From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Noah Lavine Newsgroups: gmane.lisp.guile.devel Subject: Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names. Date: Wed, 4 May 2011 00:13:18 -0400 Message-ID: References: <1297784103-18322-1-git-send-email-janneke-list@xs4all.nl> <1297784103-18322-3-git-send-email-janneke-list@xs4all.nl> <87r58gzuoy.fsf@gnu.org> <87vcxsycds.fsf@gnu.org> <87wri7ru80.fsf@netris.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1304482415 2298 80.91.229.12 (4 May 2011 04:13:35 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Wed, 4 May 2011 04:13:35 +0000 (UTC) Cc: Andy Wingo , =?ISO-8859-1?Q?Ludovic_Court=E8s?= , guile-devel@gnu.org To: Mark H Weaver Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Wed May 04 06:13:28 2011 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([140.186.70.17]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1QHTSn-0006ku-6U for guile-devel@m.gmane.org; Wed, 04 May 2011 06:13:25 +0200 Original-Received: from localhost ([::1]:45393 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QHTSm-0002sW-Fe for guile-devel@m.gmane.org; Wed, 04 May 2011 00:13:24 -0400 Original-Received: from eggs.gnu.org ([140.186.70.92]:59440) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QHTSj-0002sP-Lo for guile-devel@gnu.org; Wed, 04 May 2011 00:13:23 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1QHTSi-0004oZ-2G for guile-devel@gnu.org; Wed, 04 May 2011 00:13:21 -0400 Original-Received: from mail-vx0-f169.google.com ([209.85.220.169]:33330) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QHTSh-0004oT-VM; Wed, 04 May 2011 00:13:20 -0400 Original-Received: by vxk20 with SMTP id 20so958522vxk.0 for ; Tue, 03 May 2011 21:13:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=gWeiVaGvK3llHeDUWQrWg1Q10SEHgULDm8/ChjEUtlA=; b=SXI5XlpveL9T40BlOHj2P/6OzESxiIgA/tKd5CwI169Bm/B199Cq1pxBK+SYRcfZML /VQLlF4wp9YSvhDfxA6ZE/3e3HPtQbMHSeOPAWZPgN4h6u2ZnywchEmgtuen5LqP7VFJ tyPJKRmsK3HBouxme+jaZ9AUq4dwG4+6sKsHs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=S3+/6pFhnA+FprMdmBpNqd4nM42rvYIZv9dxQCq14LfhQURX0V5sk/gKCF/x7MPshn KZRplSDMdANLuiHtSKAQCFlKuYuLesadD7Ile1zV2KCQqWP8dpsF3XGIxpOWeFbJYGeM /E3MarEa5BxrJ740Ybe/Sw+/d0mpwKPmi4zk4= Original-Received: by 10.52.96.69 with SMTP id dq5mr727495vdb.218.1304482398889; Tue, 03 May 2011 21:13:18 -0700 (PDT) Original-Received: by 10.52.163.5 with HTTP; Tue, 3 May 2011 21:13:18 -0700 (PDT) In-Reply-To: <87wri7ru80.fsf@netris.org> X-Google-Sender-Auth: yAV-pap5vIU1JE0Zjdcip4N3QYs X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) X-Received-From: 209.85.220.169 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:12425 Archived-At: Hello all, I have another issue to raise. I think this is actually parallel to some of the stuff in the (web) module, as you will see. I've always thought it was ridiculous and hackish that I had to escape spaces in path strings. For instance, I have a folder called "Getting a Job" on my desktop, whose path is ~/Desktop/Getting\ a\ Job. The reason this strangeness enters is that path strings are actually lists (or vectors) encoded as strings. Conceptually, the path ~/Desktop/Getting\ a\ Job is the list ("~" "Desktop" "Getting a Job"). In this representation, there are no escapes and no separators. It always seemed cleaner to me to think about it that way. I think there should be some mechanism by which Guile users never have to think about escaping spaces (and any other characters they want in their paths). We don't have to represent them with lists or vectors, but there should be some mechanism for avoiding this. I said this is similar to the (web) module because of all of the discussion there of how HTTP encodes data types in text, and how it's better to think of a URI as URI type rather than a special string, etc. I think the same issue applies here - you've got list (or a list of lists, if you have a whole command-line with arguments) encoded as a string using ' ' and '/' as separators, and then you have to escape those characters when you want to use them in a different way, and the whole thing gets unnecessarily complicated because the right way to think about this is as lists of strings. Noah On Tue, May 3, 2011 at 11:59 PM, Mark H Weaver wrote: > Andy Wingo writes: >> That's the crazy thing: file names on GNU aren't in any encoding! =A0The= y >> are byte strings that may or may not decode to a string, given some >> encoding. =A0Granted, they're mostly UTF-8 these days, but users have th= e >> darndest files... > [...] >> On Tue 03 May 2011 00:18, ludo@gnu.org (Ludovic Court=E8s) writes: >>> I think GLib and the like expect UTF-8 as the file name encoding and >>> complain otherwise, so UTF-8 might be a better default than locale >>> encoding (and it=92s certainly wiser to be locale-independent.) >> >> It's more complicated than that. =A0Here's the old interface that they >> used, which attempted to treat paths as utf-8: >> >> =A0 http://developer.gnome.org/glib/unstable/glib-Character-Set-Conversi= on.html >> =A0 (search for "file name encoding") >> >> The new API is abstract, so it allows operations like "get-display-name" >> and "get-bytes": >> >> =A0 http://developer.gnome.org/gio/2.28/GFile.html =A0(search for "encod= ing" >> =A0 in that page) >> >> =A0 "All GFiles have a basename (get with g_file_get_basename()). These >> =A0 names are byte strings that are used to identify the file on the >> =A0 filesystem (relative to its parent directory) and there is no >> =A0 guarantees that they have any particular charset encoding or even ma= ke >> =A0 any sense at all. If you want to use filenames in a user interface y= ou >> =A0 should use the display name that you can get by requesting the >> =A0 G_FILE_ATTRIBUTE_STANDARD_DISPLAY_NAME attribute with >> =A0 g_file_query_info(). This is guaranteed to be in utf8 and can be use= d >> =A0 in a user interface. But always store the real basename or the GFile >> =A0 to use to actually access the file, because there is no way to go fr= om >> =A0 a display name to the actual name." > > In my opinion, this is a bad approach to take in Guile. =A0When developer= s > are careful to robustly handle filenames with invalid encoding, it will > lead to overly complex code. =A0More often, when developers write more > straightforward code, it will lead to code that works most of the time > but fails badly when confronted with weird filenames. =A0This is the same > type of problem that plagues Bourne shell scripts. =A0Let's please not go > down that road. > > There is a better way. =A0We can do a variant of what Python 3 does, > described in PEP 383 . > > Basically, the idea is to provide alternative versions of > scm_{to,from}_stringn that allow arbitrary bytevectors to be turned into > strings and back again without any lossage. =A0These alternative versions > would be used for operations involving filenames et al, and should > probably also be made available to users. > > Basically the idea is that "invalid bytes" are mapped to code points > that will never appear in any valid encoding. =A0PEP 383 maps such bytes > to a range of surrogate code points that are reserved for use in UTF-16 > surrogate pairs, and are otherwise considered invalid by Unicode. =A0Ther= e > are other possible mapping schemes as well. =A0See section 3.7 of Unicode > Technical Report #36 for more > discussion on this. > > I can understand why some say that filenames in GNU are not really > strings but rather bytevectors. =A0I respectfully disagree. =A0Filenames, > environment variables, command-line arguments, etc, are _conceptually_ > strings. =A0Let's not muddle that concept just because the transition to > Unicode has not yet been completed in the GNU world. > > Hopefully in the future, these old-style POSIX byte strings will once > again become true strings in concept. =A0All that's required for this to > happen is for popular software to agree to standardize on the use of > UTF-8 for all of these things. =A0This is reasonably likely to happen at > some point. > > In practice, I see no advantage to calling them bytevectors other than > to allow lossless storage of oddball filenames. =A0It's not as if any san= e > user interface is going to display them in hex. =A0Think about it. =A0Wha= t > are you really going to do with the bytevector version, other than to > store it in case you want to convert it back into a filename, > environment variable, or command-line argument? =A0Think about the mess > that this will make to otherwise simple code. =A0Also think about the > obscure bugs that will arise from programmers who balk at this and > simply pass around the strings instead. > > Let's keep things simple. =A0Let's use plain strings for everything that > is _conceptually_ a string. =A0Let's instead deal with the occasional > ill-encoded-filename by allowing strings to represent these oddballs. > > =A0 =A0 Best, > =A0 =A0 =A0Mark > >