unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
From: Mark H Weaver <mhw@netris.org>
To: Andy Wingo <wingo@pobox.com>
Cc: "Ludovic Courtès" <ludo@gnu.org>, guile-devel@gnu.org
Subject: Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names.
Date: Tue, 03 May 2011 23:59:59 -0400	[thread overview]
Message-ID: <87wri7ru80.fsf@netris.org> (raw)
In-Reply-To: <m3zkn4rzxx.fsf@unquote.localdomain> (Andy Wingo's message of "Tue, 03 May 2011 09:44:10 +0200")

Andy Wingo <wingo@pobox.com> writes:
> That's the crazy thing: file names on GNU aren't in any encoding!  They
> are byte strings that may or may not decode to a string, given some
> encoding.  Granted, they're mostly UTF-8 these days, but users have the
> darndest files...
[...]
> On Tue 03 May 2011 00:18, ludo@gnu.org (Ludovic Courtès) writes:
>> I think GLib and the like expect UTF-8 as the file name encoding and
>> complain otherwise, so UTF-8 might be a better default than locale
>> encoding (and it’s certainly wiser to be locale-independent.)
>
> It's more complicated than that.  Here's the old interface that they
> used, which attempted to treat paths as utf-8:
>
>   http://developer.gnome.org/glib/unstable/glib-Character-Set-Conversion.html
>   (search for "file name encoding")
>
> The new API is abstract, so it allows operations like "get-display-name"
> and "get-bytes":
>
>   http://developer.gnome.org/gio/2.28/GFile.html  (search for "encoding"
>   in that page)
>
>   "All GFiles have a basename (get with g_file_get_basename()). These
>   names are byte strings that are used to identify the file on the
>   filesystem (relative to its parent directory) and there is no
>   guarantees that they have any particular charset encoding or even make
>   any sense at all. If you want to use filenames in a user interface you
>   should use the display name that you can get by requesting the
>   G_FILE_ATTRIBUTE_STANDARD_DISPLAY_NAME attribute with
>   g_file_query_info(). This is guaranteed to be in utf8 and can be used
>   in a user interface. But always store the real basename or the GFile
>   to use to actually access the file, because there is no way to go from
>   a display name to the actual name."

In my opinion, this is a bad approach to take in Guile.  When developers
are careful to robustly handle filenames with invalid encoding, it will
lead to overly complex code.  More often, when developers write more
straightforward code, it will lead to code that works most of the time
but fails badly when confronted with weird filenames.  This is the same
type of problem that plagues Bourne shell scripts.  Let's please not go
down that road.

There is a better way.  We can do a variant of what Python 3 does,
described in PEP 383 <http://www.python.org/dev/peps/pep-0383/>.

Basically, the idea is to provide alternative versions of
scm_{to,from}_stringn that allow arbitrary bytevectors to be turned into
strings and back again without any lossage.  These alternative versions
would be used for operations involving filenames et al, and should
probably also be made available to users.

Basically the idea is that "invalid bytes" are mapped to code points
that will never appear in any valid encoding.  PEP 383 maps such bytes
to a range of surrogate code points that are reserved for use in UTF-16
surrogate pairs, and are otherwise considered invalid by Unicode.  There
are other possible mapping schemes as well.  See section 3.7 of Unicode
Technical Report #36 <http://www.unicode.org/reports/tr36/> for more
discussion on this.

I can understand why some say that filenames in GNU are not really
strings but rather bytevectors.  I respectfully disagree.  Filenames,
environment variables, command-line arguments, etc, are _conceptually_
strings.  Let's not muddle that concept just because the transition to
Unicode has not yet been completed in the GNU world.

Hopefully in the future, these old-style POSIX byte strings will once
again become true strings in concept.  All that's required for this to
happen is for popular software to agree to standardize on the use of
UTF-8 for all of these things.  This is reasonably likely to happen at
some point.

In practice, I see no advantage to calling them bytevectors other than
to allow lossless storage of oddball filenames.  It's not as if any sane
user interface is going to display them in hex.  Think about it.  What
are you really going to do with the bytevector version, other than to
store it in case you want to convert it back into a filename,
environment variable, or command-line argument?  Think about the mess
that this will make to otherwise simple code.  Also think about the
obscure bugs that will arise from programmers who balk at this and
simply pass around the strings instead.

Let's keep things simple.  Let's use plain strings for everything that
is _conceptually_ a string.  Let's instead deal with the occasional
ill-encoded-filename by allowing strings to represent these oddballs.

     Best,
      Mark



  parent reply	other threads:[~2011-05-04  3:59 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-15 15:34 mingw runtime patches Jan Nieuwenhuizen
2011-02-15 15:34 ` [PATCH 1/5] [mingw]: Add implementation of canonicalize_file_name Jan Nieuwenhuizen
2011-04-29 16:33   ` Andy Wingo
2011-05-20 13:56     ` Jan Nieuwenhuizen
2011-05-20 14:54       ` Andy Wingo
2011-02-15 15:35 ` [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names Jan Nieuwenhuizen
2011-04-29 17:16   ` Andy Wingo
2011-04-29 17:30     ` Noah Lavine
2011-05-01 11:30       ` Andy Wingo
2011-05-01 19:23         ` Noah Lavine
2011-05-01 21:12           ` Andy Wingo
2011-05-01 21:48         ` Mark H Weaver
2011-05-02  7:45           ` Andy Wingo
2011-05-02 20:58         ` Ludovic Courtès
2011-05-02 21:58           ` Andy Wingo
2011-05-02 22:18             ` Ludovic Courtès
2011-05-03  7:44               ` Andy Wingo
2011-05-03  8:38                 ` Ludovic Courtès
2011-05-04  3:59                 ` Mark H Weaver [this message]
2011-05-04  4:13                   ` Noah Lavine
2011-05-04  9:24                     ` Ludovic Courtès
2011-05-17 16:59                       ` Noah Lavine
2011-05-17 19:26                         ` Mark H Weaver
2011-05-17 20:03                         ` Mark H Weaver
2011-05-23 19:42                         ` Filenames and other POSIX byte strings as SCM strings without loss Mark H Weaver
2011-07-01 10:51                           ` Andy Wingo
2011-05-23 20:14                         ` Paths as sequences of path components Mark H Weaver
2011-05-24 10:51                           ` Hans Aberg
2011-11-23 22:15                           ` Andy Wingo
2011-11-25  2:51                             ` Mark H Weaver
2011-06-16 22:29                 ` [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names Andy Wingo
2011-05-02 23:16             ` Eli Barzilay
2011-05-20 13:47     ` Jan Nieuwenhuizen
2011-05-20 14:01       ` Andy Wingo
2011-06-30 14:11       ` Andy Wingo
2011-02-15 15:35 ` [PATCH 3/5] [mingw]: Do not export opendir, readdir etc., as dirents differ Jan Nieuwenhuizen
2011-05-01 11:37   ` Andy Wingo
2011-05-20 13:57     ` Jan Nieuwenhuizen
2011-06-16 22:22       ` Andy Wingo
2011-02-15 15:35 ` [PATCH 4/5] [mingw]: Delete existing target file before attempting rename Jan Nieuwenhuizen
2011-05-01 11:40   ` Andy Wingo
2011-05-20 14:05     ` Jan Nieuwenhuizen
2011-06-16 21:45     ` Andy Wingo
2011-02-15 15:35 ` [PATCH 5/5] [mingw]: Use $LOCALAPPDATA as a possible root for cachedir Jan Nieuwenhuizen
2011-05-01 11:42   ` Andy Wingo
2011-05-20 14:03     ` Jan Nieuwenhuizen
2011-06-16 22:02       ` Andy Wingo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wri7ru80.fsf@netris.org \
    --to=mhw@netris.org \
    --cc=guile-devel@gnu.org \
    --cc=ludo@gnu.org \
    --cc=wingo@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).