unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
From: Mark H Weaver <mhw@netris.org>
To: Noah Lavine <noah.b.lavine@gmail.com>
Cc: "Andy Wingo" <wingo@pobox.com>, "Ludovic Courtès" <ludo@gnu.org>,
	guile-devel@gnu.org
Subject: Filenames and other POSIX byte strings as SCM strings without loss
Date: Mon, 23 May 2011 15:42:56 -0400	[thread overview]
Message-ID: <87ipt1kxtr.fsf_-_@netris.org> (raw)
In-Reply-To: <BANLkTimXw9bVcgqRdk5mS1-byasW2rgpYg@mail.gmail.com> (Noah Lavine's message of "Tue, 17 May 2011 12:59:16 -0400")

Hello all,

Andy and I have been discussing how to deal with pathnames on IRC.

The tentative plan is to use normal strings to represent pathnames,
command-line arguments, environmental variable values, and other such
POSIX byte strings.

We'd need to implement alternative conversions between POSIX byte
strings and SCM strings which would implement a bijective (one-to-one)
mapping between the set of all byte vectors and a subset of SCM strings.
For purposes of this email, suppose they are called
scm_to_permissive_stringn and scm_from_permissive_stringn.  On top of
these we would implement scm_to_permissive_locale_stringn,
scm_from_permissive_locale_stringn, and some other convenience
functions.

These alternative mappings would be used to convert between POSIX byte
strings and SCM strings.  We'd reserve 256 private-use code points
(somewhere in the ranges U+F0000..U+FFFFD or U+100000..U+10FFFD) which
would represent bytes of ill-formed byte sequences.  For purposes of
this email, suppose we choose the range U+109700..U+1097FF.

scm_from_permissive_locale_stringn would be used to convert filenames et
al to SCM strings.  Ill-formed byte sequences in the filename would be
mapped to a sequence of Unicode characters in that range.  For example,
when using a UTF-8 locale, the filename 0x46 0x6F 0x6F 0xC0 0x80 0x41
would become a SCM string containing the characters: F, o, o, U+1097C0,
U+109780, A.

A few details: it is important for security reasons that the mapping be
bijective (one-to-one) between all byte vectors and a subset of SCM
strings.  The subset would include all SCM strings that do not include
characters within the reserved range U+109700..U+1097FF.

Since scm_from_permissive_stringn maps invalid bytes to private-use code
points in the range U+109700..U+1097FF, we must ensure that properly
encoded code points in that range are mapped to something else.
Otherwise, two distinct POSIX byte strings might map to the same SCM
string.  The simplest solution is to consider any byte sequence which
would map to our reserved range to be invalid, and thus mapped one byte
at a time using this scheme.  For example, U+1097FF is represented in
UTF-8 as 0xF4 0x89 0x9F 0xBF.  Although scm_from_stringn would map this
sequence of bytes to the single code point U+1097FF (when using UTF-8),
scm_from_permissive_stringn would instead consider this entire byte
sequence to be invalid, and instead map it to the 4 code points
U+1097F4, U+109789, U+10979F, U+1097BF.

We must also make sure that scm_to_permissive_stringn never maps two
distinct SCM strings to the same POSIX byte string.  In particular, we
must make sure that the U+1097xx code points are only used to generate
_invalid_ byte sequences, and never valid ones.  The simplest way to do
this is to apply scm_from_permissive_stringn to the result and make sure
that it yields the original SCM string.  If not, an exception would be
thrown.

So the tentative plan is to provide this alternative mapping, and use it
whenever accessing POSIX byte strings, whether they be filenames,
command-line arguments, environment variable values, fields within a
passwd, group, wtmp, or utmp file, system information (e.g. the hostname
or information from uname), etc.

We should allow the user to access this mapping directly, via

  scm_{to,from}_permissive_stringn,
  scm_{to,from}_permissive_locale_stringn,
  scm_{to,from}_permissive_utf8_stringn,

and also between strings and bytevectors in both Scheme and C:

  permissive-string->utf8,
  permissive-utf8->string,
  scm_permissive_string_to_utf8,
  scm_permissive_utf8_to_string,

and we should probably add procedures to convert between strings and
bytevectors using other encodings as well, most importantly the locale
encoding.

We'd also need permissive-string->pointer and
permissive-pointer->string.

I'm not sure about the names.  Suggestions welcome.

Regarding Noah's proposal to allow handling pathnames as sequences of
path components: both Andy and I like this idea.  However, as always,
the devil's in the details.  I'll write more about this in another
email.

    Best,
     Mark



  parent reply	other threads:[~2011-05-23 19:42 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-15 15:34 mingw runtime patches Jan Nieuwenhuizen
2011-02-15 15:34 ` [PATCH 1/5] [mingw]: Add implementation of canonicalize_file_name Jan Nieuwenhuizen
2011-04-29 16:33   ` Andy Wingo
2011-05-20 13:56     ` Jan Nieuwenhuizen
2011-05-20 14:54       ` Andy Wingo
2011-02-15 15:35 ` [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names Jan Nieuwenhuizen
2011-04-29 17:16   ` Andy Wingo
2011-04-29 17:30     ` Noah Lavine
2011-05-01 11:30       ` Andy Wingo
2011-05-01 19:23         ` Noah Lavine
2011-05-01 21:12           ` Andy Wingo
2011-05-01 21:48         ` Mark H Weaver
2011-05-02  7:45           ` Andy Wingo
2011-05-02 20:58         ` Ludovic Courtès
2011-05-02 21:58           ` Andy Wingo
2011-05-02 22:18             ` Ludovic Courtès
2011-05-03  7:44               ` Andy Wingo
2011-05-03  8:38                 ` Ludovic Courtès
2011-05-04  3:59                 ` Mark H Weaver
2011-05-04  4:13                   ` Noah Lavine
2011-05-04  9:24                     ` Ludovic Courtès
2011-05-17 16:59                       ` Noah Lavine
2011-05-17 19:26                         ` Mark H Weaver
2011-05-17 20:03                         ` Mark H Weaver
2011-05-23 19:42                         ` Mark H Weaver [this message]
2011-07-01 10:51                           ` Filenames and other POSIX byte strings as SCM strings without loss Andy Wingo
2011-05-23 20:14                         ` Paths as sequences of path components Mark H Weaver
2011-05-24 10:51                           ` Hans Aberg
2011-11-23 22:15                           ` Andy Wingo
2011-11-25  2:51                             ` Mark H Weaver
2011-06-16 22:29                 ` [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names Andy Wingo
2011-05-02 23:16             ` Eli Barzilay
2011-05-20 13:47     ` Jan Nieuwenhuizen
2011-05-20 14:01       ` Andy Wingo
2011-06-30 14:11       ` Andy Wingo
2011-02-15 15:35 ` [PATCH 3/5] [mingw]: Do not export opendir, readdir etc., as dirents differ Jan Nieuwenhuizen
2011-05-01 11:37   ` Andy Wingo
2011-05-20 13:57     ` Jan Nieuwenhuizen
2011-06-16 22:22       ` Andy Wingo
2011-02-15 15:35 ` [PATCH 4/5] [mingw]: Delete existing target file before attempting rename Jan Nieuwenhuizen
2011-05-01 11:40   ` Andy Wingo
2011-05-20 14:05     ` Jan Nieuwenhuizen
2011-06-16 21:45     ` Andy Wingo
2011-02-15 15:35 ` [PATCH 5/5] [mingw]: Use $LOCALAPPDATA as a possible root for cachedir Jan Nieuwenhuizen
2011-05-01 11:42   ` Andy Wingo
2011-05-20 14:03     ` Jan Nieuwenhuizen
2011-06-16 22:02       ` Andy Wingo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ipt1kxtr.fsf_-_@netris.org \
    --to=mhw@netris.org \
    --cc=guile-devel@gnu.org \
    --cc=ludo@gnu.org \
    --cc=noah.b.lavine@gmail.com \
    --cc=wingo@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).