From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mark H Weaver Newsgroups: gmane.lisp.guile.devel Subject: Re: [PATCH 2/5] [mingw]: Have compiled-file-name produce valid names. Date: Tue, 17 May 2011 16:03:18 -0400 Message-ID: <87fwodp01l.fsf@netris.org> References: <1297784103-18322-1-git-send-email-janneke-list@xs4all.nl> <1297784103-18322-3-git-send-email-janneke-list@xs4all.nl> <87r58gzuoy.fsf@gnu.org> <87vcxsycds.fsf@gnu.org> <87wri7ru80.fsf@netris.org> <87zkn2stsf.fsf@gnu.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1305662622 3975 80.91.229.12 (17 May 2011 20:03:42 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Tue, 17 May 2011 20:03:42 +0000 (UTC) Cc: Andy Wingo , Ludovic =?utf-8?Q?Court=C3=A8s?= , guile-devel@gnu.org To: Noah Lavine Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Tue May 17 22:03:38 2011 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([140.186.70.17]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1QMQUR-0007Nd-2W for guile-devel@m.gmane.org; Tue, 17 May 2011 22:03:35 +0200 Original-Received: from localhost ([::1]:56472 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QMQUQ-0007PL-C0 for guile-devel@m.gmane.org; Tue, 17 May 2011 16:03:34 -0400 Original-Received: from eggs.gnu.org ([140.186.70.92]:40876) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QMQUN-0007P1-ER for guile-devel@gnu.org; Tue, 17 May 2011 16:03:32 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1QMQUM-00065g-Cr for guile-devel@gnu.org; Tue, 17 May 2011 16:03:31 -0400 Original-Received: from world.peace.net ([96.39.62.75]:51011) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QMQUK-00065A-VP; Tue, 17 May 2011 16:03:29 -0400 Original-Received: from 209-6-41-222.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com ([209.6.41.222] helo=freedomincluded) by world.peace.net with esmtpa (Exim 4.69) (envelope-from ) id 1QMQUD-0001lx-1l; Tue, 17 May 2011 16:03:21 -0400 Original-Received: from mhw by freedomincluded with local (Exim 4.69) (envelope-from ) id 1QMQUA-0003vS-W1; Tue, 17 May 2011 16:03:19 -0400 In-Reply-To: (Noah Lavine's message of "Tue, 17 May 2011 12:59:16 -0400") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 96.39.62.75 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:12503 Archived-At: Noah Lavine writes: > Mark is right that paths are basically just strings, even though > occasionally they're not. I sort of like the idea of the PEP-383 > encoding (making paths strings that can potentially contain unused > codepoints, which represent non-character bytes), but would that make > path strings break under some Guile string operations? Yes, this is indeed a problem. Instead of using isolated surrogate code points as recommended by PEP-383, I think we should instead use one of the alternative mappings proposed in section 3.7.4 of Unicode Technical Report #36 : 1. Use 256 private-use code points, somewhere in the ranges F0000..FFFFD or 100000..10FFFD. This would probably cause the fewest security and interoperability problems. There is, however, some possibility of collision with other uses of private-use characters. 2. Use pairs of noncharacter code points in the range FDD0..FDEF. These are "super" private-use characters, and are discouraged for general interchange. The transformation would take each nibble of a byte Y, and add to FDD0 and FDE0, respectively. However, noncharacter code points may be replaced by U+FFFD ( =EF=BF=BD ) REPLACEMENT CHARACTER by = some implementations, especially when they use them internally. (Again, incoming characters must never be deleted, because that can cause security problems.) > Also, when we convert strings to paths, we need to know what encoding > the local filesystem uses. That will usually be UTF-8, but potentially > might not be, correct? Yes, that is correct. I haven't looked deeply into this, but clearly a lot of software uses the current locale encoding to interpret these POSIX byte strings, and I suspect at least some software uses UTF-8 to interpret filenames. Fortunately, most popular modern distributions of GNU are now using UTF-8 locales by default, which basically makes the problem disappear. Regardless, this method of mapping ill-formed byte sequences to private-use code points can used with _any_ encoding, not just UTF-8. Best, Mark