From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mark H Weaver Newsgroups: gmane.lisp.guile.devel Subject: Filenames and other POSIX byte strings as SCM strings without loss Date: Mon, 23 May 2011 15:42:56 -0400 Message-ID: <87ipt1kxtr.fsf_-_@netris.org> References: <1297784103-18322-1-git-send-email-janneke-list@xs4all.nl> <1297784103-18322-3-git-send-email-janneke-list@xs4all.nl> <87r58gzuoy.fsf@gnu.org> <87vcxsycds.fsf@gnu.org> <87wri7ru80.fsf@netris.org> <87zkn2stsf.fsf@gnu.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: dough.gmane.org 1306179810 11766 80.91.229.12 (23 May 2011 19:43:30 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Mon, 23 May 2011 19:43:30 +0000 (UTC) Cc: Andy Wingo , Ludovic =?utf-8?Q?Court=C3=A8s?= , guile-devel@gnu.org To: Noah Lavine Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Mon May 23 21:43:26 2011 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([140.186.70.17]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1QOb2C-0007Zc-OS for guile-devel@m.gmane.org; Mon, 23 May 2011 21:43:25 +0200 Original-Received: from localhost ([::1]:35466 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QOb2C-0008Bj-9F for guile-devel@m.gmane.org; Mon, 23 May 2011 15:43:24 -0400 Original-Received: from eggs.gnu.org ([140.186.70.92]:33695) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QOb29-0008BX-4R for guile-devel@gnu.org; Mon, 23 May 2011 15:43:22 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1QOb27-0000ph-SA for guile-devel@gnu.org; Mon, 23 May 2011 15:43:21 -0400 Original-Received: from world.peace.net ([96.39.62.75]:47005) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QOb26-0000nz-3f; Mon, 23 May 2011 15:43:18 -0400 Original-Received: from 209-6-41-222.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com ([209.6.41.222] helo=freedomincluded) by world.peace.net with esmtpa (Exim 4.69) (envelope-from ) id 1QOb1o-0001by-6j; Mon, 23 May 2011 15:43:00 -0400 Original-Received: from mhw by freedomincluded with local (Exim 4.69) (envelope-from ) id 1QOb1l-00036y-Gw; Mon, 23 May 2011 15:42:57 -0400 In-Reply-To: (Noah Lavine's message of "Tue, 17 May 2011 12:59:16 -0400") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 96.39.62.75 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:12523 Archived-At: Hello all, Andy and I have been discussing how to deal with pathnames on IRC. The tentative plan is to use normal strings to represent pathnames, command-line arguments, environmental variable values, and other such POSIX byte strings. We'd need to implement alternative conversions between POSIX byte strings and SCM strings which would implement a bijective (one-to-one) mapping between the set of all byte vectors and a subset of SCM strings. For purposes of this email, suppose they are called scm_to_permissive_stringn and scm_from_permissive_stringn. On top of these we would implement scm_to_permissive_locale_stringn, scm_from_permissive_locale_stringn, and some other convenience functions. These alternative mappings would be used to convert between POSIX byte strings and SCM strings. We'd reserve 256 private-use code points (somewhere in the ranges U+F0000..U+FFFFD or U+100000..U+10FFFD) which would represent bytes of ill-formed byte sequences. For purposes of this email, suppose we choose the range U+109700..U+1097FF. scm_from_permissive_locale_stringn would be used to convert filenames et al to SCM strings. Ill-formed byte sequences in the filename would be mapped to a sequence of Unicode characters in that range. For example, when using a UTF-8 locale, the filename 0x46 0x6F 0x6F 0xC0 0x80 0x41 would become a SCM string containing the characters: F, o, o, U+1097C0, U+109780, A. A few details: it is important for security reasons that the mapping be bijective (one-to-one) between all byte vectors and a subset of SCM strings. The subset would include all SCM strings that do not include characters within the reserved range U+109700..U+1097FF. Since scm_from_permissive_stringn maps invalid bytes to private-use code points in the range U+109700..U+1097FF, we must ensure that properly encoded code points in that range are mapped to something else. Otherwise, two distinct POSIX byte strings might map to the same SCM string. The simplest solution is to consider any byte sequence which would map to our reserved range to be invalid, and thus mapped one byte at a time using this scheme. For example, U+1097FF is represented in UTF-8 as 0xF4 0x89 0x9F 0xBF. Although scm_from_stringn would map this sequence of bytes to the single code point U+1097FF (when using UTF-8), scm_from_permissive_stringn would instead consider this entire byte sequence to be invalid, and instead map it to the 4 code points U+1097F4, U+109789, U+10979F, U+1097BF. We must also make sure that scm_to_permissive_stringn never maps two distinct SCM strings to the same POSIX byte string. In particular, we must make sure that the U+1097xx code points are only used to generate _invalid_ byte sequences, and never valid ones. The simplest way to do this is to apply scm_from_permissive_stringn to the result and make sure that it yields the original SCM string. If not, an exception would be thrown. So the tentative plan is to provide this alternative mapping, and use it whenever accessing POSIX byte strings, whether they be filenames, command-line arguments, environment variable values, fields within a passwd, group, wtmp, or utmp file, system information (e.g. the hostname or information from uname), etc. We should allow the user to access this mapping directly, via scm_{to,from}_permissive_stringn, scm_{to,from}_permissive_locale_stringn, scm_{to,from}_permissive_utf8_stringn, and also between strings and bytevectors in both Scheme and C: permissive-string->utf8, permissive-utf8->string, scm_permissive_string_to_utf8, scm_permissive_utf8_to_string, and we should probably add procedures to convert between strings and bytevectors using other encodings as well, most importantly the locale encoding. We'd also need permissive-string->pointer and permissive-pointer->string. I'm not sure about the names. Suggestions welcome. Regarding Noah's proposal to allow handling pathnames as sequences of path components: both Andy and I like this idea. However, as always, the devil's in the details. I'll write more about this in another email. Best, Mark