From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Andy Wingo Newsgroups: gmane.lisp.guile.devel Subject: Re: Filenames and other POSIX byte strings as SCM strings without loss Date: Fri, 01 Jul 2011 12:51:27 +0200 Message-ID: <87boxe8ea8.fsf@pobox.com> References: <1297784103-18322-1-git-send-email-janneke-list@xs4all.nl> <1297784103-18322-3-git-send-email-janneke-list@xs4all.nl> <87r58gzuoy.fsf@gnu.org> <87vcxsycds.fsf@gnu.org> <87wri7ru80.fsf@netris.org> <87zkn2stsf.fsf@gnu.org> <87ipt1kxtr.fsf_-_@netris.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: dough.gmane.org 1309521379 17178 80.91.229.12 (1 Jul 2011 11:56:19 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 1 Jul 2011 11:56:19 +0000 (UTC) Cc: Ludovic =?utf-8?Q?Court=C3=A8s?= , guile-devel@gnu.org To: Mark H Weaver Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Fri Jul 01 13:56:14 2011 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([140.186.70.17]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1QccKR-0000WX-VH for guile-devel@m.gmane.org; Fri, 01 Jul 2011 13:56:12 +0200 Original-Received: from localhost ([::1]:43266 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QccKQ-0004By-Qt for guile-devel@m.gmane.org; Fri, 01 Jul 2011 07:56:11 -0400 Original-Received: from eggs.gnu.org ([140.186.70.92]:48062) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Qcbix-0002gD-Dq for guile-devel@gnu.org; Fri, 01 Jul 2011 07:17:29 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Qcbiv-0002rD-BD for guile-devel@gnu.org; Fri, 01 Jul 2011 07:17:27 -0400 Original-Received: from a-pb-sasl-sd.pobox.com ([64.74.157.62]:39196 helo=sasl.smtp.pobox.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Qcbiu-0002qy-RH; Fri, 01 Jul 2011 07:17:25 -0400 Original-Received: from sasl.smtp.pobox.com (unknown [127.0.0.1]) by a-pb-sasl-sd.pobox.com (Postfix) with ESMTP id 316A74555; Fri, 1 Jul 2011 07:19:38 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=pobox.com; h=from:to:cc :subject:references:date:in-reply-to:message-id:mime-version :content-type; s=sasl; bh=vGI0TdUu28bzytBx26XitVt9e8I=; b=GvLlSS CeLRJcFUmE4M1gHS722dFqU0FXEH/+DgXPSYNlsTAaqefBtz01Dbw0fuhvynEvSv ARyRrwF/TXxqp6leyx5m9eZWjmBnlGtN5Fkfl0b2jiS4nx8gv+VWEH4OV4tZrylx j8XpbJvNCRaT/JHcw7pT/7dAk8WpjtDi5jLIk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=pobox.com; h=from:to:cc :subject:references:date:in-reply-to:message-id:mime-version :content-type; q=dns; s=sasl; b=VP/cDgIrfkS9epgPTxhMe8YOnjxIe1fh eIGnC9u73zjOPjCXVvuusCddJc766h4uRnCsr7Z41naQc8iUgts3ItQggsUWO7/F g7Fjngem9Wy3YvL1Vg0i8nOiAZCim7J8wXXwyXfhPQfzVaka83S+eJLbXKeA5O2J SQDLek++XVA= Original-Received: from a-pb-sasl-sd.pobox.com (unknown [127.0.0.1]) by a-pb-sasl-sd.pobox.com (Postfix) with ESMTP id 29BF24554; Fri, 1 Jul 2011 07:19:38 -0400 (EDT) Original-Received: from badger (unknown [90.164.198.39]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by a-pb-sasl-sd.pobox.com (Postfix) with ESMTPSA id 506F24553; Fri, 1 Jul 2011 07:19:37 -0400 (EDT) In-Reply-To: <87ipt1kxtr.fsf_-_@netris.org> (Mark H. Weaver's message of "Mon, 23 May 2011 15:42:56 -0400") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) X-Pobox-Relay-ID: 0181FBA4-A3D4-11E0-80F8-5875C023C68D-02397024!a-pb-sasl-sd.pobox.com X-detected-operating-system: by eggs.gnu.org: Solaris 10 (beta) X-Received-From: 64.74.157.62 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:12622 Archived-At: Hi Mark! On Mon 23 May 2011 21:42, Mark H Weaver writes: > The tentative plan is to use normal strings to represent pathnames, > command-line arguments, environmental variable values, and other such > POSIX byte strings. Apologies for not giving you prompt feedback on this idea. Basically I think it sounds like a great, workable plan. > For purposes of this email, suppose they are called > scm_to_permissive_stringn and scm_from_permissive_stringn. On top of > these we would implement scm_to_permissive_locale_stringn, > scm_from_permissive_locale_stringn, and some other convenience > functions. Sounds good. "Permissive" sounds a bit odd but I can't think of another name. "Foreign"? "Corrupt"? "Possibly invalid"? "Nonsense"? "Raw"? "Cooked"? "Bytes"? "scm_from_utf8_byte_string"? > Since scm_from_permissive_stringn maps invalid bytes to private-use code > points in the range U+109700..U+1097FF, we must ensure that properly > encoded code points in that range are mapped to something else. > Otherwise, two distinct POSIX byte strings might map to the same SCM > string. The simplest solution is to consider any byte sequence which > would map to our reserved range to be invalid, and thus mapped one byte > at a time using this scheme. For example, U+1097FF is represented in > UTF-8 as 0xF4 0x89 0x9F 0xBF. Although scm_from_stringn would map this > sequence of bytes to the single code point U+1097FF (when using UTF-8), > scm_from_permissive_stringn would instead consider this entire byte > sequence to be invalid, and instead map it to the 4 code points > U+1097F4, U+109789, U+10979F, U+1097BF. Works for me. > So the tentative plan is to provide this alternative mapping, and use it > whenever accessing POSIX byte strings, whether they be filenames, > command-line arguments, environment variable values, fields within a > passwd, group, wtmp, or utmp file, system information (e.g. the hostname > or information from uname), etc. Cool. > We should allow the user to access this mapping directly, via > > scm_{to,from}_permissive_stringn, > scm_{to,from}_permissive_locale_stringn, > scm_{to,from}_permissive_utf8_stringn, > > and also between strings and bytevectors in both Scheme and C: > > permissive-string->utf8, > permissive-utf8->string, > scm_permissive_string_to_utf8, > scm_permissive_utf8_to_string, > > and we should probably add procedures to convert between strings and > bytevectors using other encodings as well, most importantly the locale > encoding. > > We'd also need permissive-string->pointer and > permissive-pointer->string. > > I'm not sure about the names. Suggestions welcome. I'm liking "bytes". scm_from_locale_byte_stringn. byte-string->utf8. Perhaps not clear enough though. WDYT? > Regarding Noah's proposal to allow handling pathnames as sequences of > path components: both Andy and I like this idea. However, as always, > the devil's in the details. I'll write more about this in another > email. Sure, let's get this lowest level in first. Are you on it? :-) There is no hurry of course, just so we know... Cheers, Andy -- http://wingolog.org/