From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: ludo@gnu.org (Ludovic =?iso-8859-1?Q?Court=E8s?=) Newsgroups: gmane.lisp.guile.devel Subject: Re: Wide strings status Date: Tue, 21 Apr 2009 23:37:35 +0200 Message-ID: <87bpqpu1r4.fsf@gnu.org> References: <1240279908.3133.76.camel@localhost.localdomain> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1240349891 3751 80.91.229.12 (21 Apr 2009 21:38:11 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 21 Apr 2009 21:38:11 +0000 (UTC) To: guile-devel@gnu.org Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Tue Apr 21 23:39:30 2009 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1LwNgd-0006B5-K8 for guile-devel@m.gmane.org; Tue, 21 Apr 2009 23:39:28 +0200 Original-Received: from localhost ([127.0.0.1]:33326 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LwNfE-0000IQ-F0 for guile-devel@m.gmane.org; Tue, 21 Apr 2009 17:38:00 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1LwNf9-0000IJ-1n for guile-devel@gnu.org; Tue, 21 Apr 2009 17:37:55 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1LwNf4-0000I6-7a for guile-devel@gnu.org; Tue, 21 Apr 2009 17:37:54 -0400 Original-Received: from [199.232.76.173] (port=33550 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LwNf4-0000I3-2o for guile-devel@gnu.org; Tue, 21 Apr 2009 17:37:50 -0400 Original-Received: from main.gmane.org ([80.91.229.2]:42812 helo=ciao.gmane.org) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1LwNf3-00021A-Hp for guile-devel@gnu.org; Tue, 21 Apr 2009 17:37:49 -0400 Original-Received: from list by ciao.gmane.org with local (Exim 4.43) id 1LwNf0-0003g0-Ue for guile-devel@gnu.org; Tue, 21 Apr 2009 21:37:46 +0000 Original-Received: from reverse-83.fdn.fr ([80.67.176.83]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 21 Apr 2009 21:37:46 +0000 Original-Received: from ludo by reverse-83.fdn.fr with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 21 Apr 2009 21:37:46 +0000 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 63 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: reverse-83.fdn.fr X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 2 =?iso-8859-1?Q?Flor=E9al?= an 217 de la =?iso-8859-1?Q?R=E9volution?= X-PGP-Key-ID: 0xEA52ECF4 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 821D 815D 902A 7EAB 5CEE D120 7FBA 3D4F EB1F 5364 X-OS: i686-pc-linux-gnu User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.90 (gnu/linux) Cancel-Lock: sha1:uUyVvA+78oy5unmqmXmWPTu2H+M= X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:8439 Archived-At: Hello! Mike Gran writes: > Strings are internally encoded either as "narrow" 8-bit ISO-8859-1 > strings or as "wide" UTF-32 strings. Strings are usually created as > narrow strings. Narrow strings get automatically widened to wide > strings if non-8-bit characters are set! or appended to them. Great! > The machine-readable "write" form of strings has been changed. Before, > non-printable characters were given as hex escapes, for example \xFF. > Now there are three levels of hex escape for 8, 16, and 24 bit > characters: \xFF, \uFFFF, \UFFFFFF. This is a pretty common convention. > But after I coded this, I noticed that R6RS has a different convention > and I'll probably go with that. OK. I think it's probably good to follow R6RS when it has something to say. > The internal representation of strings seems to work already, but, the > reader doesn't work yet. For now, one can make wide strings like this: > >> (setlocale LC_ALL "") > ==> "en_US.UTF-8" > >> (define str (apply string (map integer->char '(100 200 300 400 500)))) > >> (write str) > ==>"d\xc8\u012c\u0190\u01f4" > > (display str) > ==>dÈĬƐǴ Eh eh, looks nice. Looking forward to typing `(λ (x y) (+ x y))'. ;-) > This is all going to be slower than before because of the string > conversion operations, but, I didn't want to do any premature > optimization. First, I wanted to get it working, but, there is plenty > of room for optimization later. Good. Maybe it'd be nice to add simple micro-benchmarks for `string-ref', `string-set!' et al. under `benchmarks'. > Character encoding needs to be a property of ports, so that not all > string operations are done in the current locale. This is necessary so > that UTF-8-encoded source files are not interpreted differently based on > the current locale. You seem to imply that `scm_getc ()' will now return a Unicode codepoint, is that right? What about `scm_c_{read,write} ()', and `scm_{get,put}s ()'? > The VM and interpreter need to be updated to deal with wide chars and > probably in other ways that are unclear to me now. Wide strings are > currently getting truncated to 8-bit somewhere in there. The compiler could use bytevectors when dealing with bytecode. Maybe that would clarify things. Thanks, Ludo'.