From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mike Gran Newsgroups: gmane.lisp.guile.devel Subject: Wide strings status Date: Mon, 20 Apr 2009 19:11:48 -0700 Message-ID: <1240279908.3133.76.camel@localhost.localdomain> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1240279919 16580 80.91.229.12 (21 Apr 2009 02:11:59 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 21 Apr 2009 02:11:59 +0000 (UTC) To: Guile Devel Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Tue Apr 21 04:13:18 2009 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1Lw5U6-00030J-2Z for guile-devel@m.gmane.org; Tue, 21 Apr 2009 04:13:18 +0200 Original-Received: from localhost ([127.0.0.1]:53623 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Lw5Sh-0005ms-3z for guile-devel@m.gmane.org; Mon, 20 Apr 2009 22:11:51 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Lw5Se-0005mb-JS for guile-devel@gnu.org; Mon, 20 Apr 2009 22:11:48 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Lw5Se-0005mM-2R for guile-devel@gnu.org; Mon, 20 Apr 2009 22:11:48 -0400 Original-Received: from [199.232.76.173] (port=57939 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Lw5Sd-0005mJ-Tb for guile-devel@gnu.org; Mon, 20 Apr 2009 22:11:47 -0400 Original-Received: from smtp107.prem.mail.sp1.yahoo.com ([98.136.44.62]:43317) by monty-python.gnu.org with smtp (Exim 4.60) (envelope-from ) id 1Lw5Sd-0006M2-Cj for guile-devel@gnu.org; Mon, 20 Apr 2009 22:11:47 -0400 Original-Received: (qmail 75962 invoked from network); 21 Apr 2009 02:11:46 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Received:X-YMail-OSG:X-Yahoo-Newman-Property:Subject:From:To:Content-Type:Date:Message-Id:Mime-Version:X-Mailer:Content-Transfer-Encoding; b=06xH0PZ5i07pDqejZ0OZZnk21tUSTAZ+SvG4e0/i9ad3GgMq/zAUMJoAODfWc7Tpusc84Xe9SgP4XyuazFjd3yiDfwpi+iX4PdJ6Rz5P0xxEzMBcM+M/RdAftNTxu9weqwKPYLYKNxmrgIYkrH3yDioYzLxTrIOdbBnkWuxnsEU= ; Original-Received: from unknown (HELO ?192.168.1.64?) (spk121@71.143.114.144 with plain) by smtp107.prem.mail.sp1.yahoo.com with SMTP; 21 Apr 2009 02:11:46 -0000 X-YMail-OSG: usQggqUVM1n7IRGE_ynDXajMKOj9L1lB1z0Agv2AO3HXApMBFeHwGQafqHgU3sfoi3adLoGKDbjz5RM3FlrSj6zKUUJUtpCoULmz8Xyuwrsu53ilfnZvZ4G8zoTUIpqwiQdspiZQIAYGaqLKxdlbr8fQpHslupf7Vjy1yAIp9RQdHEd7vMjnXpDAB3GoTkazXkLmBl9MqYAYMzSa8Ug.4FfVxRr5TuJrP5Emd1Xgga2FBiy0kKMHi6AlpIUv8o.TPm2D5OgPrwCEkLcrh2c3cKiSr3gUbpMBxttWAiapdpC3hZC.nNg- X-Yahoo-Newman-Property: ymail-3 X-Mailer: Evolution 2.24.5 (2.24.5-1.fc10) X-detected-operating-system: by monty-python.gnu.org: FreeBSD 4.7-5.2 (or MacOS X 10.2-10.4) (2) X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:8436 Archived-At: Hi, OK. I've uploaded a "string-abstraction" branch so that you can see what I've been doing over the last couple of months. Currently, I do have a version of Guile that uses Unicode codepoints for characters. The C representation of chars was changed to scm_t_uint32 throughout the code. Strings are internally encoded either as "narrow" 8-bit ISO-8859-1 strings or as "wide" UTF-32 strings. Strings are usually created as narrow strings. Narrow strings get automatically widened to wide strings if non-8-bit characters are set! or appended to them. Outside of the core strings module and srfi-13, a set of methods are used to access strings. I did my best to keep the internal representation of strings isolated to those two modules. This means that almost every instance of the pervasive scm_i_string_chars() was removed. The machine-readable "write" form of strings has been changed. Before, non-printable characters were given as hex escapes, for example \xFF. Now there are three levels of hex escape for 8, 16, and 24 bit characters: \xFF, \uFFFF, \UFFFFFF. This is a pretty common convention. But after I coded this, I noticed that R6RS has a different convention and I'll probably go with that. The internal representation of strings seems to work already, but, the reader doesn't work yet. For now, one can make wide strings like this: > (setlocale LC_ALL "") ==> "en_US.UTF-8" > (define str (apply string (map integer->char '(100 200 300 400 500)))) > (write str) ==>"d\xc8\u012c\u0190\u01f4" (display str) ==>dÈĬƐǴ This is all going to be slower than before because of the string conversion operations, but, I didn't want to do any premature optimization. First, I wanted to get it working, but, there is plenty of room for optimization later. Anyway, if, code-wise, it is agreed that I'm generally on the right track, the next steps are these: Write a plethora of unit tests on what has been accomplished so far. Character sets need to be modified to have more than 256 entries. Character encoding needs to be a property of ports, so that not all string operations are done in the current locale. This is necessary so that UTF-8-encoded source files are not interpreted differently based on the current locale. For programs that have been abusing strings for containing binary data, some accommodation needs to be made. Maybe make a "binary" locale. The VM and interpreter need to be updated to deal with wide chars and probably in other ways that are unclear to me now. Wide strings are currently getting truncated to 8-bit somewhere in there. Thanks, Mike Gran