From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Marius Vollmer Newsgroups: gmane.lisp.guile.devel Subject: Re: Unicode and Guile Date: Wed, 12 Nov 2003 01:06:39 +0100 Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Message-ID: <874qxajvqo.fsf@zagadka.ping.de> References: <20031021171534.GA13246@lark> <200310260003.RAA10375@morrowfield.regexps.com> <20031031132525.GB715@lark> <200311032031.MAA19389@morrowfield.regexps.com> <20031106181635.GA9546@lark> NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1068595781 18989 80.91.224.253 (12 Nov 2003 00:09:41 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Wed, 12 Nov 2003 00:09:41 +0000 (UTC) Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Wed Nov 12 01:09:37 2003 Return-path: Original-Received: from monty-python.gnu.org ([199.232.76.173]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1AJiZk-0004Zv-00 for ; Wed, 12 Nov 2003 01:09:36 +0100 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.24) id 1AJjWC-00018Z-I6 for guile-devel@m.gmane.org; Tue, 11 Nov 2003 20:10:00 -0500 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.24) id 1AJjVX-0000oy-VR for guile-devel@gnu.org; Tue, 11 Nov 2003 20:09:19 -0500 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.24) id 1AJjUp-0008SL-Th for guile-devel@gnu.org; Tue, 11 Nov 2003 20:09:08 -0500 Original-Received: from [195.253.8.218] (helo=mail.dokom.net) by monty-python.gnu.org with esmtp (Exim 4.24) id 1AJjUQ-0007ok-TJ for guile-devel@gnu.org; Tue, 11 Nov 2003 20:08:11 -0500 Original-Received: from dialin.speedway43.dip157.dokom.de ([195.138.43.157] helo=zagadka.ping.de) by mail.dokom.net with smtp (Exim 3.36 #3) id 1AJiZf-0000X7-00 for guile-devel@gnu.org; Wed, 12 Nov 2003 01:09:32 +0100 Original-Received: (qmail 15028 invoked by uid 1000); 12 Nov 2003 00:06:39 -0000 Original-To: guile-devel@gnu.org In-Reply-To: <20031106181635.GA9546@lark> (Andy Wingo's message of "Thu, 6 Nov 2003 20:16:35 +0200") User-Agent: Gnus/5.1002 (Gnus v5.10.2) Emacs/21.3 (gnu/linux) X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.2 Precedence: list List-Id: Developers list for Guile, the GNU extensibility library List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: main.gmane.org gmane.lisp.guile.devel:2999 X-Report-Spam: http://spam.gmane.org/gmane.lisp.guile.devel:2999 Please allow me to randomly dump my thoughts on Guile and Unicode: - The principal tension that I see is between having a memory efficient representation (UTF-8) and one that is simple and concept-compatible with the old way (fixed-width, maybe UTF-32). - But is there a fixed-width Unicode representation? I.e., is UTF-32 just like ASCII only with more bits or is there more to it? Are there combining characters in UTF-32? If there are, then there is no reason to go looking for a fixed-width, old-style text representation. - If we go with a variable width encoding, we can just as well use UTF-8 and replace strings/chars with something new, like Tom's texts/graphemes. - What kind of data type are strings anyway? Vectors or lists? Traditionally, they have been mutable vectors, but variable-width encoding of 'characters' might force us to rethink this, in general. People expect constant time accesses for vector-like things, but we will probably not want to guarantee them for a variable-width encoding (with integers as indices). - So the text/grapheme API should maybe be more abstract, and not be using integers to refer to graphemes contained in texts but some opaque 'iterator', 'subtext' or 'grapheme range' thing. - Shared subtexts or grapheme ranges are easy to do for read-only texts, but harder for mutable text. So texts should maybe be unmutable by default. Mutable texts and pointers into it might use a more expensive data structure, like a gap buffer. - For Guile specifically, the problematic thing is the C API. Right now, strings are pretty much fixed to be vectors of unsigned bytes. We can't do much about this without breaking code. So from that point of view, a new API for Unicode stuff looks like a good thing as well, when we can convince ourselves that people are willing to move over to that new API. - The representation of texts would be determined by what is most natural for existing C code. I.e., I think that Gtk+ uses UTF-8 and when we find that most libraries that we want to access from Guile use UTF-8 as well, we should make our text representation UTF-8. - Old code can be supported by allowing string-*, char-*, etc. to work on UTF-8 encoded texts that uses only ASCII code points. That will causes problems to the 8-bit users (like latin-1, etc.), tho. C code must avoid storing non-ASCII characters into such strings, and I'm not sure right now whether we can keep it from doing that in a compatible way. - ... :) -- GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3 331E FAF8 226A D5D4 E405 _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel