From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Tom Lord Newsgroups: gmane.lisp.guile.devel Subject: Re: Unicode and Guile Date: Tue, 11 Nov 2003 11:02:55 -0800 (PST) Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Message-ID: <200311111902.LAA25202@morrowfield.regexps.com> References: <20031021171534.GA13246@lark> <200310260003.RAA10375@morrowfield.regexps.com> <20031031132525.GB715@lark> <200311032031.MAA19389@morrowfield.regexps.com> <20031106181635.GA9546@lark> NNTP-Posting-Host: deer.gmane.org X-Trace: sea.gmane.org 1068576537 16865 80.91.224.253 (11 Nov 2003 18:48:57 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Tue, 11 Nov 2003 18:48:57 +0000 (UTC) Cc: guile-devel@gnu.org Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Tue Nov 11 19:48:51 2003 Return-path: Original-Received: from monty-python.gnu.org ([199.232.76.173]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1AJdZK-0007KO-00 for ; Tue, 11 Nov 2003 19:48:50 +0100 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.24) id 1AJeWW-0006TD-CC for guile-devel@m.gmane.org; Tue, 11 Nov 2003 14:50:00 -0500 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.24) id 1AJeVu-0006T6-BB for guile-devel@gnu.org; Tue, 11 Nov 2003 14:49:22 -0500 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.24) id 1AJeVO-0006Q1-DZ for guile-devel@gnu.org; Tue, 11 Nov 2003 14:49:21 -0500 Original-Received: from [65.234.195.251] (helo=morrowfield.regexps.com) by monty-python.gnu.org with esmtp (Exim 4.24) id 1AJeVH-0006PU-Nb for guile-devel@gnu.org; Tue, 11 Nov 2003 14:48:44 -0500 Original-Received: (from lord@localhost) by morrowfield.regexps.com (8.9.1/8.9.1) id LAA25202; Tue, 11 Nov 2003 11:02:55 -0800 (PST) (envelope-from lord@morrowfield.regexps.com) Original-To: wingo@pobox.com In-reply-to: <20031106181635.GA9546@lark> (message from Andy Wingo on Thu, 6 Nov 2003 20:16:35 +0200) X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.2 Precedence: list List-Id: Developers list for Guile, the GNU extensibility library List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: main.gmane.org gmane.lisp.guile.devel:2996 X-Report-Spam: http://spam.gmane.org/gmane.lisp.guile.devel:2996 > From: Andy Wingo [...long thing...] Thanks for the pointer to the Python type (on which I won't comment :-). Thanks for the excuse to think about this more. At the end of this proposal, I've addressed your "use case". -t Towards Standard Scheme Unicode Support * The Problems There are two major obstacles to providing nice, non-culturally-biased Unicode support in standard Scheme. First, the required standard character and string procedures are fundamentally inconsistent with the structure of unicode. Second, attempts to ignore that fact and "force fit" unicode into them anyway inevitably result in a set of text-manipulation primitives that are too low level -- that require even very simple text manipulation programs to be far more "aware" of the details of unicode encodings and structure than they ought to be. ** CHAR? Makes No Sense In Unicode Consider the unicode character U+00DF "LATIN SMALL LETTER SHARP S" (aka Eszett). Clearly it should behave this way: (char-alphabetic? eszett) => #t (char-lower-case? eszett) => #t and it is required that: (char-ci=? eszett (char-upcase eszett)) => #t (char-upper-case? (char-upcase eszett)) => #t but now what exactly does: (char-upcase eszett) return? The upper case mapping of eszett is a two character sequence, "SS". It's not even a Unicode base character plus combining characters -- it's two base characters, a string. Eszett is not an isolated anomaly (though, admittedly, is not the common case). Here is a pointer to the data file of similarly problematic case mappings: http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt So, something has to give, somewhere :-) [Case mappings are a particularly clear example but I suspect that there are other "character manipulation" operators that make sense in Unicode but, similarly, don't map onto a standard CHAR? type.] ** Other Approaches are Too Low Level Consider the example of attempting to write a procedure, in portable scheme, which performs "studly capitalization". It should accept a string like: a studly capitalizer and return a string like: a StUDly CaPItalIZer In the simple world of the scheme CHAR and STRING types, such a procedure is quite simple to write _and_get_completely_correct_. It would make good exercises for a new programming student. Let's assume that the student solves the problem in a reasonable way: by iterating over the string and, at random positions, replacing a character with its upper case equivalent. Simple enough. Unfortunately, there does not (can not) exist a mapping of Unicode onto the standard character and string types that would not break our student's program. His program can still _often_ give a correct result, but to produce a completely correct program, he will have to take a far different and, as things stand, more complicated approach. ** One Approach Comes Close Ray Dillenger has recently proposed on comp.lang.scheme a treatement of Unicode in which a CHAR? value may be: ~ a unicode base character ~ a unicode base character plus a sequence of 1 or more unicode combining characters That goes a very long way towards solving the problem. For example, if I had asked our student to write an anagram generator instead of a studly capitalizer, Ray's solution would perserve the correctness of the student's program. Unfortunately, Ray's approach still has problems. It can not handle case mappings correctly, as noted above. In Ray's system, there are an infinite number of non-EQV? CHAR? values and therefore CHAR->INTEGER may return a bignum (in Indic, Tibetan, and the Hangul Jamo alphabets, it would apparently return a bignum frequently). With an infinite set of characters, libraries (such as SRFI-7 "Character Sets"), which are designed with a finite character set in mind, can not be ported. The issue of multi-character case mappings aside, It is difficult to see how to preserve the required ordering isomophism between characters and their integer representations. Nevertheless, Ray's idea that a "conceptual character" is part of an infinite set of values and a "conceptual string" a sequence of those is the basis of this proposal. * The Proposal The proposal has two parts. Part 1 introduces a new type, TEXT?, which is a string-like type that is compatible with Unicode, and a subtype of TEXT?, GRAPHEME?, to represent "conceptual characters". Part 2 discusses what can become of the STRING? and CHAR? types in this context. ** The TEXT? and GRAPHEME? Types [This is a sketch of a specification -- not yet even a first draft of a specification.] ~ (text? obj) => True if OBJ is a text object, false otherwise. A text object represents a string of printed graphemes. ~ (utf8->text string) => ~ (utf16->text string) => ~ (utf16be->text string) => ~ (utf16le->text string) => [...] ~ (text->utf8 text) => [...] The usual conversions from strings (presumed to be sequences of octets) to text. A subset of text objects are distinguished as graphemes: ~ (grapheme? obj) => True if OBJ is a text object which is a grapheme, false otherwise. The set of graphemes is defined to be isomorphic to the set of all unicode base characters and well formed unicode combinding character sequences (and is thus an infinite set). ~ (grapheme=? g1 g2 [locale]) => ~ (grapheme? g1 g2 [locale]) [...] ~ (grapheme-ci=? g1 g2 [locale]) ~ (grapheme-ci? g1 g2 [locale]) The usual orderings. Here and elsewhere I've left the optional parameter LOCALE there as a kind of place-holder. There are many possible collation orders for text and programs need a way to distinguish which they mean (as well as have a reasonable default). It is important to note that, in general, EQV? and EQUAL? do _not_ test for grapheme equality. GRAPHEME=? must be used instead. Also note that this proposal does not include GRAPHEME->INTEGER or INTEGER->GRAPHEME. I have not included, but probably should include, a hash value procedure which hashes GRAPHEME=? values equally. ~ (grapheme-upcase g) => ~ (grapheme-downcase g) => ~ (grapheme-titlecase g) => Note that these return texts, not necessarilly graphemes. For example, GRAPHEME-UPCASE of eszett would return a text representation of "SS". All texts, including graphemes, behave like (conceptual) strings: ~ (text-length text) => Return the number of graphemes in TEXT. ~ (subtext text start end) => Return a subtext of TEXT containing the graphemes beginning at index START (inclusive) and ending at END (exclusive). ~ (text=? t1 t2 [locale]) => ~ (text [...] The usual ordering predicates. ~ (text-append text ...) => ~ (list->text list-of-graphemes) => Various constructors for text .... However, instead of TEXT-SET!, we have: ~ (text-replace! text start end replacement) Replace the graphemes at [START, END) in TEXT with the graphemes in text object REPLACEMENT. Passing #t for END is equivalent to passing an index 1 position beyond START. TEXT must be a mutable text object (see below). Implementations are permitted to make _some_ graphemes immutable. In particular: ~ (text-ref text index) => Return the grapheme at position INDEX in TEXT. The grapheme returned may be immutable. ~ (text->list text) => The graphemes returned may be immutable. ~ (char->grapheme char) => ~ (utf8->grapheme string) => [....] Conversions to possibly immutable graphemes. And some simple I/O extensions: ~ (read-grapheme [port]) => ~ (peek-grapheme [port]) => [etc.] There is still an awkwardness, however. Consider witing the "StUDly CaPItalIZer" procedure. It's tempting to write it as a loop that uses an integer grapheme index to iterate over the text, randomly picking graphemes to change the case of. That wouldn't work though: changing the case of one character can change the length of text, right at the point being indexed, and invalidate the indexes. So, texts really need markers that work like those in Emacs: ~ (make-text-marker text index) => ~ (text-marker? obj) => ~ (marker-text marker) => ~ (marker-index marker) => ~ (set-marker-index! marker index) ~ (set-marker! marker text index) etc. Changes (by TEXT-REPLACE!) to the region of a text object to the left of a marker leave the marker in the same position relative to the right end of the text, and vice versa. Changes to a region which _includes_ a marker leave the marker at last grapheme index of the replacement text that was inserted, or, if the replacement was empty, at its old index position minus the number of graphemes deleted to the marker's left. The procedures SUBTEXT, TEXT-REPLACE!, and TEXT-REF and others that except indexes can accept markers as those indexes. Unlike markers, text properties and overlays aren't strictly needed to make TEXT? useful -- but they would make a good addition. The issue is that mutating procedures (like TEXT-REPLACE!) should be aware of properties in order to update them properly. If properties and overlays are left out, and people have to implement them in a higher layer, then their "attributed text" data type can't be passed to a procedure that just expects a text object. * Optional Changes to CHAR? and STRING? The above sepcification of the TEXT? and GRAPHEME? is useful on its own, but it might be considerably more convenient in implementations which also adopt the following ideas: ~ CHAR? is an octet, STRING? a sequence of octets ~ STRING? valuess are resizable ~ STRING? values contain an "encoding" attribute which may be any of utf8 utf16be utf16le utf32 or an impelementation defined value. Note however that procedures such as STRING-REF ignore this attribute and view strings as sequences of octets. STRING-APPEND implicitly converts its second and subsequent arguments to the same encoding as its first. ~ (text? "a string") => #t ~ (grapheme? #\a) => #t In other words, all character values are graphemes, and all strings are text values. These ideas _could_ be taken even a step further with the addition of: ~ TEXT? values contain an "encoding" attribute, just as strings do (utf-8, etc.) ~ (string? a-text-value) => #t ~ (char? a-grapheme) => All text values can be strings; some graphemes can be characters. * Summary The new TEXT? and GRAPHEME? types present a simple and traditional interface to "conceptual strings" and "conceptual characters". They make it easy to express simple algorithms simply and without reference to the internal structure of Unicode. Reflecting the realities of global text processing, there is no bias in the interfaces suggesting that the set of graphemes is finite. Also reflecting the realities of global text processing: the length of a text object may change over time; a sequence replacement operator is supplied instead of an element replacement operator; and markers (similar to those in text editors) are provided for iteration and other examples of keeping track of "a position within a text vaue". There is no essential difference between a grapheme and a text object of length 1, and thus the proposal makes GRAPHEME? a subtype of TYPE. If STRING? is suitably extended, then it may be equal to or a subset of TEXT?. Conversely, if TYPE? is suitably extended, it may be equal to or a subset of STRING?. It may be sensible to unify the two types (although even analogous string procedures and text procedures will still behave differently from one another). CHAR? may be safely viewed as a subtype of GRAPHEME?, but the converse is not, and can not, be true. -------------------------------- > Hm. Let's consider some use cases. > Let's say an app wants to ask the user her name, she might want to write > her name in her native Arabic. Or perhaps her address, or anything > "local". If the app then wants to communicate this information to her > Chinese friend (who also knows Arabic), the need for Unicode is > fundamental. We can probably agree there. Absolutely. What's more, if I'm sitting in california and write a protable Scheme program that generates anagrams of a name, it'd be awefully swell if (a) My code doesn't have to "know" anything special about unicode internals; (b) my code works when passed her name as input. > The question becomes, is the user's name logically a simple string (can > we read it in with standard procedures), or must we use this > text-buffer, complete with marks, multiple backends, et al? It seems > more natural, to me, for this to be provided via simple strings, > although I could be wrong here. Scheme's requirements of the CHAR? and STRING? types simply don't map onto unicode. The case problem I illustrated above is one example and I _suspect_ that there are others, even if you do something like Ray's trying to do and make an infinitely large character set. I _think_ the TEXT? and GRAPHEME? stuff above is about as natural as "simple strings" -- it just doesn't try to give those types behavior that makes no sense in Unicode. > I was looking at what Python did > (http://www.python.org/peps/pep-100.html), and they did make a > distinction. They have a separate unicode string representation which, > like strings, is a subclass of SequenceObject. So they maintained > separate representation while being relatively transparent to the > programmer. Pretty slick, it seems. That URL is slightly wrong. It's: http://www.python.org/peps/pep-0100.html It sounds _ok_. It's got some problems. The genericity of it (that these are still sequences) is winning.... i'll discuss that below. Mostly its a little too low level. They're only (initially?) supporting the 1-1 case conversions. They are exposing unicode code points and just handing users property tables for those. They don't include a "marker" concept. These are all symptoms of starting off with an implementation limited to the 16-bit code points -- they haven't thought through how to do full unicode support (and once they do, I'll bet they wind up with something close to TEXT? and GRAPHEME?). > C#, Java, and ECMAScript (JavaScript) all apparently use UTF-16 > as their native format, although I've never coded in the first > two. Networking was probably the most important consideration > there. Streaming conversions (e.g., for networking) are cheap and easy. I think they made those choices to simplify implementations, and then made the mistake of exposing that implementation detail in the interfaces. > Perhaps the way forward would be to leave what we've been > calling "simple strings" alone, and somehow (perhaps with GOOPS, > but haven't thought too much about this) pull the Python trick > of having a unicode string that can be used everywhere simple > strings can. Thoughts on that idea? The proposal above makes it possible to pass text everywhere that simple strings can be used. However, in that part of the proposal, string-ref, sttring-set! and so forth are still specified to operate on octets. The proposal also makes it possible to pass strings everywhere that text can be used. I think that's the more interesting direction: just use text- and grapheme- procedures from now on except where you _really_ want to refer to octets. -t _______________________________________________ Guile-devel mailing list Guile-devel@gnu.org http://mail.gnu.org/mailman/listinfo/guile-devel