Re: Unicode and Guile

From: Tom Lord <lord@emf.net>
Cc: guile-devel@gnu.org
Subject: Re: Unicode and Guile
Date: Tue, 11 Nov 2003 11:02:55 -0800 (PST)	[thread overview]
Message-ID: <200311111902.LAA25202@morrowfield.regexps.com> (raw)
In-Reply-To: <20031106181635.GA9546@lark> (message from Andy Wingo on Thu, 6 Nov 2003 20:16:35 +0200)

    > From: Andy Wingo <wingo@pobox.com>

    [...long thing...]

Thanks for the pointer to the Python type (on which I won't comment
:-).   Thanks for the excuse to think about this more.

At the end of this proposal, I've addressed your "use case".

-t

               Towards Standard Scheme Unicode Support

* The Problems

  There are two major obstacles to providing nice,
  non-culturally-biased Unicode support in standard Scheme.  First,
  the required standard character and string procedures are
  fundamentally inconsistent with the structure of unicode.  Second,
  attempts to ignore that fact and "force fit" unicode into them
  anyway inevitably result in a set of text-manipulation primitives
  that are too low level -- that require even very simple text
  manipulation programs to be far more "aware" of the details of
  unicode encodings and structure than they ought to be.

** CHAR? Makes No Sense In Unicode

  Consider the unicode character U+00DF "LATIN SMALL LETTER SHARP S"
  (aka Eszett).

  Clearly it should behave this way:

	(char-alphabetic? eszett) => #t
	(char-lower-case? eszett) => #t

  and it is required that:

	(char-ci=? eszett (char-upcase eszett)) => #t
	(char-upper-case? (char-upcase eszett)) => #t

  but now what exactly does:

	(char-upcase eszett)

  return?  The upper case mapping of eszett is a two character
  sequence, "SS".  It's not even a Unicode base character plus
  combining characters -- it's two base characters, a string.

  Eszett is not an isolated anomaly (though, admittedly, is not the
  common case).  Here is a pointer to the data file of similarly
  problematic case mappings:

	http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

  So, something has to give, somewhere :-)  

  [Case mappings are a particularly clear example but I suspect
  that there are other "character manipulation" operators that
  make sense in Unicode but, similarly, don't map onto a 
  standard CHAR? type.]

** Other Approaches are Too Low Level

  Consider the example of attempting to write a procedure,
  in portable scheme, which performs "studly capitalization".
  It should accept a string like:

	a studly capitalizer

  and return a string like:

	a StUDly CaPItalIZer

  In the simple world of the scheme CHAR and STRING types, such a
  procedure is quite simple to write _and_get_completely_correct_.
  It would make good exercises for a new programming student.

  Let's assume that the student solves the problem in a reasonable
  way:  by iterating over the string and, at random positions, 
  replacing a character with its upper case equivalent.  Simple
  enough.

  Unfortunately, there does not (can not) exist a mapping of
  Unicode onto the standard character and string types that would not
  break our student's program.  His program can still _often_ give
  a correct result, but to produce a completely correct program, 
  he will have to take a far different and, as things stand, more
  complicated approach.

** One Approach Comes Close

  Ray Dillenger has recently proposed on comp.lang.scheme a 
  treatement of Unicode in which a CHAR? value may be:

	~ a unicode base character
        ~ a unicode base character plus a sequence of 1 or 
          more unicode combining characters

  That goes a very long way towards solving the problem.  For example, 
  if I had asked our student to write an anagram generator instead of
  a studly capitalizer, Ray's solution would perserve the correctness
  of the student's program.

  Unfortunately, Ray's approach still has problems.  It can not handle
  case mappings correctly, as noted above.  In Ray's system, there are
  an infinite number of non-EQV? CHAR? values and therefore
  CHAR->INTEGER may return a bignum (in Indic, Tibetan, and the Hangul
  Jamo alphabets, it would apparently return a bignum frequently).
  With an infinite set of characters, libraries (such as SRFI-7
  "Character Sets"), which are designed with a finite character set in
  mind, can not be ported.  The issue of multi-character case mappings
  aside, It is difficult to see how to preserve the required ordering
  isomophism between characters and their integer representations.

  Nevertheless, Ray's idea that a "conceptual character" is part of 
  an infinite set of values and a "conceptual string" a sequence of
  those is the basis of this proposal.

* The Proposal

  The proposal has two parts.   Part 1 introduces a new type, TEXT?, 
  which is a string-like type that is compatible with Unicode, and
  a subtype of TEXT?, GRAPHEME?, to represent "conceptual
  characters". 

  Part 2 discusses what can become of the STRING? and CHAR? types in
  this context.

** The TEXT? and GRAPHEME? Types

  [This is a sketch of a specification -- not yet even a first
   draft of a specification.]

    ~ (text? obj) => <boolean>

	True if OBJ is a text object, false otherwise.

	A text object represents a string of printed graphemes.

    ~ (utf8->text string) => <text>
    ~ (utf16->text string) => <text>
    ~ (utf16be->text string) => <text>
    ~ (utf16le->text string) => <text>
    [...]
    ~ (text->utf8 text) => <string> 
    [...]
	The usual conversions from strings (presumed to be
        sequences of octets) to text.

  A subset of text objects are distinguished as graphemes:

    ~ (grapheme? obj) => <boolean>

      True if OBJ is a text object which is a grapheme,
      false otherwise.

      The set of graphemes is defined to be isomorphic to the set of
      all unicode base characters and well formed unicode combinding
      character sequences (and is thus an infinite set).

    ~ (grapheme=? g1 g2 [locale]) => <boolean>
    ~ (grapheme<? g1 g2 [locale])
    ~ (grapheme>? g1 g2 [locale])
    [...]
    ~ (grapheme-ci=? g1 g2 [locale])
    ~ (grapheme-ci<? g1 g2 [locale])
    ~ (grapheme-ci>? g1 g2 [locale])

      The usual orderings.

      Here and elsewhere I've left the optional parameter LOCALE there
      as a kind of place-holder.  There are many possible collation
      orders for text and programs need a way to distinguish which
      they mean (as well as have a reasonable default).

  It is important to note that, in general, EQV? and EQUAL?  do _not_
  test for grapheme equality.  GRAPHEME=? must be used instead.

  Also note that this proposal does not include GRAPHEME->INTEGER or
  INTEGER->GRAPHEME.   I have not included, but probably should
  include, a hash value procedure which hashes GRAPHEME=? values 
  equally.

    ~ (grapheme-upcase g) => <text>
    ~ (grapheme-downcase g) => <text>
    ~ (grapheme-titlecase g) => <text>

       Note that these return texts, not necessarilly graphemes.
       For example, GRAPHEME-UPCASE of eszett would return a 
       text representation of "SS".

  All texts, including graphemes, behave like (conceptual) strings:

    ~ (text-length text) => <integer>

      Return the number of graphemes in TEXT.

    ~ (subtext text start end) => <text>

      Return a subtext of TEXT containing the graphemes beginning at
      index START (inclusive) and ending at END (exclusive).

    ~ (text=? t1 t2 [locale]) => <boolean>
    ~ (text<? t1 t2 [locale]) => <boolean>
    [...]
        The usual ordering predicates.

    ~ (text-append text ...) => <text>
    ~ (list->text list-of-graphemes) => <text>

         Various constructors for text ....

    However, instead of TEXT-SET!, we have:

    ~ (text-replace! text start end replacement)

      Replace the graphemes at [START, END) in TEXT with 
      the graphemes in text object REPLACEMENT.  Passing
      #t for END is equivalent to passing an index 1
      position beyond START.

      TEXT must be a mutable text object (see below).

  Implementations are permitted to make _some_ graphemes immutable.
  In particular:

    ~ (text-ref text index) => <grapheme>

      Return  the grapheme at position INDEX in TEXT.
      The grapheme returned may be immutable.

    ~ (text->list text) => <list of graphemes>

      The graphemes returned may be immutable.

    ~ (char->grapheme char) => <grapheme>
    ~ (utf8->grapheme string) => <grapheme>
    [....]

       Conversions to possibly immutable graphemes.

  And some simple I/O extensions:

    ~ (read-grapheme [port]) => <grapheme>
    ~ (peek-grapheme [port]) => <grapheme>
    [etc.]

  There is still an awkwardness, however.  Consider witing the "StUDly
  CaPItalIZer" procedure.  It's tempting to write it as a loop that
  uses an integer grapheme index to iterate over the text, randomly
  picking graphemes to change the case of.  That wouldn't work though:
  changing the case of one character can change the length of text,
  right at the point being indexed, and invalidate the indexes.  So,
  texts really need markers that work like those in Emacs:

    ~ (make-text-marker text index) => <marker>
    ~ (text-marker? obj) => <boolean>
    ~ (marker-text marker) => <index>
    ~ (marker-index marker) => <index>
    ~ (set-marker-index! marker index)
    ~ (set-marker! marker text index)
    etc.

	Changes (by TEXT-REPLACE!) to the region of a text object to
        the left of a marker leave the marker in the same position
        relative to the right end of the text, and vice versa.

        Changes to a region which _includes_ a marker leave the
        marker at last grapheme index of the replacement
        text that was inserted, or, if the replacement was empty, 
        at its old index position minus the number of graphemes
        deleted to the marker's left.

        The procedures SUBTEXT, TEXT-REPLACE!, and TEXT-REF 
        and others that except indexes can accept markers as those
        indexes.

  Unlike markers, text properties and overlays aren't strictly needed to
  make TEXT? useful -- but they would make a good addition.   The issue
  is that mutating procedures (like TEXT-REPLACE!) should be aware of
  properties in order to update them properly.    If properties and
  overlays are left out, and people have to implement them in a higher
  layer, then their "attributed text" data type can't be passed to a
  procedure that just expects a text object.

* Optional Changes to CHAR? and STRING?

  The above sepcification of the TEXT? and GRAPHEME? is useful on its
  own, but it might be considerably more convenient in implementations
  which also adopt the following ideas:

    ~ CHAR? is an octet, STRING? a sequence of octets

    ~ STRING? valuess are resizable

    ~ STRING? values contain an "encoding" attribute which may be
      any of 
		utf8
                utf16be
                utf16le
                utf32

      or an impelementation defined value.   Note however that
      procedures such as STRING-REF ignore this attribute and 
      view strings as sequences of octets.

      STRING-APPEND implicitly converts its second and subsequent
      arguments to the same encoding as its first.

    ~ (text? "a string") => #t

    ~ (grapheme? #\a) => #t

  In other words, all character values are graphemes, and all strings
  are text values.

  These ideas _could_ be taken even a step further with the addition
  of:

    ~ TEXT? values contain an "encoding" attribute, just as strings
      do (utf-8, etc.)

    ~ (string? a-text-value) => #t

    ~ (char? a-grapheme) => <boolean>

  All text values can be strings;  some graphemes can be characters.

* Summary

  The new TEXT? and GRAPHEME? types present a simple and traditional
  interface to "conceptual strings" and "conceptual characters".  
  They make it easy to express simple algorithms simply and without
  reference to the internal structure of Unicode.

  Reflecting the realities of global text processing, there is
  no bias in the interfaces suggesting that the set of graphemes
  is finite.

  Also reflecting the realities of global text processing: the length
  of a text object may change over time; a sequence replacement
  operator is supplied instead of an element replacement operator; 
  and markers (similar to those in text editors) are provided for 
  iteration and other examples of keeping track of "a position within
  a text vaue".

  There is no essential difference between a grapheme and a text
  object of length 1, and thus the proposal makes GRAPHEME? a 
  subtype of TYPE.

  If STRING? is suitably extended, then it may be equal to or a subset
  of TEXT?.  Conversely, if TYPE? is suitably extended, it may be
  equal to or a subset of STRING?.  It may be sensible to unify the
  two types (although even analogous string procedures and text
  procedures will still behave differently from one another).

  CHAR? may be safely viewed as a subtype of GRAPHEME?, but the 
  converse is not, and can not, be true.

--------------------------------

    > Hm. Let's consider some use cases.

    > Let's say an app wants to ask the user her name, she might want to write
    > her name in her native Arabic. Or perhaps her address, or anything
    > "local". If the app then wants to communicate this information to her
    > Chinese friend (who also knows Arabic), the need for Unicode is
    > fundamental. We can probably agree there.

Absolutely.    What's more, if I'm sitting in california and write a
protable Scheme program that generates anagrams of a name, it'd be
awefully swell if (a) My code doesn't have to "know" anything special
about unicode internals;  (b) my code works when passed her name as input.

    > The question becomes, is the user's name logically a simple string (can
    > we read it in with standard procedures), or must we use this
    > text-buffer, complete with marks, multiple backends, et al? It seems
    > more natural, to me, for this to be provided via simple strings,
    > although I could be wrong here.

Scheme's requirements of the CHAR? and STRING? types simply don't map
onto unicode.   The case problem I illustrated above is one example
and I _suspect_ that there are others, even if you do something like
Ray's trying to do and make an infinitely large character set.

I _think_ the TEXT? and GRAPHEME? stuff above is about as natural as
"simple strings" -- it just doesn't try to give those types behavior
that makes no sense in Unicode.

    > I was looking at what Python did
    > (http://www.python.org/peps/pep-100.html), and they did make a
    > distinction. They have a separate unicode string representation which,
    > like strings, is a subclass of SequenceObject. So they maintained
    > separate representation while being relatively transparent to the
    > programmer. Pretty slick, it seems.

That URL is slightly wrong.  It's:

     http://www.python.org/peps/pep-0100.html

It sounds _ok_.   It's got some problems.   

The genericity of it (that these are still sequences) is
winning.... i'll discuss that below.

Mostly its a little too low level.  They're only (initially?)
supporting the 1-1 case conversions.  They are exposing unicode code
points and just handing users property tables for those.  They don't
include a "marker" concept.  These are all symptoms of starting off
with an implementation limited to the 16-bit code points -- they
haven't thought through how to do full unicode support (and once they
do, I'll bet they wind up with something close to TEXT? and
GRAPHEME?).

    > C#, Java, and ECMAScript (JavaScript) all apparently use UTF-16
    > as their native format, although I've never coded in the first
    > two. Networking was probably the most important consideration
    > there.

Streaming conversions (e.g., for networking) are cheap and easy.  I
think they made those choices to simplify implementations, and then
made the mistake of exposing that implementation detail in the
interfaces.

    > Perhaps the way forward would be to leave what we've been
    > calling "simple strings" alone, and somehow (perhaps with GOOPS,
    > but haven't thought too much about this) pull the Python trick
    > of having a unicode string that can be used everywhere simple
    > strings can. Thoughts on that idea?

The proposal above makes it possible to pass text everywhere that
simple strings can be used.  However, in that part of the proposal,
string-ref, sttring-set! and so forth are still specified to operate
on octets.

The proposal also makes it possible to pass strings everywhere that
text can be used.   I think that's the more interesting direction: 
just use text- and grapheme- procedures from now on except where you
_really_ want to refer to octets.

-t

_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel