Re: Unicode and Guile

From: Andy Wingo <wingo@pobox.com>
Subject: Re: Unicode and Guile
Date: Thu, 6 Nov 2003 20:16:35 +0200	[thread overview]
Message-ID: <20031106181635.GA9546@lark> (raw)
In-Reply-To: <200311032031.MAA19389@morrowfield.regexps.com>

Well, I just wasted an afternoon reading the Unicode 4.0 spec. I'll
never have that time back again ;)

On Mon, 03 Nov 2003, Tom Lord wrote:

> Part of the problem is that Unicode specifications are very careful to
> _not_ define "character" (except ambiguously).
>
> In different contexts related to my question, it might mean a unicode
> code point, a code value, or something more complicated such as a
> grapheme (which may be represented as a string of unicode code
> points).

There is the term "abstract characters", which maps to code points. But
yes, I see that there are some problems with the concept. What do you
mean by "code value"?

We have the following passage in chapter two of the spec, talking
about one of the ten design principles, "Characters, Not Glyphs":

  The Unicode Standard draws a distinction between characters and glyphs.
  Characters are the abstract representations of the smallest components
  of written language that have semantic value. They represent primarily,
  but not exclusively, the letters, punctuation, and other signs that
  constitute natural language text and technical notation. Characters are
  represented by code points that reside only in a memory representation,
  as strings in memory, or on disk. The Unicode Standard deals only with
  character codes.

Character codes, or code points encoded via UTF-8, UTF-16, or UTF-32,
are characters, according to this passage.

Granted, end users sometimes expect certain character combinations to be
treated as characters, such as `ch' in traditional Spanish, or
lower-case a plus grave accent composite in French. That level of
processing is much higher than that of simple strings, and more related
(it seems to me) to the rendering of glyphs on the screen.

I've actually had a change of thought about encodings; while UTF-8 or
UTF-16 are good for disk and network transfer, the extensive
character-based API of Scheme (read-char from ports, for instance) lends
itself better to uniform representation for individual characters. So
the natural native format for strings in memory might well be UTF-32.
But that's another issue...

> It's a nasty problem to try to unify unicode types with scheme types.

Indeed :-/

> Suppose:
> 
> * CHAR? is a code value in some encoding (say, UTF-8 or UTF-16)

I think I'm leaning towards 4-byte values here so that you can never
read-char a partial character.

> * CHAR? is a unicode code point -- a 21 bit value.
> 
>   This approach has the same problems with string efficiency or 
>   complexity

Complexity isn't so much of an issue. Efficiency is, however;
applications with large amounts of string data might want to choose a
different encoding for their storage.

> * CHAR? is a "grapheme" -- the user's idea of a character.
> 
>   Ray Dillenger is currently exploring this (see recent c.l.s.)

Will check this out. It does sound painful, though :/

>     >> There's a need for a new type, `text', which acts like the text
>     >> contents of an emacs buffer
> 
>     > Maybe. This issue is, in my opinion, orthogonal to simple strings.
> 
> But perhaps its worth mentioning in this context because it suggests a
> very straightforward approach for Guile:
> 
> CHAR? is 8 bits.  STRING? is a sequence of 8-bit chars.  And
> everything unicode is orthogonal to that.   While there may be support
> for manipulating unicode strings represented as STRING? and unicode
> characters represented as CHAR?, fundamentally, CHAR? and STRING? are
> kept butt-simple and the unicode support is something new.

Hm. Let's consider some use cases.

Let's say an app wants to ask the user her name, she might want to write
her name in her native Arabic. Or perhaps her address, or anything
"local". If the app then wants to communicate this information to her
Chinese friend (who also knows Arabic), the need for Unicode is
fundamental. We can probably agree there.

The question becomes, is the user's name logically a simple string (can
we read it in with standard procedures), or must we use this
text-buffer, complete with marks, multiple backends, et al? It seems
more natural, to me, for this to be provided via simple strings,
although I could be wrong here.

I was looking at what Python did
(http://www.python.org/peps/pep-100.html), and they did make a
distinction. They have a separate unicode string representation which,
like strings, is a subclass of SequenceObject. So they maintained
separate representation while being relatively transparent to the
programmer. Pretty slick, it seems.

C#, Java, and ECMAScript (JavaScript) all apparently use UTF-16 as their
native format, although I've never coded in the first two. Networking
was probably the most important consideration there.

Perhaps the way forward would be to leave what we've been calling
"simple strings" alone, and somehow (perhaps with GOOPS, but haven't
thought too much about this) pull the Python trick of having a unicode
string that can be used everywhere simple strings can. Thoughts on that
idea?

Regards,

wingo.

_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel