unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* Unicode and Guile
@ 2003-10-21 17:15 Andy Wingo
  2003-10-25 17:08 ` Stephen Compall
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Andy Wingo @ 2003-10-21 17:15 UTC (permalink / raw)


Hey folks,

What's the plan on internationalization of strings in Guile?

If there is no plan, may I suggest that we move our internal
representation of strings to UTF-8. There's an interesting introductory
article written on www.joelonsoftware.com, although I don't have the
link ATM. This has the advantage that ASCII characters up to 127 are
represented the same. Of course, above that characters might take up to
eight bytes, which means that all code that processes user-input strings
has to be changed. Painful, eh? But if we hope to write apps that deal
with all languages of the world, that's the only way.

So, reactions on that would be appreciated. To make it easy, may I also
suggest that we use GLib to handle all of the unicode mess for us. This
does introduce a dependency, but libglib-2.0.so is only 400K and it's
likely to be in memory anyway on most systems. We don't need to expose
any GLib-style functions, they can all be wrapped with their scheme
equivalents.

Since the underlying representation can still be stored as char*, it
might be possible to make a (ice-9 unicode) library that would override
all the original bindings for character and string functions. We can
still require that the reader accept the low half of ASCII for code, so
that can stay the same. It's only dealing with strings that would be an
issue (some reader modifications required there). Then to display a
string would be a simple matter of g_locale_from_utf8 ().

My native language is English, so I don't have to deal with this problem
too much. But GNU is not just for European languages, so we should do
our best to spread the love around. Also, from working on guile-gtk, we
really need to have a comprehensive framework for internationalization,
and it sucks when C is ahead of us in this department.

Thoughts?

Andy



_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-10-21 17:15 Unicode and Guile Andy Wingo
@ 2003-10-25 17:08 ` Stephen Compall
  2003-10-26  0:03   ` Tom Lord
  2003-10-31 13:16   ` Andy Wingo
  2003-11-02 21:23 ` Kevin Ryde
  2003-11-26 20:35 ` Mikael Djurfeldt
  2 siblings, 2 replies; 22+ messages in thread
From: Stephen Compall @ 2003-10-25 17:08 UTC (permalink / raw)
  Cc: guile-devel

Andy Wingo <wingo@pobox.com> writes:

> If there is no plan, may I suggest that we move our internal
> representation of strings to UTF-8. There's an interesting
> introductory article written on www.joelonsoftware.com, although I
> don't have the link ATM. This has the advantage that ASCII
> characters up to 127 are represented the same.

I think this may be a disadvantage.  As you say, UTF-8 strings are
still not ASCII-compatible, but that casting their data blocks to
char* still works for ASCII strings, people might be tempted to simply
do that, because other languages "don't matter enough to bother with
it".

> Of course, above that characters might take up to eight bytes, which
> means that all code that processes user-input strings has to be
> changed. Painful, eh?  But if we hope to write apps that deal with
> all languages of the world, that's the only way.
> 
> So, reactions on that would be appreciated.

As a result, UCS-4 strings have the advantage of breaking code that
tries to merely interpret the data block as char*.  UCS-4 is what
wchar_t is in glibc.  I'd debate the virtues of treating all code
points equally, versus their status in UTF-8, but I'm sure that's
better done (and has been done) in another forum.  UCS-2 shouldn't
even be considered an option, and UTF-16 seems to offer the worst of
both worlds.

As for the semantics, I submit the way Emacs does it: node (elisp)Text
Representations, or
http://www.gnu.org/manual/elisp-manual-21-2.8/html_node/elisp_542.html

--
Stephen Compall or s11 or sirian

I think your opinions are reasonable, except for the one about my mental
instability.
		-- Psychology Professor, Farifield University

Etacs Becker quarter Albright csim Delta Force defense information
warfare Perl-RSA CDC condor undercover SAFE analyzer ASPIC USCODE


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-10-25 17:08 ` Stephen Compall
@ 2003-10-26  0:03   ` Tom Lord
  2003-10-26 12:34     ` Which Encoding? (was Re: Unicode and Guile) Stephen Compall
  2003-10-31 13:25     ` Unicode and Guile Andy Wingo
  2003-10-31 13:16   ` Andy Wingo
  1 sibling, 2 replies; 22+ messages in thread
From: Tom Lord @ 2003-10-26  0:03 UTC (permalink / raw)
  Cc: guile-devel



    > From: Stephen Compall <s11@member.fsf.org>


    > UTF-16 seems to offer the worst of
    > both worlds.
    [being both wide compared to 8-bit characters and involving
     variable length unicode character encodings, I presume.]


It's culturually discriminatory to regard utf-16 as worse than utf-8
in those regards.

Or, put differently, for many potential users, utf-16 is the best of
both worlds: it optimizes the size of the most common characters (for
some users), and it can also handle any Unicode character.



    > As for the semantics, I submit the way Emacs does it: node (elisp)Text
    > Representations, or
    > http://www.gnu.org/manual/elisp-manual-21-2.8/html_node/elisp_542.html

What do the index arguments to STRING-REF and STRING-SET refer to?
Byte positions or character positions?

(Personally, I think they refer to byte positions and that new errors
can result from them (if the index isn't at a character boundary).
(Too bad that (1+ index) no longer means (next-character string
index)).  There's a need for a new type, `text', which acts like the
text contents of an emacs buffer and has (yes I agree) pretty much the
Emacs interface.  It should all be designed so that, internally,
people can write new ways to represent text objects and multiple text
object representations can coexist in the same application (just like
emacs).  There's no good reason not to throw in attributes, overlays,
and markers for text objects too (just like emacs).  ("There's nothing
new under the sun.")   And, eventually, people should mostly stop
using the STRING? type altogether except internally to implementations
of TEXT? and as a way to represent non-textual strings of bytes.

"Everything is UTF-32" isn't going to be practical for a long time and
then, after it is, the first roughly homonoid space-aliens to show up
with news of a life-filled galaxy will mean UTF-32 won't be practical
all over again :-)

-t


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Which Encoding? (was Re: Unicode and Guile)
  2003-10-26  0:03   ` Tom Lord
@ 2003-10-26 12:34     ` Stephen Compall
  2003-10-31 13:25     ` Unicode and Guile Andy Wingo
  1 sibling, 0 replies; 22+ messages in thread
From: Stephen Compall @ 2003-10-26 12:34 UTC (permalink / raw)
  Cc: guile-devel

Tom Lord <lord@emf.net> writes:

> It's culturually discriminatory to regard utf-16 as worse than utf-8
> in those regards.
> 
> Or, put differently, for many potential users, utf-16 is the best of
> both worlds: it optimizes the size of the most common characters
> (for some users), and it can also handle any Unicode character.

That's the thing -- it can't, at least not thinking in fixed-width
terms, which was my goal in suggesting UCS-4.  It may be able to
handle all *current* Unicode characters, but what about those in the
future?  Unicode supports code points higher than 16-bit.

I say it's the worst of both worlds (from the C API user's point of
view), because you have to deal with breaking ASCII compatibility for
7-bit code points, *and* still need surrogate characters
(i.e. variable width), for code points above 65535 (the difference
between UTF-16 and UCS-2).

UTF-16 suffers the same problem as UTF-8: programmers may be tempted
to simply treat the data block as fixed-width 16-bit strings (8-bit
for UTF-8, of course), which of course will break on the surrogate
characters.

If you want to assume that Unicode will never grow out of the 16-bit
set, then UCS-2 would be a much better choice than UTF-16, IMHO.  That
way, it is clear that C programs only need deal with fixed-width,
16-bit characters.

--
Stephen Compall or s11 or sirian

Since a politician never believes what he says, he is surprised
when others believe him.
		-- Charles DeGaulle

Ft. Meade Lexis-Nexis smuggle virus BROMURE JSOFC3IP emc plutonium
electronic surveillance quarter number key offensive information
warfare fraud Albania Khaddafi


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-10-25 17:08 ` Stephen Compall
  2003-10-26  0:03   ` Tom Lord
@ 2003-10-31 13:16   ` Andy Wingo
  1 sibling, 0 replies; 22+ messages in thread
From: Andy Wingo @ 2003-10-31 13:16 UTC (permalink / raw)


Sorry it's taken me a long time to reply.

On Sat, 25 Oct 2003, Stephen Compall wrote:

> Andy Wingo <wingo@pobox.com> writes:
> 
> > If there is no plan, may I suggest that we move our internal
> > representation of strings to UTF-8. There's an interesting
> > introductory article written on www.joelonsoftware.com, although I
> > don't have the link ATM. This has the advantage that ASCII
> > characters up to 127 are represented the same.
> 
> I think this may be a disadvantage.  As you say, UTF-8 strings are
> still not ASCII-compatible, but that casting their data blocks to
> char* still works for ASCII strings, people might be tempted to simply
> do that, because other languages "don't matter enough to bother with
> it".

It is, however, a feasible conversion strategy. It is the approach taken
by Gtk+ and friends when they switched to Unicode. Apps don't break
(crash) in the switch, it's only that processing a multibyte string will
lead to strange things. Users will then file bugs / complain to the
author, and then things get fixed. It's a soft switch.

That said, I don't have a religious opinion on the matter.

Regards,

wingo.


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-10-26  0:03   ` Tom Lord
  2003-10-26 12:34     ` Which Encoding? (was Re: Unicode and Guile) Stephen Compall
@ 2003-10-31 13:25     ` Andy Wingo
  2003-11-03 13:35       ` text buffers (was Re: Unicode and Guile) Stephen Compall
  2003-11-03 20:31       ` Unicode and Guile Tom Lord
  1 sibling, 2 replies; 22+ messages in thread
From: Andy Wingo @ 2003-10-31 13:25 UTC (permalink / raw)


Hi Tom,

On Sat, 25 Oct 2003, Tom Lord wrote:

> What do the index arguments to STRING-REF and STRING-SET refer to?
> Byte positions or character positions?

From (r5rs)Strings,

  The _length_ of a string is the number of characters that it contains.
  This number is an exact, non-negative integer that is fixed when the
  string is created.  The "valid indexes" of a string are the exact
  non-negative integers less than the length of the string.  The first
  character of a string has index 0, the second has index 1, and so on.

Clearly, the intention is not to specify the underlying representation.
It would not be Correct to allow string-ref to "leak out" details about
the underlying representation, by referencing partial characters for
instance.

> There's a need for a new type, `text', which acts like the text
> contents of an emacs buffer and has (yes I agree) pretty much the
> Emacs interface. It should all be designed so that, internally, people
> can write new ways to represent text objects and multiple text object
> representations can coexist in the same application (just like emacs).
> There's no good reason not to throw in attributes, overlays, and
> markers for text objects too (just like emacs).

Maybe. This issue is, in my opinion, orthogonal to simple strings.

Users of guile-gtk (the -gobject 2.0 branch) will just use GtkTextBuffer
(and its associated view, GtkTextView). Those that pine after emacs
won't be satisfied until you can read mail in your text buffer ;)

Regards,

wingo.


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-10-21 17:15 Unicode and Guile Andy Wingo
  2003-10-25 17:08 ` Stephen Compall
@ 2003-11-02 21:23 ` Kevin Ryde
  2003-11-26 20:35 ` Mikael Djurfeldt
  2 siblings, 0 replies; 22+ messages in thread
From: Kevin Ryde @ 2003-11-02 21:23 UTC (permalink / raw)


Andy Wingo <wingo@pobox.com> writes:
>
> This has the advantage that ASCII characters up to 127 are
> represented the same. Of course, above that characters might take up to
> eight bytes, which means that all code that processes user-input strings
> has to be changed.

There'll be a compatibility question for SCM_STRING_CHARS I think.
Application C code may be using that expecting to see chars as 8 bits,
not some encoded form.

Not sure quite how bad this will be though.  If people use strings to
hold raw 8-bit data from a socket or something then it's no doubt
pretty important to make sure they pass straight through, somehow.

There might be another question for SCM_CHAR.  It's got 24 bits (if
I'm not mistaken), if that's not enough then changing the relevant
macros will break binary compatibility.  (Not a terribly big deal, but
a bit annoying.)


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* text buffers (was Re: Unicode and Guile)
  2003-10-31 13:25     ` Unicode and Guile Andy Wingo
@ 2003-11-03 13:35       ` Stephen Compall
  2003-11-03 20:34         ` Tom Lord
  2003-11-03 20:31       ` Unicode and Guile Tom Lord
  1 sibling, 1 reply; 22+ messages in thread
From: Stephen Compall @ 2003-11-03 13:35 UTC (permalink / raw)


> There's a need for a new type, `text', which acts like the text
> contents of an emacs buffer and has (yes I agree) pretty much the
> Emacs interface. It should all be designed so that, internally,
> people can write new ways to represent text objects and multiple
> text object representations can coexist in the same application
> (just like emacs).  There's no good reason not to throw in
> attributes, overlays, and markers for text objects too (just like
> emacs).

I am working on sort-of transcribing the code in emacs/src/buffer.[hc]
into a "buffer" data type in a Guile module.  I need this because I am
terribly dependent on buffers for almost any kind of data processing
:)

The advantages of transcribing Emacs source are fewer bugs and
carrying over the lovely optimizations, like the gap, that make
buffers work.

Right now, markers are part of the `impl_buffer' C data type.  The
intention of `impl_buffer' is to push most of the "good" interface out
to a goops class.  However, some of the details, like overlays, may be
better to leave as object properties or new behavior in subclasses,
rather than explicitly in impl_buffer as Emacs does it.  And then just
specify that even when (point buf) => 100, and (- (point-max buf)
(point-min buf)) => something over 100, (forward-char buf) won't
necessarily make (point buf) => 101 :)

The other interesting extension is making markers ports.  Generics
also provide a solution to the whole "current buffer" annoyance.

On a tangent, would it be useful to generalize the gap concept to be
available to any collection, not just ordered collections of
characters?

--
Stephen Compall or s11 or sirian

Intellect annuls Fate.
So far as a man thinks, he is free.
		-- Ralph Waldo Emerson

afsatcom INSCOM JPL broadside Defcon Waco, Texas BCCI kibo Ermes Reno
Crypto AG Telex jihad Panama 22nd SAS


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-10-31 13:25     ` Unicode and Guile Andy Wingo
  2003-11-03 13:35       ` text buffers (was Re: Unicode and Guile) Stephen Compall
@ 2003-11-03 20:31       ` Tom Lord
  2003-11-06 18:16         ` Andy Wingo
  1 sibling, 1 reply; 22+ messages in thread
From: Tom Lord @ 2003-11-03 20:31 UTC (permalink / raw)
  Cc: guile-devel




    > From: Andy Wingo <wingo@pobox.com>

    > On Sat, 25 Oct 2003, Tom Lord wrote:

    > > What do the index arguments to STRING-REF and STRING-SET refer to?
    > > Byte positions or character positions?

    > >From (r5rs)Strings,

    >   The _length_ of a string is the number of characters that it contains.
    >   This number is an exact, non-negative integer that is fixed when the
    >   string is created.  The "valid indexes" of a string are the exact
    >   non-negative integers less than the length of the string.  The first
    >   character of a string has index 0, the second has index 1, and so on.

    > Clearly, the intention is not to specify the underlying representation.
    > It would not be Correct to allow string-ref to "leak out" details about
    > the underlying representation, by referencing partial characters for
    > instance.

Part of the problem is that Unicode specifications are very careful to
_not_ define "character" (except ambiguously).

In different contexts related to my question, it might mean a unicode
code point, a code value, or something more complicated such as a
grapheme (which may be represented as a string of unicode code
points).

It's a nasty problem to try to unify unicode types with scheme types.

Suppose:

* CHAR? is a code value in some encoding (say, UTF-8 or UTF-16)

  In other words, CHAR? is an 8 or 16 bit integer that happens to 
  coincide with ASCII values in some ways.

  A string is then a homogenous array of such values -- and that's
  simple enough.

  But now CHAR? can't represent all unicode code points.

  A variation on this says that CHAR? is a (subset of?) 21 bit values
  and strings (semantically) a homogenous array of those but now
  either STRING-REF and friends change in their "expected" complexity 
  or the string representation has to become quite complex.


* CHAR? is a unicode code point -- a 21 bit value.

  This approach has the same problems with string efficiency or 
  complexity -- but it has the advantage that algorithms defined 
  in terms of unicode code points (e.g. collation) translate very
  directly into Guile Scheme.


* CHAR? is a "grapheme" -- the user's idea of a character.

  Ray Dillenger is currently exploring this (see recent c.l.s.)

  It too requires a very complicated STRING? representation and,
  worse, an infinitely large set of characters.   On the other hand,
  of the three possibilities, it goes farthest in hiding the details
  of representation from users.


    >> There's a need for a new type, `text', which acts like the text
    >> contents of an emacs buffer and has (yes I agree) pretty much the
    >> Emacs interface. It should all be designed so that, internally, people
    >> can write new ways to represent text objects and multiple text object
    >> representations can coexist in the same application (just like emacs).
    >> There's no good reason not to throw in attributes, overlays, and
    >> markers for text objects too (just like emacs).

    > Maybe. This issue is, in my opinion, orthogonal to simple strings.

But perhaps its worth mentioning in this context because it suggests a
very straightforward approach for Guile:

CHAR? is 8 bits.  STRING? is a sequence of 8-bit chars.  And
everything unicode is orthogonal to that.   While there may be support
for manipulating unicode strings represented as STRING? and unicode
characters represented as CHAR?, fundamentally, CHAR? and STRING? are
kept butt-simple and the unicode support is something new.

A nice side effect of that simple-minded approach is that it works
well with foreign functions written to handle UTF-8.


    > Users of guile-gtk (the -gobject 2.0 branch) will just use
    > GtkTextBuffer (and its associated view, GtkTextView). Those that
    > pine after emacs won't be satisfied until you can read mail in
    > your text buffer ;)

There's a point of view that says "look, traditional strings have a
very simple and clear operational model that is fundamentally
different from what a `unicode string' is.   It would be a shame to
take away support for that simple, traditional string type as a
precondition for making unicode text processing simpler."

-t


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: text buffers (was Re: Unicode and Guile)
  2003-11-03 13:35       ` text buffers (was Re: Unicode and Guile) Stephen Compall
@ 2003-11-03 20:34         ` Tom Lord
  2003-11-04 10:04           ` Stephen Compall
  0 siblings, 1 reply; 22+ messages in thread
From: Tom Lord @ 2003-11-03 20:34 UTC (permalink / raw)
  Cc: guile-devel




    > From: Stephen Compall <s11@member.fsf.org>

    > On a tangent, would it be useful to generalize the gap concept to be
    > available to any collection, not just ordered collections of
    > characters?

Maybe.   I don't recommend that you persue the following right away
but other things to consider are:

1) buffers that aren't entirely memory resident but that are paged in
   "on demand"  (for huge texts)

2) alternatives to gap-buffers, such as splay tree variations


-t



_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: text buffers (was Re: Unicode and Guile)
  2003-11-03 20:34         ` Tom Lord
@ 2003-11-04 10:04           ` Stephen Compall
  0 siblings, 0 replies; 22+ messages in thread
From: Stephen Compall @ 2003-11-04 10:04 UTC (permalink / raw)
  Cc: guile-devel

Tom Lord <lord@emf.net> writes:

> 1) buffers that aren't entirely memory resident but that are paged in
>    "on demand"  (for huge texts)

I did in fact think about these -- wrapping the buffer interface
around mmap, and abusing the gap seriously, maybe even making it a
separate buffer.  I also thought about "read on demand", particularly
for network access, in which data won't be downloaded until its buffer
position or after is accessed.  But if the class system won't allow
just anyone to do this cleanly, without cluttering the base
implementation (besides adding new hooks), then I won't look at it
again.  Besides, as you said, it's not good to pursue such premature
optimization :)

--
Stephen Compall or s11 or sirian

If the meanings of "true" and "false" were switched, then this sentence
would not be false.

anarchy encryption Defcon Bellcore LABLINK UNSCOM Janet Reno Ron Brown
MP5K-SD colonel tempest Aldergrove fraud smuggle NORAD


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-11-03 20:31       ` Unicode and Guile Tom Lord
@ 2003-11-06 18:16         ` Andy Wingo
  2003-11-11 19:02           ` Tom Lord
  2003-11-12  0:06           ` Marius Vollmer
  0 siblings, 2 replies; 22+ messages in thread
From: Andy Wingo @ 2003-11-06 18:16 UTC (permalink / raw)


Well, I just wasted an afternoon reading the Unicode 4.0 spec. I'll
never have that time back again ;)

On Mon, 03 Nov 2003, Tom Lord wrote:

> Part of the problem is that Unicode specifications are very careful to
> _not_ define "character" (except ambiguously).
>
> In different contexts related to my question, it might mean a unicode
> code point, a code value, or something more complicated such as a
> grapheme (which may be represented as a string of unicode code
> points).

There is the term "abstract characters", which maps to code points. But
yes, I see that there are some problems with the concept. What do you
mean by "code value"?

We have the following passage in chapter two of the spec, talking
about one of the ten design principles, "Characters, Not Glyphs":

  The Unicode Standard draws a distinction between characters and glyphs.
  Characters are the abstract representations of the smallest components
  of written language that have semantic value. They represent primarily,
  but not exclusively, the letters, punctuation, and other signs that
  constitute natural language text and technical notation. Characters are
  represented by code points that reside only in a memory representation,
  as strings in memory, or on disk. The Unicode Standard deals only with
  character codes.

Character codes, or code points encoded via UTF-8, UTF-16, or UTF-32,
are characters, according to this passage.

Granted, end users sometimes expect certain character combinations to be
treated as characters, such as `ch' in traditional Spanish, or
lower-case a plus grave accent composite in French. That level of
processing is much higher than that of simple strings, and more related
(it seems to me) to the rendering of glyphs on the screen.

I've actually had a change of thought about encodings; while UTF-8 or
UTF-16 are good for disk and network transfer, the extensive
character-based API of Scheme (read-char from ports, for instance) lends
itself better to uniform representation for individual characters. So
the natural native format for strings in memory might well be UTF-32.
But that's another issue...

> It's a nasty problem to try to unify unicode types with scheme types.

Indeed :-/

> Suppose:
> 
> * CHAR? is a code value in some encoding (say, UTF-8 or UTF-16)

I think I'm leaning towards 4-byte values here so that you can never
read-char a partial character.

> * CHAR? is a unicode code point -- a 21 bit value.
> 
>   This approach has the same problems with string efficiency or 
>   complexity

Complexity isn't so much of an issue. Efficiency is, however;
applications with large amounts of string data might want to choose a
different encoding for their storage.

> * CHAR? is a "grapheme" -- the user's idea of a character.
> 
>   Ray Dillenger is currently exploring this (see recent c.l.s.)

Will check this out. It does sound painful, though :/

>     >> There's a need for a new type, `text', which acts like the text
>     >> contents of an emacs buffer
> 
>     > Maybe. This issue is, in my opinion, orthogonal to simple strings.
> 
> But perhaps its worth mentioning in this context because it suggests a
> very straightforward approach for Guile:
> 
> CHAR? is 8 bits.  STRING? is a sequence of 8-bit chars.  And
> everything unicode is orthogonal to that.   While there may be support
> for manipulating unicode strings represented as STRING? and unicode
> characters represented as CHAR?, fundamentally, CHAR? and STRING? are
> kept butt-simple and the unicode support is something new.

Hm. Let's consider some use cases.

Let's say an app wants to ask the user her name, she might want to write
her name in her native Arabic. Or perhaps her address, or anything
"local". If the app then wants to communicate this information to her
Chinese friend (who also knows Arabic), the need for Unicode is
fundamental. We can probably agree there.

The question becomes, is the user's name logically a simple string (can
we read it in with standard procedures), or must we use this
text-buffer, complete with marks, multiple backends, et al? It seems
more natural, to me, for this to be provided via simple strings,
although I could be wrong here.

I was looking at what Python did
(http://www.python.org/peps/pep-100.html), and they did make a
distinction. They have a separate unicode string representation which,
like strings, is a subclass of SequenceObject. So they maintained
separate representation while being relatively transparent to the
programmer. Pretty slick, it seems.

C#, Java, and ECMAScript (JavaScript) all apparently use UTF-16 as their
native format, although I've never coded in the first two. Networking
was probably the most important consideration there.

Perhaps the way forward would be to leave what we've been calling
"simple strings" alone, and somehow (perhaps with GOOPS, but haven't
thought too much about this) pull the Python trick of having a unicode
string that can be used everywhere simple strings can. Thoughts on that
idea?

Regards,

wingo.


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-11-06 18:16         ` Andy Wingo
@ 2003-11-11 19:02           ` Tom Lord
  2003-11-12  0:29             ` Marius Vollmer
  2003-11-17 16:17             ` Andy Wingo
  2003-11-12  0:06           ` Marius Vollmer
  1 sibling, 2 replies; 22+ messages in thread
From: Tom Lord @ 2003-11-11 19:02 UTC (permalink / raw)
  Cc: guile-devel



    > From: Andy Wingo <wingo@pobox.com>

    [...long thing...]

Thanks for the pointer to the Python type (on which I won't comment
:-).   Thanks for the excuse to think about this more.

At the end of this proposal, I've addressed your "use case".

-t



               Towards Standard Scheme Unicode Support


* The Problems

  There are two major obstacles to providing nice,
  non-culturally-biased Unicode support in standard Scheme.  First,
  the required standard character and string procedures are
  fundamentally inconsistent with the structure of unicode.  Second,
  attempts to ignore that fact and "force fit" unicode into them
  anyway inevitably result in a set of text-manipulation primitives
  that are too low level -- that require even very simple text
  manipulation programs to be far more "aware" of the details of
  unicode encodings and structure than they ought to be.

** CHAR? Makes No Sense In Unicode

  Consider the unicode character U+00DF "LATIN SMALL LETTER SHARP S"
  (aka Eszett).

  Clearly it should behave this way:

	(char-alphabetic? eszett) => #t
	(char-lower-case? eszett) => #t

  and it is required that:

	(char-ci=? eszett (char-upcase eszett)) => #t
	(char-upper-case? (char-upcase eszett)) => #t

  but now what exactly does:

	(char-upcase eszett)

  return?  The upper case mapping of eszett is a two character
  sequence, "SS".  It's not even a Unicode base character plus
  combining characters -- it's two base characters, a string.

  Eszett is not an isolated anomaly (though, admittedly, is not the
  common case).  Here is a pointer to the data file of similarly
  problematic case mappings:

	http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

  So, something has to give, somewhere :-)  

  [Case mappings are a particularly clear example but I suspect
  that there are other "character manipulation" operators that
  make sense in Unicode but, similarly, don't map onto a 
  standard CHAR? type.]



** Other Approaches are Too Low Level

  Consider the example of attempting to write a procedure,
  in portable scheme, which performs "studly capitalization".
  It should accept a string like:

	a studly capitalizer

  and return a string like:
  
	a StUDly CaPItalIZer

  In the simple world of the scheme CHAR and STRING types, such a
  procedure is quite simple to write _and_get_completely_correct_.
  It would make good exercises for a new programming student.

  Let's assume that the student solves the problem in a reasonable
  way:  by iterating over the string and, at random positions, 
  replacing a character with its upper case equivalent.  Simple
  enough.

  Unfortunately, there does not (can not) exist a mapping of
  Unicode onto the standard character and string types that would not
  break our student's program.  His program can still _often_ give
  a correct result, but to produce a completely correct program, 
  he will have to take a far different and, as things stand, more
  complicated approach.


** One Approach Comes Close

  Ray Dillenger has recently proposed on comp.lang.scheme a 
  treatement of Unicode in which a CHAR? value may be:

	~ a unicode base character
        ~ a unicode base character plus a sequence of 1 or 
          more unicode combining characters

  That goes a very long way towards solving the problem.  For example, 
  if I had asked our student to write an anagram generator instead of
  a studly capitalizer, Ray's solution would perserve the correctness
  of the student's program.

  Unfortunately, Ray's approach still has problems.  It can not handle
  case mappings correctly, as noted above.  In Ray's system, there are
  an infinite number of non-EQV? CHAR? values and therefore
  CHAR->INTEGER may return a bignum (in Indic, Tibetan, and the Hangul
  Jamo alphabets, it would apparently return a bignum frequently).
  With an infinite set of characters, libraries (such as SRFI-7
  "Character Sets"), which are designed with a finite character set in
  mind, can not be ported.  The issue of multi-character case mappings
  aside, It is difficult to see how to preserve the required ordering
  isomophism between characters and their integer representations.

  Nevertheless, Ray's idea that a "conceptual character" is part of 
  an infinite set of values and a "conceptual string" a sequence of
  those is the basis of this proposal.


* The Proposal

  The proposal has two parts.   Part 1 introduces a new type, TEXT?, 
  which is a string-like type that is compatible with Unicode, and
  a subtype of TEXT?, GRAPHEME?, to represent "conceptual
  characters". 

  Part 2 discusses what can become of the STRING? and CHAR? types in
  this context.


** The TEXT? and GRAPHEME? Types

  [This is a sketch of a specification -- not yet even a first
   draft of a specification.]

    ~ (text? obj) => <boolean>

	True if OBJ is a text object, false otherwise.

	A text object represents a string of printed graphemes.

    ~ (utf8->text string) => <text>
    ~ (utf16->text string) => <text>
    ~ (utf16be->text string) => <text>
    ~ (utf16le->text string) => <text>
    [...]
    ~ (text->utf8 text) => <string> 
    [...]
	The usual conversions from strings (presumed to be
        sequences of octets) to text.

  A subset of text objects are distinguished as graphemes:

    ~ (grapheme? obj) => <boolean>

      True if OBJ is a text object which is a grapheme,
      false otherwise.

      The set of graphemes is defined to be isomorphic to the set of
      all unicode base characters and well formed unicode combinding
      character sequences (and is thus an infinite set).

    ~ (grapheme=? g1 g2 [locale]) => <boolean>
    ~ (grapheme<? g1 g2 [locale])
    ~ (grapheme>? g1 g2 [locale])
    [...]
    ~ (grapheme-ci=? g1 g2 [locale])
    ~ (grapheme-ci<? g1 g2 [locale])
    ~ (grapheme-ci>? g1 g2 [locale])

      The usual orderings.

      Here and elsewhere I've left the optional parameter LOCALE there
      as a kind of place-holder.  There are many possible collation
      orders for text and programs need a way to distinguish which
      they mean (as well as have a reasonable default).


  It is important to note that, in general, EQV? and EQUAL?  do _not_
  test for grapheme equality.  GRAPHEME=? must be used instead.
             
  Also note that this proposal does not include GRAPHEME->INTEGER or
  INTEGER->GRAPHEME.   I have not included, but probably should
  include, a hash value procedure which hashes GRAPHEME=? values 
  equally.

    ~ (grapheme-upcase g) => <text>
    ~ (grapheme-downcase g) => <text>
    ~ (grapheme-titlecase g) => <text>

       Note that these return texts, not necessarilly graphemes.
       For example, GRAPHEME-UPCASE of eszett would return a 
       text representation of "SS".

  All texts, including graphemes, behave like (conceptual) strings:

    ~ (text-length text) => <integer>

      Return the number of graphemes in TEXT.

    ~ (subtext text start end) => <text>

      Return a subtext of TEXT containing the graphemes beginning at
      index START (inclusive) and ending at END (exclusive).

    ~ (text=? t1 t2 [locale]) => <boolean>
    ~ (text<? t1 t2 [locale]) => <boolean>
    [...]
        The usual ordering predicates.

    ~ (text-append text ...) => <text>
    ~ (list->text list-of-graphemes) => <text>
      
         Various constructors for text ....


    However, instead of TEXT-SET!, we have:

  
    ~ (text-replace! text start end replacement)

      Replace the graphemes at [START, END) in TEXT with 
      the graphemes in text object REPLACEMENT.  Passing
      #t for END is equivalent to passing an index 1
      position beyond START.

      TEXT must be a mutable text object (see below).


  Implementations are permitted to make _some_ graphemes immutable.
  In particular:

    ~ (text-ref text index) => <grapheme>

      Return  the grapheme at position INDEX in TEXT.
      The grapheme returned may be immutable.


    ~ (text->list text) => <list of graphemes>

      The graphemes returned may be immutable.

    ~ (char->grapheme char) => <grapheme>
    ~ (utf8->grapheme string) => <grapheme>
    [....]

       Conversions to possibly immutable graphemes.


  And some simple I/O extensions:

    ~ (read-grapheme [port]) => <grapheme>
    ~ (peek-grapheme [port]) => <grapheme>
    [etc.]


  There is still an awkwardness, however.  Consider witing the "StUDly
  CaPItalIZer" procedure.  It's tempting to write it as a loop that
  uses an integer grapheme index to iterate over the text, randomly
  picking graphemes to change the case of.  That wouldn't work though:
  changing the case of one character can change the length of text,
  right at the point being indexed, and invalidate the indexes.  So,
  texts really need markers that work like those in Emacs:

    ~ (make-text-marker text index) => <marker>
    ~ (text-marker? obj) => <boolean>
    ~ (marker-text marker) => <index>
    ~ (marker-index marker) => <index>
    ~ (set-marker-index! marker index)
    ~ (set-marker! marker text index)
    etc.

	Changes (by TEXT-REPLACE!) to the region of a text object to
        the left of a marker leave the marker in the same position
        relative to the right end of the text, and vice versa.

        Changes to a region which _includes_ a marker leave the
        marker at last grapheme index of the replacement
        text that was inserted, or, if the replacement was empty, 
        at its old index position minus the number of graphemes
        deleted to the marker's left.

        The procedures SUBTEXT, TEXT-REPLACE!, and TEXT-REF 
        and others that except indexes can accept markers as those
        indexes.

  Unlike markers, text properties and overlays aren't strictly needed to
  make TEXT? useful -- but they would make a good addition.   The issue
  is that mutating procedures (like TEXT-REPLACE!) should be aware of
  properties in order to update them properly.    If properties and
  overlays are left out, and people have to implement them in a higher
  layer, then their "attributed text" data type can't be passed to a
  procedure that just expects a text object.



* Optional Changes to CHAR? and STRING?

  The above sepcification of the TEXT? and GRAPHEME? is useful on its
  own, but it might be considerably more convenient in implementations
  which also adopt the following ideas:

    ~ CHAR? is an octet, STRING? a sequence of octets

    ~ STRING? valuess are resizable

    ~ STRING? values contain an "encoding" attribute which may be
      any of 
		utf8
                utf16be
                utf16le
                utf32

      or an impelementation defined value.   Note however that
      procedures such as STRING-REF ignore this attribute and 
      view strings as sequences of octets.

      STRING-APPEND implicitly converts its second and subsequent
      arguments to the same encoding as its first.


    ~ (text? "a string") => #t

    ~ (grapheme? #\a) => #t

  In other words, all character values are graphemes, and all strings
  are text values.

  These ideas _could_ be taken even a step further with the addition
  of:

    ~ TEXT? values contain an "encoding" attribute, just as strings
      do (utf-8, etc.)

    ~ (string? a-text-value) => #t

    ~ (char? a-grapheme) => <boolean>

  All text values can be strings;  some graphemes can be characters.


* Summary

  The new TEXT? and GRAPHEME? types present a simple and traditional
  interface to "conceptual strings" and "conceptual characters".  
  They make it easy to express simple algorithms simply and without
  reference to the internal structure of Unicode.

  Reflecting the realities of global text processing, there is
  no bias in the interfaces suggesting that the set of graphemes
  is finite.

  Also reflecting the realities of global text processing: the length
  of a text object may change over time; a sequence replacement
  operator is supplied instead of an element replacement operator; 
  and markers (similar to those in text editors) are provided for 
  iteration and other examples of keeping track of "a position within
  a text vaue".

  There is no essential difference between a grapheme and a text
  object of length 1, and thus the proposal makes GRAPHEME? a 
  subtype of TYPE.

  If STRING? is suitably extended, then it may be equal to or a subset
  of TEXT?.  Conversely, if TYPE? is suitably extended, it may be
  equal to or a subset of STRING?.  It may be sensible to unify the
  two types (although even analogous string procedures and text
  procedures will still behave differently from one another).

  CHAR? may be safely viewed as a subtype of GRAPHEME?, but the 
  converse is not, and can not, be true.




--------------------------------


    > Hm. Let's consider some use cases.

    > Let's say an app wants to ask the user her name, she might want to write
    > her name in her native Arabic. Or perhaps her address, or anything
    > "local". If the app then wants to communicate this information to her
    > Chinese friend (who also knows Arabic), the need for Unicode is
    > fundamental. We can probably agree there.

Absolutely.    What's more, if I'm sitting in california and write a
protable Scheme program that generates anagrams of a name, it'd be
awefully swell if (a) My code doesn't have to "know" anything special
about unicode internals;  (b) my code works when passed her name as input.


    > The question becomes, is the user's name logically a simple string (can
    > we read it in with standard procedures), or must we use this
    > text-buffer, complete with marks, multiple backends, et al? It seems
    > more natural, to me, for this to be provided via simple strings,
    > although I could be wrong here.

Scheme's requirements of the CHAR? and STRING? types simply don't map
onto unicode.   The case problem I illustrated above is one example
and I _suspect_ that there are others, even if you do something like
Ray's trying to do and make an infinitely large character set.

I _think_ the TEXT? and GRAPHEME? stuff above is about as natural as
"simple strings" -- it just doesn't try to give those types behavior
that makes no sense in Unicode.



    > I was looking at what Python did
    > (http://www.python.org/peps/pep-100.html), and they did make a
    > distinction. They have a separate unicode string representation which,
    > like strings, is a subclass of SequenceObject. So they maintained
    > separate representation while being relatively transparent to the
    > programmer. Pretty slick, it seems.

That URL is slightly wrong.  It's:

     http://www.python.org/peps/pep-0100.html

It sounds _ok_.   It's got some problems.   

The genericity of it (that these are still sequences) is
winning.... i'll discuss that below.

Mostly its a little too low level.  They're only (initially?)
supporting the 1-1 case conversions.  They are exposing unicode code
points and just handing users property tables for those.  They don't
include a "marker" concept.  These are all symptoms of starting off
with an implementation limited to the 16-bit code points -- they
haven't thought through how to do full unicode support (and once they
do, I'll bet they wind up with something close to TEXT? and
GRAPHEME?).

    > C#, Java, and ECMAScript (JavaScript) all apparently use UTF-16
    > as their native format, although I've never coded in the first
    > two. Networking was probably the most important consideration
    > there.

Streaming conversions (e.g., for networking) are cheap and easy.  I
think they made those choices to simplify implementations, and then
made the mistake of exposing that implementation detail in the
interfaces.


    > Perhaps the way forward would be to leave what we've been
    > calling "simple strings" alone, and somehow (perhaps with GOOPS,
    > but haven't thought too much about this) pull the Python trick
    > of having a unicode string that can be used everywhere simple
    > strings can. Thoughts on that idea?

The proposal above makes it possible to pass text everywhere that
simple strings can be used.  However, in that part of the proposal,
string-ref, sttring-set! and so forth are still specified to operate
on octets.

The proposal also makes it possible to pass strings everywhere that
text can be used.   I think that's the more interesting direction: 
just use text- and grapheme- procedures from now on except where you
_really_ want to refer to octets.

-t


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-11-06 18:16         ` Andy Wingo
  2003-11-11 19:02           ` Tom Lord
@ 2003-11-12  0:06           ` Marius Vollmer
  2003-11-12  1:27             ` Tom Lord
  1 sibling, 1 reply; 22+ messages in thread
From: Marius Vollmer @ 2003-11-12  0:06 UTC (permalink / raw)


Please allow me to randomly dump my thoughts on Guile and Unicode:

- The principal tension that I see is between having a memory
  efficient representation (UTF-8) and one that is simple and
  concept-compatible with the old way (fixed-width, maybe UTF-32).

- But is there a fixed-width Unicode representation?  I.e., is UTF-32
  just like ASCII only with more bits or is there more to it?  Are
  there combining characters in UTF-32?  If there are, then there is
  no reason to go looking for a fixed-width, old-style text
  representation.

- If we go with a variable width encoding, we can just as well use
  UTF-8 and replace strings/chars with something new, like Tom's
  texts/graphemes.

- What kind of data type are strings anyway?  Vectors or lists?
  Traditionally, they have been mutable vectors, but variable-width
  encoding of 'characters' might force us to rethink this, in general.
  People expect constant time accesses for vector-like things, but we
  will probably not want to guarantee them for a variable-width
  encoding (with integers as indices).

- So the text/grapheme API should maybe be more abstract, and not be
  using integers to refer to graphemes contained in texts but some
  opaque 'iterator', 'subtext' or 'grapheme range' thing.
  
- Shared subtexts or grapheme ranges are easy to do for read-only
  texts, but harder for mutable text.  So texts should maybe be
  unmutable by default.  Mutable texts and pointers into it might use
  a more expensive data structure, like a gap buffer.

- For Guile specifically, the problematic thing is the C API.  Right
  now, strings are pretty much fixed to be vectors of unsigned bytes.
  We can't do much about this without breaking code.  So from that
  point of view, a new API for Unicode stuff looks like a good thing
  as well, when we can convince ourselves that people are willing to
  move over to that new API.

- The representation of texts would be determined by what is most
  natural for existing C code.  I.e., I think that Gtk+ uses UTF-8 and
  when we find that most libraries that we want to access from Guile
  use UTF-8 as well, we should make our text representation UTF-8.

- Old code can be supported by allowing string-*, char-*, etc. to work
  on UTF-8 encoded texts that uses only ASCII code points.  That will
  causes problems to the 8-bit users (like latin-1, etc.), tho.  C
  code must avoid storing non-ASCII characters into such strings, and
  I'm not sure right now whether we can keep it from doing that in a
  compatible way.

- ... :)

-- 
GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3  331E FAF8 226A D5D4 E405


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-11-11 19:02           ` Tom Lord
@ 2003-11-12  0:29             ` Marius Vollmer
  2003-11-12  1:40               ` Tom Lord
  2003-11-17 16:17             ` Andy Wingo
  1 sibling, 1 reply; 22+ messages in thread
From: Marius Vollmer @ 2003-11-12  0:29 UTC (permalink / raw)
  Cc: guile-devel

Tom Lord <lord@emf.net> writes:

>     ~ (grapheme=? g1 g2 [locale]) => <boolean>
>     ~ (grapheme<? g1 g2 [locale])
>     ~ (grapheme>? g1 g2 [locale])
>     [...]
>     ~ (grapheme-ci=? g1 g2 [locale])
>     ~ (grapheme-ci<? g1 g2 [locale])
>     ~ (grapheme-ci>? g1 g2 [locale])
>
>       The usual orderings.

Is it a good idea to have an ordering among graphemes, or would it be
better to only order texts, i.e., to allow for the context of a
grapheme to determine the order?

>     ~ (make-text-marker text index) => <marker>

What about having _only_ markers and not allow integers as indices?
Also, what about making TEXTs unmutable by default and instead let
TEXT-REPLACE, etc return a new text object?

>   The new TEXT? and GRAPHEME? types present a simple and traditional
>   interface to "conceptual strings" and "conceptual characters".  
>   They make it easy to express simple algorithms simply and without
>   reference to the internal structure of Unicode.

Indeed.

>   There is no essential difference between a grapheme and a text
>   object of length 1, and thus the proposal makes GRAPHEME? a 
>   subtype of TYPE.

Do we need the concept of grapheme at all, then?

> The proposal also makes it possible to pass strings everywhere that
> text can be used.   I think that's the more interesting direction: 
> just use text- and grapheme- procedures from now on except where you
> _really_ want to refer to octets.

Could we make strings/chars go away completely over time?  For vectors
of octets, there is u8vector? from SRFI-4.

-- 
GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3  331E FAF8 226A D5D4 E405


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-11-12  0:06           ` Marius Vollmer
@ 2003-11-12  1:27             ` Tom Lord
  0 siblings, 0 replies; 22+ messages in thread
From: Tom Lord @ 2003-11-12  1:27 UTC (permalink / raw)
  Cc: guile-devel



    > From: Marius Vollmer <mvo@zagadka.de>

    > - But is there a fixed-width Unicode representation?  I.e., is UTF-32
    >   just like ASCII only with more bits or is there more to it?  Are
    >   there combining characters in UTF-32?  If there are, then there is
    >   no reason to go looking for a fixed-width, old-style text
    >   representation.

No offense, but you need to do more homework.  It's not easy homework,
either -- hence the "no offense".

Yes, there are combining characters in UTF-32.   UTF-32 is something
called a "character encoding form" and combining characters are
completely orthogonal to encoding forms.

If you can afford it, grab a copy of the Unicode standard.   Or check
on unicode.org -- for all I know it's freely available these days.

Please read my proposal on this list and c.l.s.  The standard Scheme
character and string types are not sanely unicode-friendly unless
interpreted as rather low-level operations.  It makes more sense to
say that CHAR? values are octets than to say they are "unicode
characters".


    > - If we go with a variable width encoding, we can just as well use
    >   UTF-8 and replace strings/chars with something new, like Tom's
    >   texts/graphemes.

It's not quite "replace" but yeah -- where traditionally you'd teach a
newbie to use characters and strings, teach them instead to use the
(subtlely different) graphemes and texts.



    > - What kind of data type are strings anyway?  Vectors or lists?
    >   Traditionally, they have been mutable vectors, but variable-width
    >   encoding of 'characters' might force us to rethink this, in general.
    >   People expect constant time accesses for vector-like things, but we
    >   will probably not want to guarantee them for a variable-width
    >   encoding (with integers as indices).

A "vector of octets" is so remarkably useful that Scheme should not
fail to provide it.  CHAR? and STRING? types are compatible with
"octet and vector of octets" but not with Unicode.  So: add TEXT? and
GRAPHEME? for "string processing" and let CHAR? and STRING? be
octet-based types.

I'm as surprised as you are.  I've spent many months assuming that I
wanted CHAR? to be able to hold an arbitrary unicode code point -- but
it simply does not work out.


    > - So the text/grapheme API should maybe be more abstract, and not be
    >   using integers to refer to graphemes contained in texts but some
    >   opaque 'iterator', 'subtext' or 'grapheme range' thing.

It can use integers just fine except that, in the face of mutations to
a "text", integer indexes don't behave well.  So, yeah, there's a need
for "markers" which are an example of "cursors".  I think it's ok,
though, to expose the integers that underly markers, though.  They
behave comparably to (what you expect of) the integers in most cases.


    > - Shared subtexts or grapheme ranges are easy to do for read-only
    >   texts, but harder for mutable text.  So texts should maybe be
    >   unmutable by default.  Mutable texts and pointers into it might use
    >   a more expensive data structure, like a gap buffer.

I think that a tree structure is better than a gap buffer as the
default implementation option.  Shared subtexts should use markers to
represent their extents.


    > - For Guile specifically, the problematic thing is the C API.  Right
    >   now, strings are pretty much fixed to be vectors of unsigned bytes.
    >   We can't do much about this without breaking code.  So from that
    >   point of view, a new API for Unicode stuff looks like a good thing
    >   as well, when we can convince ourselves that people are willing to
    >   move over to that new API.

The proposal preserves the view that a string is a vector of unsigned
bytes.   It also adds a higher level view.



    > - The representation of texts would be determined by what is most
    >   natural for existing C code.  I.e., I think that Gtk+ uses UTF-8 and
    >   when we find that most libraries that we want to access from Guile
    >   use UTF-8 as well, we should make our text representation UTF-8.

That's an internal implementation detail, not a detail that should be
reflected in interface specifications.

It'd be just fine if Guile initially provided a C API that only
well-supported UTF-8, but that shouldn't be imposed on the
Scheme-level interfaces.


    > - Old code can be supported by allowing string-*, char-*, etc. to work
    >   on UTF-8 encoded texts that uses only ASCII code points.  That will
    >   causes problems to the 8-bit users (like latin-1, etc.), tho.  C
    >   code must avoid storing non-ASCII characters into such strings, and
    >   I'm not sure right now whether we can keep it from doing that in a
    >   compatible way.

No, I think you're basically screwed in that area.   Sorry.

And that leaves you with a choice between forking Guile from R5RS or
breaking upwards compatability to usages from C.   I suspect that the
damage to "usages from C" can be minimized to such a degree that
that's the way to go.

In this area, by the way, I'd suggest an encoding type which is not
UTF-8 but which is ISO-* for users of 8-bit sets.   If someone pokes
an unrepresentable character into an ISO-* string, either signal an
error or mutate the string's encoding --- either way will save current
C usages.

-t



_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-11-12  0:29             ` Marius Vollmer
@ 2003-11-12  1:40               ` Tom Lord
  2003-11-12  2:30                 ` Marius Vollmer
  0 siblings, 1 reply; 22+ messages in thread
From: Tom Lord @ 2003-11-12  1:40 UTC (permalink / raw)
  Cc: guile-devel


    > From: Marius Vollmer <mvo@zagadka.de>

    > Tom Lord <lord@emf.net> writes:

    > >     ~ (grapheme=? g1 g2 [locale]) => <boolean>
    > >     ~ (grapheme<? g1 g2 [locale])
    > >     ~ (grapheme>? g1 g2 [locale])
    > >     [...]
    > >     ~ (grapheme-ci=? g1 g2 [locale])
    > >     ~ (grapheme-ci<? g1 g2 [locale])
    > >     ~ (grapheme-ci>? g1 g2 [locale])

    > >       The usual orderings.

    > Is it a good idea to have an ordering among graphemes, or would it be
    > better to only order texts, i.e., to allow for the context of a
    > grapheme to determine the order?

I think it's a fine idea to order graphemes but, depending on the
locale, the ordering of texts is _not_ a lexical ordering grounded in
grapheme ordering.

It would be good to provide a locale, perhaps the default, in which
ordering of texts _is_ a lexical ordering grounded in (default)
grapheme order.


    > >     ~ (make-text-marker text index) => <marker>

    > What about having _only_ markers and not allow integers as
    > indices?

Seems excessive and aribtrary.  How do I implement (Emacs') GOTO-CHAR
without standing on my head?


    > Also, what about making TEXTs unmutable by default and instead let
    > TEXT-REPLACE, etc return a new text object?

Given an implementation that can do that efficiently, I see no
obstacle to implementing a new type, META-TEXT?, which is mutable in
exactly the way that TEXT? is in my proposal.   That'd be ridiculously
inconvenient though.   So, make META-TEXT? the same thing as TEXT?.

(I strongly suggest splay trees as an ideal implementation strategy
for for TEXT?.   They would make _both_ mutating and functional
REPLACE efficient.)


    > >   There is no essential difference between a grapheme and a text
    > >   object of length 1, and thus the proposal makes GRAPHEME? a 
    > >   subtype of TYPE.

    > Do we need the concept of grapheme at all, then?

Interesting question!  And it ties in with your question about "why
not just markers and not integer indexes".

I don't see a good way to ground markers _without_ integer indexes.

Graphemes are a reasonable "what the user thinks of as a character".

What does DELETE-BACKWARD-CHAR delete (for example) (at least by
default) if not a grapheme?  And in the non-default cases, how does it
analyze the TEXT?  value to figure out what to do?



    > > The proposal also makes it possible to pass strings everywhere that
    > > text can be used.   I think that's the more interesting direction: 
    > > just use text- and grapheme- procedures from now on except where you
    > > _really_ want to refer to octets.

    > Could we make strings/chars go away completely over time?  For vectors
    > of octets, there is u8vector? from SRFI-4.

I wouldn't object to seeing a complete unification of STRING? with
u8vector.   I'm not so sure that the CHAR? type is particularly useful
in the long run -- it's rather culturally biased.

-t



_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-11-12  1:40               ` Tom Lord
@ 2003-11-12  2:30                 ` Marius Vollmer
  2003-11-12  4:03                   ` Tom Lord
  0 siblings, 1 reply; 22+ messages in thread
From: Marius Vollmer @ 2003-11-12  2:30 UTC (permalink / raw)
  Cc: guile-devel

Tom Lord <lord@emf.net> writes:

>     > >     ~ (make-text-marker text index) => <marker>
>
>     > What about having _only_ markers and not allow integers as
>     > indices?
>
> Seems excessive and aribtrary.  How do I implement (Emacs') GOTO-CHAR
> without standing on my head?

Yes, right, there need to be conversions between markers and integers,
but I'm worried that people will write code like

    (do ((i 0 (1+ i))
         (>= i (text-length text)))
      (... (text-ref text i) ...))

and we'll have trouble implementing this efficiently for graphemes of
variable sizes.  When people are encouraged to use markers like this

    (do ((i (text-start text) (marker-forward i 1))
         ((marker-at-end? i)))
      (... (marker-ref i) ...))

things should be easier.  (Of course, there should also be things like
'text-map', etc.)

> (I strongly suggest splay trees as an ideal implementation strategy
> for for TEXT?.   They would make _both_ mutating and functional
> REPLACE efficient.)

Ok, if there is no cost for making texts mutable, we should of course
do it.

>
>     > >   There is no essential difference between a grapheme and a text
>     > >   object of length 1, and thus the proposal makes GRAPHEME? a 
>     > >   subtype of TYPE.
>
>     > Do we need the concept of grapheme at all, then?
>
> Interesting question!  And it ties in with your question about "why
> not just markers and not integer indexes".
>
> I don't see a good way to ground markers _without_ integer indexes.

Yes.  What I'm worried about is that it is expensive to go from an
integer index to the memory location where the indicated grapheme is
stored.  On the other hand, it us easy to increment the marker to the
next grapheme in a text.

> Graphemes are a reasonable "what the user thinks of as a character".

Yep, the concept of graphemes is useable, if only in the
documentation.  What I really had in mind was not the concept, but the
data type.  Is it important to have a new data type, or could we just
have

    (define (grapheme? obj) (and (text? obj) (= (text-length obj) 1)))
    (define grapheme=? text=?)
    (define grapheme<? text<?)
    ...

'read-grapheme' etc would probably need to remain.

-- 
GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3  331E FAF8 226A D5D4 E405


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-11-12  2:30                 ` Marius Vollmer
@ 2003-11-12  4:03                   ` Tom Lord
  2003-11-12 16:59                     ` Marius Vollmer
  0 siblings, 1 reply; 22+ messages in thread
From: Tom Lord @ 2003-11-12  4:03 UTC (permalink / raw)
  Cc: guile-devel


    > From: Marius Vollmer <mvo@zagadka.de>
    > Tom Lord <lord@emf.net> writes:

    > >     > >     ~ (make-text-marker text index) => <marker>

    > >     > What about having _only_ markers and not allow integers as
    > >     > indices?

    > > Seems excessive and aribtrary.  How do I implement (Emacs') GOTO-CHAR
    > > without standing on my head?

    > Yes, right, there need to be conversions between markers and integers,
    > but I'm worried that people will write code like

    >     (do ((i 0 (1+ i))
    >          (>= i (text-length text)))
    >       (... (text-ref text i) ...))

    > and we'll have trouble implementing this efficiently for graphemes of
    > variable sizes.  When people are encouraged to use markers like this

    >     (do ((i (text-start text) (marker-forward i 1))
    >          ((marker-at-end? i)))
    >       (... (marker-ref i) ...))

    > things should be easier.  (Of course, there should also be things like
    > 'text-map', etc.)

Integer indexes can be implemented quite efficiently.   Again, imagine
a splay tree representation of text in which each node is labled with
its integer offsets.  ("splay" is not the only possible tree type to
which this idea applies.)


    > > (I strongly suggest splay trees as an ideal implementation strategy
    > > for for TEXT?.   They would make _both_ mutating and functional
    > > REPLACE efficient.)

    > Ok, if there is no cost for making texts mutable, we should of course
    > do it.

Yup.

    > > Graphemes are a reasonable "what the user thinks of as a character".

    > Yep, the concept of graphemes is useable, if only in the
    > documentation.  What I really had in mind was not the concept, but the
    > data type.  Is it important to have a new data type, or could we just
    > have

    >     (define (grapheme? obj) (and (text? obj) (= (text-length obj) 1)))
    >     (define grapheme=? text=?)
    >     (define grapheme<? text<?)
    >     ...

    > 'read-grapheme' etc would probably need to remain.

That's essentially what I proposed.   I wasn't trying to spec the
absolute _minimal_ basis set, just a usefully small basis set.

-t





_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-11-12  4:03                   ` Tom Lord
@ 2003-11-12 16:59                     ` Marius Vollmer
  0 siblings, 0 replies; 22+ messages in thread
From: Marius Vollmer @ 2003-11-12 16:59 UTC (permalink / raw)
  Cc: guile-devel

Tom Lord <lord@emf.net> writes:

>     >     (do ((i 0 (1+ i))
>     >          (>= i (text-length text)))
>     >       (... (text-ref text i) ...))
>
>     > and we'll have trouble implementing this efficiently for graphemes of
>     > variable sizes. [...]
>
> Integer indexes can be implemented quite efficiently.   Again, imagine
> a splay tree representation of text in which each node is labled with
> its integer offsets.  ("splay" is not the only possible tree type to
> which this idea applies.)

But when you compare splay trees plus integer indices against UTF-8
vectors plus markers, doesn't the UTF+markers method win clearly, in
memory use, in speed and code simplicity (when you assume that texts
are not often modified)?

Also, UTF-8 or similar could often be passed directly to external
functions, maybe.  When we need to do encoding conversions anyway when
a string leaves Guile, then there is probably no point in avoiding
splay trees.

-- 
GPG: D5D4E405 - 2F9B BCCC 8527 692A 04E3  331E FAF8 226A D5D4 E405


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-11-11 19:02           ` Tom Lord
  2003-11-12  0:29             ` Marius Vollmer
@ 2003-11-17 16:17             ` Andy Wingo
  1 sibling, 0 replies; 22+ messages in thread
From: Andy Wingo @ 2003-11-17 16:17 UTC (permalink / raw)


On Tue, 11 Nov 2003, Tom Lord wrote:

> Thanks for the pointer to the Python type (on which I won't comment
> :-).   Thanks for the excuse to think about this more.

And thanks for thinking this through a lot more properly than I was, and
for caring about the problem, and for having patience with the ignorant
:-)

> ** CHAR? Makes No Sense In Unicode

I think I'm starting to get a clue. Case mapping demonstrates this
pretty clearly... Incidentally, GLib's function for this is evidently
broken:

gunichar    g_unichar_toupper               (gunichar c);

Although they do have g_utf8_strup, which operates on a string and does
the correct thing.

> * The Proposal
> 
>   The proposal has two parts.   Part 1 introduces a new type, TEXT?, 
>   which is a string-like type that is compatible with Unicode, and
>   a subtype of TEXT?, GRAPHEME?, to represent "conceptual
>   characters". 

Wow, you really have thought a lot more about this than I have.

>   It is important to note that, in general, EQV? and EQUAL?  do _not_
>   test for grapheme equality.  GRAPHEME=? must be used instead.

I can see why EQV? shouldn't test for equality: a precomposed grapheme
can be the same as one made with combining characters. But why not
overload EQUAL?, given that they would display the same (with a suitable
glyph rendering library)? Perhaps this is not possible in portable
Scheme? If this question is ignorant, my apologies.

>   So, texts really need markers that work like those in Emacs:

It does indeed appear so. I withdraw my ridicule of this idea :-P

> * Optional Changes to CHAR? and STRING?
> 
>     ~ TEXT? values contain an "encoding" attribute, just as strings
>       do (utf-8, etc.)

Why should an implementation support more than one encoding, internally?

>     ~ (string? a-text-value) => #t

Would be difficult with Guile, given the C interface... Perhaps if there
were an abstract string type, with "simple strings" as a subtype, then C
functions wanting a string (just for reading) would not call
SCM_STRING_CHARS but scm_string_chars, or the like...

> [I]f I'm sitting in california and write a protable Scheme program
> that generates anagrams of a name, it'd be awefully swell if (a) My
> code doesn't have to "know" anything special about unicode internals;
> (b) my code works when passed her name as input.

Indeed.


Overall, your proposal is IMHO well-thought out, and is of high quality.
I am humbled :). I hope something like this can go into Guile soon.

Cheers,

wingo.


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Unicode and Guile
  2003-10-21 17:15 Unicode and Guile Andy Wingo
  2003-10-25 17:08 ` Stephen Compall
  2003-11-02 21:23 ` Kevin Ryde
@ 2003-11-26 20:35 ` Mikael Djurfeldt
  2 siblings, 0 replies; 22+ messages in thread
From: Mikael Djurfeldt @ 2003-11-26 20:35 UTC (permalink / raw)
  Cc: djurfeldt, Marius Vollmer

Andy Wingo <wingo@pobox.com> writes:

> What's the plan on internationalization of strings in Guile?

I've added three documents under a new directory in the workbook CVS module:

  workbook/i18n/jimbcomment.text
  workbook/i18n/projidea.text
  workbook/i18n/references.text

The references.text file mentions Jim Blandy's design documents which
currently reside as guile-core/doc/mbapi.texi and
guile-core/doc/mltext.texi.

Best regards,
M


_______________________________________________
Guile-devel mailing list
Guile-devel@gnu.org
http://mail.gnu.org/mailman/listinfo/guile-devel


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2003-11-26 20:35 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-10-21 17:15 Unicode and Guile Andy Wingo
2003-10-25 17:08 ` Stephen Compall
2003-10-26  0:03   ` Tom Lord
2003-10-26 12:34     ` Which Encoding? (was Re: Unicode and Guile) Stephen Compall
2003-10-31 13:25     ` Unicode and Guile Andy Wingo
2003-11-03 13:35       ` text buffers (was Re: Unicode and Guile) Stephen Compall
2003-11-03 20:34         ` Tom Lord
2003-11-04 10:04           ` Stephen Compall
2003-11-03 20:31       ` Unicode and Guile Tom Lord
2003-11-06 18:16         ` Andy Wingo
2003-11-11 19:02           ` Tom Lord
2003-11-12  0:29             ` Marius Vollmer
2003-11-12  1:40               ` Tom Lord
2003-11-12  2:30                 ` Marius Vollmer
2003-11-12  4:03                   ` Tom Lord
2003-11-12 16:59                     ` Marius Vollmer
2003-11-17 16:17             ` Andy Wingo
2003-11-12  0:06           ` Marius Vollmer
2003-11-12  1:27             ` Tom Lord
2003-10-31 13:16   ` Andy Wingo
2003-11-02 21:23 ` Kevin Ryde
2003-11-26 20:35 ` Mikael Djurfeldt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).