unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* Wide strings status
@ 2009-04-21  2:11 Mike Gran
  2009-04-21 21:37 ` Ludovic Courtès
  0 siblings, 1 reply; 4+ messages in thread
From: Mike Gran @ 2009-04-21  2:11 UTC (permalink / raw)
  To: Guile Devel

Hi,

OK.  I've uploaded a "string-abstraction" branch so that you can see
what I've been doing over the last couple of months.  Currently, I do
have a version of Guile that uses Unicode codepoints for characters.

The C representation of chars was changed to scm_t_uint32 throughout the
code.

Strings are internally encoded either as "narrow" 8-bit ISO-8859-1
strings or as "wide" UTF-32 strings.  Strings are usually created as
narrow strings.  Narrow strings get automatically widened to wide
strings if non-8-bit characters are set! or appended to them.

Outside of the core strings module and srfi-13, a set of methods are
used to access strings.  I did my best to keep the internal
representation of strings isolated to those two modules.  This means
that almost every instance of the pervasive scm_i_string_chars() was
removed.

The machine-readable "write" form of strings has been changed.  Before,
non-printable characters were given as hex escapes, for example \xFF.
Now there are three levels of hex escape for 8, 16, and 24 bit
characters: \xFF, \uFFFF, \UFFFFFF.  This is a pretty common convention.
But after I coded this, I noticed that R6RS has a different convention
and I'll probably go with that.

The internal representation of strings seems to work already, but, the
reader doesn't work yet.  For now, one can make wide strings like this:

> (setlocale LC_ALL "")
==> "en_US.UTF-8"

> (define str (apply string (map integer->char '(100 200 300 400 500))))

> (write str)
==>"d\xc8\u012c\u0190\u01f4"

(display str)
==>dÈĬƐǴ

This is all going to be slower than before because of the string
conversion operations, but, I didn't want to do any premature
optimization.  First, I wanted to get it working, but, there is plenty
of room for optimization later.

Anyway, if, code-wise, it is agreed that I'm generally on the right
track, the next steps are these:

Write a plethora of unit tests on what has been accomplished so far.

Character sets need to be modified to have more than 256 entries.

Character encoding needs to be a property of ports, so that not all
string operations are done in the current locale.  This is necessary so
that UTF-8-encoded source files are not interpreted differently based on
the current locale.

For programs that have been abusing strings for containing binary data,
some accommodation needs to be made.  Maybe make a "binary" locale.

The VM and interpreter need to be updated to deal with wide chars and
probably in other ways that are unclear to me now.  Wide strings are
currently getting truncated to 8-bit somewhere in there.

Thanks,

Mike Gran





^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-04-22 20:03 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-21  2:11 Wide strings status Mike Gran
2009-04-21 21:37 ` Ludovic Courtès
2009-04-22  3:26   ` Mike Gran
2009-04-22 20:03     ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).