Wide strings status

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

From: Mike Gran <spk121@yahoo.com>
To: Guile Devel <guile-devel@gnu.org>
Subject: Wide strings status
Date: Mon, 20 Apr 2009 19:11:48 -0700	[thread overview]
Message-ID: <1240279908.3133.76.camel@localhost.localdomain> (raw)

Hi,

OK.  I've uploaded a "string-abstraction" branch so that you can see
what I've been doing over the last couple of months.  Currently, I do
have a version of Guile that uses Unicode codepoints for characters.

The C representation of chars was changed to scm_t_uint32 throughout the
code.

Strings are internally encoded either as "narrow" 8-bit ISO-8859-1
strings or as "wide" UTF-32 strings.  Strings are usually created as
narrow strings.  Narrow strings get automatically widened to wide
strings if non-8-bit characters are set! or appended to them.

Outside of the core strings module and srfi-13, a set of methods are
used to access strings.  I did my best to keep the internal
representation of strings isolated to those two modules.  This means
that almost every instance of the pervasive scm_i_string_chars() was
removed.

The machine-readable "write" form of strings has been changed.  Before,
non-printable characters were given as hex escapes, for example \xFF.
Now there are three levels of hex escape for 8, 16, and 24 bit
characters: \xFF, \uFFFF, \UFFFFFF.  This is a pretty common convention.
But after I coded this, I noticed that R6RS has a different convention
and I'll probably go with that.

The internal representation of strings seems to work already, but, the
reader doesn't work yet.  For now, one can make wide strings like this:

> (setlocale LC_ALL "")
==> "en_US.UTF-8"

> (define str (apply string (map integer->char '(100 200 300 400 500))))

> (write str)
==>"d\xc8\u012c\u0190\u01f4"

(display str)
==>dÈĬƐǴ

This is all going to be slower than before because of the string
conversion operations, but, I didn't want to do any premature
optimization.  First, I wanted to get it working, but, there is plenty
of room for optimization later.

Anyway, if, code-wise, it is agreed that I'm generally on the right
track, the next steps are these:

Write a plethora of unit tests on what has been accomplished so far.

Character sets need to be modified to have more than 256 entries.

Character encoding needs to be a property of ports, so that not all
string operations are done in the current locale.  This is necessary so
that UTF-8-encoded source files are not interpreted differently based on
the current locale.

For programs that have been abusing strings for containing binary data,
some accommodation needs to be made.  Maybe make a "binary" locale.

The VM and interpreter need to be updated to deal with wide chars and
probably in other ways that are unclear to me now.  Wide strings are
currently getting truncated to 8-bit somewhere in there.

Thanks,

Mike Gran

next             reply	other threads:[~2009-04-21  2:11 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-21  2:11 Mike Gran [this message]
2009-04-21 21:37 ` Wide strings status Ludovic Courtès
2009-04-22  3:26   ` Mike Gran
2009-04-22 20:03     ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1240279908.3133.76.camel@localhost.localdomain \
    --to=spk121@yahoo.com \
    --cc=guile-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).