unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
From: Mike Gran <spk121@yahoo.com>
To: guile-devel@gnu.org
Subject: Re: Wide strings
Date: Sun, 25 Jan 2009 16:16:12 -0800 (PST)	[thread overview]
Message-ID: <437818.2998.qm@web37907.mail.mud.yahoo.com> (raw)
In-Reply-To: 87wscjvwyq.fsf@gnu.org

> From: Ludovic Courtès ludo@gnu.org

I believe that we should aim for R6RS strings.

I think the most important thing is to have humility in the face of an
impossible problem: how to encode all textual information.  It is
important to "stand on the shoulders of giants" here.  It becomes a
matter of deciding which actively developed library of wide character
functions is to be used and how to integrate it.

There are 3 good, actively developed solutions of which I am aware.

1.  Use GNU libc functionality.  Encode wide strings as wchar_t.

2.  Use GLib functionality.  Encode wide strings as UTF-8.  Possibly
give up on O(1).  Possibly add indexing information to string to allow
O(1), which might negate the space advantage of UTF-8.
 
3.  Use IBM's ICU4c.  Encode wide strings as UTF-16.  Thus, add an
obscure dependency.

Option 3 is likely a non-starter, because it seems that Guile has
tried to avoid adding new non-GNU dependencies.  It is technologically
a great solution, IMHO.

Option 1 is probably the way to go, because it keeps Guile close to
the metal and keeps dependencies out of it.  Unfortunately, UTF-8
strings would require conversion.

>  1. IMO it'd be nice to have ASCII strings special-cased so that they
>    are always encoded in ASCII.  This would allow for memory savings
>    since, e.g., most symbols are expected to contain only ASCII
>    characters.  It might also simplify interaction with C in certain
>    cases; for instance, it would make it easy to have statically
>    initialized ASCII Scheme strings.

Why not?  It does solve the initialization problem of dealing with strings
before setlocale has been called.

Let's say that a string is a union of either an ASCII char vector or a
wchar_t vector.  A "character" then is just a Unicode codepoint.
String-ref returns a wchar_t.  This is all in line with R6RS as I
understand it.

There could then be a separate iterator and function set that does
(likely O(n)) operations on the grapheme clusters of strings.  A
grapheme cluster is a single written symbol which may be made up of
several codepoints.  Unicode Standard Annex #29 describes how to
partition a string into a set of graphemes.[1]

There is the problem of systems where wchar_t is 2 bytes instead of 4
bytes, like Cygwin.  For those systems, I'd recommend
restricting functionality to 16-bit characters instead of trying to
add an extra UTF-16 encoding/decoding step.  I think there should
always be a complete codepoint in each wchar_t.

-- 
Mike Gran

[1] http://www.unicode.org/reports/tr29/




  parent reply	other threads:[~2009-01-26  0:16 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-01-25 21:15 Wide strings Mike Gran
2009-01-25 22:31 ` Ludovic Courtès
2009-01-25 23:32   ` Neil Jerram
2009-01-26 20:24     ` Ludovic Courtès
2009-01-26  0:16   ` Mike Gran [this message]
2009-01-26 15:21     ` Mike Gran
2009-01-26 21:40     ` Ludovic Courtès
2009-01-27  5:38       ` Mike Gran
2009-01-27  5:52         ` Mike Gran
2009-01-27  9:50           ` Andy Wingo
2009-01-27 18:59         ` Ludovic Courtès
2009-01-28 16:44           ` Mike Gran
2009-01-28 18:36             ` Andy Wingo
2009-01-29  0:01               ` Ludovic Courtès
2009-01-30  0:15                 ` Neil Jerram
2009-01-28 20:44             ` Clinton Ebadi
2009-01-28 23:49               ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=437818.2998.qm@web37907.mail.mud.yahoo.com \
    --to=spk121@yahoo.com \
    --cc=guile-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).