From: ludo@gnu.org (Ludovic Courtès)
To: guile-devel@gnu.org
Subject: Re: Wide strings
Date: Mon, 26 Jan 2009 22:40:12 +0100 [thread overview]
Message-ID: <87pri9lpab.fsf@gnu.org> (raw)
In-Reply-To: 437818.2998.qm@web37907.mail.mud.yahoo.com
Hello,
Mike Gran <spk121@yahoo.com> writes:
> There are 3 good, actively developed solutions of which I am aware.
>
> 1. Use GNU libc functionality. Encode wide strings as wchar_t.
That'd be POSIX functionality, actually.
> 2. Use GLib functionality. Encode wide strings as UTF-8. Possibly
> give up on O(1). Possibly add indexing information to string to allow
> O(1), which might negate the space advantage of UTF-8.
Technically, depending on GLib would seem unreasonable to me. :-)
BTW, Gnulib has a wealth of modules that could be helpful here:
http://www.gnu.org/software/gnulib/MODULES.html#posix_ext_unicode
I used a few of them in Guile-R6RS-Libs to implement `string->utf8' and
such like.
> 3. Use IBM's ICU4c. Encode wide strings as UTF-16. Thus, add an
> obscure dependency.
>
> Option 3 is likely a non-starter, because it seems that Guile has
> tried to avoid adding new non-GNU dependencies. It is technologically
> a great solution, IMHO.
At first sight, I'd rather avoid it as a dependency, if that's possible,
but that's mostly subjective.
> Let's say that a string is a union of either an ASCII char vector or a
> wchar_t vector. A "character" then is just a Unicode codepoint.
> String-ref returns a wchar_t. This is all in line with R6RS as I
> understand it.
Yes, that seems easily doable.
> There could then be a separate iterator and function set that does
> (likely O(n)) operations on the grapheme clusters of strings. A
> grapheme cluster is a single written symbol which may be made up of
> several codepoints. Unicode Standard Annex #29 describes how to
> partition a string into a set of graphemes.[1]
Hmm, that seems like a difficult topic. It's not even mentioned in
SRFI-13. I suppose it can be addressed at a later stage, possibly by
providing a specific API.
> There is the problem of systems where wchar_t is 2 bytes instead of 4
> bytes, like Cygwin. For those systems, I'd recommend
> restricting functionality to 16-bit characters instead of trying to
> add an extra UTF-16 encoding/decoding step. I think there should
> always be a complete codepoint in each wchar_t.
Agreed. The GNU libc doc concurs (info "(libc) Extended Char Intro").
However, given this limitation, and other potential portability issues,
it's still unclear to me whether this would be a good choice. We need
to look more closely at what Gnulib has to offer, IMO.
Thanks,
Ludo'.
next prev parent reply other threads:[~2009-01-26 21:40 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-25 21:15 Wide strings Mike Gran
2009-01-25 22:31 ` Ludovic Courtès
2009-01-25 23:32 ` Neil Jerram
2009-01-26 20:24 ` Ludovic Courtès
2009-01-26 0:16 ` Mike Gran
2009-01-26 15:21 ` Mike Gran
2009-01-26 21:40 ` Ludovic Courtès [this message]
2009-01-27 5:38 ` Mike Gran
2009-01-27 5:52 ` Mike Gran
2009-01-27 9:50 ` Andy Wingo
2009-01-27 18:59 ` Ludovic Courtès
2009-01-28 16:44 ` Mike Gran
2009-01-28 18:36 ` Andy Wingo
2009-01-29 0:01 ` Ludovic Courtès
2009-01-30 0:15 ` Neil Jerram
2009-01-28 20:44 ` Clinton Ebadi
2009-01-28 23:49 ` Ludovic Courtès
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87pri9lpab.fsf@gnu.org \
--to=ludo@gnu.org \
--cc=guile-devel@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).