unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
From: Mike Gran <spk121@yahoo.com>
To: "Mark H Weaver" <mhw@netris.org>, "Ludovic Courtès" <ludo@gnu.org>
Cc: "guile-devel@gnu.org" <guile-devel@gnu.org>
Subject: Re: Using libunistring for string comparisons et al
Date: Sat, 12 Mar 2011 13:28:05 -0800 (PST)	[thread overview]
Message-ID: <336042.33326.qm@web37901.mail.mud.yahoo.com> (raw)

> From:Mark H Weaver <mhw@netris.org>

> 
> ludo@gnu.org (Ludovic Courtès) writes:
> > I find Cowan’s proposal for string iteration and the R6RS editors
> > response interesting:
> >
> >   http://www.r6rs.org/formal-comments/comment-235.txt
> 
> Cowan was proposing a complex new API.  I am not, nor did Gauche.
> An efficient implementation of string ports is all that is needed.
> 
> > I also think strings should remain what they currently are, with O(1)
> > random access.
> 
> I understand your position, and perhaps you are right.
> 
> Unfortunately, the alternatives are not pleasant.  We have a bunch of
> bugs in our string handling functions.  Currently, our case-insensitive
> string comparisons and case conversions are not correct for several
> languages including German, according to the R6RS among other things.

(Just as an aside, in discussions like this, I think it important to distinguish
between
- where Guile doesn't match R6RS
- where R6RS doesn't match Unicode
- and where Unicode doesn't match reality

Each of these battles needs to be fought in the proper battlefield.)

> We could easily fix these problems by using libunistring, which provides
> the operations we need, but only if we use a single string
> representation, and one that is supported by libunistring (UTF-8,
> UTF-16, or UTF-32).

We do, in a matter of speaking, have a single string representation: UTF-32.
The 'narrow' encoding is UTF-32 with the initial 3 bytes of zero removed.

> Our use of two different internal string representations is another
> problem.  Right now, our string comparisons are painfully inefficient.
> Take a look at compare_strings in srfi-13.c.  It's also broken w.r.t.
> case-insensitive comparisons.  In order to fix this and make it
> efficient, we'll need to make several different variants:
> 
>   * case-sensitive
>      * narrow-narrow
>      * narrow-wide
>      * wide-wide (use libunistring's u32_cmp2 for this)
> 
>   * case-insensitive
>      * narrow-narrow
>      * narrow-wide
>      * wide-wide (use libunistring for this)
> 
> The case-insensitive narrow-narrow comparison must be able to handle
> this, for example (from r6rs-lib):
> 
>   (string-ci=? "Straße" "Strasse") => #t
> 
> I'm not yet sure what's involved in implementing the case-insensitive
> narrow-wide case properly.

It is not too difficult in practice, I think.  Converting narrow (Latin-1
aka truncated UTF-32) to wide (UTF-32) involves adding back
in the missing zero bytes.

For the sake of history, here's how we got to where we are now.
- R6RS says characters are Unicode codepoints
- R6RS says string ops are O(1)
- The only Unicode encoding that uses codepoints as its atomic units and 
  is O(1) is UTF-32
- UTF-32 wastes space for most normal circumstances
Thus we invented this wide (UTF-32) narrow (UTF-32 with initial
zeros stripped) encoding scheme we have now.  It may seem
to be suboptimal in terms of complexity and familiarity, but, 
what you see is an attempt to be optimal in terms of memory and
O(1).

I actually at one point had a nearly complete version of Guile 1.8
that used UTF-8 and another that used UTF-32.  There are some
other reasons why UTF-8 is bad, which I could bore you with
ad naseum.

Thanks,

Mike




             reply	other threads:[~2011-03-12 21:28 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-12 21:28 Mike Gran [this message]
2011-03-15 17:20 ` Using libunistring for string comparisons et al Mark H Weaver
2011-03-15 20:39   ` Mike Gran
2011-03-15 22:49     ` Mark H Weaver
2011-03-16  0:01       ` Mike Gran
2011-03-16  1:12         ` Mark H Weaver
2011-03-16 11:26           ` Ludovic Courtès
2011-03-17 15:38             ` Mark H Weaver
2011-03-17 15:56               ` Ludovic Courtès
2011-03-17 17:58                 ` Mark H Weaver
2011-03-18  0:10                   ` Thien-Thi Nguyen
2011-03-18  1:38                     ` Mark H Weaver
2011-03-18  8:46                       ` Thien-Thi Nguyen
2011-03-18 12:05                         ` Mark H Weaver
2011-03-20 22:12                   ` Ludovic Courtès
2011-03-30 10:14                     ` Andy Wingo
2011-03-17 21:47           ` Ludovic Courtès
2011-03-19 12:31           ` Andy Wingo
2011-03-19 14:06             ` Mark H Weaver
2011-03-19 14:53               ` Noah Lavine
2011-03-19 15:49                 ` Mark H Weaver
2011-03-19 15:08               ` Andy Wingo
2011-03-19 19:43                 ` Mark H Weaver
2011-03-19 16:37               ` Mark H Weaver
2011-03-20 21:49               ` Ludovic Courtès
2011-03-30  9:50               ` Andy Wingo
2011-03-29 12:39             ` Peter Brett
2011-03-29 13:35               ` Andy Wingo
2011-03-29 21:15               ` Ludovic Courtès
2011-03-31 14:59                 ` Peter Brett
2011-03-31 20:12                   ` Ludovic Courtès
2011-03-30  9:33       ` Andy Wingo
2011-03-16  0:22     ` Alex Shinn
  -- strict thread matches above, loose matches on Subject: below --
2011-03-17 18:07 Mike Gran
2011-03-16 15:22 Mike Gran
2011-03-16 16:58 ` Ludovic Courtès
2011-03-16  2:03 Mike Gran
2011-03-16  1:30 Mike Gran
2011-03-11  0:54 uc_tolower (uc_toupper (x)) Mike Gran
2011-03-11 22:33 ` Using libunistring for string comparisons et al Mark H Weaver
2011-03-11 22:36   ` Mark H Weaver
2011-03-11 23:09   ` Mark H Weaver
2011-03-12 13:46     ` Ludovic Courtès
2011-03-12 17:28       ` Mark H Weaver
2011-03-13 21:30         ` Ludovic Courtès
2011-03-30  9:05           ` Andy Wingo
2011-03-30  9:03     ` Andy Wingo
2011-03-31 14:19       ` Ludovic Courtès
2011-03-12 13:36   ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=336042.33326.qm@web37901.mail.mud.yahoo.com \
    --to=spk121@yahoo.com \
    --cc=guile-devel@gnu.org \
    --cc=ludo@gnu.org \
    --cc=mhw@netris.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).