From: ludo@gnu.org (Ludovic Courtès)
To: Mark H Weaver <mhw@netris.org>
Cc: guile-devel@gnu.org
Subject: Re: Using libunistring for string comparisons et al
Date: Wed, 16 Mar 2011 12:26:32 +0100 [thread overview]
Message-ID: <87oc5b8fx3.fsf@gnu.org> (raw)
In-Reply-To: <87ipvjlvgj.fsf@netris.org> (Mark H. Weaver's message of "Tue, 15 Mar 2011 21:12:28 -0400")
Hello Mark,
Mark H Weaver <mhw@netris.org> writes:
> Mike Gran <spk121@yahoo.com> writes:
>>> The reason I am still arguing this point is because I have looked
>>> seriously at what I would need to do to (A) fix our i18n problems and
>>> (B) make the code efficient. I very much want to fix these things,
>>> but the pain of trying to do this with our current scheme is too much
>>> for me to bear. I shouldn't have to rewrite libunistring, and I
>>> shouldn't have to write 3 or 4 different variants of each procedure
>>> that takes two string parameters.
>>
>> What procedures are giving incorrect results?
>
> I know of two categories of bugs. One has to do with case conversions
> and case-insensitive comparisons, which must be done on entire strings
> but are currently done for each character. Here are some examples:
>
> (string-upcase "Straße") => "STRAßE" (should be "STRASSE")
> (string-downcase "ΧΑΟΣΣ") => "χαοσσ" (should be "χαoσς")
> (string-downcase "ΧΑΟΣ Σ") => "χαοσ σ" (should be "χαoς σ")
> (string-ci=? "Straße" "Strasse") => #f (should be #t)
> (string-ci=? "ΧΑΟΣ" "χαoσ") => #f (should be #t)
(Mike pointed out that SRFI-13 does not consider these bugs, but that’s
linguistically wrong so I’d consider it a bug. Note that all these
functions are ‘linguistically buggy’ anyway since they don’t have a
locale argument, which breaks with Turkish ‘İ’.)
Can we first check what would need to be done to fix this in 2.0.x?
At first glance:
- “Straße” is normally stored as a Latin1 string, so it would need to
be converted to UTF-* before it can be passed to one of the
unicase.h functions. *Or*, we could check with bug-libunistring
what it would take to add Latin1 string case mapping functions.
Interestingly, ‘ß’ is the only Latin1 character that doesn’t have a
one-to-one case mapping. All other Latin1 strings can be handled by
iterating over characters, as is currently done.
With this in mind, we could hack our way so that strings that
contain an ‘ß’ are stored as UTF-32 (yes, that’s a hack.)
- For ‘string-downcase’, the Greek strings above are wide strings, so
they can be passed directly to u32_toupper & co. For these, the fix
is almost two lines.
- Case insensitive comparison is more difficult, as you already
pointed out. To do it right we’d probably need to convert Latin1
strings to UTF-32 and then pass it to u32_casecmp. We don’t have to
do the conversion every time, though: we could just change Latin1
strings in-place so they now point to a wide stringbuf upon the
first ‘string-ci=’.
Thoughts?
Thanks,
Ludo’.
next prev parent reply other threads:[~2011-03-16 11:26 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-03-12 21:28 Using libunistring for string comparisons et al Mike Gran
2011-03-15 17:20 ` Mark H Weaver
2011-03-15 20:39 ` Mike Gran
2011-03-15 22:49 ` Mark H Weaver
2011-03-16 0:01 ` Mike Gran
2011-03-16 1:12 ` Mark H Weaver
2011-03-16 11:26 ` Ludovic Courtès [this message]
2011-03-17 15:38 ` Mark H Weaver
2011-03-17 15:56 ` Ludovic Courtès
2011-03-17 17:58 ` Mark H Weaver
2011-03-18 0:10 ` Thien-Thi Nguyen
2011-03-18 1:38 ` Mark H Weaver
2011-03-18 8:46 ` Thien-Thi Nguyen
2011-03-18 12:05 ` Mark H Weaver
2011-03-20 22:12 ` Ludovic Courtès
2011-03-30 10:14 ` Andy Wingo
2011-03-17 21:47 ` Ludovic Courtès
2011-03-19 12:31 ` Andy Wingo
2011-03-19 14:06 ` Mark H Weaver
2011-03-19 14:53 ` Noah Lavine
2011-03-19 15:49 ` Mark H Weaver
2011-03-19 15:08 ` Andy Wingo
2011-03-19 19:43 ` Mark H Weaver
2011-03-19 16:37 ` Mark H Weaver
2011-03-20 21:49 ` Ludovic Courtès
2011-03-30 9:50 ` Andy Wingo
2011-03-29 12:39 ` Peter Brett
2011-03-29 13:35 ` Andy Wingo
2011-03-29 21:15 ` Ludovic Courtès
2011-03-31 14:59 ` Peter Brett
2011-03-31 20:12 ` Ludovic Courtès
2011-03-30 9:33 ` Andy Wingo
2011-03-16 0:22 ` Alex Shinn
-- strict thread matches above, loose matches on Subject: below --
2011-03-17 18:07 Mike Gran
2011-03-16 15:22 Mike Gran
2011-03-16 16:58 ` Ludovic Courtès
2011-03-16 2:03 Mike Gran
2011-03-16 1:30 Mike Gran
2011-03-11 0:54 uc_tolower (uc_toupper (x)) Mike Gran
2011-03-11 22:33 ` Using libunistring for string comparisons et al Mark H Weaver
2011-03-11 22:36 ` Mark H Weaver
2011-03-11 23:09 ` Mark H Weaver
2011-03-12 13:46 ` Ludovic Courtès
2011-03-12 17:28 ` Mark H Weaver
2011-03-13 21:30 ` Ludovic Courtès
2011-03-30 9:05 ` Andy Wingo
2011-03-30 9:03 ` Andy Wingo
2011-03-31 14:19 ` Ludovic Courtès
2011-03-12 13:36 ` Ludovic Courtès
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87oc5b8fx3.fsf@gnu.org \
--to=ludo@gnu.org \
--cc=guile-devel@gnu.org \
--cc=mhw@netris.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).