From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: ludo@gnu.org (Ludovic =?iso-8859-1?Q?Court=E8s?=)
Newsgroups: gmane.lisp.guile.devel
Subject: Re: Using libunistring for string comparisons et al
Date: Wed, 16 Mar 2011 17:58:31 +0100
Message-ID: <87bp1b57ew.fsf@gnu.org>
References: <204699.68103.qm@web37904.mail.mud.yahoo.com>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Trace: dough.gmane.org 1300294736 7189 80.91.229.12 (16 Mar 2011 16:58:56 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Wed, 16 Mar 2011 16:58:56 +0000 (UTC)
Cc: Mark H Weaver <mhw@netris.org>, "guile-devel@gnu.org" <guile-devel@gnu.org>
To: Mike Gran <spk121@yahoo.com>
Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Wed Mar 16 17:58:49 2011
Return-path: <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>
Envelope-to: guile-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.69)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>)
	id 1Pzu3c-00006Q-5u
	for guile-devel@m.gmane.org; Wed, 16 Mar 2011 17:58:48 +0100
Original-Received: from localhost ([127.0.0.1]:56597 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1Pzu3b-0001eb-7t
	for guile-devel@m.gmane.org; Wed, 16 Mar 2011 12:58:47 -0400
Original-Received: from [140.186.70.92] (port=36866 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Pzu3T-0001Z5-Nx
	for guile-devel@gnu.org; Wed, 16 Mar 2011 12:58:42 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <ludo@gnu.org>) id 1Pzu3R-0000FH-Rh
	for guile-devel@gnu.org; Wed, 16 Mar 2011 12:58:39 -0400
Original-Received: from solo.fdn.fr ([80.67.169.19]:38017)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <ludo@gnu.org>)
	id 1Pzu3R-0000Ds-GB
	for guile-devel@gnu.org; Wed, 16 Mar 2011 12:58:37 -0400
Original-Received: from nixey (unknown [193.50.110.208])
	(using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits))
	(Client did not present a certificate)
	(Authenticated sender: lcourtes)
	by smtp.fdn.fr (Postfix) with ESMTPSA id 09B804473A;
	Wed, 16 Mar 2011 17:58:36 +0100 (CET)
X-URL: http://www.fdn.fr/~lcourtes/
X-Revolutionary-Date: 26 =?iso-8859-1?Q?Vent=F4se?= an 219 de la
	=?iso-8859-1?Q?R=E9volution?=
X-PGP-Key-ID: 0xEA52ECF4
X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc
X-PGP-Fingerprint: 83C4 F8E5 10A3 3B4C 5BEA  D15D 77DD 95E2 EA52 ECF4
X-OS: x86_64-unknown-linux-gnu
In-Reply-To: <204699.68103.qm@web37904.mail.mud.yahoo.com> (Mike Gran's
	message of "Wed, 16 Mar 2011 08:22:26 -0700 (PDT)")
User-Agent: Gnus/5.110013 (No Gnus v0.13) Emacs/23.3 (gnu/linux)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2)
X-Received-From: 80.67.169.19
X-BeenThere: guile-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Developers list for Guile,
	the GNU extensibility library" <guile-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guile-devel>
List-Post: <mailto:guile-devel@gnu.org>
List-Help: <mailto:guile-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=subscribe>
Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.lisp.guile.devel:11886
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.devel/11886>

Hi,

Mike Gran <spk121@yahoo.com> writes:

>> From:Ludovic Court=C3=A8s <ludo@gnu.org>
>
>> > I know of two categories of bugs.=C2=A0 One has to do with case conver=
sions
>> > and case-insensitive comparisons, which must be done on entire strings
>> > but are currently done for each character.=C2=A0 Here are some example=
s:
>> >
>> >=C2=A0  (string-upcase "Stra=C3=9Fe")=C2=A0 =C2=A0 =C2=A0 =C2=A0  =3D> =
"STRA=C3=9FE"=C2=A0=20
>> (should be "STRASSE")
>> >=C2=A0  (string-downcase "=CE=A7=CE=91=CE=9F=CE=A3=CE=A3")=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =3D> "=CF=87=CE=B1=CE=BF=CF=83=CF=83"=C2=A0=20
>> (should be "=CF=87=CE=B1o=CF=83=CF=82")
>> >=C2=A0  (string-downcase "=CE=A7=CE=91=CE=9F=CE=A3 =CE=A3")=C2=A0 =C2=
=A0 =C2=A0  =3D> "=CF=87=CE=B1=CE=BF=CF=83 =CF=83"=C2=A0=20
>> (should be "=CF=87=CE=B1o=CF=82 =CF=83")
>> >=C2=A0  (string-ci=3D? "Stra=C3=9Fe" "Strasse") =3D> #f=C2=A0 =C2=A0 =
=C2=A0 =C2=A0=20
>> (should be #t)
>> >=C2=A0  (string-ci=3D? "=CE=A7=CE=91=CE=9F=CE=A3" "=CF=87=CE=B1o=CF=83"=
)=C2=A0 =C2=A0 =C2=A0 =3D> #f=C2=A0 =C2=A0 =C2=A0 =C2=A0=20
>> (should be #t)
>>=20
>> (Mike pointed out that SRFI-13 does not consider these bugs, but that=E2=
=80=99s
>> linguistically wrong so I=E2=80=99d consider it a bug.=C2=A0 Note that a=
ll these
>> functions are =E2=80=98linguistically buggy=E2=80=99 anyway since they d=
on=E2=80=99t have a
>> locale argument, which breaks with Turkish =E2=80=98=C4=B0=E2=80=99.)
>>=20
>> Can we first check what would need to be done to fix this in 2.0.x?
>>=20
>> At first glance:
>>=20
>> =C2=A0 - =E2=80=9CStra=C3=9Fe=E2=80=9D is normally stored as a Latin1 st=
ring, so it would need to
>> =C2=A0 =C2=A0 be converted to UTF-* before it can be passed to one of the
>> =C2=A0 =C2=A0 unicase.h functions.=C2=A0 *Or*, we could check with bug-l=
ibunistring
>> =C2=A0 =C2=A0 what it would take to add Latin1 string case mapping funct=
ions.
>>=20
>> =C2=A0 =C2=A0 Interestingly, =E2=80=98=C3=9F=E2=80=99 is the only Latin1=
 character that doesn=E2=80=99t have a
>> =C2=A0 =C2=A0 one-to-one case mapping.=C2=A0 All other Latin1 strings ca=
n be handled by
>> =C2=A0 =C2=A0 iterating over characters, as is currently done.
>
> There is the micro sign, which, when case folded, becomes a Greek mu.
> It is still a single character, but, it is the only latin-1 character tha=
t,
> when folded, becomes a non-Latin-1 character

Blech.

It would have worked better with narrow =3D=3D ASCII instead of
narrow =3D=3D Latin1.  It=E2=80=99s a change we can still make, I think.

>> =C2=A0 - Case insensitive comparison is more difficult, as you already
>> =C2=A0 =C2=A0 pointed out.=C2=A0 To do it right we=E2=80=99d probably ne=
ed to convert Latin1
>> =C2=A0 =C2=A0 strings to UTF-32 and then pass it to u32_casecmp.=C2=A0 W=
e don=E2=80=99t have to
>> =C2=A0 =C2=A0 do the conversion every time, though: we could just change=
 Latin1
>> =C2=A0 =C2=A0 strings in-place so they now point to a wide stringbuf upo=
n the
>> =C2=A0 =C2=A0 first =E2=80=98string-ci=3D=E2=80=99.
>>=20
>> Thoughts?
>
> What about the srfi-13 case insensitive comparisons (the ones that don't
> terminate in question marks, like string-ci<)?=C2=A0 Should they remain
> as srfi-13 suggests, or should they remain similar in behavior
> to the question-mark-terminated comparisons?

Well, if maintaining two string comparison algorithms is reasonable,
then we can keep both; otherwise, I=E2=80=99d vote for the R6RS way.

> Mark is right that fixing this will not be pretty.=C2=A0 The case insensi=
tive
> string comparisons, for example, could be patched like the attached
> snippet. If you don't find it too ugly of an approach, I could work on
> a real patch.

Indeed it=E2=80=99s quite inelegant.  ;-)

How about changing to narrow =3D=3D ASCII and then string comparison would
be:

  if (narrow (s1) !=3D narrow (s2))
    {
      /* Handle =C3=9F -> ss.  */
      if (!narrow (s1))
        widify (s1);
      else
        widify (s2);
    }

  if (narrow (s1))
    /* S1 and S2 are ASCII.  */
    return strcmp (char_data (s1), char_data (s2));
  else
    /* S1 and S2 are UTF-32.  */
    return u32_cmp (wide_char_data (s1), wide_char_data (s2));

Looks like that would remain reasonable while actually fixing our
problems.

As a side-effect, though, scm_from_latin1_locale would become slightly
less efficient because it=E2=80=99d need to check for non-ASCII chars.

Thanks,
Ludo=E2=80=99.