From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: ludo@gnu.org (Ludovic =?iso-8859-1?Q?Court=E8s?=) Newsgroups: gmane.lisp.guile.devel Subject: Re: Using libunistring for string comparisons et al Date: Wed, 16 Mar 2011 17:58:31 +0100 Message-ID: <87bp1b57ew.fsf@gnu.org> References: <204699.68103.qm@web37904.mail.mud.yahoo.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1300294736 7189 80.91.229.12 (16 Mar 2011 16:58:56 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Wed, 16 Mar 2011 16:58:56 +0000 (UTC) Cc: Mark H Weaver , "guile-devel@gnu.org" To: Mike Gran Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Wed Mar 16 17:58:49 2011 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Pzu3c-00006Q-5u for guile-devel@m.gmane.org; Wed, 16 Mar 2011 17:58:48 +0100 Original-Received: from localhost ([127.0.0.1]:56597 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Pzu3b-0001eb-7t for guile-devel@m.gmane.org; Wed, 16 Mar 2011 12:58:47 -0400 Original-Received: from [140.186.70.92] (port=36866 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Pzu3T-0001Z5-Nx for guile-devel@gnu.org; Wed, 16 Mar 2011 12:58:42 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Pzu3R-0000FH-Rh for guile-devel@gnu.org; Wed, 16 Mar 2011 12:58:39 -0400 Original-Received: from solo.fdn.fr ([80.67.169.19]:38017) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Pzu3R-0000Ds-GB for guile-devel@gnu.org; Wed, 16 Mar 2011 12:58:37 -0400 Original-Received: from nixey (unknown [193.50.110.208]) (using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (Client did not present a certificate) (Authenticated sender: lcourtes) by smtp.fdn.fr (Postfix) with ESMTPSA id 09B804473A; Wed, 16 Mar 2011 17:58:36 +0100 (CET) X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 26 =?iso-8859-1?Q?Vent=F4se?= an 219 de la =?iso-8859-1?Q?R=E9volution?= X-PGP-Key-ID: 0xEA52ECF4 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 83C4 F8E5 10A3 3B4C 5BEA D15D 77DD 95E2 EA52 ECF4 X-OS: x86_64-unknown-linux-gnu In-Reply-To: <204699.68103.qm@web37904.mail.mud.yahoo.com> (Mike Gran's message of "Wed, 16 Mar 2011 08:22:26 -0700 (PDT)") User-Agent: Gnus/5.110013 (No Gnus v0.13) Emacs/23.3 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) X-Received-From: 80.67.169.19 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:11886 Archived-At: Hi, Mike Gran writes: >> From:Ludovic Court=C3=A8s > >> > I know of two categories of bugs.=C2=A0 One has to do with case conver= sions >> > and case-insensitive comparisons, which must be done on entire strings >> > but are currently done for each character.=C2=A0 Here are some example= s: >> > >> >=C2=A0 (string-upcase "Stra=C3=9Fe")=C2=A0 =C2=A0 =C2=A0 =C2=A0 =3D> = "STRA=C3=9FE"=C2=A0=20 >> (should be "STRASSE") >> >=C2=A0 (string-downcase "=CE=A7=CE=91=CE=9F=CE=A3=CE=A3")=C2=A0 =C2=A0= =C2=A0 =C2=A0 =3D> "=CF=87=CE=B1=CE=BF=CF=83=CF=83"=C2=A0=20 >> (should be "=CF=87=CE=B1o=CF=83=CF=82") >> >=C2=A0 (string-downcase "=CE=A7=CE=91=CE=9F=CE=A3 =CE=A3")=C2=A0 =C2= =A0 =C2=A0 =3D> "=CF=87=CE=B1=CE=BF=CF=83 =CF=83"=C2=A0=20 >> (should be "=CF=87=CE=B1o=CF=82 =CF=83") >> >=C2=A0 (string-ci=3D? "Stra=C3=9Fe" "Strasse") =3D> #f=C2=A0 =C2=A0 = =C2=A0 =C2=A0=20 >> (should be #t) >> >=C2=A0 (string-ci=3D? "=CE=A7=CE=91=CE=9F=CE=A3" "=CF=87=CE=B1o=CF=83"= )=C2=A0 =C2=A0 =C2=A0 =3D> #f=C2=A0 =C2=A0 =C2=A0 =C2=A0=20 >> (should be #t) >>=20 >> (Mike pointed out that SRFI-13 does not consider these bugs, but that=E2= =80=99s >> linguistically wrong so I=E2=80=99d consider it a bug.=C2=A0 Note that a= ll these >> functions are =E2=80=98linguistically buggy=E2=80=99 anyway since they d= on=E2=80=99t have a >> locale argument, which breaks with Turkish =E2=80=98=C4=B0=E2=80=99.) >>=20 >> Can we first check what would need to be done to fix this in 2.0.x? >>=20 >> At first glance: >>=20 >> =C2=A0 - =E2=80=9CStra=C3=9Fe=E2=80=9D is normally stored as a Latin1 st= ring, so it would need to >> =C2=A0 =C2=A0 be converted to UTF-* before it can be passed to one of the >> =C2=A0 =C2=A0 unicase.h functions.=C2=A0 *Or*, we could check with bug-l= ibunistring >> =C2=A0 =C2=A0 what it would take to add Latin1 string case mapping funct= ions. >>=20 >> =C2=A0 =C2=A0 Interestingly, =E2=80=98=C3=9F=E2=80=99 is the only Latin1= character that doesn=E2=80=99t have a >> =C2=A0 =C2=A0 one-to-one case mapping.=C2=A0 All other Latin1 strings ca= n be handled by >> =C2=A0 =C2=A0 iterating over characters, as is currently done. > > There is the micro sign, which, when case folded, becomes a Greek mu. > It is still a single character, but, it is the only latin-1 character tha= t, > when folded, becomes a non-Latin-1 character Blech. It would have worked better with narrow =3D=3D ASCII instead of narrow =3D=3D Latin1. It=E2=80=99s a change we can still make, I think. >> =C2=A0 - Case insensitive comparison is more difficult, as you already >> =C2=A0 =C2=A0 pointed out.=C2=A0 To do it right we=E2=80=99d probably ne= ed to convert Latin1 >> =C2=A0 =C2=A0 strings to UTF-32 and then pass it to u32_casecmp.=C2=A0 W= e don=E2=80=99t have to >> =C2=A0 =C2=A0 do the conversion every time, though: we could just change= Latin1 >> =C2=A0 =C2=A0 strings in-place so they now point to a wide stringbuf upo= n the >> =C2=A0 =C2=A0 first =E2=80=98string-ci=3D=E2=80=99. >>=20 >> Thoughts? > > What about the srfi-13 case insensitive comparisons (the ones that don't > terminate in question marks, like string-ci<)?=C2=A0 Should they remain > as srfi-13 suggests, or should they remain similar in behavior > to the question-mark-terminated comparisons? Well, if maintaining two string comparison algorithms is reasonable, then we can keep both; otherwise, I=E2=80=99d vote for the R6RS way. > Mark is right that fixing this will not be pretty.=C2=A0 The case insensi= tive > string comparisons, for example, could be patched like the attached > snippet. If you don't find it too ugly of an approach, I could work on > a real patch. Indeed it=E2=80=99s quite inelegant. ;-) How about changing to narrow =3D=3D ASCII and then string comparison would be: if (narrow (s1) !=3D narrow (s2)) { /* Handle =C3=9F -> ss. */ if (!narrow (s1)) widify (s1); else widify (s2); } if (narrow (s1)) /* S1 and S2 are ASCII. */ return strcmp (char_data (s1), char_data (s2)); else /* S1 and S2 are UTF-32. */ return u32_cmp (wide_char_data (s1), wide_char_data (s2)); Looks like that would remain reasonable while actually fixing our problems. As a side-effect, though, scm_from_latin1_locale would become slightly less efficient because it=E2=80=99d need to check for non-ASCII chars. Thanks, Ludo=E2=80=99.