From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mike Gran Newsgroups: gmane.lisp.guile.devel Subject: Re: Using libunistring for string comparisons et al Date: Thu, 17 Mar 2011 11:07:58 -0700 (PDT) Message-ID: <804418.17764.qm@web37908.mail.mud.yahoo.com> Reply-To: Mike Gran NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1300385319 13824 80.91.229.12 (17 Mar 2011 18:08:39 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Thu, 17 Mar 2011 18:08:39 +0000 (UTC) Cc: Mark H Weaver , "guile-devel@gnu.org" To: =?utf-8?B?THVkb3ZpYyBDb3VydMOocw==?= Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Thu Mar 17 19:08:34 2011 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Q0Hcd-0004qA-J6 for guile-devel@m.gmane.org; Thu, 17 Mar 2011 19:08:31 +0100 Original-Received: from localhost ([127.0.0.1]:52622 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Q0Hcc-0003FT-Ja for guile-devel@m.gmane.org; Thu, 17 Mar 2011 14:08:30 -0400 Original-Received: from [140.186.70.92] (port=33764 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Q0HcV-0003Cr-6p for guile-devel@gnu.org; Thu, 17 Mar 2011 14:08:25 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Q0Hc8-0000SZ-Ny for guile-devel@gnu.org; Thu, 17 Mar 2011 14:08:01 -0400 Original-Received: from web37908.mail.mud.yahoo.com ([209.191.91.170]:47051) by eggs.gnu.org with smtp (Exim 4.71) (envelope-from ) id 1Q0Hc8-0000SC-I9 for guile-devel@gnu.org; Thu, 17 Mar 2011 14:08:00 -0400 Original-Received: (qmail 29927 invoked by uid 60001); 17 Mar 2011 18:07:58 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1300385278; bh=kQcMR2+BnmQV1qOEO9yvYkZcf82rwKN6SLRONrMdjc8=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Reply-To:Subject:To:Cc:MIME-Version:Content-Type:Content-Transfer-Encoding; b=BsJ6h+sXnTFl5C+JxWZ6f4nHI8U8nQfAO83lhFLzqexoVbA7GSpXt5yQFlmkqpbGkyO3KK2ckEhTrx6sqkI0bekD7Ibm9Kq1B2ziwFi5V6Y5nIW1K9rUMdxZDN9jnm3FnEQ+m/ZC2dtCJmKc89fY7bIxtf3bi22jjqXjdGye5mM= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Reply-To:Subject:To:Cc:MIME-Version:Content-Type:Content-Transfer-Encoding; b=2VXRs1apv9wo7WQiPShB2mtxBzGUaW9zXOTbh7nJ4o7y9N67cPJZBGU3OhD3ekBAgkIPGlgkEe6MNpMhdVQrmOm+dfNVeByxUNcA+RNNcajhfIc0Ad4+Nmf3RDN/C7ujtW8ygyBq/r9LUVkJh/TQ3tJ5JXtdpegM+xmv/11W6SM=; X-YMail-OSG: VhdEFywVM1mstRVv1f4HhkStWoGCRtG2D9XT.AJn2pNbXIS uS3u13fmjpGge6FG4LB6zL2JUdXO_spROAa3VRFEkqfB99g7SFsOTsHAyGyy JcTJDaWnKZhdj3oSfsMr4ZSuuSQeCmayPFEAvG8L9djn1.JO8.2LjnzbPIZO uWIgSRTIyfXbtIf4FWWIwCn4JthR9XJeknBqejnUUoXVddStT5EygNeT8gv7 n1tT6z6Gi5vzFoMJLrf.dFHSUmpiALxCj875MfD2.romS5imRkSd_E9V86Rh dJZlT8zNGOyeF4V7rlyQYPVFcUCkdxuE6ASykO7r80Uta.zOZkMUlmSKGLxV 1xRAmPy8ujQ92A.bBc6KJI8ZiY0k6HufgrpZg3qiSpaWPTecqGIZV8ihmk12 L Original-Received: from [207.8.91.2] by web37908.mail.mud.yahoo.com via HTTP; Thu, 17 Mar 2011 11:07:58 PDT X-Mailer: YahooMailWebService/0.8.109.295617 X-detected-operating-system: by eggs.gnu.org: FreeBSD 6.x (1) X-Received-From: 209.191.91.170 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:11894 Archived-At: > From:Ludovic Court=C3=A8s =0A> >> Can we first check what w= ould need to be done to fix this in 2.0.x?=0A> >> =0A> >> At first glance:= =0A> >> =0A> >> =C2=A0 - =E2=80=9CStra=C3=9Fe=E2=80=9D is normally stored a= s a Latin1 string, so it would need to=0A> >> =C2=A0 =C2=A0 be converted to= UTF-* before it can be passed to one of the=0A> >> =C2=A0 =C2=A0 unicase.h= functions.=C2=A0 *Or*, we could check with bug-libunistring=0A> >> =C2=A0 = =C2=A0 what it would take to add Latin1 string case mapping functions.=0A> = >> =0A> >> =C2=A0 =C2=A0 Interestingly, =E2=80=98=C3=9F=E2=80=99 is the onl= y Latin1 character that doesn=E2=80=99t have a=0A> >> =C2=A0 =C2=A0 one-to-= one case mapping.=C2=A0 All other Latin1 strings can be handled =0A> by=0A>= >> =C2=A0 =C2=A0 iterating over characters, as is currently done.=0A> >=0A= > > There is the micro sign, which, when case folded, becomes a Greek mu.= =0A> > It is still a single character, but, it is the only latin-1 characte= r that,=0A> > when folded, becomes a non-Latin-1 character=0A> =0A> Blech.= =0A> =0A> It would have worked better with narrow =3D=3D ASCII instead of= =0A> narrow =3D=3D Latin1.=C2=A0 It=E2=80=99s a change we can still make, I= think.=0A=0AIt would be easy enough to do.=C2=A0 If someone were to fight = for=0Aa narrow encoding of Latin-1, I would expect it to be you, since=0Ayo= u're the only committer whose name requires ISO-8859-1.=0ASo if you're okay= with it, who am I to complain?=0A=0A> =0A> >> =C2=A0 - Case insensitive co= mparison is more difficult, as you already=0A> >> =C2=A0 =C2=A0 pointed out= .=C2=A0 To do it right we=E2=80=99d probably need to convert Latin1=0A> >> = =C2=A0 =C2=A0 strings to UTF-32 and then pass it to u32_casecmp.=C2=A0 We d= on=E2=80=99t have =0A> to=0A> >> =C2=A0 =C2=A0 do the conversion every time= , though: we could just change Latin1=0A> >> =C2=A0 =C2=A0 strings in-place= so they now point to a wide stringbuf upon the=0A> >> =C2=A0 =C2=A0 first = =E2=80=98string-ci=3D=E2=80=99.=0A> >> =0A> >> Thoughts?=0A> >=0A=0A[...]= =0A=0A> =0A> Indeed it=E2=80=99s quite inelegant.=C2=A0 ;-)=0A> =0A> How ab= out changing to narrow =3D=3D ASCII and then string comparison would=0A> be= :=0A> =0A> =C2=A0 if (narrow (s1) !=3D narrow (s2))=0A> =C2=A0 =C2=A0 {=0A= =0AIt would be easier and cleaner, as you demonstrate.=0A=0AI guess the que= stion is about future-proofing.=C2=A0 If the complications with the Latin-1= =0A/ UTF-32 dual encoding are constrained to upcase/downcase and string-ci= =0Acomparison ops, then it doesn't seem worth it to change it.=C2=A0 But if= it is going=0Ato cause endless problems down the road, ASCII/UTF-32 is sim= pler.=0A=0AA lot of this debate is about expectations, I think.=C2=A0 For m= y part, I think that=0Athe string-ci ops only have real value for English l= anguage and ASCII text.=0AFor non-English non-ASCII processing, sorting cas= e-insensitively by numeric=0Acodepoint values in the absence of locale sort= ing rules seems like an odd thing=0Ato want to do.=C2=A0=0A=0ASo I=C2=A0gue= ss I'm=C2=A0not bothered=C2=A0with the ugly C necessary to make ISO-8859-1 = work.=0AIt is bad for string-ci ops but not too bad for upcase/downcase.=C2= =A0 I also am=0Anot too concerned that string-ci comparison ops for non-Eng= lish non-ASCII=0Aprocessing may be inefficient.=C2=A0 It does seem vital th= at string-locale comparison=0Aops be efficient, though.=0A=0AThanks,=0AMike