From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mike Gran Newsgroups: gmane.lisp.guile.devel Subject: Re: Using libunistring for string comparisons et al Date: Sat, 12 Mar 2011 13:28:05 -0800 (PST) Message-ID: <336042.33326.qm@web37901.mail.mud.yahoo.com> Reply-To: Mike Gran NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1299965301 27237 80.91.229.12 (12 Mar 2011 21:28:21 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Sat, 12 Mar 2011 21:28:21 +0000 (UTC) Cc: "guile-devel@gnu.org" To: Mark H Weaver , =?utf-8?B?THVkb3ZpYyBDb3VydMOocw==?= Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sat Mar 12 22:28:16 2011 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PyWMA-0006aF-Rq for guile-devel@m.gmane.org; Sat, 12 Mar 2011 22:28:15 +0100 Original-Received: from localhost ([127.0.0.1]:60406 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PyWMA-0003UQ-BP for guile-devel@m.gmane.org; Sat, 12 Mar 2011 16:28:14 -0500 Original-Received: from [140.186.70.92] (port=57324 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PyWM5-0003S6-BK for guile-devel@gnu.org; Sat, 12 Mar 2011 16:28:10 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PyWM2-0002s2-Uw for guile-devel@gnu.org; Sat, 12 Mar 2011 16:28:09 -0500 Original-Received: from web37901.mail.mud.yahoo.com ([209.191.91.163]:42124) by eggs.gnu.org with smtp (Exim 4.71) (envelope-from ) id 1PyWM2-0002re-PE for guile-devel@gnu.org; Sat, 12 Mar 2011 16:28:06 -0500 Original-Received: (qmail 36166 invoked by uid 60001); 12 Mar 2011 21:28:05 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1299965285; bh=Ce2zy3pBTqiDTddgdR2hcjIr1EzDnuOkLEblSxZNQqE=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Reply-To:Subject:To:Cc:MIME-Version:Content-Type:Content-Transfer-Encoding; b=19PvXfSwGde/17w1fGCUgCVKSji+sdQaaef+PB0OhYjCzmkd4vFDgPL7z7r3zFi36vCKO3txEw5RHdZngHogAZoydOZh8ehgLvgx1TYynTvAn7LeX5z1aZ8zjn26t2Kym1kWomxOna+8WNfNq6S5CDOTHaZY/vrNsr9i3QWLwzQ= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Reply-To:Subject:To:Cc:MIME-Version:Content-Type:Content-Transfer-Encoding; b=ZXDteZgBk/VPgAW1DoJqXsIdt+OyxtT5ZaAEPDKrlYgGJm2HM2/XCWTHiWuO3qeovKtm9FUfueqi/U8ogoCWr0ptUIijmyu72LnLLodUXNDvc0TRmGzQSzBeaFKByJu731Ob5pN0J3z0pwqA2koWzBf0pwhVMt/g/LtcSUkEX2Q=; X-YMail-OSG: xw3Ma.QVM1k9Gpa6onVU2MaAzvQHHUVYSxZFNH1DBbNsM8M UKay4fs0k.KXWXeyrP043dw1CBOZxVdy3dooQux6E_CTgPyc06T5nR6MBQ36 Xp8hIXI4s5rOTEEAYOwml_BZCjOhl6qg41Bc6XpnmWvZEXnd82UmEnO.Uq7Q EBvOvw61n2oa7MVBrxqgIoM9sCdM_xoATJdazkoCbpkTRjGi32nmpnwCd9RO AEcMHR9Am8gUmRoMAE3sk0XhFVrijdHzE5UTGhVjybO95OtuxVKiL0VJGg8w ESLVVLh4zgPixvjdFDJWzYTwzlJCHZCjI5AbUvPDO2Rjhnk7GJqaZpp7m3nW oVHzHZen37f.SkIPdNwimOJa_K51J7aqN8Lm0GKaARk0bIbIP3nJiEBBvOIc - Original-Received: from [71.130.222.116] by web37901.mail.mud.yahoo.com via HTTP; Sat, 12 Mar 2011 13:28:05 PST X-Mailer: YahooMailWebService/0.8.109.295617 X-detected-operating-system: by eggs.gnu.org: FreeBSD 6.x (1) X-Received-From: 209.191.91.163 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:11860 Archived-At: > From:Mark H Weaver =0A=0A> =0A> ludo@gnu.org (Ludovic Cou= rt=C3=A8s) writes:=0A> > I find Cowan=E2=80=99s proposal for string iterati= on and the R6RS editors=0A> > response interesting:=0A> >=0A> >=C2=A0 http= ://www.r6rs.org/formal-comments/comment-235.txt=0A> =0A> Cowan was proposin= g a complex new API.=C2=A0 I am not, nor did Gauche.=0A> An efficient imple= mentation of string ports is all that is needed.=0A> =0A> > I also think st= rings should remain what they currently are, with O(1)=0A> > random access.= =0A> =0A> I understand your position, and perhaps you are right.=0A> =0A> U= nfortunately, the alternatives are not pleasant.=C2=A0 We have a bunch of= =0A> bugs in our string handling functions.=C2=A0 Currently, our case-insen= sitive=0A> string comparisons and case conversions are not correct for seve= ral=0A> languages including German, according to the R6RS among other thing= s.=0A=0A(Just as an aside, in discussions like this, I think it important t= o distinguish=0Abetween=0A- where Guile doesn't match R6RS=0A- where R6RS d= oesn't match Unicode=0A- and where Unicode doesn't match reality=0A=0AEach = of these battles needs to be fought in the proper battlefield.)=0A=0A> We c= ould easily fix these problems by using libunistring, which provides=0A> th= e operations we need, but only if we use a single string=0A> representation= , and one that is supported by libunistring (UTF-8,=0A> UTF-16, or UTF-32).= =0A=0AWe do, in a matter of speaking, have a single string representation: = UTF-32.=0AThe 'narrow' encoding is UTF-32 with the initial 3 bytes of zero = removed.=0A=0A> Our use of two different internal string representations is= another=0A> problem.=C2=A0 Right now, our string comparisons are painfully= inefficient.=0A> Take a look at compare_strings in srfi-13.c.=C2=A0 It's a= lso broken w.r.t.=0A> case-insensitive comparisons.=C2=A0 In order to fix t= his and make it=0A> efficient, we'll need to make several different variant= s:=0A> =0A> =C2=A0 * case-sensitive=0A> =C2=A0 =C2=A0 * narrow-narrow=0A> = =C2=A0 =C2=A0 * narrow-wide=0A> =C2=A0 =C2=A0 * wide-wide (use libunistri= ng's u32_cmp2 for this)=0A> =0A> =C2=A0 * case-insensitive=0A> =C2=A0 =C2= =A0 * narrow-narrow=0A> =C2=A0 =C2=A0 * narrow-wide=0A> =C2=A0 =C2=A0 * = wide-wide (use libunistring for this)=0A> =0A> The case-insensitive narrow-= narrow comparison must be able to handle=0A> this, for example (from r6rs-l= ib):=0A> =0A> =C2=A0 (string-ci=3D? "Stra=C3=9Fe" "Strasse") =3D> #t=0A> = =0A> I'm not yet sure what's involved in implementing the case-insensitive= =0A> narrow-wide case properly.=0A=0AIt is not too difficult in practice, I= think.=C2=A0 Converting narrow (Latin-1=0Aaka truncated UTF-32) to wide (U= TF-32) involves adding back=0Ain the missing zero bytes.=0A=0AFor the sake = of history, here's how we got to where we are now.=0A- R6RS says characters= are Unicode codepoints=0A- R6RS says string ops are O(1)=0A- The only Unic= ode encoding that uses codepoints as its atomic units and =0A=C2=A0 is O(1)= is UTF-32=0A- UTF-32 wastes space for most normal circumstances=0AThus we = invented this wide (UTF-32) narrow (UTF-32 with initial=0Azeros stripped) e= ncoding scheme we have now.=C2=A0 It may seem=0Ato be suboptimal in terms o= f complexity and familiarity, but, =0Awhat you see is an attempt to be opti= mal in terms of memory and=0AO(1).=0A=0AI actually at one point had a nearl= y complete version of Guile 1.8=0Athat used UTF-8 and another that used UTF= -32.=C2=A0 There are some=0Aother reasons why UTF-8 is bad, which I could b= ore you with=0Aad naseum.=0A=0AThanks,=0A=0AMike=0A