From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mark H Weaver Newsgroups: gmane.lisp.guile.devel Subject: Re: Using libunistring for string comparisons et al Date: Tue, 15 Mar 2011 18:49:17 -0400 Message-ID: <87sjuokniq.fsf@netris.org> References: <336042.33326.qm@web37901.mail.mud.yahoo.com> <878vwgmhah.fsf@netris.org> <511668.33680.qm@web37902.mail.mud.yahoo.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1300230804 10943 80.91.229.12 (15 Mar 2011 23:13:24 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Tue, 15 Mar 2011 23:13:24 +0000 (UTC) Cc: Ludovic =?utf-8?Q?Court=C3=A8s?= , guile-devel@gnu.org To: Mike Gran Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Wed Mar 16 00:13:19 2011 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PzdQV-0002v5-JK for guile-devel@m.gmane.org; Wed, 16 Mar 2011 00:13:19 +0100 Original-Received: from localhost ([127.0.0.1]:33208 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PzdQU-0000DR-VJ for guile-devel@m.gmane.org; Tue, 15 Mar 2011 19:13:19 -0400 Original-Received: from [140.186.70.92] (port=46551 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Pzd3f-0005T5-Vf for guile-devel@gnu.org; Tue, 15 Mar 2011 18:49:45 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Pzd3e-0003vU-IG for guile-devel@gnu.org; Tue, 15 Mar 2011 18:49:43 -0400 Original-Received: from world.peace.net ([96.39.62.75]:35580) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Pzd3c-0003v6-Hj; Tue, 15 Mar 2011 18:49:40 -0400 Original-Received: from 209-6-93-251.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com ([209.6.93.251] helo=freedomincluded) by world.peace.net with esmtpa (Exim 4.69) (envelope-from ) id 1Pzd3W-00078l-OA; Tue, 15 Mar 2011 18:49:34 -0400 Original-Received: from mhw by freedomincluded with local (Exim 4.69) (envelope-from ) id 1Pzd3F-0001pv-Jn; Tue, 15 Mar 2011 18:49:17 -0400 In-Reply-To: <511668.33680.qm@web37902.mail.mud.yahoo.com> (Mike Gran's message of "Tue, 15 Mar 2011 13:39:02 -0700 (PDT)") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 96.39.62.75 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:11875 Archived-At: Mike Gran writes: >> From:Mark H Weaver >> Despite the similarity of these two representations, they are >> sufficiently different that they cannot be handled by the same machine >> code.=C2=A0 That means you must either implement multiple inner loops, o= ne >> for each combination of string parameter representations, or else you >> must dispatch on the string representation within the inner loop.=C2=A0 = On >> modern architectures, wrongly predicted conditional branches are very >> expensive. > > Keep in mind that the UTF-8 forward iterator operation has conditional > branches.=C2=A0 Merely the act of advancing from one character to another > could take one of four paths, or more if you include the possibility > of invalid UTF-8 sequences. Yes, but for many common operations including simple string comparison, substring search, regexp search, and parsing, there is no need to know where the character boundaries are. These algorithms can be done bytewise on UTF-8. > Well, we covered O(1) vs O(n).=C2=A0 To make UTF-8 O(1), you need to store > additional indexing information of some sort.=C2=A0 There are various sch= emes, > but, depending the the scheme, you lose some of memory advantage of UTF-8 > vs UTF-32.=C2=A0 You can likely to better than UTF-32, though. I would prefer to either let our accessors be O(n), or else to create the index lazily, i.e. on the first usage of string-ref or string-set! In such a scheme, very few strings would include indices, and thus the overhead would be minimal. Anyway, the index overhead can be made arbitrarily small by increasing the chunk size. It is a classic time-space trade-off here. The chunk size could be made larger over the years, as usage of string-ref and string-set! become less common, and eventually the index stuff could be removed entirely. > So, one of the supposed benefits of UTF-8 strings=C2=A0for=C2=A0Guile=C2= =A0is that you > can more easily pass information back and forth between Guile and external > libraries or programs.=C2=A0 But in that case, when UTF-8 strings are rec= eived > by a=C2=A0UTF-8-enabled Guile from an external source, > you have to=C2=A0deal with both its O(1)-ness The indexing can be added lazily (and avoided more often than not). > the validity of its UTF-8 sequences, and the validity of the > codepoints that the UTF-8 sequences decode to (if the string is passed > back from program or library that you don't completely trust.) Yes, if we don't trust the library, then we'd have to validate the UTF-8. However, in many cases (e.g. when using libunistring) we could trust them. > So the question becomes *when* you'd deal with those questions.=C2=A0 The= best > policy would be to deal with them at the point a UTF-8 string enters your > application.=C2=A0 When a string enters your application, you should scru= b it > for invalid sequences and codepoints, throwing errors or replacing > bad characters=C2=A0with the Unicode=C2=A0replacement character=C2=A0and = whatnot. I agree, we should make sure that all internal UTF-8 strings are valid. As you rightly point out, postponing the validation would cause many problems. > And then there is a code-correctness argument against UTF-8.=C2=A0 It is = just > too easy to confuse a string's codepoint count with its memory size, and > it is just too easy to confuse iteration over bytes with iteration over > characters. True, but there are comparable code-correctness problems with the current scheme. If we use UTF-8 uniformly, we can efficiently use external libraries such as libunistring for many tasks, and avoid writing that code ourselves. In order to make our current scheme efficient, we must not only write and import much of the code from libunistring and other libraries, but we will in many cases need to write and maintain multiple versions of those routines, to handle different combinations of string representations. The reason I am still arguing this point is because I have looked seriously at what I would need to do to (A) fix our i18n problems and (B) make the code efficient. I very much want to fix these things, but the pain of trying to do this with our current scheme is too much for me to bear. I shouldn't have to rewrite libunistring, and I shouldn't have to write 3 or 4 different variants of each procedure that takes two string parameters. Mark