From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Mark H Weaver <mhw@netris.org>
Newsgroups: gmane.lisp.guile.devel
Subject: Re: Using libunistring for string comparisons et al
Date: Tue, 15 Mar 2011 18:49:17 -0400
Message-ID: <87sjuokniq.fsf@netris.org>
References: <336042.33326.qm@web37901.mail.mud.yahoo.com>
	<878vwgmhah.fsf@netris.org>
	<511668.33680.qm@web37902.mail.mud.yahoo.com>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Trace: dough.gmane.org 1300230804 10943 80.91.229.12 (15 Mar 2011 23:13:24 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Tue, 15 Mar 2011 23:13:24 +0000 (UTC)
Cc: Ludovic =?utf-8?Q?Court=C3=A8s?= <ludo@gnu.org>, guile-devel@gnu.org
To: Mike Gran <spk121@yahoo.com>
Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Wed Mar 16 00:13:19 2011
Return-path: <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>
Envelope-to: guile-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.69)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>)
	id 1PzdQV-0002v5-JK
	for guile-devel@m.gmane.org; Wed, 16 Mar 2011 00:13:19 +0100
Original-Received: from localhost ([127.0.0.1]:33208 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1PzdQU-0000DR-VJ
	for guile-devel@m.gmane.org; Tue, 15 Mar 2011 19:13:19 -0400
Original-Received: from [140.186.70.92] (port=46551 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Pzd3f-0005T5-Vf
	for guile-devel@gnu.org; Tue, 15 Mar 2011 18:49:45 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mhw@netris.org>) id 1Pzd3e-0003vU-IG
	for guile-devel@gnu.org; Tue, 15 Mar 2011 18:49:43 -0400
Original-Received: from world.peace.net ([96.39.62.75]:35580)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mhw@netris.org>)
	id 1Pzd3c-0003v6-Hj; Tue, 15 Mar 2011 18:49:40 -0400
Original-Received: from 209-6-93-251.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com
	([209.6.93.251] helo=freedomincluded)
	by world.peace.net with esmtpa (Exim 4.69)
	(envelope-from <mhw@netris.org>)
	id 1Pzd3W-00078l-OA; Tue, 15 Mar 2011 18:49:34 -0400
Original-Received: from mhw by freedomincluded with local (Exim 4.69)
	(envelope-from <mhw@netris.org>)
	id 1Pzd3F-0001pv-Jn; Tue, 15 Mar 2011 18:49:17 -0400
In-Reply-To: <511668.33680.qm@web37902.mail.mud.yahoo.com> (Mike Gran's
	message of "Tue, 15 Mar 2011 13:39:02 -0700 (PDT)")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3)
X-Received-From: 96.39.62.75
X-BeenThere: guile-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Developers list for Guile,
	the GNU extensibility library" <guile-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guile-devel>
List-Post: <mailto:guile-devel@gnu.org>
List-Help: <mailto:guile-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/guile-devel>,
	<mailto:guile-devel-request@gnu.org?subject=subscribe>
Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.lisp.guile.devel:11875
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.devel/11875>

Mike Gran <spk121@yahoo.com> writes:
>> From:Mark H Weaver <mhw@netris.org>
>> Despite the similarity of these two representations, they are
>> sufficiently different that they cannot be handled by the same machine
>> code.=C2=A0 That means you must either implement multiple inner loops, o=
ne
>> for each combination of string parameter representations, or else you
>> must dispatch on the string representation within the inner loop.=C2=A0 =
On
>> modern architectures, wrongly predicted conditional branches are very
>> expensive.
>
> Keep in mind that the UTF-8 forward iterator operation has conditional
> branches.=C2=A0 Merely the act of advancing from one character to another
> could take one of four paths, or more if you include the possibility
> of invalid UTF-8 sequences.

Yes, but for many common operations including simple string comparison,
substring search, regexp search, and parsing, there is no need to know
where the character boundaries are.  These algorithms can be done
bytewise on UTF-8.

> Well, we covered O(1) vs O(n).=C2=A0 To make UTF-8 O(1), you need to store
> additional indexing information of some sort.=C2=A0 There are various sch=
emes,
> but, depending the the scheme, you lose some of memory advantage of UTF-8
> vs UTF-32.=C2=A0 You can likely to better than UTF-32, though.

I would prefer to either let our accessors be O(n), or else to create
the index lazily, i.e. on the first usage of string-ref or string-set!
In such a scheme, very few strings would include indices, and thus the
overhead would be minimal.

Anyway, the index overhead can be made arbitrarily small by increasing
the chunk size.  It is a classic time-space trade-off here.  The chunk
size could be made larger over the years, as usage of string-ref and
string-set! become less common, and eventually the index stuff could be
removed entirely.

> So, one of the supposed benefits of UTF-8 strings=C2=A0for=C2=A0Guile=C2=
=A0is that you
> can more easily pass information back and forth between Guile and external
> libraries or programs.=C2=A0 But in that case, when UTF-8 strings are rec=
eived
> by a=C2=A0UTF-8-enabled Guile from an external source,
> you have to=C2=A0deal with both its O(1)-ness

The indexing can be added lazily (and avoided more often than not).

> the validity of its UTF-8 sequences, and the validity of the
> codepoints that the UTF-8 sequences decode to (if the string is passed
> back from program or library that you don't completely trust.)

Yes, if we don't trust the library, then we'd have to validate the
UTF-8.  However, in many cases (e.g. when using libunistring) we could
trust them.

> So the question becomes *when* you'd deal with those questions.=C2=A0 The=
 best
> policy would be to deal with them at the point a UTF-8 string enters your
> application.=C2=A0 When a string enters your application, you should scru=
b it
> for invalid sequences and codepoints, throwing errors or replacing
> bad characters=C2=A0with the Unicode=C2=A0replacement character=C2=A0and =
whatnot.

I agree, we should make sure that all internal UTF-8 strings are valid.
As you rightly point out, postponing the validation would cause many
problems.

> And then there is a code-correctness argument against UTF-8.=C2=A0 It is =
just
> too easy to confuse a string's codepoint count with its memory size, and
> it is just too easy to confuse iteration over bytes with iteration over
> characters.

True, but there are comparable code-correctness problems with the
current scheme.  If we use UTF-8 uniformly, we can efficiently use
external libraries such as libunistring for many tasks, and avoid
writing that code ourselves.

In order to make our current scheme efficient, we must not only write
and import much of the code from libunistring and other libraries, but
we will in many cases need to write and maintain multiple versions of
those routines, to handle different combinations of string
representations.

The reason I am still arguing this point is because I have looked
seriously at what I would need to do to (A) fix our i18n problems and
(B) make the code efficient.  I very much want to fix these things,
but the pain of trying to do this with our current scheme is too much
for me to bear.  I shouldn't have to rewrite libunistring, and I
shouldn't have to write 3 or 4 different variants of each procedure
that takes two string parameters.

      Mark