unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
From: Mark H Weaver <mhw@netris.org>
To: Mike Gran <spk121@yahoo.com>
Cc: "Ludovic Courtès" <ludo@gnu.org>, guile-devel@gnu.org
Subject: Re: Using libunistring for string comparisons et al
Date: Tue, 15 Mar 2011 18:49:17 -0400	[thread overview]
Message-ID: <87sjuokniq.fsf@netris.org> (raw)
In-Reply-To: <511668.33680.qm@web37902.mail.mud.yahoo.com> (Mike Gran's message of "Tue, 15 Mar 2011 13:39:02 -0700 (PDT)")

Mike Gran <spk121@yahoo.com> writes:
>> From:Mark H Weaver <mhw@netris.org>
>> Despite the similarity of these two representations, they are
>> sufficiently different that they cannot be handled by the same machine
>> code.  That means you must either implement multiple inner loops, one
>> for each combination of string parameter representations, or else you
>> must dispatch on the string representation within the inner loop.  On
>> modern architectures, wrongly predicted conditional branches are very
>> expensive.
>
> Keep in mind that the UTF-8 forward iterator operation has conditional
> branches.  Merely the act of advancing from one character to another
> could take one of four paths, or more if you include the possibility
> of invalid UTF-8 sequences.

Yes, but for many common operations including simple string comparison,
substring search, regexp search, and parsing, there is no need to know
where the character boundaries are.  These algorithms can be done
bytewise on UTF-8.

> Well, we covered O(1) vs O(n).  To make UTF-8 O(1), you need to store
> additional indexing information of some sort.  There are various schemes,
> but, depending the the scheme, you lose some of memory advantage of UTF-8
> vs UTF-32.  You can likely to better than UTF-32, though.

I would prefer to either let our accessors be O(n), or else to create
the index lazily, i.e. on the first usage of string-ref or string-set!
In such a scheme, very few strings would include indices, and thus the
overhead would be minimal.

Anyway, the index overhead can be made arbitrarily small by increasing
the chunk size.  It is a classic time-space trade-off here.  The chunk
size could be made larger over the years, as usage of string-ref and
string-set! become less common, and eventually the index stuff could be
removed entirely.

> So, one of the supposed benefits of UTF-8 strings for Guile is that you
> can more easily pass information back and forth between Guile and external
> libraries or programs.  But in that case, when UTF-8 strings are received
> by a UTF-8-enabled Guile from an external source,
> you have to deal with both its O(1)-ness

The indexing can be added lazily (and avoided more often than not).

> the validity of its UTF-8 sequences, and the validity of the
> codepoints that the UTF-8 sequences decode to (if the string is passed
> back from program or library that you don't completely trust.)

Yes, if we don't trust the library, then we'd have to validate the
UTF-8.  However, in many cases (e.g. when using libunistring) we could
trust them.

> So the question becomes *when* you'd deal with those questions.  The best
> policy would be to deal with them at the point a UTF-8 string enters your
> application.  When a string enters your application, you should scrub it
> for invalid sequences and codepoints, throwing errors or replacing
> bad characters with the Unicode replacement character and whatnot.

I agree, we should make sure that all internal UTF-8 strings are valid.
As you rightly point out, postponing the validation would cause many
problems.

> And then there is a code-correctness argument against UTF-8.  It is just
> too easy to confuse a string's codepoint count with its memory size, and
> it is just too easy to confuse iteration over bytes with iteration over
> characters.

True, but there are comparable code-correctness problems with the
current scheme.  If we use UTF-8 uniformly, we can efficiently use
external libraries such as libunistring for many tasks, and avoid
writing that code ourselves.

In order to make our current scheme efficient, we must not only write
and import much of the code from libunistring and other libraries, but
we will in many cases need to write and maintain multiple versions of
those routines, to handle different combinations of string
representations.

The reason I am still arguing this point is because I have looked
seriously at what I would need to do to (A) fix our i18n problems and
(B) make the code efficient.  I very much want to fix these things,
but the pain of trying to do this with our current scheme is too much
for me to bear.  I shouldn't have to rewrite libunistring, and I
shouldn't have to write 3 or 4 different variants of each procedure
that takes two string parameters.

      Mark



  reply	other threads:[~2011-03-15 22:49 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-12 21:28 Using libunistring for string comparisons et al Mike Gran
2011-03-15 17:20 ` Mark H Weaver
2011-03-15 20:39   ` Mike Gran
2011-03-15 22:49     ` Mark H Weaver [this message]
2011-03-16  0:01       ` Mike Gran
2011-03-16  1:12         ` Mark H Weaver
2011-03-16 11:26           ` Ludovic Courtès
2011-03-17 15:38             ` Mark H Weaver
2011-03-17 15:56               ` Ludovic Courtès
2011-03-17 17:58                 ` Mark H Weaver
2011-03-18  0:10                   ` Thien-Thi Nguyen
2011-03-18  1:38                     ` Mark H Weaver
2011-03-18  8:46                       ` Thien-Thi Nguyen
2011-03-18 12:05                         ` Mark H Weaver
2011-03-20 22:12                   ` Ludovic Courtès
2011-03-30 10:14                     ` Andy Wingo
2011-03-17 21:47           ` Ludovic Courtès
2011-03-19 12:31           ` Andy Wingo
2011-03-19 14:06             ` Mark H Weaver
2011-03-19 14:53               ` Noah Lavine
2011-03-19 15:49                 ` Mark H Weaver
2011-03-19 15:08               ` Andy Wingo
2011-03-19 19:43                 ` Mark H Weaver
2011-03-19 16:37               ` Mark H Weaver
2011-03-20 21:49               ` Ludovic Courtès
2011-03-30  9:50               ` Andy Wingo
2011-03-29 12:39             ` Peter Brett
2011-03-29 13:35               ` Andy Wingo
2011-03-29 21:15               ` Ludovic Courtès
2011-03-31 14:59                 ` Peter Brett
2011-03-31 20:12                   ` Ludovic Courtès
2011-03-30  9:33       ` Andy Wingo
2011-03-16  0:22     ` Alex Shinn
  -- strict thread matches above, loose matches on Subject: below --
2011-03-17 18:07 Mike Gran
2011-03-16 15:22 Mike Gran
2011-03-16 16:58 ` Ludovic Courtès
2011-03-16  2:03 Mike Gran
2011-03-16  1:30 Mike Gran
2011-03-11  0:54 uc_tolower (uc_toupper (x)) Mike Gran
2011-03-11 22:33 ` Using libunistring for string comparisons et al Mark H Weaver
2011-03-11 22:36   ` Mark H Weaver
2011-03-11 23:09   ` Mark H Weaver
2011-03-12 13:46     ` Ludovic Courtès
2011-03-12 17:28       ` Mark H Weaver
2011-03-13 21:30         ` Ludovic Courtès
2011-03-30  9:05           ` Andy Wingo
2011-03-30  9:03     ` Andy Wingo
2011-03-31 14:19       ` Ludovic Courtès
2011-03-12 13:36   ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87sjuokniq.fsf@netris.org \
    --to=mhw@netris.org \
    --cc=guile-devel@gnu.org \
    --cc=ludo@gnu.org \
    --cc=spk121@yahoo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).