unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* Re: uc_tolower (uc_toupper (x))
@ 2011-03-11  0:54 Mike Gran
  2011-03-11 22:33 ` Using libunistring for string comparisons et al Mark H Weaver
  0 siblings, 1 reply; 55+ messages in thread
From: Mike Gran @ 2011-03-11  0:54 UTC (permalink / raw)
  To: Mark H Weaver, guile-devel@gnu.org

> From:Mark H Weaver <mhw@netris.org>
> To:guile-devel@gnu.org
> Cc:
> Sent:Thursday, March 10, 2011 3:39 PM
> Subject:uc_tolower (uc_toupper (x))
> 
> I've noticed that srfi-13.c very frequently does:
> 
>   uc_tolower (uc_toupper (x))
> 
> Is there a good reason to do this instead of:
> 
>   uc_tolower (x)

Unicode defines a case folding algorithm as well as
a data table for case insensitive sorting.  Setting
things to lowercase is a decent approximation of
case folding.  But doing the upper->lower operation picks
up a few more of the corner cases, like U+03C2 GREEK
SMALL LETTER FINAL SIGMA and U+03C3 GREEK SMALL LETTER SIGMA
which are the same letter with different representations,
or U+00B5 MICRO SIGN and U+039C GREEK SMALL LETTER MU
which are supposed to have the same sort ordering.

Now that we've pulled in all of libunistring, it might
be a good idea to see if it has a complete implementation
of unicode case folding, because upper->lower is also not
completely correct.

-Mike



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Using libunistring for string comparisons et al
  2011-03-11  0:54 uc_tolower (uc_toupper (x)) Mike Gran
@ 2011-03-11 22:33 ` Mark H Weaver
  2011-03-11 22:36   ` Mark H Weaver
                     ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Mark H Weaver @ 2011-03-11 22:33 UTC (permalink / raw)
  To: Mike Gran; +Cc: guile-devel@gnu.org

Mike Gran <spk121@yahoo.com> writes:
> [...] But doing the upper->lower operation picks
> up a few more of the corner cases, like U+03C2 GREEK
> SMALL LETTER FINAL SIGMA and U+03C3 GREEK SMALL LETTER SIGMA
> which are the same letter with different representations,
> or U+00B5 MICRO SIGN and U+039C GREEK SMALL LETTER MU
> which are supposed to have the same sort ordering.

Ah, okay.  Makes sense.

> Now that we've pulled in all of libunistring, it might
> be a good idea to see if it has a complete implementation
> of unicode case folding, because upper->lower is also not
> completely correct.

I looked into this.  Indeed, the libunistring documentation mentions
that in some languages (e.g. German), the to_upper and to_lower
conversions cannot be done properly on a per-character basis, because
the number of character can change.  These operations much be done on an
entire string.  For example:

<http://www.r6rs.org/final/html/r6rs-lib/r6rs-lib-Z-H-2.html>

  (string-upcase "Straße") => "STRASSE"
  (string-foldcase "Straße") => "strasse"

libunistring contains all the necessary functions, including
case-insensitive string comparisons.  However, the only string
representations supported by these operations are: UTF-8, UTF-16,
UTF-32, or locale-encoded strings, and for comparisons both strings must
be the same encoding.

I'm aware that this proposal will be very controversial, but starting in
Guile 2.2, I think we ought to consider storing strings internally in
UTF-8, as is done in Gauche.  This would of course make string-ref and
string-set! into O(n) operations.  However, I claim that any code that
depends on string-ref and string-set! could be better written 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-11 22:33 ` Using libunistring for string comparisons et al Mark H Weaver
@ 2011-03-11 22:36   ` Mark H Weaver
  2011-03-11 23:09   ` Mark H Weaver
  2011-03-12 13:36   ` Ludovic Courtès
  2 siblings, 0 replies; 55+ messages in thread
From: Mark H Weaver @ 2011-03-11 22:36 UTC (permalink / raw)
  To: Mike Gran; +Cc: guile-devel@gnu.org

Sorry, I accidentally sent out an only partly-written draft message.
Please disregard for now; I will finish writing it later.

     Mark



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-11 22:33 ` Using libunistring for string comparisons et al Mark H Weaver
  2011-03-11 22:36   ` Mark H Weaver
@ 2011-03-11 23:09   ` Mark H Weaver
  2011-03-12 13:46     ` Ludovic Courtès
  2011-03-30  9:03     ` Using libunistring for string comparisons et al Andy Wingo
  2011-03-12 13:36   ` Ludovic Courtès
  2 siblings, 2 replies; 55+ messages in thread
From: Mark H Weaver @ 2011-03-11 23:09 UTC (permalink / raw)
  To: Mike Gran; +Cc: guile-devel

I wrote:
> I'm aware that this proposal will be very controversial, but starting in
> Guile 2.2, I think we ought to consider storing strings internally in
> UTF-8, as is done in Gauche.  This would of course make string-ref and
> string-set! into O(n) operations.  However, I claim that any code that
> depends on string-ref and string-set! could be better written 

I guess I better write at least a few more arguments now, before minds
become hardened against it :)

It's a mistake to think of strings as arrays of characters.  This is an
adequate model for simple scripts, but not for more complex ones.  Even
Gerald Sussman said so in his recent WG1 ballot.

Gerald Sussman wrote in http://trac.sacrideo.us/wg/wiki/WG1BallotSussman
> It is not very good to think of strings as 1-dimensional arrays of
> characters.  What about accents and other hairy stuff? Be afraid!
> Consider the complexity of Unicode!

I claim that any reasonable code which currently uses string-ref and
string-set! could be more cleanly written using string ports or
string-{fold,unfold}{,-right}.

Anyway, I don't have time right now to write as persuasive an argument
as I'd like to.  After 2.0.1 perhaps I will try again.

      Mark



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-11 22:33 ` Using libunistring for string comparisons et al Mark H Weaver
  2011-03-11 22:36   ` Mark H Weaver
  2011-03-11 23:09   ` Mark H Weaver
@ 2011-03-12 13:36   ` Ludovic Courtès
  2 siblings, 0 replies; 55+ messages in thread
From: Ludovic Courtès @ 2011-03-12 13:36 UTC (permalink / raw)
  To: guile-devel

Hello!

Mark H Weaver <mhw@netris.org> writes:

> I'm aware that this proposal will be very controversial, but starting in
> Guile 2.2, I think we ought to consider storing strings internally in
> UTF-8, as is done in Gauche.

I don’t think so.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-11 23:09   ` Mark H Weaver
@ 2011-03-12 13:46     ` Ludovic Courtès
  2011-03-12 17:28       ` Mark H Weaver
  2011-03-13  4:05       ` O(1) accessors for UTF-8 backed strings Mark H Weaver
  2011-03-30  9:03     ` Using libunistring for string comparisons et al Andy Wingo
  1 sibling, 2 replies; 55+ messages in thread
From: Ludovic Courtès @ 2011-03-12 13:46 UTC (permalink / raw)
  To: guile-devel

Hello!

Mark H Weaver <mhw@netris.org> writes:

> I claim that any reasonable code which currently uses string-ref and
> string-set! could be more cleanly written using string ports or
> string-{fold,unfold}{,-right}.

I agree, and we should encourage this.  However...

I find Cowan’s proposal for string iteration and the R6RS editors
response interesting:

  http://www.r6rs.org/formal-comments/comment-235.txt

I also think strings should remain what they currently are, with O(1)
random access.

I think anything beyond that may require a new API, perhaps with a new
data type, which would deserve a SRFI with wider discussion.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-12 13:46     ` Ludovic Courtès
@ 2011-03-12 17:28       ` Mark H Weaver
  2011-03-13 21:30         ` Ludovic Courtès
  2011-03-13  4:05       ` O(1) accessors for UTF-8 backed strings Mark H Weaver
  1 sibling, 1 reply; 55+ messages in thread
From: Mark H Weaver @ 2011-03-12 17:28 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

ludo@gnu.org (Ludovic Courtès) writes:
> I find Cowan’s proposal for string iteration and the R6RS editors
> response interesting:
>
>   http://www.r6rs.org/formal-comments/comment-235.txt

Cowan was proposing a complex new API.  I am not, nor did Gauche.
An efficient implementation of string ports is all that is needed.

> I also think strings should remain what they currently are, with O(1)
> random access.

I understand your position, and perhaps you are right.

Unfortunately, the alternatives are not pleasant.  We have a bunch of
bugs in our string handling functions.  Currently, our case-insensitive
string comparisons and case conversions are not correct for several
languages including German, according to the R6RS among other things.

We could easily fix these problems by using libunistring, which provides
the operations we need, but only if we use a single string
representation, and one that is supported by libunistring (UTF-8,
UTF-16, or UTF-32).

So, our options appear to be:

  * Use only wide strings internally.

  * Reimplement several complex functions from libunistring within guile
    (string comparisons and case conversions).

  * Convert strings to a libunistring-supported representation, and
    possibly back again, on each operation.  For example, this will be
    needed when comparing two narrow strings, when comparing a narrow
    string to a wide string, or when applying a case conversion to a
    narrow string.

Our use of two different internal string representations is another
problem.  Right now, our string comparisons are painfully inefficient.
Take a look at compare_strings in srfi-13.c.  It's also broken w.r.t.
case-insensitive comparisons.  In order to fix this and make it
efficient, we'll need to make several different variants:

  * case-sensitive
     * narrow-narrow
     * narrow-wide
     * wide-wide (use libunistring's u32_cmp2 for this)

  * case-insensitive
     * narrow-narrow
     * narrow-wide
     * wide-wide (use libunistring for this)

The case-insensitive narrow-narrow comparison must be able to handle
this, for example (from r6rs-lib):

  (string-ci=? "Straße" "Strasse") => #t

I'm not yet sure what's involved in implementing the case-insensitive
narrow-wide case properly.

    Best,
     Mark



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
@ 2011-03-12 21:28 Mike Gran
  2011-03-15 17:20 ` Mark H Weaver
  0 siblings, 1 reply; 55+ messages in thread
From: Mike Gran @ 2011-03-12 21:28 UTC (permalink / raw)
  To: Mark H Weaver, Ludovic Courtès; +Cc: guile-devel@gnu.org

> From:Mark H Weaver <mhw@netris.org>

> 
> ludo@gnu.org (Ludovic Courtès) writes:
> > I find Cowan’s proposal for string iteration and the R6RS editors
> > response interesting:
> >
> >   http://www.r6rs.org/formal-comments/comment-235.txt
> 
> Cowan was proposing a complex new API.  I am not, nor did Gauche.
> An efficient implementation of string ports is all that is needed.
> 
> > I also think strings should remain what they currently are, with O(1)
> > random access.
> 
> I understand your position, and perhaps you are right.
> 
> Unfortunately, the alternatives are not pleasant.  We have a bunch of
> bugs in our string handling functions.  Currently, our case-insensitive
> string comparisons and case conversions are not correct for several
> languages including German, according to the R6RS among other things.

(Just as an aside, in discussions like this, I think it important to distinguish
between
- where Guile doesn't match R6RS
- where R6RS doesn't match Unicode
- and where Unicode doesn't match reality

Each of these battles needs to be fought in the proper battlefield.)

> We could easily fix these problems by using libunistring, which provides
> the operations we need, but only if we use a single string
> representation, and one that is supported by libunistring (UTF-8,
> UTF-16, or UTF-32).

We do, in a matter of speaking, have a single string representation: UTF-32.
The 'narrow' encoding is UTF-32 with the initial 3 bytes of zero removed.

> Our use of two different internal string representations is another
> problem.  Right now, our string comparisons are painfully inefficient.
> Take a look at compare_strings in srfi-13.c.  It's also broken w.r.t.
> case-insensitive comparisons.  In order to fix this and make it
> efficient, we'll need to make several different variants:
> 
>   * case-sensitive
>      * narrow-narrow
>      * narrow-wide
>      * wide-wide (use libunistring's u32_cmp2 for this)
> 
>   * case-insensitive
>      * narrow-narrow
>      * narrow-wide
>      * wide-wide (use libunistring for this)
> 
> The case-insensitive narrow-narrow comparison must be able to handle
> this, for example (from r6rs-lib):
> 
>   (string-ci=? "Straße" "Strasse") => #t
> 
> I'm not yet sure what's involved in implementing the case-insensitive
> narrow-wide case properly.

It is not too difficult in practice, I think.  Converting narrow (Latin-1
aka truncated UTF-32) to wide (UTF-32) involves adding back
in the missing zero bytes.

For the sake of history, here's how we got to where we are now.
- R6RS says characters are Unicode codepoints
- R6RS says string ops are O(1)
- The only Unicode encoding that uses codepoints as its atomic units and 
  is O(1) is UTF-32
- UTF-32 wastes space for most normal circumstances
Thus we invented this wide (UTF-32) narrow (UTF-32 with initial
zeros stripped) encoding scheme we have now.  It may seem
to be suboptimal in terms of complexity and familiarity, but, 
what you see is an attempt to be optimal in terms of memory and
O(1).

I actually at one point had a nearly complete version of Guile 1.8
that used UTF-8 and another that used UTF-32.  There are some
other reasons why UTF-8 is bad, which I could bore you with
ad naseum.

Thanks,

Mike




^ permalink raw reply	[flat|nested] 55+ messages in thread

* O(1) accessors for UTF-8 backed strings
  2011-03-12 13:46     ` Ludovic Courtès
  2011-03-12 17:28       ` Mark H Weaver
@ 2011-03-13  4:05       ` Mark H Weaver
  2011-03-13  4:42         ` Alex Shinn
  2011-03-30  9:20         ` Andy Wingo
  1 sibling, 2 replies; 55+ messages in thread
From: Mark H Weaver @ 2011-03-13  4:05 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

ludo@gnu.org (Ludovic Courtès) writes:
> I also think strings should remain what they currently are, with O(1)
> random access.

I just realized that it is possible to implement O(1) accessors for
UTF-8 backed strings.

The simplest solution is to break the stringbuf into chunks, where each
chunk holds exactly CHUNK_SIZE characters, except for the last chunk
which might have fewer.  To find a character at index i, we scan forward
(i % CHUNK_SIZE) characters from the beginning of chunk[i / CHUNK_SIZE].
Ideally, CHUNK_SIZE is a power of 2, which allows us to use bitwise
operations instead of division.  The number of characters to scan is
bounded by CHUNK_SIZE, and therefore takes O(1) time.

String-set! in general requires us to resize a chunk.  Although chunks
contain a constant number of characters, that of course maps to a
variable number of bytes.  There are various tricks we can do to reduce
the number of reallocations done, but even in the worse case, the time
spent is bounded by CHUNK_SIZE, and thus O(1).

Converting a string to UTF-8 now consists of concatenating its chunks.
For short strings (<= CHUNK_SIZE) we can simply return the address of
the first chunk.  In the general case, we'll have to recombine the
chunks into a new block, which of course takes O(n) time.

With a little added complexity, we can reduce this cost in common cases,
and even eliminate it completely when string-set! is not used.  The idea
is to initially store a stringbuf in a single piece.  Only after the
first string-set! would the stringbuf be broken up into chunks.

A flag in the stringbuf would indicate whether or not it is chunked.  In
a chunked string, each chunk[i] points to its own memory block.  In an
unchunked string, the entire string is in one block of UTF-8, pointed to
by chunk[0].  chunk[i] for i>0 points into the middle of the block, at
an offset of i*CHUNK_SIZE characters.  This allows string-ref to treat
chunked and unchunked stringbufs equivalently.

It is probably worthwhile to actually _convert_ a stringbuf back into
unchunked form whenever we convert a string to utf8.  That way, in the
common case where string-set! is only used to initialize the string,
later conversions to utf8 can be done in constant time (except the first
one, which can be charged to the initialization routine).

Now, it is my hope that string-set! will be seldom used, and that we can
encourage people to instead use string ports and/or string-unfold and
string-unfold-right.  These other functions can be implemented without
string-set!, and will not require the use of the chunked representation.

Therefore, it is my hope that the chunked representation can be avoided
altogether in most all cases, which enables fast constant-time
conversions to UTF-8, and thus the efficient use of libunistring and any
other library that understands UTF-8.

What do you think?

      Mark



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: O(1) accessors for UTF-8 backed strings
  2011-03-13  4:05       ` O(1) accessors for UTF-8 backed strings Mark H Weaver
@ 2011-03-13  4:42         ` Alex Shinn
  2011-03-15 15:46           ` Mark H Weaver
  2011-03-30  9:20         ` Andy Wingo
  1 sibling, 1 reply; 55+ messages in thread
From: Alex Shinn @ 2011-03-13  4:42 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel

On Sun, Mar 13, 2011 at 1:05 PM, Mark H Weaver <mhw@netris.org> wrote:
> ludo@gnu.org (Ludovic Courtès) writes:
>> I also think strings should remain what they currently are, with O(1)
>> random access.
>
> I just realized that it is possible to implement O(1) accessors for
> UTF-8 backed strings.

It's possible with several approaches, but not necessarily worth it:

http://trac.sacrideo.us/wg/wiki/StringRepresentations

-- 
Alex



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-12 17:28       ` Mark H Weaver
@ 2011-03-13 21:30         ` Ludovic Courtès
  2011-03-30  9:05           ` Andy Wingo
  0 siblings, 1 reply; 55+ messages in thread
From: Ludovic Courtès @ 2011-03-13 21:30 UTC (permalink / raw)
  To: guile-devel

Hi Mark,

Mark H Weaver <mhw@netris.org> writes:

> Unfortunately, the alternatives are not pleasant.  We have a bunch of
> bugs in our string handling functions.  Currently, our case-insensitive
> string comparisons and case conversions are not correct for several
> languages including German, according to the R6RS among other things.
>
> We could easily fix these problems by using libunistring, which provides
> the operations we need, but only if we use a single string
> representation, and one that is supported by libunistring (UTF-8,
> UTF-16, or UTF-32).

I don’t think so.  For instance, you could “upgrade” narrow strings to
UTF-32 and then use libunistring on that.  That would fix case-folding
for “Straße”, I guess.

> So, our options appear to be:
>
>   * Use only wide strings internally.
>
>   * Reimplement several complex functions from libunistring within guile
>     (string comparisons and case conversions).
>
>   * Convert strings to a libunistring-supported representation, and
>     possibly back again, on each operation.  For example, this will be
>     needed when comparing two narrow strings, when comparing a narrow
>     string to a wide string, or when applying a case conversion to a
>     narrow string.
>
> Our use of two different internal string representations is another
> problem.  Right now, our string comparisons are painfully inefficient.

Inefficient in the (unlikely) case that you’re comparing a narrow and a
wide string of different lengths.

So yes, the current implementation has bugs, but I think most if not all
can be fixed with minimal changes.  Would you like to look into it
for 2.0.x?

Using UTF-8 internally has problems of its own, as Mike explained, which
is why it was rejected in the first place.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: O(1) accessors for UTF-8 backed strings
  2011-03-13  4:42         ` Alex Shinn
@ 2011-03-15 15:46           ` Mark H Weaver
  2011-03-16  0:07             ` Alex Shinn
  2011-03-19 13:02             ` Andy Wingo
  0 siblings, 2 replies; 55+ messages in thread
From: Mark H Weaver @ 2011-03-15 15:46 UTC (permalink / raw)
  To: Alex Shinn; +Cc: Ludovic Courtès, guile-devel

Alex Shinn <alexshinn@gmail.com> wrote:
> On Sun, Mar 13, 2011 at 1:05 PM, Mark H Weaver <mhw@netris.org> wrote:
>> I just realized that it is possible to implement O(1) accessors for
>> UTF-8 backed strings.
>
> It's possible with several approaches, but not necessarily worth it:
>
> http://trac.sacrideo.us/wg/wiki/StringRepresentations

Alex, can you please clarify your position?  I fear that readers of your
message might assume that you are against my proposal to store strings
internally in UTF-8.  Having read the text that you referenced above, I
suspect that you are in favor of using UTF-8 with O(n) string accessors.

For those who may not be familiar with the special properties of UTF-8,
please read at least the section on "Common Algorithms and Usage
Patterns" near the end of the text Alex referenced.  In summary, many
operations on UTF-8 such as substring searches, regexp searches, and
parsing can be done one byte at a time, using the same inner loop that
would be used for ASCII or Latin-1.  Also, although it is not mentioned
there, even simple string comparisons (done lexigraphically by code
point) can be done byte-wise on UTF-8.

I'd also like to point out that the R6RS is the only relevant standard
that mandates O(1) string accessors.  The R5RS did not require this, and
WG1 for the R7RS has already voted against this requirement.

  http://trac.sacrideo.us/wg/ticket/27

I'll write more on this later.

    Mark



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-12 21:28 Mike Gran
@ 2011-03-15 17:20 ` Mark H Weaver
  2011-03-15 20:39   ` Mike Gran
  0 siblings, 1 reply; 55+ messages in thread
From: Mark H Weaver @ 2011-03-15 17:20 UTC (permalink / raw)
  To: Mike Gran; +Cc: Ludovic Courtès, guile-devel

Mike Gran <spk121@yahoo.com> writes:
> We do, in a matter of speaking, have a single string representation:
> UTF-32.  The 'narrow' encoding is UTF-32 with the initial 3 bytes of
> zero removed.

Despite the similarity of these two representations, they are
sufficiently different that they cannot be handled by the same machine
code.  That means you must either implement multiple inner loops, one
for each combination of string parameter representations, or else you
must dispatch on the string representation within the inner loop.  On
modern architectures, wrongly predicted conditional branches are very
expensive.

> I actually at one point had a nearly complete version of Guile 1.8
> that used UTF-8 and another that used UTF-32.  There are some
> other reasons why UTF-8 is bad, which I could bore you with
> ad naseum.

Can you please tell me why UTF-8 is bad, or point me to something that
explains it?  Everything I have found suggests that UTF-8 is very good.

   Thanks,
     Mark



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-15 17:20 ` Mark H Weaver
@ 2011-03-15 20:39   ` Mike Gran
  2011-03-15 22:49     ` Mark H Weaver
  2011-03-16  0:22     ` Alex Shinn
  0 siblings, 2 replies; 55+ messages in thread
From: Mike Gran @ 2011-03-15 20:39 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel@gnu.org

> From:Mark H Weaver <mhw@netris.org>
> 
> Mike Gran <spk121@yahoo.com> writes:
> > We do, in a matter of speaking, have a single string representation:
> > UTF-32.  The 'narrow' encoding is UTF-32 with the initial 3 bytes 
> of
> > zero removed.
> 
> Despite the similarity of these two representations, they are
> sufficiently different that they cannot be handled by the same machine
> code.  That means you must either implement multiple inner loops, one
> for each combination of string parameter representations, or else you
> must dispatch on the string representation within the inner loop.  On
> modern architectures, wrongly predicted conditional branches are very
> expensive.

Keep in mind that the UTF-8 forward iterator operation has conditional
branches.  Merely the act of advancing from one character to another
could take one of four paths, or more if you include the possibility
of invalid UTF-8 sequences.

Also, you need to wrap what libunistring does into your calculations
of what is expensive.

> 
> > I actually at one point had a nearly complete version of Guile 1.8
> > that used UTF-8 and another that used UTF-32.  There are some
> > other reasons why UTF-8 is bad, which I could bore you with
> > ad naseum.
> 
> Can you please tell me why UTF-8 is bad, or point me to something that
> explains it?  Everything I have found suggests that UTF-8 is very good.

OK

Well, we covered O(1) vs O(n).  To make UTF-8 O(1), you need to store
additional indexing information of some sort.  There are various schemes,
but, depending the the scheme, you lose some of memory advantage of UTF-8
vs UTF-32.  You can likely to better than UTF-32, though.

Next, you have the question of string character validation.  For example,
in Latin-1, all 8-bit values are valid characters.  In UTF-32, not all
32-bit values are valid codepoints: anything over 10FFFF or in the surrogate
region D800 to D8FF is invalid.  Likewise, in UTF-8, you can have invalid
sequences.  You can also have valid UTF-8 sequences that point to the
invalid codepoints in the surrogate region.  In our current scheme, we
test the validity the codepoints in UTF-32 strings when the strings
are created.

So, one of the supposed benefits of UTF-8 strings for Guile is that you
can more easily pass information back and forth between Guile and external
libraries or programs.  But in that case, when UTF-8 strings are received
by a UTF-8-enabled Guile from an external source,
you have to deal with both its O(1)-ness, the validity of its UTF-8
sequences, and the validity of the codepoints that the UTF-8 sequences
decode to (if the string is passed back from program or library that you
don't completely trust.)

So the question becomes *when* you'd deal with those questions.  The best
policy would be to deal with them at the point a UTF-8 string enters your
application.  When a string enters your application, you should scrub it
for invalid sequences and codepoints, throwing errors or replacing
bad characters with the Unicode replacement character and whatnot.

If you don't deal with O(1) indexing and invalid sequences at the point
a UTF-8 string enters the system, you will have a couple of
interesting effects.  First, UTF-8 iterators that have to deal with the
possibility of invalid sequences are less efficient that those that can
trust the validity of a sequence, because of the extra code involved.

And then what do you do when you find an invalid sequence?  If you throw
an error, will there the possibility of non-local exits at any time there
is a string iteration?  (You can't replace bad sequences with the
the Unicode replacement character on the fly as bad sequences are discovered
during iteration, because the string could be read-only or be being used
by another thread.)

And then there is a code-correctness argument against UTF-8.  It is just
too easy to confuse a string's codepoint count with its memory size, and
it is just too easy to confuse iteration over bytes with iteration over
characters.  I know that the Guile maintainers are geniuses, but, they
are still human.  (I don't have any real data for this, but anecdotally
it seems that languages that have gone the ISO-8859-n to 8-bit clean to
UTF-8 upgrade route seem to have taken longer to get up to speed.)  But
this argument, for Guile, is now largely moot, since much of the string
access is through encoding-agnostic accessor functions.

None of this is insurmountable, of course; all of this could be overcome,
if the costs outweighed the benefits.

-Mike



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-15 20:39   ` Mike Gran
@ 2011-03-15 22:49     ` Mark H Weaver
  2011-03-16  0:01       ` Mike Gran
  2011-03-30  9:33       ` Andy Wingo
  2011-03-16  0:22     ` Alex Shinn
  1 sibling, 2 replies; 55+ messages in thread
From: Mark H Weaver @ 2011-03-15 22:49 UTC (permalink / raw)
  To: Mike Gran; +Cc: Ludovic Courtès, guile-devel

Mike Gran <spk121@yahoo.com> writes:
>> From:Mark H Weaver <mhw@netris.org>
>> Despite the similarity of these two representations, they are
>> sufficiently different that they cannot be handled by the same machine
>> code.  That means you must either implement multiple inner loops, one
>> for each combination of string parameter representations, or else you
>> must dispatch on the string representation within the inner loop.  On
>> modern architectures, wrongly predicted conditional branches are very
>> expensive.
>
> Keep in mind that the UTF-8 forward iterator operation has conditional
> branches.  Merely the act of advancing from one character to another
> could take one of four paths, or more if you include the possibility
> of invalid UTF-8 sequences.

Yes, but for many common operations including simple string comparison,
substring search, regexp search, and parsing, there is no need to know
where the character boundaries are.  These algorithms can be done
bytewise on UTF-8.

> Well, we covered O(1) vs O(n).  To make UTF-8 O(1), you need to store
> additional indexing information of some sort.  There are various schemes,
> but, depending the the scheme, you lose some of memory advantage of UTF-8
> vs UTF-32.  You can likely to better than UTF-32, though.

I would prefer to either let our accessors be O(n), or else to create
the index lazily, i.e. on the first usage of string-ref or string-set!
In such a scheme, very few strings would include indices, and thus the
overhead would be minimal.

Anyway, the index overhead can be made arbitrarily small by increasing
the chunk size.  It is a classic time-space trade-off here.  The chunk
size could be made larger over the years, as usage of string-ref and
string-set! become less common, and eventually the index stuff could be
removed entirely.

> So, one of the supposed benefits of UTF-8 strings for Guile is that you
> can more easily pass information back and forth between Guile and external
> libraries or programs.  But in that case, when UTF-8 strings are received
> by a UTF-8-enabled Guile from an external source,
> you have to deal with both its O(1)-ness

The indexing can be added lazily (and avoided more often than not).

> the validity of its UTF-8 sequences, and the validity of the
> codepoints that the UTF-8 sequences decode to (if the string is passed
> back from program or library that you don't completely trust.)

Yes, if we don't trust the library, then we'd have to validate the
UTF-8.  However, in many cases (e.g. when using libunistring) we could
trust them.

> So the question becomes *when* you'd deal with those questions.  The best
> policy would be to deal with them at the point a UTF-8 string enters your
> application.  When a string enters your application, you should scrub it
> for invalid sequences and codepoints, throwing errors or replacing
> bad characters with the Unicode replacement character and whatnot.

I agree, we should make sure that all internal UTF-8 strings are valid.
As you rightly point out, postponing the validation would cause many
problems.

> And then there is a code-correctness argument against UTF-8.  It is just
> too easy to confuse a string's codepoint count with its memory size, and
> it is just too easy to confuse iteration over bytes with iteration over
> characters.

True, but there are comparable code-correctness problems with the
current scheme.  If we use UTF-8 uniformly, we can efficiently use
external libraries such as libunistring for many tasks, and avoid
writing that code ourselves.

In order to make our current scheme efficient, we must not only write
and import much of the code from libunistring and other libraries, but
we will in many cases need to write and maintain multiple versions of
those routines, to handle different combinations of string
representations.

The reason I am still arguing this point is because I have looked
seriously at what I would need to do to (A) fix our i18n problems and
(B) make the code efficient.  I very much want to fix these things,
but the pain of trying to do this with our current scheme is too much
for me to bear.  I shouldn't have to rewrite libunistring, and I
shouldn't have to write 3 or 4 different variants of each procedure
that takes two string parameters.

      Mark



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-15 22:49     ` Mark H Weaver
@ 2011-03-16  0:01       ` Mike Gran
  2011-03-16  1:12         ` Mark H Weaver
  2011-03-30  9:33       ` Andy Wingo
  1 sibling, 1 reply; 55+ messages in thread
From: Mike Gran @ 2011-03-16  0:01 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel@gnu.org

> The reason I am still arguing this point is because I have looked
> seriously at what I would need to do to (A) fix our i18n problems and
> (B) make the code efficient.  I very much want to fix these things,
> but the pain of trying to do this with our current scheme is too much
> for me to bear.  I shouldn't have to rewrite libunistring, and I
> shouldn't have to write 3 or 4 different variants of each procedure
> that takes two string parameters.

What procedures are giving incorrect results?

Thanks,

Mike




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: O(1) accessors for UTF-8 backed strings
  2011-03-15 15:46           ` Mark H Weaver
@ 2011-03-16  0:07             ` Alex Shinn
  2011-03-19 13:02             ` Andy Wingo
  1 sibling, 0 replies; 55+ messages in thread
From: Alex Shinn @ 2011-03-16  0:07 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel

On Wed, Mar 16, 2011 at 12:46 AM, Mark H Weaver <mhw@netris.org> wrote:
> Alex Shinn <alexshinn@gmail.com> wrote:
>> On Sun, Mar 13, 2011 at 1:05 PM, Mark H Weaver <mhw@netris.org> wrote:
>>> I just realized that it is possible to implement O(1) accessors for
>>> UTF-8 backed strings.
>>
>> It's possible with several approaches, but not necessarily worth it:
>>
>> http://trac.sacrideo.us/wg/wiki/StringRepresentations
>
> Alex, can you please clarify your position?  I fear that readers of your
> message might assume that you are against my proposal to store strings
> internally in UTF-8.  Having read the text that you referenced above, I
> suspect that you are in favor of using UTF-8 with O(n) string accessors.

I didn't intend to make a recommendation either way,
just point to a useful resource where people have collected
ideas and data on the topic so you could make an informed
decision.

You are correct that I personally prefer simple UTF-8 with O(n)
string accessors, which is why the Unicode support I added
for Chicken does this, as does my own chibi-scheme.  But
the best string representation depends on your use cases.

-- 
Alex



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-15 20:39   ` Mike Gran
  2011-03-15 22:49     ` Mark H Weaver
@ 2011-03-16  0:22     ` Alex Shinn
  1 sibling, 0 replies; 55+ messages in thread
From: Alex Shinn @ 2011-03-16  0:22 UTC (permalink / raw)
  To: Mike Gran; +Cc: Mark H Weaver, Ludovic Courtès, guile-devel@gnu.org

On Wed, Mar 16, 2011 at 5:39 AM, Mike Gran <spk121@yahoo.com> wrote:
>> From:Mark H Weaver <mhw@netris.org>
>>
>> Mike Gran <spk121@yahoo.com> writes:
>> > We do, in a matter of speaking, have a single string representation:
>> > UTF-32.  The 'narrow' encoding is UTF-32 with the initial 3 bytes
>> of
>> > zero removed.
>>
>> Despite the similarity of these two representations, they are
>> sufficiently different that they cannot be handled by the same machine
>> code.  That means you must either implement multiple inner loops, one
>> for each combination of string parameter representations, or else you
>> must dispatch on the string representation within the inner loop.  On
>> modern architectures, wrongly predicted conditional branches are very
>> expensive.
>
> Keep in mind that the UTF-8 forward iterator operation has conditional
> branches.  Merely the act of advancing from one character to another
> could take one of four paths, or more if you include the possibility
> of invalid UTF-8 sequences.

No, technically you don't need any branching:

  /* first-byte lookup table encoded as an integer */
  #define magic 3841982464uL

  /* read a utf8 char and advance the pointer */
  int read_utf8_char(char **pp) {
    char *p = *pp;
    int tmp, len, res = (unsigned char)*p++;

    tmp = (res>>7);
    /* tmp is 0 if len is 1, 1 otherwise */
    len = (res>>4)*2;
    len = tmp*(((magic&(3<<len))>>len)+1) + (1-tmp);
    res = ((res<<len)&255)>>len;
    res = (tmp*((res<<6)+(63&*p))) + (1-tmp)*res;
    p += tmp;
    tmp = (len-1)>>1;
    /* tmp is 1 if len is 3,4 0 otherwise */
    res = (tmp*((res<<6)+(63&*p))) + (1-tmp)*res;
    p += tmp;
    tmp = len>>2;
    /* tmp is 1 if len is 4, 0 otherwise */
    res = (tmp*((res<<6)+(63&*p))) + (1-tmp)*res;
    p += tmp;

    *pp = p;
    return res;
  }

It turns out this isn't worth it on x86 architectures.
Branch prediction is very fast.  Either way, the
overhead tends to be dwarfed by whatever it is
you're doing _with_ the char.

-- 
Alex



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-16  0:01       ` Mike Gran
@ 2011-03-16  1:12         ` Mark H Weaver
  2011-03-16 11:26           ` Ludovic Courtès
                             ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Mark H Weaver @ 2011-03-16  1:12 UTC (permalink / raw)
  To: Mike Gran; +Cc: Ludovic Courtès, guile-devel

Mike Gran <spk121@yahoo.com> writes:
>> The reason I am still arguing this point is because I have looked
>> seriously at what I would need to do to (A) fix our i18n problems and
>> (B) make the code efficient.  I very much want to fix these things,
>> but the pain of trying to do this with our current scheme is too much
>> for me to bear.  I shouldn't have to rewrite libunistring, and I
>> shouldn't have to write 3 or 4 different variants of each procedure
>> that takes two string parameters.
>
> What procedures are giving incorrect results?

I know of two categories of bugs.  One has to do with case conversions
and case-insensitive comparisons, which must be done on entire strings
but are currently done for each character.  Here are some examples:

  (string-upcase "Straße")         => "STRAßE"  (should be "STRASSE")
  (string-downcase "ΧΑΟΣΣ")        => "χαοσσ"   (should be "χαoσς")
  (string-downcase "ΧΑΟΣ Σ")       => "χαοσ σ"  (should be "χαoς σ")
  (string-ci=? "Straße" "Strasse") => #f        (should be #t)
  (string-ci=? "ΧΑΟΣ" "χαoσ")      => #f        (should be #t)

Another big category of problems has to do with the fact that
scm_from_locale_{string,symbol,keyword} is currently used in many places
where the C string being converted is a compile-time constant.  This is
a bug unless the strings are ASCII-only, because the locale is normally
that of the user, which is not necessarily that of the source code.

Ludovic, Andy and I discussed this on IRC, and came to the conclusion
that UTF-8 should be the encoding assumed by functions such as
scm_c_define, scm_c_define_gsubr, scm_c_define_gsubr_with_generic,
scm_c_export, scm_c_define_module, scm_c_resolve_module,
scm_c_use_module, etc.  However, this creates pressure to make
scm_from_utf8_string and scm_from_utf8_symbol as efficient as possible.

With the current string representation scheme, the plan for
scm_from_utf8_string is to scan up to the first 100 characters of the
input string, and if the string is found to be ASCII-only, then we can
use scm_from_latin1_string.  Otherwise, we need to use scm_from_stringn
which is noticeably slower.

An unfortunate complication is that the snarfing macros such as
SCM_DEFINE et al arrange to store the symbol names as compile-time
constants and thus to put them in a read-only segment of the shared
library.  This is done with some preprocessor magic in snarf.h (see
SCM_IMMUTABLE_STRINGBUF).  I would like to make SCM_DEFINE et al work
for any UTF-8 strings, but I can do that with cpp only if UTF-8 is the
internal representation.  As things currently stand, those macros must
be limited to ASCII-only names, which is unfair to non-English speakers.

     Mark



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
@ 2011-03-16  1:30 Mike Gran
  0 siblings, 0 replies; 55+ messages in thread
From: Mike Gran @ 2011-03-16  1:30 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel@gnu.org

>   (string-upcase "Straße")         => "STRAßE"  (should 

> be "STRASSE")
>   (string-downcase "ΧΑΟΣΣ")        => "χαοσσ"   (should 
> be "χαoσς")
>   (string-downcase "ΧΑΟΣ Σ")       => "χαοσ σ"  (should 
> be "χαoς σ")

Well, yes and no.  R6RS yes.  SRFI-13 no.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
@ 2011-03-16  2:03 Mike Gran
  0 siblings, 0 replies; 55+ messages in thread
From: Mike Gran @ 2011-03-16  2:03 UTC (permalink / raw)
  To: Alex Shinn; +Cc: Mark H Weaver, Ludovic Courtès, guile-devel@gnu.org

> From:Alex Shinn <alexshinn@gmail.com>

> > Keep in mind that the UTF-8 forward iterator operation has conditional
> > branches.  Merely the act of advancing from one character to another
> > could take one of four paths, or more if you include the possibility
> > of invalid UTF-8 sequences.
> 
> No, technically you don't need any branching:
> 
>   /* first-byte lookup table encoded as an integer */
>   #define magic 3841982464uL
> ...

Cool.  I stand corrected.

- Mike




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-16  1:12         ` Mark H Weaver
@ 2011-03-16 11:26           ` Ludovic Courtès
  2011-03-17 15:38             ` Mark H Weaver
  2011-03-17 21:47           ` Ludovic Courtès
  2011-03-19 12:31           ` Andy Wingo
  2 siblings, 1 reply; 55+ messages in thread
From: Ludovic Courtès @ 2011-03-16 11:26 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: guile-devel

Hello Mark,

Mark H Weaver <mhw@netris.org> writes:

> Mike Gran <spk121@yahoo.com> writes:
>>> The reason I am still arguing this point is because I have looked
>>> seriously at what I would need to do to (A) fix our i18n problems and
>>> (B) make the code efficient.  I very much want to fix these things,
>>> but the pain of trying to do this with our current scheme is too much
>>> for me to bear.  I shouldn't have to rewrite libunistring, and I
>>> shouldn't have to write 3 or 4 different variants of each procedure
>>> that takes two string parameters.
>>
>> What procedures are giving incorrect results?
>
> I know of two categories of bugs.  One has to do with case conversions
> and case-insensitive comparisons, which must be done on entire strings
> but are currently done for each character.  Here are some examples:
>
>   (string-upcase "Straße")         => "STRAßE"  (should be "STRASSE")
>   (string-downcase "ΧΑΟΣΣ")        => "χαοσσ"   (should be "χαoσς")
>   (string-downcase "ΧΑΟΣ Σ")       => "χαοσ σ"  (should be "χαoς σ")
>   (string-ci=? "Straße" "Strasse") => #f        (should be #t)
>   (string-ci=? "ΧΑΟΣ" "χαoσ")      => #f        (should be #t)

(Mike pointed out that SRFI-13 does not consider these bugs, but that’s
linguistically wrong so I’d consider it a bug.  Note that all these
functions are ‘linguistically buggy’ anyway since they don’t have a
locale argument, which breaks with Turkish ‘İ’.)

Can we first check what would need to be done to fix this in 2.0.x?

At first glance:

  - “Straße” is normally stored as a Latin1 string, so it would need to
    be converted to UTF-* before it can be passed to one of the
    unicase.h functions.  *Or*, we could check with bug-libunistring
    what it would take to add Latin1 string case mapping functions.

    Interestingly, ‘ß’ is the only Latin1 character that doesn’t have a
    one-to-one case mapping.  All other Latin1 strings can be handled by
    iterating over characters, as is currently done.

    With this in mind, we could hack our way so that strings that
    contain an ‘ß’ are stored as UTF-32 (yes, that’s a hack.)

  - For ‘string-downcase’, the Greek strings above are wide strings, so
    they can be passed directly to u32_toupper & co.  For these, the fix
    is almost two lines.

  - Case insensitive comparison is more difficult, as you already
    pointed out.  To do it right we’d probably need to convert Latin1
    strings to UTF-32 and then pass it to u32_casecmp.  We don’t have to
    do the conversion every time, though: we could just change Latin1
    strings in-place so they now point to a wide stringbuf upon the
    first ‘string-ci=’.

Thoughts?

Thanks,
Ludo’.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
@ 2011-03-16 15:22 Mike Gran
  2011-03-16 16:58 ` Ludovic Courtès
  0 siblings, 1 reply; 55+ messages in thread
From: Mike Gran @ 2011-03-16 15:22 UTC (permalink / raw)
  To: Ludovic Courtès, Mark H Weaver; +Cc: guile-devel@gnu.org

[-- Attachment #1: Type: text/plain, Size: 3110 bytes --]

> From:Ludovic Courtès <ludo@gnu.org>

> > I know of two categories of bugs.  One has to do with case conversions
> > and case-insensitive comparisons, which must be done on entire strings
> > but are currently done for each character.  Here are some examples:
> >
> >   (string-upcase "Straße")         => "STRAßE"  
> (should be "STRASSE")
> >   (string-downcase "ΧΑΟΣΣ")        => "χαοσσ"  
> (should be "χαoσς")
> >   (string-downcase "ΧΑΟΣ Σ")       => "χαοσ σ"  
> (should be "χαoς σ")
> >   (string-ci=? "Straße" "Strasse") => #f        
> (should be #t)
> >   (string-ci=? "ΧΑΟΣ" "χαoσ")      => #f        
> (should be #t)
> 
> (Mike pointed out that SRFI-13 does not consider these bugs, but that’s
> linguistically wrong so I’d consider it a bug.  Note that all these
> functions are ‘linguistically buggy’ anyway since they don’t have a
> locale argument, which breaks with Turkish ‘İ’.)
> 
> Can we first check what would need to be done to fix this in 2.0.x?
> 
> At first glance:
> 
>   - “Straße” is normally stored as a Latin1 string, so it would need to
>     be converted to UTF-* before it can be passed to one of the
>     unicase.h functions.  *Or*, we could check with bug-libunistring
>     what it would take to add Latin1 string case mapping functions.
> 
>     Interestingly, ‘ß’ is the only Latin1 character that doesn’t have a
>     one-to-one case mapping.  All other Latin1 strings can be handled by
>     iterating over characters, as is currently done.

There is the micro sign, which, when case folded, becomes a Greek mu.
It is still a single character, but, it is the only latin-1 character that,
when folded, becomes a non-Latin-1 character

> 
>     With this in mind, we could hack our way so that strings that
>     contain an ‘ß’ are stored as UTF-32 (yes, that’s a hack.)
> 
>   - For ‘string-downcase’, the Greek strings above are wide strings, so
>     they can be passed directly to u32_toupper & co.  For these, the fix
>     is almost two lines.
> 
>   - Case insensitive comparison is more difficult, as you already
>     pointed out.  To do it right we’d probably need to convert Latin1
>     strings to UTF-32 and then pass it to u32_casecmp.  We don’t have to
>     do the conversion every time, though: we could just change Latin1
>     strings in-place so they now point to a wide stringbuf upon the
>     first ‘string-ci=’.
> 
> Thoughts?

What about the srfi-13 case insensitive comparisons (the ones that don't
terminate in question marks, like string-ci<)?  Should they remain
as srfi-13 suggests, or should they remain similar in behavior
to the question-mark-terminated comparisons?

Mark is right that fixing this will not be pretty.  The case insensitive
string comparisons, for example, could be patched like the attached
snippet. If you don't find it too ugly of an approach, I could work on
a real patch.

Thanks,

Mike

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: strorder.c.patch --]
[-- Type: text/x-patch; name="strorder.c.patch", Size: 8227 bytes --]

diff --git a/libguile/strorder.c b/libguile/strorder.c
index a51ce17..6df8343 100644
--- a/libguile/strorder.c
+++ b/libguile/strorder.c
@@ -21,6 +21,8 @@
 # include <config.h>
 #endif
 
+#include <unicase.h>
+
 #include "libguile/_scm.h"
 #include "libguile/chars.h"
 #include "libguile/strings.h"
@@ -42,6 +44,164 @@ srfi13_cmp (SCM s1, SCM s2, SCM (*cmp) (SCM, SCM, SCM, SCM, SCM, SCM))
     return SCM_BOOL_F;
 }
 
+#define SHARP_S (0xdf)
+#define MICRO_SIGN (0xb5)
+#define MU (0x3bc)
+/* This function compares does a comparison of the case-folded
+   versions of S1 and S2. It returns -1 if S1 < S2, 0 if they are equal
+   or 1 if S1 > S2. */
+static int
+compare_folded_strings (SCM s1, SCM s2)
+{
+  if (!scm_is_string (s1))
+    scm_wrong_type_arg (__func__, 0, s1);
+  if (!scm_is_string (s2))
+    scm_wrong_type_arg (__func__, 1, s2);
+  if (scm_i_is_narrow_string (s1) && scm_i_is_narrow_string (s2))
+    {
+      size_t cindex1 = 0, cindex2 = 0;
+      const size_t cend1 = scm_i_string_length (s1);
+      const size_t cend2 = scm_i_string_length (s2);
+      ucs4_t a, b;
+      int ss1 = 0, ss2 = 0;
+  
+      /* For narrow strings, folding equals downcasing except for sharp
+	 s which becomes 'ss' and the micro sign which becomes Greek
+	 mu.  */
+      while (cindex1 < cend1 && cindex2 < cend2)
+	{
+	  if (ss1)
+	    a = (ucs4_t) 's';
+	  else
+	    {
+	      a = uc_tolower ((unsigned char) (scm_i_string_chars (s1))[cindex1]);
+	      if (a == SHARP_S)
+		{
+		  a = (ucs4_t) 's';
+		  ss1 = 2;
+		}
+	      if (a == MICRO_SIGN)
+		a = MU;
+	    }
+	  if (ss2)
+	    b = (ucs4_t) 's';
+	  else
+	    {
+	      b = uc_tolower ((unsigned char) (scm_i_string_chars (s2))[cindex2]);
+	      if (b == SHARP_S)
+		{
+		  b = 's';
+		  ss2 = 2;
+		}
+	      if (b == MICRO_SIGN)
+		b = MU;
+	    }
+	  if (a < b)
+	    return -1;
+	  else if (a > b)
+	    return 1;
+	  if (ss1)
+	    ss1 --;
+	  if (!ss1)
+	    cindex1 ++;
+	  if (ss2)
+	    ss2 --;
+	  if (!ss2)
+	    cindex2 ++;
+	}
+      if (cindex1 < cend1)
+	return 1;
+      else if (cindex2 < cend2)
+	return -1;
+
+      return 0;
+    }
+  else if (!scm_i_is_narrow_string (s1) && !scm_i_is_narrow_string (s2))
+    {
+      int ret, result;
+
+      ret = u32_casecmp ((const uint32_t *) scm_i_string_wide_chars (s1),
+			 scm_i_string_length (s1),
+			 (const uint32_t *) scm_i_string_wide_chars (s2),
+			 scm_i_string_length (s2),
+			 NULL, NULL, &result);
+      if (ret != 0)
+        scm_encoding_error (__func__, errno,
+			    "cannot do case-folded comparison",
+			    SCM_BOOL_F,
+			    /* FIXME: Faulty character unknown.  */
+			    SCM_BOOL_F);
+      return result;
+    }
+  else
+    {
+      int swap = 1, ss1 = 0;
+      uint32_t *str2 = NULL;
+      size_t cindex1 = 0, cindex2 = 0;
+      const size_t cend1 = scm_i_string_length (s1);
+      size_t cend2;
+      ucs4_t a, b;
+
+      /* Swap so that s1 is narrow and s2 is wide.  */
+      if (scm_i_is_narrow_string (s2))
+	{
+	  SCM s3;
+	  s3 = s1;
+	  s1 = s2;
+	  s2 = s3;
+	  swap = -1;
+	}
+      str2 = u32_casefold ((const uint32_t *) scm_i_string_wide_chars (s2),
+			   scm_i_string_length (s2),
+			   NULL, NULL, NULL, &cend2);
+      if (str2 == NULL)
+	scm_memory_error (__func__);
+
+      while (cindex1 < cend1 && cindex2 < cend2)
+	{
+	  if (ss1)
+	    a = (ucs4_t) 's';
+	  else
+	    {
+	      a = uc_tolower ((unsigned char) scm_i_string_chars (s1)[cindex1]);
+	      if (a == SHARP_S)
+		{
+		  a = (ucs4_t) 's';
+		  ss1 = 2;
+		}
+	      if (a == MICRO_SIGN)
+		a = MU;
+	    }
+	  b = str2[cindex2];
+	  if (a < b)
+	    {
+	      free (str2);
+	      return -1 * swap;
+	    }
+	  else if (a > b)
+	    {
+	      free (str2);
+	      return 1 * swap;
+	    }
+	  if (ss1)
+	    ss1 --;
+	  if (!ss1) 
+	    cindex1 ++;
+	  cindex2 ++;
+	}
+      free (str2);
+      if (cindex1 < cend1)
+	return -1 * swap;
+      else if (cindex2 > cend2)
+	return 1 * swap;
+
+      return 0;
+    }
+      
+  return 0;
+}
+
+
 static SCM scm_i_string_equal_p (SCM s1, SCM s2, SCM rest);
 SCM_DEFINE (scm_i_string_equal_p, "string=?", 0, 2, 1,
             (SCM s1, SCM s2, SCM rest),
@@ -80,8 +240,8 @@ static SCM scm_i_string_ci_equal_p (SCM s1, SCM s2, SCM rest);
 SCM_DEFINE (scm_i_string_ci_equal_p, "string-ci=?", 0, 2, 1,
             (SCM s1, SCM s2, SCM rest),
 	    "Case-insensitive string equality predicate; return @code{#t} if\n"
-	    "the two strings are the same length and their component\n"
-	    "characters match (ignoring case) at each position; otherwise\n"
+	    "case-folded versions of the the two strings are the same length\n"
+            "and their component characters match at each position; otherwise\n"
 	    "return @code{#f}.")
 #define FUNC_NAME s_scm_i_string_ci_equal_p
 {
@@ -89,13 +249,13 @@ SCM_DEFINE (scm_i_string_ci_equal_p, "string-ci=?", 0, 2, 1,
     return SCM_BOOL_T;
   while (!scm_is_null (rest))
     {
-      if (scm_is_false (srfi13_cmp (s1, s2, scm_string_ci_eq)))
+      if (0 != compare_folded_strings (s1, s2))
         return SCM_BOOL_F;
       s1 = s2;
       s2 = scm_car (rest);
       rest = scm_cdr (rest);
     }
-  return srfi13_cmp (s1, s2, scm_string_ci_eq);
+  return scm_from_bool (0 == compare_folded_strings (s1, s2));
 }
 #undef FUNC_NAME
 
@@ -218,6 +378,7 @@ SCM scm_string_geq_p (SCM s1, SCM s2)
 }
 #undef FUNC_NAME
 
+
 static SCM scm_i_string_ci_less_p (SCM s1, SCM s2, SCM rest);
 SCM_DEFINE (scm_i_string_ci_less_p, "string-ci<?", 0, 2, 1,
             (SCM s1, SCM s2, SCM rest),
@@ -230,20 +391,20 @@ SCM_DEFINE (scm_i_string_ci_less_p, "string-ci<?", 0, 2, 1,
     return SCM_BOOL_T;
   while (!scm_is_null (rest))
     {
-      if (scm_is_false (srfi13_cmp (s1, s2, scm_string_ci_lt)))
+      if (-1 != compare_folded_strings (s1, s2))
         return SCM_BOOL_F;
       s1 = s2;
       s2 = scm_car (rest);
       rest = scm_cdr (rest);
     }
-  return srfi13_cmp (s1, s2, scm_string_ci_lt);
+  return scm_from_bool (-1 == compare_folded_strings (s1, s2));
 }
 #undef FUNC_NAME
 
 SCM scm_string_ci_less_p (SCM s1, SCM s2)
 #define FUNC_NAME s_scm_i_string_ci_less_p
 {
-  return srfi13_cmp (s1, s2, scm_string_ci_lt);
+  return scm_from_bool (-1 == compare_folded_strings (s1, s2));
 }
 #undef FUNC_NAME
 
@@ -259,20 +420,20 @@ SCM_DEFINE (scm_i_string_ci_leq_p, "string-ci<=?", 0, 2, 1,
     return SCM_BOOL_T;
   while (!scm_is_null (rest))
     {
-      if (scm_is_false (srfi13_cmp (s1, s2, scm_string_ci_le)))
+      if (1 == compare_folded_strings (s1, s2))
         return SCM_BOOL_F;
       s1 = s2;
       s2 = scm_car (rest);
       rest = scm_cdr (rest);
     }
-  return srfi13_cmp (s1, s2, scm_string_ci_le);
+  return scm_from_bool (1 != compare_folded_strings (s1, s2));
 }
 #undef FUNC_NAME
 
 SCM scm_string_ci_leq_p (SCM s1, SCM s2)
 #define FUNC_NAME s_scm_i_string_ci_leq_p
 {
-  return srfi13_cmp (s1, s2, scm_string_ci_le);
+  return scm_from_bool (1 != compare_folded_strings (s1, s2));
 }
 #undef FUNC_NAME
 
@@ -288,13 +449,13 @@ SCM_DEFINE (scm_i_string_ci_gr_p, "string-ci>?", 0, 2, 1,
     return SCM_BOOL_T;
   while (!scm_is_null (rest))
     {
-      if (scm_is_false (srfi13_cmp (s1, s2, scm_string_ci_gt)))
+      if (1 != compare_folded_strings (s1, s2))
         return SCM_BOOL_F;
       s1 = s2;
       s2 = scm_car (rest);
       rest = scm_cdr (rest);
     }
-  return srfi13_cmp (s1, s2, scm_string_ci_gt);
+  return scm_from_bool (1 == compare_folded_strings (s1, s2));
 }
 #undef FUNC_NAME
 
@@ -317,20 +478,20 @@ SCM_DEFINE (scm_i_string_ci_geq_p, "string-ci>=?", 0, 2, 1,
     return SCM_BOOL_T;
   while (!scm_is_null (rest))
     {
-      if (scm_is_false (srfi13_cmp (s1, s2, scm_string_ci_ge)))
+      if (-1 == compare_folded_strings (s1, s2))
         return SCM_BOOL_F;
       s1 = s2;
       s2 = scm_car (rest);
       rest = scm_cdr (rest);
     }
-  return srfi13_cmp (s1, s2, scm_string_ci_ge);
+  return scm_from_bool (-1 != compare_folded_strings (s1, s2));
 }
 #undef FUNC_NAME
 
 SCM scm_string_ci_geq_p (SCM s1, SCM s2)
 #define FUNC_NAME s_scm_i_string_ci_geq_p
 {
-  return srfi13_cmp (s1, s2, scm_string_ci_ge);
+  return scm_from_bool (-1 != compare_folded_strings (s1, s2));
 }
 #undef FUNC_NAME
 

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-16 15:22 Mike Gran
@ 2011-03-16 16:58 ` Ludovic Courtès
  0 siblings, 0 replies; 55+ messages in thread
From: Ludovic Courtès @ 2011-03-16 16:58 UTC (permalink / raw)
  To: Mike Gran; +Cc: Mark H Weaver, guile-devel@gnu.org

Hi,

Mike Gran <spk121@yahoo.com> writes:

>> From:Ludovic Courtès <ludo@gnu.org>
>
>> > I know of two categories of bugs.  One has to do with case conversions
>> > and case-insensitive comparisons, which must be done on entire strings
>> > but are currently done for each character.  Here are some examples:
>> >
>> >   (string-upcase "Straße")         => "STRAßE"  
>> (should be "STRASSE")
>> >   (string-downcase "ΧΑΟΣΣ")        => "χαοσσ"  
>> (should be "χαoσς")
>> >   (string-downcase "ΧΑΟΣ Σ")       => "χαοσ σ"  
>> (should be "χαoς σ")
>> >   (string-ci=? "Straße" "Strasse") => #f        
>> (should be #t)
>> >   (string-ci=? "ΧΑΟΣ" "χαoσ")      => #f        
>> (should be #t)
>> 
>> (Mike pointed out that SRFI-13 does not consider these bugs, but that’s
>> linguistically wrong so I’d consider it a bug.  Note that all these
>> functions are ‘linguistically buggy’ anyway since they don’t have a
>> locale argument, which breaks with Turkish ‘İ’.)
>> 
>> Can we first check what would need to be done to fix this in 2.0.x?
>> 
>> At first glance:
>> 
>>   - “Straße” is normally stored as a Latin1 string, so it would need to
>>     be converted to UTF-* before it can be passed to one of the
>>     unicase.h functions.  *Or*, we could check with bug-libunistring
>>     what it would take to add Latin1 string case mapping functions.
>> 
>>     Interestingly, ‘ß’ is the only Latin1 character that doesn’t have a
>>     one-to-one case mapping.  All other Latin1 strings can be handled by
>>     iterating over characters, as is currently done.
>
> There is the micro sign, which, when case folded, becomes a Greek mu.
> It is still a single character, but, it is the only latin-1 character that,
> when folded, becomes a non-Latin-1 character

Blech.

It would have worked better with narrow == ASCII instead of
narrow == Latin1.  It’s a change we can still make, I think.

>>   - Case insensitive comparison is more difficult, as you already
>>     pointed out.  To do it right we’d probably need to convert Latin1
>>     strings to UTF-32 and then pass it to u32_casecmp.  We don’t have to
>>     do the conversion every time, though: we could just change Latin1
>>     strings in-place so they now point to a wide stringbuf upon the
>>     first ‘string-ci=’.
>> 
>> Thoughts?
>
> What about the srfi-13 case insensitive comparisons (the ones that don't
> terminate in question marks, like string-ci<)?  Should they remain
> as srfi-13 suggests, or should they remain similar in behavior
> to the question-mark-terminated comparisons?

Well, if maintaining two string comparison algorithms is reasonable,
then we can keep both; otherwise, I’d vote for the R6RS way.

> Mark is right that fixing this will not be pretty.  The case insensitive
> string comparisons, for example, could be patched like the attached
> snippet. If you don't find it too ugly of an approach, I could work on
> a real patch.

Indeed it’s quite inelegant.  ;-)

How about changing to narrow == ASCII and then string comparison would
be:

  if (narrow (s1) != narrow (s2))
    {
      /* Handle ß -> ss.  */
      if (!narrow (s1))
        widify (s1);
      else
        widify (s2);
    }

  if (narrow (s1))
    /* S1 and S2 are ASCII.  */
    return strcmp (char_data (s1), char_data (s2));
  else
    /* S1 and S2 are UTF-32.  */
    return u32_cmp (wide_char_data (s1), wide_char_data (s2));

Looks like that would remain reasonable while actually fixing our
problems.

As a side-effect, though, scm_from_latin1_locale would become slightly
less efficient because it’d need to check for non-ASCII chars.

Thanks,
Ludo’.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-16 11:26           ` Ludovic Courtès
@ 2011-03-17 15:38             ` Mark H Weaver
  2011-03-17 15:56               ` Ludovic Courtès
  0 siblings, 1 reply; 55+ messages in thread
From: Mark H Weaver @ 2011-03-17 15:38 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

I have a compromise proposal, which could be implemented for 2.0.x:

We keep wide (UTF-32) stringbufs as-is, but we change narrow stringbufs
to UTF-8, along with a flag that indicates whether it is known to be
ASCII-only.

Applying string-ref or string-set! to a narrow stringbuf would upgrade
it to a wide stringbuf, unless it is known to be ASCII-only.  Better
yet, string-ref should do this only when the index is above a certain
threshold value, and string-set! should do this only for stringbufs
longer than a certain threshold length.

This would keep our accessors O(1), but also ensure that most stringbufs
are narrow.  This is important not only for optimal memory usage, but
also because it means we don't have to worry so much about optimizing
the narrow-wide cases: then we can handle those cases by widening or
narrowing to make them the same width, and then calling libunistring.

In the eventual common case, where string-ref and string-set! are rarely
called, almost all stringbufs would be narrow, so converting to UTF-8
becomes an O(1) operation.

What do you think?

    Mark



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-17 15:38             ` Mark H Weaver
@ 2011-03-17 15:56               ` Ludovic Courtès
  2011-03-17 17:58                 ` Mark H Weaver
  0 siblings, 1 reply; 55+ messages in thread
From: Ludovic Courtès @ 2011-03-17 15:56 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: guile-devel

Hi Mark,

Mark H Weaver <mhw@netris.org> writes:

> I have a compromise proposal, which could be implemented for 2.0.x:
>
> We keep wide (UTF-32) stringbufs as-is, but we change narrow stringbufs
> to UTF-8, along with a flag that indicates whether it is known to be
> ASCII-only.

The whole point of the narrow/wide distinction was to avoid
variable-width encodings.  In addition, we’d end up with 3 cases (ASCII,
UTF-8, or UTF-32) instead of 2, which seems quite complex to me.

What do you think of moving to narrow = ASCII, as I suggested earlier?

Thanks,
Ludo’.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-17 15:56               ` Ludovic Courtès
@ 2011-03-17 17:58                 ` Mark H Weaver
  2011-03-18  0:10                   ` Thien-Thi Nguyen
  2011-03-20 22:12                   ` Ludovic Courtès
  0 siblings, 2 replies; 55+ messages in thread
From: Mark H Weaver @ 2011-03-17 17:58 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

ludo@gnu.org (Ludovic Courtès) writes:
>> We keep wide (UTF-32) stringbufs as-is, but we change narrow stringbufs
>> to UTF-8, along with a flag that indicates whether it is known to be
>> ASCII-only.
>
> The whole point of the narrow/wide distinction was to avoid
> variable-width encodings.  In addition, we’d end up with 3 cases (ASCII,
> UTF-8, or UTF-32) instead of 2, which seems quite complex to me.

Most functions would not care about the known_ascii_only flag, so really
it's just two cases.  (As you know, I'd prefer to have only one case).

> What do you think of moving to narrow = ASCII, as I suggested earlier?

The problem is that the narrow-wide cases will be much more common in
your scheme, and none of us has a good solution to those cases.  All the
solutions that handle those cases efficiently involve an unacceptable
increase in code complexity.  In your scheme, a large number of common
operations will require widening strings, which is bad for efficiency,
in both space and time for the common operations.

You may not realize the extent to which UTF-8's special properties
mostly eliminate the usual disadvantages of variable-width encodings.
Please allow me to explain how the most common string operations can be
implemented on UTF-8 strings.

* case-sensitive string comparison: Scan for the first unequal byte.
  The scan can be done bytewise, without paying any attention to UTF-8
  encoding.  Then you figure out how the differing characters compare to
  one another, which can be done in constant time.  Anyway, this is
  implemented by libunistring, and not our concern.

* case-insensitive string comparison: Same as above, but you might find
  that the differing characters are in the same equivalence class, and
  thus might have to restart the scan.  Anyway, this is implemented by
  libunistring, and not our concern.

* substring search: This can be implemented bytewise, exactly as if it
  was a fixed-width encoding.

* regexp search: The search itself can be implemented bytewise, exactly
  as if it was a fixed-width encoding.  Compiling the regexp can
  _almost_ be implemented as if the UTF-8-encoded regexp was in a
  fixed-width encoding, with just one added complication: a multibyte
  character followed by `*', `?' etc, must be compiled in such a way
  that the suffix operator applies to the whole character, and not just
  its final byte.  (In practice, it's probably more straightforward to
  handling compiling somewhat differently than outlined here, but you
  get the idea).

* parsing: Similarly, the parsing itself can be done bytewise, but
  compiling the grammar may require some considerations similar to the
  ones needed for compiling regexps.

Can you think of a real-world example where the variable-width encoding
of UTF-8 causes problems?

    Best,
     Mark



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
@ 2011-03-17 18:07 Mike Gran
  0 siblings, 0 replies; 55+ messages in thread
From: Mike Gran @ 2011-03-17 18:07 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Mark H Weaver, guile-devel@gnu.org

> From:Ludovic Courtès <ludo@gnu.org>
> >> Can we first check what would need to be done to fix this in 2.0.x?
> >> 
> >> At first glance:
> >> 
> >>   - “Straße” is normally stored as a Latin1 string, so it would need to
> >>     be converted to UTF-* before it can be passed to one of the
> >>     unicase.h functions.  *Or*, we could check with bug-libunistring
> >>     what it would take to add Latin1 string case mapping functions.
> >> 
> >>     Interestingly, ‘ß’ is the only Latin1 character that doesn’t have a
> >>     one-to-one case mapping.  All other Latin1 strings can be handled 
> by
> >>     iterating over characters, as is currently done.
> >
> > There is the micro sign, which, when case folded, becomes a Greek mu.
> > It is still a single character, but, it is the only latin-1 character that,
> > when folded, becomes a non-Latin-1 character
> 
> Blech.
> 
> It would have worked better with narrow == ASCII instead of
> narrow == Latin1.  It’s a change we can still make, I think.

It would be easy enough to do.  If someone were to fight for
a narrow encoding of Latin-1, I would expect it to be you, since
you're the only committer whose name requires ISO-8859-1.
So if you're okay with it, who am I to complain?

> 
> >>   - Case insensitive comparison is more difficult, as you already
> >>     pointed out.  To do it right we’d probably need to convert Latin1
> >>     strings to UTF-32 and then pass it to u32_casecmp.  We don’t have 
> to
> >>     do the conversion every time, though: we could just change Latin1
> >>     strings in-place so they now point to a wide stringbuf upon the
> >>     first ‘string-ci=’.
> >> 
> >> Thoughts?
> >

[...]

> 
> Indeed it’s quite inelegant.  ;-)
> 
> How about changing to narrow == ASCII and then string comparison would
> be:
> 
>   if (narrow (s1) != narrow (s2))
>     {

It would be easier and cleaner, as you demonstrate.

I guess the question is about future-proofing.  If the complications with the Latin-1
/ UTF-32 dual encoding are constrained to upcase/downcase and string-ci
comparison ops, then it doesn't seem worth it to change it.  But if it is going
to cause endless problems down the road, ASCII/UTF-32 is simpler.

A lot of this debate is about expectations, I think.  For my part, I think that
the string-ci ops only have real value for English language and ASCII text.
For non-English non-ASCII processing, sorting case-insensitively by numeric
codepoint values in the absence of locale sorting rules seems like an odd thing
to want to do. 

So I guess I'm not bothered with the ugly C necessary to make ISO-8859-1 work.
It is bad for string-ci ops but not too bad for upcase/downcase.  I also am
not too concerned that string-ci comparison ops for non-English non-ASCII
processing may be inefficient.  It does seem vital that string-locale comparison
ops be efficient, though.

Thanks,
Mike



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-16  1:12         ` Mark H Weaver
  2011-03-16 11:26           ` Ludovic Courtès
@ 2011-03-17 21:47           ` Ludovic Courtès
  2011-03-19 12:31           ` Andy Wingo
  2 siblings, 0 replies; 55+ messages in thread
From: Ludovic Courtès @ 2011-03-17 21:47 UTC (permalink / raw)
  To: guile-devel

Hi!

Mark H Weaver <mhw@netris.org> writes:

>   (string-upcase "Straße")         => "STRAßE"  (should be "STRASSE")
>   (string-downcase "ΧΑΟΣΣ")        => "χαοσσ"   (should be "χαoσς")
>   (string-downcase "ΧΑΟΣ Σ")       => "χαοσ σ"  (should be "χαoς σ")
>   (string-ci=? "Straße" "Strasse") => #f        (should be #t)
>   (string-ci=? "ΧΑΟΣ" "χαoσ")      => #f        (should be #t)

I need to think some more about the guts of the discussion ;-), but in
the meantime I’ve added these to i18n.test and they pass (thanks,
Glibc!).

Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-17 17:58                 ` Mark H Weaver
@ 2011-03-18  0:10                   ` Thien-Thi Nguyen
  2011-03-18  1:38                     ` Mark H Weaver
  2011-03-20 22:12                   ` Ludovic Courtès
  1 sibling, 1 reply; 55+ messages in thread
From: Thien-Thi Nguyen @ 2011-03-18  0:10 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel

() Mark H Weaver <mhw@netris.org>
() Thu, 17 Mar 2011 13:58:42 -0400

   * regexp search: The search itself can be implemented bytewise, exactly
     as if it was a fixed-width encoding.  Compiling the regexp can
     _almost_ be implemented as if the UTF-8-encoded regexp was in a
     fixed-width encoding, with just one added complication: a multibyte
     character followed by `*', `?' etc, must be compiled in such a way
     that the suffix operator applies to the whole character, and not just
     its final byte.  (In practice, it's probably more straightforward to
     handling compiling somewhat differently than outlined here, but you
     get the idea).

In unibyte land, "." matches a byte.  OK.

In multibyte land done "bytewise", "." matches ____________.
(What goes in the blank?)



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-18  0:10                   ` Thien-Thi Nguyen
@ 2011-03-18  1:38                     ` Mark H Weaver
  2011-03-18  8:46                       ` Thien-Thi Nguyen
  0 siblings, 1 reply; 55+ messages in thread
From: Mark H Weaver @ 2011-03-18  1:38 UTC (permalink / raw)
  To: Thien-Thi Nguyen; +Cc: Ludovic Courtès, guile-devel

Thien-Thi Nguyen <ttn@gnuvola.org> writes:
> In unibyte land, "." matches a byte.  OK.
>
> In multibyte land done "bytewise", "." matches ____________.
> (What goes in the blank?)

"." (and more generally [^...]) is equivalent to (a|b|c|d|...) where
every valid UTF-8 character is present in the disjunction except for the
excluded characters.  However, this case does indeed require special
consideration for the sake of efficiency in the regexp compiler.

If we may assume that the searched string is valid UTF-8, and when only
ASCII characters are excluded (e.g. "."), then three additional states
are required in the generated DFA.  Let us call them S1, S2, and S3.  S1
accepts one arbitrary byte, i.e. it has an outgoing edge for every byte
which leads to the final state.  S2 accepts any two bytes, i.e. it has
an outgoing edge for every byte which leads to S1.  Similarly for S3.

Now, the start state of "." (or [^...]) has outgoing edges for every
byte from 0x80 to 0xFF, leading to S1, S2, or S3, depending on the high
bits of the first byte.  Specifically, byte values 0xC0 through 0xDF
lead to S1, 0xE0 through 0xEF lead to S2, and 0xF0 through 0xF7 lead to
S3.

When non-ASCII characters are excluded, additional states must be added:
one for each unique prefix of the excluded multibyte characters.  It's
quite straightforward.

   Regards,
     Mark



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-18  1:38                     ` Mark H Weaver
@ 2011-03-18  8:46                       ` Thien-Thi Nguyen
  2011-03-18 12:05                         ` Mark H Weaver
  0 siblings, 1 reply; 55+ messages in thread
From: Thien-Thi Nguyen @ 2011-03-18  8:46 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel

() Mark H Weaver <mhw@netris.org>
() Thu, 17 Mar 2011 21:38:28 -0400

   If we may assume that the searched string is valid UTF-8, and when only
   ASCII characters are excluded (e.g. "."), then three additional states
   are required in the generated DFA.  Let us call them S1, S2, and S3.

   [handling these states]

   When non-ASCII characters are excluded, additional states must be added:
   one for each unique prefix of the excluded multibyte characters.  It's
   quite straightforward.

I don't understand what "excluded" means here.  Is this a property
of the string, the regexp, the (dynamic, environmental) operation, or ...?



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-18  8:46                       ` Thien-Thi Nguyen
@ 2011-03-18 12:05                         ` Mark H Weaver
  0 siblings, 0 replies; 55+ messages in thread
From: Mark H Weaver @ 2011-03-18 12:05 UTC (permalink / raw)
  To: Thien-Thi Nguyen; +Cc: Ludovic Courtès, guile-devel

Thien-Thi Nguyen <ttn@gnuvola.org> writes:

> () Mark H Weaver <mhw@netris.org>
> () Thu, 17 Mar 2011 21:38:28 -0400
>
>    If we may assume that the searched string is valid UTF-8, and when only
>    ASCII characters are excluded (e.g. "."), then three additional states
>    are required in the generated DFA.  Let us call them S1, S2, and S3.
>
>    [handling these states]
>
>    When non-ASCII characters are excluded, additional states must be added:
>    one for each unique prefix of the excluded multibyte characters.  It's
>    quite straightforward.
>
> I don't understand what "excluded" means here.  Is this a property
> of the string, the regexp, the (dynamic, environmental) operation, or ...?

The excluded characters are the ones listed inside of a [^...] form.
For example, [^abc] excludes only the ASCII characters #\a, #\b, and
#\c.  [^abcλ] excludes also the non-ASCII character #\λ.

"." is equivalent to a [^...] form that excludes only newlines.


I should also note that although UTF-8 causes a few complications when
compiling regexps, UTF-32 creates much worse problems that require a
different DFA data structure and a slower scan.

In the UTF-8 case, each state in the DFA can contain a fixed lookup
table with 256 entries (one for each byte) as is used for ASCII or
Latin1.  However, in the UTF-32 case, it is _not_ practical for each
state to contain a lookup table with over 1 million entries (one for
each code point).

    Mark



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-16  1:12         ` Mark H Weaver
  2011-03-16 11:26           ` Ludovic Courtès
  2011-03-17 21:47           ` Ludovic Courtès
@ 2011-03-19 12:31           ` Andy Wingo
  2011-03-19 14:06             ` Mark H Weaver
  2011-03-29 12:39             ` Peter Brett
  2 siblings, 2 replies; 55+ messages in thread
From: Andy Wingo @ 2011-03-19 12:31 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel

Greetings,

On Wed 16 Mar 2011 02:12, Mark H Weaver <mhw@netris.org> writes:

> Ludovic, Andy and I discussed this on IRC, and came to the conclusion
> that UTF-8 should be the encoding assumed by functions such as
> scm_c_define, scm_c_define_gsubr, scm_c_define_gsubr_with_generic,
> scm_c_export, scm_c_define_module, scm_c_resolve_module,
> scm_c_use_module, etc.

Can we step back a little and revisit this decision?

Clearly, we need to specify the encoding for these procedures, and have
it not be locale encoding.  However I don't think we would be breaking
anyone's code if we simply restricted it to 7-bit ASCII.

I am quite sensitive to the "justice" argument -- that we not restrict
the names our users give to Scheme identifiers, or the characters they
use in their strings.  But these values typically come from literals in
C source code, which has no portable superset of ASCII.

Furthermore, such a default would not restrict our users at all -- they
can always use the non-_c_ variants with a symbol explicitly constructed
with (e.g.) scm_from_utf8_symbol.

Finally, users are moving away from these functions anyway.  The thing
to do now is to write Scheme, not C: and in Scheme we do the Right
Thing.

So let's not let this particular consideration weigh too heavily on our
choice of character encoding.

                              *   *   *

And on the meta-level... I'm really happy with Guile 2.0, that we've got
more traffic on the mailing list, more contributors (like yourself!),
more folks on IRC.  This is all fantastic.  We need to take advantage of
this energy, but also not be overwhelmed by it ;-)

Important discussions should still take place on the mailing list, with
due calmness and consideration.  This allows everyone affected to
participate, whether or not they use IRC and happen to be online then.

Also, just from a personal perspective, I find that I think much better
when looking at a message-mode buffer than at an IRC window ;-)

Back to this question, though: I don't know what to think yet.  I'll
read the thread again.

Best regards,

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: O(1) accessors for UTF-8 backed strings
  2011-03-15 15:46           ` Mark H Weaver
  2011-03-16  0:07             ` Alex Shinn
@ 2011-03-19 13:02             ` Andy Wingo
  1 sibling, 0 replies; 55+ messages in thread
From: Andy Wingo @ 2011-03-19 13:02 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel

On Tue 15 Mar 2011 16:46, Mark H Weaver <mhw@netris.org> writes:

> I'd also like to point out that the R6RS is the only relevant standard
> that mandates O(1) string accessors.  The R5RS did not require this, and
> WG1 for the R7RS has already voted against this requirement.
>
>   http://trac.sacrideo.us/wg/ticket/27

Only tangentially related to the question under discussion, but I find
this a bit baffling.  I found
http://trac.sacrideo.us/wg/wiki/StringRepresentations, but it's a lot of
original work (record1 ??), and doesn't lead to any conclusions.

http://trac.sacrideo.us/wg/wiki/WG1Ballot1Results lists no rationales.

Grr.

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-19 12:31           ` Andy Wingo
@ 2011-03-19 14:06             ` Mark H Weaver
  2011-03-19 14:53               ` Noah Lavine
                                 ` (4 more replies)
  2011-03-29 12:39             ` Peter Brett
  1 sibling, 5 replies; 55+ messages in thread
From: Mark H Weaver @ 2011-03-19 14:06 UTC (permalink / raw)
  To: Andy Wingo; +Cc: Ludovic Courtès, guile-devel

Andy Wingo <wingo@pobox.com> writes:
>> Ludovic, Andy and I discussed this on IRC, and came to the conclusion
>> that UTF-8 should be the encoding assumed by functions such as
>> scm_c_define, scm_c_define_gsubr, scm_c_define_gsubr_with_generic,
>> scm_c_export, scm_c_define_module, scm_c_resolve_module,
>> scm_c_use_module, etc.
>
> Can we step back a little and revisit this decision?
>
> Clearly, we need to specify the encoding for these procedures, and have
> it not be locale encoding.  However I don't think we would be breaking
> anyone's code if we simply restricted it to 7-bit ASCII.
>
> I am quite sensitive to the "justice" argument -- that we not restrict
> the names our users give to Scheme identifiers, or the characters they
> use in their strings.  But these values typically come from literals in
> C source code, which has no portable superset of ASCII.

Not everyone writes portable code.  Who here limits their code to the
R6RS and avoids all Guile-specific features?  Portability may be
something to strive for, but when compelling reasons dictate otherwise,
it's not unreasonable to limit your portability to better compilers like
gcc.

For those who don't speak English but wish to hack with Guile, being
able to write code in their own language is a compelling reason.

Anyway, one can only hope that some future C standard supports unicode,
but if the folks who control those standards don't give a damn about
non-english speakers, that doesn't mean we should follow their example.

> Furthermore, such a default would not restrict our users at all -- they
> can always use the non-_c_ variants with a symbol explicitly constructed
> with (e.g.) scm_from_utf8_symbol.

We have those convenience functions for a reason.  You recently proposed
several more convenience functions, so apparently you prefer to save
keystrokes like the rest of us.  I'm sure our non-english-speaking
comrades feel the same way.

Let me ask you this: why would you oppose changing the scm_c_ functions
to use UTF-8 by default?  If you're comfortable with ASCII-only names,
then UTF-8 will work fine for you, since ASCII strings are unchanged in
UTF-8.

> Finally, users are moving away from these functions anyway.  The thing
> to do now is to write Scheme, not C: and in Scheme we do the Right
> Thing.

If you write all your code in Scheme now, then you should care even less
about the scm_c_ functions.  So why oppose what you recently agreed to?


As a meta-comment: I've grown rather weary from fighting this battle
alone.  My hacking has completely stopped because of this argument.  To
those of you out there who care about this issue, please let your voices
be heard.  I know you're there, because a few of you stated your
opinions rather strongly on IRC.  If others don't join in soon, I'm
likely to soon give up on this, and be left with rather less enthusiasm
for Guile than when I started, I'm sorry to say.

      Mark



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-19 14:06             ` Mark H Weaver
@ 2011-03-19 14:53               ` Noah Lavine
  2011-03-19 15:49                 ` Mark H Weaver
  2011-03-19 15:08               ` Andy Wingo
                                 ` (3 subsequent siblings)
  4 siblings, 1 reply; 55+ messages in thread
From: Noah Lavine @ 2011-03-19 14:53 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Andy Wingo, Ludovic Courtès, guile-devel

Hello all,

>> Furthermore, such a default would not restrict our users at all -- they
>> can always use the non-_c_ variants with a symbol explicitly constructed
>> with (e.g.) scm_from_utf8_symbol.
>
> We have those convenience functions for a reason.  You recently proposed
> several more convenience functions, so apparently you prefer to save
> keystrokes like the rest of us.  I'm sure our non-english-speaking
> comrades feel the same way.

I'm afraid I don't know enough about this issue to understand it
completely, but some idea seemed a little off. I think there are two
questions being conflated here: what Guile's internal string
representation should be, and what convenience functions should be
provided for users to easily make symbols.

The string representation is a somewhat difficult choice involving
tradeoffs. However, it seems to me that the functions provided to make
symbols should be more or less independent of that. After all, these
are *convenience* functions, whose functionality could always be
implemented in the more general underlying API. So I think the thing
to do here is make a list of what string formats you think people
should easily be able to use, and then make convenience functions for
each of them. Clearly ASCII is one. Perhaps UTF-8 is another.

(Unless of course you mean macros that need to define things at
compile time. In that case, you're really limited by C's lack of
compile-time computation ability, and I think the only real solution
would be to implement a preprocessor that can convert strings to
different formats. For instance, Guile could easily be such a
preprocessor.)

Noah



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-19 14:06             ` Mark H Weaver
  2011-03-19 14:53               ` Noah Lavine
@ 2011-03-19 15:08               ` Andy Wingo
  2011-03-19 19:43                 ` Mark H Weaver
  2011-03-19 16:37               ` Mark H Weaver
                                 ` (2 subsequent siblings)
  4 siblings, 1 reply; 55+ messages in thread
From: Andy Wingo @ 2011-03-19 15:08 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel

Hello!

I have been sitting in the sun pondering all this for an hour or so now,
and there is a lot to say, I think.  But we should get this out of the
way first:

On Sat 19 Mar 2011 15:06, Mark H Weaver <mhw@netris.org> writes:

> As a meta-comment: I've grown rather weary from fighting this battle
> alone.  My hacking has completely stopped because of this argument.  To
> those of you out there who care about this issue, please let your voices
> be heard.  I know you're there, because a few of you stated your
> opinions rather strongly on IRC.  If others don't join in soon, I'm
> likely to soon give up on this, and be left with rather less enthusiasm
> for Guile than when I started, I'm sorry to say.

Have patience :)  We will get there in time.  The process of
consensus-building is work.  This is a very important decision to make,
and its engineering implications are large.  We've only been discussing
it for a week :)

Wide-ranging opinions are important, but ultimately the impacts of this
decision rest on the shoulders of the maintainers of this code, which up
to now has mostly been Mike, and to an extent Ludovic and myself.  It
goes without saying that we would be most pleased to have your
hack-power pointed in its direction, but we need to all be on the same
page.  There's no need for the "fighting battles" here; military
metaphors simply don't apply to hackers needing to understand each other
and talk over a problem.

If it's stressing you out, I suggest focusing on something else for the
meantime.  I suspect this thread will go on for at least a couple more
weeks.  At the end of it, whenever it ends, I hope we will all agree on
the right thing to do.

Peace,

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-19 14:53               ` Noah Lavine
@ 2011-03-19 15:49                 ` Mark H Weaver
  0 siblings, 0 replies; 55+ messages in thread
From: Mark H Weaver @ 2011-03-19 15:49 UTC (permalink / raw)
  To: Noah Lavine; +Cc: Andy Wingo, Ludovic Courtès, guile-devel

Noah Lavine <noah.b.lavine@gmail.com> writes:
> I think there are two questions being conflated here: what Guile's
> internal string representation should be, and what convenience
> functions should be provided for users to easily make symbols.

Yes, you are absolutely right.  They are two separate questions.  They
were clearly separated in my mind, but I should have made that more
clear in my writing.

> So I think the thing to do here is make a list of what string formats
> you think people should easily be able to use, and then make
> convenience functions for each of them. Clearly ASCII is one. Perhaps
> UTF-8 is another.

We could do that, but there are a lot of these functions.  Making too
many different variants would significantly increase the number of
functions in our API.  I could perhaps be convinced that this is
worthwhile, but at present it seems to me that UTF-8 should be enough,
since ASCII strings are unchanged in UTF-8.

The only advantage I see to having ASCII variants is that they could be
implemented more efficiently, but the added speed compared to UTF-8
would almost certainly be lost in the noise unless the names were
absurdly long.

    Best,
     Mark



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-19 14:06             ` Mark H Weaver
  2011-03-19 14:53               ` Noah Lavine
  2011-03-19 15:08               ` Andy Wingo
@ 2011-03-19 16:37               ` Mark H Weaver
  2011-03-20 21:49               ` Ludovic Courtès
  2011-03-30  9:50               ` Andy Wingo
  4 siblings, 0 replies; 55+ messages in thread
From: Mark H Weaver @ 2011-03-19 16:37 UTC (permalink / raw)
  To: Andy Wingo; +Cc: Ludovic Courtès, guile-devel

Andy Wingo <wingo@pobox.com> writes:
> I am quite sensitive to the "justice" argument -- that we not restrict
> the names our users give to Scheme identifiers, or the characters they
> use in their strings.  But these values typically come from literals in
> C source code, which has no portable superset of ASCII.

I should also mention that the only thing needed from a C compiler in
order to accept UTF-8 string constants is to treat 0x80-0xFF like any
other non-special characters.  In other words, it takes _no_ effort on
their part to support this.  In fact, it requires effort for them to
prevent it, by explicitly checking for 0x80-0xFF and reporting an error
in that case.

     Mark



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-19 15:08               ` Andy Wingo
@ 2011-03-19 19:43                 ` Mark H Weaver
  0 siblings, 0 replies; 55+ messages in thread
From: Mark H Weaver @ 2011-03-19 19:43 UTC (permalink / raw)
  To: Andy Wingo; +Cc: Ludovic Courtès, guile-devel

Andy Wingo <wingo@pobox.com> writes:
> Have patience :)  We will get there in time.  The process of
> consensus-building is work.  This is a very important decision to make,
> and its engineering implications are large.  We've only been discussing
> it for a week :)

Words of wisdom, to be sure.  For what it's worth, I very much
appreciate not only your excellent hacks, but also your talents at
fostering a friendly environment in which a community can flourish.
Ultimately, that's far more important than any technical issue.

Just please be aware that it is easier to be relaxed about a debate when
you're comfortable with the status quo, and when your approval is
required before changes can be made.  Don't get me wrong; I do not
begrudge that you have this authority.  You are a wise and fair-minded
maintainer, and saying "no" (when warranted) is one of your most
important roles.

I merely point out that, besides any differences in personality (and I
concede that I have much to learn from you about how to foster a
friendly community), in this debate you are in an inherently more
comfortable position than I am.

   Regards,
     Mark



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-19 14:06             ` Mark H Weaver
                                 ` (2 preceding siblings ...)
  2011-03-19 16:37               ` Mark H Weaver
@ 2011-03-20 21:49               ` Ludovic Courtès
  2011-03-30  9:50               ` Andy Wingo
  4 siblings, 0 replies; 55+ messages in thread
From: Ludovic Courtès @ 2011-03-20 21:49 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Andy Wingo, guile-devel

Hello Unicode fellows!  :-)

Mark H Weaver <mhw@netris.org> writes:

> Andy Wingo <wingo@pobox.com> writes:
>>> Ludovic, Andy and I discussed this on IRC, and came to the conclusion
>>> that UTF-8 should be the encoding assumed by functions such as
>>> scm_c_define, scm_c_define_gsubr, scm_c_define_gsubr_with_generic,
>>> scm_c_export, scm_c_define_module, scm_c_resolve_module,
>>> scm_c_use_module, etc.
>>
>> Can we step back a little and revisit this decision?
>>
>> Clearly, we need to specify the encoding for these procedures, and have
>> it not be locale encoding.  However I don't think we would be breaking
>> anyone's code if we simply restricted it to 7-bit ASCII.

[...]

> For those who don't speak English but wish to hack with Guile, being
> able to write code in their own language is a compelling reason.

As Andy said, our intent is to have people write Scheme code, not C.
Scheme code supports native languages.

[...]

> We have those convenience functions for a reason.  You recently proposed
> several more convenience functions, so apparently you prefer to save
> keystrokes like the rest of us.  I'm sure our non-english-speaking
> comrades feel the same way.
>
> Let me ask you this: why would you oppose changing the scm_c_ functions
> to use UTF-8 by default?

I knew adding these C functions would amount to opening a can of worms. :-)

So I understand your comment.  This would have been a non-issue in
Scheme code but in C we’re in troubles.

Up to 1.8 ‘const char *’ in Guile’s API implicitly meant ASCII or
locale-encoded strings; internally, it was only ASCII though.

Thanks,
Ludo’.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-17 17:58                 ` Mark H Weaver
  2011-03-18  0:10                   ` Thien-Thi Nguyen
@ 2011-03-20 22:12                   ` Ludovic Courtès
  2011-03-30 10:14                     ` Andy Wingo
  1 sibling, 1 reply; 55+ messages in thread
From: Ludovic Courtès @ 2011-03-20 22:12 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: guile-devel

Hi Mark,

Mark H Weaver <mhw@netris.org> writes:

> ludo@gnu.org (Ludovic Courtès) writes:
>>> We keep wide (UTF-32) stringbufs as-is, but we change narrow stringbufs
>>> to UTF-8, along with a flag that indicates whether it is known to be
>>> ASCII-only.
>>
>> The whole point of the narrow/wide distinction was to avoid
>> variable-width encodings.  In addition, we’d end up with 3 cases (ASCII,
>> UTF-8, or UTF-32) instead of 2, which seems quite complex to me.
>
> Most functions would not care about the known_ascii_only flag, so really
> it's just two cases.  (As you know, I'd prefer to have only one case).

What about string-ref/set!?  Should they widen their argument as well
when it’s UTF-8?

>> What do you think of moving to narrow = ASCII, as I suggested earlier?
>
> The problem is that the narrow-wide cases will be much more common in
> your scheme, and none of us has a good solution to those cases.  All the
> solutions that handle those cases efficiently involve an unacceptable
> increase in code complexity.

IMO ASCII + UCS-4 can only be simpler than ASCII + UTF-8 + UCS-4.

> In your scheme, a large number of common operations will require
> widening strings, which is bad for efficiency, in both space and time
> for the common operations.

My feeling is that widening will be rare enough.  For instance, most of
the time, programs compare strings in the same language, which goes
through the fast path of ‘string=’?

So here’s a plan for 2.0.x.  Remember that a design goal was to have
constant-time string-ref/set!; this is debatable, I agree, but Mike, I
and others on r6rs-discuss back then thought it was desirable.

To fix the bugs you identified in 2.0.x, I’m in favor of a
narrow = ASCII scheme and to apply Mike’s suggestion.  It should
allow us to fix our bugs with minimal changes.

For 2.1.x, things are different.  I’m happy to revisit not only the
internal storage approach but also the O(1) ref/set! (the latter should
be discussed in light of the trend in other Schemes, though.)

How does that sound?

> You may not realize the extent to which UTF-8's special properties
> mostly eliminate the usual disadvantages of variable-width encodings.
> Please allow me to explain how the most common string operations can be
> implemented on UTF-8 strings.

Thanks for the info!  I was probably not aware of all these properties.
But again, one design constraint in 2.0 was to provide O(1) random
access.  Sure this is doable with variable-width encoding as Clinger et
al. note at <http://trac.sacrideo.us/wg/wiki/StringRepresentations>, but
the narrow/wide scheme we came up with looked like a good trade-off.

Again, if you want to experiment with UTF-8 for internal storage, then
2.1 is yours.  ;-)

BTW, giving up O(1) access may be more work than it seems.  For instance
‘string-fold’, ‘string-map’, & co. would need to be compiled
efficiently, which won’t happen until we have an inliner in the
compiler, etc.  But all these could be worthy goals for 2.1/2.2.

Thanks,
Ludo’.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-19 12:31           ` Andy Wingo
  2011-03-19 14:06             ` Mark H Weaver
@ 2011-03-29 12:39             ` Peter Brett
  2011-03-29 13:35               ` Andy Wingo
  2011-03-29 21:15               ` Ludovic Courtès
  1 sibling, 2 replies; 55+ messages in thread
From: Peter Brett @ 2011-03-29 12:39 UTC (permalink / raw)
  To: guile-devel

Andy Wingo <wingo@pobox.com> writes:

> Finally, users are moving away from these functions anyway.  The thing
> to do now is to write Scheme, not C: and in Scheme we do the Right
> Thing.

I hope I'm misinterpreting this statement.  Using Guile as an extension
language for applications depends *heavily* on writing large amounts C
to provide fundamental API functions in C Scheme code can then invoke to
manipulate the application's state.  Is this disencouraged with Guile
2.0, then?

> So let's not let this particular consideration weigh too heavily on our
> choice of character encoding.
>

It would certainly make my life as a downstream application maintainer
much, much easier if all Guile API functions that accept a C string
argument expected UTF-8.  Having to double check (either by an explicit
save/restore or by code auditing) that whenever a string is passed to
Guile that the locale is set up correctly is a pain.

                          Peter

-- 
Peter Brett <peter@peter-b.co.uk>
Remote Sensing Research Group
Surrey Space Centre




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-29 12:39             ` Peter Brett
@ 2011-03-29 13:35               ` Andy Wingo
  2011-03-29 21:15               ` Ludovic Courtès
  1 sibling, 0 replies; 55+ messages in thread
From: Andy Wingo @ 2011-03-29 13:35 UTC (permalink / raw)
  To: Peter Brett; +Cc: guile-devel

Hi Peter,

On Tue 29 Mar 2011 14:39, Peter Brett <peter@peter-b.co.uk> writes:

> Andy Wingo <wingo@pobox.com> writes:
>
>> Finally, users are moving away from these functions anyway.  The thing
>> to do now is to write Scheme, not C: and in Scheme we do the Right
>> Thing.
>
> I hope I'm misinterpreting this statement.  Using Guile as an extension
> language for applications depends *heavily* on writing large amounts C
> to provide fundamental API functions in C Scheme code can then invoke to
> manipulate the application's state.  Is this disencouraged with Guile
> 2.0, then?

The short answer is, no, Guile still has a good C API, and its C API is
improving as well.

The long answer is that, given a choice, users should write in Scheme.
In the past there wasn't much of a choice for some things, as Guile was
slower and did not offer any dynamic FFI capability.  But now that Guile
is more capable, people should reconsider, even if they end up staying
with C.

Furthermore, the discussion at hand is about which encoding to choose
for routines taking a const char*.  We all agree that it should not, for
most cases, be the locale.  The thing under discussion is should it be
ASCII or latin1 or utf8.

Note that this dilemma was not to be had with pre-2.0 Guile, as pre-2.0
Guile did not do unicode properly.

>> So let's not let this particular consideration weigh too heavily on our
>> choice of character encoding.
>
> It would certainly make my life as a downstream application maintainer
> much, much easier if all Guile API functions that accept a C string
> argument expected UTF-8.

Understood.  It looks like we'll go that way for 2.2.

> Having to double check (either by an explicit save/restore or by code
> auditing) that whenever a string is passed to Guile that the locale is
> set up correctly is a pain.

This should not be the case now -- you can use scm_from_utf8_string,
etc, and as long as you are referencing ascii identifiers, the locale
shouldn't matter for scm_c_module_ref et al.

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-29 12:39             ` Peter Brett
  2011-03-29 13:35               ` Andy Wingo
@ 2011-03-29 21:15               ` Ludovic Courtès
  2011-03-31 14:59                 ` Peter Brett
  1 sibling, 1 reply; 55+ messages in thread
From: Ludovic Courtès @ 2011-03-29 21:15 UTC (permalink / raw)
  To: guile-devel

Hi Peter,

Peter Brett <peter@peter-b.co.uk> writes:

> It would certainly make my life as a downstream application maintainer
> much, much easier if all Guile API functions that accept a C string
> argument expected UTF-8.

Out of curiosity, what kind of C libraries and tools do you use?

My impression is that GLib & co. are well-equipped to deal with UTF-8
whereas other C libraries and programs would rather work with locale
encoding or ‘wchar_t’ using the standard C APIs.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-11 23:09   ` Mark H Weaver
  2011-03-12 13:46     ` Ludovic Courtès
@ 2011-03-30  9:03     ` Andy Wingo
  2011-03-31 14:19       ` Ludovic Courtès
  1 sibling, 1 reply; 55+ messages in thread
From: Andy Wingo @ 2011-03-30  9:03 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: guile-devel

Hi Mark!

I think UTF-8 could be a good plan for 2.1/2.2, but I wanted to make
sure we understand what string-ref is good for...

On Sat 12 Mar 2011 00:09, Mark H Weaver <mhw@netris.org> writes:

> I claim that any reasonable code which currently uses string-ref and
> string-set! could be more cleanly written using string ports or
> string-{fold,unfold}{,-right}.

Folding and unfolding traverses the entire string.  Sometimes this is
indeed what you want.

But sometimes you want to search in a string, roll back and forward,
keep pointers to various parts in the string, etc.  Integer indices work
well as pointers to parts of strings, is what I'm saying here.  They are
immediate values that can be compared with < and >.

Any change to Guile's internal character encoding should not start from
the premise that string-ref is obsolete or unimportant, especially
considering that there is no other standard "string pointer" mechanism.

Regards,

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-13 21:30         ` Ludovic Courtès
@ 2011-03-30  9:05           ` Andy Wingo
  0 siblings, 0 replies; 55+ messages in thread
From: Andy Wingo @ 2011-03-30  9:05 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

On Sun 13 Mar 2011 22:30, ludo@gnu.org (Ludovic Courtès) writes:

> So yes, the current implementation has bugs, but I think most if not all
> can be fixed with minimal changes.  Would you like to look into it
> for 2.0.x?

I very much agree with this sentiment for 2.0.x.  Let's not let the
perfect be the enemy of the good.  We can revisit things on master, for
2.2.

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: O(1) accessors for UTF-8 backed strings
  2011-03-13  4:05       ` O(1) accessors for UTF-8 backed strings Mark H Weaver
  2011-03-13  4:42         ` Alex Shinn
@ 2011-03-30  9:20         ` Andy Wingo
  1 sibling, 0 replies; 55+ messages in thread
From: Andy Wingo @ 2011-03-30  9:20 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel

On Sun 13 Mar 2011 05:05, Mark H Weaver <mhw@netris.org> writes:

> ludo@gnu.org (Ludovic Courtès) writes:
>> I also think strings should remain what they currently are, with O(1)
>> random access.
>
> I just realized that it is possible to implement O(1) accessors for
> UTF-8 backed strings.

If we switch to UTF-8, I like this proposal, generally.

It would be good if it continued to support cheap, shared substrings,
but without the mutexen we have now.  Perhaps it would also be good to
do away with stringbuf objects entirely; dunno.

O(log N) accessors would also fine be with me, e.g. via skip lists or
tries or something.

A tricky problem!

Regards,

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-15 22:49     ` Mark H Weaver
  2011-03-16  0:01       ` Mike Gran
@ 2011-03-30  9:33       ` Andy Wingo
  1 sibling, 0 replies; 55+ messages in thread
From: Andy Wingo @ 2011-03-30  9:33 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel

On Tue 15 Mar 2011 23:49, Mark H Weaver <mhw@netris.org> writes:

>> Well, we covered O(1) vs O(n).  To make UTF-8 O(1), you need to store
>> additional indexing information of some sort.  There are various schemes,
>> but, depending the the scheme, you lose some of memory advantage of UTF-8
>> vs UTF-32.  You can likely to better than UTF-32, though.
>
> I would prefer to either let our accessors be O(n), or else to create
> the index lazily, i.e. on the first usage of string-ref or string-set!
> In such a scheme, very few strings would include indices, and thus the
> overhead would be minimal.
>
> Anyway, the index overhead can be made arbitrarily small by increasing
> the chunk size.  It is a classic time-space trade-off here.  The chunk
> size could be made larger over the years, as usage of string-ref and
> string-set! become less common, and eventually the index stuff could be
> removed entirely.

Though I agre that string-set! should be discouraged -- as Clinger also
thought back in 1984, it seems -- string-ref is still important.  The
only thing that could replace it would be some sort of string cursor /
iteration protocol, and I would prefer for that to be standard (SRFI or
otherwise).

So let's factor string-ref into the "costs" of a potential switch to
UTF-8, be it in space or in time or whatever.

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-19 14:06             ` Mark H Weaver
                                 ` (3 preceding siblings ...)
  2011-03-20 21:49               ` Ludovic Courtès
@ 2011-03-30  9:50               ` Andy Wingo
  4 siblings, 0 replies; 55+ messages in thread
From: Andy Wingo @ 2011-03-30  9:50 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel

On Sat 19 Mar 2011 15:06, Mark H Weaver <mhw@netris.org> writes:

> Let me ask you this: why would you oppose changing the scm_c_ functions
> to use UTF-8 by default?  If you're comfortable with ASCII-only names,
> then UTF-8 will work fine for you, since ASCII strings are unchanged in
> UTF-8.

This is true.  Well, I think UTF-8 is the Right Thing, but it is
complicated.  This discussion reminds me of the luserspace problem from
Worse is Better ;-)

I think I was clinging to ASCII because C is ASCII.  No one is clamoring
for "if" in Spanish, much less katakana identifiers.  I was thinking
that since we could not reap the benefits, why bother paying the cost.
But you have a good point that that might change in the future; who
knows?

If we do switch to UTF-8 in 2.2, though, then there's no real cost;
there's only benefits.  So perhaps we should just stop worrying about
potential costs, and learn to love the benefits (!), and just do the
UTF-8 thing.

I'm still being a little cagey in these replies as I haven't gotten to
the end of the thread yet :)

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-20 22:12                   ` Ludovic Courtès
@ 2011-03-30 10:14                     ` Andy Wingo
  0 siblings, 0 replies; 55+ messages in thread
From: Andy Wingo @ 2011-03-30 10:14 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Mark H Weaver, guile-devel

On Sun 20 Mar 2011 23:12, ludo@gnu.org (Ludovic Courtès) writes:

> For 2.1.x, things are different.  I’m happy to revisit not only the
> internal storage approach but also the O(1) ref/set! (the latter should
> be discussed in light of the trend in other Schemes, though.)

[...]

> Again, if you want to experiment with UTF-8 for internal storage, then
> 2.1 is yours.  ;-)

At the end of all this, I also agree that we should look into UTF-8 for
2.1/2.2.  The advantage of having a single string representation is good
enough for me; and that representation doesn't present bit memory-usage
gotchas (like our current UTF-32 if any char is non-latin1), then so
much the better.  It would be nice to get rid of mutexen at the same
time, given that UTF-8 can't be usefully modified in place.

That said, I would like to see a branch before merging into master.  If
the advantages are as great as we hope, it should be an obvious choice
to merge the branch in as soon as things look promising.

Regards,

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-30  9:03     ` Using libunistring for string comparisons et al Andy Wingo
@ 2011-03-31 14:19       ` Ludovic Courtès
  0 siblings, 0 replies; 55+ messages in thread
From: Ludovic Courtès @ 2011-03-31 14:19 UTC (permalink / raw)
  To: guile-devel

Hello,

Andy Wingo <wingo@pobox.com> writes:

> Any change to Guile's internal character encoding should not start from
> the premise that string-ref is obsolete or unimportant, especially
> considering that there is no other standard "string pointer" mechanism.

+1

There are idioms like:

  (let ((start (string-index s #\,))
        (end   (string-rindex s #\,)))
    (substring s (+ 1 start) end))

which may not have a better equivalent (sometimes ‘string-tokenize’ can
be used, sometimes not.)

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-29 21:15               ` Ludovic Courtès
@ 2011-03-31 14:59                 ` Peter Brett
  2011-03-31 20:12                   ` Ludovic Courtès
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Brett @ 2011-03-31 14:59 UTC (permalink / raw)
  To: guile-devel

ludo@gnu.org (Ludovic Courtès) writes:

> Hi Peter,
>
> Peter Brett <peter@peter-b.co.uk> writes:
>
>> It would certainly make my life as a downstream application maintainer
>> much, much easier if all Guile API functions that accept a C string
>> argument expected UTF-8.
>
> Out of curiosity, what kind of C libraries and tools do you use?
>
> My impression is that GLib & co. are well-equipped to deal with UTF-8
> whereas other C libraries and programs would rather work with locale
> encoding or ‘wchar_t’ using the standard C APIs.
>

Yep, lots of GLib.  Our file formats use UTF-8, and all of our
non-Scheme code uses UTF-8.  We are pretty careful about ensuring that
our C code is UTF-8 safe (using GLib!), but spend a lot of time crossing
our fingers and hoping that if we pass a string to Guile we'll get it
back in the right encoding.  Note that we have to support Guile 1.8.x as
well, probably until at least 2013.

Some tips on the safest way to do this would be great (but probably
belong on guile-user or [even better] in the fine manual).

                              Peter

-- 
Peter Brett <peter@peter-b.co.uk>
Remote Sensing Research Group
Surrey Space Centre




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Using libunistring for string comparisons et al
  2011-03-31 14:59                 ` Peter Brett
@ 2011-03-31 20:12                   ` Ludovic Courtès
  0 siblings, 0 replies; 55+ messages in thread
From: Ludovic Courtès @ 2011-03-31 20:12 UTC (permalink / raw)
  To: guile-devel

Hi,

Peter Brett <peter@peter-b.co.uk> writes:

> ludo@gnu.org (Ludovic Courtès) writes:

[...]

>> My impression is that GLib & co. are well-equipped to deal with UTF-8
>> whereas other C libraries and programs would rather work with locale
>> encoding or ‘wchar_t’ using the standard C APIs.
>>
>
> Yep, lots of GLib.  Our file formats use UTF-8, and all of our
> non-Scheme code uses UTF-8.

File formats aren’t a problem since port I/O can use your encoding of
choice.

> We are pretty careful about ensuring that our C code is UTF-8 safe
> (using GLib!), but spend a lot of time crossing our fingers and hoping
> that if we pass a string to Guile we'll get it back in the right
> encoding.  Note that we have to support Guile 1.8.x as well, probably
> until at least 2013.
>
> Some tips on the safest way to do this would be great (but probably
> belong on guile-user or [even better] in the fine manual).

NEWS for 2.0.0 explains the differences in details, sometimes with
examples on how to transition from 1.8.  You may find the deprecation
mechanism helpful too.

Hope this helps,
Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2011-03-31 20:12 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-11  0:54 uc_tolower (uc_toupper (x)) Mike Gran
2011-03-11 22:33 ` Using libunistring for string comparisons et al Mark H Weaver
2011-03-11 22:36   ` Mark H Weaver
2011-03-11 23:09   ` Mark H Weaver
2011-03-12 13:46     ` Ludovic Courtès
2011-03-12 17:28       ` Mark H Weaver
2011-03-13 21:30         ` Ludovic Courtès
2011-03-30  9:05           ` Andy Wingo
2011-03-13  4:05       ` O(1) accessors for UTF-8 backed strings Mark H Weaver
2011-03-13  4:42         ` Alex Shinn
2011-03-15 15:46           ` Mark H Weaver
2011-03-16  0:07             ` Alex Shinn
2011-03-19 13:02             ` Andy Wingo
2011-03-30  9:20         ` Andy Wingo
2011-03-30  9:03     ` Using libunistring for string comparisons et al Andy Wingo
2011-03-31 14:19       ` Ludovic Courtès
2011-03-12 13:36   ` Ludovic Courtès
  -- strict thread matches above, loose matches on Subject: below --
2011-03-12 21:28 Mike Gran
2011-03-15 17:20 ` Mark H Weaver
2011-03-15 20:39   ` Mike Gran
2011-03-15 22:49     ` Mark H Weaver
2011-03-16  0:01       ` Mike Gran
2011-03-16  1:12         ` Mark H Weaver
2011-03-16 11:26           ` Ludovic Courtès
2011-03-17 15:38             ` Mark H Weaver
2011-03-17 15:56               ` Ludovic Courtès
2011-03-17 17:58                 ` Mark H Weaver
2011-03-18  0:10                   ` Thien-Thi Nguyen
2011-03-18  1:38                     ` Mark H Weaver
2011-03-18  8:46                       ` Thien-Thi Nguyen
2011-03-18 12:05                         ` Mark H Weaver
2011-03-20 22:12                   ` Ludovic Courtès
2011-03-30 10:14                     ` Andy Wingo
2011-03-17 21:47           ` Ludovic Courtès
2011-03-19 12:31           ` Andy Wingo
2011-03-19 14:06             ` Mark H Weaver
2011-03-19 14:53               ` Noah Lavine
2011-03-19 15:49                 ` Mark H Weaver
2011-03-19 15:08               ` Andy Wingo
2011-03-19 19:43                 ` Mark H Weaver
2011-03-19 16:37               ` Mark H Weaver
2011-03-20 21:49               ` Ludovic Courtès
2011-03-30  9:50               ` Andy Wingo
2011-03-29 12:39             ` Peter Brett
2011-03-29 13:35               ` Andy Wingo
2011-03-29 21:15               ` Ludovic Courtès
2011-03-31 14:59                 ` Peter Brett
2011-03-31 20:12                   ` Ludovic Courtès
2011-03-30  9:33       ` Andy Wingo
2011-03-16  0:22     ` Alex Shinn
2011-03-16  1:30 Mike Gran
2011-03-16  2:03 Mike Gran
2011-03-16 15:22 Mike Gran
2011-03-16 16:58 ` Ludovic Courtès
2011-03-17 18:07 Mike Gran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).