Wide strings

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* Wide strings
@ 2009-01-25 21:15 Mike Gran
  2009-01-25 22:31 ` Ludovic Courtès
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Gran @ 2009-01-25 21:15 UTC (permalink / raw)
  To: guile-devel

Hi.  I know there has been a lot of talk about wide characters and
Unicode over the years.  I'd like to see it happen because how the are
implemented will determine the future of a couple of my side-projects.
I could pitch in, if you needed some help.

I looked over the history of guile-devel, and there has been a
tremendous amount of discussion about it.  Also, the Schemes seem to
be each inventing their own solution.

Tom Lord's 2003 proposal
    http://lists.gnu.org/archive/html/guile-devel/2003-11/msg00036.html
Marius Vollmer's idea
    http://lists.gnu.org/archive/html/guile-devel/2005-08/msg00029.html
R6RS
    http://www.r6rs.org/final/html/r6rs-lib/r6rs-lib-Z-H-2.html#node_chap_1
MIT Scheme
    http://www.gnu.org/software/mit-scheme/documentation/mit-scheme-ref/Internal-Representation-of-Characters.html

There has also been some back-and-forth about to what extent the
internal representation of strings should be accessible, whether the
internal representation should be a vector or if it can be something
more efficient, and how not to completely break regular expressions.

Also, there is the question as to whether a wide character is a
codepoint or a grapheme.

Is there a current proposal on the table for how to reach this?

If you suffering from a dearth of opinions, I certainly have some
ideas.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Wide strings
  2009-01-25 21:15 Wide strings Mike Gran
@ 2009-01-25 22:31 ` Ludovic Courtès
  2009-01-25 23:32   ` Neil Jerram
  2009-01-26  0:16   ` Mike Gran
  0 siblings, 2 replies; 17+ messages in thread
From: Ludovic Courtès @ 2009-01-25 22:31 UTC (permalink / raw)
  To: guile-devel

Hello!

Mike Gran <spk121@yahoo.com> writes:

> Hi.  I know there has been a lot of talk about wide characters and
> Unicode over the years.  I'd like to see it happen because how the are
> implemented will determine the future of a couple of my side-projects.
> I could pitch in, if you needed some help.

Indeed, it looks like you have some experience with GuCu!  ;-)

I agree it would be really nice to have Unicode support, but I'm not
aware of any "plan", so please go ahead!  :-)

A few considerations regarding the inevitable debate about the internal
string representation:

  1. IMO it'd be nice to have ASCII strings special-cased so that they
     are always encoded in ASCII.  This would allow for memory savings
     since, e.g., most symbols are expected to contain only ASCII
     characters.  It might also simplify interaction with C in certain
     cases; for instance, it would make it easy to have statically
     initialized ASCII Scheme strings [0].

  2. O(1) `string-{ref,set!}' is somewhat mandated by parts of SRFI-13.
     For instance, `substring' takes indices as parameters,
     `string-index' returns an index, etc. (John Cowan once argued that
     an abstract type to represent the position would remove this
     limitation [1], but the fact is that we have to live with SRFI-13).

  3. GLib et al. like UTF-8, and it'd be nice to minimize the overhead
     when interfacing with these libs (e.g., by avoiding translations
     from one string representation to another).

  4. It might be nice to be friendly to `wchar_t' and friends.

Interestingly, some of these things are contradictory.

Will Clinger has a good summary of a range of possible implementations:

  https://trac.ccs.neu.edu/trac/larceny/wiki/StringRepresentations

Thanks,
Ludo'.

[0] http://thread.gmane.org/gmane.lisp.guile.devel/7998
[1] http://lists.r6rs.org/pipermail/r6rs-discuss/2007-April/002252.html





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Wide strings
  2009-01-25 22:31 ` Ludovic Courtès
@ 2009-01-25 23:32   ` Neil Jerram
  2009-01-26 20:24     ` Ludovic Courtès
  2009-01-26  0:16   ` Mike Gran
  1 sibling, 1 reply; 17+ messages in thread
From: Neil Jerram @ 2009-01-25 23:32 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

2009/1/25 Ludovic Courtès <ludo@gnu.org>:
>
> I agree it would be really nice to have Unicode support, but I'm not
> aware of any "plan", so please go ahead!  :-)

Indeed.

> A few considerations regarding the inevitable debate about the internal
> string representation:

[...]

But what about the other possible debate, about the API?  Are you
thinking that we should accept R6RS's choice?

(I really haven't read up on all this enough - however when reading
Tom Lord's analysis just now, I was thinking "why not just specify
that things like char-upcase don't work in the difficult cases", and
it seems to me that this is what R6RS chose to do.  So at first glance
the R6RS API looks OK to me.

(Although I read them at the time, I can't remember now what Tom's
remaining concerns with the R6RS proposal were; should probably go
back and read those again.  On the other hand, Tom did eventually vote
for R6RS, so I would guess that they can't have been that bad.))

Regards,
        Neil

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Wide strings
  2009-01-25 22:31 ` Ludovic Courtès
  2009-01-25 23:32   ` Neil Jerram
@ 2009-01-26  0:16   ` Mike Gran
  2009-01-26 15:21     ` Mike Gran
  2009-01-26 21:40     ` Ludovic Courtès
  1 sibling, 2 replies; 17+ messages in thread
From: Mike Gran @ 2009-01-26  0:16 UTC (permalink / raw)
  To: guile-devel

> From: Ludovic Courtès ludo@gnu.org

I believe that we should aim for R6RS strings.

I think the most important thing is to have humility in the face of an
impossible problem: how to encode all textual information.  It is
important to "stand on the shoulders of giants" here.  It becomes a
matter of deciding which actively developed library of wide character
functions is to be used and how to integrate it.

There are 3 good, actively developed solutions of which I am aware.

1.  Use GNU libc functionality.  Encode wide strings as wchar_t.

2.  Use GLib functionality.  Encode wide strings as UTF-8.  Possibly
give up on O(1).  Possibly add indexing information to string to allow
O(1), which might negate the space advantage of UTF-8.

3.  Use IBM's ICU4c.  Encode wide strings as UTF-16.  Thus, add an
obscure dependency.

Option 3 is likely a non-starter, because it seems that Guile has
tried to avoid adding new non-GNU dependencies.  It is technologically
a great solution, IMHO.

Option 1 is probably the way to go, because it keeps Guile close to
the metal and keeps dependencies out of it.  Unfortunately, UTF-8
strings would require conversion.

>  1. IMO it'd be nice to have ASCII strings special-cased so that they
>    are always encoded in ASCII.  This would allow for memory savings
>    since, e.g., most symbols are expected to contain only ASCII
>    characters.  It might also simplify interaction with C in certain
>    cases; for instance, it would make it easy to have statically
>    initialized ASCII Scheme strings.

Why not?  It does solve the initialization problem of dealing with strings
before setlocale has been called.

Let's say that a string is a union of either an ASCII char vector or a
wchar_t vector.  A "character" then is just a Unicode codepoint.
String-ref returns a wchar_t.  This is all in line with R6RS as I
understand it.

There could then be a separate iterator and function set that does
(likely O(n)) operations on the grapheme clusters of strings.  A
grapheme cluster is a single written symbol which may be made up of
several codepoints.  Unicode Standard Annex #29 describes how to
partition a string into a set of graphemes.[1]

There is the problem of systems where wchar_t is 2 bytes instead of 4
bytes, like Cygwin.  For those systems, I'd recommend
restricting functionality to 16-bit characters instead of trying to
add an extra UTF-16 encoding/decoding step.  I think there should
always be a complete codepoint in each wchar_t.

-- 
Mike Gran

[1] http://www.unicode.org/reports/tr29/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Wide strings
  2009-01-26  0:16   ` Mike Gran
@ 2009-01-26 15:21     ` Mike Gran
  2009-01-26 21:40     ` Ludovic Courtès
  1 sibling, 0 replies; 17+ messages in thread
From: Mike Gran @ 2009-01-26 15:21 UTC (permalink / raw)
  To: Mike Gran, guile-devel

> > Ludo sez,
> Mike sez,
> >  1. IMO it'd be nice to have ASCII strings special-cased so that they
> >    are always encoded in ASCII.  This would allow for memory savings
> >    since, e.g., most symbols are expected to contain only ASCII
> >    characters.  It might also simplify interaction with C in certain
> >    cases; for instance, it would make it easy to have statically
> >    initialized ASCII Scheme strings.
> 
> Why not?  It does solve the initialization problem of dealing with strings
> before setlocale has been called.

One thing I only just noticed today is that the first 256 Unicode chars are
ISO-8859-1.  Maybe then the idea should be to make strings where the 
size of the character can be 1 or 4 bytes but the encoding is always in 
Unicode code points, which in the 1-byte-char case would then be 
coincidentally ISO-8859-1.

Thanks,
Mike Gran





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Wide strings
  2009-01-25 23:32   ` Neil Jerram
@ 2009-01-26 20:24     ` Ludovic Courtès
  0 siblings, 0 replies; 17+ messages in thread
From: Ludovic Courtès @ 2009-01-26 20:24 UTC (permalink / raw)
  To: guile-devel

Hello!

Neil Jerram <neiljerram@googlemail.com> writes:

> But what about the other possible debate, about the API?  Are you
> thinking that we should accept R6RS's choice?

No, I think we have SRFI-1[34] to start with, both of which are well
defined in the context of Unicode.

> (I really haven't read up on all this enough - however when reading
> Tom Lord's analysis just now, I was thinking "why not just specify
> that things like char-upcase don't work in the difficult cases", and
> it seems to me that this is what R6RS chose to do.  So at first glance
> the R6RS API looks OK to me.

Regarding `ß' (German eszet), which is one of the "difficult cases"
mentioned by Tom Lord, SRFI-13 reads:

  Some characters case-map to more than one character.  For example, the
  Latin-1 German eszet character upper-cases to "SS."

    * This means that the R5RS function char-upcase is not well-defined,
      since it is defined to produce a (single) character result.

    * It means that an in-place string-upcase! procedure cannot be
      reliably defined, since the original string may not be long enough
      to contain the result -- an N-character string might upcase to a
      2N-character result.

    * It means that case-insensitive string-matching or searching is
      quite tricky. For example, an n-character string s might match a
      2N-character string s'.

And then:

  SRFI 13 makes no attempt to deal with these issues; it uses a simple
  1-1 locale- and context-independent case-mapping

I think it's reasonable to stick to this approach at first, at least.
Locale-dependent case folding is part of `(ice-9 i18n)' anyway.

Thanks,
Ludo'.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Wide strings
  2009-01-26  0:16   ` Mike Gran
  2009-01-26 15:21     ` Mike Gran
@ 2009-01-26 21:40     ` Ludovic Courtès
  2009-01-27  5:38       ` Mike Gran
  1 sibling, 1 reply; 17+ messages in thread
From: Ludovic Courtès @ 2009-01-26 21:40 UTC (permalink / raw)
  To: guile-devel

Hello,

Mike Gran <spk121@yahoo.com> writes:

> There are 3 good, actively developed solutions of which I am aware.
>
> 1.  Use GNU libc functionality.  Encode wide strings as wchar_t.

That'd be POSIX functionality, actually.

> 2.  Use GLib functionality.  Encode wide strings as UTF-8.  Possibly
> give up on O(1).  Possibly add indexing information to string to allow
> O(1), which might negate the space advantage of UTF-8.

Technically, depending on GLib would seem unreasonable to me.  :-)

BTW, Gnulib has a wealth of modules that could be helpful here:

  http://www.gnu.org/software/gnulib/MODULES.html#posix_ext_unicode

I used a few of them in Guile-R6RS-Libs to implement `string->utf8' and
such like.

> 3.  Use IBM's ICU4c.  Encode wide strings as UTF-16.  Thus, add an
> obscure dependency.
>
> Option 3 is likely a non-starter, because it seems that Guile has
> tried to avoid adding new non-GNU dependencies.  It is technologically
> a great solution, IMHO.

At first sight, I'd rather avoid it as a dependency, if that's possible,
but that's mostly subjective.

> Let's say that a string is a union of either an ASCII char vector or a
> wchar_t vector.  A "character" then is just a Unicode codepoint.
> String-ref returns a wchar_t.  This is all in line with R6RS as I
> understand it.

Yes, that seems easily doable.

> There could then be a separate iterator and function set that does
> (likely O(n)) operations on the grapheme clusters of strings.  A
> grapheme cluster is a single written symbol which may be made up of
> several codepoints.  Unicode Standard Annex #29 describes how to
> partition a string into a set of graphemes.[1]

Hmm, that seems like a difficult topic.  It's not even mentioned in
SRFI-13.  I suppose it can be addressed at a later stage, possibly by
providing a specific API.

> There is the problem of systems where wchar_t is 2 bytes instead of 4
> bytes, like Cygwin.  For those systems, I'd recommend
> restricting functionality to 16-bit characters instead of trying to
> add an extra UTF-16 encoding/decoding step.  I think there should
> always be a complete codepoint in each wchar_t.

Agreed.  The GNU libc doc concurs (info "(libc) Extended Char Intro").

However, given this limitation, and other potential portability issues,
it's still unclear to me whether this would be a good choice.  We need
to look more closely at what Gnulib has to offer, IMO.

Thanks,
Ludo'.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Wide strings
  2009-01-26 21:40     ` Ludovic Courtès
@ 2009-01-27  5:38       ` Mike Gran
  2009-01-27  5:52         ` Mike Gran
  2009-01-27 18:59         ` Ludovic Courtès
  0 siblings, 2 replies; 17+ messages in thread
From: Mike Gran @ 2009-01-27  5:38 UTC (permalink / raw)
  To: guile-devel

Hello,

> Ludo' sez

>> Mike Gran <spk121@yahoo.com> writes:

> BTW, Gnulib has a wealth of modules that could be helpful here:

>  http://www.gnu.org/software/gnulib/MODULES.html#posix_ext_unicode

> I used a few of them in Guile-R6RS-Libs to implement `string->utf8'
> and such like.

The Gnulib routines seem perfectly complete.  I was unaware of them.
It wasn't clear to me at first glance if wide regex is supported, but,
otherwise, they are fine.

>> There could then be a separate iterator and function set that does
>> (likely O(n)) operations on the grapheme clusters of strings.  A
>> grapheme cluster is a single written symbol which may be made up of
>> several codepoints.  Unicode Standard Annex #29 describes how to
>> partition a string into a set of graphemes.[1]

> Hmm, that seems like a difficult topic.  It's not even mentioned in
> SRFI-13.  I suppose it can be addressed at a later stage, possibly
> by providing a specific API.

Fair enough.  With wide strings in place, this could all be done in
pure Scheme anyway, and end up in some library.  I brought it up
really just to note the codepoint / grapheme problem.

> [...] We need to look more closely at what Gnulib has to offer, IMO.

Gnulib works for me.  Bruno is the maintainer of those funcs, so I'm
sure they work great.

So really the first questions to answer are the encoding question and
whether the R6RS string API is the goal.  

For the latter, I think the R6RS / SRFI-13 is simple enough.  I like
it as a goal.

For the former, I rather like the idea that internally a string will
internally be encoded either as 4-byte chars of UTF-32 or 1-byte chars
of ISO-8859-1.  Since the first 256 chars of UTF-32 are ISO-8859-1, it
makes it trivial for string-ref/set to work with codepoints.

(Though, such a scheme would force scm_take_locale_string to become
scm_take_iso88591_string.)

> Thanks,
> Ludo'.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Wide strings
  2009-01-27  5:38       ` Mike Gran
@ 2009-01-27  5:52         ` Mike Gran
  2009-01-27  9:50           ` Andy Wingo
  2009-01-27 18:59         ` Ludovic Courtès
  1 sibling, 1 reply; 17+ messages in thread
From: Mike Gran @ 2009-01-27  5:52 UTC (permalink / raw)
  To: guile-devel

I said

> (Though, such a scheme would force scm_take_locale_string to become
> scm_take_iso88591_string.)

which is incorrect.  Under the proposed scheme, scm_take_locale_string 
would only be able to  use that storage directly if it happened to be 
ASCII or 8859-1.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Wide strings
  2009-01-27  5:52         ` Mike Gran
@ 2009-01-27  9:50           ` Andy Wingo
  0 siblings, 0 replies; 17+ messages in thread
From: Andy Wingo @ 2009-01-27  9:50 UTC (permalink / raw)
  To: Mike Gran; +Cc: guile-devel

On Tue 27 Jan 2009 06:52, Mike Gran <spk121@yahoo.com> writes:

> I said
>
>> (Though, such a scheme would force scm_take_locale_string to become
>> scm_take_iso88591_string.)
>
> which is incorrect.  Under the proposed scheme, scm_take_locale_string 
> would only be able to  use that storage directly if it happened to be 
> ASCII or 8859-1.

Perhaps as part of this, we should add
scm_{from,take}_{ascii,iso88591,ucs32}_string. This would help greatly
when you know the format of data that you're writing to object code or
in C, but you don't know the locale of the user.

Andy

ps. Good luck! Having this problem looked at, with an eye to solutions,
makes me very happy :-))
-- 
http://wingolog.org/




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Wide strings
  2009-01-27  5:38       ` Mike Gran
  2009-01-27  5:52         ` Mike Gran
@ 2009-01-27 18:59         ` Ludovic Courtès
  2009-01-28 16:44           ` Mike Gran
  1 sibling, 1 reply; 17+ messages in thread
From: Ludovic Courtès @ 2009-01-27 18:59 UTC (permalink / raw)
  To: guile-devel

Hi!

Mike Gran <spk121@yahoo.com> writes:

> Gnulib works for me.  Bruno is the maintainer of those funcs, so I'm
> sure they work great.

Good!

> So really the first questions to answer are the encoding question and
> whether the R6RS string API is the goal.  

SRFI-1[34] (i.e., status quo in terms of supported APIs) seems like a
reasonable milestone.

> For the former, I rather like the idea that internally a string will
> internally be encoded either as 4-byte chars of UTF-32 or 1-byte chars
> of ISO-8859-1.  Since the first 256 chars of UTF-32 are ISO-8859-1, it
> makes it trivial for string-ref/set to work with codepoints.

Good to know.  That would give us O(1) ref/set!, and with Latin-1
special-cased, we'd have memory saving when interpreting Latin-1 code,
which is good.

> (Though, such a scheme would force scm_take_locale_string to become
> scm_take_iso88591_string.)

I think it would not *become* scm_take_iso88591_string, but
scm_take_iso88591_string (and others, as Andy suggested) would
*complement* it.

Thanks,
Ludo'.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Wide strings
  2009-01-27 18:59         ` Ludovic Courtès
@ 2009-01-28 16:44           ` Mike Gran
  2009-01-28 18:36             ` Andy Wingo
  2009-01-28 20:44             ` Clinton Ebadi
  0 siblings, 2 replies; 17+ messages in thread
From: Mike Gran @ 2009-01-28 16:44 UTC (permalink / raw)
  To: guile-devel

Hi,

Let's say that one possible goal is to add wide strings 
* using Gnulib functions 
* with minimal changes to the public Guile API 
* where chars become 4-byte codepoints and strings are internally 
 either UTF-32 or ISO-8859-1

Since I need this functionality taken care of, and since I have some
time to play with it, what's the procedure here? Should I mock
something up and submit it as a patch?  If I did, it would likely be 
a big patch.  Do we need to talk more about what needs to be
accomplished?  Do we need a complete specification?  Do we need
a vote on if it is a good idea?

Pragmatically, I see that this can be broken up into three steps.
(Not for public use.  Just as a programming subtasks.)

1.  Convert the internal char and string representation to be 
explicitly ISO 8859-1.  Add the to/from locale conversion functionality
while still retaining 8-bit strings.  Replace C library funcs with 
Gnulib string funcs where appropriate.

2.  Convert the internal representation of chars to 4-byte 
codepoints, while still retaining 8-bit strings.

3.  Convert strings to be a union of 1 byte and 4 byte chars.

Thanks,

Mike Gran

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Wide strings
  2009-01-28 16:44           ` Mike Gran
@ 2009-01-28 18:36             ` Andy Wingo
  2009-01-29  0:01               ` Ludovic Courtès
  2009-01-28 20:44             ` Clinton Ebadi
  1 sibling, 1 reply; 17+ messages in thread
From: Andy Wingo @ 2009-01-28 18:36 UTC (permalink / raw)
  To: Mike Gran; +Cc: guile-devel

Hi,

On Wed 28 Jan 2009 17:44, Mike Gran <spk121@yahoo.com> writes:

> Since I need this functionality taken care of, and since I have some
> time to play with it, what's the procedure here?

The best thing IMO would be to hack on it on a Git branch, with small
and correct patches. We could get you commit access if you don't already
have it (Ludo or Neil would have to reply on that). Then you could push
your work directly to a branch, so we all can review it easily.

> Do we need to talk more about what needs to be accomplished? Do we
> need a complete specification? Do we need a vote on if it is a good
> idea?

I think you're going in the right direction. More importantly, although
I can't speak for them, Neil and Ludo seem to think so too.

> 1.  Convert the internal char and string representation to be 
> explicitly ISO 8859-1.  Add the to/from locale conversion functionality
> while still retaining 8-bit strings.  Replace C library funcs with 
> Gnulib string funcs where appropriate.

Sounds appropriate to me. I am unfamiliar with the gnulib code; where do
the unicode codepoit tables live? How does one update them? Do we get
full introspection on characters and their classes, properties, etc?

> 2.  Convert the internal representation of chars to 4-byte 
> codepoints, while still retaining 8-bit strings.

Currently, characters are immediate values, with an 8-bit tag. See
tags.h:333. So it seems we have 24 bits remaining, and unicode claims
that 21 bits are the minimum necessary -- so we're good, if you can
figure out a reasonable way to go from a 32-bit codepoint to a 24-bit
codepoint.

> 3.  Convert strings to be a union of 1 byte and 4 byte chars.

There's room on stringbufs to have a flag, I think. Dunno if that's the
right way to do it. Converting the symbols and keywords code to do the
right thing will be a little bit of work, too.

Happy hacking,

Andy
-- 
http://wingolog.org/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Wide strings
  2009-01-28 16:44           ` Mike Gran
  2009-01-28 18:36             ` Andy Wingo
@ 2009-01-28 20:44             ` Clinton Ebadi
  2009-01-28 23:49               ` Ludovic Courtès
  1 sibling, 1 reply; 17+ messages in thread
From: Clinton Ebadi @ 2009-01-28 20:44 UTC (permalink / raw)
  To: guile-devel

Mike Gran <spk121@yahoo.com> writes:

> Hi,
>
> Let's say that one possible goal is to add wide strings 
> * using Gnulib functions 
> * with minimal changes to the public Guile API 
> * where chars become 4-byte codepoints and strings are internally 
>  either UTF-32 or ISO-8859-1
>
> Since I need this functionality taken care of, and since I have some
> time to play with it, what's the procedure here? Should I mock
> something up and submit it as a patch?  If I did, it would likely be 
> a big patch.  Do we need to talk more about what needs to be
> accomplished?  Do we need a complete specification?  Do we need
> a vote on if it is a good idea?

You should take a look at Common Lisp strings[0] and streams[1]. The
gist is that a string is a uniform array of some subtype of
`character'[2], and character streams have an
:external-encoding--character data is converted to/from that format when
writing/reading the stream. Guile should be a bit more opaque and just
specify the string as being an ordered sequence of characters, and
providing conversion functions to/from uniform byte[^0] arrays in some
explicitly specified encoding.

The `scm_{to|from}_locale_string' functions provide enough abstraction
to make this doable without breaking anything that doesn't use
`scm_take_locale_string' (and even then Guile can detect when the locale
is not UCS-4, revert to `scm_from_locale_string' and `free' the taken
string immediately after conversion). This could be enhanced with
`scm_{to|from}_encoded_string ({char*|SCM} string, enum encoding)'
functions.

> Pragmatically, I see that this can be broken up into three steps.
> (Not for public use.  Just as a programming subtasks.)
>
> 1.  Convert the internal char and string representation to be 
> explicitly ISO 8859-1.  Add the to/from locale conversion functionality
> while still retaining 8-bit strings.  Replace C library funcs with 
> Gnulib string funcs where appropriate.

Initially, I would suggest just using UCS-4 internally and iconv[3] to
handle conversion to/from the locale dependent encodings for
C. Converting to an external encoding within `scm_to_{}_string' has
minimal overhead really--the stringbuf has to be copied anyway (likewise
for `scm_from_{}_string'). If you are writing the externally encoded
string to a stream it is even cheaper--no memory need be allocated
during conversion.

 I think it is acceptable to restrict the encoding of the string passed
`scm_take_string'. If you are constructing strings that Guile can take
possession of you probably have a bit of control over the encoding; if
you don't generating a string and throwing it away more or less
immediately is still pretty cheap if malloc doesn't suck. Adding a
`scm_take_encoded_string' and removing the guarantee from
`scm_take_locale_string' that Guile will not copy the string seems to be
all that is needed to make taking strings work more or less
transparently.

> 2.  Convert the internal representation of chars to 4-byte 
> codepoints, while still retaining 8-bit strings.
>
> 3.  Convert strings to be a union of 1 byte and 4 byte chars.

After getting a basic implementation done using a fixed with internal
encoding rather than doing something like this it seems better to make
the internal encoding flexible.

Basically `make-string' would be extended with an :internal-encoding
argument, or a new `make-string-with-internal-encoding' (with a better
name perhaps) introduced to explicitly specify the internal encoding
the application desires. 

An encoding would be implemented as a protocol of some sort that
implemented a few primitive operations: conversion to UCS-4[^1], length,
substring, concatenate, indexed ref, and indexed set! seem to be the
minimal set for an optimizable implementation. Indices would have an
unspecified type to allow for fancy internal encodings--e.g. a tree of
some sort of UTF-8 codepoints that supported fast substring and
concatenation. Allowing an internal encoding to not implement a
destructive set! opens up some interesting optimizations for purely
functional strings (e.g. for representing things like Emacs buffers
using fancy persistent trees that are efficiently updateable and can
maintain an edit history with nearly nil overhead).

Does this seem reasonable?

[0] http://www.lispworks.com/documentation/HyperSpec/Body/16_a.htm
[1] http://www.lispworks.com/documentation/HyperSpec/Body/21_a.htm
[2] http://www.lispworks.com/documentation/HyperSpec/Body/13_a.htm 
[3] http://www.gnu.org/software/libiconv/
[4] http://www.lispworks.com/documentation/HyperSpec/Body/f_by_by.htm
[5] http://www.lispworks.com/documentation/HyperSpec/Body/f_ldb.htm
[6] http://www.lispworks.com/documentation/HyperSpec/Body/f_dpb.htm#dpb

[^0] `byte'[4] in CL language is some arbitrary width sequence of bits;
     e.g. a /traditional/ byte would be of type `(byte 0 7)' and a
     32-bit machine word `(byte 0 31)'. Unrelatedly, you can do some
     neat things using these arbitrary width bytes with
     `ldb'[5]/`dpb'[6].
[^1] Minimally; ideally an internal encoding would be passed any format
     iconv understands and if possible convert directly to that, but if
     not use UCS-4 and punt to the default conversion function instead.

-- 
emacsen: "Like... windows are portals man...
emacsen: Dude... let's yank this shit out of the kill ring"

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Wide strings
  2009-01-28 20:44             ` Clinton Ebadi
@ 2009-01-28 23:49               ` Ludovic Courtès
  0 siblings, 0 replies; 17+ messages in thread
From: Ludovic Courtès @ 2009-01-28 23:49 UTC (permalink / raw)
  To: guile-devel

Hello,

Clinton Ebadi <clinton@unknownlamer.org> writes:

> The `scm_{to|from}_locale_string' functions provide enough abstraction
> to make this doable without breaking anything that doesn't use
> `scm_take_locale_string' (and even then Guile can detect when the locale
> is not UCS-4, revert to `scm_from_locale_string' and `free' the taken
> string immediately after conversion). This could be enhanced with
> `scm_{to|from}_encoded_string ({char*|SCM} string, enum encoding)'
> functions.

Yes.

> Basically `make-string' would be extended with an :internal-encoding
> argument, or a new `make-string-with-internal-encoding' (with a better
> name perhaps) introduced to explicitly specify the internal encoding
> the application desires. 

I think the internal encoding of strings should not be exposed to
applications, at least not at the Scheme level.

Nevertheless, one could argue that it might be useful to expose it at
the C level, precisely to allow the efficient interaction with C code,
but that would need to be done with care, so that Guile is not
over-constrained.

Thanks,
Ludo'.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Wide strings
  2009-01-28 18:36             ` Andy Wingo
@ 2009-01-29  0:01               ` Ludovic Courtès
  2009-01-30  0:15                 ` Neil Jerram
  0 siblings, 1 reply; 17+ messages in thread
From: Ludovic Courtès @ 2009-01-29  0:01 UTC (permalink / raw)
  To: guile-devel

Hi,

Andy Wingo <wingo@pobox.com> writes:

> On Wed 28 Jan 2009 17:44, Mike Gran <spk121@yahoo.com> writes:
>
>> Since I need this functionality taken care of, and since I have some
>> time to play with it, what's the procedure here?
>
> The best thing IMO would be to hack on it on a Git branch, with small
> and correct patches. We could get you commit access if you don't already
> have it (Ludo or Neil would have to reply on that). Then you could push
> your work directly to a branch, so we all can review it easily.

Yep, setting up a branch is the easiest way.  You can then post updates
or requests for comments as things progress.  We'll need you to assign
the copyright for your changes to the FSF as well (I'll send you an
email sometime later, I need to go to sleep now).  In the meantime, you
can browse the GNU Coding Standards.  :-)

>> Do we need to talk more about what needs to be accomplished? Do we
>> need a complete specification? Do we need a vote on if it is a good
>> idea?
>
> I think you're going in the right direction. More importantly, although
> I can't speak for them, Neil and Ludo seem to think so too.

Yes, as far as I'm concerned.  I know you're probably more knowledgeable
than I am on this issue and I'm confident.

>> 1.  Convert the internal char and string representation to be 
>> explicitly ISO 8859-1.  Add the to/from locale conversion functionality
>> while still retaining 8-bit strings.  Replace C library funcs with 
>> Gnulib string funcs where appropriate.
>
> Sounds appropriate to me.

+1.

>> 2.  Convert the internal representation of chars to 4-byte 
>> codepoints, while still retaining 8-bit strings.
>
> Currently, characters are immediate values, with an 8-bit tag. See
> tags.h:333. So it seems we have 24 bits remaining, and unicode claims
> that 21 bits are the minimum necessary -- so we're good, if you can
> figure out a reasonable way to go from a 32-bit codepoint to a 24-bit
> codepoint.

Good (code)point.  It might be that we'll have to resort to cells for
chars themselves, while storing raw `wchar_t' in a stringbuf content.

>> 3.  Convert strings to be a union of 1 byte and 4 byte chars.
>
> There's room on stringbufs to have a flag, I think. Dunno if that's the
> right way to do it.

I had something like that in mind.

> Converting the symbols and keywords code to do the
> right thing will be a little bit of work, too.

Not if it's handled at the level of stringbufs, I think.

BTW, while BDW-GC isn't used, make sure to update `scm_i_stringbuf_free ()'
and friends so that they pass the right number of bytes that are to be
freed...

Thanks,
Ludo'.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Wide strings
  2009-01-29  0:01               ` Ludovic Courtès
@ 2009-01-30  0:15                 ` Neil Jerram
  0 siblings, 0 replies; 17+ messages in thread
From: Neil Jerram @ 2009-01-30  0:15 UTC (permalink / raw)
  To: Mike Gran; +Cc: guile-devel

ludo@gnu.org (Ludovic Courtès) writes:

>>> Do we need to talk more about what needs to be accomplished? Do we
>>> need a complete specification? Do we need a vote on if it is a good
>>> idea?
>>
>> I think you're going in the right direction. More importantly, although
>> I can't speak for them, Neil and Ludo seem to think so too.
>
> Yes, as far as I'm concerned.  I know you're probably more knowledgeable
> than I am on this issue and I'm confident.

For the record, I'm happy too - in fact I'm excited that Guile is
finally going to have wide strings.  Technically I think I'm less of
an expert here than everyone else who has commented, so I'm happy to
defer to the apparent consensus.  I see that Clinton's idea goes
beyond that, but (IIUC) it looks like we doesn't lose anything by
going with Latin-1/UCS-32 now, and then Clinton's idea could be added
later; is that correct?

Regards,
        Neil

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2009-01-30  0:15 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-25 21:15 Wide strings Mike Gran
2009-01-25 22:31 ` Ludovic Courtès
2009-01-25 23:32   ` Neil Jerram
2009-01-26 20:24     ` Ludovic Courtès
2009-01-26  0:16   ` Mike Gran
2009-01-26 15:21     ` Mike Gran
2009-01-26 21:40     ` Ludovic Courtès
2009-01-27  5:38       ` Mike Gran
2009-01-27  5:52         ` Mike Gran
2009-01-27  9:50           ` Andy Wingo
2009-01-27 18:59         ` Ludovic Courtès
2009-01-28 16:44           ` Mike Gran
2009-01-28 18:36             ` Andy Wingo
2009-01-29  0:01               ` Ludovic Courtès
2009-01-30  0:15                 ` Neil Jerram
2009-01-28 20:44             ` Clinton Ebadi
2009-01-28 23:49               ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).