Wide strings status

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* Wide strings status
@ 2009-04-21  2:11 Mike Gran
  2009-04-21 21:37 ` Ludovic Courtès
  0 siblings, 1 reply; 4+ messages in thread
From: Mike Gran @ 2009-04-21  2:11 UTC (permalink / raw)
  To: Guile Devel

Hi,

OK.  I've uploaded a "string-abstraction" branch so that you can see
what I've been doing over the last couple of months.  Currently, I do
have a version of Guile that uses Unicode codepoints for characters.

The C representation of chars was changed to scm_t_uint32 throughout the
code.

Strings are internally encoded either as "narrow" 8-bit ISO-8859-1
strings or as "wide" UTF-32 strings.  Strings are usually created as
narrow strings.  Narrow strings get automatically widened to wide
strings if non-8-bit characters are set! or appended to them.

Outside of the core strings module and srfi-13, a set of methods are
used to access strings.  I did my best to keep the internal
representation of strings isolated to those two modules.  This means
that almost every instance of the pervasive scm_i_string_chars() was
removed.

The machine-readable "write" form of strings has been changed.  Before,
non-printable characters were given as hex escapes, for example \xFF.
Now there are three levels of hex escape for 8, 16, and 24 bit
characters: \xFF, \uFFFF, \UFFFFFF.  This is a pretty common convention.
But after I coded this, I noticed that R6RS has a different convention
and I'll probably go with that.

The internal representation of strings seems to work already, but, the
reader doesn't work yet.  For now, one can make wide strings like this:

> (setlocale LC_ALL "")
==> "en_US.UTF-8"

> (define str (apply string (map integer->char '(100 200 300 400 500))))

> (write str)
==>"d\xc8\u012c\u0190\u01f4"

(display str)
==>dÈĬƐǴ

This is all going to be slower than before because of the string
conversion operations, but, I didn't want to do any premature
optimization.  First, I wanted to get it working, but, there is plenty
of room for optimization later.

Anyway, if, code-wise, it is agreed that I'm generally on the right
track, the next steps are these:

Write a plethora of unit tests on what has been accomplished so far.

Character sets need to be modified to have more than 256 entries.

Character encoding needs to be a property of ports, so that not all
string operations are done in the current locale.  This is necessary so
that UTF-8-encoded source files are not interpreted differently based on
the current locale.

For programs that have been abusing strings for containing binary data,
some accommodation needs to be made.  Maybe make a "binary" locale.

The VM and interpreter need to be updated to deal with wide chars and
probably in other ways that are unclear to me now.  Wide strings are
currently getting truncated to 8-bit somewhere in there.

Thanks,

Mike Gran

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Wide strings status
  2009-04-21  2:11 Wide strings status Mike Gran
@ 2009-04-21 21:37 ` Ludovic Courtès
  2009-04-22  3:26   ` Mike Gran
  0 siblings, 1 reply; 4+ messages in thread
From: Ludovic Courtès @ 2009-04-21 21:37 UTC (permalink / raw)
  To: guile-devel

Hello!

Mike Gran <spk121@yahoo.com> writes:

> Strings are internally encoded either as "narrow" 8-bit ISO-8859-1
> strings or as "wide" UTF-32 strings.  Strings are usually created as
> narrow strings.  Narrow strings get automatically widened to wide
> strings if non-8-bit characters are set! or appended to them.

Great!

> The machine-readable "write" form of strings has been changed.  Before,
> non-printable characters were given as hex escapes, for example \xFF.
> Now there are three levels of hex escape for 8, 16, and 24 bit
> characters: \xFF, \uFFFF, \UFFFFFF.  This is a pretty common convention.
> But after I coded this, I noticed that R6RS has a different convention
> and I'll probably go with that.

OK.  I think it's probably good to follow R6RS when it has something to
say.

> The internal representation of strings seems to work already, but, the
> reader doesn't work yet.  For now, one can make wide strings like this:
>
>> (setlocale LC_ALL "")
> ==> "en_US.UTF-8"
>
>> (define str (apply string (map integer->char '(100 200 300 400 500))))
>
>> (write str)
> ==>"d\xc8\u012c\u0190\u01f4"
>
> (display str)
> ==>dÈĬƐǴ

Eh eh, looks nice.  Looking forward to typing `(λ (x y) (+ x y))'.  ;-)

> This is all going to be slower than before because of the string
> conversion operations, but, I didn't want to do any premature
> optimization.  First, I wanted to get it working, but, there is plenty
> of room for optimization later.

Good.  Maybe it'd be nice to add simple micro-benchmarks for
`string-ref', `string-set!' et al. under `benchmarks'.

> Character encoding needs to be a property of ports, so that not all
> string operations are done in the current locale.  This is necessary so
> that UTF-8-encoded source files are not interpreted differently based on
> the current locale.

You seem to imply that `scm_getc ()' will now return a Unicode
codepoint, is that right?  What about `scm_c_{read,write} ()', and
`scm_{get,put}s ()'?

> The VM and interpreter need to be updated to deal with wide chars and
> probably in other ways that are unclear to me now.  Wide strings are
> currently getting truncated to 8-bit somewhere in there.

The compiler could use bytevectors when dealing with bytecode.  Maybe
that would clarify things.

Thanks,
Ludo'.





^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Wide strings status
  2009-04-21 21:37 ` Ludovic Courtès
@ 2009-04-22  3:26   ` Mike Gran
  2009-04-22 20:03     ` Ludovic Courtès
  0 siblings, 1 reply; 4+ messages in thread
From: Mike Gran @ 2009-04-22  3:26 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

On Tue, 2009-04-21 at 23:37 +0200, Ludovic Courtès wrote:

> > This is all going to be slower than before because of the string
> > conversion operations, but, I didn't want to do any premature
> > optimization.  First, I wanted to get it working, but, there is plenty
> > of room for optimization later.
> 
> Good.  Maybe it'd be nice to add simple micro-benchmarks for
> `string-ref', `string-set!' et al. under `benchmarks'.
> 

I'll put it on my todo list.

> > Character encoding needs to be a property of ports, so that not all
> > string operations are done in the current locale.  This is necessary so
> > that UTF-8-encoded source files are not interpreted differently based on
> > the current locale.
> 
> You seem to imply that `scm_getc ()' will now return a Unicode
> codepoint, is that right?  What about `scm_c_{read,write} ()', and
> `scm_{get,put}s ()'?
> 

I vacillate on this, but, I think the most logical approach is to have
scm_getc return codepoints and to have the rest of those functions
return strings that could contain wide characters.  This is if and only
if the port has been assigned a character encoding.  If it doesn't have
an associated encoding, ports will be treated as de facto ISO-8859-1,
where character values between 0 and 255 are stored without any
interpretation and characters greater than 255 are invalid.  (Unicode
codepoints 0 to 255 are by design the same as ISO-8859-1.)

> > The VM and interpreter need to be updated to deal with wide chars and
> > probably in other ways that are unclear to me now.  Wide strings are
> > currently getting truncated to 8-bit somewhere in there.
> 
> The compiler could use bytevectors when dealing with bytecode.  Maybe
> that would clarify things.

On those issues, I'll have to concede to the wisdom of others.  I'll do
what I can with the C code, and then I'll need help.

> 
> Thanks,
> Ludo'.
> 

Thanks for taking the time.

-Mike

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Wide strings status
  2009-04-22  3:26   ` Mike Gran
@ 2009-04-22 20:03     ` Ludovic Courtès
  0 siblings, 0 replies; 4+ messages in thread
From: Ludovic Courtès @ 2009-04-22 20:03 UTC (permalink / raw)
  To: Mike Gran; +Cc: guile-devel

Hello!

Mike Gran <spk121@yahoo.com> writes:

> On Tue, 2009-04-21 at 23:37 +0200, Ludovic Courtès wrote:

>> You seem to imply that `scm_getc ()' will now return a Unicode
>> codepoint, is that right?  What about `scm_c_{read,write} ()', and
>> `scm_{get,put}s ()'?
>> 
>
> I vacillate on this, but, I think the most logical approach is to have
> scm_getc return codepoints and to have the rest of those functions
> return strings that could contain wide characters.

Hmm, `scm_c_{read,write} ()' are biased toward binary data, according to
the manual and to their prototype (they take `void *' buffers).  So I
would keep them this way.

`scm_puts ()' is more of a concern since it takes a `char *', which the
caller may consider an 8-bit-encoded, null-terminated string.  We should
probably deprecate it, and have it return an ISO-8859-1 string,
transcoding as necessary.

And `scm_gets ()' doesn't exist actually.  ;-)

> This is if and only
> if the port has been assigned a character encoding.  If it doesn't have
> an associated encoding, ports will be treated as de facto ISO-8859-1,
> where character values between 0 and 255 are stored without any
> interpretation and characters greater than 255 are invalid.  (Unicode
> codepoints 0 to 255 are by design the same as ISO-8859-1.)

OK.

Thanks,
Ludo'.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-04-22 20:03 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-21  2:11 Wide strings status Mike Gran
2009-04-21 21:37 ` Ludovic Courtès
2009-04-22  3:26   ` Mike Gran
2009-04-22 20:03     ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).