Unicode ports patch

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* Unicode ports patch
@ 2009-08-25 15:06 Mike Gran
  2009-08-25 19:54 ` dsmich
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Mike Gran @ 2009-08-25 15:06 UTC (permalink / raw)
  To: Guile Devel

The latest commit 'Add full Unicode capability to ports and the default
reader' 889975e51accb80491af76fc5db980aeb3edd342 adds the majority of
the functionality for non-ASCII strings.  

The commit breaks functions that have to do with locale-specific case
conversion and character-sets, and some tests will fail for the time
being.  

It is a big commit, but, it would have been hard to make it smaller, I
think.

There are some new test files in that commit that can demo the
capabilities.

Thanks,

Mike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unicode ports patch
  2009-08-25 15:06 Unicode ports patch Mike Gran
@ 2009-08-25 19:54 ` dsmich
  2009-08-25 19:58 ` Andy Wingo
  2009-09-01  8:19 ` Ludovic Courtès
  2 siblings, 0 replies; 9+ messages in thread
From: dsmich @ 2009-08-25 19:54 UTC (permalink / raw)
  To: Guile Devel, Mike Gran


---- Mike Gran <spk121@yahoo.com> wrote: 
> The latest commit 'Add full Unicode capability to ports and the default
> reader' 889975e51accb80491af76fc5db980aeb3edd342 adds the majority of
> the functionality for non-ASCII strings.  
> 
> The commit breaks functions that have to do with locale-specific case
> conversion and character-sets, and some tests will fail for the time
> being.  

Also breaks the build. ;^)

cc1: warnings being treated as errors
read.c: In function 'scm_read_expression':
read.c:820: warning: 'ch' may be used uninitialized in this function
make[3]: *** [libguile_la-read.lo] Error 1

This is a 32bit Intel Debian Etch system where cc is:
cc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)

-Dale





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unicode ports patch
  2009-08-25 15:06 Unicode ports patch Mike Gran
  2009-08-25 19:54 ` dsmich
@ 2009-08-25 19:58 ` Andy Wingo
  2009-09-01  8:19 ` Ludovic Courtès
  2 siblings, 0 replies; 9+ messages in thread
From: Andy Wingo @ 2009-08-25 19:58 UTC (permalink / raw)
  To: Mike Gran; +Cc: Guile Devel

On Tue 25 Aug 2009 17:06, Mike Gran <spk121@yahoo.com> writes:

> The commit breaks functions that have to do with locale-specific case
> conversion and character-sets, and some tests will fail for the time
> being.

Ah, should have read guile-devel first :)

Happy hacking!

A
-- 
http://wingolog.org/




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unicode ports patch
  2009-08-25 15:06 Unicode ports patch Mike Gran
  2009-08-25 19:54 ` dsmich
  2009-08-25 19:58 ` Andy Wingo
@ 2009-09-01  8:19 ` Ludovic Courtès
  2009-09-01 18:25   ` Andy Wingo
  2 siblings, 1 reply; 9+ messages in thread
From: Ludovic Courtès @ 2009-09-01  8:19 UTC (permalink / raw)
  To: guile-devel

Hello!

Mike Gran <spk121@yahoo.com> writes:

> The latest commit 'Add full Unicode capability to ports and the default
> reader' 889975e51accb80491af76fc5db980aeb3edd342 adds the majority of
> the functionality for non-ASCII strings.  

This patch adds a few functions related to string ports:

  * libguile/strports.c: store string ports in locale encoding
    (scm_strport_to_locale_u8vector, scm_call_with_output_locale_u8vector)
    (scm_open_input_locale_u8vector, scm_get_output_locale_u8vector):
    new functions

I think it would be nicer if these used bytevectors instead of u8vectors
and were locale-independent (which would match the `string->utf8' &
co. API).  Also I would make `scm_strport_to_locale_u8vector ()'
private.  And finally, it'd be even better if it were documented in the
manual.  :-)

Actually I'm not convinced that `call-with-output-locale-*' and
`open-input-locale-*' are useful, precisely because we can use a string
port to get a string and then `string->utf8' to get at the string bits.

What do you think?

Thanks,
Ludo'.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unicode ports patch
  2009-09-01  8:19 ` Ludovic Courtès
@ 2009-09-01 18:25   ` Andy Wingo
  2009-09-01 19:19     ` Mike Gran
  0 siblings, 1 reply; 9+ messages in thread
From: Andy Wingo @ 2009-09-01 18:25 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

Hi,

On Tue 01 Sep 2009 10:19, ludo@gnu.org (Ludovic Courtès) writes:

> Mike Gran <spk121@yahoo.com> writes:
>
>> The latest commit 'Add full Unicode capability to ports and the default
>> reader' 889975e51accb80491af76fc5db980aeb3edd342 adds the majority of
>> the functionality for non-ASCII strings.  
>
> This patch adds a few functions related to string ports:
>
>   * libguile/strports.c: store string ports in locale encoding
>     (scm_strport_to_locale_u8vector, scm_call_with_output_locale_u8vector)
>     (scm_open_input_locale_u8vector, scm_get_output_locale_u8vector):
>     new functions
>
> I think it would be nicer if these used bytevectors instead of u8vectors
> and were locale-independent (which would match the `string->utf8' &
> co. API).  Also I would make `scm_strport_to_locale_u8vector ()'
> private.  And finally, it'd be even better if it were documented in the
> manual.  :-)
>
> Actually I'm not convinced that `call-with-output-locale-*' and
> `open-input-locale-*' are useful, precisely because we can use a string
> port to get a string and then `string->utf8' to get at the string bits.

FWIW, I think I agree with all of Ludovic's comments; though if there is
a way that we can simply arrange to output bytes to an R6RS binary
output port, I think there are already efficient means to collect the
bytes from such a port in a bytevector.

Cheers,

Andy
-- 
http://wingolog.org/




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unicode ports patch
  2009-09-01 18:25   ` Andy Wingo
@ 2009-09-01 19:19     ` Mike Gran
  2009-09-01 19:34       ` Ludovic Courtès
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Gran @ 2009-09-01 19:19 UTC (permalink / raw)
  To: Andy Wingo, Ludovic Courtès; +Cc: guile-devel

----- Original Message ----
> From: Andy Wingo <wingo@pobox.com>
> To: Ludovic Courtès <ludo@gnu.org>
> Cc: guile-devel@gnu.org
> Sent: Tuesday, September 1, 2009 11:25:26 AM
> Subject: Re: Unicode ports patch
> 
> Hi,
> 
> On Tue 01 Sep 2009 10:19, ludo@gnu.org (Ludovic Courtès) writes:
> 
> > Mike Gran writes:
> >
> >> The latest commit 'Add full Unicode capability to ports and the default
> >> reader' 889975e51accb80491af76fc5db980aeb3edd342 adds the majority of
> >> the functionality for non-ASCII strings.  
> >
> > This patch adds a few functions related to string ports:
> >
> >  * libguile/strports.c: store string ports in locale encoding
> >    (scm_strport_to_locale_u8vector, scm_call_with_output_locale_u8vector)
> >    (scm_open_input_locale_u8vector, scm_get_output_locale_u8vector):
> >    new functions
> >
> > I think it would be nicer if these used bytevectors instead of u8vectors
> > and were locale-independent (which would match the `string->utf8' &
> > co. API).  Also I would make `scm_strport_to_locale_u8vector ()'
> > private.  And finally, it'd be even better if it were documented in the
> > manual.  :-)

I don't understand.  "it would be nicer if *these* ..."

To what does *these* refer: string ports?  It would be nicer if we replace
string ports with bytevector ports?  Or it would be nicer if 
scm_get_output_locale_u8vector was scm_get_output_bytevector?

"it would be nicer if these used bytevectors ... and were *locale-independent*"

It would be nicer if string ports were actually bytevector ports, and that 
they were locale-independent?  Or that scm_get_output_bytevector returned a 
locale-independent (ergo 8-bit or 32-bit) vector?

> >
> > Actually I'm not convinced that `call-with-output-locale-*' and
> > `open-input-locale-*' are useful, precisely because we can use a string
> > port to get a string and then `string->utf8' to get at the string bits.

"We can use a string port to get a string"

If we write to a string port and pop a result string?

"And then use string->utf8 to get at the string bits"

And then convert the result string to a UTF-8 encoded bytevector?

> 
> FWIW, I think I agree with all of Ludovic's comments; though if there is
> a way that we can simply arrange to output bytes to an R6RS binary
> output port, I think there are already efficient means to collect the
> bytes from such a port in a bytevector.

Thanks,
Mike




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unicode ports patch
  2009-09-01 19:19     ` Mike Gran
@ 2009-09-01 19:34       ` Ludovic Courtès
  2009-09-01 21:08         ` Mike Gran
  0 siblings, 1 reply; 9+ messages in thread
From: Ludovic Courtès @ 2009-09-01 19:34 UTC (permalink / raw)
  To: guile-devel

Hi!

Mike Gran <spk121@yahoo.com> writes:

>> On Tue 01 Sep 2009 10:19, ludo@gnu.org (Ludovic Courtès) writes:
>> 
>> > Mike Gran writes:
>> >
>> >> The latest commit 'Add full Unicode capability to ports and the default
>> >> reader' 889975e51accb80491af76fc5db980aeb3edd342 adds the majority of
>> >> the functionality for non-ASCII strings.  
>> >
>> > This patch adds a few functions related to string ports:
>> >
>> >  * libguile/strports.c: store string ports in locale encoding
>> >    (scm_strport_to_locale_u8vector, scm_call_with_output_locale_u8vector)
>> >    (scm_open_input_locale_u8vector, scm_get_output_locale_u8vector):
>> >    new functions
>> >
>> > I think it would be nicer if these used bytevectors instead of u8vectors
>> > and were locale-independent (which would match the `string->utf8' &
>> > co. API).  Also I would make `scm_strport_to_locale_u8vector ()'
>> > private.  And finally, it'd be even better if it were documented in the
>> > manual.  :-)
>
> I don't understand.  "it would be nicer if *these* ..."

"These" was for "these functions".

> "it would be nicer if these used bytevectors ... and were *locale-independent*"
>
> It would be nicer if string ports were actually bytevector ports, and that 
> they were locale-independent?  Or that scm_get_output_bytevector returned a 
> locale-independent (ergo 8-bit or 32-bit) vector?

The latter.

>> > Actually I'm not convinced that `call-with-output-locale-*' and
>> > `open-input-locale-*' are useful, precisely because we can use a string
>> > port to get a string and then `string->utf8' to get at the string bits.
>
> "We can use a string port to get a string"
>
> If we write to a string port and pop a result string?

Yes, with `with-output-to-string' for instance.

> "And then use string->utf8 to get at the string bits"
>
> And then convert the result string to a UTF-8 encoded bytevector?

`string->utf8' returns a bytevector containing the UTF-8-encoded string
it is passed.

Thanks,
Ludo'.





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unicode ports patch
  2009-09-01 19:34       ` Ludovic Courtès
@ 2009-09-01 21:08         ` Mike Gran
  2009-09-02  8:01           ` Ludovic Courtès
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Gran @ 2009-09-01 21:08 UTC (permalink / raw)
  To: Ludovic Courtès, guile-devel

> > It would be nicer if string ports were actually bytevector ports, and that 
> > they were locale-independent?  Or that scm_get_output_bytevector returned a 
> > locale-independent (ergo 8-bit or 32-bit) vector?
> 
> The latter.

The test suite requires an API for testing the correctness of the encoding
when writing or displaying a string in a given locale.  It also needs an API
for checking that a locale-encoded byte-array can be correctly converted to a string.

What would you suggest?

Thanks,

Mike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Unicode ports patch
  2009-09-01 21:08         ` Mike Gran
@ 2009-09-02  8:01           ` Ludovic Courtès
  0 siblings, 0 replies; 9+ messages in thread
From: Ludovic Courtès @ 2009-09-02  8:01 UTC (permalink / raw)
  To: guile-devel

Hi!

Mike Gran <spk121@yahoo.com> writes:

>> > It would be nicer if string ports were actually bytevector ports, and that 
>> > they were locale-independent?  Or that scm_get_output_bytevector returned a 
>> > locale-independent (ergo 8-bit or 32-bit) vector?
>> 
>> The latter.
>
> The test suite requires an API for testing the correctness of the encoding
> when writing or displaying a string in a given locale.  It also needs an API
> for checking that a locale-encoded byte-array can be correctly converted to a string.

Hmm, OK, I understand.

> What would you suggest?

Have them return a bytevector instead of a u8vector.

My other concern was about adding it to the public API.  Do you think it
would be useful?  My initial feeling was that it may not be too useful,
hence not needing to be public, but I'm not sure.

How about adding (string->encoding STR ENCODING) => BYTEVECTOR in
`(rnrs bytevector)'[*]?  We also have `locale-encoding' in `(ice-9
i18n)', so combining the two should provide you with what you need and
may be generally useful.  Would that work for you?

Thanks,
Ludo'.

[*] Eventually, `string->encoding', `uniform-array->bytevector' and
    similar extensions to R6RS should be moved in, say, `(rnrs
    bytevector gnu)', IMO.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-09-02  8:01 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-25 15:06 Unicode ports patch Mike Gran
2009-08-25 19:54 ` dsmich
2009-08-25 19:58 ` Andy Wingo
2009-09-01  8:19 ` Ludovic Courtès
2009-09-01 18:25   ` Andy Wingo
2009-09-01 19:19     ` Mike Gran
2009-09-01 19:34       ` Ludovic Courtès
2009-09-01 21:08         ` Mike Gran
2009-09-02  8:01           ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).