Unicode, ports and encoding

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* Unicode, ports and encoding
@ 2009-02-16 23:51 Mike Gran
  2009-02-17 21:54 ` Ludovic Courtès
  0 siblings, 1 reply; 4+ messages in thread
From: Mike Gran @ 2009-02-16 23:51 UTC (permalink / raw)
  To: guile-devel

More observations about wide strings and Guile.

First, here are the abridged call trees for low-level reading and
writing.

read <-+- scm_getc <-+- [the parser] <--- scm_read <--- scm_primitive_load
       |             |
       |             +- scm_read_char
       |
       |
       +- scm_c_read
       |
       +- read_without_guile




write <-+- scm_lfwrite <-+- scm_display
        |                |
        |                +- scm_putc <-+- scm_write_char
        |                              |
        |                              +- scm_newline
        |
        +- scm_flush


1.  To move to a Unicode-enabled guile, text information needs to be
    converted to an internal representation when read and converted
    back to the locale when written.  Most reading and writing for
    ports passes through scm_getc (input) and scm_lfwrite (output).
    Conversion between locale strings and internal strings should
    happen there.

2.  If string conversion occurs in scm_getc, then the scm_read reader
    will be receiving and parsing source code that has passed through
    the conversion routines.  This is initially not a problem since
    scheme code is largely ASCII, and Guile will start up in the C
    locale.

    But, if a source code file is not ASCII, the reader needs to be
    able to ascertain this before parsing the code from the file.  The
    encoding of a source code file is a property of the file and not
    the locale in which Guile is being run. 

    This implies that a source code file should have syntax to
    indicate its own encoding, if it is not ASCII.  Something akin to
    the <?xml encoding="utf-8"?> line in HTML files.

3.  The text encoding of a port needs to be associated with the port.
    R6RS has the idea of transcoders for ports that require
    conversion.  It is daunting, but, having played some ideas for a
    few weeks, it seems that at least a subset of the transcoder
    functionality needs to be implemented for this to make any sense.

I sent in my copyright assignment last week, so you should have it
now.

Thanks,

Mike Gran





^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Unicode, ports and encoding
  2009-02-16 23:51 Unicode, ports and encoding Mike Gran
@ 2009-02-17 21:54 ` Ludovic Courtès
  2009-02-17 23:45   ` Mike Gran
  0 siblings, 1 reply; 4+ messages in thread
From: Ludovic Courtès @ 2009-02-17 21:54 UTC (permalink / raw)
  To: guile-devel

Hello!

Mike Gran <spk121@yahoo.com> writes:

> 1.  To move to a Unicode-enabled guile, text information needs to be
>     converted to an internal representation when read and converted
>     back to the locale when written.  Most reading and writing for
>     ports passes through scm_getc (input) and scm_lfwrite (output).
>     Conversion between locale strings and internal strings should
>     happen there.

One strategy could be to have a new C port API, e.g., roughly based on
R6RS', with transcoders and all, and somehow arrange to have the current
port "API" mapped to that new shiny API.  It might be a bit ambitious,
though.

>     This implies that a source code file should have syntax to
>     indicate its own encoding, if it is not ASCII.  Something akin to
>     the <?xml encoding="utf-8"?> line in HTML files.

One could imagine special treatment of, say, the first 10 lines of a
file, with the ability to recognize Emacs file variables like
"-*- coding: utf-8 -*-" and to change the current port transcoder
accordingly, something like that.

By default, which encoding is used by `read' would be determined by the
input port's encoder.

> 3.  The text encoding of a port needs to be associated with the port.
>     R6RS has the idea of transcoders for ports that require
>     conversion.  It is daunting, but, having played some ideas for a
>     few weeks, it seems that at least a subset of the transcoder
>     functionality needs to be implemented for this to make any sense.

Yes.

> I sent in my copyright assignment last week, so you should have it
> now.

Cool!

IIRC, the first step you suggested was the implementation of wide
string/char types.  Did you also work on this?

Thanks,
Ludo'.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Unicode, ports and encoding
  2009-02-17 21:54 ` Ludovic Courtès
@ 2009-02-17 23:45   ` Mike Gran
  2009-02-18  8:48     ` Ludovic Courtès
  0 siblings, 1 reply; 4+ messages in thread
From: Mike Gran @ 2009-02-17 23:45 UTC (permalink / raw)
  To: guile-devel

 > From: Ludovic Courtès <ludo@gnu.org>
>> Mike Gran writes:

> >     This implies that a source code file should have syntax to
> >     indicate its own encoding, if it is not ASCII.  Something akin to
> >     the  line in HTML files.
> 
> One could imagine special treatment of, say, the first 10 lines of a
> file, with the ability to recognize Emacs file variables like
> "-*- coding: utf-8 -*-" and to change the current port transcoder
> accordingly, something like that.

Yeah.  Something like that.

> IIRC, the first step you suggested was the implementation of wide
> string/char types.  Did you also work on this?

Sort of.

I thought I could start there, but, it isn't easy. There is a lot that could
be broken by modifying string processing.  So I tried writing some tests 
first so I can check my work as I go along.  But the tests have to be
non-ASCII, so they need to be converted when they are read in.
It gets a little bit circular using scm_from_locale_string to convert
non-ASCII strings in the test source, and then having the test check
the behavior of scm_from_locale_string.

So, now I think a better route is to make some type of simplified
transcoded port system available to ports so that non-ASCII
tests are read in correctly.   From there, one can work up toward wide
strings and chars while checking work along the way.

Thanks,

Mike Gran

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Unicode, ports and encoding
  2009-02-17 23:45   ` Mike Gran
@ 2009-02-18  8:48     ` Ludovic Courtès
  0 siblings, 0 replies; 4+ messages in thread
From: Ludovic Courtès @ 2009-02-18  8:48 UTC (permalink / raw)
  To: guile-devel

Hi Mike,

Mike Gran <spk121@yahoo.com> writes:

> I thought I could start there, but, it isn't easy. There is a lot that could
> be broken by modifying string processing.  So I tried writing some tests 
> first so I can check my work as I go along.  But the tests have to be
> non-ASCII, so they need to be converted when they are read in.
> It gets a little bit circular using scm_from_locale_string to convert
> non-ASCII strings in the test source, and then having the test check
> the behavior of scm_from_locale_string.

I see.  OTOH, it should be possible to write plain C tests that would
create strings using `scm_from_{utf8,locale}_string ()' (with sample
UTF-8 strings hardwired as raw byte arrays) and from there test
`scm_string_ref ()', etc., and all of the <libguile/chars.h> functions.
What do you think?

Thanks,
Ludo'.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-02-18  8:48 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-16 23:51 Unicode, ports and encoding Mike Gran
2009-02-17 21:54 ` Ludovic Courtès
2009-02-17 23:45   ` Mike Gran
2009-02-18  8:48     ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).