unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* string port encodings
@ 2013-01-15 14:36 Andy Wingo
  2013-01-15 15:20 ` Alex Shinn
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Andy Wingo @ 2013-01-15 14:36 UTC (permalink / raw)
  To: guile-devel

Quiz: what does this do?

  (define (f s)
    (with-output-to-string (lambda () (display s))))

When called with a string, what should it do?  Like (f "foo").

If you answered, "return the string", that's what I would think it
should do.

But no, currently the answer is locale-specific.  It encodes the string
according to the current locale, then decodes it from that encoding.  If
your locale can't encode the string, tough luck for you!

This is a bit crazy.  Surely the port should be textual?  Surely the
default encoding for a string port should be utf-8 or something that can
actually handle all strings?

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: string port encodings
  2013-01-15 14:36 string port encodings Andy Wingo
@ 2013-01-15 15:20 ` Alex Shinn
  2013-01-15 18:46 ` Mark H Weaver
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Alex Shinn @ 2013-01-15 15:20 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 1132 bytes --]

On Tue, Jan 15, 2013 at 11:36 PM, Andy Wingo <wingo@pobox.com> wrote:

> Quiz: what does this do?
>
>   (define (f s)
>     (with-output-to-string (lambda () (display s))))
>
> When called with a string, what should it do?  Like (f "foo").
>
> If you answered, "return the string", that's what I would think it
> should do.
>
> But no, currently the answer is locale-specific.  It encodes the string
> according to the current locale, then decodes it from that encoding.  If
> your locale can't encode the string, tough luck for you!
>
> This is a bit crazy.  Surely the port should be textual?  Surely the
> default encoding for a string port should be utf-8 or something that can
> actually handle all strings?
>

Yes, the POSIX locale should refer to the external environment -
notably the terminal, command-line args and env variables.

Default file encodings are open to interpretation, but either using
POSIX or ignoring it completely and using heuristics combined with
overrides (e.g. -*- coding: ... -*-) are both reasonable.

But string ports are entirely internal - there's no reason to use
the locale for them.

-- 
Alex

[-- Attachment #2: Type: text/html, Size: 1735 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: string port encodings
  2013-01-15 14:36 string port encodings Andy Wingo
  2013-01-15 15:20 ` Alex Shinn
@ 2013-01-15 18:46 ` Mark H Weaver
  2013-01-15 21:21 ` Mike Gran
  2013-01-16 15:44 ` Ludovic Courtès
  3 siblings, 0 replies; 11+ messages in thread
From: Mark H Weaver @ 2013-01-15 18:46 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-devel

Hi Andy,

Andy Wingo <wingo@pobox.com> writes:
> Quiz: what does this do?
>
>   (define (f s)
>     (with-output-to-string (lambda () (display s))))
>
> When called with a string, what should it do?  Like (f "foo").
>
> If you answered, "return the string", that's what I would think it
> should do.
>
> But no, currently the answer is locale-specific.  It encodes the string
> according to the current locale, then decodes it from that encoding.  If
> your locale can't encode the string, tough luck for you!
>
> This is a bit crazy.  Surely the port should be textual?  Surely the
> default encoding for a string port should be utf-8 or something that can
> actually handle all strings?

I agree wholeheartedly.  Unfortunately, the current broken behavior of
string ports has been recommended in our official forums as a clever way
to do arbitrary iconv conversions.  Nonetheless, I still think it ought
to be fixed properly in 2.2.

Here are some past discussions on this topic:

http://debbugs.gnu.org/cgi/bugreport.cgi?bug=11197
http://comments.gmane.org/gmane.lisp.guile.devel/14533

      Mark



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: string port encodings
  2013-01-15 14:36 string port encodings Andy Wingo
  2013-01-15 15:20 ` Alex Shinn
  2013-01-15 18:46 ` Mark H Weaver
@ 2013-01-15 21:21 ` Mike Gran
  2013-01-16 15:44 ` Ludovic Courtès
  3 siblings, 0 replies; 11+ messages in thread
From: Mike Gran @ 2013-01-15 21:21 UTC (permalink / raw)
  To: Andy Wingo, guile-devel

Hi Andy,

> From: Andy Wingo <wingo@pobox.com>
> But no, currently the answer is locale-specific.  It encodes the string
> according to the current locale, then decodes it from that encoding.  If
> your locale can't encode the string, tough luck for you!
> 
> This is a bit crazy.  Surely the port should be textual?  Surely the
> default encoding for a string port should be utf-8 or something that can
> actually handle all strings?

Some trivia.  The Unicode work began in mainline around Aug 2009.  At
first, string ports used locale encoding.
 
Then, in Sep 2009, 25ebc0340d30d1ceb786dbc8c3fe80c6e9ae0e87
had them as always using UTF-8 for the reasons that you mention here.

There was a discussion as to whether this was the right thing to do
that began with the following message
 
https://lists.gnu.org/archive/html/guile-devel/2010-01/msg00012.html
 
The commit 7b0419128bce68f48a158292430ed4a7202aa1b1 set string
ports to locale encoding.
 
-Mike



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: string port encodings
  2013-01-15 14:36 string port encodings Andy Wingo
                   ` (2 preceding siblings ...)
  2013-01-15 21:21 ` Mike Gran
@ 2013-01-16 15:44 ` Ludovic Courtès
  2013-01-16 16:57   ` Andy Wingo
  3 siblings, 1 reply; 11+ messages in thread
From: Ludovic Courtès @ 2013-01-16 15:44 UTC (permalink / raw)
  To: guile-devel

Hi!

Andy Wingo <wingo@pobox.com> skribis:

> But no, currently the answer is locale-specific.  It encodes the string
> according to the current locale, then decodes it from that encoding.  If
> your locale can't encode the string, tough luck for you!

SRFI-6 uses Unicode-capable ports since
ecb48dccbac6b8fdd969f50a23351ef7f4b91ce5.

Otherwise, %default-port-encoding governs (info "(guile) String Ports"):

 -- Scheme Procedure: call-with-output-string proc
 -- C Function: scm_call_with_output_string (proc)
     Calls the one-argument procedure PROC with a newly created output
     port.  When the function returns, the string composed of the
     characters written into the port is returned.  PROC should not
     close the port.

     Note that which characters can be written to a string port depend
     on the port's encoding.  The default encoding of string ports is
     specified by the `%default-port-encoding' fluid (*note
     `%default-port-encoding': Ports.).  For instance, it is an error
     to write Greek letter alpha to an ISO-8859-1-encoded string port
     since this character cannot be represented with ISO-8859-1:

          (define alpha (integer->char #x03b1)) ; GREEK SMALL LETTER ALPHA

          (with-fluids ((%default-port-encoding "ISO-8859-1"))
            (call-with-output-string
              (lambda (p)
                (display alpha p))))

          =>
          Throw to key `encoding-error'

     Changing the string port's encoding to a Unicode-capable encoding
     such as UTF-8 solves the problem.

> This is a bit crazy.  Surely the port should be textual?  Surely the
> default encoding for a string port should be utf-8 or something that can
> actually handle all strings?

As was said, “this has been recommended” (note the passive form!) on our
fora as a smart way to do encoding conversion.

The thing is, unlike R6RS, our ports can be used both for textual and
binary I/O.

This has been discussed at length already, and I think all the pros and
cons have been written already.  :-)

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: string port encodings
  2013-01-16 15:44 ` Ludovic Courtès
@ 2013-01-16 16:57   ` Andy Wingo
  2013-01-16 17:37     ` Ludovic Courtès
  0 siblings, 1 reply; 11+ messages in thread
From: Andy Wingo @ 2013-01-16 16:57 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

Hi :)

On Wed 16 Jan 2013 16:44, ludo@gnu.org (Ludovic Courtès) writes:

> Andy Wingo <wingo@pobox.com> skribis:
>
>> But no, currently the answer is locale-specific.  It encodes the string
>> according to the current locale, then decodes it from that encoding.  If
>> your locale can't encode the string, tough luck for you!
>
> SRFI-6 uses Unicode-capable ports since
> ecb48dccbac6b8fdd969f50a23351ef7f4b91ce5

I have never heard of this srfi before; I always thought our string
ports "just worked" :P

> Otherwise, %default-port-encoding governs (info "(guile) String Ports"):

But why?  The documentation does not say it; it merely spends electrons
describing how to make string ports actually accept all characters.

You mention one use case:

> as a smart way to do encoding conversion.

But surely this is not a common case and is adequately handled by
set-port-encoding!, potentially via an optional argument.

> The thing is, unlike R6RS, our ports can be used both for textual and
> binary I/O.

I am aware of this, and not arguing against it :)

> This has been discussed at length already, and I think all the pros and
> cons have been written already.  :-)

"What from your father you’ve inherited, You must earn again, to own it
straight." -- Faust

:)

Flippantly yours,

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: string port encodings
  2013-01-16 16:57   ` Andy Wingo
@ 2013-01-16 17:37     ` Ludovic Courtès
  2013-01-16 18:16       ` Andy Wingo
  0 siblings, 1 reply; 11+ messages in thread
From: Ludovic Courtès @ 2013-01-16 17:37 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-devel

Andy Wingo <wingo@pobox.com> skribis:

>> Otherwise, %default-port-encoding governs (info "(guile) String Ports"):
>
> But why?

Because %default-port-encoding specifies the default port encoding?  :-)

I think that was mostly the reason behind
<http://thread.gmane.org/gmane.lisp.guile.devel/9822>.

> You mention one use case:
>
>> as a smart way to do encoding conversion.
>
> But surely this is not a common case and is adequately handled by
> set-port-encoding!, potentially via an optional argument.

Yes.  So you could change the default string port encoding to UTF-8, and
update the doc accordingly, as I wrote in
<http://debbugs.gnu.org/cgi/bugreport.cgi?bug=11197>:

  In hindsight, UTF-8 does seem like a better default than the locale port
  encoding (which is what %default-port-encoding is, by default), but it
  does remain useful to specify a different encoding.

I just think this may have to wait until 2.2.

WDYT?

Ludo’.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: string port encodings
  2013-01-16 17:37     ` Ludovic Courtès
@ 2013-01-16 18:16       ` Andy Wingo
  2013-01-31 11:04         ` Andy Wingo
  0 siblings, 1 reply; 11+ messages in thread
From: Andy Wingo @ 2013-01-16 18:16 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

On Wed 16 Jan 2013 18:37, ludo@gnu.org (Ludovic Courtès) writes:

> I just think this may have to wait until 2.2.
>
> WDYT?

Oh yes, agreed here.  Anyway let's let it simmer for a while.  Another
two or three of these threads should be enough to either reaffirm or
change the current state of things :)

A
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: string port encodings
  2013-01-16 18:16       ` Andy Wingo
@ 2013-01-31 11:04         ` Andy Wingo
  2013-01-31 17:55           ` Mark H Weaver
  2013-08-07  5:37           ` Mark H Weaver
  0 siblings, 2 replies; 11+ messages in thread
From: Andy Wingo @ 2013-01-31 11:04 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

Hi,

On Wed 16 Jan 2013 19:16, Andy Wingo <wingo@pobox.com> writes:

> On Wed 16 Jan 2013 18:37, ludo@gnu.org (Ludovic Courtès) writes:
>
>> I just think [string port encodings] may have to wait until 2.2.
>
> Oh yes, agreed here.  Anyway let's let it simmer for a while.  Another
> two or three of these threads should be enough to either reaffirm or
> change the current state of things :)

OK that was simmering long enough ;)

I just merged stable-2.0 to master.  There is now a failing test.

    (pass-if-equal
      '(*TOP* (foo "\xA0"))
      (xml->sxml "<foo>&nbsp;</foo>"
                 #:entities '((nbsp . "\xA0"))))

This one fails, with (encoding-error "scm_to_stringn" "cannot convert
narrow string to output locale" 84 #f #f).

It passes in stable-2.0 because "ASCII" is erroneously treated as equal
the same as "ISO-8859-1".  In master, attempting to write a character
above #\x7F to an ASCII port will cause an encoding error.  It seems
more correct than the 2.0 behavior.  This error would have happened in
stable-2.0 if I had chose an entity with a character above #\xFF.

Looking further, the cause is in sxml/upstream/SSAX.scm:

   (define (ssax:handle-parsed-entity port name entities
                                      content-handler str-handler seed)
    ...
           (call-with-input-string ent-body
             (lambda (port) (content-handler port new-entities seed)))
    ...)

Here is where I think this code goes wrong: its correctness appears to
depend on the default port encoding.  That is totally bogus.  It was
written long before we had such a thing.

Again, I think the default encoding for a string port should be one that
can represent all characters, and we should change this in master.

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: string port encodings
  2013-01-31 11:04         ` Andy Wingo
@ 2013-01-31 17:55           ` Mark H Weaver
  2013-08-07  5:37           ` Mark H Weaver
  1 sibling, 0 replies; 11+ messages in thread
From: Mark H Weaver @ 2013-01-31 17:55 UTC (permalink / raw)
  To: Andy Wingo; +Cc: Ludovic Courtès, guile-devel

Andy Wingo <wingo@pobox.com> writes:
> Here is where I think this code goes wrong: its correctness appears to
> depend on the default port encoding.  That is totally bogus.  It was
> written long before we had such a thing.
>
> Again, I think the default encoding for a string port should be one that
> can represent all characters, and we should change this in master.

Yes, let's please fix this in master!

    Mark



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: string port encodings
  2013-01-31 11:04         ` Andy Wingo
  2013-01-31 17:55           ` Mark H Weaver
@ 2013-08-07  5:37           ` Mark H Weaver
  1 sibling, 0 replies; 11+ messages in thread
From: Mark H Weaver @ 2013-08-07  5:37 UTC (permalink / raw)
  To: Andy Wingo; +Cc: Ludovic Courtès, guile-devel

Andy Wingo <wingo@pobox.com> writes:
> Again, I think the default encoding for a string port should be one that
> can represent all characters, and we should change this in master.

Long ago on IRC, we agreed to fix this in master, and I just did so.

      Mark



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-08-07  5:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-15 14:36 string port encodings Andy Wingo
2013-01-15 15:20 ` Alex Shinn
2013-01-15 18:46 ` Mark H Weaver
2013-01-15 21:21 ` Mike Gran
2013-01-16 15:44 ` Ludovic Courtès
2013-01-16 16:57   ` Andy Wingo
2013-01-16 17:37     ` Ludovic Courtès
2013-01-16 18:16       ` Andy Wingo
2013-01-31 11:04         ` Andy Wingo
2013-01-31 17:55           ` Mark H Weaver
2013-08-07  5:37           ` Mark H Weaver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).