unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed
* bug#11197: problems with string ports and unicode
@ 2012-04-07 20:07 Klaus Stehle
  2012-04-09 21:12 ` Ludovic Courtès
  0 siblings, 1 reply; 8+ messages in thread
From: Klaus Stehle @ 2012-04-07 20:07 UTC (permalink / raw)
  To: 11197

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1194 bytes --]

Hi,

;;;; a very very short example script to describe the problem:

;; open a string port with unicode characters >= 0x0100
(define p (open-input-string "čtyří"))


Put the line into a script and start guile. You will see the output:
=> Backtrace:

That's all, and guile will hang in an eternal loop.

If you enter the line interactively into the REPL, everything works
properly and you can read all characters with (read-char p).



;;;; another very short script, which is possibly the same problem:

;; open a string port and unread a unicode character >= 0x0100
(define p (open-input-string "ibenik"))
(unread-char #\Š p)


Running these two lines as a script generates an error message:
=> ERROR: In procedure unread-char:
=> ERROR: Throw to key `encoding-error' with args
          `("scm_ungetc" "conversion to port encoding failed" 84 #f #\540)'.

If you enter the lines interactively into the REPL, everything works
properly and you can read all characters with (read-char p).


Cheers,
Klaus Stehle


----------------------------
guile --version
guile (GNU Guile) 2.0.5

uname -srm
Linux 2.6.32-5-amd64 x86_64

echo $LANG
de_DE.UTF-8

^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#11197: problems with string ports and unicode
  2012-04-07 20:07 bug#11197: problems with string ports and unicode Klaus Stehle
@ 2012-04-09 21:12 ` Ludovic Courtès
  2012-04-11 16:08   ` Mark H Weaver
  0 siblings, 1 reply; 8+ messages in thread
From: Ludovic Courtès @ 2012-04-09 21:12 UTC (permalink / raw)
  To: Klaus Stehle; +Cc: 11197

Hi,

It may be that your string ports are created with a non-Unicode-capable
encoding.  Try something like:

  (define p
    (with-fluids ((%default-port-encoding "UTF-8"))
      (open-input-string "čtyří")))

More details in the manual (info "(guile) String Ports").

How does it work for you?

Ludo’.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#11197: problems with string ports and unicode
  2012-04-09 21:12 ` Ludovic Courtès
@ 2012-04-11 16:08   ` Mark H Weaver
  2012-04-11 16:25     ` Ludovic Courtès
  0 siblings, 1 reply; 8+ messages in thread
From: Mark H Weaver @ 2012-04-11 16:08 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 11197, Klaus Stehle

ludo@gnu.org (Ludovic Courtès) writes:
> It may be that your string ports are created with a non-Unicode-capable
> encoding.  Try something like:
>
>   (define p
>     (with-fluids ((%default-port-encoding "UTF-8"))
>       (open-input-string "čtyří")))

IMO, this should not be needed.  Port encodings should only be relevant
when reading from ports involving byte strings, such as file ports or
socket ports.  The encoding used by Scheme strings is a purely internal
matter; from the user's perspective, Scheme strings are simply a
sequence of Unicode code points.

What _is_ needed is a file coding declaration near the top of the source
file, e.g. "coding: utf-8" (see "Character Encoding of Source Files" in
the manual).  I tried that and it still fails for me.

I think this is a genuine bug.

     Mark





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#11197: problems with string ports and unicode
  2012-04-11 16:08   ` Mark H Weaver
@ 2012-04-11 16:25     ` Ludovic Courtès
  2012-04-11 17:53       ` Mark H Weaver
  0 siblings, 1 reply; 8+ messages in thread
From: Ludovic Courtès @ 2012-04-11 16:25 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 11197, Klaus Stehle

Hi Mark,

Mark H Weaver <mhw@netris.org> skribis:

> ludo@gnu.org (Ludovic Courtès) writes:
>> It may be that your string ports are created with a non-Unicode-capable
>> encoding.  Try something like:
>>
>>   (define p
>>     (with-fluids ((%default-port-encoding "UTF-8"))
>>       (open-input-string "čtyří")))
>
> IMO, this should not be needed.  Port encodings should only be relevant
> when reading from ports involving byte strings, such as file ports or
> socket ports.  The encoding used by Scheme strings is a purely internal
> matter; from the user's perspective, Scheme strings are simply a
> sequence of Unicode code points.

Note that “UTF-8” above has nothing to do with Guile’s internal string
representation; it’s just one of the many encodings that can represent
“čtyří”.

> What _is_ needed is a file coding declaration near the top of the source
> file, e.g. "coding: utf-8" (see "Character Encoding of Source Files" in
> the manual).

Yes.  And you actually need both–i.e., the ‘coding’ cookie won’t
magically make string ports use that encoding.

> I tried that and it still fails for me.

What fails exactly?

Thanks,
Ludo’.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#11197: problems with string ports and unicode
  2012-04-11 16:25     ` Ludovic Courtès
@ 2012-04-11 17:53       ` Mark H Weaver
  2012-04-11 21:01         ` Ludovic Courtès
  0 siblings, 1 reply; 8+ messages in thread
From: Mark H Weaver @ 2012-04-11 17:53 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 11197, Klaus Stehle

Hi Ludovic,

ludo@gnu.org (Ludovic Courtès) writes:
> Mark H Weaver <mhw@netris.org> skribis:
>> ludo@gnu.org (Ludovic Courtès) writes:
>>> It may be that your string ports are created with a non-Unicode-capable
>>> encoding.  Try something like:
>>>
>>>   (define p
>>>     (with-fluids ((%default-port-encoding "UTF-8"))
>>>       (open-input-string "čtyří")))
>>
>> IMO, this should not be needed.  Port encodings should only be relevant
>> when reading from ports involving byte strings, such as file ports or
>> socket ports.  The encoding used by Scheme strings is a purely internal
>> matter; from the user's perspective, Scheme strings are simply a
>> sequence of Unicode code points.
>
> Note that “UTF-8” above has nothing to do with Guile’s internal string
> representation; it’s just one of the many encodings that can represent
> “čtyří”.

Okay, now I understand.  The problem is that internally, string ports
are implemented by converting the string into a stream of bytes in the
string port's encoding, and then the string port reads those bytes.

Nonetheless, it is very unfortunate that this internal implementation
detail "leaks" out into user code.  SRFI-6 says nothing about port
encodings, and portable code written for SRFI-6 will fail on Guile
unless the string is constrained to whatever the default port encoding
happens to be.

Conceptually, a string port is a textual port, not a binary port.  You
should be able to hand it an arbitrary string and read those characters
from it, as described in SRFI-6, without setting Guile-specific fluid
variables.  Similarly, you should be able to write arbitrary characters
to a string-output-port.

IMO, string ports should use UTF-8 as their initial port encoding, since
we know that UTF-8 can represent any Guile string.  This will allow
portable use of string ports.

I realize that this would change the existing behavior of programs that
use binary I/O on string ports, but as things stand right now, portable
SRFI-6 code is broken on Guile.

What do you think?

>> What _is_ needed is a file coding declaration near the top of the source
>> file, e.g. "coding: utf-8" (see "Character Encoding of Source Files" in
>> the manual).
>
> Yes.  And you actually need both–i.e., the ‘coding’ cookie won’t
> magically make string ports use that encoding.
>
>> I tried that and it still fails for me.
>
> What fails exactly?

It fails ungracefully (goes into an infinite while trying to print the
backtrace) without the %default-port-encoding setting.  It works when I
add both the %default-port-encoding setting and the coding declaration.

     Thanks,
       Mark





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#11197: problems with string ports and unicode
  2012-04-11 17:53       ` Mark H Weaver
@ 2012-04-11 21:01         ` Ludovic Courtès
  2012-06-20 20:58           ` Ludovic Courtès
  2012-06-20 21:03           ` Ludovic Courtès
  0 siblings, 2 replies; 8+ messages in thread
From: Ludovic Courtès @ 2012-04-11 21:01 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 11197, Klaus Stehle

[-- Attachment #1: Type: text/plain, Size: 1242 bytes --]

Hi Mark,

Mark H Weaver <mhw@netris.org> skribis:

> Okay, now I understand.  The problem is that internally, string ports
> are implemented by converting the string into a stream of bytes in the
> string port's encoding, and then the string port reads those bytes.

Exactly.

[...]

> Conceptually, a string port is a textual port, not a binary port.

But not in Guile, where there’s no distinction between textual and
binary ports.  One can write code like:

  scheme@(guile-user)> (define (string->utf16 s)
                         (let ((p (with-fluids ((%default-port-encoding "UTF-16BE"))
                                    (open-input-string s))))
                           (get-bytevector-all p)))
  scheme@(guile-user)> (string->utf16 "hello")
  $4 = #vu8(0 104 0 101 0 108 0 108 0 111)
  scheme@(guile-user)> (use-modules(rnrs bytevectors))
  scheme@(guile-user)> (utf16->string $4)
  $5 = "hello"

> You should be able to hand it an arbitrary string and read those
> characters from it, as described in SRFI-6, without setting
> Guile-specific fluid variables.  Similarly, you should be able to
> write arbitrary characters to a string-output-port.

The SRFI-6 issue could be addressed with:


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 1137 bytes --]

diff --git a/module/srfi/srfi-6.scm b/module/srfi/srfi-6.scm
index 098b586..ba946ec 100644
--- a/module/srfi/srfi-6.scm
+++ b/module/srfi/srfi-6.scm
@@ -1,6 +1,6 @@
 ;;; srfi-6.scm --- Basic String Ports
 
-;; 	Copyright (C) 2001, 2002, 2003, 2006 Free Software Foundation, Inc.
+;; 	Copyright (C) 2001, 2002, 2003, 2006, 2012 Free Software Foundation, Inc.
 ;;
 ;; This library is free software; you can redistribute it and/or
 ;; modify it under the terms of the GNU Lesser General Public
@@ -23,10 +23,16 @@
 ;;; Code:
 
 (define-module (srfi srfi-6)
-  #:re-export (open-input-string open-output-string get-output-string))
+  #:export (open-input-string open-output-string)
+  #:re-export (get-output-string))
 
-;; Currently, guile provides these functions by default, so no action
-;; is needed, and this file is just a placeholder.
+(define (open-input-string s)
+  (with-fluids ((%default-port-encoding "UTF-8"))
+    ((@ (guile) open-input-string) s)))
+
+(define (open-output-string)
+  (with-fluids ((%default-port-encoding "UTF-8"))
+    ((@ (guile) open-output-string))))
 
 (cond-expand-provide (current-module) '(srfi-6))

[-- Attachment #3: Type: text/plain, Size: 4055 bytes --]


It wouldn’t completely solve the problem.

> IMO, string ports should use UTF-8 as their initial port encoding, since
> we know that UTF-8 can represent any Guile string.  This will allow
> portable use of string ports.

The change was submitted and briefly discussed at
<http://thread.gmane.org/gmane.lisp.guile.devel/9822>.

I think the rationale was mostly backward compatibility (in 1.8 people
could mix Latin-1 textual and binary I/O), consistency with how other
ports behave, and the ability to change the default encoding of string
ports.

> I realize that this would change the existing behavior of programs that
> use binary I/O on string ports, but as things stand right now, portable
> SRFI-6 code is broken on Guile.
>
> What do you think?

In hindsight, UTF-8 does seem like a better default than the locale port
encoding (which is what %default-port-encoding is, by default), but it
does remain useful to specify a different encoding.

>>> What _is_ needed is a file coding declaration near the top of the source
>>> file, e.g. "coding: utf-8" (see "Character Encoding of Source Files" in
>>> the manual).
>>
>> Yes.  And you actually need both–i.e., the ‘coding’ cookie won’t
>> magically make string ports use that encoding.
>>
>>> I tried that and it still fails for me.
>>
>> What fails exactly?
>
> It fails ungracefully (goes into an infinite while trying to print the
> backtrace) without the %default-port-encoding setting.

Indeed, it’s stuck in a deadlock:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007ffff75e1204 in __lll_lock_wait () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0
#1  0x00007ffff75dc4d4 in _L_lock_999 () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0
#2  0x00007ffff75dc2ea in pthread_mutex_lock () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0
#3  0x00007ffff7b30499 in scm_dynwind_pthread_mutex_lock (mutex=0x7ffff7dd28c0) at threads.c:1962
#4  0x00007ffff7b2bb0e in scm_mkstrport (pos=0x2, str=0x4, modes=327680, caller=<value optimized out>) at strports.c:287
#5  0x00007ffff7aac20b in display_backtrace_body (a=0x7fffffffc1a0) at backtrace.c:487
#6  0x00007ffff7b46c7b in vm_regular_engine (vm=0x6f61f0, program=0x7f5d50, argv=0x6fa3b0, nargs=-1) at vm-i-system.c:895
#7  0x00007ffff7ac039e in scm_call_3 (proc=0x7f5d50, arg1=<value optimized out>, arg2=<value optimized out>, arg3=<value optimized out>) at eval.c:500
#8  0x00007ffff7b32504 in scm_internal_catch (tag=<value optimized out>, body=<value optimized out>, body_data=<value optimized out>, handler=<value optimized out>, handler_data=<value optimized out>) at throw.c:222
#9  0x00007ffff7aabbba in scm_display_backtrace_with_highlights (stack=<value optimized out>, port=<value optimized out>, first=<value optimized out>, depth=<value optimized out>, highlights=<value optimized out>)
    at backtrace.c:558
#10 0x00007ffff7ab725e in print_exception_and_backtrace (error_port=0x6f6170, tag=0x66d4c0, args=0x8e6ea0) at continuations.c:490
#11 pre_unwind_handler (error_port=0x6f6170, tag=0x66d4c0, args=0x8e6ea0) at continuations.c:534
#12 0x00007ffff7b46c7b in vm_regular_engine (vm=0x6f61f0, program=0x7f3ce0, argv=0x6fa300, nargs=-1) at vm-i-system.c:895
#13 0x00007ffff7b4846e in scm_call_with_vm (vm=0x6f61f0, proc=0x7f3ce0, args=<value optimized out>) at vm.c:878
#14 0x00007ffff7b296db in scm_to_stringn (str=0x8dba80, lenp=0x7fffffffc4e8, encoding=<value optimized out>, handler=SCM_FAILED_CONVERSION_ERROR) at strings.c:2102
#15 0x00007ffff7b2bb73 in scm_mkstrport (pos=0x2, str=0x8dba80, modes=196608, caller=<value optimized out>) at strports.c:312
--8<---------------cut here---------------end--------------->8---

This could be fixed by calling ‘scm_new_port_table_entry’ after having
prepared the backing buffer, but the problem is that ‘pt->encoding’ is
needed before.

Thoughts?

Ludo’.

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* bug#11197: problems with string ports and unicode
  2012-04-11 21:01         ` Ludovic Courtès
@ 2012-06-20 20:58           ` Ludovic Courtès
  2012-06-20 21:03           ` Ludovic Courtès
  1 sibling, 0 replies; 8+ messages in thread
From: Ludovic Courtès @ 2012-06-20 20:58 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 11197, Klaus Stehle

Hi,

ludo@gnu.org (Ludovic Courtès) skribis:

> @@ -23,10 +23,16 @@
>  ;;; Code:
>  
>  (define-module (srfi srfi-6)
> -  #:re-export (open-input-string open-output-string get-output-string))
> +  #:export (open-input-string open-output-string)
> +  #:re-export (get-output-string))
>  
> -;; Currently, guile provides these functions by default, so no action
> -;; is needed, and this file is just a placeholder.
> +(define (open-input-string s)
> +  (with-fluids ((%default-port-encoding "UTF-8"))
> +    ((@ (guile) open-input-string) s)))
> +
> +(define (open-output-string)
> +  (with-fluids ((%default-port-encoding "UTF-8"))
> +    ((@ (guile) open-output-string))))

I’ve applied it as commit ecb48dccbac6b8fdd969f50a23351ef7f4b91ce5.

Thanks,
Ludo’.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#11197: problems with string ports and unicode
  2012-04-11 21:01         ` Ludovic Courtès
  2012-06-20 20:58           ` Ludovic Courtès
@ 2012-06-20 21:03           ` Ludovic Courtès
  1 sibling, 0 replies; 8+ messages in thread
From: Ludovic Courtès @ 2012-06-20 21:03 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 11197-done, Klaus Stehle

Hi,

ludo@gnu.org (Ludovic Courtès) skribis:

> Indeed, it’s stuck in a deadlock:
>
> (gdb) bt
> #0  0x00007ffff75e1204 in __lll_lock_wait () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0
> #1  0x00007ffff75dc4d4 in _L_lock_999 () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0
> #2  0x00007ffff75dc2ea in pthread_mutex_lock () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0
> #3  0x00007ffff7b30499 in scm_dynwind_pthread_mutex_lock (mutex=0x7ffff7dd28c0) at threads.c:1962
> #4  0x00007ffff7b2bb0e in scm_mkstrport (pos=0x2, str=0x4, modes=327680, caller=<value optimized out>) at strports.c:287
> #5  0x00007ffff7aac20b in display_backtrace_body (a=0x7fffffffc1a0) at backtrace.c:487
> #6  0x00007ffff7b46c7b in vm_regular_engine (vm=0x6f61f0, program=0x7f5d50, argv=0x6fa3b0, nargs=-1) at vm-i-system.c:895
> #7  0x00007ffff7ac039e in scm_call_3 (proc=0x7f5d50, arg1=<value optimized out>, arg2=<value optimized out>, arg3=<value optimized out>) at eval.c:500
> #8  0x00007ffff7b32504 in scm_internal_catch (tag=<value optimized out>, body=<value optimized out>, body_data=<value optimized out>, handler=<value optimized out>, handler_data=<value optimized out>) at throw.c:222
> #9  0x00007ffff7aabbba in scm_display_backtrace_with_highlights (stack=<value optimized out>, port=<value optimized out>, first=<value optimized out>, depth=<value optimized out>, highlights=<value optimized out>)
>     at backtrace.c:558
> #10 0x00007ffff7ab725e in print_exception_and_backtrace (error_port=0x6f6170, tag=0x66d4c0, args=0x8e6ea0) at continuations.c:490
> #11 pre_unwind_handler (error_port=0x6f6170, tag=0x66d4c0, args=0x8e6ea0) at continuations.c:534
> #12 0x00007ffff7b46c7b in vm_regular_engine (vm=0x6f61f0, program=0x7f3ce0, argv=0x6fa300, nargs=-1) at vm-i-system.c:895
> #13 0x00007ffff7b4846e in scm_call_with_vm (vm=0x6f61f0, proc=0x7f3ce0, args=<value optimized out>) at vm.c:878
> #14 0x00007ffff7b296db in scm_to_stringn (str=0x8dba80, lenp=0x7fffffffc4e8, encoding=<value optimized out>, handler=SCM_FAILED_CONVERSION_ERROR) at strings.c:2102
> #15 0x00007ffff7b2bb73 in scm_mkstrport (pos=0x2, str=0x8dba80, modes=196608, caller=<value optimized out>) at strports.c:312
>
> This could be fixed by calling ‘scm_new_port_table_entry’ after having
> prepared the backing buffer, but the problem is that ‘pt->encoding’ is
> needed before.

Fixed in 03fcf93bff9f02a3d12ab86be4e67b996310aad4 (not particularly
elegant, but I couldn’t think of a better way.)  The test in that commit
captures the initial problem.

I’m marking this bug as “done”.  If you would like to discuss string
port encodings, separate binary/textual ports, or any other significant
change, you’re welcome to do so on guile-devel@gnu.org, of course.

Thanks!

Ludo’.





^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-06-20 21:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-04-07 20:07 bug#11197: problems with string ports and unicode Klaus Stehle
2012-04-09 21:12 ` Ludovic Courtès
2012-04-11 16:08   ` Mark H Weaver
2012-04-11 16:25     ` Ludovic Courtès
2012-04-11 17:53       ` Mark H Weaver
2012-04-11 21:01         ` Ludovic Courtès
2012-06-20 20:58           ` Ludovic Courtès
2012-06-20 21:03           ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).