* bug#11197: problems with string ports and unicode @ 2012-04-07 20:07 Klaus Stehle 2012-04-09 21:12 ` Ludovic Courtès 0 siblings, 1 reply; 8+ messages in thread From: Klaus Stehle @ 2012-04-07 20:07 UTC (permalink / raw) To: 11197 [-- Attachment #1: Type: TEXT/PLAIN, Size: 1194 bytes --] Hi, ;;;; a very very short example script to describe the problem: ;; open a string port with unicode characters >= 0x0100 (define p (open-input-string "čtyří")) Put the line into a script and start guile. You will see the output: => Backtrace: That's all, and guile will hang in an eternal loop. If you enter the line interactively into the REPL, everything works properly and you can read all characters with (read-char p). ;;;; another very short script, which is possibly the same problem: ;; open a string port and unread a unicode character >= 0x0100 (define p (open-input-string "ibenik")) (unread-char #\Š p) Running these two lines as a script generates an error message: => ERROR: In procedure unread-char: => ERROR: Throw to key `encoding-error' with args `("scm_ungetc" "conversion to port encoding failed" 84 #f #\540)'. If you enter the lines interactively into the REPL, everything works properly and you can read all characters with (read-char p). Cheers, Klaus Stehle ---------------------------- guile --version guile (GNU Guile) 2.0.5 uname -srm Linux 2.6.32-5-amd64 x86_64 echo $LANG de_DE.UTF-8 ^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#11197: problems with string ports and unicode 2012-04-07 20:07 bug#11197: problems with string ports and unicode Klaus Stehle @ 2012-04-09 21:12 ` Ludovic Courtès 2012-04-11 16:08 ` Mark H Weaver 0 siblings, 1 reply; 8+ messages in thread From: Ludovic Courtès @ 2012-04-09 21:12 UTC (permalink / raw) To: Klaus Stehle; +Cc: 11197 Hi, It may be that your string ports are created with a non-Unicode-capable encoding. Try something like: (define p (with-fluids ((%default-port-encoding "UTF-8")) (open-input-string "čtyří"))) More details in the manual (info "(guile) String Ports"). How does it work for you? Ludo’. ^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#11197: problems with string ports and unicode 2012-04-09 21:12 ` Ludovic Courtès @ 2012-04-11 16:08 ` Mark H Weaver 2012-04-11 16:25 ` Ludovic Courtès 0 siblings, 1 reply; 8+ messages in thread From: Mark H Weaver @ 2012-04-11 16:08 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 11197, Klaus Stehle ludo@gnu.org (Ludovic Courtès) writes: > It may be that your string ports are created with a non-Unicode-capable > encoding. Try something like: > > (define p > (with-fluids ((%default-port-encoding "UTF-8")) > (open-input-string "čtyří"))) IMO, this should not be needed. Port encodings should only be relevant when reading from ports involving byte strings, such as file ports or socket ports. The encoding used by Scheme strings is a purely internal matter; from the user's perspective, Scheme strings are simply a sequence of Unicode code points. What _is_ needed is a file coding declaration near the top of the source file, e.g. "coding: utf-8" (see "Character Encoding of Source Files" in the manual). I tried that and it still fails for me. I think this is a genuine bug. Mark ^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#11197: problems with string ports and unicode 2012-04-11 16:08 ` Mark H Weaver @ 2012-04-11 16:25 ` Ludovic Courtès 2012-04-11 17:53 ` Mark H Weaver 0 siblings, 1 reply; 8+ messages in thread From: Ludovic Courtès @ 2012-04-11 16:25 UTC (permalink / raw) To: Mark H Weaver; +Cc: 11197, Klaus Stehle Hi Mark, Mark H Weaver <mhw@netris.org> skribis: > ludo@gnu.org (Ludovic Courtès) writes: >> It may be that your string ports are created with a non-Unicode-capable >> encoding. Try something like: >> >> (define p >> (with-fluids ((%default-port-encoding "UTF-8")) >> (open-input-string "čtyří"))) > > IMO, this should not be needed. Port encodings should only be relevant > when reading from ports involving byte strings, such as file ports or > socket ports. The encoding used by Scheme strings is a purely internal > matter; from the user's perspective, Scheme strings are simply a > sequence of Unicode code points. Note that “UTF-8” above has nothing to do with Guile’s internal string representation; it’s just one of the many encodings that can represent “čtyří”. > What _is_ needed is a file coding declaration near the top of the source > file, e.g. "coding: utf-8" (see "Character Encoding of Source Files" in > the manual). Yes. And you actually need both–i.e., the ‘coding’ cookie won’t magically make string ports use that encoding. > I tried that and it still fails for me. What fails exactly? Thanks, Ludo’. ^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#11197: problems with string ports and unicode 2012-04-11 16:25 ` Ludovic Courtès @ 2012-04-11 17:53 ` Mark H Weaver 2012-04-11 21:01 ` Ludovic Courtès 0 siblings, 1 reply; 8+ messages in thread From: Mark H Weaver @ 2012-04-11 17:53 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 11197, Klaus Stehle Hi Ludovic, ludo@gnu.org (Ludovic Courtès) writes: > Mark H Weaver <mhw@netris.org> skribis: >> ludo@gnu.org (Ludovic Courtès) writes: >>> It may be that your string ports are created with a non-Unicode-capable >>> encoding. Try something like: >>> >>> (define p >>> (with-fluids ((%default-port-encoding "UTF-8")) >>> (open-input-string "čtyří"))) >> >> IMO, this should not be needed. Port encodings should only be relevant >> when reading from ports involving byte strings, such as file ports or >> socket ports. The encoding used by Scheme strings is a purely internal >> matter; from the user's perspective, Scheme strings are simply a >> sequence of Unicode code points. > > Note that “UTF-8” above has nothing to do with Guile’s internal string > representation; it’s just one of the many encodings that can represent > “čtyří”. Okay, now I understand. The problem is that internally, string ports are implemented by converting the string into a stream of bytes in the string port's encoding, and then the string port reads those bytes. Nonetheless, it is very unfortunate that this internal implementation detail "leaks" out into user code. SRFI-6 says nothing about port encodings, and portable code written for SRFI-6 will fail on Guile unless the string is constrained to whatever the default port encoding happens to be. Conceptually, a string port is a textual port, not a binary port. You should be able to hand it an arbitrary string and read those characters from it, as described in SRFI-6, without setting Guile-specific fluid variables. Similarly, you should be able to write arbitrary characters to a string-output-port. IMO, string ports should use UTF-8 as their initial port encoding, since we know that UTF-8 can represent any Guile string. This will allow portable use of string ports. I realize that this would change the existing behavior of programs that use binary I/O on string ports, but as things stand right now, portable SRFI-6 code is broken on Guile. What do you think? >> What _is_ needed is a file coding declaration near the top of the source >> file, e.g. "coding: utf-8" (see "Character Encoding of Source Files" in >> the manual). > > Yes. And you actually need both–i.e., the ‘coding’ cookie won’t > magically make string ports use that encoding. > >> I tried that and it still fails for me. > > What fails exactly? It fails ungracefully (goes into an infinite while trying to print the backtrace) without the %default-port-encoding setting. It works when I add both the %default-port-encoding setting and the coding declaration. Thanks, Mark ^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#11197: problems with string ports and unicode 2012-04-11 17:53 ` Mark H Weaver @ 2012-04-11 21:01 ` Ludovic Courtès 2012-06-20 20:58 ` Ludovic Courtès 2012-06-20 21:03 ` Ludovic Courtès 0 siblings, 2 replies; 8+ messages in thread From: Ludovic Courtès @ 2012-04-11 21:01 UTC (permalink / raw) To: Mark H Weaver; +Cc: 11197, Klaus Stehle [-- Attachment #1: Type: text/plain, Size: 1242 bytes --] Hi Mark, Mark H Weaver <mhw@netris.org> skribis: > Okay, now I understand. The problem is that internally, string ports > are implemented by converting the string into a stream of bytes in the > string port's encoding, and then the string port reads those bytes. Exactly. [...] > Conceptually, a string port is a textual port, not a binary port. But not in Guile, where there’s no distinction between textual and binary ports. One can write code like: scheme@(guile-user)> (define (string->utf16 s) (let ((p (with-fluids ((%default-port-encoding "UTF-16BE")) (open-input-string s)))) (get-bytevector-all p))) scheme@(guile-user)> (string->utf16 "hello") $4 = #vu8(0 104 0 101 0 108 0 108 0 111) scheme@(guile-user)> (use-modules(rnrs bytevectors)) scheme@(guile-user)> (utf16->string $4) $5 = "hello" > You should be able to hand it an arbitrary string and read those > characters from it, as described in SRFI-6, without setting > Guile-specific fluid variables. Similarly, you should be able to > write arbitrary characters to a string-output-port. The SRFI-6 issue could be addressed with: [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: Type: text/x-patch, Size: 1137 bytes --] diff --git a/module/srfi/srfi-6.scm b/module/srfi/srfi-6.scm index 098b586..ba946ec 100644 --- a/module/srfi/srfi-6.scm +++ b/module/srfi/srfi-6.scm @@ -1,6 +1,6 @@ ;;; srfi-6.scm --- Basic String Ports -;; Copyright (C) 2001, 2002, 2003, 2006 Free Software Foundation, Inc. +;; Copyright (C) 2001, 2002, 2003, 2006, 2012 Free Software Foundation, Inc. ;; ;; This library is free software; you can redistribute it and/or ;; modify it under the terms of the GNU Lesser General Public @@ -23,10 +23,16 @@ ;;; Code: (define-module (srfi srfi-6) - #:re-export (open-input-string open-output-string get-output-string)) + #:export (open-input-string open-output-string) + #:re-export (get-output-string)) -;; Currently, guile provides these functions by default, so no action -;; is needed, and this file is just a placeholder. +(define (open-input-string s) + (with-fluids ((%default-port-encoding "UTF-8")) + ((@ (guile) open-input-string) s))) + +(define (open-output-string) + (with-fluids ((%default-port-encoding "UTF-8")) + ((@ (guile) open-output-string)))) (cond-expand-provide (current-module) '(srfi-6)) [-- Attachment #3: Type: text/plain, Size: 4055 bytes --] It wouldn’t completely solve the problem. > IMO, string ports should use UTF-8 as their initial port encoding, since > we know that UTF-8 can represent any Guile string. This will allow > portable use of string ports. The change was submitted and briefly discussed at <http://thread.gmane.org/gmane.lisp.guile.devel/9822>. I think the rationale was mostly backward compatibility (in 1.8 people could mix Latin-1 textual and binary I/O), consistency with how other ports behave, and the ability to change the default encoding of string ports. > I realize that this would change the existing behavior of programs that > use binary I/O on string ports, but as things stand right now, portable > SRFI-6 code is broken on Guile. > > What do you think? In hindsight, UTF-8 does seem like a better default than the locale port encoding (which is what %default-port-encoding is, by default), but it does remain useful to specify a different encoding. >>> What _is_ needed is a file coding declaration near the top of the source >>> file, e.g. "coding: utf-8" (see "Character Encoding of Source Files" in >>> the manual). >> >> Yes. And you actually need both–i.e., the ‘coding’ cookie won’t >> magically make string ports use that encoding. >> >>> I tried that and it still fails for me. >> >> What fails exactly? > > It fails ungracefully (goes into an infinite while trying to print the > backtrace) without the %default-port-encoding setting. Indeed, it’s stuck in a deadlock: --8<---------------cut here---------------start------------->8--- (gdb) bt #0 0x00007ffff75e1204 in __lll_lock_wait () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0 #1 0x00007ffff75dc4d4 in _L_lock_999 () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0 #2 0x00007ffff75dc2ea in pthread_mutex_lock () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0 #3 0x00007ffff7b30499 in scm_dynwind_pthread_mutex_lock (mutex=0x7ffff7dd28c0) at threads.c:1962 #4 0x00007ffff7b2bb0e in scm_mkstrport (pos=0x2, str=0x4, modes=327680, caller=<value optimized out>) at strports.c:287 #5 0x00007ffff7aac20b in display_backtrace_body (a=0x7fffffffc1a0) at backtrace.c:487 #6 0x00007ffff7b46c7b in vm_regular_engine (vm=0x6f61f0, program=0x7f5d50, argv=0x6fa3b0, nargs=-1) at vm-i-system.c:895 #7 0x00007ffff7ac039e in scm_call_3 (proc=0x7f5d50, arg1=<value optimized out>, arg2=<value optimized out>, arg3=<value optimized out>) at eval.c:500 #8 0x00007ffff7b32504 in scm_internal_catch (tag=<value optimized out>, body=<value optimized out>, body_data=<value optimized out>, handler=<value optimized out>, handler_data=<value optimized out>) at throw.c:222 #9 0x00007ffff7aabbba in scm_display_backtrace_with_highlights (stack=<value optimized out>, port=<value optimized out>, first=<value optimized out>, depth=<value optimized out>, highlights=<value optimized out>) at backtrace.c:558 #10 0x00007ffff7ab725e in print_exception_and_backtrace (error_port=0x6f6170, tag=0x66d4c0, args=0x8e6ea0) at continuations.c:490 #11 pre_unwind_handler (error_port=0x6f6170, tag=0x66d4c0, args=0x8e6ea0) at continuations.c:534 #12 0x00007ffff7b46c7b in vm_regular_engine (vm=0x6f61f0, program=0x7f3ce0, argv=0x6fa300, nargs=-1) at vm-i-system.c:895 #13 0x00007ffff7b4846e in scm_call_with_vm (vm=0x6f61f0, proc=0x7f3ce0, args=<value optimized out>) at vm.c:878 #14 0x00007ffff7b296db in scm_to_stringn (str=0x8dba80, lenp=0x7fffffffc4e8, encoding=<value optimized out>, handler=SCM_FAILED_CONVERSION_ERROR) at strings.c:2102 #15 0x00007ffff7b2bb73 in scm_mkstrport (pos=0x2, str=0x8dba80, modes=196608, caller=<value optimized out>) at strports.c:312 --8<---------------cut here---------------end--------------->8--- This could be fixed by calling ‘scm_new_port_table_entry’ after having prepared the backing buffer, but the problem is that ‘pt->encoding’ is needed before. Thoughts? Ludo’. ^ permalink raw reply related [flat|nested] 8+ messages in thread
* bug#11197: problems with string ports and unicode 2012-04-11 21:01 ` Ludovic Courtès @ 2012-06-20 20:58 ` Ludovic Courtès 2012-06-20 21:03 ` Ludovic Courtès 1 sibling, 0 replies; 8+ messages in thread From: Ludovic Courtès @ 2012-06-20 20:58 UTC (permalink / raw) To: Mark H Weaver; +Cc: 11197, Klaus Stehle Hi, ludo@gnu.org (Ludovic Courtès) skribis: > @@ -23,10 +23,16 @@ > ;;; Code: > > (define-module (srfi srfi-6) > - #:re-export (open-input-string open-output-string get-output-string)) > + #:export (open-input-string open-output-string) > + #:re-export (get-output-string)) > > -;; Currently, guile provides these functions by default, so no action > -;; is needed, and this file is just a placeholder. > +(define (open-input-string s) > + (with-fluids ((%default-port-encoding "UTF-8")) > + ((@ (guile) open-input-string) s))) > + > +(define (open-output-string) > + (with-fluids ((%default-port-encoding "UTF-8")) > + ((@ (guile) open-output-string)))) I’ve applied it as commit ecb48dccbac6b8fdd969f50a23351ef7f4b91ce5. Thanks, Ludo’. ^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#11197: problems with string ports and unicode 2012-04-11 21:01 ` Ludovic Courtès 2012-06-20 20:58 ` Ludovic Courtès @ 2012-06-20 21:03 ` Ludovic Courtès 1 sibling, 0 replies; 8+ messages in thread From: Ludovic Courtès @ 2012-06-20 21:03 UTC (permalink / raw) To: Mark H Weaver; +Cc: 11197-done, Klaus Stehle Hi, ludo@gnu.org (Ludovic Courtès) skribis: > Indeed, it’s stuck in a deadlock: > > (gdb) bt > #0 0x00007ffff75e1204 in __lll_lock_wait () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0 > #1 0x00007ffff75dc4d4 in _L_lock_999 () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0 > #2 0x00007ffff75dc2ea in pthread_mutex_lock () from /nix/store/vxycd107wjbhcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0 > #3 0x00007ffff7b30499 in scm_dynwind_pthread_mutex_lock (mutex=0x7ffff7dd28c0) at threads.c:1962 > #4 0x00007ffff7b2bb0e in scm_mkstrport (pos=0x2, str=0x4, modes=327680, caller=<value optimized out>) at strports.c:287 > #5 0x00007ffff7aac20b in display_backtrace_body (a=0x7fffffffc1a0) at backtrace.c:487 > #6 0x00007ffff7b46c7b in vm_regular_engine (vm=0x6f61f0, program=0x7f5d50, argv=0x6fa3b0, nargs=-1) at vm-i-system.c:895 > #7 0x00007ffff7ac039e in scm_call_3 (proc=0x7f5d50, arg1=<value optimized out>, arg2=<value optimized out>, arg3=<value optimized out>) at eval.c:500 > #8 0x00007ffff7b32504 in scm_internal_catch (tag=<value optimized out>, body=<value optimized out>, body_data=<value optimized out>, handler=<value optimized out>, handler_data=<value optimized out>) at throw.c:222 > #9 0x00007ffff7aabbba in scm_display_backtrace_with_highlights (stack=<value optimized out>, port=<value optimized out>, first=<value optimized out>, depth=<value optimized out>, highlights=<value optimized out>) > at backtrace.c:558 > #10 0x00007ffff7ab725e in print_exception_and_backtrace (error_port=0x6f6170, tag=0x66d4c0, args=0x8e6ea0) at continuations.c:490 > #11 pre_unwind_handler (error_port=0x6f6170, tag=0x66d4c0, args=0x8e6ea0) at continuations.c:534 > #12 0x00007ffff7b46c7b in vm_regular_engine (vm=0x6f61f0, program=0x7f3ce0, argv=0x6fa300, nargs=-1) at vm-i-system.c:895 > #13 0x00007ffff7b4846e in scm_call_with_vm (vm=0x6f61f0, proc=0x7f3ce0, args=<value optimized out>) at vm.c:878 > #14 0x00007ffff7b296db in scm_to_stringn (str=0x8dba80, lenp=0x7fffffffc4e8, encoding=<value optimized out>, handler=SCM_FAILED_CONVERSION_ERROR) at strings.c:2102 > #15 0x00007ffff7b2bb73 in scm_mkstrport (pos=0x2, str=0x8dba80, modes=196608, caller=<value optimized out>) at strports.c:312 > > This could be fixed by calling ‘scm_new_port_table_entry’ after having > prepared the backing buffer, but the problem is that ‘pt->encoding’ is > needed before. Fixed in 03fcf93bff9f02a3d12ab86be4e67b996310aad4 (not particularly elegant, but I couldn’t think of a better way.) The test in that commit captures the initial problem. I’m marking this bug as “done”. If you would like to discuss string port encodings, separate binary/textual ports, or any other significant change, you’re welcome to do so on guile-devel@gnu.org, of course. Thanks! Ludo’. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-06-20 21:03 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-04-07 20:07 bug#11197: problems with string ports and unicode Klaus Stehle 2012-04-09 21:12 ` Ludovic Courtès 2012-04-11 16:08 ` Mark H Weaver 2012-04-11 16:25 ` Ludovic Courtès 2012-04-11 17:53 ` Mark H Weaver 2012-04-11 21:01 ` Ludovic Courtès 2012-06-20 20:58 ` Ludovic Courtès 2012-06-20 21:03 ` Ludovic Courtès
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).