From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Newsgroups: gmane.lisp.guile.bugs Subject: bug#11197: problems with string ports and unicode Date: Wed, 11 Apr 2012 23:01:16 +0200 Message-ID: <87wr5mj84j.fsf@gnu.org> References: <87ty0sa9tu.fsf@gnu.org> <87ty0q8d5h.fsf@netris.org> <87zkaip76h.fsf@gnu.org> <87lim288a6.fsf@netris.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: dough.gmane.org 1334178127 22766 80.91.229.3 (11 Apr 2012 21:02:07 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Wed, 11 Apr 2012 21:02:07 +0000 (UTC) Cc: 11197@debbugs.gnu.org, Klaus Stehle To: Mark H Weaver Original-X-From: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Wed Apr 11 23:02:05 2012 Return-path: Envelope-to: guile-bugs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SI4fx-0002su-Ld for guile-bugs@m.gmane.org; Wed, 11 Apr 2012 23:02:01 +0200 Original-Received: from localhost ([::1]:51377 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SI4fx-00019A-0U for guile-bugs@m.gmane.org; Wed, 11 Apr 2012 17:02:01 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:40404) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SI4ft-00018l-6b for bug-guile@gnu.org; Wed, 11 Apr 2012 17:01:59 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1SI4fq-0002F9-EY for bug-guile@gnu.org; Wed, 11 Apr 2012 17:01:56 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:56295) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SI4fq-0002Ew-Ap for bug-guile@gnu.org; Wed, 11 Apr 2012 17:01:54 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.72) (envelope-from ) id 1SI4gw-00012q-1J for bug-guile@gnu.org; Wed, 11 Apr 2012 17:03:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-CC: bug-guile@gnu.org Resent-Date: Wed, 11 Apr 2012 21:03:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 11197 X-GNU-PR-Package: guile X-GNU-PR-Keywords: Original-Received: via spool by 11197-submit@debbugs.gnu.org id=B11197.13341781543983 (code B ref 11197); Wed, 11 Apr 2012 21:03:01 +0000 Original-Received: (at 11197) by debbugs.gnu.org; 11 Apr 2012 21:02:34 +0000 Original-Received: from localhost ([127.0.0.1]:52833 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1SI4gT-00012B-7Z for submit@debbugs.gnu.org; Wed, 11 Apr 2012 17:02:33 -0400 Original-Received: from xanadu.aquilenet.fr ([88.191.123.111]:33444) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1SI4gP-000121-Fi for 11197@debbugs.gnu.org; Wed, 11 Apr 2012 17:02:31 -0400 Original-Received: from localhost (xanadu.aquilenet.fr [127.0.0.1]) by xanadu.aquilenet.fr (Postfix) with ESMTP id 30C107662; Wed, 11 Apr 2012 23:01:18 +0200 (CEST) Original-Received: from xanadu.aquilenet.fr ([127.0.0.1]) by localhost (xanadu.aquilenet.fr [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id cOET4y4eM3av; Wed, 11 Apr 2012 23:01:18 +0200 (CEST) Original-Received: from pluto (reverse-83.fdn.fr [80.67.176.83]) by xanadu.aquilenet.fr (Postfix) with ESMTPSA id CA26B7660; Wed, 11 Apr 2012 23:01:16 +0200 (CEST) X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 23 Germinal an 220 de la =?UTF-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0xEA52ECF4 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 83C4 F8E5 10A3 3B4C 5BEA D15D 77DD 95E2 EA52 ECF4 X-OS: x86_64-unknown-linux-gnu In-Reply-To: <87lim288a6.fsf@netris.org> (Mark H. Weaver's message of "Wed, 11 Apr 2012 13:53:21 -0400") User-Agent: Gnus/5.110018 (No Gnus v0.18) Emacs/24.0.93 (gnu/linux) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) X-Received-From: 140.186.70.43 X-BeenThere: bug-guile@gnu.org List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Original-Sender: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.bugs:6299 Archived-At: --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi Mark, Mark H Weaver skribis: > Okay, now I understand. The problem is that internally, string ports > are implemented by converting the string into a stream of bytes in the > string port's encoding, and then the string port reads those bytes. Exactly. [...] > Conceptually, a string port is a textual port, not a binary port. But not in Guile, where there=E2=80=99s no distinction between textual and binary ports. One can write code like: scheme@(guile-user)> (define (string->utf16 s) (let ((p (with-fluids ((%default-port-encoding "UT= F-16BE")) (open-input-string s)))) (get-bytevector-all p))) scheme@(guile-user)> (string->utf16 "hello") $4 =3D #vu8(0 104 0 101 0 108 0 108 0 111) scheme@(guile-user)> (use-modules(rnrs bytevectors)) scheme@(guile-user)> (utf16->string $4) $5 =3D "hello" > You should be able to hand it an arbitrary string and read those > characters from it, as described in SRFI-6, without setting > Guile-specific fluid variables. Similarly, you should be able to > write arbitrary characters to a string-output-port. The SRFI-6 issue could be addressed with: --=-=-= Content-Type: text/x-patch Content-Disposition: inline diff --git a/module/srfi/srfi-6.scm b/module/srfi/srfi-6.scm index 098b586..ba946ec 100644 --- a/module/srfi/srfi-6.scm +++ b/module/srfi/srfi-6.scm @@ -1,6 +1,6 @@ ;;; srfi-6.scm --- Basic String Ports -;; Copyright (C) 2001, 2002, 2003, 2006 Free Software Foundation, Inc. +;; Copyright (C) 2001, 2002, 2003, 2006, 2012 Free Software Foundation, Inc. ;; ;; This library is free software; you can redistribute it and/or ;; modify it under the terms of the GNU Lesser General Public @@ -23,10 +23,16 @@ ;;; Code: (define-module (srfi srfi-6) - #:re-export (open-input-string open-output-string get-output-string)) + #:export (open-input-string open-output-string) + #:re-export (get-output-string)) -;; Currently, guile provides these functions by default, so no action -;; is needed, and this file is just a placeholder. +(define (open-input-string s) + (with-fluids ((%default-port-encoding "UTF-8")) + ((@ (guile) open-input-string) s))) + +(define (open-output-string) + (with-fluids ((%default-port-encoding "UTF-8")) + ((@ (guile) open-output-string)))) (cond-expand-provide (current-module) '(srfi-6)) --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable It wouldn=E2=80=99t completely solve the problem. > IMO, string ports should use UTF-8 as their initial port encoding, since > we know that UTF-8 can represent any Guile string. This will allow > portable use of string ports. The change was submitted and briefly discussed at . I think the rationale was mostly backward compatibility (in 1.8 people could mix Latin-1 textual and binary I/O), consistency with how other ports behave, and the ability to change the default encoding of string ports. > I realize that this would change the existing behavior of programs that > use binary I/O on string ports, but as things stand right now, portable > SRFI-6 code is broken on Guile. > > What do you think? In hindsight, UTF-8 does seem like a better default than the locale port encoding (which is what %default-port-encoding is, by default), but it does remain useful to specify a different encoding. >>> What _is_ needed is a file coding declaration near the top of the source >>> file, e.g. "coding: utf-8" (see "Character Encoding of Source Files" in >>> the manual). >> >> Yes. And you actually need both=E2=80=93i.e., the =E2=80=98coding=E2=80= =99 cookie won=E2=80=99t >> magically make string ports use that encoding. >> >>> I tried that and it still fails for me. >> >> What fails exactly? > > It fails ungracefully (goes into an infinite while trying to print the > backtrace) without the %default-port-encoding setting. Indeed, it=E2=80=99s stuck in a deadlock: --8<---------------cut here---------------start------------->8--- (gdb) bt #0 0x00007ffff75e1204 in __lll_lock_wait () from /nix/store/vxycd107wjbhcj= 720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0 #1 0x00007ffff75dc4d4 in _L_lock_999 () from /nix/store/vxycd107wjbhcj720h= zkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0 #2 0x00007ffff75dc2ea in pthread_mutex_lock () from /nix/store/vxycd107wjb= hcj720hzkw2px7s7kr724-glibc-2.12.2/lib/libpthread.so.0 #3 0x00007ffff7b30499 in scm_dynwind_pthread_mutex_lock (mutex=3D0x7ffff7d= d28c0) at threads.c:1962 #4 0x00007ffff7b2bb0e in scm_mkstrport (pos=3D0x2, str=3D0x4, modes=3D3276= 80, caller=3D) at strports.c:287 #5 0x00007ffff7aac20b in display_backtrace_body (a=3D0x7fffffffc1a0) at ba= cktrace.c:487 #6 0x00007ffff7b46c7b in vm_regular_engine (vm=3D0x6f61f0, program=3D0x7f5= d50, argv=3D0x6fa3b0, nargs=3D-1) at vm-i-system.c:895 #7 0x00007ffff7ac039e in scm_call_3 (proc=3D0x7f5d50, arg1=3D, arg2=3D, arg3=3D) at ev= al.c:500 #8 0x00007ffff7b32504 in scm_internal_catch (tag=3D, = body=3D, body_data=3D, handler=3D= , handler_data=3D) at throw.c:222 #9 0x00007ffff7aabbba in scm_display_backtrace_with_highlights (stack=3D, port=3D, first=3D, depth=3D, highlights=3D) at backtrace.c:558 #10 0x00007ffff7ab725e in print_exception_and_backtrace (error_port=3D0x6f6= 170, tag=3D0x66d4c0, args=3D0x8e6ea0) at continuations.c:490 #11 pre_unwind_handler (error_port=3D0x6f6170, tag=3D0x66d4c0, args=3D0x8e6= ea0) at continuations.c:534 #12 0x00007ffff7b46c7b in vm_regular_engine (vm=3D0x6f61f0, program=3D0x7f3= ce0, argv=3D0x6fa300, nargs=3D-1) at vm-i-system.c:895 #13 0x00007ffff7b4846e in scm_call_with_vm (vm=3D0x6f61f0, proc=3D0x7f3ce0,= args=3D) at vm.c:878 #14 0x00007ffff7b296db in scm_to_stringn (str=3D0x8dba80, lenp=3D0x7fffffff= c4e8, encoding=3D, handler=3DSCM_FAILED_CONVERSION_ERR= OR) at strings.c:2102 #15 0x00007ffff7b2bb73 in scm_mkstrport (pos=3D0x2, str=3D0x8dba80, modes= =3D196608, caller=3D) at strports.c:312 --8<---------------cut here---------------end--------------->8--- This could be fixed by calling =E2=80=98scm_new_port_table_entry=E2=80=99 a= fter having prepared the backing buffer, but the problem is that =E2=80=98pt->encoding= =E2=80=99 is needed before. Thoughts? Ludo=E2=80=99. --=-=-=--