From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: David Kastrup Newsgroups: gmane.lisp.guile.bugs Subject: bug#18520: string ports should not have an encoding Date: Mon, 22 Sep 2014 15:09:25 +0200 Message-ID: <87wq8vls3e.fsf@fencepost.gnu.org> References: <87iokgmttc.fsf@fencepost.gnu.org> <87k34vrhu2.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1411392995 15602 80.91.229.3 (22 Sep 2014 13:36:35 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 22 Sep 2014 13:36:35 +0000 (UTC) Cc: 18520@debbugs.gnu.org To: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Original-X-From: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Mon Sep 22 15:36:29 2014 Return-path: Envelope-to: guile-bugs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XW3n2-0007Ay-2k for guile-bugs@m.gmane.org; Mon, 22 Sep 2014 15:36:28 +0200 Original-Received: from localhost ([::1]:46693 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XW3n1-0005Xc-Nt for guile-bugs@m.gmane.org; Mon, 22 Sep 2014 09:36:27 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:57551) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XW3mq-0005RI-NN for bug-guile@gnu.org; Mon, 22 Sep 2014 09:36:24 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XW3mh-0000Lr-HT for bug-guile@gnu.org; Mon, 22 Sep 2014 09:36:16 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:56691) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XW3mh-0000LM-Cg for bug-guile@gnu.org; Mon, 22 Sep 2014 09:36:07 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.80) (envelope-from ) id 1XW3mb-0007cg-Qt for bug-guile@gnu.org; Mon, 22 Sep 2014 09:36:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: David Kastrup Original-Sender: "Debbugs-submit" Resent-CC: bug-guile@gnu.org Resent-Date: Mon, 22 Sep 2014 13:36:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18520 X-GNU-PR-Package: guile X-GNU-PR-Keywords: Original-Received: via spool by 18520-submit@debbugs.gnu.org id=B18520.141139291329237 (code B ref 18520); Mon, 22 Sep 2014 13:36:01 +0000 Original-Received: (at 18520) by debbugs.gnu.org; 22 Sep 2014 13:35:13 +0000 Original-Received: from localhost ([127.0.0.1]:48253 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XW3lo-0007bQ-1b for submit@debbugs.gnu.org; Mon, 22 Sep 2014 09:35:12 -0400 Original-Received: from fencepost.gnu.org ([208.118.235.10]:57569) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XW3ll-0007bG-S2 for 18520@debbugs.gnu.org; Mon, 22 Sep 2014 09:35:10 -0400 Original-Received: from localhost ([127.0.0.1]:36642 helo=lola) by fencepost.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XW3ll-0001aJ-FI; Mon, 22 Sep 2014 09:35:09 -0400 Original-Received: by lola (Postfix, from userid 1000) id 20A33E0C64; Mon, 22 Sep 2014 15:09:25 +0200 (CEST) In-Reply-To: <87k34vrhu2.fsf@gnu.org> ("Ludovic =?UTF-8?Q?Court=C3=A8s?="'s message of "Mon, 22 Sep 2014 13:54:29 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 140.186.70.43 X-BeenThere: bug-guile@gnu.org List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Original-Sender: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.bugs:7569 Archived-At: ludo@gnu.org (Ludovic Court=C3=A8s) writes: > This has been addressed in two ways: No, it hasn't. > 1. In 2.0, (srfi srfi-6) uses Unicode-capable string ports (commit > ecb48dc.) This issue report is not about adding more optional functionality on top. It is about _removing_ unwarranted redirection and complication from existing core functionality. The artifacts of making with-input-from-string and with-output-to-string go through an additional character->bytevector->character encoding/recoding layer are not invisible. > 2. In 2.2, string ports are always Unicode-capable, and > =E2=80=98%default-port-encoding=E2=80=99 is ignored (commit 6dce942.) String ports should not be "Unicode capable" but transparent. Characters in, characters out. ftell/fseek should be based on character position in strings rather than offsets in a magically created bytestream of some particular encoding. > So for 2.0, the workaround is to either use (srfi srfi-6), or force > =E2=80=98%default-port-encoding=E2=80=99 to "UTF-8". Which is what the latter _only_ does. It still interprets set-port-encoding! with respect to a byte stream meaning, and it still calculates positions according to a byte stream meaning not related to string positions: (use-modules (srfi srfi-6)) (define s (list->string (map integer->char '(20 200 2000 20000)))) (let ((port (open-input-string s))) (let loop ((ch (read-char port))) (if (not (eof-object? ch)) (begin (format #t "~d, pos=3D~d\n" (char->integer ch) (ftell port)) (loop (read-char port)))))) 20, pos=3D1 200, pos=3D3 2000, pos=3D5 20000, pos=3D8 Tying string ports to an artificial bytevector presentation in a manner bleeding through like that means that it is not possible to synchronize string positions and stream positions when parts of the source string are _not_ processed from within the stream. Which is precisely the problem I am currently dealing with while porting LilyPond: it has its own lexer working on an (utf-8 encoded) byte stream which is at the same time available as a string port. Whenever embedded Scheme is interpreted, the string port is moved to the proper position, GUILE reads an expression and is told what to do with it, the string port position is picked off and the LilyPond lexer is moved to the respective position to continue. If you take a look at , ftell on a string port is here used for correlating the positions of parsed subexpressions with the original data. Reencoding strings in utf-8 is not going to make this work with string indexing since ftell does not bear a useful relation to string positions. The behavior of ftell and port-encoding is perfectly fine for reading from bytevectors or files, and reading from bytevectors or files also does not incur a encode-when-open action governed by %default-port-encoding in GUILE-2.0 and by hardwired UTF-8 in GUILE-2.2. But strings are already decoded characters. Reencoding makes no sense and detaches things like ftell and fseek from the actual input into the port. --=20 David Kastrup