From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: David Kastrup Newsgroups: gmane.lisp.guile.bugs Subject: bug#18520: string ports should not have an encoding Date: Tue, 23 Sep 2014 13:54:15 +0200 Message-ID: <87h9zyk0wo.fsf@fencepost.gnu.org> References: <87iokgmttc.fsf@fencepost.gnu.org> <87mw9rq20u.fsf@gnu.org> <87sijjlqx0.fsf@fencepost.gnu.org> <87sijjmvlr.fsf@gnu.org> <87bnq7lgg9.fsf@fencepost.gnu.org> <87d2anl79a.fsf@gnu.org> <87tx3zjod1.fsf@fencepost.gnu.org> <87egv2pwv5.fsf@gnu.org> <87lhpak8ye.fsf@fencepost.gnu.org> <87bnq6oelf.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1411474673 7722 80.91.229.3 (23 Sep 2014 12:17:53 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 23 Sep 2014 12:17:53 +0000 (UTC) Cc: 18520@debbugs.gnu.org To: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Original-X-From: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Tue Sep 23 14:17:46 2014 Return-path: Envelope-to: guile-bugs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XWOvz-0001DT-0L for guile-bugs@m.gmane.org; Tue, 23 Sep 2014 14:11:07 +0200 Original-Received: from localhost ([::1]:52776 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XWOvy-0005AV-4H for guile-bugs@m.gmane.org; Tue, 23 Sep 2014 08:11:06 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:38797) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XWOvp-0005AO-Rh for bug-guile@gnu.org; Tue, 23 Sep 2014 08:11:02 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XWOvk-0001NZ-VO for bug-guile@gnu.org; Tue, 23 Sep 2014 08:10:57 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:57978) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XWOvk-0001N3-Sj for bug-guile@gnu.org; Tue, 23 Sep 2014 08:10:52 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.80) (envelope-from ) id 1XWOgQ-0006ie-BV for bug-guile@gnu.org; Tue, 23 Sep 2014 07:55:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: David Kastrup Original-Sender: "Debbugs-submit" Resent-CC: bug-guile@gnu.org Resent-Date: Tue, 23 Sep 2014 11:55:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18520 X-GNU-PR-Package: guile X-GNU-PR-Keywords: Original-Received: via spool by 18520-submit@debbugs.gnu.org id=B18520.141147326225748 (code B ref 18520); Tue, 23 Sep 2014 11:55:02 +0000 Original-Received: (at 18520) by debbugs.gnu.org; 23 Sep 2014 11:54:22 +0000 Original-Received: from localhost ([127.0.0.1]:49519 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XWOfl-0006hD-S4 for submit@debbugs.gnu.org; Tue, 23 Sep 2014 07:54:22 -0400 Original-Received: from fencepost.gnu.org ([208.118.235.10]:55439) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XWOfj-0006h2-8d for 18520@debbugs.gnu.org; Tue, 23 Sep 2014 07:54:20 -0400 Original-Received: from localhost ([127.0.0.1]:34509 helo=lola) by fencepost.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XWOfh-00047E-7h; Tue, 23 Sep 2014 07:54:17 -0400 Original-Received: by lola (Postfix, from userid 1000) id B8E4EE61BB; Tue, 23 Sep 2014 13:54:15 +0200 (CEST) In-Reply-To: <87bnq6oelf.fsf@gnu.org> ("Ludovic =?UTF-8?Q?Court=C3=A8s?="'s message of "Tue, 23 Sep 2014 11:45:00 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 140.186.70.43 X-BeenThere: bug-guile@gnu.org List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Original-Sender: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.bugs:7580 Archived-At: ludo@gnu.org (Ludovic Court=C3=A8s) writes: > David Kastrup skribis: > >>> Line/column info remains identical regardless of the encoding, so I tend >>> to think it=E2=80=99s more robust to use that. >> >> Column info remains identical regardless of the encoding? Since when? > > The character on line L and column M is always there, regardless of > whether the file is encoded in UTF-8, Latin-1, etc. > > Would that work for LilyPond? Last time I looked, in the following line x was in column 3 in latin-1 encoding and in column 2 in utf-8 encoding: =C3=BCx At any rate, we are missing the point of the issue. The issue is not whether a workaround may be designed for every way in which GUILE tries tripping up its users. The question is how GUILE may provide the least amount of surprise to its users without sacrificing functionality. GUILE's current implementation uses two character set conversions for string ports. For input string ports, the first is a batch encoding when the string port is opened (using %default-port-encoding resp. "UTF-8" in GUILE-2.0 and GUILE-2.2), this encoding is set as the port's encoding (I hope) and then, unless changed, every read operation employs the encoding that is, at any given time, current. Accompanying the opening of a string with an encoding operation (whether using a forced encoding or %default-port-encoding) is expensive (not least of all because everything needs to be decoded again), leads to arbitrary semantics for port positioning, and is asymmetric since the port encoding is only used for reading on an input string and for writing on an output string. Oh, and for writing on an input string using unread-string, of course. No kidding. There is also a conversion in there. Would it be worth ditching the sort of unnecessary conversion? Well, just look at: commit be7ecef05c1eea66f30360f658c610710c5cb22e Author: Andy Wingo Date: Sat Aug 31 10:44:07 2013 +0200 unread-char: inline conversion from codepoint to bytes * libguile/ports.c (scm_ungetc_unlocked): Inline the conversion from codepoint to bytes for UTF-8 and latin-1 ports. Speeds up a numbers-reading test case by 100% (!). That sounds like quite some gain just for _simplifying_ the back-and-forth conversion, and we could be just foregoing it instead (yes, peek-char as getc+ungetc presents a challenge in connection with encoding switches: I think that declaring the first impression of peek-char as sticky would be reasonable). At any rate, the above commit looks like it would make a hash out of (with-input-from-string "Huh\"" (lambda () (unread-string "\"=C3=A4" (current-input-port)) (read))) because of a broken character range check (I cannot currently check with a compilation of master since that takes about a day on my computer, but I would be surprised if the above worked fine). So yes, the required complexity to deal with GUILE's current behavior can introduce problems. --=20 David Kastrup