From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: David Kastrup Newsgroups: gmane.lisp.guile.bugs Subject: bug#18520: string ports should not have an encoding Date: Tue, 23 Sep 2014 15:02:54 +0200 Message-ID: <87d2amjxq9.fsf@fencepost.gnu.org> References: <87iokgmttc.fsf@fencepost.gnu.org> <87mw9rq20u.fsf@gnu.org> <87sijjlqx0.fsf@fencepost.gnu.org> <87sijjmvlr.fsf@gnu.org> <87bnq7lgg9.fsf@fencepost.gnu.org> <87d2anl79a.fsf@gnu.org> <87tx3zjod1.fsf@fencepost.gnu.org> <87egv2pwv5.fsf@gnu.org> <87lhpak8ye.fsf@fencepost.gnu.org> <87bnq6oelf.fsf@gnu.org> <87h9zyk0wo.fsf@fencepost.gnu.org> <87tx3yjzzw.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1411477435 13601 80.91.229.3 (23 Sep 2014 13:03:55 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 23 Sep 2014 13:03:55 +0000 (UTC) Cc: 18520@debbugs.gnu.org To: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Original-X-From: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Tue Sep 23 15:03:49 2014 Return-path: Envelope-to: guile-bugs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XWPkW-0005ne-Ht for guile-bugs@m.gmane.org; Tue, 23 Sep 2014 15:03:20 +0200 Original-Received: from localhost ([::1]:53266 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XWPkW-0007uR-3x for guile-bugs@m.gmane.org; Tue, 23 Sep 2014 09:03:20 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:56822) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XWPkQ-0007qH-Pm for bug-guile@gnu.org; Tue, 23 Sep 2014 09:03:16 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XWPkK-0000i0-Dh for bug-guile@gnu.org; Tue, 23 Sep 2014 09:03:14 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:58027) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XWPkK-0000cS-AV for bug-guile@gnu.org; Tue, 23 Sep 2014 09:03:08 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.80) (envelope-from ) id 1XWPkE-000063-QR for bug-guile@gnu.org; Tue, 23 Sep 2014 09:03:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: David Kastrup Original-Sender: "Debbugs-submit" Resent-CC: bug-guile@gnu.org Resent-Date: Tue, 23 Sep 2014 13:03:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18520 X-GNU-PR-Package: guile X-GNU-PR-Keywords: Original-Received: via spool by 18520-submit@debbugs.gnu.org id=B18520.1411477380363 (code B ref 18520); Tue, 23 Sep 2014 13:03:02 +0000 Original-Received: (at 18520) by debbugs.gnu.org; 23 Sep 2014 13:03:00 +0000 Original-Received: from localhost ([127.0.0.1]:49591 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XWPkB-00005l-6P for submit@debbugs.gnu.org; Tue, 23 Sep 2014 09:02:59 -0400 Original-Received: from fencepost.gnu.org ([208.118.235.10]:57289) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XWPk7-00005a-Uz for 18520@debbugs.gnu.org; Tue, 23 Sep 2014 09:02:56 -0400 Original-Received: from localhost ([127.0.0.1]:36362 helo=lola) by fencepost.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XWPk7-0007RQ-7x; Tue, 23 Sep 2014 09:02:55 -0400 Original-Received: by lola (Postfix, from userid 1000) id B76CFE61BB; Tue, 23 Sep 2014 15:02:54 +0200 (CEST) In-Reply-To: <87tx3yjzzw.fsf@gnu.org> ("Ludovic =?UTF-8?Q?Court=C3=A8s?="'s message of "Tue, 23 Sep 2014 14:13:55 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 140.186.70.43 X-BeenThere: bug-guile@gnu.org List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Original-Sender: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.bugs:7582 Archived-At: ludo@gnu.org (Ludovic Court=C3=A8s) writes: > David Kastrup skribis: > >> ludo@gnu.org (Ludovic Court=C3=A8s) writes: >> >>> David Kastrup skribis: >>> >>>>> Line/column info remains identical regardless of the encoding, so I t= end >>>>> to think it=E2=80=99s more robust to use that. >>>> >>>> Column info remains identical regardless of the encoding? Since when? >>> >>> The character on line L and column M is always there, regardless of >>> whether the file is encoded in UTF-8, Latin-1, etc. >>> >>> Would that work for LilyPond? >> >> Last time I looked, in the following line x was in column 3 in latin-1 >> encoding and in column 2 in utf-8 encoding: >> >> =C3=BCx > > I=E2=80=99m not sure what you mean. This line contains two characters: = =E2=80=98u=E2=80=99 with > umlaut followed by =E2=80=98x=E2=80=99. =E2=80=98=C3=BC=E2=80=99 is in t= he first column, and =E2=80=98x=E2=80=99 in the > second column. It contains three bytes. 0xc3, 0xbc, 0x78. In utf-8, this is =C3=BCx, in Latin-1 it is =C3=83=C2=BCx. This whole issue is about string ports _not_ being represented in terms of characters but bytes. > Is there a simple way to reproduce the issue with LilyPond? This issue is at best marginally about LilyPond, in that the semantics chosen for GUILE-2.0 (and switched again in GUILE-2.2) are both surprising and a source for headaches. They result in code like // we do our own utf8 encoding and verification in the parser, so we // use the no-conversion equivalent of latin1 SCM str =3D scm_from_latin1_string (c_str ()); scm_dynwind_begin ((scm_t_dynwind_flags)0); // Why doesn't scm_set_port_encoding_x work here? scm_dynwind_fluid (ly_lily_module_constant ("%default-port-encoding"), SC= M_BOOL_F); str_port_ =3D scm_open_input_string (str); scm_dynwind_end (); scm_set_port_filename_x (str_port_, ly_string2scm (name_)); } which will, incidentally, stop working in GUILE-2.2 at which time another workaround will be found. GUILE is an extension language. The stance that any kind of dealing with characters/strings that is not under control of GUILE and its character model is simply inappropriate. It is not the job of GUILE to dictate how an application has to organize matters internally. For that reason, its behavior needs to be straightforward and unsurprising. That includes sane boundaries between strings as character vectors, byte vectors, and encoding and decoding operations. Going through a byte-based encoding when copying a character-based string to a string, even when going through a string port, does not make sense. As a sign that this does not make sense, the effects of %default-port-encoding and set-port-encoding! on input and output string ports are unsymmetric. More so in GUILE-2.2 than in GUILE-2.0, but already in GUILE-2.0. That inconsistency (and its effects on overall performance) is what this issue is about. That I am tripping all over GUILE in the course of working with LilyPond is at best incidental to this issue. I could equally well be tripping over it when working with TeXmacs. I am not going to further reply to this issue since this is _not_, I=C2=A0repeat _not_ some complaint that I=C2=A0am too stupid to understand = what GUILE is doing here. I understand it perfectly well, and I=C2=A0am perfect= ly able to hack around GUILE's deficiencies and inconsistencies. One consequence of design problems like this is that the chosen semantics under such a fundamental design problem are arbitrary and thus more likely to change to different semantics in future versions. That means a higher likelihood of future maintenance. When I am going to have to redo this for GUILE-2.2 anyway, I prefer doing it in a sane manner that will stick around for good. I don't see that here. That does not mean that I am too stupid to work with the GUILE=C2=A02.0 behavior or the GUILE=C2=A02.2 behavior or the GUIL= E=C2=A01.8 behavior (in fact, the first port to GUILE=C2=A02 will set LC_CTYPE to C and just stick with GUILE=C2=A01.8 behavior, but that's not a long-term perspective since working with characters rather than bytes as string constituents _is_ nicer for the user). --=20 David Kastrup