From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: David Kastrup Newsgroups: gmane.lisp.guile.bugs Subject: bug#18520: string ports should not have an encoding Date: Mon, 22 Sep 2014 15:34:51 +0200 Message-ID: <87sijjlqx0.fsf@fencepost.gnu.org> References: <87iokgmttc.fsf@fencepost.gnu.org> <87mw9rq20u.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1411392990 15549 80.91.229.3 (22 Sep 2014 13:36:30 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 22 Sep 2014 13:36:30 +0000 (UTC) Cc: 18520@debbugs.gnu.org To: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Original-X-From: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Mon Sep 22 15:36:20 2014 Return-path: Envelope-to: guile-bugs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1XW3mt-00073g-H9 for guile-bugs@m.gmane.org; Mon, 22 Sep 2014 15:36:19 +0200 Original-Received: from localhost ([::1]:46692 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XW3ms-0005RS-RW for guile-bugs@m.gmane.org; Mon, 22 Sep 2014 09:36:18 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:57539) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XW3mn-0005PP-Vq for bug-guile@gnu.org; Mon, 22 Sep 2014 09:36:15 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XW3mi-0000MC-1d for bug-guile@gnu.org; Mon, 22 Sep 2014 09:36:13 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:56692) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XW3mh-0000LR-Uy for bug-guile@gnu.org; Mon, 22 Sep 2014 09:36:07 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.80) (envelope-from ) id 1XW3mc-0007co-MH for bug-guile@gnu.org; Mon, 22 Sep 2014 09:36:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: David Kastrup Original-Sender: "Debbugs-submit" Resent-CC: bug-guile@gnu.org Resent-Date: Mon, 22 Sep 2014 13:36:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18520 X-GNU-PR-Package: guile X-GNU-PR-Keywords: Original-Received: via spool by 18520-submit@debbugs.gnu.org id=B18520.141139291729248 (code B ref 18520); Mon, 22 Sep 2014 13:36:02 +0000 Original-Received: (at 18520) by debbugs.gnu.org; 22 Sep 2014 13:35:17 +0000 Original-Received: from localhost ([127.0.0.1]:48255 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XW3lr-0007bf-Vz for submit@debbugs.gnu.org; Mon, 22 Sep 2014 09:35:16 -0400 Original-Received: from fencepost.gnu.org ([208.118.235.10]:57571) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XW3ll-0007bH-V1 for 18520@debbugs.gnu.org; Mon, 22 Sep 2014 09:35:11 -0400 Original-Received: from localhost ([127.0.0.1]:36643 helo=lola) by fencepost.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XW3ll-0001aK-Fa; Mon, 22 Sep 2014 09:35:09 -0400 Original-Received: by lola (Postfix, from userid 1000) id 037B1E0C78; Mon, 22 Sep 2014 15:34:51 +0200 (CEST) In-Reply-To: <87mw9rq20u.fsf@gnu.org> ("Ludovic =?UTF-8?Q?Court=C3=A8s?="'s message of "Mon, 22 Sep 2014 14:21:21 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 140.186.70.43 X-BeenThere: bug-guile@gnu.org List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Original-Sender: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.bugs:7568 Archived-At: ludo@gnu.org (Ludovic Court=C3=A8s) writes: > David Kastrup skribis: > >> Guile-2.2 does not consult %default-port-encoding but uses UTF-8 >> consistently (I guess, overriding set-port-encoding! will again change >> that). >> >> That still is not satisfactory. For example, using ftell on the input >> port will not report the string index of the string connected to the >> string port but rather a byte index into a UTF-8 encoded version of the >> string. This is a number that has nothing to do with the original >> string and cannot be used for correlating string and port. > > Right. > >> Ports fundamentally deliver characters, and so reading and writing from >> a string source/sink should not involve _any_ coding system. >> >> Files fundamentally deliver bytes, a conversion is required. The same >> would be the case when opening a port on a _bytevector_. Here an >> encoding would make equally make sense, and ftell/fseek offsets would >> naturally be in bytes. But a port on a string delivers and consumes >> characters. Any conversion, even a fixed UTF-8 conversion, will destroy >> the predictable nature of with-output-to-string and >> with-input-from-string and the respective uses of string ports. > > Guile ports can be mixed textual/binary (unlike R6 ports, which are > either textual or binary.) Thus, they fundamentally deliver bytes, > possibly with a textual conversion. I think that is a mischaracterization. GUILE ports at the current point of time can _only_ be binary, to the degree that strings/texts first have to be encoded into a binary stream before they can be passed through a port. Which is what this issue is about. > Although the manual isn=E2=80=99t clear about it, =E2=80=98ftell=E2=80=99= , when available, > returns a position in bytes. Which is not helpful if the input does not consist of bytes. > The situation for string ports here is comparable to that of other > ports used for textual I/O. No. The situation for file ports is that ftell refers to identifiable and reproducible byte offsets of the input, the input being a file consisting of bytes and indexed using bytes. The situation for string ports is that ftell refers to unidentifiable and incidental byte offsets of a temporary inaccessible ad-hoc encoding of the input, the input being a string consisting of characters and indexed using characters. > Do you have a situation where you were relying on 1.8=E2=80=99s behavior = in > that regard? Could we see whether this can be solved differently? I'm currently migrating LilyPond over to GUILE 2.0. LilyPond has its own UTF-8 verification, error flagging, processing and indexing. I=C2=A0ha= ve more than enough crashes and obscure errors to contend with as it stands, so the first port will use LC_CTYPE=3DC (LC_CTYPE=3DISO-8859-1 does not work since then GUILE/iconv considers itself entitled to complain about improper Latin-1) and will keep GUILE=C2=A02.0 from thinking about UTF-8 at all. Moving string processing to UTF-8 will be a gradual process, and a separate project involving programmer choices about what to represent where how: much of LilyPond is written in C++ and so UTF-8 encoded strings (rather than GUILE's strings consisting of either UCS-8 or UCS-32) are ubiquitous, with most of LilyPond's core literals fitting in the common ASCII subset. Whenever GUILE chooses to take decisions from the user and programmer, problems are likely to result, and workarounds will abound. For efficiency reasons, it is not realistic to demand that any string data passed between GUILE and LilyPond will have to be encoded and reencoded at every call gate: there is a real lot of them. --=20 David Kastrup