From: David Kastrup <dak@gnu.org>
To: ludo@gnu.org (Ludovic Courtès)
Cc: 18520@debbugs.gnu.org
Subject: bug#18520: string ports should not have an encoding
Date: Tue, 23 Sep 2014 13:54:15 +0200 [thread overview]
Message-ID: <87h9zyk0wo.fsf@fencepost.gnu.org> (raw)
In-Reply-To: <87bnq6oelf.fsf@gnu.org> ("Ludovic Courtès"'s message of "Tue, 23 Sep 2014 11:45:00 +0200")
ludo@gnu.org (Ludovic Courtès) writes:
> David Kastrup <dak@gnu.org> skribis:
>
>>> Line/column info remains identical regardless of the encoding, so I tend
>>> to think it’s more robust to use that.
>>
>> Column info remains identical regardless of the encoding? Since when?
>
> The character on line L and column M is always there, regardless of
> whether the file is encoded in UTF-8, Latin-1, etc.
>
> Would that work for LilyPond?
Last time I looked, in the following line x was in column 3 in latin-1
encoding and in column 2 in utf-8 encoding:
üx
At any rate, we are missing the point of the issue. The issue is not
whether a workaround may be designed for every way in which GUILE tries
tripping up its users. The question is how GUILE may provide the least
amount of surprise to its users without sacrificing functionality.
GUILE's current implementation uses two character set conversions for
string ports. For input string ports, the first is a batch encoding
when the string port is opened (using %default-port-encoding
resp. "UTF-8" in GUILE-2.0 and GUILE-2.2), this encoding is set as the
port's encoding (I hope) and then, unless changed, every read operation
employs the encoding that is, at any given time, current.
Accompanying the opening of a string with an encoding operation (whether
using a forced encoding or %default-port-encoding) is expensive (not
least of all because everything needs to be decoded again), leads to
arbitrary semantics for port positioning, and is asymmetric since the
port encoding is only used for reading on an input string and for
writing on an output string.
Oh, and for writing on an input string using unread-string, of course.
No kidding. There is also a conversion in there.
Would it be worth ditching the sort of unnecessary conversion? Well,
just look at:
commit be7ecef05c1eea66f30360f658c610710c5cb22e
Author: Andy Wingo <wingo@pobox.com>
Date: Sat Aug 31 10:44:07 2013 +0200
unread-char: inline conversion from codepoint to bytes
* libguile/ports.c (scm_ungetc_unlocked): Inline the conversion from
codepoint to bytes for UTF-8 and latin-1 ports. Speeds up a
numbers-reading test case by 100% (!).
That sounds like quite some gain just for _simplifying_ the
back-and-forth conversion, and we could be just foregoing it instead
(yes, peek-char as getc+ungetc presents a challenge in connection with
encoding switches: I think that declaring the first impression of
peek-char as sticky would be reasonable).
At any rate, the above commit looks like it would make a hash out of
(with-input-from-string "Huh\""
(lambda ()
(unread-string "\"ä" (current-input-port))
(read)))
because of a broken character range check (I cannot currently check with
a compilation of master since that takes about a day on my computer, but
I would be surprised if the above worked fine). So yes, the required
complexity to deal with GUILE's current behavior can introduce problems.
--
David Kastrup
next prev parent reply other threads:[~2014-09-23 11:54 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-09-21 23:34 bug#18520: string ports should not have an encoding David Kastrup
2014-09-22 11:54 ` Ludovic Courtès
2014-09-22 13:09 ` David Kastrup
2014-09-22 12:21 ` Ludovic Courtès
2014-09-22 13:34 ` David Kastrup
2014-09-22 17:08 ` Ludovic Courtès
2014-09-22 17:20 ` David Kastrup
2014-09-22 20:39 ` Ludovic Courtès
2014-09-22 22:12 ` David Kastrup
2014-09-23 8:25 ` Ludovic Courtès
2014-09-23 9:00 ` David Kastrup
2014-09-23 9:45 ` Ludovic Courtès
2014-09-23 11:54 ` David Kastrup [this message]
2014-09-23 12:13 ` Ludovic Courtès
2014-09-23 13:02 ` David Kastrup
2014-09-23 16:01 ` Ludovic Courtès
2014-09-23 16:21 ` David Kastrup
2014-09-23 19:33 ` Ludovic Courtès
2014-09-24 5:30 ` Mark H Weaver
2014-09-24 12:00 ` David Kastrup
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87h9zyk0wo.fsf@fencepost.gnu.org \
--to=dak@gnu.org \
--cc=18520@debbugs.gnu.org \
--cc=ludo@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).