From: David Kastrup <dak@gnu.org>
To: ludo@gnu.org (Ludovic Courtès)
Cc: 18520@debbugs.gnu.org
Subject: bug#18520: string ports should not have an encoding
Date: Mon, 22 Sep 2014 15:34:51 +0200 [thread overview]
Message-ID: <87sijjlqx0.fsf@fencepost.gnu.org> (raw)
In-Reply-To: <87mw9rq20u.fsf@gnu.org> ("Ludovic Courtès"'s message of "Mon, 22 Sep 2014 14:21:21 +0200")
ludo@gnu.org (Ludovic Courtès) writes:
> David Kastrup <dak@gnu.org> skribis:
>
>> Guile-2.2 does not consult %default-port-encoding but uses UTF-8
>> consistently (I guess, overriding set-port-encoding! will again change
>> that).
>>
>> That still is not satisfactory. For example, using ftell on the input
>> port will not report the string index of the string connected to the
>> string port but rather a byte index into a UTF-8 encoded version of the
>> string. This is a number that has nothing to do with the original
>> string and cannot be used for correlating string and port.
>
> Right.
>
>> Ports fundamentally deliver characters, and so reading and writing from
>> a string source/sink should not involve _any_ coding system.
>>
>> Files fundamentally deliver bytes, a conversion is required. The same
>> would be the case when opening a port on a _bytevector_. Here an
>> encoding would make equally make sense, and ftell/fseek offsets would
>> naturally be in bytes. But a port on a string delivers and consumes
>> characters. Any conversion, even a fixed UTF-8 conversion, will destroy
>> the predictable nature of with-output-to-string and
>> with-input-from-string and the respective uses of string ports.
>
> Guile ports can be mixed textual/binary (unlike R6 ports, which are
> either textual or binary.) Thus, they fundamentally deliver bytes,
> possibly with a textual conversion.
I think that is a mischaracterization. GUILE ports at the current point
of time can _only_ be binary, to the degree that strings/texts first
have to be encoded into a binary stream before they can be passed
through a port. Which is what this issue is about.
> Although the manual isn’t clear about it, ‘ftell’, when available,
> returns a position in bytes.
Which is not helpful if the input does not consist of bytes.
> The situation for string ports here is comparable to that of other
> ports used for textual I/O.
No. The situation for file ports is that ftell refers to identifiable
and reproducible byte offsets of the input, the input being a file
consisting of bytes and indexed using bytes.
The situation for string ports is that ftell refers to unidentifiable
and incidental byte offsets of a temporary inaccessible ad-hoc encoding
of the input, the input being a string consisting of characters and
indexed using characters.
> Do you have a situation where you were relying on 1.8’s behavior in
> that regard? Could we see whether this can be solved differently?
I'm currently migrating LilyPond over to GUILE 2.0. LilyPond has its
own UTF-8 verification, error flagging, processing and indexing. I have
more than enough crashes and obscure errors to contend with as it
stands, so the first port will use LC_CTYPE=C (LC_CTYPE=ISO-8859-1 does
not work since then GUILE/iconv considers itself entitled to complain
about improper Latin-1) and will keep GUILE 2.0 from thinking about
UTF-8 at all. Moving string processing to UTF-8 will be a gradual
process, and a separate project involving programmer choices about what
to represent where how: much of LilyPond is written in C++ and so UTF-8
encoded strings (rather than GUILE's strings consisting of either UCS-8
or UCS-32) are ubiquitous, with most of LilyPond's core literals fitting
in the common ASCII subset.
Whenever GUILE chooses to take decisions from the user and programmer,
problems are likely to result, and workarounds will abound. For
efficiency reasons, it is not realistic to demand that any string data
passed between GUILE and LilyPond will have to be encoded and reencoded
at every call gate: there is a real lot of them.
--
David Kastrup
next prev parent reply other threads:[~2014-09-22 13:34 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-09-21 23:34 bug#18520: string ports should not have an encoding David Kastrup
2014-09-22 11:54 ` Ludovic Courtès
2014-09-22 13:09 ` David Kastrup
2014-09-22 12:21 ` Ludovic Courtès
2014-09-22 13:34 ` David Kastrup [this message]
2014-09-22 17:08 ` Ludovic Courtès
2014-09-22 17:20 ` David Kastrup
2014-09-22 20:39 ` Ludovic Courtès
2014-09-22 22:12 ` David Kastrup
2014-09-23 8:25 ` Ludovic Courtès
2014-09-23 9:00 ` David Kastrup
2014-09-23 9:45 ` Ludovic Courtès
2014-09-23 11:54 ` David Kastrup
2014-09-23 12:13 ` Ludovic Courtès
2014-09-23 13:02 ` David Kastrup
2014-09-23 16:01 ` Ludovic Courtès
2014-09-23 16:21 ` David Kastrup
2014-09-23 19:33 ` Ludovic Courtès
2014-09-24 5:30 ` Mark H Weaver
2014-09-24 12:00 ` David Kastrup
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87sijjlqx0.fsf@fencepost.gnu.org \
--to=dak@gnu.org \
--cc=18520@debbugs.gnu.org \
--cc=ludo@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).