From: Mark H Weaver <mhw@netris.org>
To: David Kastrup <dak@gnu.org>
Cc: 18520@debbugs.gnu.org
Subject: bug#18520: string ports should not have an encoding
Date: Wed, 24 Sep 2014 01:30:59 -0400 [thread overview]
Message-ID: <87oau5h9f0.fsf@yeeloong.lan> (raw)
In-Reply-To: <87iokgmttc.fsf@fencepost.gnu.org> (David Kastrup's message of "Mon, 22 Sep 2014 01:34:39 +0200")
David Kastrup <dak@gnu.org> writes:
> In Guile 2.0, at the time a string port is opened, the value of the
> fluid %default-port-encoding is used for deciding how to encode the
> string into a byte stream, [...]
I agree that this was a mistake. The issue is fixed on the master
branch.
> Ports fundamentally deliver characters, and so reading and writing from
> a string source/sink should not involve _any_ coding system.
David, you know as well as I that internally, there is always a coding
system. Strings have a coding system too, even if it's UCS-4. Emacs
uses something based on UTF-8, and I'd like to Guile to do something
similar in the future.
I guess you don't like the fact that it is possible to expose the
internal representation via 'set-port-encoding!', 'ftell' or 'seek'.
I don't see this as a problem, and arguably it's a benefit.
First I'll address the non-standard 'set-port-encoding!'. As you say,
it doesn't even make sense on string ports, and arguably should be an
error. So why do you care if some internal details leak out when you do
this nonsensical thing? Admittedly, we're missing an opportunity to
report a possible bug to the user, but that's the only problem I see
here.
Regarding 'ftell' and 'seek', it's not entirely clear to me what's the
best representation of those positions. In some situations, I guess it
would be convenient for them to count unicode code points or string
indices. In other situations, I could imagine it being more convenient
for them to count grapheme clusters or UTF-8 bytes.
R6RS, the only Scheme standard that supports getting or setting file
positions, gives us complete freedom to choose our representation of
positions on textual ports. The R6RS is explicit that they don't even
have to be integers, and if they are, they don't have to correspond to
bytes or characters.
For better or for worse, Guile's ports are fundamentally based on bytes,
and allow mixed binary and textual operations on all ports. Sometimes
this is very helpful, for example when implementing HTTP. I can think
of one other case where it's very helpful:
I don't know how deeply you've looked at UTF-8, but it has some unusual
properties that allow many (most?) string algorithms to be most
naturally (and efficiently) implemented by operating on bytes rather
than code points. Much of the time, you don't even have to be aware of
the code point boundaries, which is a great savings. Efficient lookup
tables based on bytes are also much cheaper than ones based on code
points, etc.
In fact, I intend to propose that in a future version of Guile, strings
will not only be based on UTF-8 internally, but that this fact should be
exposed in the API, allowing users to implement UTF-8 string operations
that operate on bytes not code points. I'd also like lightweight, fast
string ports that allow access to these bytes when desired.
This leads me to believe that it's a feature, not a bug, that string
ports use UTF-8 internally, and that it's possible (via non-standard
extensions) to get access to the underlying bytes.
Mark
next prev parent reply other threads:[~2014-09-24 5:30 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-09-21 23:34 bug#18520: string ports should not have an encoding David Kastrup
2014-09-22 11:54 ` Ludovic Courtès
2014-09-22 13:09 ` David Kastrup
2014-09-22 12:21 ` Ludovic Courtès
2014-09-22 13:34 ` David Kastrup
2014-09-22 17:08 ` Ludovic Courtès
2014-09-22 17:20 ` David Kastrup
2014-09-22 20:39 ` Ludovic Courtès
2014-09-22 22:12 ` David Kastrup
2014-09-23 8:25 ` Ludovic Courtès
2014-09-23 9:00 ` David Kastrup
2014-09-23 9:45 ` Ludovic Courtès
2014-09-23 11:54 ` David Kastrup
2014-09-23 12:13 ` Ludovic Courtès
2014-09-23 13:02 ` David Kastrup
2014-09-23 16:01 ` Ludovic Courtès
2014-09-23 16:21 ` David Kastrup
2014-09-23 19:33 ` Ludovic Courtès
2014-09-24 5:30 ` Mark H Weaver [this message]
2014-09-24 12:00 ` David Kastrup
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87oau5h9f0.fsf@yeeloong.lan \
--to=mhw@netris.org \
--cc=18520@debbugs.gnu.org \
--cc=dak@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).