bug#18520: string ports should not have an encoding

unofficial mirror of bug-guile@gnu.org 
 help / color / mirror / Atom feed

* bug#18520: string ports should not have an encoding
@ 2014-09-21 23:34 David Kastrup
  2014-09-22 11:54 ` Ludovic Courtès
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: David Kastrup @ 2014-09-21 23:34 UTC (permalink / raw)
  To: 18520

In Guile 2.0, at the time a string port is opened, the value of the
fluid %default-port-encoding is used for deciding how to encode the
string into a byte stream, and set-port-encoding! may then be used for
deciding how to decode that byte stream back into characters.

This does not make sense as ports deliver characters, and strings
contain characters.  There is no point in going through bytes.

Guile-2.2 does not consult %default-port-encoding but uses UTF-8
consistently (I guess, overriding set-port-encoding! will again change
that).

That still is not satisfactory.  For example, using ftell on the input
port will not report the string index of the string connected to the
string port but rather a byte index into a UTF-8 encoded version of the
string.  This is a number that has nothing to do with the original
string and cannot be used for correlating string and port.

Ports fundamentally deliver characters, and so reading and writing from
a string source/sink should not involve _any_ coding system.

Files fundamentally deliver bytes, a conversion is required.  The same
would be the case when opening a port on a _bytevector_.  Here an
encoding would make equally make sense, and ftell/fseek offsets would
naturally be in bytes.  But a port on a string delivers and consumes
characters.  Any conversion, even a fixed UTF-8 conversion, will destroy
the predictable nature of with-output-to-string and
with-input-from-string and the respective uses of string ports.

In code like the following, the results should not depend on either the
fluid-set! or the set-port-encoding!, and the ftell should always output
successive integers independent from either fluid-set! or
set-port-encoding!.  set-port-encoding! should probably flag an error,
like an fseek on an unseekable device.

(fluid-set! %default-port-encoding "UTF-8")
(define s (list->string (map integer->char '(20 200 2000 20000))))
(with-input-from-string s
  (lambda ()
    (set-port-encoding! (current-input-port) "ISO-8859-1")
    (let loop ((ch (read-char (current-input-port))))
      (if (not (eof-object? ch))
	  (begin
	    (format #t "~d, pos=~d\n" (char->integer ch) (ftell (current-input-port)))
	    (loop (read-char (current-input-port))))))))

Again, things are quite different from bytevectors which could be
accepted instead of a string for opening ports with the string-port
commands, or could have their own port open/close commands, and the
respective ports then definitely would want to obey set-port-encoding!
(defaulting to %default-port-encoding) for _decoding_ the bytevector.

I don't know what r7rs might think here.  But for me, associating
encodings for connecting strings to ports does not make sense.  The
relation is one of characters to characters.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-21 23:34 bug#18520: string ports should not have an encoding David Kastrup
@ 2014-09-22 11:54 ` Ludovic Courtès
  2014-09-22 13:09   ` David Kastrup
  2014-09-22 12:21 ` Ludovic Courtès
  2014-09-24  5:30 ` Mark H Weaver
  2 siblings, 1 reply; 20+ messages in thread
From: Ludovic Courtès @ 2014-09-22 11:54 UTC (permalink / raw)
  To: David Kastrup; +Cc: 18520

This has been addressed in two ways:

  1. In 2.0, (srfi srfi-6) uses Unicode-capable string ports (commit
     ecb48dc.)

  2. In 2.2, string ports are always Unicode-capable, and
     ‘%default-port-encoding’ is ignored (commit 6dce942.)

So for 2.0, the workaround is to either use (srfi srfi-6), or force
‘%default-port-encoding’ to "UTF-8".

HTH,
Ludo’.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-22 11:54 ` Ludovic Courtès
@ 2014-09-22 13:09   ` David Kastrup
  0 siblings, 0 replies; 20+ messages in thread
From: David Kastrup @ 2014-09-22 13:09 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 18520

ludo@gnu.org (Ludovic Courtès) writes:

> This has been addressed in two ways:

No, it hasn't.

>   1. In 2.0, (srfi srfi-6) uses Unicode-capable string ports (commit
>      ecb48dc.)

This issue report is not about adding more optional functionality on
top.  It is about _removing_ unwarranted redirection and complication
from existing core functionality.

The artifacts of making with-input-from-string and with-output-to-string
go through an additional character->bytevector->character
encoding/recoding layer are not invisible.

>   2. In 2.2, string ports are always Unicode-capable, and
>      ‘%default-port-encoding’ is ignored (commit 6dce942.)

String ports should not be "Unicode capable" but transparent.
Characters in, characters out.  ftell/fseek should be based on character
position in strings rather than offsets in a magically created
bytestream of some particular encoding.

> So for 2.0, the workaround is to either use (srfi srfi-6), or force
> ‘%default-port-encoding’ to "UTF-8".

Which is what the latter _only_ does.  It still interprets
set-port-encoding! with respect to a byte stream meaning, and it still
calculates positions according to a byte stream meaning not related to
string positions:

(use-modules (srfi srfi-6))
(define s (list->string (map integer->char '(20 200 2000 20000))))
(let ((port (open-input-string s)))
  (let loop ((ch (read-char port)))
    (if (not (eof-object? ch))
	(begin
	  (format #t "~d, pos=~d\n" (char->integer ch) (ftell port))
	    (loop (read-char port))))))

20, pos=1
200, pos=3
2000, pos=5
20000, pos=8

Tying string ports to an artificial bytevector presentation in a manner
bleeding through like that means that it is not possible to synchronize
string positions and stream positions when parts of the source string
are _not_ processed from within the stream.

Which is precisely the problem I am currently dealing with while porting
LilyPond: it has its own lexer working on an (utf-8 encoded) byte stream
which is at the same time available as a string port.  Whenever embedded
Scheme is interpreted, the string port is moved to the proper position,
GUILE reads an expression and is told what to do with it, the string
port position is picked off and the LilyPond lexer is moved to the
respective position to continue.

If you take a look at
<URL:http://git.savannah.gnu.org/cgit/lilypond.git/tree/scm/parser-ly-from-scheme.scm>,
ftell on a string port is here used for correlating the positions of
parsed subexpressions with the original data.  Reencoding strings in
utf-8 is not going to make this work with string indexing since ftell
does not bear a useful relation to string positions.

The behavior of ftell and port-encoding is perfectly fine for reading
from bytevectors or files, and reading from bytevectors or files also
does not incur a encode-when-open action governed by
%default-port-encoding in GUILE-2.0 and by hardwired UTF-8 in GUILE-2.2.

But strings are already decoded characters.  Reencoding makes no sense
and detaches things like ftell and fseek from the actual input into the
port.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-21 23:34 bug#18520: string ports should not have an encoding David Kastrup
  2014-09-22 11:54 ` Ludovic Courtès
@ 2014-09-22 12:21 ` Ludovic Courtès
  2014-09-22 13:34   ` David Kastrup
  2014-09-24  5:30 ` Mark H Weaver
  2 siblings, 1 reply; 20+ messages in thread
From: Ludovic Courtès @ 2014-09-22 12:21 UTC (permalink / raw)
  To: David Kastrup; +Cc: 18520

I see my reply failed to address some of the points raised.

David Kastrup <dak@gnu.org> skribis:

> Guile-2.2 does not consult %default-port-encoding but uses UTF-8
> consistently (I guess, overriding set-port-encoding! will again change
> that).
>
> That still is not satisfactory.  For example, using ftell on the input
> port will not report the string index of the string connected to the
> string port but rather a byte index into a UTF-8 encoded version of the
> string.  This is a number that has nothing to do with the original
> string and cannot be used for correlating string and port.

Right.

> Ports fundamentally deliver characters, and so reading and writing from
> a string source/sink should not involve _any_ coding system.
>
> Files fundamentally deliver bytes, a conversion is required.  The same
> would be the case when opening a port on a _bytevector_.  Here an
> encoding would make equally make sense, and ftell/fseek offsets would
> naturally be in bytes.  But a port on a string delivers and consumes
> characters.  Any conversion, even a fixed UTF-8 conversion, will destroy
> the predictable nature of with-output-to-string and
> with-input-from-string and the respective uses of string ports.

Guile ports can be mixed textual/binary (unlike R6 ports, which are
either textual or binary.)  Thus, they fundamentally deliver bytes,
possibly with a textual conversion.

Although the manual isn’t clear about it, ‘ftell’, when available,
returns a position in bytes.  The situation for string ports here is
comparable to that of other ports used for textual I/O.

Do you have a situation where you were relying on 1.8’s behavior in that
regard?  Could we see whether this can be solved differently?

Thanks,
Ludo’.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-22 12:21 ` Ludovic Courtès
@ 2014-09-22 13:34   ` David Kastrup
  2014-09-22 17:08     ` Ludovic Courtès
  0 siblings, 1 reply; 20+ messages in thread
From: David Kastrup @ 2014-09-22 13:34 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 18520

ludo@gnu.org (Ludovic Courtès) writes:

> David Kastrup <dak@gnu.org> skribis:
>
>> Guile-2.2 does not consult %default-port-encoding but uses UTF-8
>> consistently (I guess, overriding set-port-encoding! will again change
>> that).
>>
>> That still is not satisfactory.  For example, using ftell on the input
>> port will not report the string index of the string connected to the
>> string port but rather a byte index into a UTF-8 encoded version of the
>> string.  This is a number that has nothing to do with the original
>> string and cannot be used for correlating string and port.
>
> Right.
>
>> Ports fundamentally deliver characters, and so reading and writing from
>> a string source/sink should not involve _any_ coding system.
>>
>> Files fundamentally deliver bytes, a conversion is required.  The same
>> would be the case when opening a port on a _bytevector_.  Here an
>> encoding would make equally make sense, and ftell/fseek offsets would
>> naturally be in bytes.  But a port on a string delivers and consumes
>> characters.  Any conversion, even a fixed UTF-8 conversion, will destroy
>> the predictable nature of with-output-to-string and
>> with-input-from-string and the respective uses of string ports.
>
> Guile ports can be mixed textual/binary (unlike R6 ports, which are
> either textual or binary.)  Thus, they fundamentally deliver bytes,
> possibly with a textual conversion.

I think that is a mischaracterization.  GUILE ports at the current point
of time can _only_ be binary, to the degree that strings/texts first
have to be encoded into a binary stream before they can be passed
through a port.  Which is what this issue is about.

> Although the manual isn’t clear about it, ‘ftell’, when available,
> returns a position in bytes.

Which is not helpful if the input does not consist of bytes.

> The situation for string ports here is comparable to that of other
> ports used for textual I/O.

No.  The situation for file ports is that ftell refers to identifiable
and reproducible byte offsets of the input, the input being a file
consisting of bytes and indexed using bytes.

The situation for string ports is that ftell refers to unidentifiable
and incidental byte offsets of a temporary inaccessible ad-hoc encoding
of the input, the input being a string consisting of characters and
indexed using characters.

> Do you have a situation where you were relying on 1.8’s behavior in
> that regard?  Could we see whether this can be solved differently?

I'm currently migrating LilyPond over to GUILE 2.0.  LilyPond has its
own UTF-8 verification, error flagging, processing and indexing.  I have
more than enough crashes and obscure errors to contend with as it
stands, so the first port will use LC_CTYPE=C (LC_CTYPE=ISO-8859-1 does
not work since then GUILE/iconv considers itself entitled to complain
about improper Latin-1) and will keep GUILE 2.0 from thinking about
UTF-8 at all.  Moving string processing to UTF-8 will be a gradual
process, and a separate project involving programmer choices about what
to represent where how: much of LilyPond is written in C++ and so UTF-8
encoded strings (rather than GUILE's strings consisting of either UCS-8
or UCS-32) are ubiquitous, with most of LilyPond's core literals fitting
in the common ASCII subset.

Whenever GUILE chooses to take decisions from the user and programmer,
problems are likely to result, and workarounds will abound.  For
efficiency reasons, it is not realistic to demand that any string data
passed between GUILE and LilyPond will have to be encoded and reencoded
at every call gate: there is a real lot of them.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-22 13:34   ` David Kastrup
@ 2014-09-22 17:08     ` Ludovic Courtès
  2014-09-22 17:20       ` David Kastrup
  0 siblings, 1 reply; 20+ messages in thread
From: Ludovic Courtès @ 2014-09-22 17:08 UTC (permalink / raw)
  To: David Kastrup; +Cc: 18520

David Kastrup <dak@gnu.org> skribis:

> I'm currently migrating LilyPond over to GUILE 2.0.  LilyPond has its
> own UTF-8 verification, error flagging, processing and indexing.

Do I understand correctly that LilyPond expects Guile strings to be byte
vectors, which it can feed with UTF-8 byte sequences that it built by
itself?

> If you take a look at
> <URL:http://git.savannah.gnu.org/cgit/lilypond.git/tree/scm/parser-ly-from-scheme.scm>,
> ftell on a string port is here used for correlating the positions of
> parsed subexpressions with the original data.  Reencoding strings in
> utf-8 is not going to make this work with string indexing since ftell
> does not bear a useful relation to string positions.

AIUI the result of ‘ftell’ is used in only one place, while ‘port-line’
and ‘port-column’ are used in other places.  The latter seems more
appropriate to me when it comes to tracking source location.

How is the result of ‘ftell’ used by callers of ‘read-lily-expression’?

> I have more than enough crashes and obscure errors to contend with as
> it stands,

Could you open a separate bug with the backtrace of such crashes, if you
think it may be Guile’s fault?

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-22 17:08     ` Ludovic Courtès
@ 2014-09-22 17:20       ` David Kastrup
  2014-09-22 20:39         ` Ludovic Courtès
  0 siblings, 1 reply; 20+ messages in thread
From: David Kastrup @ 2014-09-22 17:20 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 18520

ludo@gnu.org (Ludovic Courtès) writes:

> David Kastrup <dak@gnu.org> skribis:
>
>> I'm currently migrating LilyPond over to GUILE 2.0.  LilyPond has its
>> own UTF-8 verification, error flagging, processing and indexing.
>
> Do I understand correctly that LilyPond expects Guile strings to be byte
> vectors, which it can feed with UTF-8 byte sequences that it built by
> itself?

Not really.  LilyPond reads and parses its own files but it does divert
parts through GUILE occasionally in the process.  Some stuff is passed
through GUILE with time delays and parts wrapped into closures and
flagged with machine-identifiable source locations.

>> If you take a look at
>> <URL:http://git.savannah.gnu.org/cgit/lilypond.git/tree/scm/parser-ly-from-scheme.scm>,
>> ftell on a string port is here used for correlating the positions of
>> parsed subexpressions with the original data.  Reencoding strings in
>> utf-8 is not going to make this work with string indexing since ftell
>> does not bear a useful relation to string positions.
>
> AIUI the result of ‘ftell’ is used in only one place, while
> ‘port-line’ and ‘port-column’ are used in other places.

The ftell information is wrapped into an alist together with a closure
corresponding to the source location.  At a later point of time, the
surrounding string may be interpreted, and the source location is
correlated with the closure and the closure used instead of a call to
local-eval (which does not have the same power of evaluating materials
in a preserved lexical environment as a closure has).

> The latter seems more appropriate to me when it comes to tracking
> source location.

For error messages, yes.  For associating a position in a string with a
previously parsed closure, no.

> How is the result of ‘ftell’ used by callers of
> ‘read-lily-expression’?

See above.

>> I have more than enough crashes and obscure errors to contend with as
>> it stands,
>
> Could you open a separate bug with the backtrace of such crashes, if you
> think it may be Guile’s fault?

The backtraces are usually quite useless for diagnosing the crashes.
For example, there are crashes in scm_sloppy_assq.  If you look at the
code, it is clear that they can only happen for pairs that have already
been collected by garbage collection.  So the bug has occured quite a
bit previously to the crash.

So one has to figure out how the collection could possibly have happened
(naturally, it didn't with GUILE 1.8).  You can try doing that with the
rather expensive process of "reverse execution" (which basically traces
and keeps a history you can then explore backwards from the crash), but
that requires that the bugs are reproducible, and with collection in a
separate thread, that is not really the case.  Sometimes a crash
segfaults, more often you get std::exception triggered.  All with the
same input and executable.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-22 17:20       ` David Kastrup
@ 2014-09-22 20:39         ` Ludovic Courtès
  2014-09-22 22:12           ` David Kastrup
  0 siblings, 1 reply; 20+ messages in thread
From: Ludovic Courtès @ 2014-09-22 20:39 UTC (permalink / raw)
  To: David Kastrup; +Cc: 18520

David Kastrup <dak@gnu.org> skribis:

> ludo@gnu.org (Ludovic Courtès) writes:
>
>> David Kastrup <dak@gnu.org> skribis:
>>
>>> I'm currently migrating LilyPond over to GUILE 2.0.  LilyPond has its
>>> own UTF-8 verification, error flagging, processing and indexing.
>>
>> Do I understand correctly that LilyPond expects Guile strings to be byte
>> vectors, which it can feed with UTF-8 byte sequences that it built by
>> itself?
>
> Not really.  LilyPond reads and parses its own files but it does divert
> parts through GUILE occasionally in the process.  Some stuff is passed
> through GUILE with time delays and parts wrapped into closures and
> flagged with machine-identifiable source locations.

OK.

>>> If you take a look at
>>> <URL:http://git.savannah.gnu.org/cgit/lilypond.git/tree/scm/parser-ly-from-scheme.scm>,
>>> ftell on a string port is here used for correlating the positions of
>>> parsed subexpressions with the original data.  Reencoding strings in
>>> utf-8 is not going to make this work with string indexing since ftell
>>> does not bear a useful relation to string positions.
>>
>> AIUI the result of ‘ftell’ is used in only one place, while
>> ‘port-line’ and ‘port-column’ are used in other places.
>
> The ftell information is wrapped into an alist together with a closure
> corresponding to the source location.  At a later point of time, the
> surrounding string may be interpreted, and the source location is
> correlated with the closure and the closure used instead of a call to
> local-eval (which does not have the same power of evaluating materials
> in a preserved lexical environment as a closure has).
>
>> The latter seems more appropriate to me when it comes to tracking
>> source location.
>
> For error messages, yes.  For associating a position in a string with a
> previously parsed closure, no.

But wouldn’t a line/column pair be as suitable as a unique identifier as
the position in the file?

Also, if the result of ‘ftell’ is used as a unique identifier, does it
really matter whether it’s an offset measured in bytes or in character?

(Trying to make sure I understand the problem.)

Thanks,
Ludo’.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-22 20:39         ` Ludovic Courtès
@ 2014-09-22 22:12           ` David Kastrup
  2014-09-23  8:25             ` Ludovic Courtès
  0 siblings, 1 reply; 20+ messages in thread
From: David Kastrup @ 2014-09-22 22:12 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 18520

ludo@gnu.org (Ludovic Courtès) writes:

> David Kastrup <dak@gnu.org> skribis:
>>
>> For error messages, yes.  For associating a position in a string with a
>> previously parsed closure, no.
>
> But wouldn’t a line/column pair be as suitable as a unique identifier as
> the position in the file?

As long as the reencoded UTF-8 is byte-identical to the original.  At
the current point of time, we flag non-UTF-8 sequences with a warning
and continue.

People complained previously about things like Latin-1 characters (most
likely to occur in comments or lyrics where they cause little or
well-identifiable havoc) leading to unceremonious aborts without
identifiable cause.

At any rate, the current behavior does not make sense.  Guile 2.0 might
refuse to turn a string into a port, and for Guile 2.2 the port encoding
may be used to have a UTF-8 rendition of the string characters be
interpreted in another encoding (like latin-1) but not the other way
round.

Both versions make only some half-baked sense.  Most resulting problems
can probably be worked around in some manner, but string ports are
actually the main stringbuf-like mechanism that Scheme has (dynamically
growing strings that are more compact than a list of characters).
Wedging a compulsory code conversion into it that is mirrored in the
port positions seems like a distraction.

> Also, if the result of ‘ftell’ is used as a unique identifier, does it
> really matter whether it’s an offset measured in bytes or in
> character?

In the LilyPond lexer, stuff is usually measured with byte offsets.
Yes, one can certainly parse the UTF-8 character distances and hope to
arrive at the same results as the UTF-8 reencoding.

But the point of GUILE's character set support was not really to make
everything more complicated, was it?

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-22 22:12           ` David Kastrup
@ 2014-09-23  8:25             ` Ludovic Courtès
  2014-09-23  9:00               ` David Kastrup
  0 siblings, 1 reply; 20+ messages in thread
From: Ludovic Courtès @ 2014-09-23  8:25 UTC (permalink / raw)
  To: David Kastrup; +Cc: 18520

David Kastrup <dak@gnu.org> skribis:

> ludo@gnu.org (Ludovic Courtès) writes:
>
>> David Kastrup <dak@gnu.org> skribis:
>>>
>>> For error messages, yes.  For associating a position in a string with a
>>> previously parsed closure, no.
>>
>> But wouldn’t a line/column pair be as suitable as a unique identifier as
>> the position in the file?
>
> As long as the reencoded UTF-8 is byte-identical to the original.

Sorry, what do you mean by “reencoded UTF-8”?  The internal string port
buffer?

Line/column info remains identical regardless of the encoding, so I tend
to think it’s more robust to use that.

Thanks,
Ludo’.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-23  8:25             ` Ludovic Courtès
@ 2014-09-23  9:00               ` David Kastrup
  2014-09-23  9:45                 ` Ludovic Courtès
  0 siblings, 1 reply; 20+ messages in thread
From: David Kastrup @ 2014-09-23  9:00 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 18520

ludo@gnu.org (Ludovic Courtès) writes:

> David Kastrup <dak@gnu.org> skribis:
>
>> ludo@gnu.org (Ludovic Courtès) writes:
>>
>>> David Kastrup <dak@gnu.org> skribis:
>>>>
>>>> For error messages, yes.  For associating a position in a string with a
>>>> previously parsed closure, no.
>>>
>>> But wouldn’t a line/column pair be as suitable as a unique identifier as
>>> the position in the file?
>>
>> As long as the reencoded UTF-8 is byte-identical to the original.
>
> Sorry, what do you mean by “reencoded UTF-8”?  The internal string port
> buffer?

Sure.  That's where ftell gets its info from.

> Line/column info remains identical regardless of the encoding, so I tend
> to think it’s more robust to use that.

Column info remains identical regardless of the encoding?  Since when?

-- 
David Kastrup





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-23  9:00               ` David Kastrup
@ 2014-09-23  9:45                 ` Ludovic Courtès
  2014-09-23 11:54                   ` David Kastrup
  0 siblings, 1 reply; 20+ messages in thread
From: Ludovic Courtès @ 2014-09-23  9:45 UTC (permalink / raw)
  To: David Kastrup; +Cc: 18520

David Kastrup <dak@gnu.org> skribis:

>> Line/column info remains identical regardless of the encoding, so I tend
>> to think it’s more robust to use that.
>
> Column info remains identical regardless of the encoding?  Since when?

The character on line L and column M is always there, regardless of
whether the file is encoded in UTF-8, Latin-1, etc.

Would that work for LilyPond?

Ludo’.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-23  9:45                 ` Ludovic Courtès
@ 2014-09-23 11:54                   ` David Kastrup
  2014-09-23 12:13                     ` Ludovic Courtès
  0 siblings, 1 reply; 20+ messages in thread
From: David Kastrup @ 2014-09-23 11:54 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 18520

ludo@gnu.org (Ludovic Courtès) writes:

> David Kastrup <dak@gnu.org> skribis:
>
>>> Line/column info remains identical regardless of the encoding, so I tend
>>> to think it’s more robust to use that.
>>
>> Column info remains identical regardless of the encoding?  Since when?
>
> The character on line L and column M is always there, regardless of
> whether the file is encoded in UTF-8, Latin-1, etc.
>
> Would that work for LilyPond?

Last time I looked, in the following line x was in column 3 in latin-1
encoding and in column 2 in utf-8 encoding:

üx

At any rate, we are missing the point of the issue.  The issue is not
whether a workaround may be designed for every way in which GUILE tries
tripping up its users.  The question is how GUILE may provide the least
amount of surprise to its users without sacrificing functionality.

GUILE's current implementation uses two character set conversions for
string ports.  For input string ports, the first is a batch encoding
when the string port is opened (using %default-port-encoding
resp. "UTF-8" in GUILE-2.0 and GUILE-2.2), this encoding is set as the
port's encoding (I hope) and then, unless changed, every read operation
employs the encoding that is, at any given time, current.

Accompanying the opening of a string with an encoding operation (whether
using a forced encoding or %default-port-encoding) is expensive (not
least of all because everything needs to be decoded again), leads to
arbitrary semantics for port positioning, and is asymmetric since the
port encoding is only used for reading on an input string and for
writing on an output string.

Oh, and for writing on an input string using unread-string, of course.
No kidding.  There is also a conversion in there.

Would it be worth ditching the sort of unnecessary conversion?  Well,
just look at:

    commit be7ecef05c1eea66f30360f658c610710c5cb22e
    Author: Andy Wingo <wingo@pobox.com>
    Date:   Sat Aug 31 10:44:07 2013 +0200

        unread-char: inline conversion from codepoint to bytes

        * libguile/ports.c (scm_ungetc_unlocked): Inline the conversion from
          codepoint to bytes for UTF-8 and latin-1 ports.  Speeds up a
          numbers-reading test case by 100% (!).

That sounds like quite some gain just for _simplifying_ the
back-and-forth conversion, and we could be just foregoing it instead
(yes, peek-char as getc+ungetc presents a challenge in connection with
encoding switches: I think that declaring the first impression of
peek-char as sticky would be reasonable).

At any rate, the above commit looks like it would make a hash out of

(with-input-from-string "Huh\""
  (lambda ()
    (unread-string "\"ä" (current-input-port))
    (read)))

because of a broken character range check (I cannot currently check with
a compilation of master since that takes about a day on my computer, but
I would be surprised if the above worked fine).  So yes, the required
complexity to deal with GUILE's current behavior can introduce problems.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-23 11:54                   ` David Kastrup
@ 2014-09-23 12:13                     ` Ludovic Courtès
  2014-09-23 13:02                       ` David Kastrup
  0 siblings, 1 reply; 20+ messages in thread
From: Ludovic Courtès @ 2014-09-23 12:13 UTC (permalink / raw)
  To: David Kastrup; +Cc: 18520

David Kastrup <dak@gnu.org> skribis:

> ludo@gnu.org (Ludovic Courtès) writes:
>
>> David Kastrup <dak@gnu.org> skribis:
>>
>>>> Line/column info remains identical regardless of the encoding, so I tend
>>>> to think it’s more robust to use that.
>>>
>>> Column info remains identical regardless of the encoding?  Since when?
>>
>> The character on line L and column M is always there, regardless of
>> whether the file is encoded in UTF-8, Latin-1, etc.
>>
>> Would that work for LilyPond?
>
> Last time I looked, in the following line x was in column 3 in latin-1
> encoding and in column 2 in utf-8 encoding:
>
> üx

I’m not sure what you mean.  This line contains two characters: ‘u’ with
umlaut followed by ‘x’.  ‘ü’ is in the first column, and ‘x’ in the
second column.

If we get a different column number, that means we’re looking at a
different line.  It could be because the encoding of the input port from
which that line was read was incorrectly specified.  This is the issue
what would need to be fixed.

Is there a simple way to reproduce the issue with LilyPond?

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-23 12:13                     ` Ludovic Courtès
@ 2014-09-23 13:02                       ` David Kastrup
  2014-09-23 16:01                         ` Ludovic Courtès
  0 siblings, 1 reply; 20+ messages in thread
From: David Kastrup @ 2014-09-23 13:02 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 18520

ludo@gnu.org (Ludovic Courtès) writes:

> David Kastrup <dak@gnu.org> skribis:
>
>> ludo@gnu.org (Ludovic Courtès) writes:
>>
>>> David Kastrup <dak@gnu.org> skribis:
>>>
>>>>> Line/column info remains identical regardless of the encoding, so I tend
>>>>> to think it’s more robust to use that.
>>>>
>>>> Column info remains identical regardless of the encoding?  Since when?
>>>
>>> The character on line L and column M is always there, regardless of
>>> whether the file is encoded in UTF-8, Latin-1, etc.
>>>
>>> Would that work for LilyPond?
>>
>> Last time I looked, in the following line x was in column 3 in latin-1
>> encoding and in column 2 in utf-8 encoding:
>>
>> üx
>
> I’m not sure what you mean.  This line contains two characters: ‘u’ with
> umlaut followed by ‘x’.  ‘ü’ is in the first column, and ‘x’ in the
> second column.

It contains three bytes. 0xc3, 0xbc, 0x78.  In utf-8, this is üx, in
Latin-1 it is Ã¼x.

This whole issue is about string ports _not_ being represented in terms
of characters but bytes.

> Is there a simple way to reproduce the issue with LilyPond?

This issue is at best marginally about LilyPond, in that the semantics
chosen for GUILE-2.0 (and switched again in GUILE-2.2) are both
surprising and a source for headaches.

They result in code like

  // we do our own utf8 encoding and verification in the parser, so we
  // use the no-conversion equivalent of latin1
  SCM str = scm_from_latin1_string (c_str ());
  scm_dynwind_begin ((scm_t_dynwind_flags)0);
  // Why doesn't scm_set_port_encoding_x work here?
  scm_dynwind_fluid (ly_lily_module_constant ("%default-port-encoding"), SCM_BOOL_F);
  str_port_ = scm_open_input_string (str);
  scm_dynwind_end ();
  scm_set_port_filename_x (str_port_, ly_string2scm (name_));
}

which will, incidentally, stop working in GUILE-2.2 at which time
another workaround will be found.

GUILE is an extension language.  The stance that any kind of dealing
with characters/strings that is not under control of GUILE and its
character model is simply inappropriate.  It is not the job of GUILE to
dictate how an application has to organize matters internally.  For that
reason, its behavior needs to be straightforward and unsurprising.  That
includes sane boundaries between strings as character vectors, byte
vectors, and encoding and decoding operations.  Going through a
byte-based encoding when copying a character-based string to a string,
even when going through a string port, does not make sense.

As a sign that this does not make sense, the effects of
%default-port-encoding and set-port-encoding! on input and output string
ports are unsymmetric.  More so in GUILE-2.2 than in GUILE-2.0, but
already in GUILE-2.0.

That inconsistency (and its effects on overall performance) is what this
issue is about.  That I am tripping all over GUILE in the course of
working with LilyPond is at best incidental to this issue.  I could
equally well be tripping over it when working with TeXmacs.

I am not going to further reply to this issue since this is _not_,
I repeat _not_ some complaint that I am too stupid to understand what
GUILE is doing here.  I understand it perfectly well, and I am perfectly
able to hack around GUILE's deficiencies and inconsistencies.  One
consequence of design problems like this is that the chosen semantics
under such a fundamental design problem are arbitrary and thus more
likely to change to different semantics in future versions.  That means
a higher likelihood of future maintenance.  When I am going to have to
redo this for GUILE-2.2 anyway, I prefer doing it in a sane manner that
will stick around for good.

I don't see that here.  That does not mean that I am too stupid to work
with the GUILE 2.0 behavior or the GUILE 2.2 behavior or the GUILE 1.8
behavior (in fact, the first port to GUILE 2 will set LC_CTYPE to C and
just stick with GUILE 1.8 behavior, but that's not a long-term
perspective since working with characters rather than bytes as string
constituents _is_ nicer for the user).

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-23 13:02                       ` David Kastrup
@ 2014-09-23 16:01                         ` Ludovic Courtès
  2014-09-23 16:21                           ` David Kastrup
  0 siblings, 1 reply; 20+ messages in thread
From: Ludovic Courtès @ 2014-09-23 16:01 UTC (permalink / raw)
  To: David Kastrup; +Cc: 18520

David Kastrup <dak@gnu.org> skribis:

> They result in code like
>
>   // we do our own utf8 encoding and verification in the parser, so we
>   // use the no-conversion equivalent of latin1
>   SCM str = scm_from_latin1_string (c_str ());
>   scm_dynwind_begin ((scm_t_dynwind_flags)0);
>   // Why doesn't scm_set_port_encoding_x work here?
>   scm_dynwind_fluid (ly_lily_module_constant ("%default-port-encoding"), SCM_BOOL_F);
>   str_port_ = scm_open_input_string (str);
>   scm_dynwind_end ();
>   scm_set_port_filename_x (str_port_, ly_string2scm (name_));
> }

So here ‘c_str’ returns a char * that is a UTF-8-encoded string, right?

In that case, it should be enough to do:

  /* Get a Scheme string from its UTF-8 representation.  */
  str = scm_from_utf8_string (c_str ());

  /* Create an input string port.  ‘read-char’ & co. will return each
     character from STR, one at a time.  */
  str_port = open_input_string (str);

  scm_set_port_filename_x (str_port, file);

As long as textual I/O procedures are used on ‘str_port’, there’s no
need to worry about its encoding.

Now, to be able to use ‘ftell’ and assume it returns the position as a
number of bytes in the UTF-8 sequence, something like this should work
(for 2.0; for 2.2 nothing special is needed):

  /* Get a Scheme string from its UTF-8 representation.  */
  str = scm_from_utf8_string (c_str ());

  scm_dynwind_begin (0);

  /* Make sure the following string port uses UTF-8 as the internal
     encoding of its buffer.  */
  scm_dynwind_fluid (scm_public_ref ("guile", "%default-port-encoding"),
                     scm_from_latin1_string ("UTF-8"));

  /* Create an input string port.  ‘read-char’ & co. will return each
     character from STR, one at a time.  */
  str_port = open_input_string (str);
  scm_dynwind_end ();

  scm_set_port_filename_x (str_port, file);

Does this help for LilyPond?

Ludo’.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-23 16:01                         ` Ludovic Courtès
@ 2014-09-23 16:21                           ` David Kastrup
  2014-09-23 19:33                             ` Ludovic Courtès
  0 siblings, 1 reply; 20+ messages in thread
From: David Kastrup @ 2014-09-23 16:21 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 18520

ludo@gnu.org (Ludovic Courtès) writes:

> Does this help for LilyPond?

I stated quite definitely that I am perfectly capable of dealing with
the mess GUILE made of string ports.  The issue is that I should not
have to, nor should anybody else.

This issue _is_ _not_ _about_ _LilyPond_.  Working on LilyPond merely
shines a light on it.

So please stop painting this as a request for help.  It isn't.  It is a
request for change.

The subject line is "string ports should not have an encoding".  It
isn't "help, I don't understand string ports".

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-23 16:21                           ` David Kastrup
@ 2014-09-23 19:33                             ` Ludovic Courtès
  0 siblings, 0 replies; 20+ messages in thread
From: Ludovic Courtès @ 2014-09-23 19:33 UTC (permalink / raw)
  To: David Kastrup; +Cc: 18520

[-- Attachment #1: Type: text/plain, Size: 1139 bytes --]

David Kastrup <dak@gnu.org> skribis:

> I stated quite definitely that I am perfectly capable of dealing with
> the mess GUILE made of string ports.

Good to know, this was not my understanding until now.

The intent of the change in 2.2 is to hide the very fact that string
ports “have an encoding.”  So from that viewpoint, that bug is closed.

If the bug is about ‘ftell’, that’s a different story.  I would tend to
suggest that ‘ftell’ and ‘seek’ for string ports operate on an abstract
notion of position within the string port data.

This is in fact the path that the R6RS takes:

  For a binary port, the port-position procedure returns the index of
  the position at which the next byte would be read from or written to
  the port as an exact non-negative integer object.  For a textual port,
  port-position returns a value of some implementation-dependent type
  representing the port's position; this value may be useful only as the
  pos argument to set-port-position!, if the latter is supported on the
  port (see below).

Thus, I would suggest a clarification along these lines:

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 1215 bytes --]

diff --git a/doc/ref/api-io.texi b/doc/ref/api-io.texi
index 02d92a2..8331378 100644
--- a/doc/ref/api-io.texi
+++ b/doc/ref/api-io.texi
@@ -443,8 +443,12 @@ open.
 @deffn {Scheme Procedure} seek fd_port offset whence
 @deffnx {C Function} scm_seek (fd_port, offset, whence)
 Sets the current position of @var{fd_port} to the integer
-@var{offset}, which is interpreted according to the value of
-@var{whence}.
+@var{offset}.  For a file port, @var{offset} is expressed
+as a number of bytes; for other types of ports, such as string
+ports, @var{offset} is an abstract representation of the
+position within the port's data, not necessarily expressed
+as a number of bytes.  @var{offset} is interpreted according to
+the value of @var{whence}.

 One of the following variables should be supplied for
 @var{whence}:
@@ -460,7 +464,7 @@ Seek from the end of the file.
 If @var{fd_port} is a file descriptor, the underlying system
 call is @code{lseek}.  @var{port} may be a string port.

-The value returned is the new position in the file.  This means
+The value returned is the new position in @var{fd_port}.  This means
 that the current position of a port can be obtained using:
 @lisp
 (seek port 0 SEEK_CUR)

[-- Attachment #3: Type: text/plain, Size: 34 bytes --]

Thoughts?

Thanks,
Ludo’.

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-21 23:34 bug#18520: string ports should not have an encoding David Kastrup
  2014-09-22 11:54 ` Ludovic Courtès
  2014-09-22 12:21 ` Ludovic Courtès
@ 2014-09-24  5:30 ` Mark H Weaver
  2014-09-24 12:00   ` David Kastrup
  2 siblings, 1 reply; 20+ messages in thread
From: Mark H Weaver @ 2014-09-24  5:30 UTC (permalink / raw)
  To: David Kastrup; +Cc: 18520

David Kastrup <dak@gnu.org> writes:

> In Guile 2.0, at the time a string port is opened, the value of the
> fluid %default-port-encoding is used for deciding how to encode the
> string into a byte stream, [...]

I agree that this was a mistake.  The issue is fixed on the master
branch.

> Ports fundamentally deliver characters, and so reading and writing from
> a string source/sink should not involve _any_ coding system.

David, you know as well as I that internally, there is always a coding
system.  Strings have a coding system too, even if it's UCS-4.  Emacs
uses something based on UTF-8, and I'd like to Guile to do something
similar in the future.

I guess you don't like the fact that it is possible to expose the
internal representation via 'set-port-encoding!', 'ftell' or 'seek'.
I don't see this as a problem, and arguably it's a benefit.

First I'll address the non-standard 'set-port-encoding!'.  As you say,
it doesn't even make sense on string ports, and arguably should be an
error.  So why do you care if some internal details leak out when you do
this nonsensical thing?  Admittedly, we're missing an opportunity to
report a possible bug to the user, but that's the only problem I see
here.

Regarding 'ftell' and 'seek', it's not entirely clear to me what's the
best representation of those positions.  In some situations, I guess it
would be convenient for them to count unicode code points or string
indices.  In other situations, I could imagine it being more convenient
for them to count grapheme clusters or UTF-8 bytes.

R6RS, the only Scheme standard that supports getting or setting file
positions, gives us complete freedom to choose our representation of
positions on textual ports.  The R6RS is explicit that they don't even
have to be integers, and if they are, they don't have to correspond to
bytes or characters.

For better or for worse, Guile's ports are fundamentally based on bytes,
and allow mixed binary and textual operations on all ports.  Sometimes
this is very helpful, for example when implementing HTTP.  I can think
of one other case where it's very helpful:

I don't know how deeply you've looked at UTF-8, but it has some unusual
properties that allow many (most?) string algorithms to be most
naturally (and efficiently) implemented by operating on bytes rather
than code points.  Much of the time, you don't even have to be aware of
the code point boundaries, which is a great savings.  Efficient lookup
tables based on bytes are also much cheaper than ones based on code
points, etc.

In fact, I intend to propose that in a future version of Guile, strings
will not only be based on UTF-8 internally, but that this fact should be
exposed in the API, allowing users to implement UTF-8 string operations
that operate on bytes not code points.  I'd also like lightweight, fast
string ports that allow access to these bytes when desired.

This leads me to believe that it's a feature, not a bug, that string
ports use UTF-8 internally, and that it's possible (via non-standard
extensions) to get access to the underlying bytes.

      Mark

^ permalink raw reply	[flat|nested] 20+ messages in thread

* bug#18520: string ports should not have an encoding
  2014-09-24  5:30 ` Mark H Weaver
@ 2014-09-24 12:00   ` David Kastrup
  0 siblings, 0 replies; 20+ messages in thread
From: David Kastrup @ 2014-09-24 12:00 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: 18520

Mark H Weaver <mhw@netris.org> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> In Guile 2.0, at the time a string port is opened, the value of the
>> fluid %default-port-encoding is used for deciding how to encode the
>> string into a byte stream, [...]
>
> I agree that this was a mistake.  The issue is fixed on the master
> branch.

The mistake is having a string port use a different
sequence-of-character encoding than a string.

>> Ports fundamentally deliver characters, and so reading and writing
>> from a string source/sink should not involve _any_ coding system.
>
> David, you know as well as I that internally, there is always a coding
> system.  Strings have a coding system too, even if it's UCS-4.  Emacs
> uses something based on UTF-8, and I'd like to Guile to do something
> similar in the future.
>
> I guess you don't like the fact that it is possible to expose the
> internal representation via 'set-port-encoding!', 'ftell' or 'seek'.
> I don't see this as a problem, and arguably it's a benefit.

Shrug.  That arguable benefit went down in flames in Emacs 20.  It
triggered the last great migration from Emacs users to XEmacs.  It took
until Emacs 20.4 until the horrible mistake of exposing byte offsets to
the user in either strings or buffers was corrected.

You write above "Emacs uses something based on UTF-8", and it's worth
pointing out that it does so starting with Emacs 23.  Previously Emacs
used its own peculiar multibyte encoding that existed long before UTF-8.
The important thing to note is that is was _completely_ hidden from
sight from Elisp users when the Emacs 20 tribulations were over.  Emacs
was able to swap out this multibyte encoding for the Emacs 23 coding
rather transparently, and the main reason to do that was to make UTF-8 a
favored encoding regarding performance of encoding/decoding and
processing of Elisp source files.

Emacs' internal encoding is not proper UTF-8.  You can take a random
byte string, tell Emacs that it is encoded in UTF-8, and decode it into
Emacs' internal representation.  All passages that happen to be proper
uniquely represented UTF-8 will pass the transcoding unchanged, but
everything else will be transcoded into a UTF-8-like representation of
"unencodable byte".  I think Emacs uses the UTF-8 forbidden code points
from 0xd800 to 0xd880 for encoding stray bytes, or something like that.

So if you reencode the unchanged "UTF-8" Emacs uses internally, the
result will again faithfully reproduce the random byte stream.

Garbage in, _same_ garbage out.  A very important property that many of
Emacs' supported file encodings share.  Notable exception are various
Japanese encodings based on escape characters.

At any rate, unless you are using explicit conversions like
string-as-unibyte or _encoding_ to Emacs' internal representation (it is
available as a named coding system), the representation is not exposed.
Strings are indexed per character, and buffers (which are at their heart
random-access string ports) are indexed per character.

Emacs has both unibyte and multibyte strings and unibyte and multibyte
buffers, and unibyte strings and buffers are the source for decoding and
the target for encoding into multibyte strings and buffers.  XEmacs does
not have unibyte strings/buffers, so a lot of string internals do not
need to make the distinction.  GUILE could probably get away without
unibyte strings as well because it has bytevectors.  This would imply
that if you wanted to do stuff akin to string operations on unibyte
strings, you'd have to first convert bytevectors to multibyte strings,
do your operations, convert back.  XEmacs chose _not_ to have unibyte
strings (and the corresponding complications to support both in the
primitives), Emacs chose to have them.  I think both approaches are
defensible.

Since GUILE presents itself as an extension language and since strings
will need to get passed in and out of extension languages all the time,
the implementation cost of offering a low-cost unibyte string is
probably even more defensible than with Elisp where Elisp is the main
processing language.

> First I'll address the non-standard 'set-port-encoding!'.  As you say,
> it doesn't even make sense on string ports, and arguably should be an
> error.  So why do you care if some internal details leak out when you
> do this nonsensical thing?  Admittedly, we're missing an opportunity
> to report a possible bug to the user, but that's the only problem I
> see here.
>
> Regarding 'ftell' and 'seek', it's not entirely clear to me what's the
> best representation of those positions.  In some situations, I guess
> it would be convenient for them to count unicode code points or string
> indices.  In other situations, I could imagine it being more
> convenient for them to count grapheme clusters or UTF-8 bytes.
>
> R6RS, the only Scheme standard that supports getting or setting file
> positions, gives us complete freedom to choose our representation of
> positions on textual ports.  The R6RS is explicit that they don't even
> have to be integers, and if they are, they don't have to correspond to
> bytes or characters.

R6RS gives you the freedom to match your semantics to your
implementation.  String ports are strings-in-progress (and Emacs buffers
are strings-in-progress on steroids), so it makes sense to match the
fseek/ftell semantics of string ports to those of strings and the
implementation to those of strings.  You don't have anything to gain
from converting characters to bytes and back just because you can.

> For better or for worse, Guile's ports are fundamentally based on
> bytes,

Seriously?  The whole point of this issue was that fundamentally basing
GUILE's string ports on bytes is for worse.

> and allow mixed binary and textual operations on all ports.

I'll go out on a limb here and state "they don't".  They work with bytes
(either located on file or in some internally generated or consumed byte
vector) and they input/output characters on their Scheme side, and you
can change the en/decoding system which which characters are put into
the stream or consumed.  Their external side is identical to its
internal side, and the Scheme/character/string side is fundamentally
different.  By changing the port encoding, you can change the conversion
between Scheme on the one side and internal/external on the other.  All
operations are binary on the internal side, and textual on the Scheme
side.  That there are encodings which are less costly does not
fundamentally change this.

> Sometimes this is very helpful, for example when implementing HTTP.  I
> can think of one other case where it's very helpful:
>
> I don't know how deeply you've looked at UTF-8,

It is a somewhat safe bet that a person who is the head maintainer of an
application conversing in UTF-8 while using GUILE-1.8 in its internals
has had some basic amount of exposure to UTF-8.  In general, the working
assumption "David just has little clue about computing" is rarely
helpful for dismissing matters since David tends to have picked up
tidbits occasionally since he started computing on systems where
lowercase letters already needed a multi-sextet representation in its
60bit words.

So it is a reasonably safe bet that when David has some problems with
matters, chances are that a non-negligible percentage of other users
will not fare significantly better, so it is a somewhat relevant
indicator what to avoid.

> but it has some unusual properties that allow many (most?) string
> algorithms to be most naturally (and efficiently) implemented by
> operating on bytes rather than code points.  Much of the time, you
> don't even have to be aware of the code point boundaries, which is a
> great savings.  Efficient lookup tables based on bytes are also much
> cheaper than ones based on code points, etc.

That's all very nice but totally irrelevant for this issue.  If you like
UTF-8, by all means base the internal string representation of GUILE on
it.  It comes at a cost since strings in Scheme are writable (and there
are more operations for doing so than in Elisp) and indexed by
character.  Emacs has paid this cost: I think the basic speed of Emacs
dropped by a factor of 2 when indexing was moved from bytes to
characters around Emacs 20.2 or similar.

But this issue is about not using different internal coding and exposed
interfaces for strings and string ports.  Whatever internal string
representation you choose, it does not make sense to pick a different
representation and indexing for string ports.

> In fact, I intend to propose that in a future version of Guile,
> strings will not only be based on UTF-8 internally, but that this fact
> should be exposed in the API, allowing users to implement UTF-8 string
> operations that operate on bytes not code points.

This experiment has been tried and crashed and burnt with the initial
MULE versions in Emacs 20.  Current versions _do_ offer conversion-less
reinterpretations string-as-unibyte and string-as-multibyte and offer
working with either string type.  As explained, that comes at the cost
of having to make all primitives able to work with either.  They are
actually rarely used by application level programmers, so most
applications do not have this as a porting problem between Emacs and
XEmacs (XEmacs has only multibyte strings).

Personally, I'd consider that worth the cost in the case of GUILE.
While XEmacs gets along without this addition, it seems important for
efficient passing of data in and out of GUILE.  It would also make sense
to distinguish between multibyte (internal form of UTF-8, anything may
happen if it is not properly formed) and external UTF-8 (reading/writing
it uses a conversion process turning all illegal UTF-8 bytes into some
reproducible representation).

> I'd also like lightweight, fast string ports that allow access to
> these bytes when desired.

Any string port that does not involve encoding/decoding will be
lightweight and fast, lighter and faster than any implementation having
to code/decode gratuitously.  Which is one of the points of this issue,
even though I am more concerned with the conceptual cost than the
runtime cost.  But both have an impact.

> This leads me to believe that it's a feature, not a bug, that string
> ports use UTF-8 internally, and that it's possible (via non-standard
> extensions) to get access to the underlying bytes.

Getting confused about bytes and characters and introducing unnecessary
conversions is not a feature.  Even if you at one time use an UTF-8
based string representation, working with external UTF-8 will involve
encoding/decoding processes.  Forcing a string port to encode/decode
during operation will remain expensive.  Exposing string internals
beyond quite special-purpose functions will be hard to deal with.

All those lessons have already been learnt with Emacs.  If you want to
relearn them from scratch, the available developer power will not make
basing Emacs on GUILE realistic in the next 10 years: Emacs
fundamentally operates with texts.  Too many reliability or efficiency
problems doing that (or having to implement them as foreign datatypes
altogether) will not make Guilemacs acceptable.

So even in cases where multiple strategies are feasible, it may make
sense to lean towards Emacs' choices.  One choice that has served Emacs
well is to hide its internal encoding system well from the external
ones.  That way its switch to an internal coding system based on UTF-8
affected almost no existing Elisp packages, and the programming model
was conceptually clean.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-09-24 12:00 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-21 23:34 bug#18520: string ports should not have an encoding David Kastrup
2014-09-22 11:54 ` Ludovic Courtès
2014-09-22 13:09   ` David Kastrup
2014-09-22 12:21 ` Ludovic Courtès
2014-09-22 13:34   ` David Kastrup
2014-09-22 17:08     ` Ludovic Courtès
2014-09-22 17:20       ` David Kastrup
2014-09-22 20:39         ` Ludovic Courtès
2014-09-22 22:12           ` David Kastrup
2014-09-23  8:25             ` Ludovic Courtès
2014-09-23  9:00               ` David Kastrup
2014-09-23  9:45                 ` Ludovic Courtès
2014-09-23 11:54                   ` David Kastrup
2014-09-23 12:13                     ` Ludovic Courtès
2014-09-23 13:02                       ` David Kastrup
2014-09-23 16:01                         ` Ludovic Courtès
2014-09-23 16:21                           ` David Kastrup
2014-09-23 19:33                             ` Ludovic Courtès
2014-09-24  5:30 ` Mark H Weaver
2014-09-24 12:00   ` David Kastrup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).