* `write-region' writes different bytes than passed to it?
@ 2018-12-11 12:30 Philipp Stephani
2018-12-11 12:42 ` Philipp Stephani
2018-12-11 15:41 ` Eli Zaretskii
0 siblings, 2 replies; 24+ messages in thread
From: Philipp Stephani @ 2018-12-11 12:30 UTC (permalink / raw)
To: help-gnu-emacs
Hi,
usually `write-region' uses the coding system bound to
`coding-system-for-write'. However, I've found a case where this
doesn't seem to be the case:
$ emacs -Q -batch -eval '(let ((coding-system-for-write (quote
utf-8-unix))) (write-region "\xC1\xB2" nil "/tmp/test.txt"))' && hd
/tmp/test.txt
00000000 f2 |.|
00000001
That is, instead of the byte sequence C1 B2 it writes the single byte
F2, which is an invalid UTF-8 sequence. Is that expected?
Thanks,
Philipp
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2018-12-11 12:30 `write-region' writes different bytes than passed to it? Philipp Stephani
@ 2018-12-11 12:42 ` Philipp Stephani
2018-12-11 15:42 ` Eli Zaretskii
2018-12-11 15:41 ` Eli Zaretskii
1 sibling, 1 reply; 24+ messages in thread
From: Philipp Stephani @ 2018-12-11 12:42 UTC (permalink / raw)
To: help-gnu-emacs
Am Di., 11. Dez. 2018 um 13:30 Uhr schrieb Philipp Stephani
<p.stephani2@gmail.com>:
>
> Hi,
>
> usually `write-region' uses the coding system bound to
> `coding-system-for-write'. However, I've found a case where this
> doesn't seem to be the case:
>
> $ emacs -Q -batch -eval '(let ((coding-system-for-write (quote
> utf-8-unix))) (write-region "\xC1\xB2" nil "/tmp/test.txt"))' && hd
> /tmp/test.txt
> 00000000 f2 |.|
> 00000001
>
> That is, instead of the byte sequence C1 B2 it writes the single byte
> F2, which is an invalid UTF-8 sequence. Is that expected?
I've realized that I can use either string-to-multibyte or
string-as-multibyte to force writing the expected bytes. Still it
seems weird that when confronted with an invalid UTF-8 sequence
`write-region' occasionally writes a *different* invalid UTF-8
sequence.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2018-12-11 12:30 `write-region' writes different bytes than passed to it? Philipp Stephani
2018-12-11 12:42 ` Philipp Stephani
@ 2018-12-11 15:41 ` Eli Zaretskii
2018-12-11 16:36 ` Stefan Monnier
2018-12-22 22:58 ` Philipp Stephani
1 sibling, 2 replies; 24+ messages in thread
From: Eli Zaretskii @ 2018-12-11 15:41 UTC (permalink / raw)
To: help-gnu-emacs
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Tue, 11 Dec 2018 13:30:07 +0100
>
> usually `write-region' uses the coding system bound to
> `coding-system-for-write'. However, I've found a case where this
> doesn't seem to be the case:
>
> $ emacs -Q -batch -eval '(let ((coding-system-for-write (quote
> utf-8-unix))) (write-region "\xC1\xB2" nil "/tmp/test.txt"))' && hd
> /tmp/test.txt
> 00000000 f2 |.|
> 00000001
>
> That is, instead of the byte sequence C1 B2 it writes the single byte
> F2, which is an invalid UTF-8 sequence. Is that expected?
Yes, because "\xC1\xB2" just happens to be the internal multibyte
representation of a raw-byte F2. Raw bytes are always converted to
their single-byte values on output, regardless of the encoding you
request.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2018-12-11 12:42 ` Philipp Stephani
@ 2018-12-11 15:42 ` Eli Zaretskii
0 siblings, 0 replies; 24+ messages in thread
From: Eli Zaretskii @ 2018-12-11 15:42 UTC (permalink / raw)
To: help-gnu-emacs
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Tue, 11 Dec 2018 13:42:59 +0100
>
> Still it seems weird that when confronted with an invalid UTF-8
> sequence `write-region' occasionally writes a *different* invalid
> UTF-8 sequence.
The internal representation of characters is not UTF-8, it is a
superset of UTF-8. So some sequences that are invalid UTF-8 are valid
for the internal representation.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2018-12-11 15:41 ` Eli Zaretskii
@ 2018-12-11 16:36 ` Stefan Monnier
2018-12-11 18:05 ` Eli Zaretskii
2018-12-22 22:59 ` Philipp Stephani
2018-12-22 22:58 ` Philipp Stephani
1 sibling, 2 replies; 24+ messages in thread
From: Stefan Monnier @ 2018-12-11 16:36 UTC (permalink / raw)
To: help-gnu-emacs
> Yes, because "\xC1\xB2" just happens to be the internal multibyte
> representation of a raw-byte F2. Raw bytes are always converted to
> their single-byte values on output, regardless of the encoding you
> request.
Maybe we shouldn't encode unibyte strings (under the assumption
that a unibyte string is already encoded: it's a sequence of bytes
rather than a sequence of chars).
Stefan
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2018-12-11 16:36 ` Stefan Monnier
@ 2018-12-11 18:05 ` Eli Zaretskii
2018-12-11 19:47 ` Stefan Monnier
2018-12-22 23:13 ` Philipp Stephani
2018-12-22 22:59 ` Philipp Stephani
1 sibling, 2 replies; 24+ messages in thread
From: Eli Zaretskii @ 2018-12-11 18:05 UTC (permalink / raw)
To: help-gnu-emacs
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Tue, 11 Dec 2018 11:36:13 -0500
>
> > Yes, because "\xC1\xB2" just happens to be the internal multibyte
> > representation of a raw-byte F2. Raw bytes are always converted to
> > their single-byte values on output, regardless of the encoding you
> > request.
>
> Maybe we shouldn't encode unibyte strings (under the assumption
> that a unibyte string is already encoded: it's a sequence of bytes
> rather than a sequence of chars).
I'm not sure that single use case is important enough to change
something that was working like that since Emacs 23. Who knows how
many more important use cases this will break?
This whole area is crawling with heuristics, whose only justification
is that it does TRT in the vast majority of use cases.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2018-12-11 18:05 ` Eli Zaretskii
@ 2018-12-11 19:47 ` Stefan Monnier
2018-12-22 23:16 ` Philipp Stephani
2018-12-22 23:13 ` Philipp Stephani
1 sibling, 1 reply; 24+ messages in thread
From: Stefan Monnier @ 2018-12-11 19:47 UTC (permalink / raw)
To: help-gnu-emacs
> I'm not sure that single use case is important enough to change
> something that was working like that since Emacs 23. Who knows how
> many more important use cases this will break?
Oh, indeed, especially since it sounds to me like the problem is in the
original code (if you don't want to change bytes, the use a `binary`
encoding rather than utf-8).
> This whole area is crawling with heuristics, whose only justification
> is that it does TRT in the vast majority of use cases.
Exactly: I think we should try and get rid of those heuristics
(progressively). Actually, it's already what we've been doing since
Emacs-20, tho "lately" the progression in this respect has slowed
down I think.
Stefan
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2018-12-11 15:41 ` Eli Zaretskii
2018-12-11 16:36 ` Stefan Monnier
@ 2018-12-22 22:58 ` Philipp Stephani
2018-12-23 15:20 ` Eli Zaretskii
2018-12-24 4:27 ` Stefan Monnier
1 sibling, 2 replies; 24+ messages in thread
From: Philipp Stephani @ 2018-12-22 22:58 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: help-gnu-emacs
Am Di., 11. Dez. 2018 um 16:52 Uhr schrieb Eli Zaretskii <eliz@gnu.org>:
>
> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Tue, 11 Dec 2018 13:30:07 +0100
> >
> > usually `write-region' uses the coding system bound to
> > `coding-system-for-write'. However, I've found a case where this
> > doesn't seem to be the case:
> >
> > $ emacs -Q -batch -eval '(let ((coding-system-for-write (quote
> > utf-8-unix))) (write-region "\xC1\xB2" nil "/tmp/test.txt"))' && hd
> > /tmp/test.txt
> > 00000000 f2 |.|
> > 00000001
> >
> > That is, instead of the byte sequence C1 B2 it writes the single byte
> > F2, which is an invalid UTF-8 sequence. Is that expected?
>
> Yes, because "\xC1\xB2" just happens to be the internal multibyte
> representation of a raw-byte F2. Raw bytes are always converted to
> their single-byte values on output, regardless of the encoding you
> request.
>
Is that documented somewhere?
Or, in other words, what are the semantics of
(let ((coding-system-for-write 'utf-8-unix)) (write-region STRING ...))
?
There are two easy cases:
1. STRING is a unibyte string containing only bytes within the ASCII range
2. STRING is a multibyte string containing only Unicode scalar values
In those cases the answer is simple: The form writes the UTF-8
representation of STRING.
However, the interesting cases are as follows:
3. STRING is a unibyte string with at least one byte outside the ASCII range
4. STRING is a multibyte string with at least one elements that is not
a Unicode scalar value
My example is an instance of (3). I admit I haven't read the entire
Emacs Lisp reference manual, but quite some parts of it, and I
couldn't find a description of the cases (3) and (4). Naively there
are a couple options:
- Signal an error. That would seem appropriate as such strings can't
be encoded as UTF-8. However, evidently Emacs doesn't do this.
- For case 3, write the bytes in STRING, ignoring the coding system. I
had expected this to happen, but apparently it isn't the case either.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2018-12-11 16:36 ` Stefan Monnier
2018-12-11 18:05 ` Eli Zaretskii
@ 2018-12-22 22:59 ` Philipp Stephani
1 sibling, 0 replies; 24+ messages in thread
From: Philipp Stephani @ 2018-12-22 22:59 UTC (permalink / raw)
To: Stefan Monnier; +Cc: help-gnu-emacs
Am Di., 11. Dez. 2018 um 17:50 Uhr schrieb Stefan Monnier
<monnier@iro.umontreal.ca>:
>
> > Yes, because "\xC1\xB2" just happens to be the internal multibyte
> > representation of a raw-byte F2. Raw bytes are always converted to
> > their single-byte values on output, regardless of the encoding you
> > request.
>
> Maybe we shouldn't encode unibyte strings (under the assumption
> that a unibyte string is already encoded: it's a sequence of bytes
> rather than a sequence of chars).
>
That's what I'd expect (either this, or a signal).
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2018-12-11 18:05 ` Eli Zaretskii
2018-12-11 19:47 ` Stefan Monnier
@ 2018-12-22 23:13 ` Philipp Stephani
1 sibling, 0 replies; 24+ messages in thread
From: Philipp Stephani @ 2018-12-22 23:13 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: help-gnu-emacs
Am Di., 11. Dez. 2018 um 19:41 Uhr schrieb Eli Zaretskii <eliz@gnu.org>:
>
> > From: Stefan Monnier <monnier@iro.umontreal.ca>
> > Date: Tue, 11 Dec 2018 11:36:13 -0500
> >
> > > Yes, because "\xC1\xB2" just happens to be the internal multibyte
> > > representation of a raw-byte F2. Raw bytes are always converted to
> > > their single-byte values on output, regardless of the encoding you
> > > request.
> >
> > Maybe we shouldn't encode unibyte strings (under the assumption
> > that a unibyte string is already encoded: it's a sequence of bytes
> > rather than a sequence of chars).
>
> I'm not sure that single use case is important enough to change
> something that was working like that since Emacs 23. Who knows how
> many more important use cases this will break?
It's important for correctness and for actually describing what "encoding" does.
>
> This whole area is crawling with heuristics, whose only justification
> is that it does TRT in the vast majority of use cases.
>
Why should this be the right thing, what use case should it cover? Do
we expect users to explicitly put the byte sequences for the
(undocumented) internal encoding into unibyte strings? Shouldn't we
rather expect that users want to write unibyte strings as is, in all
cases?
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2018-12-11 19:47 ` Stefan Monnier
@ 2018-12-22 23:16 ` Philipp Stephani
0 siblings, 0 replies; 24+ messages in thread
From: Philipp Stephani @ 2018-12-22 23:16 UTC (permalink / raw)
To: Stefan Monnier; +Cc: help-gnu-emacs
Am Di., 11. Dez. 2018 um 20:53 Uhr schrieb Stefan Monnier
<monnier@iro.umontreal.ca>:
>
> > I'm not sure that single use case is important enough to change
> > something that was working like that since Emacs 23. Who knows how
> > many more important use cases this will break?
>
> Oh, indeed, especially since it sounds to me like the problem is in the
> original code (if you don't want to change bytes, the use a `binary`
> encoding rather than utf-8).
That wouldn't work with multibyte strings, right? Because they need to
be encoded.
>
> > This whole area is crawling with heuristics, whose only justification
> > is that it does TRT in the vast majority of use cases.
>
> Exactly: I think we should try and get rid of those heuristics
> (progressively). Actually, it's already what we've been doing since
> Emacs-20, tho "lately" the progression in this respect has slowed
> down I think.
>
I'd definitely welcome any simplification in this area. There seems to
be a lot of incidental complexity and undocumented corner cases here.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2018-12-22 22:58 ` Philipp Stephani
@ 2018-12-23 15:20 ` Eli Zaretskii
2019-02-10 19:06 ` Philipp Stephani
2018-12-24 4:27 ` Stefan Monnier
1 sibling, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2018-12-23 15:20 UTC (permalink / raw)
To: help-gnu-emacs
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sat, 22 Dec 2018 23:58:07 +0100
> Cc: help-gnu-emacs <help-gnu-emacs@gnu.org>
>
> > Yes, because "\xC1\xB2" just happens to be the internal multibyte
> > representation of a raw-byte F2. Raw bytes are always converted to
> > their single-byte values on output, regardless of the encoding you
> > request.
> >
>
> Is that documented somewhere?
Which part(s)?
> Or, in other words, what are the semantics of
>
> (let ((coding-system-for-write 'utf-8-unix)) (write-region STRING ...))
>
> ?
>
> There are two easy cases:
> 1. STRING is a unibyte string containing only bytes within the ASCII range
> 2. STRING is a multibyte string containing only Unicode scalar values
> In those cases the answer is simple: The form writes the UTF-8
> representation of STRING.
> However, the interesting cases are as follows:
> 3. STRING is a unibyte string with at least one byte outside the ASCII range
> 4. STRING is a multibyte string with at least one elements that is not
> a Unicode scalar value
You are actually asking what code conversion does in these cases, so
let's limit the discussion to that part. write-region is not really
relevant here.
One technicality before I answer the question: there are no "Unicode
scalar values" in Emacs strings and buffers. The internal
representation is a multibyte one, so any non-ASCII character, be it a
valid Unicode character or a raw byte, is always stored as a multibyte
sequence. So let's please use a less confusing wording, like
"strictly valid UTF-8 sequence" or something to that effect.
> My example is an instance of (3). I admit I haven't read the entire
> Emacs Lisp reference manual, but quite some parts of it, and I
> couldn't find a description of the cases (3) and (4). Naively there
> are a couple options:
> - Signal an error. That would seem appropriate as such strings can't
> be encoded as UTF-8. However, evidently Emacs doesn't do this.
> - For case 3, write the bytes in STRING, ignoring the coding system. I
> had expected this to happen, but apparently it isn't the case either.
IMO, doing encoding on unibyte strings invokes undefined behavior,
since encoding is only defined for multibyte strings. Admittedly, we
don't say that explicitly (we could if that's deemed important), but
the entire description in "Coding System Basics" makes no sense
without this assumption, and even hints on that indirectly:
The coding system ‘raw-text’ is special in that it prevents character
code conversion, and causes the buffer visited with this coding system
to be a unibyte buffer. For historical reasons, you can save both
unibyte and multibyte text with this coding system.
The last sentence implicitly tells you that coding systems other than
raw-text (with the exception of no-conversion, described in the very
next paragraph) can only be meaningfully used when writing multibyte
text.
Since this is undefined behavior, Emacs can do anything that best
suits the relevant use cases. What it actually does is convert raw
bytes from their internal two-byte representation to a single byte.
Emacs jumps through many hoops to avoid exposing the internal
multibyte representation of raw bytes outside of buffers and strings,
and this is one of those hoops. That's because exposing that internal
representation is considered to be corruption of the original byte
stream, and is not generally useful.
Signaling an error in this situation is also not useful, because it
turns out many Lisp programs did this kind of thing in the past (Gnus
is a notable example), and undoubtedly quite a few still do.
Emacs handles this case like it does because many years of bitter
experience have taught us that this suits best the use cases we want
to support. In particular, signaling errors when encountering invalid
UTF-8 sequences is a bad idea in a text-editing application, where
users expect an arbitrary byte stream to pass unscathed from input to
output. This is why Emacs is decades ahead of other similar systems,
such as Guile, which still throw exceptions in such cases (and claim
that they are "correct").
> > I'm not sure that single use case is important enough to change
> > something that was working like that since Emacs 23. Who knows how
> > many more important use cases this will break?
>
> It's important for correctness and for actually describing what "encoding" does.
So does labeling this as undefined behavior, which is what it is. We
don't really need to describe undefined behavior in detail, because
Lisp programs shouldn't do that.
> Do we expect users to explicitly put the byte sequences for the
> (undocumented) internal encoding into unibyte strings? Shouldn't we
> rather expect that users want to write unibyte strings as is, in all
> cases?
To avoid the undefined behavior, a Lisp program should never try to
encode a unibyte string with anything other than no-conversion or
raw-text (the latter also allows the application to convert EOL
format, if that is desired). IOW, you should have used either
raw-text-unix or no-conversion in your example, not utf-8.
> > Oh, indeed, especially since it sounds to me like the problem is in the
> > original code (if you don't want to change bytes, the use a `binary`
> > encoding rather than utf-8).
>
> That wouldn't work with multibyte strings, right? Because they need to
> be encoded.
You can detect when a string is a unibyte string with
multibyte-string-p, if your application needs to handle both unibyte
and multibyte strings. For unibyte strings, use only raw-text or
no-conversion.
> > Exactly: I think we should try and get rid of those heuristics
> > (progressively). Actually, it's already what we've been doing since
> > Emacs-20, tho "lately" the progression in this respect has slowed
> > down I think.
>
> I'd definitely welcome any simplification in this area. There seems to
> be a lot of incidental complexity and undocumented corner cases here.
AFAIK, all of that heuristics are in the undefined behavior
department. Lisp programs are well advised to stay away from that.
If Lisp programs do stay away, they will never need to deal with the
complexity and the undocumented corner cases.
We keep the current behavior for backward compatibility, and for this
reason I would suggest to avoid changes in this area unless we have a
very good reason for a change. What was the reason you needed to
write something like the original snippet?
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2018-12-22 22:58 ` Philipp Stephani
2018-12-23 15:20 ` Eli Zaretskii
@ 2018-12-24 4:27 ` Stefan Monnier
2019-02-10 19:15 ` Philipp Stephani
1 sibling, 1 reply; 24+ messages in thread
From: Stefan Monnier @ 2018-12-24 4:27 UTC (permalink / raw)
To: help-gnu-emacs
> There are two easy cases:
> 1. STRING is a unibyte string containing only bytes within the ASCII range
> 2. STRING is a multibyte string containing only Unicode scalar values
> In those cases the answer is simple: The form writes the UTF-8
> representation of STRING.
Not sure what you mean by "unicode scalar values", but a multibyte
string is a sequence of chars, i.e. a sequence of char codes (integers)
And utf-8 is a way to encode a sequence of integer char codes into
a sequence of bytes.
So your sample code will pretty much always write the utf-8
representation of the multibyte string.
[ The only exception is when the multibyte string contains chars in the
eight-bit charset, because those are supposed to stand for raw bytes.
This is exception is used to make sure that if you read a file using
the utf-8 coding-system and the file's content is not valid utf-8,
writing the buffer will still generate the exact same byte sequence. ]
> However, the interesting cases are as follows:
> 3. STRING is a unibyte string with at least one byte outside the ASCII range
I don't think this case is clearly documented, indeed.
I believe what happens currently is that Emacs looks at the byte
sequence in the unibyte string as if it was the internal representation
of a multibyte string. Changing behavior (e.g. by simply outputting the
bytes unchanged like I suggested) will likely affect some code out there
somewhere. I think it'd be a good change, tho, because I think that any
code thus affected is likely buggy and needs to be fixed anyway (and
actually that change might be the fix the code needs).
What makes this question a bit more tricky is that when a string is all
ASCII, Emacs tends to choose rather arbitrarily between unibyte
and multibyte. But if we decide that coding-system doesn't affect
unibyte strings, then we get into trouble with
(let ((coding-system-for-write 'ebcdic-int)) (write-region STRING ...))
since for a purely ASCII string, we still need to do a conversion,
so we'd need to be more careful about the distinction between unibyte and
multibyte ASCII strings.
Maybe we should just drop support for coding systems that aren't
supersets of ASCII and be done with it, but I'm not sure we're ready to
do that.
Stefan
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2018-12-23 15:20 ` Eli Zaretskii
@ 2019-02-10 19:06 ` Philipp Stephani
2019-02-10 20:05 ` Eli Zaretskii
0 siblings, 1 reply; 24+ messages in thread
From: Philipp Stephani @ 2019-02-10 19:06 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: help-gnu-emacs
Am So., 23. Dez. 2018 um 16:21 Uhr schrieb Eli Zaretskii <eliz@gnu.org>:
>
> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Sat, 22 Dec 2018 23:58:07 +0100
> > Cc: help-gnu-emacs <help-gnu-emacs@gnu.org>
> >
> > > Yes, because "\xC1\xB2" just happens to be the internal multibyte
> > > representation of a raw-byte F2. Raw bytes are always converted to
> > > their single-byte values on output, regardless of the encoding you
> > > request.
> > >
> >
> > Is that documented somewhere?
>
> Which part(s)?
All of it? ;)
Basically, "what is the behavior of write-region".
>
> > Or, in other words, what are the semantics of
> >
> > (let ((coding-system-for-write 'utf-8-unix)) (write-region STRING ...))
> >
> > ?
> >
> > There are two easy cases:
> > 1. STRING is a unibyte string containing only bytes within the ASCII range
> > 2. STRING is a multibyte string containing only Unicode scalar values
> > In those cases the answer is simple: The form writes the UTF-8
> > representation of STRING.
> > However, the interesting cases are as follows:
> > 3. STRING is a unibyte string with at least one byte outside the ASCII range
> > 4. STRING is a multibyte string with at least one elements that is not
> > a Unicode scalar value
>
> You are actually asking what code conversion does in these cases, so
> let's limit the discussion to that part. write-region is not really
> relevant here.
>
> One technicality before I answer the question: there are no "Unicode
> scalar values" in Emacs strings and buffers. The internal
> representation is a multibyte one, so any non-ASCII character, be it a
> valid Unicode character or a raw byte, is always stored as a multibyte
> sequence. So let's please use a less confusing wording, like
> "strictly valid UTF-8 sequence" or something to that effect.
I don't think we should change the terminology. Emacs multibyte
strings are sequences of integers (in most cases, scalar values), not
UTF-8 strings. They are internally represented as byte arrays, but
that's a different story.
>
> > My example is an instance of (3). I admit I haven't read the entire
> > Emacs Lisp reference manual, but quite some parts of it, and I
> > couldn't find a description of the cases (3) and (4). Naively there
> > are a couple options:
> > - Signal an error. That would seem appropriate as such strings can't
> > be encoded as UTF-8. However, evidently Emacs doesn't do this.
> > - For case 3, write the bytes in STRING, ignoring the coding system. I
> > had expected this to happen, but apparently it isn't the case either.
>
> IMO, doing encoding on unibyte strings invokes undefined behavior,
> since encoding is only defined for multibyte strings.
That is very unfortunate. Is there any hope we can get out of that situation?
> Admittedly, we
> don't say that explicitly (we could if that's deemed important), but
> the entire description in "Coding System Basics" makes no sense
> without this assumption, and even hints on that indirectly:
>
> The coding system ‘raw-text’ is special in that it prevents character
> code conversion, and causes the buffer visited with this coding system
> to be a unibyte buffer. For historical reasons, you can save both
> unibyte and multibyte text with this coding system.
>
> The last sentence implicitly tells you that coding systems other than
> raw-text (with the exception of no-conversion, described in the very
> next paragraph) can only be meaningfully used when writing multibyte
> text.
That's true, but very subtle. You first have to read the description
of a certain encoding to figure out how other encodings behave.
>
> Since this is undefined behavior, Emacs can do anything that best
> suits the relevant use cases. What it actually does is convert raw
> bytes from their internal two-byte representation to a single byte.
> Emacs jumps through many hoops to avoid exposing the internal
> multibyte representation of raw bytes outside of buffers and strings,
> and this is one of those hoops. That's because exposing that internal
> representation is considered to be corruption of the original byte
> stream, and is not generally useful.
But in this question there is never any internal representation, just
a byte array that happens to match the internal representation of
something else.
>
> Signaling an error in this situation is also not useful, because it
> turns out many Lisp programs did this kind of thing in the past (Gnus
> is a notable example), and undoubtedly quite a few still do.
Well, if the behavior is unspecified, then signaling an error would
absolutely be a legal (and even expected) behavior.
>
> Emacs handles this case like it does because many years of bitter
> experience have taught us that this suits best the use cases we want
> to support. In particular, signaling errors when encountering invalid
> UTF-8 sequences is a bad idea in a text-editing application, where
> users expect an arbitrary byte stream to pass unscathed from input to
> output. This is why Emacs is decades ahead of other similar systems,
> such as Guile, which still throw exceptions in such cases (and claim
> that they are "correct").
I'm not saying that Emacs should necessary start signaling errors when
visiting files with invalid UTF-8 sequences. That it degrades
gracefully in this case is very welcome and user-friendly.
But visiting a file can't result in a call to write-region with a
unibyte string, right?
>
> > > I'm not sure that single use case is important enough to change
> > > something that was working like that since Emacs 23. Who knows how
> > > many more important use cases this will break?
> >
> > It's important for correctness and for actually describing what "encoding" does.
>
> So does labeling this as undefined behavior, which is what it is. We
> don't really need to describe undefined behavior in detail, because
> Lisp programs shouldn't do that.
Rather than describing it in detail, it should be removed. Unspecified
behavior makes a programming system hard to use and reason about.
>
> > Do we expect users to explicitly put the byte sequences for the
> > (undocumented) internal encoding into unibyte strings? Shouldn't we
> > rather expect that users want to write unibyte strings as is, in all
> > cases?
>
> To avoid the undefined behavior, a Lisp program should never try to
> encode a unibyte string with anything other than no-conversion or
> raw-text (the latter also allows the application to convert EOL
> format, if that is desired). IOW, you should have used either
> raw-text-unix or no-conversion in your example, not utf-8.
If Lisp code shouldn't try that, then the encoding functions should
signal an error on such cases.
>
> > > Oh, indeed, especially since it sounds to me like the problem is in the
> > > original code (if you don't want to change bytes, the use a `binary`
> > > encoding rather than utf-8).
> >
> > That wouldn't work with multibyte strings, right? Because they need to
> > be encoded.
>
> You can detect when a string is a unibyte string with
> multibyte-string-p, if your application needs to handle both unibyte
> and multibyte strings. For unibyte strings, use only raw-text or
> no-conversion.
I get that, but this is too subtle and nontrivial.
>
> > > Exactly: I think we should try and get rid of those heuristics
> > > (progressively). Actually, it's already what we've been doing since
> > > Emacs-20, tho "lately" the progression in this respect has slowed
> > > down I think.
> >
> > I'd definitely welcome any simplification in this area. There seems to
> > be a lot of incidental complexity and undocumented corner cases here.
>
> AFAIK, all of that heuristics are in the undefined behavior
> department. Lisp programs are well advised to stay away from that.
> If Lisp programs do stay away, they will never need to deal with the
> complexity and the undocumented corner cases.
You can't tell programmers to stay away from something. Either it
should work as documented or signal an error. Silently doing the wrong
thing is the worst choice.
>
> We keep the current behavior for backward compatibility, and for this
> reason I would suggest to avoid changes in this area unless we have a
> very good reason for a change. What was the reason you needed to
> write something like the original snippet?
>
I'm writing a function to write an arbitrary string to a file. This
should be trivial, but as you can see, it isn't.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2018-12-24 4:27 ` Stefan Monnier
@ 2019-02-10 19:15 ` Philipp Stephani
2019-02-10 20:13 ` Eli Zaretskii
2019-02-10 22:25 ` Stefan Monnier
0 siblings, 2 replies; 24+ messages in thread
From: Philipp Stephani @ 2019-02-10 19:15 UTC (permalink / raw)
To: Stefan Monnier; +Cc: help-gnu-emacs
Am Mo., 24. Dez. 2018 um 05:28 Uhr schrieb Stefan Monnier
<monnier@iro.umontreal.ca>:
>
> > There are two easy cases:
> > 1. STRING is a unibyte string containing only bytes within the ASCII range
> > 2. STRING is a multibyte string containing only Unicode scalar values
> > In those cases the answer is simple: The form writes the UTF-8
> > representation of STRING.
>
> Not sure what you mean by "unicode scalar values"
What the Unicode standard says :)
> but a multibyte
> string is a sequence of chars, i.e. a sequence of char codes (integers)
> And utf-8 is a way to encode a sequence of integer char codes into
> a sequence of bytes.
"Character" is an underspecified term, therefore I generally try to avoid it.
To recap: An Emacs Lisp multibyte string is a sequence of integers of
a certain range. The range is a superset of the set of Unicode scalar
values.
>
> So your sample code will pretty much always write the utf-8
> representation of the multibyte string.
>
> [ The only exception is when the multibyte string contains chars in the
> eight-bit charset, because those are supposed to stand for raw bytes.
> This is exception is used to make sure that if you read a file using
> the utf-8 coding-system and the file's content is not valid utf-8,
> writing the buffer will still generate the exact same byte sequence. ]
>
> > However, the interesting cases are as follows:
> > 3. STRING is a unibyte string with at least one byte outside the ASCII range
>
> I don't think this case is clearly documented, indeed.
>
> I believe what happens currently is that Emacs looks at the byte
> sequence in the unibyte string as if it was the internal representation
> of a multibyte string. Changing behavior (e.g. by simply outputting the
> bytes unchanged like I suggested) will likely affect some code out there
> somewhere. I think it'd be a good change, tho, because I think that any
> code thus affected is likely buggy and needs to be fixed anyway (and
> actually that change might be the fix the code needs).
>
> What makes this question a bit more tricky is that when a string is all
> ASCII, Emacs tends to choose rather arbitrarily between unibyte
> and multibyte. But if we decide that coding-system doesn't affect
> unibyte strings, then we get into trouble with
>
> (let ((coding-system-for-write 'ebcdic-int)) (write-region STRING ...))
>
> since for a purely ASCII string, we still need to do a conversion,
> so we'd need to be more careful about the distinction between unibyte and
> multibyte ASCII strings.
>
> Maybe we should just drop support for coding systems that aren't
> supersets of ASCII and be done with it, but I'm not sure we're ready to
> do that.
>
That might be one option. Others might be:
1. Signal an error whenever Emacs attempts to encode a unibyte string
and the encoding isn't "raw-text" or "no-conversion"
2. Like (1), but only signal an error if the encoding isn't ASCII-compatible
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2019-02-10 19:06 ` Philipp Stephani
@ 2019-02-10 20:05 ` Eli Zaretskii
0 siblings, 0 replies; 24+ messages in thread
From: Eli Zaretskii @ 2019-02-10 20:05 UTC (permalink / raw)
To: help-gnu-emacs
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sun, 10 Feb 2019 20:06:57 +0100
> Cc: help-gnu-emacs <help-gnu-emacs@gnu.org>
>
> > > > Yes, because "\xC1\xB2" just happens to be the internal multibyte
> > > > representation of a raw-byte F2. Raw bytes are always converted to
> > > > their single-byte values on output, regardless of the encoding you
> > > > request.
> > > >
> > >
> > > Is that documented somewhere?
> >
> > Which part(s)?
>
> All of it? ;)
> Basically, "what is the behavior of write-region".
Like I said, write-region is not relevant here, encoding is.
> > One technicality before I answer the question: there are no "Unicode
> > scalar values" in Emacs strings and buffers. The internal
> > representation is a multibyte one, so any non-ASCII character, be it a
> > valid Unicode character or a raw byte, is always stored as a multibyte
> > sequence. So let's please use a less confusing wording, like
> > "strictly valid UTF-8 sequence" or something to that effect.
>
> I don't think we should change the terminology. Emacs multibyte
> strings are sequences of integers
No, they are not. They are sequences of bytes (as evidenced by the
"multibyte" part) which represent sequences of Unicode codepoints.
The latter are scalar integers. But these scalars are not explicitly
present in the multibyte representation.
> > IMO, doing encoding on unibyte strings invokes undefined behavior,
> > since encoding is only defined for multibyte strings.
>
> That is very unfortunate. Is there any hope we can get out of that situation?
Unlikely.
> But in this question there is never any internal representation
Yes, there is: you have succeeded to use one of the few loopholes to
create such a byte sequence.
> > Signaling an error in this situation is also not useful, because it
> > turns out many Lisp programs did this kind of thing in the past (Gnus
> > is a notable example), and undoubtedly quite a few still do.
>
> Well, if the behavior is unspecified, then signaling an error would
> absolutely be a legal (and even expected) behavior.
It's possible, but not useful, so we don't do that.
> I'm not saying that Emacs should necessary start signaling errors when
> visiting files with invalid UTF-8 sequences. That it degrades
> gracefully in this case is very welcome and user-friendly.
> But visiting a file can't result in a call to write-region with a
> unibyte string, right?
Why not? Of course it can: imagine that you modify some part of the
file's text that doesn't include raw undecoded bytes, then write the
result to a file. You will expect that portions of text you didn't
modify remain intact, right?
> > > It's important for correctness and for actually describing what "encoding" does.
> >
> > So does labeling this as undefined behavior, which is what it is. We
> > don't really need to describe undefined behavior in detail, because
> > Lisp programs shouldn't do that.
>
> Rather than describing it in detail, it should be removed. Unspecified
> behavior makes a programming system hard to use and reason about.
It cannot be removed. Raw bytes that cannot be decoded are a fact of
life, removing them will make Emacs a lame duck.
> > To avoid the undefined behavior, a Lisp program should never try to
> > encode a unibyte string with anything other than no-conversion or
> > raw-text (the latter also allows the application to convert EOL
> > format, if that is desired). IOW, you should have used either
> > raw-text-unix or no-conversion in your example, not utf-8.
>
> If Lisp code shouldn't try that, then the encoding functions should
> signal an error on such cases.
Signaling an error is not useful, so Emacs should not do that.
> > You can detect when a string is a unibyte string with
> > multibyte-string-p, if your application needs to handle both unibyte
> > and multibyte strings. For unibyte strings, use only raw-text or
> > no-conversion.
>
> I get that, but this is too subtle and nontrivial.
Then try not to write code that could bump into these subtleties. You
shouldn't need that.
> > AFAIK, all of that heuristics are in the undefined behavior
> > department. Lisp programs are well advised to stay away from that.
> > If Lisp programs do stay away, they will never need to deal with the
> > complexity and the undocumented corner cases.
>
> You can't tell programmers to stay away from something.
No, but I can advise them.
> Either it should work as documented or signal an error. Silently
> doing the wrong thing is the worst choice.
It doesn't do the wrong thing, it does the right thing: it stays out
of the hair of programmers who might need to write such stuff
(assuming they know what they are doing).
> > What was the reason you needed to write something like the
> > original snippet?
>
> I'm writing a function to write an arbitrary string to a file. This
> should be trivial, but as you can see, it isn't.
It wasn't a string, it was a sequence of bytes that cannot be
interpreted as a text string.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2019-02-10 19:15 ` Philipp Stephani
@ 2019-02-10 20:13 ` Eli Zaretskii
2019-02-10 22:25 ` Stefan Monnier
1 sibling, 0 replies; 24+ messages in thread
From: Eli Zaretskii @ 2019-02-10 20:13 UTC (permalink / raw)
To: help-gnu-emacs
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sun, 10 Feb 2019 20:15:57 +0100
> Cc: help-gnu-emacs <help-gnu-emacs@gnu.org>
>
> > > There are two easy cases:
> > > 1. STRING is a unibyte string containing only bytes within the ASCII range
> > > 2. STRING is a multibyte string containing only Unicode scalar values
> > > In those cases the answer is simple: The form writes the UTF-8
> > > representation of STRING.
> >
> > Not sure what you mean by "unicode scalar values"
>
> What the Unicode standard says :)
A multibyte Unicode string doesn't contain Unicode scalar values, it
contains their UTF-8 encoding.
> To recap: An Emacs Lisp multibyte string is a sequence of integers of
> a certain range.
No, it's a sequence of bytes that can be interpreted as representing a
sequence of integers.
> > Maybe we should just drop support for coding systems that aren't
> > supersets of ASCII and be done with it, but I'm not sure we're ready to
> > do that.
>
> That might be one option.
Just a month or two ago someone asked about one variation of EBCDIC
that we didn't support directly. So no, it's too early to drop them.
> 1. Signal an error whenever Emacs attempts to encode a unibyte string
> and the encoding isn't "raw-text" or "no-conversion"
> 2. Like (1), but only signal an error if the encoding isn't ASCII-compatible
Signaling an error in these cases is a non-starter. If you don't like
what Emacs does in these cases, just don't write such code. Emacs is
not a tool whose primary goal is educating novice programmers, it is
also an industry-strength system that allows doing low-level stuff
when needed. If we signal errors in those cases, we will throw out
valid use cases for no good reason.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2019-02-10 19:15 ` Philipp Stephani
2019-02-10 20:13 ` Eli Zaretskii
@ 2019-02-10 22:25 ` Stefan Monnier
2019-02-11 3:31 ` Eli Zaretskii
1 sibling, 1 reply; 24+ messages in thread
From: Stefan Monnier @ 2019-02-10 22:25 UTC (permalink / raw)
To: help-gnu-emacs
> 1. Signal an error whenever Emacs attempts to encode a unibyte string
> and the encoding isn't "raw-text" or "no-conversion"
Sounds good to me. I have similar extra checks in my local Emacs hacks
(as well as signaling errors when trying to decode a multibyte string).
They've helped me track down encoding problems in Gnus.
They rarely trigger nowadays, but it's likely because I've fixed most
occurrences in the Elisp code I happen to use. I'd be surprised if
there aren't any such problems lurking in many other places.
Often the corresponding code "works" in practice (i.e. circumstances
make it do the right thing even though in general it may break), and
fixing it so it doesn't trigger the check requires non-trivial changes.
IOW the tradeoff is not very good when it comes to motivating the
package's maintainer to fix his code (non-trivial rework which will
likely introduce new bugs in order to fix mostly hypothetical old bugs).
Stefan
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2019-02-10 22:25 ` Stefan Monnier
@ 2019-02-11 3:31 ` Eli Zaretskii
2019-02-11 14:05 ` Stefan Monnier
0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2019-02-11 3:31 UTC (permalink / raw)
To: help-gnu-emacs
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Sun, 10 Feb 2019 17:25:04 -0500
>
> > 1. Signal an error whenever Emacs attempts to encode a unibyte string
> > and the encoding isn't "raw-text" or "no-conversion"
>
> Sounds good to me.
I'm objected to such a change.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2019-02-11 3:31 ` Eli Zaretskii
@ 2019-02-11 14:05 ` Stefan Monnier
2019-02-11 16:37 ` Eli Zaretskii
0 siblings, 1 reply; 24+ messages in thread
From: Stefan Monnier @ 2019-02-11 14:05 UTC (permalink / raw)
To: help-gnu-emacs
>> > 1. Signal an error whenever Emacs attempts to encode a unibyte string
>> > and the encoding isn't "raw-text" or "no-conversion"
>> Sounds good to me.
> I'm objected to such a change.
I would too because of the breakage it can/will introduce.
But I still think it's a good idea ;-)
[ Lots of good ideas can't be applied, sadly. ]
Maybe in this specific case we could introduce a "strict encoding mode"
controlled by a config var.
Stefan "not volunteering to write the patch"
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2019-02-11 14:05 ` Stefan Monnier
@ 2019-02-11 16:37 ` Eli Zaretskii
2019-02-11 19:44 ` Stefan Monnier
0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2019-02-11 16:37 UTC (permalink / raw)
To: help-gnu-emacs
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Mon, 11 Feb 2019 09:05:25 -0500
>
> >> > 1. Signal an error whenever Emacs attempts to encode a unibyte string
> >> > and the encoding isn't "raw-text" or "no-conversion"
> >> Sounds good to me.
> > I'm objected to such a change.
>
> I would too because of the breakage it can/will introduce.
> But I still think it's a good idea ;-)
> [ Lots of good ideas can't be applied, sadly. ]
Yes, in a better world, a better deity would be well advised to make
raw bytes unnecessary and non-existent.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2019-02-11 16:37 ` Eli Zaretskii
@ 2019-02-11 19:44 ` Stefan Monnier
2019-02-11 20:20 ` Eli Zaretskii
0 siblings, 1 reply; 24+ messages in thread
From: Stefan Monnier @ 2019-02-11 19:44 UTC (permalink / raw)
To: help-gnu-emacs
> Yes, in a better world, a better deity would be well advised to make
> raw bytes unnecessary and non-existent.
I think raw bytes are OK. The problem here is the encoding of a unibyte
text (i.e. treating the unibyte text as holding chars rather than bytes).
The two are quite different: Raw bytes happen because of "faulty" data.
Encoding of unibyte text happens because of faulty *code* (the code
should either not encode, or should be using multibyte text instead).
To some extent both are unavoidable, but we have more control over the code
executed in Emacs than over the data it has to manipulate.
Stefan
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2019-02-11 19:44 ` Stefan Monnier
@ 2019-02-11 20:20 ` Eli Zaretskii
2019-02-11 22:06 ` Stefan Monnier
0 siblings, 1 reply; 24+ messages in thread
From: Eli Zaretskii @ 2019-02-11 20:20 UTC (permalink / raw)
To: help-gnu-emacs
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Mon, 11 Feb 2019 14:44:57 -0500
>
> > Yes, in a better world, a better deity would be well advised to make
> > raw bytes unnecessary and non-existent.
>
> I think raw bytes are OK. The problem here is the encoding of a unibyte
> text (i.e. treating the unibyte text as holding chars rather than bytes).
If you have raw bytes inside otherwise legible text, encoding that
text will have to encode the bytes as well. So as soon as you have
the former, you need to deal with the latter.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: `write-region' writes different bytes than passed to it?
2019-02-11 20:20 ` Eli Zaretskii
@ 2019-02-11 22:06 ` Stefan Monnier
0 siblings, 0 replies; 24+ messages in thread
From: Stefan Monnier @ 2019-02-11 22:06 UTC (permalink / raw)
To: help-gnu-emacs
>> > Yes, in a better world, a better deity would be well advised to make
>> > raw bytes unnecessary and non-existent.
>> I think raw bytes are OK. The problem here is the encoding of a unibyte
>> text (i.e. treating the unibyte text as holding chars rather than bytes).
> If you have raw bytes inside otherwise legible text, encoding that
> text will have to encode the bytes as well.
But the "otherwise legible text" is multibyte text, so the semantics of
encoding that text is perfectly clear (regardless if it contains raw bytes).
What is not clear is the semantics of something like:
(encode-coding-string (encode-coding-string STR CODING1) CODING2)
> So as soon as you have the former, you need to deal with the latter.
In your example, encoding is not applied to unibyte text, only
to multibyte text. The raw-bytes there don't cause any trouble (they
just do the job they were designed to do).
Stefan
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2019-02-11 22:06 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-12-11 12:30 `write-region' writes different bytes than passed to it? Philipp Stephani
2018-12-11 12:42 ` Philipp Stephani
2018-12-11 15:42 ` Eli Zaretskii
2018-12-11 15:41 ` Eli Zaretskii
2018-12-11 16:36 ` Stefan Monnier
2018-12-11 18:05 ` Eli Zaretskii
2018-12-11 19:47 ` Stefan Monnier
2018-12-22 23:16 ` Philipp Stephani
2018-12-22 23:13 ` Philipp Stephani
2018-12-22 22:59 ` Philipp Stephani
2018-12-22 22:58 ` Philipp Stephani
2018-12-23 15:20 ` Eli Zaretskii
2019-02-10 19:06 ` Philipp Stephani
2019-02-10 20:05 ` Eli Zaretskii
2018-12-24 4:27 ` Stefan Monnier
2019-02-10 19:15 ` Philipp Stephani
2019-02-10 20:13 ` Eli Zaretskii
2019-02-10 22:25 ` Stefan Monnier
2019-02-11 3:31 ` Eli Zaretskii
2019-02-11 14:05 ` Stefan Monnier
2019-02-11 16:37 ` Eli Zaretskii
2019-02-11 19:44 ` Stefan Monnier
2019-02-11 20:20 ` Eli Zaretskii
2019-02-11 22:06 ` Stefan Monnier
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).