> > Date: Sun, 22 Nov 2015 14:56:12 +0000
> > Cc: tzz@lifelogs.com, aurelien.aptel+emacs@gmail.com,
> emacs-devel@gnu.org
> >
> > - The multibyte API should use an extension of UTF-8 to encode Emacs
> strings.
> > The extension is the obvious one already in use in multiple places.
>
> It is only used in one place: the internal representation of
> characters in buffers and strings. Emacs _never_ lets this internal
> representation leak outside.
If I run in scratch:
(with-temp-buffer
(insert #x3fff40)
(describe-char (point-min)))
Then the resulting help buffer says "buffer code: #xF8 #x8F #xBF #xBD
#x80", is that not considered a leak?
> In practice the last sentence means that
> text that Emacs encoded in UTF-8 will only include either valid UTF-8
> sequences of characters whose codepoints are below #x200000 or single
> bytes that don't belong to any UTF-8 sequence.
>
I get the same result as above when running
(with-temp-buffer
(call-process "echo" nil t nil (string #x3fff40))
(describe-char (point-min)))
that means the non-UTF sequence is even "leaked" to the external process!
>
> You are suggesting to expose the internal representation to outside
> application code, which predictably will cause that representation to
> leak into Lisp. That'd be a disaster. We had something like that
> back in the Emacs 20 era, and it took many years to plug those leaks.
> We would be making a grave mistake to go back there.
>
I don't suggest leaking anything what isn't already leaked. The extension
of the codespace to 22 bits is well documented.
>
> What you suggest is also impossible without deep changes in how we
> decode and encode text: that process maps codepoints above #1FFFFF to
> either codepoints below that mark or to raw bytes. So it's impossible
> to produce these high codes in UTF-8 compatible form while handling
> UTF-8 text. To say nothing about the simple fact that no library
> function in any C library will ever be able to do anything useful with
> such codepoints, because they are our own invention.
>
Unless the behavior changed recently, that doesn't seem the case:
(encode-coding-string (string #x3fff40) 'utf-8-unix)
"\370\217\277\275\200"
Or are you talking about something different?
>
> > - There should be a one-to-one mapping between Emacs multibyte strings
> and
> > encoded module API strings.
>
> UTF-8 encoded strings satisfy that requirement.
>
No! UTF-8 can only encode Unicode scalar values. Only the Emacs extension
to UTF-8 (which I think Emacs calls "UTF-8" unfortunately) satisfies this.
If you are talking about this extension, then we talk about the same thing
anyway.
>
> > Therefore non-shortest forms, illegal code unit sequences, and code
> > unit sequences that would encode values outside the range of Emacs
> > characters are illegal and raise a signal.
>
> Once again, this was tried in the past and was found to be a bad idea.
> Emacs provides features to test the result of converting invalid
> sequences, for the purposes of detecting such problems, but it leaves
> that to the application.
>
It's probably OK to accept invalid sequences for consistency with
decode-coding-string and friends. I don't really like it though: the module
API, like decode-coding-string, is not a general-purpose UI for end users,
and accepting invalid sequences is error-prone and can even introduce
security issues (see e.g.
https://blogs.oracle.com/CoreJavaTechTips/entry/the_overhaul_of_java_utf).
>
> > Likewise, such sequences will never be returned from Emacs.
>
> Emacs doesn't return invalid sequences, if the original text didn't
> include raw bytes. If there were raw bytes in the original text,
> Emacs has no choice but return them back, or else it will violate a
> basic expectation from a text-processing program: that it shall never
> change the portions of text that were not affected by the processing.
>
It seems that Emacs does return invalid sequences for characters such as
#x3ffff40 (or anything else outside of Unicode except the 127 values for
encoding raw bytes).
Returning raw bytes means that encoding and decoding isn't a perfect
roundtrip:
(decode-coding-string (encode-coding-string (string #x3fffc2 #x3fffbb)
'utf-8-unix) 'utf-8-unix)
"=C2=BB"
We might be able to live with that as it's an extreme edge case.
>
> > I think this is a relatively simple and unsurprising approach. It allow=
s
> > encoding the documented Emacs character space while still being fully
> > compatible with UTF-8 and not resorting to undocumented Emacs internals=
.
>
> So does the approach I suggested. The advantage of my suggestion is
> that it follows a long Emacs tradition about every aspect of encoding
> and decoding text, and doesn't require any changes in the existing
> infrastructure.
>
What are the exact difference between the approaches? As far as I can see
differences exist only for the following points:
- Accepting invalid sequences. I consider that a bug in general-purpose
APIs, including decode-coding-string. However, given that Emacs already
extends the Unicode codespace and therefore has to accept some invalid
sequences anyway, it might be OK if it's clearly documented.
- Emitting raw bytes instead of extended sequences. Though I'm not a fan of
this it might be unavoidable to be able to treat strings transparently
(which is desirable).
--001a1130cd242364b9052525de54
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Eli Za=
retskii <
eliz@gnu.org> schrieb am=
So., 22. Nov. 2015 um 19:04=C2=A0Uhr:
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sun, 22 Nov 2015 14:56:12 +0000
> Cc: tzz@lifelogs=
.com, aurelien.aptel+emacs@gmail.com, emacs-devel@gnu.org
>
> - The multibyte API should use an extension of UTF-8 to encode Emacs s=
trings.
> The extension is the obvious one already in use in multiple places.
It is only used in one place: the internal representation of
characters in buffers and strings.=C2=A0 Emacs _never_ lets this internal
representation leak outside.
If I run in sc=
ratch:
(with-temp-buffer
=C2=A0 (in=
sert #x3fff40)
=C2=A0 (describe-char (point-min)))
Then the resulting help buffer says "buffer code: #xF=
8 #x8F #xBF #xBD #x80", is that not considered a leak?
=C2=
=A0
=C2=A0 In practice the last sentenc=
e means that
text that Emacs encoded in UTF-8 will only include either valid UTF-8
sequences of characters whose codepoints are below #x200000 or single
bytes that don't belong to any UTF-8 sequence.
I get the same result as above when running
(with-temp-buffer
=C2=A0 (call-process "echo"=
; nil t nil (string #x3fff40))
=C2=A0 (describe-char (point-min))=
)
that means the non-UTF sequence is even &q=
uot;leaked" to the external process!
=C2=A0
You are suggesting to expose the internal representation to outside
application code, which predictably will cause that representation to
leak into Lisp.=C2=A0 That'd be a disaster.=C2=A0 We had something like=
that
back in the Emacs 20 era, and it took many years to plug those leaks.
We would be making a grave mistake to go back there.
<=
br>
I don't suggest leaking anything what isn't already l=
eaked. The extension of the codespace to 22 bits is well documented.
<=
div>=C2=A0
What you suggest is also impossible without deep changes in how we
decode and encode text: that process maps codepoints above #1FFFFF to
either codepoints below that mark or to raw bytes.=C2=A0 So it's imposs=
ible
to produce these high codes in UTF-8 compatible form while handling
UTF-8 text.=C2=A0 To say nothing about the simple fact that no library
function in any C library will ever be able to do anything useful with
such codepoints, because they are our own invention.
<=
br>
Unless the behavior changed recently, that doesn't seem t=
he case:
(encode-coding-string (string #x3fff=
40) 'utf-8-unix)
"\370\217\277\275\200"
=
Or are you talking about something different?
=C2=A0
> - There should be a one-to-one mapping between Emacs multibyte strings=
and
> encoded module API strings.
UTF-8 encoded strings satisfy that requirement.
=
div>
No! UTF-8 can only encode Unicode scalar values. Only the Emacs ex=
tension to UTF-8 (which I think Emacs calls "UTF-8" unfortunately=
) satisfies this. If you are talking about this extension, then we talk abo=
ut the same thing anyway.
=C2=A0
> Therefore non-shortest forms, illegal code unit sequences, and code
> unit sequences that would encode values outside the range of Emacs
> characters are illegal and raise a signal.
Once again, this was tried in the past and was found to be a bad idea.
Emacs provides features to test the result of converting invalid
sequences, for the purposes of detecting such problems, but it leaves
that to the application.
It's proba=
bly OK to accept invalid sequences for consistency with decode-coding-strin=
g and friends. I don't really like it though: the module API, like deco=
de-coding-string, is not a general-purpose UI for end users, and accepting =
invalid sequences is error-prone and can even introduce security issues (se=
e e.g.=C2=A0
https://blogs.oracle.com/CoreJavaTechTips/entry/the_o=
verhaul_of_java_utf).
=C2=A0
> Likewise, such sequences will never be returned from Emacs.
Emacs doesn't return invalid sequences, if the original text didn't=
include raw bytes.=C2=A0 If there were raw bytes in the original text,
Emacs has no choice but return them back, or else it will violate a
basic expectation from a text-processing program: that it shall never
change the portions of text that were not affected by the processing.
=
blockquote>
It seems that Emacs does return invalid sequ=
ences for characters such as #x3ffff40 (or anything else outside of Unicode=
except the 127 values for encoding raw bytes).
Returning raw byt=
es means that encoding and decoding isn't a perfect roundtrip:
(decode-coding-string (encode-coding-string (string #=
x3fffc2 #x3fffbb) 'utf-8-unix) 'utf-8-unix)
"=C2=BB&=
quot;
We might be able to live with that as =
it's an extreme edge case.
=C2=A0
> I think this is a relatively simple and unsurprising approach. It allo=
ws
> encoding the documented Emacs character space while still being fully<=
br>
> compatible with UTF-8 and not resorting to undocumented Emacs internal=
s.
So does the approach I suggested.=C2=A0 The advantage of my suggestion is
that it follows a long Emacs tradition about every aspect of encoding
and decoding text, and doesn't require any changes in the existing
infrastructure.
What are the exact diff=
erence between the approaches? As far as I can see differences exist only f=
or the following points:
- Accepting invalid sequences. I conside=
r that a bug in general-purpose APIs, including decode-coding-string. Howev=
er, given that Emacs already extends the Unicode codespace and therefore ha=
s to accept some invalid sequences anyway, it might be OK if it's clear=
ly documented.
- Emitting raw bytes instead of extended sequences=
. Though I'm not a fan of this it might be unavoidable to be able to tr=
eat strings transparently (which is desirable).=C2=A0
--001a1130cd242364b9052525de54--