Eli Zaretskii <eliz@gnu.org> schrieb am So., 22. Nov. 2015 um 19:04 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Sun, 22 Nov 2015 14:56:12 +0000
> > Cc: tzz@lifelogs.com, aurelien.aptel+emacs@gmail.com,
> emacs-devel@gnu.org
> >
> > - The multibyte API should use an extension of UTF-8 to encode Emacs
> strings.
> > The extension is the obvious one already in use in multiple places.
>
> It is only used in one place: the internal representation of
> characters in buffers and strings.  Emacs _never_ lets this internal
> representation leak outside.


If I run in scratch:

(with-temp-buffer
  (insert #x3fff40)
  (describe-char (point-min)))

Then the resulting help buffer says "buffer code: #xF8 #x8F #xBF #xBD
#x80", is that not considered a leak?


>   In practice the last sentence means that
> text that Emacs encoded in UTF-8 will only include either valid UTF-8
> sequences of characters whose codepoints are below #x200000 or single
> bytes that don't belong to any UTF-8 sequence.
>

I get the same result as above when running

(with-temp-buffer
  (call-process "echo" nil t nil (string #x3fff40))
  (describe-char (point-min)))

that means the non-UTF sequence is even "leaked" to the external process!


>
> You are suggesting to expose the internal representation to outside
> application code, which predictably will cause that representation to
> leak into Lisp.  That'd be a disaster.  We had something like that
> back in the Emacs 20 era, and it took many years to plug those leaks.
> We would be making a grave mistake to go back there.
>

I don't suggest leaking anything what isn't already leaked. The extension
of the codespace to 22 bits is well documented.


>
> What you suggest is also impossible without deep changes in how we
> decode and encode text: that process maps codepoints above #1FFFFF to
> either codepoints below that mark or to raw bytes.  So it's impossible
> to produce these high codes in UTF-8 compatible form while handling
> UTF-8 text.  To say nothing about the simple fact that no library
> function in any C library will ever be able to do anything useful with
> such codepoints, because they are our own invention.
>

Unless the behavior changed recently, that doesn't seem the case:

(encode-coding-string (string #x3fff40) 'utf-8-unix)
"\370\217\277\275\200"

Or are you talking about something different?


>
> > - There should be a one-to-one mapping between Emacs multibyte strings
> and
> > encoded module API strings.
>
> UTF-8 encoded strings satisfy that requirement.
>

No! UTF-8 can only encode Unicode scalar values. Only the Emacs extension
to UTF-8 (which I think Emacs calls "UTF-8" unfortunately) satisfies this.
If you are talking about this extension, then we talk about the same thing
anyway.


>
> > Therefore non-shortest forms, illegal code unit sequences, and code
> > unit sequences that would encode values outside the range of Emacs
> > characters are illegal and raise a signal.
>
> Once again, this was tried in the past and was found to be a bad idea.
> Emacs provides features to test the result of converting invalid
> sequences, for the purposes of detecting such problems, but it leaves
> that to the application.
>

It's probably OK to accept invalid sequences for consistency with
decode-coding-string and friends. I don't really like it though: the module
API, like decode-coding-string, is not a general-purpose UI for end users,
and accepting invalid sequences is error-prone and can even introduce
security issues (see e.g.
https://blogs.oracle.com/CoreJavaTechTips/entry/the_overhaul_of_java_utf).


>
> > Likewise, such sequences will never be returned from Emacs.
>
> Emacs doesn't return invalid sequences, if the original text didn't
> include raw bytes.  If there were raw bytes in the original text,
> Emacs has no choice but return them back, or else it will violate a
> basic expectation from a text-processing program: that it shall never
> change the portions of text that were not affected by the processing.
>

It seems that Emacs does return invalid sequences for characters such as
#x3ffff40 (or anything else outside of Unicode except the 127 values for
encoding raw bytes).
Returning raw bytes means that encoding and decoding isn't a perfect
roundtrip:

(decode-coding-string (encode-coding-string (string #x3fffc2 #x3fffbb)
'utf-8-unix) 'utf-8-unix)
"»"

We might be able to live with that as it's an extreme edge case.


>
> > I think this is a relatively simple and unsurprising approach. It allows
> > encoding the documented Emacs character space while still being fully
> > compatible with UTF-8 and not resorting to undocumented Emacs internals.
>
> So does the approach I suggested.  The advantage of my suggestion is
> that it follows a long Emacs tradition about every aspect of encoding
> and decoding text, and doesn't require any changes in the existing
> infrastructure.
>

What are the exact difference between the approaches? As far as I can see
differences exist only for the following points:
- Accepting invalid sequences. I consider that a bug in general-purpose
APIs, including decode-coding-string. However, given that Emacs already
extends the Unicode codespace and therefore has to accept some invalid
sequences anyway, it might be OK if it's clearly documented.
- Emitting raw bytes instead of extended sequences. Though I'm not a fan of
this it might be unavoidable to be able to treat strings transparently
(which is desirable).