Eli Zaretskii schrieb am So., 22. Nov. 2015 um 19:04 Uhr: > > From: Philipp Stephani > > Date: Sun, 22 Nov 2015 14:56:12 +0000 > > Cc: tzz@lifelogs.com, aurelien.aptel+emacs@gmail.com, > emacs-devel@gnu.org > > > > - The multibyte API should use an extension of UTF-8 to encode Emacs > strings. > > The extension is the obvious one already in use in multiple places. > > It is only used in one place: the internal representation of > characters in buffers and strings. Emacs _never_ lets this internal > representation leak outside. If I run in scratch: (with-temp-buffer (insert #x3fff40) (describe-char (point-min))) Then the resulting help buffer says "buffer code: #xF8 #x8F #xBF #xBD #x80", is that not considered a leak? > In practice the last sentence means that > text that Emacs encoded in UTF-8 will only include either valid UTF-8 > sequences of characters whose codepoints are below #x200000 or single > bytes that don't belong to any UTF-8 sequence. > I get the same result as above when running (with-temp-buffer (call-process "echo" nil t nil (string #x3fff40)) (describe-char (point-min))) that means the non-UTF sequence is even "leaked" to the external process! > > You are suggesting to expose the internal representation to outside > application code, which predictably will cause that representation to > leak into Lisp. That'd be a disaster. We had something like that > back in the Emacs 20 era, and it took many years to plug those leaks. > We would be making a grave mistake to go back there. > I don't suggest leaking anything what isn't already leaked. The extension of the codespace to 22 bits is well documented. > > What you suggest is also impossible without deep changes in how we > decode and encode text: that process maps codepoints above #1FFFFF to > either codepoints below that mark or to raw bytes. So it's impossible > to produce these high codes in UTF-8 compatible form while handling > UTF-8 text. To say nothing about the simple fact that no library > function in any C library will ever be able to do anything useful with > such codepoints, because they are our own invention. > Unless the behavior changed recently, that doesn't seem the case: (encode-coding-string (string #x3fff40) 'utf-8-unix) "\370\217\277\275\200" Or are you talking about something different? > > > - There should be a one-to-one mapping between Emacs multibyte strings > and > > encoded module API strings. > > UTF-8 encoded strings satisfy that requirement. > No! UTF-8 can only encode Unicode scalar values. Only the Emacs extension to UTF-8 (which I think Emacs calls "UTF-8" unfortunately) satisfies this. If you are talking about this extension, then we talk about the same thing anyway. > > > Therefore non-shortest forms, illegal code unit sequences, and code > > unit sequences that would encode values outside the range of Emacs > > characters are illegal and raise a signal. > > Once again, this was tried in the past and was found to be a bad idea. > Emacs provides features to test the result of converting invalid > sequences, for the purposes of detecting such problems, but it leaves > that to the application. > It's probably OK to accept invalid sequences for consistency with decode-coding-string and friends. I don't really like it though: the module API, like decode-coding-string, is not a general-purpose UI for end users, and accepting invalid sequences is error-prone and can even introduce security issues (see e.g. https://blogs.oracle.com/CoreJavaTechTips/entry/the_overhaul_of_java_utf). > > > Likewise, such sequences will never be returned from Emacs. > > Emacs doesn't return invalid sequences, if the original text didn't > include raw bytes. If there were raw bytes in the original text, > Emacs has no choice but return them back, or else it will violate a > basic expectation from a text-processing program: that it shall never > change the portions of text that were not affected by the processing. > It seems that Emacs does return invalid sequences for characters such as #x3ffff40 (or anything else outside of Unicode except the 127 values for encoding raw bytes). Returning raw bytes means that encoding and decoding isn't a perfect roundtrip: (decode-coding-string (encode-coding-string (string #x3fffc2 #x3fffbb) 'utf-8-unix) 'utf-8-unix) "ยป" We might be able to live with that as it's an extreme edge case. > > > I think this is a relatively simple and unsurprising approach. It allows > > encoding the documented Emacs character space while still being fully > > compatible with UTF-8 and not resorting to undocumented Emacs internals. > > So does the approach I suggested. The advantage of my suggestion is > that it follows a long Emacs tradition about every aspect of encoding > and decoding text, and doesn't require any changes in the existing > infrastructure. > What are the exact difference between the approaches? As far as I can see differences exist only for the following points: - Accepting invalid sequences. I consider that a bug in general-purpose APIs, including decode-coding-string. However, given that Emacs already extends the Unicode codespace and therefore has to accept some invalid sequences anyway, it might be OK if it's clearly documented. - Emitting raw bytes instead of extended sequences. Though I'm not a fan of this it might be unavoidable to be able to treat strings transparently (which is desirable).