Eli Zaretskii schrieb am Sa., 21. Nov. 2015 um 14:23 Uhr: > > From: Philipp Stephani > > Date: Sat, 21 Nov 2015 12:11:45 +0000 > > Cc: tzz@lifelogs.com, aurelien.aptel+emacs@gmail.com, > emacs-devel@gnu.org > > > > No, we cannot, or rather should not. It is unreasonable to expect > > external modules to know the intricacies of the internal > > representation. Most Emacs hackers don't. > > > > Fine with me, but how would we then represent Emacs strings that are not > valid > > Unicode strings? Just raise an error? > > No need to raise an error. Strings that are returned to modules > should be encoded into UTF-8. That encoding already takes care of > these situations: it either produces the UTF-8 encoding of the > equivalent Unicode characters, or outputs raw bytes. > Then we should document such a situation and give module authors a way to detect them. For example, what happens if a sequence of such raw bytes happens to be a valid UTF-8 sequence? Is there a way for module code to detect this situation? > > We are using this all the time when we save files or send stuff over > the network. > > > No, I meant strict UTF-8, not its Emacs extension. > > > > That would be possible and provide a clean interface. However, Emacs > strings > > are extended, so we'd need to specify how they interact with UTF-8 > strings. > > > > * If a module passes a char sequence that's not a valid UTF-8 string, > but a > > valid Emacs multibyte string, what should happen? Error, undefined > behavior, > > silently accepted? > > We are quite capable of quietly accepting such strings, so that is > what I would suggest. Doing so would be in line with what Emacs does > when such invalid sequences come from other sources, like files. > If we accept such strings, then we should document what the extensions are. - Are UTF-8-like sequences encoding surrogate code points accepted? - Are UTF-8-like sequences encoding integers outside the Unicode codespace accepted? - Are non-shortest forms accepted? - Are other invalid code unit sequences accepted? If the answer to any of these is "yes", we can't say we accept UTF-8, because we don't. Rather we should say what is actually accepted. > > > * If copy_string_contents is passed an Emacs string that is not a valid > Unicode > > string, what should happen? > > How can that happen? The Emacs string comes from the Emacs bowels, so > it must be "valid" string by Emacs standards. Or maybe I don't > understand what you mean by "invalid Unicode string". > A sequence of integers where at least one element is not a Unicode scalar value. > > In any case, we already deal with any such problems when we save a > buffer to a file, or send it over the network. This isn't some new > problem we need to cope with. > Yes, but the module interface is new, it doesn't necessarily have to have the same behavior. If we say we emit only UTF-8, then we should do so. > > > OK, then we can use that, of course. The question of handling invalid > UTF-8 > > strings is still open, though, as make_multibyte_string doesn't enforce > valid > > UTF-8. > > It doesn't enforce valid UTF-8 because it can handle invalid UTF-8 as > well. That's by design. > Then whatever it handles needs to be specified. > > > If it's the contract of make_multibyte_string that it will always accept > UTF-8, > > then that should be added as a comment to that function. Currently I > don't see > > it documented anywhere. > > That part of the documentation is only revealed to veteran Emacs > hackers, subject to swearing not to reveal that to the uninitiated and > to some blood-letting that seals the oath ;-) > I see ;-)