From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Philipp Stephani
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sat, 21 Nov 2015 12:11:45 +0000
> Cc: tzz@lifelogs= .com, aurelien.aptel+emacs@gmail.com, emacs-devel@gnu.org
>
>=C2=A0 =C2=A0 =C2=A0No, we cannot, or rather should not. It is unreason= able to expect
>=C2=A0 =C2=A0 =C2=A0external modules to know the intricacies of the int= ernal
>=C2=A0 =C2=A0 =C2=A0representation. Most Emacs hackers don't.
>
> Fine with me, but how would we then represent Emacs strings that are n= ot valid
> Unicode strings? Just raise an error?
No need to raise an error.=C2=A0 Strings that are returned to modules
should be encoded into UTF-8.=C2=A0 That encoding already takes care of
these situations: it either produces the UTF-8 encoding of the
equivalent Unicode characters, or outputs raw bytes.
We are using this all the time when we save files or send stuff over
the network.
>=C2=A0 =C2=A0 =C2=A0No, I meant strict UTF-8, not its Emacs extension.<= br> >
> That would be possible and provide a clean interface. However, Emacs s= trings
> are extended, so we'd need to specify how they interact with UTF-8= strings.
>
> * If a module passes a char sequence that's not a valid UTF-8 stri= ng, but a
>=C2=A0 =C2=A0valid Emacs multibyte string, what should happen? Error, u= ndefined behavior,
>=C2=A0 =C2=A0silently accepted?
We are quite capable of quietly accepting such strings, so that is
what I would suggest.=C2=A0 Doing so would be in line with what Emacs does<= br> when such invalid sequences come from other sources, like files.If we accept such strings, then we should documen= t what the extensions are.- Are UTF-8-like sequences encoding su= rrogate code points accepted?- Are UTF-8-like sequences encoding= integers outside the Unicode codespace accepted?- Are non-short= est forms accepted?- Are other invalid code unit sequences accep= ted?If the answer to any of these is "yes", we can'= ;t say we accept UTF-8, because we don't. Rather we should say what is = actually accepted.=C2=A0
> * If copy_string_contents is passed an Emacs string that is not a vali= d Unicode
>=C2=A0 =C2=A0string, what should happen?
How can that happen?=C2=A0 The Emacs string comes from the Emacs bowels, so=
it must be "valid" string by Emacs standards.=C2=A0 Or maybe I do= n't
understand what you mean by "invalid Unicode string".A sequence of integers where at least one element = is not a Unicode scalar value.=C2=A0
In any case, we already deal with any such problems when we save a
buffer to a file, or send it over the network.=C2=A0 This isn't some ne= w
problem we need to cope with.Yes, but = the module interface is new, it doesn't necessarily have to have the sa= me behavior. If we say we emit only UTF-8, then we should do so.= =C2=A0
> OK, then we can use that, of course. The question of handling invalid = UTF-8
> strings is still open, though, as make_multibyte_string doesn't en= force valid
> UTF-8.
It doesn't enforce valid UTF-8 because it can handle invalid UTF-8 as well.=C2=A0 That's by design.Then = whatever it handles needs to be specified.=C2=A0
> If it's the contract of make_multibyte_string that it will always = accept UTF-8,
> then that should be added as a comment to that function. Currently I d= on't see
> it documented anywhere.
That part of the documentation is only revealed to veteran Emacs
hackers, subject to swearing not to reveal that to the uninitiated and
to some blood-letting that seals the oath ;-)I see ;-)=C2=A0