Eli Zaretskii <
eliz@gnu.org> schrieb am Sa., 21. Nov. 2015 um 14:23=C2=A0Uhr:
> From: Philipp Stephani <p.stephani2@gmail.co=
m>
> Date: Sat, 21 Nov 2015 12:11:45 +0000
> Cc: tzz@lifelogs=
.com, aurelien.aptel+emacs@gmail.com, emacs-devel@gnu.org
>
>=C2=A0 =C2=A0 =C2=A0No, we cannot, or rather should not. It is unreason=
able to expect
>=C2=A0 =C2=A0 =C2=A0external modules to know the intricacies of the int=
ernal
>=C2=A0 =C2=A0 =C2=A0representation. Most Emacs hackers don't.
>
> Fine with me, but how would we then represent Emacs strings that are n=
ot valid
> Unicode strings? Just raise an error?
No need to raise an error.=C2=A0 Strings that are returned to modules
should be encoded into UTF-8.=C2=A0 That encoding already takes care of
these situations: it either produces the UTF-8 encoding of the
equivalent Unicode characters, or outputs raw bytes.
<=
br>
Then =
we should document such a situation and give module authors a way to detect=
them. For example, what happens if a sequence of such raw bytes happens to=
be a valid UTF-8 sequence? Is there a way for module code to detect this s=
ituation?
I've thought =
a bit more about this issue an in the following I'll attempt to derive =
the desired behavior from first principles without referring to internal Em=
acs functions.
- There are two sets of=
functions for creating and reading strings, unibyte and multibyte. If a st=
ring of the wrong type is passed, a signal is raised. This way the two type=
s are clearly separated.
- The behavior of the unibyte API is uncontrove=
rsial and has no failure modes apart from generic ones such as wrong type, =
argument out of range, OOM.
- The multibyte API should use an ext=
ension of UTF-8 to encode Emacs strings. The extension is the obvious one a=
lready in use in multiple places.
- There should be a one-to-one =
mapping between Emacs multibyte strings and encoded module API strings. The=
refore non-shortest forms, illegal code unit sequences, and code unit seque=
nces that would encode values outside the range of Emacs characters are ill=
egal and raise a signal. Likewise, such sequences will never be returned fr=
om Emacs.
I think this is a relatively simple and unsurprising ap=
proach. It allows encoding the documented Emacs character space while still=
being fully compatible with UTF-8 and not resorting to undocumented Emacs =
internals.