Eli Zaretskii <eliz@gnu.org> schrieb am Sa., 21. Nov. 2015 um 14:23 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Sat, 21 Nov 2015 12:11:45 +0000
> > Cc: tzz@lifelogs.com, aurelien.aptel+emacs@gmail.com,
> emacs-devel@gnu.org
> >
> >     No, we cannot, or rather should not. It is unreasonable to expect
> >     external modules to know the intricacies of the internal
> >     representation. Most Emacs hackers don't.
> >
> > Fine with me, but how would we then represent Emacs strings that are not
> valid
> > Unicode strings? Just raise an error?
>
> No need to raise an error.  Strings that are returned to modules
> should be encoded into UTF-8.  That encoding already takes care of
> these situations: it either produces the UTF-8 encoding of the
> equivalent Unicode characters, or outputs raw bytes.
>

Then we should document such a situation and give module authors a way to
detect them. For example, what happens if a sequence of such raw bytes
happens to be a valid UTF-8 sequence? Is there a way for module code to
detect this situation?


>
> We are using this all the time when we save files or send stuff over
> the network.
>
> >     No, I meant strict UTF-8, not its Emacs extension.
> >
> > That would be possible and provide a clean interface. However, Emacs
> strings
> > are extended, so we'd need to specify how they interact with UTF-8
> strings.
> >
> > * If a module passes a char sequence that's not a valid UTF-8 string,
> but a
> >   valid Emacs multibyte string, what should happen? Error, undefined
> behavior,
> >   silently accepted?
>
> We are quite capable of quietly accepting such strings, so that is
> what I would suggest.  Doing so would be in line with what Emacs does
> when such invalid sequences come from other sources, like files.
>

If we accept such strings, then we should document what the extensions are.
- Are UTF-8-like sequences encoding surrogate code points accepted?
- Are UTF-8-like sequences encoding integers outside the Unicode codespace
accepted?
- Are non-shortest forms accepted?
- Are other invalid code unit sequences accepted?
If the answer to any of these is "yes", we can't say we accept UTF-8,
because we don't. Rather we should say what is actually accepted.


>
> > * If copy_string_contents is passed an Emacs string that is not a valid
> Unicode
> >   string, what should happen?
>
> How can that happen?  The Emacs string comes from the Emacs bowels, so
> it must be "valid" string by Emacs standards.  Or maybe I don't
> understand what you mean by "invalid Unicode string".
>

A sequence of integers where at least one element is not a Unicode scalar
value.


>
> In any case, we already deal with any such problems when we save a
> buffer to a file, or send it over the network.  This isn't some new
> problem we need to cope with.
>

Yes, but the module interface is new, it doesn't necessarily have to have
the same behavior. If we say we emit only UTF-8, then we should do so.


>
> > OK, then we can use that, of course. The question of handling invalid
> UTF-8
> > strings is still open, though, as make_multibyte_string doesn't enforce
> valid
> > UTF-8.
>
> It doesn't enforce valid UTF-8 because it can handle invalid UTF-8 as
> well.  That's by design.
>

Then whatever it handles needs to be specified.


>
> > If it's the contract of make_multibyte_string that it will always accept
> UTF-8,
> > then that should be added as a comment to that function. Currently I
> don't see
> > it documented anywhere.
>
> That part of the documentation is only revealed to veteran Emacs
> hackers, subject to swearing not to reveal that to the uninitiated and
> to some blood-letting that seals the oath ;-)
>

I see ;-)