Eli Zaretskii <eliz@gnu.org> schrieb am Sa., 21. Nov. 2015 um 12:10 Uhr:

> >     (Btw, I don't think we should worry about changing the internal
> >     representation of characters in Emacs, because make_multibyte_string
> >     will be updated as needed.)
> >
> > This is a crucial point. If the internal encoding never changes, then we
> can
> > declare that those string parameters are expected to be in the internal
> > encoding.
>
> No, we cannot, or rather should not.  It is unreasonable to expect
> external modules to know the intricacies of the internal
> representation.  Most Emacs hackers don't.
>

Fine with me, but how would we then represent Emacs strings that are not
valid Unicode strings? Just raise an error?


>
> > But see the discussion in
> > https://github.com/aaptel/emacs-dynamic-module/issues/37: the comment in
> > mule-conf.el seems to indicate that the internal encoding is not stable.
>
> That discussion is about zero-copy access to Emacs buffer text and
> Emacs strings inside module code.


Partially, the encoding discussion is also part of that because it's
required to specify the encoding before zero-copy access is even possible.


>   Such access is indeed impossible
> without either knowing _something_ about the internal representation,
> or having additional APIs in emacs-module.c that allow modules such
> access while hiding the details of the internal representation.  We
> could discuss extending the module functionality to include this.
>
>
Yes, there's no need for that in this subthread though.


> But that is a separate issue from what module_make_function and
> module_make_string do.  These two functions are basic, and don't need
> to know about the internal representation or use it.  While direct
> access to Emacs buffer text will be needed by only some modules,
> module_make_function will be used by all of them, and
> module_make_string by many.
>
> So I think we shouldn't conflate these two issues; they are separate.
>

OK.


>
> >     This is what my comments were about. I think that you, by contrast,
> >     are talking about the encoding of the _input_ strings, in this case
> >     the 'documentation' argument to module_make_function and 'str'
> >     argument to module_make_string. My assumption was that these
> >     arguments will always have to be in UTF-8 encoding; if that
> assumption
> >     is true, then no decoding via code_convert_string_norecord is
> >     necessary, since make_multibyte_string will DTRT. We can (and
> >     probably should) document the fact that all non-ASCII strings must be
> >     UTF-8 encoded as a requirement of the emacs-module interface.
> >
> > Or rather, an extension to UTF-8 capable of encoding surrogate code
> points and
> > numbers that are not code points, as described in
> >
> https://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html
> .
>
> No, I meant strict UTF-8, not its Emacs extension.
>

That would be possible and provide a clean interface. However, Emacs
strings are extended, so we'd need to specify how they interact with UTF-8
strings.

   - If a module passes a char sequence that's not a valid UTF-8 string,
   but a valid Emacs multibyte string, what should happen? Error, undefined
   behavior, silently accepted?
   - If copy_string_contents is passed an Emacs string that is not a valid
   Unicode string, what should happen? Error, or should the internal
   representation be silently leaked?


> > If it's stable, we can use make_multibyte_string; if not, we can
> > only use make_unibyte_string.
>
> If the arguments strings are in strict UTF-8, then
> make_multibyte_string will DTRT automagically, no matter what the
> internal representation is.  That is their contract.
>

OK, then we can use that, of course. The question of handling invalid UTF-8
strings is still open, though, as make_multibyte_string doesn't enforce
valid UTF-8.
If it's the contract of make_multibyte_string that it will always accept
UTF-8, then that should be added as a comment to that function. Currently I
don't see it documented anywhere.