Eli Zaretskii schrieb am Sa., 21. Nov. 2015 um 12:10 Uhr: > > (Btw, I don't think we should worry about changing the internal > > representation of characters in Emacs, because make_multibyte_string > > will be updated as needed.) > > > > This is a crucial point. If the internal encoding never changes, then we > can > > declare that those string parameters are expected to be in the internal > > encoding. > > No, we cannot, or rather should not. It is unreasonable to expect > external modules to know the intricacies of the internal > representation. Most Emacs hackers don't. > Fine with me, but how would we then represent Emacs strings that are not valid Unicode strings? Just raise an error? > > > But see the discussion in > > https://github.com/aaptel/emacs-dynamic-module/issues/37: the comment in > > mule-conf.el seems to indicate that the internal encoding is not stable. > > That discussion is about zero-copy access to Emacs buffer text and > Emacs strings inside module code. Partially, the encoding discussion is also part of that because it's required to specify the encoding before zero-copy access is even possible. > Such access is indeed impossible > without either knowing _something_ about the internal representation, > or having additional APIs in emacs-module.c that allow modules such > access while hiding the details of the internal representation. We > could discuss extending the module functionality to include this. > > Yes, there's no need for that in this subthread though. > But that is a separate issue from what module_make_function and > module_make_string do. These two functions are basic, and don't need > to know about the internal representation or use it. While direct > access to Emacs buffer text will be needed by only some modules, > module_make_function will be used by all of them, and > module_make_string by many. > > So I think we shouldn't conflate these two issues; they are separate. > OK. > > > This is what my comments were about. I think that you, by contrast, > > are talking about the encoding of the _input_ strings, in this case > > the 'documentation' argument to module_make_function and 'str' > > argument to module_make_string. My assumption was that these > > arguments will always have to be in UTF-8 encoding; if that > assumption > > is true, then no decoding via code_convert_string_norecord is > > necessary, since make_multibyte_string will DTRT. We can (and > > probably should) document the fact that all non-ASCII strings must be > > UTF-8 encoded as a requirement of the emacs-module interface. > > > > Or rather, an extension to UTF-8 capable of encoding surrogate code > points and > > numbers that are not code points, as described in > > > https://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html > . > > No, I meant strict UTF-8, not its Emacs extension. > That would be possible and provide a clean interface. However, Emacs strings are extended, so we'd need to specify how they interact with UTF-8 strings. - If a module passes a char sequence that's not a valid UTF-8 string, but a valid Emacs multibyte string, what should happen? Error, undefined behavior, silently accepted? - If copy_string_contents is passed an Emacs string that is not a valid Unicode string, what should happen? Error, or should the internal representation be silently leaked? > > If it's stable, we can use make_multibyte_string; if not, we can > > only use make_unibyte_string. > > If the arguments strings are in strict UTF-8, then > make_multibyte_string will DTRT automagically, no matter what the > internal representation is. That is their contract. > OK, then we can use that, of course. The question of handling invalid UTF-8 strings is still open, though, as make_multibyte_string doesn't enforce valid UTF-8. If it's the contract of make_multibyte_string that it will always accept UTF-8, then that should be added as a comment to that function. Currently I don't see it documented anywhere.