Eli Zaretskii schrieb am Sa., 23. Dez. 2017 um 15:52 Uhr: > > From: Philipp Stephani > > Date: Sat, 23 Dec 2017 14:29:56 +0000 > > Cc: emacs-devel@gnu.org, phst@google.com > > > > OK, but why do we need external functions for doing that? What is > > missing in our own code to detect such a situation? > > > > Not much I think, it's just easiest to use Gnulib functions because they > are well-documented, have a clean > > interface, and are probably bug-free. > > coding.c has check_utf_8, which is quite similar, but has an > incompatible interface (it takes struct > > coding_system objects) and also checks for embedded newlines, which > isn't necessary here. > > So let's use check_utf_8, as its downsides don't sound serious to me, > Well it needs to be rewritten significantly to take a char*, length argument instead of the coding_system struct. > and OTOH using unistring functions will bloat Emacs u8-check.c is just 77 LoC (including all boilerplate, comments, and empty lines), so I don't think it blows up Emacs in any significant way. > for the benefit of > a single use case, not to mention create two different methods for > doing the same job, which IMO is even more confusing to any newcomer > to the Emacs internals. > Agreed it's somewhat confusing, but I think not too much. The two functions have quite different use cases: check_utf_8 is a specialized function that requires a coding system with significant set-up and is only used once (in decode_coding_gap), while u8_check is a general-purpose function. Having not much experience with coding.c, I find the functions in that file much more confusing and harder to understand than the ones from libunistring. The libunistring functions tend to have a single, clear purpose, while the coding.c functions often do many different things at once. > > Btw, doesn't find_charsets_in_text do the same job cleaner and > quicker? AFAIU, all you need is make sure there are no characters > from the 2 eight-bit-* charsets in the text, or did I miss something? > What I need to check is one of the following: - Is the initial string either a well-formed UTF-8 unibyte string, or a multibyte string that represents a Unicode scalar value sequence? - Is the encoded string a well-formed UTF-8 unibyte string? Given my understanding of the implementation of coding.c, these two criteria should be equivalent. (Unfortunately that doesn't seem to be documented.) So I choose to implement the second check, which is easier and allows delaying the check until we know we have to signal an error.