Eli Zaretskii schrieb am So., 22. Nov. 2015 um 20:20 Uhr: > > From: Philipp Stephani > > Date: Sun, 22 Nov 2015 18:19:29 +0000 > > Cc: tzz@lifelogs.com, aurelien.aptel+emacs@gmail.com, > emacs-devel@gnu.org > > > > I already suggested what we should say in the documentation: that > > these interfaces accept and produce UTF-8 encoded non-ASCII text. > > > > > > If the interface accepts UTF-8, then it must signal an error for invalid > > sequences; the Unicode standard mandates this. > > The Unicode standard cannot mandate anything for Emacs, because Emacs > is not subject to Unicode standardization. > True, but I think we shouldn't make the terminology more confusing. If we say "UTF-8", we should mean "UTF-8 as defined in the Unicode standard", not the Emacs extension of UTF-8. That's all. > > > If the interface produces UTF-8, then it must only ever produce valid > > sequences > > As I explained, this would violate the basic expectation from a text > editing program. > > > That's why I propose to not encode raw bytes as bytes, but as the Emacs > integer > > codes used to represent them. > > If we do that, no external code will be able to do anything useful > with such "bytes". Module authors will have to write their own > replacements for library functions. This will never be accepted by > our users. > I wouldn't be so pessimistic, but I was convinced by consistency with encode-coding-string. So yes, let's use the raw bytes (and document that). > > > If any byte sequence is accepted, then the behavior becomes more > complex. We > > need to exhaustively describe the behavior for any possible byte > sequence, > > otherwise module authors cannot make any assumption. > > We say that we accept valid UTF-8 encoded strings; anything else > might produce invalid UTF-8 on output. > Couldn't we just say "it behaves as if encoding and decoding were done using the utf-8-unix coding system"? Because I think that's what this boils down to. > > > No matter what we expect or tolerate, we need to state that. > > No, we don't. When the callers violate the contract, they cannot > expect to know in detail what will happen. If they want to know, they > will have to read the source. > So you want this to be unspecified or undefined behavior? That might be OK (we already have that in several places), but we still need to state what the contract is. > > > Module authors are not end users. > > They are users like anyone who writes Lisp. They came to expect that > Emacs behaves in certain ways, and modules should follow suit. > > > I agree that end users should not see errors on decoding failure, > > but modules use only programmatic access, where we can be more > > strict. > > You cannot be more strict, unless you rewrite the whole > encoding/decoding machinery, or write specialized code to detect and > reject invalid UTF-8 before it is passed to a decoder. There are no > good reasons to do either, so let's not. > > > An Emacs string is a sequence of integers. > > No, it's a sequence of bytes. > From https://www.gnu.org/software/emacs/manual/html_node/elisp/String-Basics.html : "In Emacs Lisp, characters are simply integers ... A string is a fixed sequence of characters" How a string is represented internally shouldn't be the concern of module authors. > > > I agree that we shouldn't add such limitations. But I disagree that we > should > > leave the behavior undocumented in such cases. > > OK, so let's agree to disagree. If that disagreement gets in your way > of fixing the issues related to this discussion, please say so, and I > will fix them myself > > No, I will definitely fix it. I think our disagreement is way smaller than it might look like.