Eli Zaretskii <eliz@gnu.org> schrieb am So., 22. Nov. 2015 um 20:20 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Sun, 22 Nov 2015 18:19:29 +0000
> > Cc: tzz@lifelogs.com, aurelien.aptel+emacs@gmail.com,
> emacs-devel@gnu.org
> >
> >     I already suggested what we should say in the documentation: that
> >     these interfaces accept and produce UTF-8 encoded non-ASCII text.
> >
> >
> > If the interface accepts UTF-8, then it must signal an error for invalid
> > sequences; the Unicode standard mandates this.
>
> The Unicode standard cannot mandate anything for Emacs, because Emacs
> is not subject to Unicode standardization.
>

True, but I think we shouldn't make the terminology more confusing. If we
say "UTF-8", we should mean "UTF-8 as defined in the Unicode standard", not
the Emacs extension of UTF-8. That's all.


>
> > If the interface produces UTF-8, then it must only ever produce valid
> > sequences
>
> As I explained, this would violate the basic expectation from a text
> editing program.
>
> > That's why I propose to not encode raw bytes as bytes, but as the Emacs
> integer
> > codes used to represent them.
>
> If we do that, no external code will be able to do anything useful
> with such "bytes".  Module authors will have to write their own
> replacements for library functions.  This will never be accepted by
> our users.
>

I wouldn't be so pessimistic, but I was convinced by consistency with
encode-coding-string. So yes, let's use the raw bytes (and document that).


>
> > If any byte sequence is accepted, then the behavior becomes more
> complex. We
> > need to exhaustively describe the behavior for any possible byte
> sequence,
> > otherwise module authors cannot make any assumption.
>
> We say that we accept valid UTF-8 encoded strings; anything else
> might produce invalid UTF-8 on output.
>

Couldn't we just say "it behaves as if encoding and decoding were done
using the utf-8-unix coding system"? Because I think that's what this boils
down to.


>
> > No matter what we expect or tolerate, we need to state that.
>
> No, we don't.  When the callers violate the contract, they cannot
> expect to know in detail what will happen.  If they want to know, they
> will have to read the source.
>

So you want this to be unspecified or undefined behavior? That might be OK
(we already have that in several places), but we still need to state what
the contract is.


>
> > Module authors are not end users.
>
> They are users like anyone who writes Lisp.  They came to expect that
> Emacs behaves in certain ways, and modules should follow suit.
>
> > I agree that end users should not see errors on decoding failure,
> > but modules use only programmatic access, where we can be more
> > strict.
>
> You cannot be more strict, unless you rewrite the whole
> encoding/decoding machinery, or write specialized code to detect and
> reject invalid UTF-8 before it is passed to a decoder.  There are no
> good reasons to do either, so let's not.
>
> > An Emacs string is a sequence of integers.
>
> No, it's a sequence of bytes.
>

From
https://www.gnu.org/software/emacs/manual/html_node/elisp/String-Basics.html
:
"In Emacs Lisp, characters are simply integers ... A string is a fixed
sequence of characters"
How a string is represented internally shouldn't be the concern of module
authors.


>
> > I agree that we shouldn't add such limitations. But I disagree that we
> should
> > leave the behavior undocumented in such cases.
>
> OK, so let's agree to disagree.  If that disagreement gets in your way
> of fixing the issues related to this discussion, please say so, and I
> will fix them myself
>
>
No, I will definitely fix it. I think our disagreement is way smaller than
it might look like.