From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Philipp Stephani
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sun, 22 Nov 2015 09:25:08 +0000
> Cc: tzz@lifelogs= .com, aurelien.aptel+emacs@gmail.com, emacs-devel@gnu.org
>
>=C2=A0 =C2=A0 =C2=A0> Fine with me, but how would we then represent = Emacs strings that are not
>=C2=A0 =C2=A0 =C2=A0valid
>=C2=A0 =C2=A0 =C2=A0> Unicode strings? Just raise an error?
>
>=C2=A0 =C2=A0 =C2=A0No need to raise an error. Strings that are returne= d to modules
>=C2=A0 =C2=A0 =C2=A0should be encoded into UTF-8. That encoding already= takes care of
>=C2=A0 =C2=A0 =C2=A0these situations: it either produces the UTF-8 enco= ding of the
>=C2=A0 =C2=A0 =C2=A0equivalent Unicode characters, or outputs raw bytes= .
>
> Then we should document such a situation and give module authors a way= to
> detect them.
I already suggested what we should say in the documentation: that
these interfaces accept and produce UTF-8 encoded non-ASCII text.If the interface accepts UTF-8, then it must sig= nal an error for invalid sequences; the Unicode standard mandates this.If the interface produces UTF-8, then it must only ever produce vali= d sequences, this is again required by the Unicode standard.=C2= =A0
> For example, what happens if a sequence of such raw bytes happens
> to be a valid UTF-8 sequence? Is there a way for module code to detect= this
> situation?
How can you detect that if you are only given the byte stream?=C2=A0 You
can't.=C2=A0 You need some additional information to be able to disting= uish
between these two alternatives.That= 9;s why I propose to not encode raw bytes as bytes, but as the Emacs intege= r codes used to represent them.=C2=A0
Look, an Emacs module _must_ support non-ASCII text, otherwise it
would be severely limited, to say the least.=C2=A0Absolutely!=C2=A0H= aving interfaces that
accept and produce UTF-8 encoded strings is the simplest complete
solution to this problem.=C2=A0 So we must at least support that much.
<= /blockquote>Agreed.=C2=A0
Supporting strings of raw bytes is also possible, probably even
desirable, but it's an extension, something that would be required
much more rarely.=C2=A0 Such strings cannot be meaningfully treated as
text: you cannot ask if some byte is upper-case or lower-case letter,
you cannot display such strings as readable text, you cannot count
characters in it, etc.=C2=A0 Such strings are useful for a limited number of specialized jobs, and handling them in Lisp requires some caution,
because if you treat them as normal text strings, you get surprises.Yes. However, without an interface they are a= wkward to produce.=C2=A0
So let's solve the more important issues first, and talk about
extensions later.=C2=A0 The more important issue is how can a module pass to Emacs non-ASCII text and get back non-ASCII text.=C2=A0 And the answer to that is to use UTF-8 encoded strings.Full agreement. =C2=A0
>=C2=A0 =C2=A0 =C2=A0We are quite capable of quietly accepting such stri= ngs, so that is
>=C2=A0 =C2=A0 =C2=A0what I would suggest. Doing so would be in line wit= h what Emacs does
>=C2=A0 =C2=A0 =C2=A0when such invalid sequences come from other sources= , like files.
>
> If we accept such strings, then we should document what the extensions= are.
> - Are UTF-8-like sequences encoding surrogate code points accepted?
> - Are UTF-8-like sequences encoding integers outside the Unicode codes= pace
> accepted?
> - Are non-shortest forms accepted?
> - Are other invalid code unit sequences accepted?
_Anything_ can be accepted.=C2=A0 _Any_ byte sequence.=C2=A0 Emacs will cop= e.Not if they accept UTF-8. The Unicod= e standard rules out accepting invalid byte sequences.If any byt= e sequence is accepted, then the behavior becomes more complex. We need to = exhaustively describe the behavior for any possible byte sequence, otherwis= e module authors cannot make any assumption.=C2=A0The perpetrator will probably get back after processing a string that
is not entirely human-readable, or its processing will sometimes
produce surprises, like if the string is lower-cased.=C2=A0 But nothing bad=
will happen to Emacs, it won't crash and won't garble its display.<= br> Moreover, just passing such a string to Emacs, then outputting it back
without any changes will produce an exact copy of the input, which is
quite a feat, considering that the input was "invalid".
If you want to see what "bad" things can happen, take a Latin-1 encoded FILE and visit it with "C-x RET c utf-8 RET C-x C-f FILE RET&q= uot;.
Then play with the buffer a while.=C2=A0 This is what happens when Emacs is=
told the text is in UTF-8, when it really isn't.=C2=A0 There's no catastrophe, but the luser who does that might be amply punished, at
the very least she will not see the letters she expects.=C2=A0 However, if<= br> you save such a buffer to a file, using UTF-8, you will get the same
Latin-1 encoded text as was there originally.
Now, given such resilience, why do we need to raise an error?The Unicode standard says so. If we document that *a= superset of UTF-8* is accepted, then we don't need to raise an error. = So I'd suggest we do exactly that, but describe what that superset is.<= /div>=C2=A0
> If the answer to any of these is "yes", we can't say we = accept UTF-8, because
> we don't.
We _expect_ UTF-8, and if given that, will produce known, predictable
results when the string is processed as text.=C2=A0 We can _tolerate_
violations, resulting in somewhat surprising behavior, if such a text
is treated as "normal" human-readable text.=C2=A0 (If the module = knows what
it does, and really means to work with raw bytes, then Emacs will do
what the module expects, and produce raw bytes on output, as
expected.)No matter what we expect or = tolerate, we need to state that. If all byte sequences are accepted, then w= e also need to state that, but describe what the behavior is if there are i= nvalid UTF-8 sequences in the input.=C2=A0
> Rather we should say what is actually accepted.
Saying that is meaningless in this case, because we can accept
anything.=C2=A0 _If_ the module wants the string it passes to be processed<= br> as human-readable text that consists of recognizable characters, then
the module should _only_ pass valid UTF-8 sequences.=C2=A0 But raising
errors upon detecting violations was discovered long ago a bad idea
that users resented.=C2=A0 So we don't, and neither should the module A= PI.Module authors are not end users. I= agree that end users should not see errors on decoding failure, but module= s use only programmatic access, where we can be more strict.=C2= =A0
>=C2=A0 =C2=A0 =C2=A0> * If copy_string_contents is passed an Emacs s= tring that is not a valid
>=C2=A0 =C2=A0 =C2=A0Unicode
>=C2=A0 =C2=A0 =C2=A0> string, what should happen?
>
>=C2=A0 =C2=A0 =C2=A0How can that happen? The Emacs string comes from th= e Emacs bowels, so
>=C2=A0 =C2=A0 =C2=A0it must be "valid" string by Emacs standa= rds. Or maybe I don't
>=C2=A0 =C2=A0 =C2=A0understand what you mean by "invalid Unicode s= tring".
>
> A sequence of integers where at least one element is not a Unicode sca= lar
> value.
Emacs doesn't store characters as scalar Unicode values, so this
doesn't really explain to me your concept of a "valid Unicode stri= ng".An Emacs string is a sequence= of integers. It doesn't have to be a sequence of scalar values.<= div>=C2=A0
>=C2=A0 =C2=A0 =C2=A0In any case, we already deal with any such problems= when we save a
>=C2=A0 =C2=A0 =C2=A0buffer to a file, or send it over the network. This= isn't some new
>=C2=A0 =C2=A0 =C2=A0problem we need to cope with.
>
> Yes, but the module interface is new, it doesn't necessarily have = to have the
> same behavior.
Of course, it does!=C2=A0 Modules are Emacs extensions, so the interface
should support the same features that core Emacs does.=C2=A0 Why? because there's no limits to what creative minds can do with this feature, so we should not artificially impose such limitations where we have
sound, time-proven infrastructure that doesn't need them.I agree that we shouldn't add such limitations. = But I disagree that we should leave the behavior undocumented in such cases= .=C2=A0
> If we say we emit only UTF-8, then we should do so.
We emit only valid UTF-8, provided that its source (if it came from a
module) was valid UTF-8.Then in turn w= e shouldn't say we emit only UTF-8.=C2=A0