From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Philipp Stephani Newsgroups: gmane.emacs.devel Subject: Re: Dynamic loading progress Date: Sun, 22 Nov 2015 18:19:29 +0000 Message-ID: References: <83k2ptq5t3.fsf@gnu.org> <87h9kxx60e.fsf@lifelogs.com> <877flswse5.fsf@lifelogs.com> <8737wgw7kf.fsf@lifelogs.com> <87io5bv1it.fsf@lifelogs.com> <87egfzuwca.fsf@lifelogs.com> <876118u6f2.fsf@lifelogs.com> <8737w3qero.fsf@lifelogs.com> <831tbn9g9j.fsf@gnu.org> <878u5upw7o.fsf@lifelogs.com> <83ziya8xph.fsf@gnu.org> <83y4du80xo.fsf@gnu.org> <837fld6lps.fsf@gnu.org> <83si3z4s5n.fsf@gnu.org> <83mvu74nhm.fsf@gnu.org> <83d1v34hba.fsf@gnu.org> <83io4u2aze.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=001a114b07d4ddf5780525252681 X-Trace: ger.gmane.org 1448216413 28787 80.91.229.3 (22 Nov 2015 18:20:13 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 22 Nov 2015 18:20:13 +0000 (UTC) Cc: aurelien.aptel+emacs@gmail.com, tzz@lifelogs.com, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Nov 22 19:20:11 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1a0ZF5-0001cz-Nt for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 19:20:04 +0100 Original-Received: from localhost ([::1]:56998 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0ZF5-0000RL-Nn for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 13:20:03 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:49430) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0ZEo-0000RA-4E for emacs-devel@gnu.org; Sun, 22 Nov 2015 13:19:48 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a0ZEl-0000XG-NE for emacs-devel@gnu.org; Sun, 22 Nov 2015 13:19:46 -0500 Original-Received: from mail-wm0-x230.google.com ([2a00:1450:400c:c09::230]:38868) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0ZEi-0000We-E3; Sun, 22 Nov 2015 13:19:40 -0500 Original-Received: by wmec201 with SMTP id c201so79225000wme.1; Sun, 22 Nov 2015 10:19:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-type; bh=sF7LS9qzsWwViJAnwO2Lf8HNiyozdDCsfnwJARwJzQ0=; b=YR1Tid9Fc85qR4mXcvJuftpRtoAWDr5k6YYlzOrUB2bQqkd6eV4kTizTn+F0rcIa8x vE8wVCe6GZi0nfrV2jnj5BMwUsUPiEy1aLXeJKf/TQxmR2PZrpFaclKVrfCE0ns0i/b/ FyWeJWiD9SBmcJKSbxMmEeuAvLe4CwcUK8M4/kAlV81mZhr4ze3rr3bEb81L1PhO3OUz oVG1VYH0wJfkmsZGQc+zVBLFQV4CPfQjNRnLHV5G8ITUNHSI2ELKDigccUsNx3A3N6D4 rPcVbPrHJhNQudCdlOMi+pFMFjjTaUApJSmJZiV4nE8szjNnEkpK1pAK+zuipGy5YFIj u9Uw== X-Received: by 10.28.187.4 with SMTP id l4mr11747573wmf.33.1448216379648; Sun, 22 Nov 2015 10:19:39 -0800 (PST) In-Reply-To: <83io4u2aze.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2a00:1450:400c:c09::230 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:195045 Archived-At: --001a114b07d4ddf5780525252681 Content-Type: text/plain; charset=UTF-8 Eli Zaretskii schrieb am So., 22. Nov. 2015 um 18:35 Uhr: > > From: Philipp Stephani > > Date: Sun, 22 Nov 2015 09:25:08 +0000 > > Cc: tzz@lifelogs.com, aurelien.aptel+emacs@gmail.com, > emacs-devel@gnu.org > > > > > Fine with me, but how would we then represent Emacs strings that > are not > > valid > > > Unicode strings? Just raise an error? > > > > No need to raise an error. Strings that are returned to modules > > should be encoded into UTF-8. That encoding already takes care of > > these situations: it either produces the UTF-8 encoding of the > > equivalent Unicode characters, or outputs raw bytes. > > > > Then we should document such a situation and give module authors a way to > > detect them. > > I already suggested what we should say in the documentation: that > these interfaces accept and produce UTF-8 encoded non-ASCII text. > If the interface accepts UTF-8, then it must signal an error for invalid sequences; the Unicode standard mandates this. If the interface produces UTF-8, then it must only ever produce valid sequences, this is again required by the Unicode standard. > > > For example, what happens if a sequence of such raw bytes happens > > to be a valid UTF-8 sequence? Is there a way for module code to detect > this > > situation? > > How can you detect that if you are only given the byte stream? You > can't. You need some additional information to be able to distinguish > between these two alternatives. > That's why I propose to not encode raw bytes as bytes, but as the Emacs integer codes used to represent them. > > Look, an Emacs module _must_ support non-ASCII text, otherwise it > would be severely limited, to say the least. Absolutely! > Having interfaces that > accept and produce UTF-8 encoded strings is the simplest complete > solution to this problem. So we must at least support that much. > Agreed. > > Supporting strings of raw bytes is also possible, probably even > desirable, but it's an extension, something that would be required > much more rarely. Such strings cannot be meaningfully treated as > text: you cannot ask if some byte is upper-case or lower-case letter, > you cannot display such strings as readable text, you cannot count > characters in it, etc. Such strings are useful for a limited number > of specialized jobs, and handling them in Lisp requires some caution, > because if you treat them as normal text strings, you get surprises. > Yes. However, without an interface they are awkward to produce. > > So let's solve the more important issues first, and talk about > extensions later. The more important issue is how can a module pass > to Emacs non-ASCII text and get back non-ASCII text. And the answer > to that is to use UTF-8 encoded strings. > Full agreement. > > > We are quite capable of quietly accepting such strings, so that is > > what I would suggest. Doing so would be in line with what Emacs does > > when such invalid sequences come from other sources, like files. > > > > If we accept such strings, then we should document what the extensions > are. > > - Are UTF-8-like sequences encoding surrogate code points accepted? > > - Are UTF-8-like sequences encoding integers outside the Unicode > codespace > > accepted? > > - Are non-shortest forms accepted? > > - Are other invalid code unit sequences accepted? > > _Anything_ can be accepted. _Any_ byte sequence. Emacs will cope. > Not if they accept UTF-8. The Unicode standard rules out accepting invalid byte sequences. If any byte sequence is accepted, then the behavior becomes more complex. We need to exhaustively describe the behavior for any possible byte sequence, otherwise module authors cannot make any assumption. > The perpetrator will probably get back after processing a string that > is not entirely human-readable, or its processing will sometimes > produce surprises, like if the string is lower-cased. But nothing bad > will happen to Emacs, it won't crash and won't garble its display. > Moreover, just passing such a string to Emacs, then outputting it back > without any changes will produce an exact copy of the input, which is > quite a feat, considering that the input was "invalid". > > If you want to see what "bad" things can happen, take a Latin-1 > encoded FILE and visit it with "C-x RET c utf-8 RET C-x C-f FILE RET". > Then play with the buffer a while. This is what happens when Emacs is > told the text is in UTF-8, when it really isn't. There's no > catastrophe, but the luser who does that might be amply punished, at > the very least she will not see the letters she expects. However, if > you save such a buffer to a file, using UTF-8, you will get the same > Latin-1 encoded text as was there originally. > > Now, given such resilience, why do we need to raise an error? > The Unicode standard says so. If we document that *a superset of UTF-8* is accepted, then we don't need to raise an error. So I'd suggest we do exactly that, but describe what that superset is. > > > If the answer to any of these is "yes", we can't say we accept UTF-8, > because > > we don't. > > We _expect_ UTF-8, and if given that, will produce known, predictable > results when the string is processed as text. We can _tolerate_ > violations, resulting in somewhat surprising behavior, if such a text > is treated as "normal" human-readable text. (If the module knows what > it does, and really means to work with raw bytes, then Emacs will do > what the module expects, and produce raw bytes on output, as > expected.) > No matter what we expect or tolerate, we need to state that. If all byte sequences are accepted, then we also need to state that, but describe what the behavior is if there are invalid UTF-8 sequences in the input. > > > Rather we should say what is actually accepted. > > Saying that is meaningless in this case, because we can accept > anything. _If_ the module wants the string it passes to be processed > as human-readable text that consists of recognizable characters, then > the module should _only_ pass valid UTF-8 sequences. But raising > errors upon detecting violations was discovered long ago a bad idea > that users resented. So we don't, and neither should the module API. > Module authors are not end users. I agree that end users should not see errors on decoding failure, but modules use only programmatic access, where we can be more strict. > > > > * If copy_string_contents is passed an Emacs string that is not a > valid > > Unicode > > > string, what should happen? > > > > How can that happen? The Emacs string comes from the Emacs bowels, so > > it must be "valid" string by Emacs standards. Or maybe I don't > > understand what you mean by "invalid Unicode string". > > > > A sequence of integers where at least one element is not a Unicode scalar > > value. > > Emacs doesn't store characters as scalar Unicode values, so this > doesn't really explain to me your concept of a "valid Unicode string". > An Emacs string is a sequence of integers. It doesn't have to be a sequence of scalar values. > > > In any case, we already deal with any such problems when we save a > > buffer to a file, or send it over the network. This isn't some new > > problem we need to cope with. > > > > Yes, but the module interface is new, it doesn't necessarily have to > have the > > same behavior. > > Of course, it does! Modules are Emacs extensions, so the interface > should support the same features that core Emacs does. Why? because > there's no limits to what creative minds can do with this feature, so > we should not artificially impose such limitations where we have > sound, time-proven infrastructure that doesn't need them. > I agree that we shouldn't add such limitations. But I disagree that we should leave the behavior undocumented in such cases. > > > If we say we emit only UTF-8, then we should do so. > > We emit only valid UTF-8, provided that its source (if it came from a > module) was valid UTF-8. > Then in turn we shouldn't say we emit only UTF-8. --001a114b07d4ddf5780525252681 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


Eli Za= retskii <eliz@gnu.org> schrieb am= So., 22. Nov. 2015 um 18:35=C2=A0Uhr:
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sun, 22 Nov 2015 09:25:08 +0000
> Cc: tzz@lifelogs= .com, aurelien.aptel+emacs@gmail.com, emacs-devel@gnu.org
>
>=C2=A0 =C2=A0 =C2=A0> Fine with me, but how would we then represent = Emacs strings that are not
>=C2=A0 =C2=A0 =C2=A0valid
>=C2=A0 =C2=A0 =C2=A0> Unicode strings? Just raise an error?
>
>=C2=A0 =C2=A0 =C2=A0No need to raise an error. Strings that are returne= d to modules
>=C2=A0 =C2=A0 =C2=A0should be encoded into UTF-8. That encoding already= takes care of
>=C2=A0 =C2=A0 =C2=A0these situations: it either produces the UTF-8 enco= ding of the
>=C2=A0 =C2=A0 =C2=A0equivalent Unicode characters, or outputs raw bytes= .
>
> Then we should document such a situation and give module authors a way= to
> detect them.

I already suggested what we should say in the documentation: that
these interfaces accept and produce UTF-8 encoded non-ASCII text.

If the interface accepts UTF-8, then it must sig= nal an error for invalid sequences; the Unicode standard mandates this.
If the interface produces UTF-8, then it must only ever produce vali= d sequences, this is again required by the Unicode standard.
=C2= =A0

> For example, what happens if a sequence of such raw bytes happens
> to be a valid UTF-8 sequence? Is there a way for module code to detect= this
> situation?

How can you detect that if you are only given the byte stream?=C2=A0 You can't.=C2=A0 You need some additional information to be able to disting= uish
between these two alternatives.

That= 9;s why I propose to not encode raw bytes as bytes, but as the Emacs intege= r codes used to represent them.
=C2=A0

Look, an Emacs module _must_ support non-ASCII text, otherwise it
would be severely limited, to say the least.=C2=A0

Absolutely!
=C2=A0
H= aving interfaces that
accept and produce UTF-8 encoded strings is the simplest complete
solution to this problem.=C2=A0 So we must at least support that much.
<= /blockquote>

Agreed.
=C2=A0

Supporting strings of raw bytes is also possible, probably even
desirable, but it's an extension, something that would be required
much more rarely.=C2=A0 Such strings cannot be meaningfully treated as
text: you cannot ask if some byte is upper-case or lower-case letter,
you cannot display such strings as readable text, you cannot count
characters in it, etc.=C2=A0 Such strings are useful for a limited number of specialized jobs, and handling them in Lisp requires some caution,
because if you treat them as normal text strings, you get surprises.

Yes. However, without an interface they are a= wkward to produce.
=C2=A0

So let's solve the more important issues first, and talk about
extensions later.=C2=A0 The more important issue is how can a module pass to Emacs non-ASCII text and get back non-ASCII text.=C2=A0 And the answer to that is to use UTF-8 encoded strings.

Full agreement.
=C2=A0

>=C2=A0 =C2=A0 =C2=A0We are quite capable of quietly accepting such stri= ngs, so that is
>=C2=A0 =C2=A0 =C2=A0what I would suggest. Doing so would be in line wit= h what Emacs does
>=C2=A0 =C2=A0 =C2=A0when such invalid sequences come from other sources= , like files.
>
> If we accept such strings, then we should document what the extensions= are.
> - Are UTF-8-like sequences encoding surrogate code points accepted? > - Are UTF-8-like sequences encoding integers outside the Unicode codes= pace
> accepted?
> - Are non-shortest forms accepted?
> - Are other invalid code unit sequences accepted?

_Anything_ can be accepted.=C2=A0 _Any_ byte sequence.=C2=A0 Emacs will cop= e.

Not if they accept UTF-8. The Unicod= e standard rules out accepting invalid byte sequences.
If any byt= e sequence is accepted, then the behavior becomes more complex. We need to = exhaustively describe the behavior for any possible byte sequence, otherwis= e module authors cannot make any assumption.
=C2=A0
The perpetrator will probably get back after processing a string that
is not entirely human-readable, or its processing will sometimes
produce surprises, like if the string is lower-cased.=C2=A0 But nothing bad=
will happen to Emacs, it won't crash and won't garble its display.<= br> Moreover, just passing such a string to Emacs, then outputting it back
without any changes will produce an exact copy of the input, which is
quite a feat, considering that the input was "invalid".

If you want to see what "bad" things can happen, take a Latin-1 encoded FILE and visit it with "C-x RET c utf-8 RET C-x C-f FILE RET&q= uot;.
Then play with the buffer a while.=C2=A0 This is what happens when Emacs is=
told the text is in UTF-8, when it really isn't.=C2=A0 There's no catastrophe, but the luser who does that might be amply punished, at
the very least she will not see the letters she expects.=C2=A0 However, if<= br> you save such a buffer to a file, using UTF-8, you will get the same
Latin-1 encoded text as was there originally.

Now, given such resilience, why do we need to raise an error?

The Unicode standard says so. If we document that *a= superset of UTF-8* is accepted, then we don't need to raise an error. = So I'd suggest we do exactly that, but describe what that superset is.<= /div>
=C2=A0

> If the answer to any of these is "yes", we can't say we = accept UTF-8, because
> we don't.

We _expect_ UTF-8, and if given that, will produce known, predictable
results when the string is processed as text.=C2=A0 We can _tolerate_
violations, resulting in somewhat surprising behavior, if such a text
is treated as "normal" human-readable text.=C2=A0 (If the module = knows what
it does, and really means to work with raw bytes, then Emacs will do
what the module expects, and produce raw bytes on output, as
expected.)

No matter what we expect or = tolerate, we need to state that. If all byte sequences are accepted, then w= e also need to state that, but describe what the behavior is if there are i= nvalid UTF-8 sequences in the input.
=C2=A0

> Rather we should say what is actually accepted.

Saying that is meaningless in this case, because we can accept
anything.=C2=A0 _If_ the module wants the string it passes to be processed<= br> as human-readable text that consists of recognizable characters, then
the module should _only_ pass valid UTF-8 sequences.=C2=A0 But raising
errors upon detecting violations was discovered long ago a bad idea
that users resented.=C2=A0 So we don't, and neither should the module A= PI.

Module authors are not end users. I= agree that end users should not see errors on decoding failure, but module= s use only programmatic access, where we can be more strict.
=C2= =A0

>=C2=A0 =C2=A0 =C2=A0> * If copy_string_contents is passed an Emacs s= tring that is not a valid
>=C2=A0 =C2=A0 =C2=A0Unicode
>=C2=A0 =C2=A0 =C2=A0> string, what should happen?
>
>=C2=A0 =C2=A0 =C2=A0How can that happen? The Emacs string comes from th= e Emacs bowels, so
>=C2=A0 =C2=A0 =C2=A0it must be "valid" string by Emacs standa= rds. Or maybe I don't
>=C2=A0 =C2=A0 =C2=A0understand what you mean by "invalid Unicode s= tring".
>
> A sequence of integers where at least one element is not a Unicode sca= lar
> value.

Emacs doesn't store characters as scalar Unicode values, so this
doesn't really explain to me your concept of a "valid Unicode stri= ng".

An Emacs string is a sequence= of integers. It doesn't have to be a sequence of scalar values.
<= div>=C2=A0

>=C2=A0 =C2=A0 =C2=A0In any case, we already deal with any such problems= when we save a
>=C2=A0 =C2=A0 =C2=A0buffer to a file, or send it over the network. This= isn't some new
>=C2=A0 =C2=A0 =C2=A0problem we need to cope with.
>
> Yes, but the module interface is new, it doesn't necessarily have = to have the
> same behavior.

Of course, it does!=C2=A0 Modules are Emacs extensions, so the interface should support the same features that core Emacs does.=C2=A0 Why? because there's no limits to what creative minds can do with this feature, so we should not artificially impose such limitations where we have
sound, time-proven infrastructure that doesn't need them.

I agree that we shouldn't add such limitations. = But I disagree that we should leave the behavior undocumented in such cases= .
=C2=A0

> If we say we emit only UTF-8, then we should do so.

We emit only valid UTF-8, provided that its source (if it came from a
module) was valid UTF-8.

Then in turn w= e shouldn't say we emit only UTF-8.=C2=A0
--001a114b07d4ddf5780525252681--