From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Philipp Stephani <p.stephani2@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: Dynamic loading progress
Date: Sun, 22 Nov 2015 18:19:29 +0000
Message-ID: <CAArVCkROBfCxh1qcSW9ApP-6m60YFyMR7H3W0xZ_rkYauF8umg@mail.gmail.com>
References: <CA+5B0FOuWbpBUTsrE4tzzoLxACPQ-mgxx7zJKyW2LR77QRM=Ug@mail.gmail.com>
	<83k2ptq5t3.fsf@gnu.org> <87h9kxx60e.fsf@lifelogs.com>
	<CA+5B0FPp9nYEmoyDLrutJpcOZBtpV9kxx7LdPqrsj26rnj11qA@mail.gmail.com>
	<CAArVCkS515CVbS1UfavFGAq0dGO=e_mGftMbhF_eBw3SSu3Xjg@mail.gmail.com>
	<877flswse5.fsf@lifelogs.com>
	<CAArVCkT0M8o4MDP1RaP-r9JqumoQaMbhANRrycSEyyCj+mqUcA@mail.gmail.com>
	<8737wgw7kf.fsf@lifelogs.com>
	<CA+5B0FOGrn01XZzKJvXdWLPL62ONUzoEBfQRwLiKqLmd6Ta3RA@mail.gmail.com>
	<87io5bv1it.fsf@lifelogs.com>
	<CA+5B0FOp8Ub1+V_2G4CC1r2aG1hLKmZdSic59MfOy=9QoovSRQ@mail.gmail.com>
	<87egfzuwca.fsf@lifelogs.com>
	<CAArVCkSEHxSd3X2PnEvRJk5n1wOR0y9neU7AxGYEHSqKRG+y3Q@mail.gmail.com>
	<876118u6f2.fsf@lifelogs.com>
	<CA+5B0FPz-vo+Y=38=21jRQuEHANzFG_cf3tPDiwEbK2TO4+JdA@mail.gmail.com>
	<CA+5B0FNW48d3S5CJfxHK9HHVHPmuYqaT3K9tn5MVTgv_qas5Rw@mail.gmail.com>
	<ryhmvud820v.fsf@dod.no>
	<CA+5B0FMU1Ry6mRSinyV5Ar8DaL4VciEUEbTe1NcXZUQ2-4y4TA@mail.gmail.com>
	<8737w3qero.fsf@lifelogs.com> <831tbn9g9j.fsf@gnu.org>
	<878u5upw7o.fsf@lifelogs.com>
	<83ziya8xph.fsf@gnu.org> <83y4du80xo.fsf@gnu.org>
	<CAArVCkTwVbA58_wfj7O-Et83M8YJ9jfpCKhYn466BYO8T2cG0A@mail.gmail.com>
	<837fld6lps.fsf@gnu.org>
	<CAArVCkSTdg=EjSiN69TqLoH_ufkz_vzV6qLKNae2QbEXadYomg@mail.gmail.com>
	<83si3z4s5n.fsf@gnu.org>
	<CAArVCkQ0qUTUr5GZ+xmCub2tEWc0YzFKRsHEN-FFv3ioAc2n0w@mail.gmail.com>
	<83mvu74nhm.fsf@gnu.org>
	<CAArVCkR+LqXPbHnWKW+2FQ61z+AyWR6ThBAb5ens=mwN+rS_mQ@mail.gmail.com>
	<83d1v34hba.fsf@gnu.org>
	<CAArVCkRBF7+yJcFiYA6KmZzKp5EGP6iauQ=0hkH5KJZbMRH7LA@mail.gmail.com>
	<83io4u2aze.fsf@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary=001a114b07d4ddf5780525252681
X-Trace: ger.gmane.org 1448216413 28787 80.91.229.3 (22 Nov 2015 18:20:13 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sun, 22 Nov 2015 18:20:13 +0000 (UTC)
Cc: aurelien.aptel+emacs@gmail.com, tzz@lifelogs.com, emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Nov 22 19:20:11 2015
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1a0ZF5-0001cz-Nt
	for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 19:20:04 +0100
Original-Received: from localhost ([::1]:56998 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1a0ZF5-0000RL-Nn
	for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 13:20:03 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:49430)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1a0ZEo-0000RA-4E
	for emacs-devel@gnu.org; Sun, 22 Nov 2015 13:19:48 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1a0ZEl-0000XG-NE
	for emacs-devel@gnu.org; Sun, 22 Nov 2015 13:19:46 -0500
Original-Received: from mail-wm0-x230.google.com ([2a00:1450:400c:c09::230]:38868)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>)
	id 1a0ZEi-0000We-E3; Sun, 22 Nov 2015 13:19:40 -0500
Original-Received: by wmec201 with SMTP id c201so79225000wme.1;
	Sun, 22 Nov 2015 10:19:39 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:references:in-reply-to:from:date:message-id:subject:to
	:cc:content-type;
	bh=sF7LS9qzsWwViJAnwO2Lf8HNiyozdDCsfnwJARwJzQ0=;
	b=YR1Tid9Fc85qR4mXcvJuftpRtoAWDr5k6YYlzOrUB2bQqkd6eV4kTizTn+F0rcIa8x
	vE8wVCe6GZi0nfrV2jnj5BMwUsUPiEy1aLXeJKf/TQxmR2PZrpFaclKVrfCE0ns0i/b/
	FyWeJWiD9SBmcJKSbxMmEeuAvLe4CwcUK8M4/kAlV81mZhr4ze3rr3bEb81L1PhO3OUz
	oVG1VYH0wJfkmsZGQc+zVBLFQV4CPfQjNRnLHV5G8ITUNHSI2ELKDigccUsNx3A3N6D4
	rPcVbPrHJhNQudCdlOMi+pFMFjjTaUApJSmJZiV4nE8szjNnEkpK1pAK+zuipGy5YFIj
	u9Uw==
X-Received: by 10.28.187.4 with SMTP id l4mr11747573wmf.33.1448216379648; Sun,
	22 Nov 2015 10:19:39 -0800 (PST)
In-Reply-To: <83io4u2aze.fsf@gnu.org>
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
	(bad octet value).
X-Received-From: 2a00:1450:400c:c09::230
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:195045
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/195045>

--001a114b07d4ddf5780525252681
Content-Type: text/plain; charset=UTF-8

Eli Zaretskii <eliz@gnu.org> schrieb am So., 22. Nov. 2015 um 18:35 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Sun, 22 Nov 2015 09:25:08 +0000
> > Cc: tzz@lifelogs.com, aurelien.aptel+emacs@gmail.com,
> emacs-devel@gnu.org
> >
> >     > Fine with me, but how would we then represent Emacs strings that
> are not
> >     valid
> >     > Unicode strings? Just raise an error?
> >
> >     No need to raise an error. Strings that are returned to modules
> >     should be encoded into UTF-8. That encoding already takes care of
> >     these situations: it either produces the UTF-8 encoding of the
> >     equivalent Unicode characters, or outputs raw bytes.
> >
> > Then we should document such a situation and give module authors a way to
> > detect them.
>
> I already suggested what we should say in the documentation: that
> these interfaces accept and produce UTF-8 encoded non-ASCII text.
>

If the interface accepts UTF-8, then it must signal an error for invalid
sequences; the Unicode standard mandates this.
If the interface produces UTF-8, then it must only ever produce valid
sequences, this is again required by the Unicode standard.


>
> > For example, what happens if a sequence of such raw bytes happens
> > to be a valid UTF-8 sequence? Is there a way for module code to detect
> this
> > situation?
>
> How can you detect that if you are only given the byte stream?  You
> can't.  You need some additional information to be able to distinguish
> between these two alternatives.
>

That's why I propose to not encode raw bytes as bytes, but as the Emacs
integer codes used to represent them.


>
> Look, an Emacs module _must_ support non-ASCII text, otherwise it
> would be severely limited, to say the least.


Absolutely!


> Having interfaces that
> accept and produce UTF-8 encoded strings is the simplest complete
> solution to this problem.  So we must at least support that much.
>

Agreed.


>
> Supporting strings of raw bytes is also possible, probably even
> desirable, but it's an extension, something that would be required
> much more rarely.  Such strings cannot be meaningfully treated as
> text: you cannot ask if some byte is upper-case or lower-case letter,
> you cannot display such strings as readable text, you cannot count
> characters in it, etc.  Such strings are useful for a limited number
> of specialized jobs, and handling them in Lisp requires some caution,
> because if you treat them as normal text strings, you get surprises.
>

Yes. However, without an interface they are awkward to produce.


>
> So let's solve the more important issues first, and talk about
> extensions later.  The more important issue is how can a module pass
> to Emacs non-ASCII text and get back non-ASCII text.  And the answer
> to that is to use UTF-8 encoded strings.
>

Full agreement.


>
> >     We are quite capable of quietly accepting such strings, so that is
> >     what I would suggest. Doing so would be in line with what Emacs does
> >     when such invalid sequences come from other sources, like files.
> >
> > If we accept such strings, then we should document what the extensions
> are.
> > - Are UTF-8-like sequences encoding surrogate code points accepted?
> > - Are UTF-8-like sequences encoding integers outside the Unicode
> codespace
> > accepted?
> > - Are non-shortest forms accepted?
> > - Are other invalid code unit sequences accepted?
>
> _Anything_ can be accepted.  _Any_ byte sequence.  Emacs will cope.
>

Not if they accept UTF-8. The Unicode standard rules out accepting invalid
byte sequences.
If any byte sequence is accepted, then the behavior becomes more complex.
We need to exhaustively describe the behavior for any possible byte
sequence, otherwise module authors cannot make any assumption.


> The perpetrator will probably get back after processing a string that
> is not entirely human-readable, or its processing will sometimes
> produce surprises, like if the string is lower-cased.  But nothing bad
> will happen to Emacs, it won't crash and won't garble its display.
> Moreover, just passing such a string to Emacs, then outputting it back
> without any changes will produce an exact copy of the input, which is
> quite a feat, considering that the input was "invalid".
>
> If you want to see what "bad" things can happen, take a Latin-1
> encoded FILE and visit it with "C-x RET c utf-8 RET C-x C-f FILE RET".
> Then play with the buffer a while.  This is what happens when Emacs is
> told the text is in UTF-8, when it really isn't.  There's no
> catastrophe, but the luser who does that might be amply punished, at
> the very least she will not see the letters she expects.  However, if
> you save such a buffer to a file, using UTF-8, you will get the same
> Latin-1 encoded text as was there originally.
>
> Now, given such resilience, why do we need to raise an error?
>

The Unicode standard says so. If we document that *a superset of UTF-8* is
accepted, then we don't need to raise an error. So I'd suggest we do
exactly that, but describe what that superset is.


>
> > If the answer to any of these is "yes", we can't say we accept UTF-8,
> because
> > we don't.
>
> We _expect_ UTF-8, and if given that, will produce known, predictable
> results when the string is processed as text.  We can _tolerate_
> violations, resulting in somewhat surprising behavior, if such a text
> is treated as "normal" human-readable text.  (If the module knows what
> it does, and really means to work with raw bytes, then Emacs will do
> what the module expects, and produce raw bytes on output, as
> expected.)
>

No matter what we expect or tolerate, we need to state that. If all byte
sequences are accepted, then we also need to state that, but describe what
the behavior is if there are invalid UTF-8 sequences in the input.


>
> > Rather we should say what is actually accepted.
>
> Saying that is meaningless in this case, because we can accept
> anything.  _If_ the module wants the string it passes to be processed
> as human-readable text that consists of recognizable characters, then
> the module should _only_ pass valid UTF-8 sequences.  But raising
> errors upon detecting violations was discovered long ago a bad idea
> that users resented.  So we don't, and neither should the module API.
>

Module authors are not end users. I agree that end users should not see
errors on decoding failure, but modules use only programmatic access, where
we can be more strict.


>
> >     > * If copy_string_contents is passed an Emacs string that is not a
> valid
> >     Unicode
> >     > string, what should happen?
> >
> >     How can that happen? The Emacs string comes from the Emacs bowels, so
> >     it must be "valid" string by Emacs standards. Or maybe I don't
> >     understand what you mean by "invalid Unicode string".
> >
> > A sequence of integers where at least one element is not a Unicode scalar
> > value.
>
> Emacs doesn't store characters as scalar Unicode values, so this
> doesn't really explain to me your concept of a "valid Unicode string".
>

An Emacs string is a sequence of integers. It doesn't have to be a sequence
of scalar values.


>
> >     In any case, we already deal with any such problems when we save a
> >     buffer to a file, or send it over the network. This isn't some new
> >     problem we need to cope with.
> >
> > Yes, but the module interface is new, it doesn't necessarily have to
> have the
> > same behavior.
>
> Of course, it does!  Modules are Emacs extensions, so the interface
> should support the same features that core Emacs does.  Why? because
> there's no limits to what creative minds can do with this feature, so
> we should not artificially impose such limitations where we have
> sound, time-proven infrastructure that doesn't need them.
>

I agree that we shouldn't add such limitations. But I disagree that we
should leave the behavior undocumented in such cases.


>
> > If we say we emit only UTF-8, then we should do so.
>
> We emit only valid UTF-8, provided that its source (if it came from a
> module) was valid UTF-8.
>

Then in turn we shouldn't say we emit only UTF-8.

--001a114b07d4ddf5780525252681
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br><div class=3D"gmail_quote"><div dir=3D"ltr">Eli Za=
retskii &lt;<a href=3D"mailto:eliz@gnu.org">eliz@gnu.org</a>&gt; schrieb am=
 So., 22. Nov. 2015 um 18:35=C2=A0Uhr:<br></div><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex">&gt; From: Philipp Stephani &lt;<a href=3D"mailto:p.stephani2@gmail.com=
" target=3D"_blank">p.stephani2@gmail.com</a>&gt;<br>
&gt; Date: Sun, 22 Nov 2015 09:25:08 +0000<br>
&gt; Cc: <a href=3D"mailto:tzz@lifelogs.com" target=3D"_blank">tzz@lifelogs=
.com</a>, <a href=3D"mailto:aurelien.aptel%2Bemacs@gmail.com" target=3D"_bl=
ank">aurelien.aptel+emacs@gmail.com</a>, <a href=3D"mailto:emacs-devel@gnu.=
org" target=3D"_blank">emacs-devel@gnu.org</a><br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0&gt; Fine with me, but how would we then represent =
Emacs strings that are not<br>
&gt;=C2=A0 =C2=A0 =C2=A0valid<br>
&gt;=C2=A0 =C2=A0 =C2=A0&gt; Unicode strings? Just raise an error?<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0No need to raise an error. Strings that are returne=
d to modules<br>
&gt;=C2=A0 =C2=A0 =C2=A0should be encoded into UTF-8. That encoding already=
 takes care of<br>
&gt;=C2=A0 =C2=A0 =C2=A0these situations: it either produces the UTF-8 enco=
ding of the<br>
&gt;=C2=A0 =C2=A0 =C2=A0equivalent Unicode characters, or outputs raw bytes=
.<br>
&gt;<br>
&gt; Then we should document such a situation and give module authors a way=
 to<br>
&gt; detect them.<br>
<br>
I already suggested what we should say in the documentation: that<br>
these interfaces accept and produce UTF-8 encoded non-ASCII text.<br></bloc=
kquote><div><br></div><div>If the interface accepts UTF-8, then it must sig=
nal an error for invalid sequences; the Unicode standard mandates this.</di=
v><div>If the interface produces UTF-8, then it must only ever produce vali=
d sequences, this is again required by the Unicode standard.</div><div>=C2=
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde=
r-left:1px #ccc solid;padding-left:1ex">
<br>
&gt; For example, what happens if a sequence of such raw bytes happens<br>
&gt; to be a valid UTF-8 sequence? Is there a way for module code to detect=
 this<br>
&gt; situation?<br>
<br>
How can you detect that if you are only given the byte stream?=C2=A0 You<br=
>
can&#39;t.=C2=A0 You need some additional information to be able to disting=
uish<br>
between these two alternatives.<br></blockquote><div><br></div><div>That=
9;s why I propose to not encode raw bytes as bytes, but as the Emacs intege=
r codes used to represent them.</div><div>=C2=A0</div><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex">
<br>
Look, an Emacs module _must_ support non-ASCII text, otherwise it<br>
would be severely limited, to say the least.=C2=A0</blockquote><div><br></d=
iv><div>Absolutely!</div><div>=C2=A0</div><blockquote class=3D"gmail_quote"=
 style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> H=
aving interfaces that<br>
accept and produce UTF-8 encoded strings is the simplest complete<br>
solution to this problem.=C2=A0 So we must at least support that much.<br><=
/blockquote><div><br></div><div>Agreed.</div><div>=C2=A0</div><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex">
<br>
Supporting strings of raw bytes is also possible, probably even<br>
desirable, but it&#39;s an extension, something that would be required<br>
much more rarely.=C2=A0 Such strings cannot be meaningfully treated as<br>
text: you cannot ask if some byte is upper-case or lower-case letter,<br>
you cannot display such strings as readable text, you cannot count<br>
characters in it, etc.=C2=A0 Such strings are useful for a limited number<b=
r>
of specialized jobs, and handling them in Lisp requires some caution,<br>
because if you treat them as normal text strings, you get surprises.<br></b=
lockquote><div><br></div><div>Yes. However, without an interface they are a=
wkward to produce.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" =
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
So let&#39;s solve the more important issues first, and talk about<br>
extensions later.=C2=A0 The more important issue is how can a module pass<b=
r>
to Emacs non-ASCII text and get back non-ASCII text.=C2=A0 And the answer<b=
r>
to that is to use UTF-8 encoded strings.<br></blockquote><div><br></div><di=
v>Full agreement.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
&gt;=C2=A0 =C2=A0 =C2=A0We are quite capable of quietly accepting such stri=
ngs, so that is<br>
&gt;=C2=A0 =C2=A0 =C2=A0what I would suggest. Doing so would be in line wit=
h what Emacs does<br>
&gt;=C2=A0 =C2=A0 =C2=A0when such invalid sequences come from other sources=
, like files.<br>
&gt;<br>
&gt; If we accept such strings, then we should document what the extensions=
 are.<br>
&gt; - Are UTF-8-like sequences encoding surrogate code points accepted?<br=
>
&gt; - Are UTF-8-like sequences encoding integers outside the Unicode codes=
pace<br>
&gt; accepted?<br>
&gt; - Are non-shortest forms accepted?<br>
&gt; - Are other invalid code unit sequences accepted?<br>
<br>
_Anything_ can be accepted.=C2=A0 _Any_ byte sequence.=C2=A0 Emacs will cop=
e.<br></blockquote><div><br></div><div>Not if they accept UTF-8. The Unicod=
e standard rules out accepting invalid byte sequences.</div><div>If any byt=
e sequence is accepted, then the behavior becomes more complex. We need to =
exhaustively describe the behavior for any possible byte sequence, otherwis=
e module authors cannot make any assumption.</div><div>=C2=A0</div><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc s=
olid;padding-left:1ex">
The perpetrator will probably get back after processing a string that<br>
is not entirely human-readable, or its processing will sometimes<br>
produce surprises, like if the string is lower-cased.=C2=A0 But nothing bad=
<br>
will happen to Emacs, it won&#39;t crash and won&#39;t garble its display.<=
br>
Moreover, just passing such a string to Emacs, then outputting it back<br>
without any changes will produce an exact copy of the input, which is<br>
quite a feat, considering that the input was &quot;invalid&quot;.<br>
<br>
If you want to see what &quot;bad&quot; things can happen, take a Latin-1<b=
r>
encoded FILE and visit it with &quot;C-x RET c utf-8 RET C-x C-f FILE RET&q=
uot;.<br>
Then play with the buffer a while.=C2=A0 This is what happens when Emacs is=
<br>
told the text is in UTF-8, when it really isn&#39;t.=C2=A0 There&#39;s no<b=
r>
catastrophe, but the luser who does that might be amply punished, at<br>
the very least she will not see the letters she expects.=C2=A0 However, if<=
br>
you save such a buffer to a file, using UTF-8, you will get the same<br>
Latin-1 encoded text as was there originally.<br>
<br>
Now, given such resilience, why do we need to raise an error?<br></blockquo=
te><div><br></div><div>The Unicode standard says so. If we document that *a=
 superset of UTF-8* is accepted, then we don&#39;t need to raise an error. =
So I&#39;d suggest we do exactly that, but describe what that superset is.<=
/div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0=
 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
&gt; If the answer to any of these is &quot;yes&quot;, we can&#39;t say we =
accept UTF-8, because<br>
&gt; we don&#39;t.<br>
<br>
We _expect_ UTF-8, and if given that, will produce known, predictable<br>
results when the string is processed as text.=C2=A0 We can _tolerate_<br>
violations, resulting in somewhat surprising behavior, if such a text<br>
is treated as &quot;normal&quot; human-readable text.=C2=A0 (If the module =
knows what<br>
it does, and really means to work with raw bytes, then Emacs will do<br>
what the module expects, and produce raw bytes on output, as<br>
expected.)<br></blockquote><div><br></div><div>No matter what we expect or =
tolerate, we need to state that. If all byte sequences are accepted, then w=
e also need to state that, but describe what the behavior is if there are i=
nvalid UTF-8 sequences in the input.</div><div>=C2=A0</div><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pad=
ding-left:1ex">
<br>
&gt; Rather we should say what is actually accepted.<br>
<br>
Saying that is meaningless in this case, because we can accept<br>
anything.=C2=A0 _If_ the module wants the string it passes to be processed<=
br>
as human-readable text that consists of recognizable characters, then<br>
the module should _only_ pass valid UTF-8 sequences.=C2=A0 But raising<br>
errors upon detecting violations was discovered long ago a bad idea<br>
that users resented.=C2=A0 So we don&#39;t, and neither should the module A=
PI.<br></blockquote><div><br></div><div>Module authors are not end users. I=
 agree that end users should not see errors on decoding failure, but module=
s use only programmatic access, where we can be more strict.</div><div>=C2=
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde=
r-left:1px #ccc solid;padding-left:1ex">
<br>
&gt;=C2=A0 =C2=A0 =C2=A0&gt; * If copy_string_contents is passed an Emacs s=
tring that is not a valid<br>
&gt;=C2=A0 =C2=A0 =C2=A0Unicode<br>
&gt;=C2=A0 =C2=A0 =C2=A0&gt; string, what should happen?<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0How can that happen? The Emacs string comes from th=
e Emacs bowels, so<br>
&gt;=C2=A0 =C2=A0 =C2=A0it must be &quot;valid&quot; string by Emacs standa=
rds. Or maybe I don&#39;t<br>
&gt;=C2=A0 =C2=A0 =C2=A0understand what you mean by &quot;invalid Unicode s=
tring&quot;.<br>
&gt;<br>
&gt; A sequence of integers where at least one element is not a Unicode sca=
lar<br>
&gt; value.<br>
<br>
Emacs doesn&#39;t store characters as scalar Unicode values, so this<br>
doesn&#39;t really explain to me your concept of a &quot;valid Unicode stri=
ng&quot;.<br></blockquote><div><br></div><div>An Emacs string is a sequence=
 of integers. It doesn&#39;t have to be a sequence of scalar values.</div><=
div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8e=
x;border-left:1px #ccc solid;padding-left:1ex">
<br>
&gt;=C2=A0 =C2=A0 =C2=A0In any case, we already deal with any such problems=
 when we save a<br>
&gt;=C2=A0 =C2=A0 =C2=A0buffer to a file, or send it over the network. This=
 isn&#39;t some new<br>
&gt;=C2=A0 =C2=A0 =C2=A0problem we need to cope with.<br>
&gt;<br>
&gt; Yes, but the module interface is new, it doesn&#39;t necessarily have =
to have the<br>
&gt; same behavior.<br>
<br>
Of course, it does!=C2=A0 Modules are Emacs extensions, so the interface<br=
>
should support the same features that core Emacs does.=C2=A0 Why? because<b=
r>
there&#39;s no limits to what creative minds can do with this feature, so<b=
r>
we should not artificially impose such limitations where we have<br>
sound, time-proven infrastructure that doesn&#39;t need them.<br></blockquo=
te><div><br></div><div>I agree that we shouldn&#39;t add such limitations. =
But I disagree that we should leave the behavior undocumented in such cases=
.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0=
 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
&gt; If we say we emit only UTF-8, then we should do so.<br>
<br>
We emit only valid UTF-8, provided that its source (if it came from a<br>
module) was valid UTF-8.<br></blockquote><div><br></div><div>Then in turn w=
e shouldn&#39;t say we emit only UTF-8.=C2=A0</div></div></div>

--001a114b07d4ddf5780525252681--