From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Philipp Stephani <p.stephani2@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: Dynamic loading progress
Date: Sun, 22 Nov 2015 19:10:44 +0000
Message-ID: <CAArVCkR+AoMyf7BizSgrug2cfHUen=PfmU3C-hMBCoff=YetOA@mail.gmail.com>
References: <CA+5B0FOuWbpBUTsrE4tzzoLxACPQ-mgxx7zJKyW2LR77QRM=Ug@mail.gmail.com>
	<83k2ptq5t3.fsf@gnu.org> <87h9kxx60e.fsf@lifelogs.com>
	<CA+5B0FPp9nYEmoyDLrutJpcOZBtpV9kxx7LdPqrsj26rnj11qA@mail.gmail.com>
	<CAArVCkS515CVbS1UfavFGAq0dGO=e_mGftMbhF_eBw3SSu3Xjg@mail.gmail.com>
	<877flswse5.fsf@lifelogs.com>
	<CAArVCkT0M8o4MDP1RaP-r9JqumoQaMbhANRrycSEyyCj+mqUcA@mail.gmail.com>
	<8737wgw7kf.fsf@lifelogs.com>
	<CA+5B0FOGrn01XZzKJvXdWLPL62ONUzoEBfQRwLiKqLmd6Ta3RA@mail.gmail.com>
	<87io5bv1it.fsf@lifelogs.com>
	<CA+5B0FOp8Ub1+V_2G4CC1r2aG1hLKmZdSic59MfOy=9QoovSRQ@mail.gmail.com>
	<87egfzuwca.fsf@lifelogs.com>
	<CAArVCkSEHxSd3X2PnEvRJk5n1wOR0y9neU7AxGYEHSqKRG+y3Q@mail.gmail.com>
	<876118u6f2.fsf@lifelogs.com>
	<CA+5B0FPz-vo+Y=38=21jRQuEHANzFG_cf3tPDiwEbK2TO4+JdA@mail.gmail.com>
	<CA+5B0FNW48d3S5CJfxHK9HHVHPmuYqaT3K9tn5MVTgv_qas5Rw@mail.gmail.com>
	<ryhmvud820v.fsf@dod.no>
	<CA+5B0FMU1Ry6mRSinyV5Ar8DaL4VciEUEbTe1NcXZUQ2-4y4TA@mail.gmail.com>
	<8737w3qero.fsf@lifelogs.com> <831tbn9g9j.fsf@gnu.org>
	<878u5upw7o.fsf@lifelogs.com>
	<83ziya8xph.fsf@gnu.org> <83y4du80xo.fsf@gnu.org>
	<CAArVCkTwVbA58_wfj7O-Et83M8YJ9jfpCKhYn466BYO8T2cG0A@mail.gmail.com>
	<837fld6lps.fsf@gnu.org>
	<CAArVCkSTdg=EjSiN69TqLoH_ufkz_vzV6qLKNae2QbEXadYomg@mail.gmail.com>
	<83si3z4s5n.fsf@gnu.org>
	<CAArVCkQ0qUTUr5GZ+xmCub2tEWc0YzFKRsHEN-FFv3ioAc2n0w@mail.gmail.com>
	<83mvu74nhm.fsf@gnu.org>
	<CAArVCkR+LqXPbHnWKW+2FQ61z+AyWR6ThBAb5ens=mwN+rS_mQ@mail.gmail.com>
	<83d1v34hba.fsf@gnu.org>
	<CAArVCkRBF7+yJcFiYA6KmZzKp5EGP6iauQ=0hkH5KJZbMRH7LA@mail.gmail.com>
	<CAArVCkTnaf=rh47Xv4rykPJyyMJwUZ_iASBencLLOo1q5OMAOw@mail.gmail.com>
	<83egfh3o7n.fsf@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary=001a1130cd242364b9052525de54
X-Trace: ger.gmane.org 1448219471 13590 80.91.229.3 (22 Nov 2015 19:11:11 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sun, 22 Nov 2015 19:11:11 +0000 (UTC)
Cc: aurelien.aptel+emacs@gmail.com, tzz@lifelogs.com, emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Nov 22 20:11:06 2015
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1a0a2T-0004nq-02
	for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 20:11:05 +0100
Original-Received: from localhost ([::1]:57145 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1a0a2T-00062a-56
	for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 14:11:05 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:60574)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1a0a2O-00062S-5e
	for emacs-devel@gnu.org; Sun, 22 Nov 2015 14:11:02 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1a0a2M-0003a0-1L
	for emacs-devel@gnu.org; Sun, 22 Nov 2015 14:11:00 -0500
Original-Received: from mail-wm0-x22b.google.com ([2a00:1450:400c:c09::22b]:36622)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>)
	id 1a0a2J-0003Zl-1q; Sun, 22 Nov 2015 14:10:55 -0500
Original-Received: by wmww144 with SMTP id w144so73004268wmw.1;
	Sun, 22 Nov 2015 11:10:54 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:references:in-reply-to:from:date:message-id:subject:to
	:cc:content-type;
	bh=lk8TT9whmSj+o/ZSrjHMz8PAtcT+zhfZfbQfiWxssiQ=;
	b=uaCjIlbHyVpgzF6aMlqttws4+veqN0ejepqGbksQhHNpT5QuSxheni0MoiOtMevcwK
	wSymxGpeZzMQE5PebS7RQafIwJq/5U/B1K3mmxfEYekCcjb4n9qxz/nnUHQdbHRqgjm2
	NQD/1dpm+sU9vdgWUI7zosyLbPy5jiEIVFFiWEHAzI4RRXvpd8iwJ/VOwSMXQCDCWbbY
	3BaqrnXMMYDp0ARgbC7APv15Yq6AUzZS5Evp311S9O4OCcNRW4TGXfspcYd/kGgK9tIq
	imjlOdqhMixfJtwL2os0W+KbNmtoO2xw0dPSsep9TOrFXnd9l+B1+qbrNxMJcdTQ7U+T
	nNVw==
X-Received: by 10.194.116.170 with SMTP id jx10mr37656wjb.166.1448219454430;
	Sun, 22 Nov 2015 11:10:54 -0800 (PST)
In-Reply-To: <83egfh3o7n.fsf@gnu.org>
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
	(bad octet value).
X-Received-From: 2a00:1450:400c:c09::22b
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:195050
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/195050>

--001a1130cd242364b9052525de54
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Eli Zaretskii <eliz@gnu.org> schrieb am So., 22. Nov. 2015 um 19:04 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Sun, 22 Nov 2015 14:56:12 +0000
> > Cc: tzz@lifelogs.com, aurelien.aptel+emacs@gmail.com,
> emacs-devel@gnu.org
> >
> > - The multibyte API should use an extension of UTF-8 to encode Emacs
> strings.
> > The extension is the obvious one already in use in multiple places.
>
> It is only used in one place: the internal representation of
> characters in buffers and strings.  Emacs _never_ lets this internal
> representation leak outside.


If I run in scratch:

(with-temp-buffer
  (insert #x3fff40)
  (describe-char (point-min)))

Then the resulting help buffer says "buffer code: #xF8 #x8F #xBF #xBD
#x80", is that not considered a leak?


>   In practice the last sentence means that
> text that Emacs encoded in UTF-8 will only include either valid UTF-8
> sequences of characters whose codepoints are below #x200000 or single
> bytes that don't belong to any UTF-8 sequence.
>

I get the same result as above when running

(with-temp-buffer
  (call-process "echo" nil t nil (string #x3fff40))
  (describe-char (point-min)))

that means the non-UTF sequence is even "leaked" to the external process!


>
> You are suggesting to expose the internal representation to outside
> application code, which predictably will cause that representation to
> leak into Lisp.  That'd be a disaster.  We had something like that
> back in the Emacs 20 era, and it took many years to plug those leaks.
> We would be making a grave mistake to go back there.
>

I don't suggest leaking anything what isn't already leaked. The extension
of the codespace to 22 bits is well documented.


>
> What you suggest is also impossible without deep changes in how we
> decode and encode text: that process maps codepoints above #1FFFFF to
> either codepoints below that mark or to raw bytes.  So it's impossible
> to produce these high codes in UTF-8 compatible form while handling
> UTF-8 text.  To say nothing about the simple fact that no library
> function in any C library will ever be able to do anything useful with
> such codepoints, because they are our own invention.
>

Unless the behavior changed recently, that doesn't seem the case:

(encode-coding-string (string #x3fff40) 'utf-8-unix)
"\370\217\277\275\200"

Or are you talking about something different?


>
> > - There should be a one-to-one mapping between Emacs multibyte strings
> and
> > encoded module API strings.
>
> UTF-8 encoded strings satisfy that requirement.
>

No! UTF-8 can only encode Unicode scalar values. Only the Emacs extension
to UTF-8 (which I think Emacs calls "UTF-8" unfortunately) satisfies this.
If you are talking about this extension, then we talk about the same thing
anyway.


>
> > Therefore non-shortest forms, illegal code unit sequences, and code
> > unit sequences that would encode values outside the range of Emacs
> > characters are illegal and raise a signal.
>
> Once again, this was tried in the past and was found to be a bad idea.
> Emacs provides features to test the result of converting invalid
> sequences, for the purposes of detecting such problems, but it leaves
> that to the application.
>

It's probably OK to accept invalid sequences for consistency with
decode-coding-string and friends. I don't really like it though: the module
API, like decode-coding-string, is not a general-purpose UI for end users,
and accepting invalid sequences is error-prone and can even introduce
security issues (see e.g.
https://blogs.oracle.com/CoreJavaTechTips/entry/the_overhaul_of_java_utf).


>
> > Likewise, such sequences will never be returned from Emacs.
>
> Emacs doesn't return invalid sequences, if the original text didn't
> include raw bytes.  If there were raw bytes in the original text,
> Emacs has no choice but return them back, or else it will violate a
> basic expectation from a text-processing program: that it shall never
> change the portions of text that were not affected by the processing.
>

It seems that Emacs does return invalid sequences for characters such as
#x3ffff40 (or anything else outside of Unicode except the 127 values for
encoding raw bytes).
Returning raw bytes means that encoding and decoding isn't a perfect
roundtrip:

(decode-coding-string (encode-coding-string (string #x3fffc2 #x3fffbb)
'utf-8-unix) 'utf-8-unix)
"=C2=BB"

We might be able to live with that as it's an extreme edge case.


>
> > I think this is a relatively simple and unsurprising approach. It allow=
s
> > encoding the documented Emacs character space while still being fully
> > compatible with UTF-8 and not resorting to undocumented Emacs internals=
.
>
> So does the approach I suggested.  The advantage of my suggestion is
> that it follows a long Emacs tradition about every aspect of encoding
> and decoding text, and doesn't require any changes in the existing
> infrastructure.
>

What are the exact difference between the approaches? As far as I can see
differences exist only for the following points:
- Accepting invalid sequences. I consider that a bug in general-purpose
APIs, including decode-coding-string. However, given that Emacs already
extends the Unicode codespace and therefore has to accept some invalid
sequences anyway, it might be OK if it's clearly documented.
- Emitting raw bytes instead of extended sequences. Though I'm not a fan of
this it might be unavoidable to be able to treat strings transparently
(which is desirable).

--001a1130cd242364b9052525de54
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br><div class=3D"gmail_quote"><div dir=3D"ltr">Eli Za=
retskii &lt;<a href=3D"mailto:eliz@gnu.org">eliz@gnu.org</a>&gt; schrieb am=
 So., 22. Nov. 2015 um 19:04=C2=A0Uhr:<br></div><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex">&gt; From: Philipp Stephani &lt;<a href=3D"mailto:p.stephani2@gmail.com=
" target=3D"_blank">p.stephani2@gmail.com</a>&gt;<br>
&gt; Date: Sun, 22 Nov 2015 14:56:12 +0000<br>
&gt; Cc: <a href=3D"mailto:tzz@lifelogs.com" target=3D"_blank">tzz@lifelogs=
.com</a>, <a href=3D"mailto:aurelien.aptel%2Bemacs@gmail.com" target=3D"_bl=
ank">aurelien.aptel+emacs@gmail.com</a>, <a href=3D"mailto:emacs-devel@gnu.=
org" target=3D"_blank">emacs-devel@gnu.org</a><br>
&gt;<br>
&gt; - The multibyte API should use an extension of UTF-8 to encode Emacs s=
trings.<br>
&gt; The extension is the obvious one already in use in multiple places.<br=
>
<br>
It is only used in one place: the internal representation of<br>
characters in buffers and strings.=C2=A0 Emacs _never_ lets this internal<b=
r>
representation leak outside.</blockquote><div><br></div><div>If I run in sc=
ratch:</div><div><br></div><div><div>(with-temp-buffer</div><div>=C2=A0 (in=
sert #x3fff40)</div><div>=C2=A0 (describe-char (point-min)))</div></div><di=
v><br></div><div>Then the resulting help buffer says &quot;buffer code: #xF=
8 #x8F #xBF #xBD #x80&quot;, is that not considered a leak?</div><div>=C2=
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde=
r-left:1px #ccc solid;padding-left:1ex">=C2=A0 In practice the last sentenc=
e means that<br>
text that Emacs encoded in UTF-8 will only include either valid UTF-8<br>
sequences of characters whose codepoints are below #x200000 or single<br>
bytes that don&#39;t belong to any UTF-8 sequence.<br></blockquote><div><br=
></div><div>I get the same result as above when running</div><div><br></div=
><div><div>(with-temp-buffer</div><div>=C2=A0 (call-process &quot;echo&quot=
; nil t nil (string #x3fff40))</div><div>=C2=A0 (describe-char (point-min))=
)</div></div><div><br></div><div>that means the non-UTF sequence is even &q=
uot;leaked&quot; to the external process!</div><div>=C2=A0</div><blockquote=
 class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli=
d;padding-left:1ex">
<br>
You are suggesting to expose the internal representation to outside<br>
application code, which predictably will cause that representation to<br>
leak into Lisp.=C2=A0 That&#39;d be a disaster.=C2=A0 We had something like=
 that<br>
back in the Emacs 20 era, and it took many years to plug those leaks.<br>
We would be making a grave mistake to go back there.<br></blockquote><div><=
br></div><div>I don&#39;t suggest leaking anything what isn&#39;t already l=
eaked. The extension of the codespace to 22 bits is well documented.</div><=
div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8e=
x;border-left:1px #ccc solid;padding-left:1ex">
<br>
What you suggest is also impossible without deep changes in how we<br>
decode and encode text: that process maps codepoints above #1FFFFF to<br>
either codepoints below that mark or to raw bytes.=C2=A0 So it&#39;s imposs=
ible<br>
to produce these high codes in UTF-8 compatible form while handling<br>
UTF-8 text.=C2=A0 To say nothing about the simple fact that no library<br>
function in any C library will ever be able to do anything useful with<br>
such codepoints, because they are our own invention.<br></blockquote><div><=
br></div><div>Unless the behavior changed recently, that doesn&#39;t seem t=
he case:</div><div><br></div><div><div>(encode-coding-string (string #x3fff=
40) &#39;utf-8-unix)</div><div>&quot;\370\217\277\275\200&quot;</div></div>=
<div><br></div><div>Or are you talking about something different?</div><div=
>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex">
<br>
&gt; - There should be a one-to-one mapping between Emacs multibyte strings=
 and<br>
&gt; encoded module API strings.<br>
<br>
UTF-8 encoded strings satisfy that requirement.<br></blockquote><div><br></=
div><div>No! UTF-8 can only encode Unicode scalar values. Only the Emacs ex=
tension to UTF-8 (which I think Emacs calls &quot;UTF-8&quot; unfortunately=
) satisfies this. If you are talking about this extension, then we talk abo=
ut the same thing anyway.</div><div>=C2=A0</div><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex">
<br>
&gt; Therefore non-shortest forms, illegal code unit sequences, and code<br=
>
&gt; unit sequences that would encode values outside the range of Emacs<br>
&gt; characters are illegal and raise a signal.<br>
<br>
Once again, this was tried in the past and was found to be a bad idea.<br>
Emacs provides features to test the result of converting invalid<br>
sequences, for the purposes of detecting such problems, but it leaves<br>
that to the application.<br></blockquote><div><br></div><div>It&#39;s proba=
bly OK to accept invalid sequences for consistency with decode-coding-strin=
g and friends. I don&#39;t really like it though: the module API, like deco=
de-coding-string, is not a general-purpose UI for end users, and accepting =
invalid sequences is error-prone and can even introduce security issues (se=
e e.g.=C2=A0<a href=3D"https://blogs.oracle.com/CoreJavaTechTips/entry/the_=
overhaul_of_java_utf">https://blogs.oracle.com/CoreJavaTechTips/entry/the_o=
verhaul_of_java_utf</a>).</div><div>=C2=A0</div><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex">
<br>
&gt; Likewise, such sequences will never be returned from Emacs.<br>
<br>
Emacs doesn&#39;t return invalid sequences, if the original text didn&#39;t=
<br>
include raw bytes.=C2=A0 If there were raw bytes in the original text,<br>
Emacs has no choice but return them back, or else it will violate a<br>
basic expectation from a text-processing program: that it shall never<br>
change the portions of text that were not affected by the processing.<br></=
blockquote><div><br></div><div>It seems that Emacs does return invalid sequ=
ences for characters such as #x3ffff40 (or anything else outside of Unicode=
 except the 127 values for encoding raw bytes).</div><div>Returning raw byt=
es means that encoding and decoding isn&#39;t a perfect roundtrip:</div><di=
v><br></div><div><div>(decode-coding-string (encode-coding-string (string #=
x3fffc2 #x3fffbb) &#39;utf-8-unix) &#39;utf-8-unix)</div><div>&quot;=C2=BB&=
quot;</div></div><div><br></div><div>We might be able to live with that as =
it&#39;s an extreme edge case.</div><div>=C2=A0</div><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex">
<br>
&gt; I think this is a relatively simple and unsurprising approach. It allo=
ws<br>
&gt; encoding the documented Emacs character space while still being fully<=
br>
&gt; compatible with UTF-8 and not resorting to undocumented Emacs internal=
s.<br>
<br>
So does the approach I suggested.=C2=A0 The advantage of my suggestion is<b=
r>
that it follows a long Emacs tradition about every aspect of encoding<br>
and decoding text, and doesn&#39;t require any changes in the existing<br>
infrastructure.<br></blockquote><div><br></div><div>What are the exact diff=
erence between the approaches? As far as I can see differences exist only f=
or the following points:</div><div>- Accepting invalid sequences. I conside=
r that a bug in general-purpose APIs, including decode-coding-string. Howev=
er, given that Emacs already extends the Unicode codespace and therefore ha=
s to accept some invalid sequences anyway, it might be OK if it&#39;s clear=
ly documented.</div><div>- Emitting raw bytes instead of extended sequences=
. Though I&#39;m not a fan of this it might be unavoidable to be able to tr=
eat strings transparently (which is desirable).=C2=A0</div></div></div>

--001a1130cd242364b9052525de54--