From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Philipp Stephani <p.stephani2@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: Dynamic loading progress
Date: Sun, 22 Nov 2015 14:56:12 +0000
Message-ID: <CAArVCkTnaf=rh47Xv4rykPJyyMJwUZ_iASBencLLOo1q5OMAOw@mail.gmail.com>
References: <CA+5B0FOuWbpBUTsrE4tzzoLxACPQ-mgxx7zJKyW2LR77QRM=Ug@mail.gmail.com>
	<83k2ptq5t3.fsf@gnu.org> <87h9kxx60e.fsf@lifelogs.com>
	<CA+5B0FPp9nYEmoyDLrutJpcOZBtpV9kxx7LdPqrsj26rnj11qA@mail.gmail.com>
	<CAArVCkS515CVbS1UfavFGAq0dGO=e_mGftMbhF_eBw3SSu3Xjg@mail.gmail.com>
	<877flswse5.fsf@lifelogs.com>
	<CAArVCkT0M8o4MDP1RaP-r9JqumoQaMbhANRrycSEyyCj+mqUcA@mail.gmail.com>
	<8737wgw7kf.fsf@lifelogs.com>
	<CA+5B0FOGrn01XZzKJvXdWLPL62ONUzoEBfQRwLiKqLmd6Ta3RA@mail.gmail.com>
	<87io5bv1it.fsf@lifelogs.com>
	<CA+5B0FOp8Ub1+V_2G4CC1r2aG1hLKmZdSic59MfOy=9QoovSRQ@mail.gmail.com>
	<87egfzuwca.fsf@lifelogs.com>
	<CAArVCkSEHxSd3X2PnEvRJk5n1wOR0y9neU7AxGYEHSqKRG+y3Q@mail.gmail.com>
	<876118u6f2.fsf@lifelogs.com>
	<CA+5B0FPz-vo+Y=38=21jRQuEHANzFG_cf3tPDiwEbK2TO4+JdA@mail.gmail.com>
	<CA+5B0FNW48d3S5CJfxHK9HHVHPmuYqaT3K9tn5MVTgv_qas5Rw@mail.gmail.com>
	<ryhmvud820v.fsf@dod.no>
	<CA+5B0FMU1Ry6mRSinyV5Ar8DaL4VciEUEbTe1NcXZUQ2-4y4TA@mail.gmail.com>
	<8737w3qero.fsf@lifelogs.com> <831tbn9g9j.fsf@gnu.org>
	<878u5upw7o.fsf@lifelogs.com>
	<83ziya8xph.fsf@gnu.org> <83y4du80xo.fsf@gnu.org>
	<CAArVCkTwVbA58_wfj7O-Et83M8YJ9jfpCKhYn466BYO8T2cG0A@mail.gmail.com>
	<837fld6lps.fsf@gnu.org>
	<CAArVCkSTdg=EjSiN69TqLoH_ufkz_vzV6qLKNae2QbEXadYomg@mail.gmail.com>
	<83si3z4s5n.fsf@gnu.org>
	<CAArVCkQ0qUTUr5GZ+xmCub2tEWc0YzFKRsHEN-FFv3ioAc2n0w@mail.gmail.com>
	<83mvu74nhm.fsf@gnu.org>
	<CAArVCkR+LqXPbHnWKW+2FQ61z+AyWR6ThBAb5ens=mwN+rS_mQ@mail.gmail.com>
	<83d1v34hba.fsf@gnu.org>
	<CAArVCkRBF7+yJcFiYA6KmZzKp5EGP6iauQ=0hkH5KJZbMRH7LA@mail.gmail.com>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary=047d7b5d382cdca0910525224f83
X-Trace: ger.gmane.org 1448204195 7807 80.91.229.3 (22 Nov 2015 14:56:35 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sun, 22 Nov 2015 14:56:35 +0000 (UTC)
Cc: aurelien.aptel+emacs@gmail.com, tzz@lifelogs.com, emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Nov 22 15:56:34 2015
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1a0W47-0008Vm-44
	for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 15:56:31 +0100
Original-Received: from localhost ([::1]:56370 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1a0W46-0000q4-KQ
	for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 09:56:30 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:38829)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1a0W42-0000pw-Ms
	for emacs-devel@gnu.org; Sun, 22 Nov 2015 09:56:28 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1a0W41-0007bV-Al
	for emacs-devel@gnu.org; Sun, 22 Nov 2015 09:56:26 -0500
Original-Received: from mail-wm0-x22d.google.com ([2a00:1450:400c:c09::22d]:37225)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>)
	id 1a0W3z-0007bG-2n; Sun, 22 Nov 2015 09:56:23 -0500
Original-Received: by wmww144 with SMTP id w144so75334425wmw.0;
	Sun, 22 Nov 2015 06:56:22 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:references:in-reply-to:from:date:message-id:subject:to
	:cc:content-type;
	bh=IeK2vMNmIaB5MCUKEzYRjxtB82mtAwjYLZW68y9k6Xg=;
	b=sXyUXDDKQEWaiiKI7fVLD3UAlciWA4PX7Oal1w/Mfe8zMa/2c6I+2Xh9oNWIivdvQe
	sGvYPFpX6ulHNpPgMNrp4lUIMDj+TAjcUa3g9TXE1p3I75bf3+wfxl3tRxvRU8otlNON
	fhggOFMr5QvNPUEOGIYNp84/s0S7hOg5EFcBRr9/ltgPLeTQEsc8lQoAEnh7dlg6cmAa
	v4I7c8NQ6DCTR+2IUEAh3YuquYORORpmvnUWa6x7d4xYoz37JPCUh8MSe9zRY4hvaowk
	xTeyk4oMLec1dB4vpxLUYTy+4grWLfo8sxjrj+68ynZMHB8UZZINbMepYsK1tj8E3aK8
	vonA==
X-Received: by 10.194.19.163 with SMTP id g3mr25024637wje.166.1448204182526;
	Sun, 22 Nov 2015 06:56:22 -0800 (PST)
In-Reply-To: <CAArVCkRBF7+yJcFiYA6KmZzKp5EGP6iauQ=0hkH5KJZbMRH7LA@mail.gmail.com>
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
	(bad octet value).
X-Received-From: 2a00:1450:400c:c09::22d
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:195020
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/195020>

--047d7b5d382cdca0910525224f83
Content-Type: text/plain; charset=UTF-8

Philipp Stephani <p.stephani2@gmail.com> schrieb am So., 22. Nov. 2015 um
10:25 Uhr:

> Eli Zaretskii <eliz@gnu.org> schrieb am Sa., 21. Nov. 2015 um 14:23 Uhr:
>
>> > From: Philipp Stephani <p.stephani2@gmail.com>
>> > Date: Sat, 21 Nov 2015 12:11:45 +0000
>> > Cc: tzz@lifelogs.com, aurelien.aptel+emacs@gmail.com,
>> emacs-devel@gnu.org
>> >
>> >     No, we cannot, or rather should not. It is unreasonable to expect
>> >     external modules to know the intricacies of the internal
>> >     representation. Most Emacs hackers don't.
>> >
>> > Fine with me, but how would we then represent Emacs strings that are
>> not valid
>> > Unicode strings? Just raise an error?
>>
>> No need to raise an error.  Strings that are returned to modules
>> should be encoded into UTF-8.  That encoding already takes care of
>> these situations: it either produces the UTF-8 encoding of the
>> equivalent Unicode characters, or outputs raw bytes.
>>
>
> Then we should document such a situation and give module authors a way to
> detect them. For example, what happens if a sequence of such raw bytes
> happens to be a valid UTF-8 sequence? Is there a way for module code to
> detect this situation?
>
>

I've thought a bit more about this issue an in the following I'll attempt
to derive the desired behavior from first principles without referring to
internal Emacs functions.
There are two kinds of Emacs strings, unibyte and multibyte.
https://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html
and
https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Codes.html
agree
that multibyte strings are sequences of integers (let's avoid the
overloaded and vague term "characters") in the range 0 to #x3FFFFF
(inclusive). It is also clear that within the subset of that range
corresponding the the Unicode codespace the intergers are interpreted as
Unicode code points. Given that new APIs should use UTF-8, the following
approach looks reasonable to me:
- There are two sets of functions for creating and reading strings, unibyte
and multibyte. If a string of the wrong type is passed, a signal is raised.
This way the two types are clearly separated.
- The behavior of the unibyte API is uncontroversial and has no failure
modes apart from generic ones such as wrong type, argument out of range,
OOM.
- The multibyte API should use an extension of UTF-8 to encode Emacs
strings. The extension is the obvious one already in use in multiple places.
- There should be a one-to-one mapping between Emacs multibyte strings and
encoded module API strings. Therefore non-shortest forms, illegal code unit
sequences, and code unit sequences that would encode values outside the
range of Emacs characters are illegal and raise a signal. Likewise, such
sequences will never be returned from Emacs.
I think this is a relatively simple and unsurprising approach. It allows
encoding the documented Emacs character space while still being fully
compatible with UTF-8 and not resorting to undocumented Emacs internals.

--047d7b5d382cdca0910525224f83
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br><div class=3D"gmail_quote"><div dir=3D"ltr">Philip=
p Stephani &lt;<a href=3D"mailto:p.stephani2@gmail.com">p.stephani2@gmail.c=
om</a>&gt; schrieb am So., 22. Nov. 2015 um 10:25=C2=A0Uhr:<br></div><block=
quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc=
 solid;padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_quote"><div d=
ir=3D"ltr">Eli Zaretskii &lt;<a href=3D"mailto:eliz@gnu.org" target=3D"_bla=
nk">eliz@gnu.org</a>&gt; schrieb am Sa., 21. Nov. 2015 um 14:23=C2=A0Uhr:<b=
r></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border=
-left:1px #ccc solid;padding-left:1ex">&gt; From: Philipp Stephani &lt;<a h=
ref=3D"mailto:p.stephani2@gmail.com" target=3D"_blank">p.stephani2@gmail.co=
m</a>&gt;<br>
&gt; Date: Sat, 21 Nov 2015 12:11:45 +0000<br>
&gt; Cc: <a href=3D"mailto:tzz@lifelogs.com" target=3D"_blank">tzz@lifelogs=
.com</a>, <a href=3D"mailto:aurelien.aptel%2Bemacs@gmail.com" target=3D"_bl=
ank">aurelien.aptel+emacs@gmail.com</a>, <a href=3D"mailto:emacs-devel@gnu.=
org" target=3D"_blank">emacs-devel@gnu.org</a><br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0No, we cannot, or rather should not. It is unreason=
able to expect<br>
&gt;=C2=A0 =C2=A0 =C2=A0external modules to know the intricacies of the int=
ernal<br>
&gt;=C2=A0 =C2=A0 =C2=A0representation. Most Emacs hackers don&#39;t.<br>
&gt;<br>
&gt; Fine with me, but how would we then represent Emacs strings that are n=
ot valid<br>
&gt; Unicode strings? Just raise an error?<br>
<br>
No need to raise an error.=C2=A0 Strings that are returned to modules<br>
should be encoded into UTF-8.=C2=A0 That encoding already takes care of<br>
these situations: it either produces the UTF-8 encoding of the<br>
equivalent Unicode characters, or outputs raw bytes.<br></blockquote><div><=
br></div></div></div><div dir=3D"ltr"><div class=3D"gmail_quote"><div>Then =
we should document such a situation and give module authors a way to detect=
 them. For example, what happens if a sequence of such raw bytes happens to=
 be a valid UTF-8 sequence? Is there a way for module code to detect this s=
ituation?</div></div></div><div dir=3D"ltr"><div class=3D"gmail_quote"><div=
>=C2=A0</div></div></div></blockquote><div><br></div><div>I&#39;ve thought =
a bit more about this issue an in the following I&#39;ll attempt to derive =
the desired behavior from first principles without referring to internal Em=
acs functions.</div><div>There are two kinds of Emacs strings, unibyte and =
multibyte.=C2=A0<a href=3D"https://www.gnu.org/software/emacs/manual/html_n=
ode/elisp/Text-Representations.html">https://www.gnu.org/software/emacs/man=
ual/html_node/elisp/Text-Representations.html</a>=C2=A0and <a href=3D"https=
://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Codes.html">=
https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Codes.h=
tml</a>=C2=A0agree that multibyte strings are sequences of integers (let=
9;s avoid the overloaded and vague term &quot;characters&quot;) in the rang=
e 0 to=C2=A0#x3FFFFF (inclusive). It is also clear that within the subset o=
f that range corresponding the the Unicode codespace the intergers are inte=
rpreted as Unicode code points. Given that new APIs should use UTF-8, the f=
ollowing approach looks reasonable to me:</div><div>- There are two sets of=
 functions for creating and reading strings, unibyte and multibyte. If a st=
ring of the wrong type is passed, a signal is raised. This way the two type=
s are clearly separated.<br>- The behavior of the unibyte API is uncontrove=
rsial and has no failure modes apart from generic ones such as wrong type, =
argument out of range, OOM.</div><div>- The multibyte API should use an ext=
ension of UTF-8 to encode Emacs strings. The extension is the obvious one a=
lready in use in multiple places.</div><div>- There should be a one-to-one =
mapping between Emacs multibyte strings and encoded module API strings. The=
refore non-shortest forms, illegal code unit sequences, and code unit seque=
nces that would encode values outside the range of Emacs characters are ill=
egal and raise a signal. Likewise, such sequences will never be returned fr=
om Emacs.</div><div>I think this is a relatively simple and unsurprising ap=
proach. It allows encoding the documented Emacs character space while still=
 being fully compatible with UTF-8 and not resorting to undocumented Emacs =
internals.</div></div></div>

--047d7b5d382cdca0910525224f83--