From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Philipp Stephani Newsgroups: gmane.emacs.devel Subject: Re: Dynamic loading progress Date: Sun, 22 Nov 2015 14:56:12 +0000 Message-ID: References: <83k2ptq5t3.fsf@gnu.org> <87h9kxx60e.fsf@lifelogs.com> <877flswse5.fsf@lifelogs.com> <8737wgw7kf.fsf@lifelogs.com> <87io5bv1it.fsf@lifelogs.com> <87egfzuwca.fsf@lifelogs.com> <876118u6f2.fsf@lifelogs.com> <8737w3qero.fsf@lifelogs.com> <831tbn9g9j.fsf@gnu.org> <878u5upw7o.fsf@lifelogs.com> <83ziya8xph.fsf@gnu.org> <83y4du80xo.fsf@gnu.org> <837fld6lps.fsf@gnu.org> <83si3z4s5n.fsf@gnu.org> <83mvu74nhm.fsf@gnu.org> <83d1v34hba.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=047d7b5d382cdca0910525224f83 X-Trace: ger.gmane.org 1448204195 7807 80.91.229.3 (22 Nov 2015 14:56:35 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 22 Nov 2015 14:56:35 +0000 (UTC) Cc: aurelien.aptel+emacs@gmail.com, tzz@lifelogs.com, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Nov 22 15:56:34 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1a0W47-0008Vm-44 for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 15:56:31 +0100 Original-Received: from localhost ([::1]:56370 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0W46-0000q4-KQ for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 09:56:30 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:38829) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0W42-0000pw-Ms for emacs-devel@gnu.org; Sun, 22 Nov 2015 09:56:28 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a0W41-0007bV-Al for emacs-devel@gnu.org; Sun, 22 Nov 2015 09:56:26 -0500 Original-Received: from mail-wm0-x22d.google.com ([2a00:1450:400c:c09::22d]:37225) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0W3z-0007bG-2n; Sun, 22 Nov 2015 09:56:23 -0500 Original-Received: by wmww144 with SMTP id w144so75334425wmw.0; Sun, 22 Nov 2015 06:56:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-type; bh=IeK2vMNmIaB5MCUKEzYRjxtB82mtAwjYLZW68y9k6Xg=; b=sXyUXDDKQEWaiiKI7fVLD3UAlciWA4PX7Oal1w/Mfe8zMa/2c6I+2Xh9oNWIivdvQe sGvYPFpX6ulHNpPgMNrp4lUIMDj+TAjcUa3g9TXE1p3I75bf3+wfxl3tRxvRU8otlNON fhggOFMr5QvNPUEOGIYNp84/s0S7hOg5EFcBRr9/ltgPLeTQEsc8lQoAEnh7dlg6cmAa v4I7c8NQ6DCTR+2IUEAh3YuquYORORpmvnUWa6x7d4xYoz37JPCUh8MSe9zRY4hvaowk xTeyk4oMLec1dB4vpxLUYTy+4grWLfo8sxjrj+68ynZMHB8UZZINbMepYsK1tj8E3aK8 vonA== X-Received: by 10.194.19.163 with SMTP id g3mr25024637wje.166.1448204182526; Sun, 22 Nov 2015 06:56:22 -0800 (PST) In-Reply-To: X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2a00:1450:400c:c09::22d X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:195020 Archived-At: --047d7b5d382cdca0910525224f83 Content-Type: text/plain; charset=UTF-8 Philipp Stephani schrieb am So., 22. Nov. 2015 um 10:25 Uhr: > Eli Zaretskii schrieb am Sa., 21. Nov. 2015 um 14:23 Uhr: > >> > From: Philipp Stephani >> > Date: Sat, 21 Nov 2015 12:11:45 +0000 >> > Cc: tzz@lifelogs.com, aurelien.aptel+emacs@gmail.com, >> emacs-devel@gnu.org >> > >> > No, we cannot, or rather should not. It is unreasonable to expect >> > external modules to know the intricacies of the internal >> > representation. Most Emacs hackers don't. >> > >> > Fine with me, but how would we then represent Emacs strings that are >> not valid >> > Unicode strings? Just raise an error? >> >> No need to raise an error. Strings that are returned to modules >> should be encoded into UTF-8. That encoding already takes care of >> these situations: it either produces the UTF-8 encoding of the >> equivalent Unicode characters, or outputs raw bytes. >> > > Then we should document such a situation and give module authors a way to > detect them. For example, what happens if a sequence of such raw bytes > happens to be a valid UTF-8 sequence? Is there a way for module code to > detect this situation? > > I've thought a bit more about this issue an in the following I'll attempt to derive the desired behavior from first principles without referring to internal Emacs functions. There are two kinds of Emacs strings, unibyte and multibyte. https://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html and https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Codes.html agree that multibyte strings are sequences of integers (let's avoid the overloaded and vague term "characters") in the range 0 to #x3FFFFF (inclusive). It is also clear that within the subset of that range corresponding the the Unicode codespace the intergers are interpreted as Unicode code points. Given that new APIs should use UTF-8, the following approach looks reasonable to me: - There are two sets of functions for creating and reading strings, unibyte and multibyte. If a string of the wrong type is passed, a signal is raised. This way the two types are clearly separated. - The behavior of the unibyte API is uncontroversial and has no failure modes apart from generic ones such as wrong type, argument out of range, OOM. - The multibyte API should use an extension of UTF-8 to encode Emacs strings. The extension is the obvious one already in use in multiple places. - There should be a one-to-one mapping between Emacs multibyte strings and encoded module API strings. Therefore non-shortest forms, illegal code unit sequences, and code unit sequences that would encode values outside the range of Emacs characters are illegal and raise a signal. Likewise, such sequences will never be returned from Emacs. I think this is a relatively simple and unsurprising approach. It allows encoding the documented Emacs character space while still being fully compatible with UTF-8 and not resorting to undocumented Emacs internals. --047d7b5d382cdca0910525224f83 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


Philip= p Stephani <p.stephani2@gmail.c= om> schrieb am So., 22. Nov. 2015 um 10:25=C2=A0Uhr:
Eli Zaretskii <eliz@gnu.org> schrieb am Sa., 21. Nov. 2015 um 14:23=C2=A0Uhr:
> From: Philipp Stephani <p.stephani2@gmail.co= m>
> Date: Sat, 21 Nov 2015 12:11:45 +0000
> Cc: tzz@lifelogs= .com, aurelien.aptel+emacs@gmail.com, emacs-devel@gnu.org
>
>=C2=A0 =C2=A0 =C2=A0No, we cannot, or rather should not. It is unreason= able to expect
>=C2=A0 =C2=A0 =C2=A0external modules to know the intricacies of the int= ernal
>=C2=A0 =C2=A0 =C2=A0representation. Most Emacs hackers don't.
>
> Fine with me, but how would we then represent Emacs strings that are n= ot valid
> Unicode strings? Just raise an error?

No need to raise an error.=C2=A0 Strings that are returned to modules
should be encoded into UTF-8.=C2=A0 That encoding already takes care of
these situations: it either produces the UTF-8 encoding of the
equivalent Unicode characters, or outputs raw bytes.
<= br>
Then = we should document such a situation and give module authors a way to detect= them. For example, what happens if a sequence of such raw bytes happens to= be a valid UTF-8 sequence? Is there a way for module code to detect this s= ituation?
=C2=A0

I've thought = a bit more about this issue an in the following I'll attempt to derive = the desired behavior from first principles without referring to internal Em= acs functions.
There are two kinds of Emacs strings, unibyte and = multibyte.=C2=A0https://www.gnu.org/software/emacs/man= ual/html_node/elisp/Text-Representations.html=C2=A0and = https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Codes.h= tml=C2=A0agree that multibyte strings are sequences of integers (let= 9;s avoid the overloaded and vague term "characters") in the rang= e 0 to=C2=A0#x3FFFFF (inclusive). It is also clear that within the subset o= f that range corresponding the the Unicode codespace the intergers are inte= rpreted as Unicode code points. Given that new APIs should use UTF-8, the f= ollowing approach looks reasonable to me:
- There are two sets of= functions for creating and reading strings, unibyte and multibyte. If a st= ring of the wrong type is passed, a signal is raised. This way the two type= s are clearly separated.
- The behavior of the unibyte API is uncontrove= rsial and has no failure modes apart from generic ones such as wrong type, = argument out of range, OOM.
- The multibyte API should use an ext= ension of UTF-8 to encode Emacs strings. The extension is the obvious one a= lready in use in multiple places.
- There should be a one-to-one = mapping between Emacs multibyte strings and encoded module API strings. The= refore non-shortest forms, illegal code unit sequences, and code unit seque= nces that would encode values outside the range of Emacs characters are ill= egal and raise a signal. Likewise, such sequences will never be returned fr= om Emacs.
I think this is a relatively simple and unsurprising ap= proach. It allows encoding the documented Emacs character space while still= being fully compatible with UTF-8 and not resorting to undocumented Emacs = internals.
--047d7b5d382cdca0910525224f83--