From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Philipp Stephani Newsgroups: gmane.emacs.devel Subject: Re: Dynamic loading progress Date: Sun, 22 Nov 2015 19:10:44 +0000 Message-ID: References: <83k2ptq5t3.fsf@gnu.org> <87h9kxx60e.fsf@lifelogs.com> <877flswse5.fsf@lifelogs.com> <8737wgw7kf.fsf@lifelogs.com> <87io5bv1it.fsf@lifelogs.com> <87egfzuwca.fsf@lifelogs.com> <876118u6f2.fsf@lifelogs.com> <8737w3qero.fsf@lifelogs.com> <831tbn9g9j.fsf@gnu.org> <878u5upw7o.fsf@lifelogs.com> <83ziya8xph.fsf@gnu.org> <83y4du80xo.fsf@gnu.org> <837fld6lps.fsf@gnu.org> <83si3z4s5n.fsf@gnu.org> <83mvu74nhm.fsf@gnu.org> <83d1v34hba.fsf@gnu.org> <83egfh3o7n.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=001a1130cd242364b9052525de54 X-Trace: ger.gmane.org 1448219471 13590 80.91.229.3 (22 Nov 2015 19:11:11 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 22 Nov 2015 19:11:11 +0000 (UTC) Cc: aurelien.aptel+emacs@gmail.com, tzz@lifelogs.com, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Nov 22 20:11:06 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1a0a2T-0004nq-02 for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 20:11:05 +0100 Original-Received: from localhost ([::1]:57145 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0a2T-00062a-56 for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 14:11:05 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:60574) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0a2O-00062S-5e for emacs-devel@gnu.org; Sun, 22 Nov 2015 14:11:02 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a0a2M-0003a0-1L for emacs-devel@gnu.org; Sun, 22 Nov 2015 14:11:00 -0500 Original-Received: from mail-wm0-x22b.google.com ([2a00:1450:400c:c09::22b]:36622) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0a2J-0003Zl-1q; Sun, 22 Nov 2015 14:10:55 -0500 Original-Received: by wmww144 with SMTP id w144so73004268wmw.1; Sun, 22 Nov 2015 11:10:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-type; bh=lk8TT9whmSj+o/ZSrjHMz8PAtcT+zhfZfbQfiWxssiQ=; b=uaCjIlbHyVpgzF6aMlqttws4+veqN0ejepqGbksQhHNpT5QuSxheni0MoiOtMevcwK wSymxGpeZzMQE5PebS7RQafIwJq/5U/B1K3mmxfEYekCcjb4n9qxz/nnUHQdbHRqgjm2 NQD/1dpm+sU9vdgWUI7zosyLbPy5jiEIVFFiWEHAzI4RRXvpd8iwJ/VOwSMXQCDCWbbY 3BaqrnXMMYDp0ARgbC7APv15Yq6AUzZS5Evp311S9O4OCcNRW4TGXfspcYd/kGgK9tIq imjlOdqhMixfJtwL2os0W+KbNmtoO2xw0dPSsep9TOrFXnd9l+B1+qbrNxMJcdTQ7U+T nNVw== X-Received: by 10.194.116.170 with SMTP id jx10mr37656wjb.166.1448219454430; Sun, 22 Nov 2015 11:10:54 -0800 (PST) In-Reply-To: <83egfh3o7n.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2a00:1450:400c:c09::22b X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:195050 Archived-At: --001a1130cd242364b9052525de54 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Eli Zaretskii schrieb am So., 22. Nov. 2015 um 19:04 Uhr: > > From: Philipp Stephani > > Date: Sun, 22 Nov 2015 14:56:12 +0000 > > Cc: tzz@lifelogs.com, aurelien.aptel+emacs@gmail.com, > emacs-devel@gnu.org > > > > - The multibyte API should use an extension of UTF-8 to encode Emacs > strings. > > The extension is the obvious one already in use in multiple places. > > It is only used in one place: the internal representation of > characters in buffers and strings. Emacs _never_ lets this internal > representation leak outside. If I run in scratch: (with-temp-buffer (insert #x3fff40) (describe-char (point-min))) Then the resulting help buffer says "buffer code: #xF8 #x8F #xBF #xBD #x80", is that not considered a leak? > In practice the last sentence means that > text that Emacs encoded in UTF-8 will only include either valid UTF-8 > sequences of characters whose codepoints are below #x200000 or single > bytes that don't belong to any UTF-8 sequence. > I get the same result as above when running (with-temp-buffer (call-process "echo" nil t nil (string #x3fff40)) (describe-char (point-min))) that means the non-UTF sequence is even "leaked" to the external process! > > You are suggesting to expose the internal representation to outside > application code, which predictably will cause that representation to > leak into Lisp. That'd be a disaster. We had something like that > back in the Emacs 20 era, and it took many years to plug those leaks. > We would be making a grave mistake to go back there. > I don't suggest leaking anything what isn't already leaked. The extension of the codespace to 22 bits is well documented. > > What you suggest is also impossible without deep changes in how we > decode and encode text: that process maps codepoints above #1FFFFF to > either codepoints below that mark or to raw bytes. So it's impossible > to produce these high codes in UTF-8 compatible form while handling > UTF-8 text. To say nothing about the simple fact that no library > function in any C library will ever be able to do anything useful with > such codepoints, because they are our own invention. > Unless the behavior changed recently, that doesn't seem the case: (encode-coding-string (string #x3fff40) 'utf-8-unix) "\370\217\277\275\200" Or are you talking about something different? > > > - There should be a one-to-one mapping between Emacs multibyte strings > and > > encoded module API strings. > > UTF-8 encoded strings satisfy that requirement. > No! UTF-8 can only encode Unicode scalar values. Only the Emacs extension to UTF-8 (which I think Emacs calls "UTF-8" unfortunately) satisfies this. If you are talking about this extension, then we talk about the same thing anyway. > > > Therefore non-shortest forms, illegal code unit sequences, and code > > unit sequences that would encode values outside the range of Emacs > > characters are illegal and raise a signal. > > Once again, this was tried in the past and was found to be a bad idea. > Emacs provides features to test the result of converting invalid > sequences, for the purposes of detecting such problems, but it leaves > that to the application. > It's probably OK to accept invalid sequences for consistency with decode-coding-string and friends. I don't really like it though: the module API, like decode-coding-string, is not a general-purpose UI for end users, and accepting invalid sequences is error-prone and can even introduce security issues (see e.g. https://blogs.oracle.com/CoreJavaTechTips/entry/the_overhaul_of_java_utf). > > > Likewise, such sequences will never be returned from Emacs. > > Emacs doesn't return invalid sequences, if the original text didn't > include raw bytes. If there were raw bytes in the original text, > Emacs has no choice but return them back, or else it will violate a > basic expectation from a text-processing program: that it shall never > change the portions of text that were not affected by the processing. > It seems that Emacs does return invalid sequences for characters such as #x3ffff40 (or anything else outside of Unicode except the 127 values for encoding raw bytes). Returning raw bytes means that encoding and decoding isn't a perfect roundtrip: (decode-coding-string (encode-coding-string (string #x3fffc2 #x3fffbb) 'utf-8-unix) 'utf-8-unix) "=C2=BB" We might be able to live with that as it's an extreme edge case. > > > I think this is a relatively simple and unsurprising approach. It allow= s > > encoding the documented Emacs character space while still being fully > > compatible with UTF-8 and not resorting to undocumented Emacs internals= . > > So does the approach I suggested. The advantage of my suggestion is > that it follows a long Emacs tradition about every aspect of encoding > and decoding text, and doesn't require any changes in the existing > infrastructure. > What are the exact difference between the approaches? As far as I can see differences exist only for the following points: - Accepting invalid sequences. I consider that a bug in general-purpose APIs, including decode-coding-string. However, given that Emacs already extends the Unicode codespace and therefore has to accept some invalid sequences anyway, it might be OK if it's clearly documented. - Emitting raw bytes instead of extended sequences. Though I'm not a fan of this it might be unavoidable to be able to treat strings transparently (which is desirable). --001a1130cd242364b9052525de54 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


Eli Za= retskii <eliz@gnu.org> schrieb am= So., 22. Nov. 2015 um 19:04=C2=A0Uhr:
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sun, 22 Nov 2015 14:56:12 +0000
> Cc: tzz@lifelogs= .com, aurelien.aptel+emacs@gmail.com, emacs-devel@gnu.org
>
> - The multibyte API should use an extension of UTF-8 to encode Emacs s= trings.
> The extension is the obvious one already in use in multiple places.
It is only used in one place: the internal representation of
characters in buffers and strings.=C2=A0 Emacs _never_ lets this internal representation leak outside.

If I run in sc= ratch:

(with-temp-buffer
=C2=A0 (in= sert #x3fff40)
=C2=A0 (describe-char (point-min)))

Then the resulting help buffer says "buffer code: #xF= 8 #x8F #xBF #xBD #x80", is that not considered a leak?
=C2= =A0
=C2=A0 In practice the last sentenc= e means that
text that Emacs encoded in UTF-8 will only include either valid UTF-8
sequences of characters whose codepoints are below #x200000 or single
bytes that don't belong to any UTF-8 sequence.
I get the same result as above when running

(with-temp-buffer
=C2=A0 (call-process "echo"= ; nil t nil (string #x3fff40))
=C2=A0 (describe-char (point-min))= )

that means the non-UTF sequence is even &q= uot;leaked" to the external process!
=C2=A0

You are suggesting to expose the internal representation to outside
application code, which predictably will cause that representation to
leak into Lisp.=C2=A0 That'd be a disaster.=C2=A0 We had something like= that
back in the Emacs 20 era, and it took many years to plug those leaks.
We would be making a grave mistake to go back there.
<= br>
I don't suggest leaking anything what isn't already l= eaked. The extension of the codespace to 22 bits is well documented.
<= div>=C2=A0

What you suggest is also impossible without deep changes in how we
decode and encode text: that process maps codepoints above #1FFFFF to
either codepoints below that mark or to raw bytes.=C2=A0 So it's imposs= ible
to produce these high codes in UTF-8 compatible form while handling
UTF-8 text.=C2=A0 To say nothing about the simple fact that no library
function in any C library will ever be able to do anything useful with
such codepoints, because they are our own invention.
<= br>
Unless the behavior changed recently, that doesn't seem t= he case:

(encode-coding-string (string #x3fff= 40) 'utf-8-unix)
"\370\217\277\275\200"
=

Or are you talking about something different?
=C2=A0

> - There should be a one-to-one mapping between Emacs multibyte strings= and
> encoded module API strings.

UTF-8 encoded strings satisfy that requirement.

No! UTF-8 can only encode Unicode scalar values. Only the Emacs ex= tension to UTF-8 (which I think Emacs calls "UTF-8" unfortunately= ) satisfies this. If you are talking about this extension, then we talk abo= ut the same thing anyway.
=C2=A0

> Therefore non-shortest forms, illegal code unit sequences, and code > unit sequences that would encode values outside the range of Emacs
> characters are illegal and raise a signal.

Once again, this was tried in the past and was found to be a bad idea.
Emacs provides features to test the result of converting invalid
sequences, for the purposes of detecting such problems, but it leaves
that to the application.

It's proba= bly OK to accept invalid sequences for consistency with decode-coding-strin= g and friends. I don't really like it though: the module API, like deco= de-coding-string, is not a general-purpose UI for end users, and accepting = invalid sequences is error-prone and can even introduce security issues (se= e e.g.=C2=A0https://blogs.oracle.com/CoreJavaTechTips/entry/the_o= verhaul_of_java_utf).
=C2=A0

> Likewise, such sequences will never be returned from Emacs.

Emacs doesn't return invalid sequences, if the original text didn't=
include raw bytes.=C2=A0 If there were raw bytes in the original text,
Emacs has no choice but return them back, or else it will violate a
basic expectation from a text-processing program: that it shall never
change the portions of text that were not affected by the processing.

It seems that Emacs does return invalid sequ= ences for characters such as #x3ffff40 (or anything else outside of Unicode= except the 127 values for encoding raw bytes).
Returning raw byt= es means that encoding and decoding isn't a perfect roundtrip:

(decode-coding-string (encode-coding-string (string #= x3fffc2 #x3fffbb) 'utf-8-unix) 'utf-8-unix)
"=C2=BB&= quot;

We might be able to live with that as = it's an extreme edge case.
=C2=A0

> I think this is a relatively simple and unsurprising approach. It allo= ws
> encoding the documented Emacs character space while still being fully<= br> > compatible with UTF-8 and not resorting to undocumented Emacs internal= s.

So does the approach I suggested.=C2=A0 The advantage of my suggestion is that it follows a long Emacs tradition about every aspect of encoding
and decoding text, and doesn't require any changes in the existing
infrastructure.

What are the exact diff= erence between the approaches? As far as I can see differences exist only f= or the following points:
- Accepting invalid sequences. I conside= r that a bug in general-purpose APIs, including decode-coding-string. Howev= er, given that Emacs already extends the Unicode codespace and therefore ha= s to accept some invalid sequences anyway, it might be OK if it's clear= ly documented.
- Emitting raw bytes instead of extended sequences= . Though I'm not a fan of this it might be unavoidable to be able to tr= eat strings transparently (which is desirable).=C2=A0
--001a1130cd242364b9052525de54--