From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Philipp Stephani Newsgroups: gmane.emacs.devel Subject: Re: Dynamic loading progress Date: Sun, 22 Nov 2015 09:25:08 +0000 Message-ID: References: <83k2ptq5t3.fsf@gnu.org> <87h9kxx60e.fsf@lifelogs.com> <877flswse5.fsf@lifelogs.com> <8737wgw7kf.fsf@lifelogs.com> <87io5bv1it.fsf@lifelogs.com> <87egfzuwca.fsf@lifelogs.com> <876118u6f2.fsf@lifelogs.com> <8737w3qero.fsf@lifelogs.com> <831tbn9g9j.fsf@gnu.org> <878u5upw7o.fsf@lifelogs.com> <83ziya8xph.fsf@gnu.org> <83y4du80xo.fsf@gnu.org> <837fld6lps.fsf@gnu.org> <83si3z4s5n.fsf@gnu.org> <83mvu74nhm.fsf@gnu.org> <83d1v34hba.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=001a1148eec6d91d7d05251daf90 X-Trace: ger.gmane.org 1448184347 16226 80.91.229.3 (22 Nov 2015 09:25:47 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 22 Nov 2015 09:25:47 +0000 (UTC) Cc: aurelien.aptel+emacs@gmail.com, tzz@lifelogs.com, emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Nov 22 10:25:46 2015 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1a0Qu0-0000sq-PS for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 10:25:45 +0100 Original-Received: from localhost ([::1]:55291 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0Qu0-0002kx-HD for ged-emacs-devel@m.gmane.org; Sun, 22 Nov 2015 04:25:44 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:42669) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0Qte-0002kq-Dg for emacs-devel@gnu.org; Sun, 22 Nov 2015 04:25:24 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a0Qtc-0007rs-WD for emacs-devel@gnu.org; Sun, 22 Nov 2015 04:25:22 -0500 Original-Received: from mail-wm0-x236.google.com ([2a00:1450:400c:c09::236]:34899) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0Qta-0007rL-KO; Sun, 22 Nov 2015 04:25:18 -0500 Original-Received: by wmuu63 with SMTP id u63so21796420wmu.0; Sun, 22 Nov 2015 01:25:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-type; bh=npe1XRt1Si51k1FNDPldQwLpyVRG3kcYguFC6D/mV5A=; b=dRUddMetUDo9rLQGJ5rz75/od0gbtyZXyuc1FllOJymgj3a2lXTXhVQMce1L9FuOUl lNe17+mNIsXgH3IqVUkq0pexJVfFoqRnR7cUcgRLkWnRd7TxbOSF6bGF0JuIEblw+rsY R6IcNlMFpFn1hAqr37Ayw3H56jfsBiCqVZl49h/tHWEhAuDKfsc0QTPYHB7X554P5Zsk QdNO6vIru9sW9hVlw8Yc/wIDP7rG7WyGOd/FTiSm5iE5xU0ASyD/+UdxXLKVWohHnQff vskYXJD/3oBGTSWfuCLtnzlR4j7/0MssfvVnczz1aXyqJC0M7BQhc47ijf772sGI5fXA mmSw== X-Received: by 10.28.97.197 with SMTP id v188mr13168984wmb.63.1448184318072; Sun, 22 Nov 2015 01:25:18 -0800 (PST) In-Reply-To: <83d1v34hba.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2a00:1450:400c:c09::236 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:194994 Archived-At: --001a1148eec6d91d7d05251daf90 Content-Type: text/plain; charset=UTF-8 Eli Zaretskii schrieb am Sa., 21. Nov. 2015 um 14:23 Uhr: > > From: Philipp Stephani > > Date: Sat, 21 Nov 2015 12:11:45 +0000 > > Cc: tzz@lifelogs.com, aurelien.aptel+emacs@gmail.com, > emacs-devel@gnu.org > > > > No, we cannot, or rather should not. It is unreasonable to expect > > external modules to know the intricacies of the internal > > representation. Most Emacs hackers don't. > > > > Fine with me, but how would we then represent Emacs strings that are not > valid > > Unicode strings? Just raise an error? > > No need to raise an error. Strings that are returned to modules > should be encoded into UTF-8. That encoding already takes care of > these situations: it either produces the UTF-8 encoding of the > equivalent Unicode characters, or outputs raw bytes. > Then we should document such a situation and give module authors a way to detect them. For example, what happens if a sequence of such raw bytes happens to be a valid UTF-8 sequence? Is there a way for module code to detect this situation? > > We are using this all the time when we save files or send stuff over > the network. > > > No, I meant strict UTF-8, not its Emacs extension. > > > > That would be possible and provide a clean interface. However, Emacs > strings > > are extended, so we'd need to specify how they interact with UTF-8 > strings. > > > > * If a module passes a char sequence that's not a valid UTF-8 string, > but a > > valid Emacs multibyte string, what should happen? Error, undefined > behavior, > > silently accepted? > > We are quite capable of quietly accepting such strings, so that is > what I would suggest. Doing so would be in line with what Emacs does > when such invalid sequences come from other sources, like files. > If we accept such strings, then we should document what the extensions are. - Are UTF-8-like sequences encoding surrogate code points accepted? - Are UTF-8-like sequences encoding integers outside the Unicode codespace accepted? - Are non-shortest forms accepted? - Are other invalid code unit sequences accepted? If the answer to any of these is "yes", we can't say we accept UTF-8, because we don't. Rather we should say what is actually accepted. > > > * If copy_string_contents is passed an Emacs string that is not a valid > Unicode > > string, what should happen? > > How can that happen? The Emacs string comes from the Emacs bowels, so > it must be "valid" string by Emacs standards. Or maybe I don't > understand what you mean by "invalid Unicode string". > A sequence of integers where at least one element is not a Unicode scalar value. > > In any case, we already deal with any such problems when we save a > buffer to a file, or send it over the network. This isn't some new > problem we need to cope with. > Yes, but the module interface is new, it doesn't necessarily have to have the same behavior. If we say we emit only UTF-8, then we should do so. > > > OK, then we can use that, of course. The question of handling invalid > UTF-8 > > strings is still open, though, as make_multibyte_string doesn't enforce > valid > > UTF-8. > > It doesn't enforce valid UTF-8 because it can handle invalid UTF-8 as > well. That's by design. > Then whatever it handles needs to be specified. > > > If it's the contract of make_multibyte_string that it will always accept > UTF-8, > > then that should be added as a comment to that function. Currently I > don't see > > it documented anywhere. > > That part of the documentation is only revealed to veteran Emacs > hackers, subject to swearing not to reveal that to the uninitiated and > to some blood-letting that seals the oath ;-) > I see ;-) --001a1148eec6d91d7d05251daf90 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


Eli Za= retskii <eliz@gnu.org> schrieb am= Sa., 21. Nov. 2015 um 14:23=C2=A0Uhr:
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sat, 21 Nov 2015 12:11:45 +0000
> Cc: tzz@lifelogs= .com, aurelien.aptel+emacs@gmail.com, emacs-devel@gnu.org
>
>=C2=A0 =C2=A0 =C2=A0No, we cannot, or rather should not. It is unreason= able to expect
>=C2=A0 =C2=A0 =C2=A0external modules to know the intricacies of the int= ernal
>=C2=A0 =C2=A0 =C2=A0representation. Most Emacs hackers don't.
>
> Fine with me, but how would we then represent Emacs strings that are n= ot valid
> Unicode strings? Just raise an error?

No need to raise an error.=C2=A0 Strings that are returned to modules
should be encoded into UTF-8.=C2=A0 That encoding already takes care of
these situations: it either produces the UTF-8 encoding of the
equivalent Unicode characters, or outputs raw bytes.
<= br>
Then we should document such a situation and give module auth= ors a way to detect them. For example, what happens if a sequence of such r= aw bytes happens to be a valid UTF-8 sequence? Is there a way for module co= de to detect this situation?
=C2=A0

We are using this all the time when we save files or send stuff over
the network.

>=C2=A0 =C2=A0 =C2=A0No, I meant strict UTF-8, not its Emacs extension.<= br> >
> That would be possible and provide a clean interface. However, Emacs s= trings
> are extended, so we'd need to specify how they interact with UTF-8= strings.
>
> * If a module passes a char sequence that's not a valid UTF-8 stri= ng, but a
>=C2=A0 =C2=A0valid Emacs multibyte string, what should happen? Error, u= ndefined behavior,
>=C2=A0 =C2=A0silently accepted?

We are quite capable of quietly accepting such strings, so that is
what I would suggest.=C2=A0 Doing so would be in line with what Emacs does<= br> when such invalid sequences come from other sources, like files.

If we accept such strings, then we should documen= t what the extensions are.
- Are UTF-8-like sequences encoding su= rrogate code points accepted?
- Are UTF-8-like sequences encoding= integers outside the Unicode codespace accepted?
- Are non-short= est forms accepted?
- Are other invalid code unit sequences accep= ted?
If the answer to any of these is "yes", we can'= ;t say we accept UTF-8, because we don't. Rather we should say what is = actually accepted.
=C2=A0

> * If copy_string_contents is passed an Emacs string that is not a vali= d Unicode
>=C2=A0 =C2=A0string, what should happen?

How can that happen?=C2=A0 The Emacs string comes from the Emacs bowels, so=
it must be "valid" string by Emacs standards.=C2=A0 Or maybe I do= n't
understand what you mean by "invalid Unicode string".

A sequence of integers where at least one element = is not a Unicode scalar value.
=C2=A0

In any case, we already deal with any such problems when we save a
buffer to a file, or send it over the network.=C2=A0 This isn't some ne= w
problem we need to cope with.

Yes, but = the module interface is new, it doesn't necessarily have to have the sa= me behavior. If we say we emit only UTF-8, then we should do so.
= =C2=A0

> OK, then we can use that, of course. The question of handling invalid = UTF-8
> strings is still open, though, as make_multibyte_string doesn't en= force valid
> UTF-8.

It doesn't enforce valid UTF-8 because it can handle invalid UTF-8 as well.=C2=A0 That's by design.

Then = whatever it handles needs to be specified.
=C2=A0

> If it's the contract of make_multibyte_string that it will always = accept UTF-8,
> then that should be added as a comment to that function. Currently I d= on't see
> it documented anywhere.

That part of the documentation is only revealed to veteran Emacs
hackers, subject to swearing not to reveal that to the uninitiated and
to some blood-letting that seals the oath ;-)

I see ;-)=C2=A0
--001a1148eec6d91d7d05251daf90--