From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED!not-for-mail
From: Philipp Stephani <p.stephani2@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: String encoding in json.c
Date: Sat, 23 Dec 2017 17:27:22 +0000
Message-ID: <CAArVCkTgrMe0LqFXcsmTccvUWKcTK9L4mLAXtrxOQj7rwmMv1A@mail.gmail.com>
References: <CAArVCkSoF9SQx_tUyx_wHc5bMKavY+qOeBYo=xiCLO4oXdz7jQ@mail.gmail.com>
	<83tvwhjyi5.fsf@gnu.org>
	<CAArVCkQCbuE4o_oYyXRc-vjS0ppHLW5cL_wRLMzhT+iFqYUZRA@mail.gmail.com>
	<83mv29jv99.fsf@gnu.org>
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="001a11481742988f610561053f6d"
X-Trace: blaine.gmane.org 1514049989 11162 195.159.176.226 (23 Dec 2017 17:26:29 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Sat, 23 Dec 2017 17:26:29 +0000 (UTC)
Cc: emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Dec 23 18:26:24 2017
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by blaine.gmane.org with esmtp (Exim 4.84_2)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1eSnYy-0002FN-6H
	for ged-emacs-devel@m.gmane.org; Sat, 23 Dec 2017 18:26:20 +0100
Original-Received: from localhost ([::1]:56734 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1eSnaw-0004Kj-K1
	for ged-emacs-devel@m.gmane.org; Sat, 23 Dec 2017 12:28:22 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:46344)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1eSnaF-0004Ia-RZ
	for emacs-devel@gnu.org; Sat, 23 Dec 2017 12:27:42 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1eSnaC-00018i-Ki
	for emacs-devel@gnu.org; Sat, 23 Dec 2017 12:27:39 -0500
Original-Received: from mail-qt0-x22b.google.com ([2607:f8b0:400d:c0d::22b]:38570)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <p.stephani2@gmail.com>)
	id 1eSnaA-00014f-2v; Sat, 23 Dec 2017 12:27:34 -0500
Original-Received: by mail-qt0-x22b.google.com with SMTP id d4so39589802qtj.5;
	Sat, 23 Dec 2017 09:27:34 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=mime-version:references:in-reply-to:from:date:message-id:subject:to
	:cc; bh=jXfqsFDWAMbePTsq3Fu4gQ1m00C9N5IJFY96x9uftXc=;
	b=ukmipn8ITtIUbbCWt+0IuVsehX74brgiZZ1w84kt5YKo9Uv/cfBUy9CwOwiJz+X0+F
	kEhrGxnvkHXZ1022cJDyaASLifmg2kNOmwFfdktfx4cNQuH1ByTHDlBmaYJiqXAZELke
	09KyGn39Rcdq9liFrVrGOIUTVLqcJNJEC4o3nGX9a1tvQW/FIrDhS9V4IHG8nbtxcasS
	PjuNlwPj0Qkt1J1hxGYR4BxOXh+tjSDMnagpgr/pAHVNSViT4tBaRVYTEVrErCHfLwpa
	kubHnl3TdHJnkpvKjyOzdcm8my7dou5NL2q6afDJmPE5nrPWgHpjk39qtKw7gBWNitc+
	p+rQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:mime-version:references:in-reply-to:from:date
	:message-id:subject:to:cc;
	bh=jXfqsFDWAMbePTsq3Fu4gQ1m00C9N5IJFY96x9uftXc=;
	b=J25GFwF/Bx+XRBKz75aBiXRYaFKkxf5hKMlKY/dJUkBBTWCLIjSvCQrmmtxpuqJ3c0
	5GHXnbo6vv9UfEA2XG4emOei4aNMvzkUb1DqH/70Z1XwBz/hWdWvi33QVbdvV9Dq417+
	9ZzO7og9H6MkFE0YDpiFKDopR7rI1cLhZlY6HBFh+zxfIcC29GxEDvGyNdfwOaIFsiZR
	PyijgGrLnBio9cs1k06GXT/227zwsFI4bF1GsmbI8ffStP+iDVVkvAB03OUjowi+87JM
	9LaNmdYdBWUEO5hCo6x6v260ihDz8DpHulG7VYOSlMdx67h1KyZSR4BQhYyIsrMDdvAR
	yw3A==
X-Gm-Message-State: AKGB3mLOAmIcevq15tmbYu4ewX3VClGNRUAFaYEsxlRFecDgkkq/81I8
	QDX3+aGPDRM8lin1Yt+2i5vT/trpCCIGqcc/yWDmFQ==
X-Google-Smtp-Source: ACJfBosqxGgOcmpGvL6UajJoe6D464YRzDWIVk5/m+C/ui9pbjH5IkiRPCaLYPMzmdVAAlPypA9UIqNR4TFOW4eaCvw=
X-Received: by 10.200.23.20 with SMTP id w20mr24727246qtj.210.1514050053245;
	Sat, 23 Dec 2017 09:27:33 -0800 (PST)
In-Reply-To: <83mv29jv99.fsf@gnu.org>
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 2607:f8b0:400d:c0d::22b
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel/>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: "Emacs-devel" <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.devel:221393
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/221393>

--001a11481742988f610561053f6d
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Eli Zaretskii <eliz@gnu.org> schrieb am Sa., 23. Dez. 2017 um 16:53 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Sat, 23 Dec 2017 15:31:06 +0000
> > Cc: emacs-devel@gnu.org
> >
> >  The coding operations are "expensive no-ops" except when they aren't,
> >  and that is exactly when we need their 'expensive" parts.
> >
> > In which case are they not no-ops?
>
> When the input is not a valid UTF-8 sequence.  When that happens, we
> produce a special representation of such raw bytes instead of
> signaling EILSEQ and refusing to decode the input.  Encoding (if and
> when it is done) then performs the opposite conversion, producing the
> same single raw byte in the output stream.  This allows Emacs to
> manipulate text that included invalid sequences without crashing,
> because all the low-level primitives that walk buffer text and strings
> by characters assume the internal representation of each character is
> valid.
>

OK, thanks for the refresher. I was aware of the single byte
representation, but forgot how exactly it's handled during coding.


>
> > Using utf-8-unix as encoding seems to keep the encoding intact.
>
> First, you forget about decoding.


OK, let's treat encoding and decoding separately.

- We encode Lisp strings when passing them to Jansson. Jansson only accepts
UTF-8 strings and fails (with proper error reporting, not crashing) when
encountering non-UTF-8 strings. I think encoding can only make a difference
here for strings that contain sequences of bytes that are themselves valid
UTF-8 code unit sequences, such as "=C3=84\xC3\x84". This string is encoded=
 as
"\xC3\x84\xC3\x84" using utf-8-unix. (Note how this is a case where
encoding and decoding are not inverses of each other.) Without encoding,
the string contents will be \xC3\x84 plus two invalid 5-byte sequences. I
think it's not obvious at all which interpretation is correct; after all,
"=C3=84\xC3\x84" is not equal to "=C3=84=C3=84", but the two strings now re=
sult in the
same JSON representation. This could be at least surprising, and I'd argue
that the other behavior (raising an error) would be more correct and more
obvious.

- We decode UTF-8 strings after receiving them from Jansson. Jansson
guarantees to only ever emit well-formed UTF-8. Given that for well-formed
UTF-8 strings, the UTF-8 representation and the Emacs representation are
one and the same, we don't need decoding.


>   And second, encoding keeps the
> encoding intact precisely because it is not a no-op: raw bytes are
> held in buffer and string text as special multibyte sequences, not as
> single bytes, so just copying them to output instead of encoding will
> produce non-UTF-8 multibyte sequences.
>

That's the correct behavior, I think. JSON values must be valid Unicode
strings, and raw bytes are not.


>
> > I've spot-checked some other code where we interface with external
> libraries, namely dbusbind.c and
> > gnutls.c. In no cases I've found explicit coding operations (except for
> filenames, where the situation is
> > different); these files always use SDATA directly. dbusbind.c even has
> the comment
> >
> >   /* We need to send a valid UTF-8 string.  We could encode `object'
> >      but by not encoding it, we guarantee it's valid utf-8, even if
> >      it contains eight-bit-bytes.  Of course, you can still send
> >      manually-crafted junk by passing a unibyte string.  */
>
> If gnutls.c and dbusbind.c don't encode and decode text that comes
> from and goes to outside, then they are buggy.


Not necessarily. As mentioned, the internal encoding of multibyte strings
is even mentioned in the Lisp reference; and the above comment indicates
that it's OK to use that information at least within the Emacs codebase.
BTW, that comment was added by Stefan in
commit e454a4a330cc6524cf0d2604b4fafc32d5bda795, where he removed an
explicit encoding step.


>   (At least for
> gnutls.c, I think you are mistaken, because the encoding/decoding is
> in process.c, see, e.g., read_process_output.)
>

Some parts are definitely encoded, but for example, there is c_hostname in
Fgnutls_boot, which doesn't encode the user-supplied string.


>
> > It's the *current* json.c (and emacs-module.c) that's inconsistent
> > with the rest of the codebase.
>
> Well, I disagree with that conclusion.  Just look at all the calls to
> decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, etc.,
> and you will see where we do that.
>

We obviously do *some* encoding/decoding. But when interacting with
third-party libraries, we seem to leave it out pretty frequently, if those
libraries use UTF-8 as well.

--001a11481742988f610561053f6d
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br><div class=3D"gmail_quote"><div dir=3D"ltr">Eli Za=
retskii &lt;<a href=3D"mailto:eliz@gnu.org">eliz@gnu.org</a>&gt; schrieb am=
 Sa., 23. Dez. 2017 um 16:53=C2=A0Uhr:<br></div><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex">&gt; From: Philipp Stephani &lt;<a href=3D"mailto:p.stephani2@gmail.com=
" target=3D"_blank">p.stephani2@gmail.com</a>&gt;<br>
&gt; Date: Sat, 23 Dec 2017 15:31:06 +0000<br>
&gt; Cc: <a href=3D"mailto:emacs-devel@gnu.org" target=3D"_blank">emacs-dev=
el@gnu.org</a><br>
&gt;<br>
&gt;=C2=A0 The coding operations are &quot;expensive no-ops&quot; except wh=
en they aren&#39;t,<br>
&gt;=C2=A0 and that is exactly when we need their &#39;expensive&quot; part=
s.<br>
&gt;<br>
&gt; In which case are they not no-ops?<br>
<br>
When the input is not a valid UTF-8 sequence.=C2=A0 When that happens, we<b=
r>
produce a special representation of such raw bytes instead of<br>
signaling EILSEQ and refusing to decode the input.=C2=A0 Encoding (if and<b=
r>
when it is done) then performs the opposite conversion, producing the<br>
same single raw byte in the output stream.=C2=A0 This allows Emacs to<br>
manipulate text that included invalid sequences without crashing,<br>
because all the low-level primitives that walk buffer text and strings<br>
by characters assume the internal representation of each character is<br>
valid.<br></blockquote><div><br></div><div>OK, thanks for the refresher. I =
was aware of the single byte representation, but forgot how exactly it&#39;=
s handled during coding.</div><div>=C2=A0</div><blockquote class=3D"gmail_q=
uote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1e=
x">
<br>
&gt; Using utf-8-unix as encoding seems to keep the encoding intact.<br>
<br>
First, you forget about decoding.</blockquote><div><br></div><div>OK, let&#=
39;s treat encoding and decoding separately.</div><div><br></div><div>- We =
encode Lisp strings when passing them to Jansson. Jansson only accepts UTF-=
8 strings and fails (with proper error reporting, not crashing) when encoun=
tering non-UTF-8 strings. I think encoding can only make a difference here =
for strings that contain sequences of bytes that are themselves valid UTF-8=
 code unit sequences, such as=C2=A0&quot;=C3=84\xC3\x84&quot;. This string =
is encoded as &quot;\xC3\x84\xC3\x84&quot; using utf-8-unix. (Note how this=
 is a case where encoding and decoding are not inverses of each other.) Wit=
hout encoding, the string contents will be \xC3\x84 plus two invalid 5-byte=
 sequences. I think it&#39;s not obvious at all which interpretation is cor=
rect; after all, &quot;=C3=84\xC3\x84&quot; is not equal to &quot;=C3=84=C3=
=84&quot;, but the two strings now result in the same JSON representation. =
This could be at least surprising, and I&#39;d argue that the other behavio=
r (raising an error) would be more correct and more obvious.</div><div><br>=
</div><div>- We decode UTF-8 strings after receiving them from Jansson. Jan=
sson guarantees to only ever emit well-formed UTF-8. Given that for well-fo=
rmed UTF-8 strings, the UTF-8 representation and the Emacs representation a=
re one and the same, we don&#39;t need decoding.</div><div><br></div><div>=
=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex">=C2=A0 And second, encoding keep=
s the<br>
encoding intact precisely because it is not a no-op: raw bytes are<br>
held in buffer and string text as special multibyte sequences, not as<br>
single bytes, so just copying them to output instead of encoding will<br>
produce non-UTF-8 multibyte sequences.<br></blockquote><div><br></div><div>=
That&#39;s the correct behavior, I think. JSON values must be valid Unicode=
 strings, and raw bytes are not.</div><div>=C2=A0</div><blockquote class=3D=
"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding=
-left:1ex">
<br>
&gt; I&#39;ve spot-checked some other code where we interface with external=
 libraries, namely dbusbind.c and<br>
&gt; gnutls.c. In no cases I&#39;ve found explicit coding operations (excep=
t for filenames, where the situation is<br>
&gt; different); these files always use SDATA directly. dbusbind.c even has=
 the comment<br>
&gt;<br>
&gt;=C2=A0 =C2=A0/* We need to send a valid UTF-8 string.=C2=A0 We could en=
code `object&#39;<br>
&gt;=C2=A0 =C2=A0 =C2=A0 but by not encoding it, we guarantee it&#39;s vali=
d utf-8, even if<br>
&gt;=C2=A0 =C2=A0 =C2=A0 it contains eight-bit-bytes.=C2=A0 Of course, you =
can still send<br>
&gt;=C2=A0 =C2=A0 =C2=A0 manually-crafted junk by passing a unibyte string.=
=C2=A0 */<br>
<br>
If gnutls.c and dbusbind.c don&#39;t encode and decode text that comes<br>
from and goes to outside, then they are buggy.</blockquote><div><br></div><=
div>Not necessarily. As mentioned, the internal encoding of multibyte strin=
gs is even mentioned in the Lisp reference; and the above comment indicates=
 that it&#39;s OK to use that information at least within the Emacs codebas=
e.</div><div>BTW, that comment was added by Stefan in commit=C2=A0e454a4a33=
0cc6524cf0d2604b4fafc32d5bda795, where he removed an explicit encoding step=
.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0=
 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">=C2=A0 (At least for=
<br>
gnutls.c, I think you are mistaken, because the encoding/decoding is<br>
in process.c, see, e.g., read_process_output.)<br></blockquote><div><br></d=
iv><div>Some parts are definitely encoded, but for example, there is c_host=
name in Fgnutls_boot, which doesn&#39;t encode the user-supplied string.</d=
iv><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
&gt; It&#39;s the *current* json.c (and emacs-module.c) that&#39;s inconsis=
tent<br>
&gt; with the rest of the codebase.<br>
<br>
Well, I disagree with that conclusion.=C2=A0 Just look at all the calls to<=
br>
decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, etc.,<br>
and you will see where we do that.<br></blockquote><div><br></div><div>We o=
bviously do *some* encoding/decoding. But when interacting with third-party=
 libraries, we seem to leave it out pretty frequently, if those libraries u=
se UTF-8 as well.</div></div></div>

--001a11481742988f610561053f6d--