From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Philipp Stephani Newsgroups: gmane.emacs.devel Subject: Re: String encoding in json.c Date: Sat, 23 Dec 2017 17:27:22 +0000 Message-ID: References: <83tvwhjyi5.fsf@gnu.org> <83mv29jv99.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="001a11481742988f610561053f6d" X-Trace: blaine.gmane.org 1514049989 11162 195.159.176.226 (23 Dec 2017 17:26:29 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 23 Dec 2017 17:26:29 +0000 (UTC) Cc: emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Dec 23 18:26:24 2017 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eSnYy-0002FN-6H for ged-emacs-devel@m.gmane.org; Sat, 23 Dec 2017 18:26:20 +0100 Original-Received: from localhost ([::1]:56734 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eSnaw-0004Kj-K1 for ged-emacs-devel@m.gmane.org; Sat, 23 Dec 2017 12:28:22 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:46344) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eSnaF-0004Ia-RZ for emacs-devel@gnu.org; Sat, 23 Dec 2017 12:27:42 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eSnaC-00018i-Ki for emacs-devel@gnu.org; Sat, 23 Dec 2017 12:27:39 -0500 Original-Received: from mail-qt0-x22b.google.com ([2607:f8b0:400d:c0d::22b]:38570) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1eSnaA-00014f-2v; Sat, 23 Dec 2017 12:27:34 -0500 Original-Received: by mail-qt0-x22b.google.com with SMTP id d4so39589802qtj.5; Sat, 23 Dec 2017 09:27:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=jXfqsFDWAMbePTsq3Fu4gQ1m00C9N5IJFY96x9uftXc=; b=ukmipn8ITtIUbbCWt+0IuVsehX74brgiZZ1w84kt5YKo9Uv/cfBUy9CwOwiJz+X0+F kEhrGxnvkHXZ1022cJDyaASLifmg2kNOmwFfdktfx4cNQuH1ByTHDlBmaYJiqXAZELke 09KyGn39Rcdq9liFrVrGOIUTVLqcJNJEC4o3nGX9a1tvQW/FIrDhS9V4IHG8nbtxcasS PjuNlwPj0Qkt1J1hxGYR4BxOXh+tjSDMnagpgr/pAHVNSViT4tBaRVYTEVrErCHfLwpa kubHnl3TdHJnkpvKjyOzdcm8my7dou5NL2q6afDJmPE5nrPWgHpjk39qtKw7gBWNitc+ p+rQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=jXfqsFDWAMbePTsq3Fu4gQ1m00C9N5IJFY96x9uftXc=; b=J25GFwF/Bx+XRBKz75aBiXRYaFKkxf5hKMlKY/dJUkBBTWCLIjSvCQrmmtxpuqJ3c0 5GHXnbo6vv9UfEA2XG4emOei4aNMvzkUb1DqH/70Z1XwBz/hWdWvi33QVbdvV9Dq417+ 9ZzO7og9H6MkFE0YDpiFKDopR7rI1cLhZlY6HBFh+zxfIcC29GxEDvGyNdfwOaIFsiZR PyijgGrLnBio9cs1k06GXT/227zwsFI4bF1GsmbI8ffStP+iDVVkvAB03OUjowi+87JM 9LaNmdYdBWUEO5hCo6x6v260ihDz8DpHulG7VYOSlMdx67h1KyZSR4BQhYyIsrMDdvAR yw3A== X-Gm-Message-State: AKGB3mLOAmIcevq15tmbYu4ewX3VClGNRUAFaYEsxlRFecDgkkq/81I8 QDX3+aGPDRM8lin1Yt+2i5vT/trpCCIGqcc/yWDmFQ== X-Google-Smtp-Source: ACJfBosqxGgOcmpGvL6UajJoe6D464YRzDWIVk5/m+C/ui9pbjH5IkiRPCaLYPMzmdVAAlPypA9UIqNR4TFOW4eaCvw= X-Received: by 10.200.23.20 with SMTP id w20mr24727246qtj.210.1514050053245; Sat, 23 Dec 2017 09:27:33 -0800 (PST) In-Reply-To: <83mv29jv99.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:400d:c0d::22b X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:221393 Archived-At: --001a11481742988f610561053f6d Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Eli Zaretskii schrieb am Sa., 23. Dez. 2017 um 16:53 Uhr: > > From: Philipp Stephani > > Date: Sat, 23 Dec 2017 15:31:06 +0000 > > Cc: emacs-devel@gnu.org > > > > The coding operations are "expensive no-ops" except when they aren't, > > and that is exactly when we need their 'expensive" parts. > > > > In which case are they not no-ops? > > When the input is not a valid UTF-8 sequence. When that happens, we > produce a special representation of such raw bytes instead of > signaling EILSEQ and refusing to decode the input. Encoding (if and > when it is done) then performs the opposite conversion, producing the > same single raw byte in the output stream. This allows Emacs to > manipulate text that included invalid sequences without crashing, > because all the low-level primitives that walk buffer text and strings > by characters assume the internal representation of each character is > valid. > OK, thanks for the refresher. I was aware of the single byte representation, but forgot how exactly it's handled during coding. > > > Using utf-8-unix as encoding seems to keep the encoding intact. > > First, you forget about decoding. OK, let's treat encoding and decoding separately. - We encode Lisp strings when passing them to Jansson. Jansson only accepts UTF-8 strings and fails (with proper error reporting, not crashing) when encountering non-UTF-8 strings. I think encoding can only make a difference here for strings that contain sequences of bytes that are themselves valid UTF-8 code unit sequences, such as "=C3=84\xC3\x84". This string is encoded= as "\xC3\x84\xC3\x84" using utf-8-unix. (Note how this is a case where encoding and decoding are not inverses of each other.) Without encoding, the string contents will be \xC3\x84 plus two invalid 5-byte sequences. I think it's not obvious at all which interpretation is correct; after all, "=C3=84\xC3\x84" is not equal to "=C3=84=C3=84", but the two strings now re= sult in the same JSON representation. This could be at least surprising, and I'd argue that the other behavior (raising an error) would be more correct and more obvious. - We decode UTF-8 strings after receiving them from Jansson. Jansson guarantees to only ever emit well-formed UTF-8. Given that for well-formed UTF-8 strings, the UTF-8 representation and the Emacs representation are one and the same, we don't need decoding. > And second, encoding keeps the > encoding intact precisely because it is not a no-op: raw bytes are > held in buffer and string text as special multibyte sequences, not as > single bytes, so just copying them to output instead of encoding will > produce non-UTF-8 multibyte sequences. > That's the correct behavior, I think. JSON values must be valid Unicode strings, and raw bytes are not. > > > I've spot-checked some other code where we interface with external > libraries, namely dbusbind.c and > > gnutls.c. In no cases I've found explicit coding operations (except for > filenames, where the situation is > > different); these files always use SDATA directly. dbusbind.c even has > the comment > > > > /* We need to send a valid UTF-8 string. We could encode `object' > > but by not encoding it, we guarantee it's valid utf-8, even if > > it contains eight-bit-bytes. Of course, you can still send > > manually-crafted junk by passing a unibyte string. */ > > If gnutls.c and dbusbind.c don't encode and decode text that comes > from and goes to outside, then they are buggy. Not necessarily. As mentioned, the internal encoding of multibyte strings is even mentioned in the Lisp reference; and the above comment indicates that it's OK to use that information at least within the Emacs codebase. BTW, that comment was added by Stefan in commit e454a4a330cc6524cf0d2604b4fafc32d5bda795, where he removed an explicit encoding step. > (At least for > gnutls.c, I think you are mistaken, because the encoding/decoding is > in process.c, see, e.g., read_process_output.) > Some parts are definitely encoded, but for example, there is c_hostname in Fgnutls_boot, which doesn't encode the user-supplied string. > > > It's the *current* json.c (and emacs-module.c) that's inconsistent > > with the rest of the codebase. > > Well, I disagree with that conclusion. Just look at all the calls to > decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, etc., > and you will see where we do that. > We obviously do *some* encoding/decoding. But when interacting with third-party libraries, we seem to leave it out pretty frequently, if those libraries use UTF-8 as well. --001a11481742988f610561053f6d Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


Eli Za= retskii <eliz@gnu.org> schrieb am= Sa., 23. Dez. 2017 um 16:53=C2=A0Uhr:
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sat, 23 Dec 2017 15:31:06 +0000
> Cc: emacs-dev= el@gnu.org
>
>=C2=A0 The coding operations are "expensive no-ops" except wh= en they aren't,
>=C2=A0 and that is exactly when we need their 'expensive" part= s.
>
> In which case are they not no-ops?

When the input is not a valid UTF-8 sequence.=C2=A0 When that happens, we produce a special representation of such raw bytes instead of
signaling EILSEQ and refusing to decode the input.=C2=A0 Encoding (if and when it is done) then performs the opposite conversion, producing the
same single raw byte in the output stream.=C2=A0 This allows Emacs to
manipulate text that included invalid sequences without crashing,
because all the low-level primitives that walk buffer text and strings
by characters assume the internal representation of each character is
valid.

OK, thanks for the refresher. I = was aware of the single byte representation, but forgot how exactly it'= s handled during coding.
=C2=A0

> Using utf-8-unix as encoding seems to keep the encoding intact.

First, you forget about decoding.

OK, let&#= 39;s treat encoding and decoding separately.

- We = encode Lisp strings when passing them to Jansson. Jansson only accepts UTF-= 8 strings and fails (with proper error reporting, not crashing) when encoun= tering non-UTF-8 strings. I think encoding can only make a difference here = for strings that contain sequences of bytes that are themselves valid UTF-8= code unit sequences, such as=C2=A0"=C3=84\xC3\x84". This string = is encoded as "\xC3\x84\xC3\x84" using utf-8-unix. (Note how this= is a case where encoding and decoding are not inverses of each other.) Wit= hout encoding, the string contents will be \xC3\x84 plus two invalid 5-byte= sequences. I think it's not obvious at all which interpretation is cor= rect; after all, "=C3=84\xC3\x84" is not equal to "=C3=84=C3= =84", but the two strings now result in the same JSON representation. = This could be at least surprising, and I'd argue that the other behavio= r (raising an error) would be more correct and more obvious.

=
- We decode UTF-8 strings after receiving them from Jansson. Jan= sson guarantees to only ever emit well-formed UTF-8. Given that for well-fo= rmed UTF-8 strings, the UTF-8 representation and the Emacs representation a= re one and the same, we don't need decoding.

= =C2=A0
=C2=A0 And second, encoding keep= s the
encoding intact precisely because it is not a no-op: raw bytes are
held in buffer and string text as special multibyte sequences, not as
single bytes, so just copying them to output instead of encoding will
produce non-UTF-8 multibyte sequences.

= That's the correct behavior, I think. JSON values must be valid Unicode= strings, and raw bytes are not.
=C2=A0

> I've spot-checked some other code where we interface with external= libraries, namely dbusbind.c and
> gnutls.c. In no cases I've found explicit coding operations (excep= t for filenames, where the situation is
> different); these files always use SDATA directly. dbusbind.c even has= the comment
>
>=C2=A0 =C2=A0/* We need to send a valid UTF-8 string.=C2=A0 We could en= code `object'
>=C2=A0 =C2=A0 =C2=A0 but by not encoding it, we guarantee it's vali= d utf-8, even if
>=C2=A0 =C2=A0 =C2=A0 it contains eight-bit-bytes.=C2=A0 Of course, you = can still send
>=C2=A0 =C2=A0 =C2=A0 manually-crafted junk by passing a unibyte string.= =C2=A0 */

If gnutls.c and dbusbind.c don't encode and decode text that comes
from and goes to outside, then they are buggy.

<= div>Not necessarily. As mentioned, the internal encoding of multibyte strin= gs is even mentioned in the Lisp reference; and the above comment indicates= that it's OK to use that information at least within the Emacs codebas= e.
BTW, that comment was added by Stefan in commit=C2=A0e454a4a33= 0cc6524cf0d2604b4fafc32d5bda795, where he removed an explicit encoding step= .
=C2=A0
=C2=A0 (At least for=
gnutls.c, I think you are mistaken, because the encoding/decoding is
in process.c, see, e.g., read_process_output.)

Some parts are definitely encoded, but for example, there is c_host= name in Fgnutls_boot, which doesn't encode the user-supplied string.
=C2=A0

> It's the *current* json.c (and emacs-module.c) that's inconsis= tent
> with the rest of the codebase.

Well, I disagree with that conclusion.=C2=A0 Just look at all the calls to<= br> decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, etc.,
and you will see where we do that.

We o= bviously do *some* encoding/decoding. But when interacting with third-party= libraries, we seem to leave it out pretty frequently, if those libraries u= se UTF-8 as well.
--001a11481742988f610561053f6d--