From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Philipp Stephani Newsgroups: gmane.emacs.devel Subject: Re: String encoding in json.c Date: Sat, 23 Dec 2017 15:31:06 +0000 Message-ID: References: <83tvwhjyi5.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="001a1144d8b2cda52f0561039f8b" X-Trace: blaine.gmane.org 1514043014 31974 195.159.176.226 (23 Dec 2017 15:30:14 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 23 Dec 2017 15:30:14 +0000 (UTC) Cc: emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sat Dec 23 16:30:10 2017 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eSlkX-0007lh-H2 for ged-emacs-devel@m.gmane.org; Sat, 23 Dec 2017 16:30:09 +0100 Original-Received: from localhost ([::1]:45151 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eSlmU-0006lc-5t for ged-emacs-devel@m.gmane.org; Sat, 23 Dec 2017 10:32:10 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:41556) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eSllh-0006kf-EN for emacs-devel@gnu.org; Sat, 23 Dec 2017 10:31:22 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eSllf-0004YV-Vu for emacs-devel@gnu.org; Sat, 23 Dec 2017 10:31:21 -0500 Original-Received: from mail-qk0-x235.google.com ([2607:f8b0:400d:c09::235]:46737) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1eSlle-0004X6-7T; Sat, 23 Dec 2017 10:31:18 -0500 Original-Received: by mail-qk0-x235.google.com with SMTP id b132so7903425qkc.13; Sat, 23 Dec 2017 07:31:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=KRMe5LCkNBRiYfsF23s3G3J6A/Sr6ISGEYlsHvaNGaM=; b=Haj9rWJaOMdh4VpC2SQerPUFnx0sr37ZSKXZV/Yy1L9InLMCklBno3uGKT7LfIX98C SuaacvAi9JWE1ULPIiBcxObOBAPScDgPGTrzNJLb1pJ/NOXKG++grn1rZ+xhp8wvL+bZ NXsQgBFQOVy4kF3oTzvQgVIdvg6pmwP/xKaas2ZUP9lLPNFF58uvcE7g4bypBRNg3Xsu XOY4uNROQrTAigc0vwI2dehGvdldADxAVIs/aN1n4TGbOsQOZeYugSUeubGMEMwaaD9U MAUm9couCZu+8gegYOQNEpL+MP/TIbYNW3rkk77S0tOis48HpevEZUW7l1LPhWe5Ae9M QO8Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=KRMe5LCkNBRiYfsF23s3G3J6A/Sr6ISGEYlsHvaNGaM=; b=uAbFjdmUGH3M282Xy2dMhmjuDuLtNNL4Mav4gsT9WyF9YScDzdLWQ3Y9TSPrwincQX cAhih9TSW0Y0TRtlXHMkPTo09PKog/CY+i0wtPsknKwQDqXKvU7q4tfpCMfBCVg6h+p6 QMWcRVkwuwSgQNc/bsLBRxboXTWreuZmf7UdO0A1avs34dWtPiR6ozSxJJr371AQ5JLr 1jvgy3p8xWbc71MD4js5uKKAPBPPiSYEgLj3cJTXsFAVciffPcfIkZY3KxrC6V4zX81m 9jRsCWnjovMf2d0LJd5sFlW9ezkC7bq25lMI4uMvsE5tF/VGoyS80rmIe9DoCccJ2Rxc psXw== X-Gm-Message-State: AKGB3mIvGD5GFhb4pGgoMmSkRLVD8+sSSwKIEUnXqFyPFS7R5pGQU49a iEtQbBGETy4PNHFCVX/611Q7YNfI4uYvGkosT1WQ6Q== X-Google-Smtp-Source: ACJfBos/jpoZFb2AUTkJrWXMleEMTKWcgz2xzYw+/Ns3stQWHSjXvUxwN1Az1NbgzXmIRE2YOgYDKHtpDbtzQ4774x0= X-Received: by 10.55.33.17 with SMTP id h17mr21686583qkh.143.1514043077402; Sat, 23 Dec 2017 07:31:17 -0800 (PST) In-Reply-To: <83tvwhjyi5.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:400d:c09::235 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:221383 Archived-At: --001a1144d8b2cda52f0561039f8b Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Eli Zaretskii schrieb am Sa., 23. Dez. 2017 um 15:43 Uhr: > > From: Philipp Stephani > > Date: Sat, 23 Dec 2017 14:26:09 +0000 > > > > I've benchmarked serialization and parsing of JSON with and without > explicit encoding. I've found that leaving > > out the coding makes both operations significantly faster =E2=80=93 fro= m a > speedup of a factor of 1.11 =C2=B1 0.06 for > > parsing canada.json to 1.57 =C2=B1 0.08 for serializing twitter.json. O= ther > speedups are in between, but the > > speedup is always significant (to at least one standard deviation). All > unit tests pass when leaving out the > > coding steps =E2=80=93 which isn't surprising given that currently the = coding > operations are expensive no-ops. > > The coding operations are "expensive no-ops" except when they aren't, > and that is exactly when we need their 'expensive" parts. > In which case are they not no-ops? I've spot-checked some of the implementation details of coding.c, and I haven't found obvious cases where they are not no-ops. Emacs appears to use the obvious extension of UTF-8 for integers that are not Unicode scalar values, and that's even documented in character.h and the Elisp reference manual. Using utf-8-unix as encoding seems to keep the encoding intact. > > > Therefore I'd suggest to document the internal string encoding in lisp.= h > or character.h and remove the explicit > > coding in json.c and emacs-module.c. It's very unlikely that the > internal string encoding will change frequently, > > and if so, the unit tests should catch potential issues caused by that. > > As I've already said, I don't think this particular case should be an > exception wrt to how Emacs behaves with external strings everywhere > else. We suffer similar slow-downs in those other places as well, and > IMO this is a small penalty to pay for making sure our objects are > valid and won't crash Emacs. > I've spot-checked some other code where we interface with external libraries, namely dbusbind.c and gnutls.c. In no cases I've found explicit coding operations (except for filenames, where the situation is different); these files always use SDATA directly. dbusbind.c even has the comment /* We need to send a valid UTF-8 string. We could encode `object' but by not encoding it, we guarantee it's valid utf-8, even if it contains eight-bit-bytes. Of course, you can still send manually-crafted junk by passing a unibyte string. */ So not only do we not encode strings explicitly, we even *prefer* not encoding them, and we do rely on the internal string encoding being an extension of UTF-8. It's the *current* json.c (and emacs-module.c) that's inconsistent with the rest of the codebase. --001a1144d8b2cda52f0561039f8b Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


Eli Za= retskii <eliz@gnu.org> schrieb am= Sa., 23. Dez. 2017 um 15:43=C2=A0Uhr:
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Sat, 23 Dec 2017 14:26:09 +0000
>
> I've benchmarked serialization and parsing of JSON with and withou= t explicit encoding. I've found that leaving
> out the coding makes both operations significantly faster =E2=80=93 fr= om a speedup of a factor of 1.11 =C2=B1 0.06 for
> parsing canada.json to 1.57 =C2=B1 0.08 for serializing twitter.json. = Other speedups are in between, but the
> speedup is always significant (to at least one standard deviation). Al= l unit tests pass when leaving out the
> coding steps =E2=80=93 which isn't surprising given that currently= the coding operations are expensive no-ops.

The coding operations are "expensive no-ops" except when they are= n't,
and that is exactly when we need their 'expensive" parts.

In which case are they not no-ops? I've spo= t-checked some of the implementation details of coding.c, and I haven't= found obvious cases where they are not no-ops. Emacs appears to use the ob= vious extension of UTF-8 for integers that are not Unicode scalar values, a= nd that's even documented in character.h and the Elisp reference manual= . Using utf-8-unix as encoding seems to keep the encoding intact.
=C2=A0

> Therefore I'd suggest to document the internal string encoding in = lisp.h or character.h and remove the explicit
> coding in json.c and emacs-module.c. It's very unlikely that the i= nternal string encoding will change frequently,
> and if so, the unit tests should catch potential issues caused by that= .

As I've already said, I don't think this particular case should be = an
exception wrt to how Emacs behaves with external strings everywhere
else.=C2=A0 We suffer similar slow-downs in those other places as well, and=
IMO this is a small penalty to pay for making sure our objects are
valid and won't crash Emacs.

I'= ve spot-checked some other code where we interface with external libraries,= namely dbusbind.c and gnutls.c. In no cases I've found explicit coding= operations (except for filenames, where the situation is different); these= files always use SDATA directly. dbusbind.c even has the comment

=C2=A0 /* We need t= o send a valid UTF-8 string.=C2=A0 We could encode `object'
<= span style=3D"white-space:pre"> =C2=A0 =C2=A0 =C2=A0but by not encod= ing it, we guarantee it's valid utf-8, even if
=C2=A0 =C2=A0 =C2=A0it contains eight-bit-bytes.= =C2=A0 Of course, you can still send
=C2=A0 =C2=A0 =C2=A0manually-crafted junk by passing a unibyte = string.=C2=A0 */

So not only do we not encode stri= ngs explicitly, we even *prefer* not encoding them, and we do rely on the i= nternal string encoding being an extension of UTF-8. It's the *current*= json.c (and emacs-module.c) that's inconsistent with the rest of the c= odebase.
--001a1144d8b2cda52f0561039f8b--