From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED!not-for-mail
From: Philipp Stephani <p.stephani2@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: JSON/YAML/TOML/etc. parsing performance
Date: Sun, 08 Oct 2017 23:04:10 +0000
Message-ID: <CAArVCkQddXEqLYwPU7xXUND0jpQ2dixWpcnDf+08Lc9OO5BuwA@mail.gmail.com>
References: <87poaqhc63.fsf@lifelogs.com>
	<CAArVCkQTLp=Cmh-FM1R-WK=WYFX_hP=6XiUUinKRT17bciL+CQ@mail.gmail.com>
	<CAArVCkTj_1P+fTDCzEY5xG8bBB7B6ctNkQCv+bAxt=N_cuD05Q@mail.gmail.com>
	<CAArVCkS52m8SGeOQt19k+XsfZnxy+bh9LJMyX=h+e67_adP6Mg@mail.gmail.com>
	<8360ceh5f1.fsf@gnu.org>
	<CAArVCkRvSaS-orqHcVPtZ2etUnRiY39okHh+6sYV-mtQQRYs-g@mail.gmail.com>
	<83h8vl5lf9.fsf@gnu.org>
	<CAArVCkRSY+bUiX2n1sysxy4dRTLE5M8R3OZmt+1W0-OfiYGDuA@mail.gmail.com>
	<83r2um3fqi.fsf@gnu.org>
	<CAArVCkSR5w5B4_E0UU8uJtZJtr1oP51md4-i-J+NSRCGpGP_9A@mail.gmail.com>
	<43520b71-9e25-926c-d744-78098dad6441@cs.ucla.edu>
	<83o9pnzddc.fsf@gnu.org>
	<472176ce-846b-1f24-716b-98eb95ceaa47@cs.ucla.edu>
	<83d163z6dy.fsf@gnu.org>
	<73477c99-1600-a53d-d84f-737837d0f91f@cs.ucla.edu>
	<83poa2ya8j.fsf@gnu.org>
	<21b0ba97-ed49-43ae-e86f-63fba762353a@cs.ucla.edu>
	<83lgkqxe3l.fsf@gnu.org>
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="001a113d78f8290b42055b1118f1"
X-Trace: blaine.gmane.org 1507503875 12381 195.159.176.226 (8 Oct 2017 23:04:35 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Sun, 8 Oct 2017 23:04:35 +0000 (UTC)
Cc: emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>, Paul Eggert <eggert@cs.ucla.edu>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Oct 09 01:04:29 2017
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by blaine.gmane.org with esmtp (Exim 4.84_2)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1e1KcW-0002Ro-VP
	for ged-emacs-devel@m.gmane.org; Mon, 09 Oct 2017 01:04:29 +0200
Original-Received: from localhost ([::1]:55374 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1e1Kce-0000Op-BY
	for ged-emacs-devel@m.gmane.org; Sun, 08 Oct 2017 19:04:36 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:58963)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1e1KcU-0000OS-AI
	for emacs-devel@gnu.org; Sun, 08 Oct 2017 19:04:27 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1e1KcS-0002mr-Mc
	for emacs-devel@gnu.org; Sun, 08 Oct 2017 19:04:26 -0400
Original-Received: from mail-oi0-x22f.google.com ([2607:f8b0:4003:c06::22f]:46634)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <p.stephani2@gmail.com>)
	id 1e1KcQ-0002lt-H8; Sun, 08 Oct 2017 19:04:22 -0400
Original-Received: by mail-oi0-x22f.google.com with SMTP id n82so29704456oig.3;
	Sun, 08 Oct 2017 16:04:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=mime-version:references:in-reply-to:from:date:message-id:subject:to
	:cc; bh=VYTRXVUmWipP8+g0+Cc/hgMw2JRI+ssDVkOG8d4BDnE=;
	b=k+XvJROuUUW9zQz4LLtLJEAxbzZBNcJeT0Do7hc06bUs565gBmg6YFIYd1qy8NkIAg
	/mIcScFf5anXU6Xb57/o+9yocOlq/RKr1yZA1aoCWDW8hCLqIZQPnieRTqGSWBx7JoxR
	Ilm37olS4K/UNenHWaBJh9Iy/kgsajojEizykbSfHwkb9+3eVTaAAZf5rucJGJUORcwm
	hulsy2lai0GCiBt2WQ4IwL94YhO2VengLDVlPxU+ojvM7PW4tU2LHQWcwbAeS1L60Nqd
	wRevfNpRL2rRwTTB0lYfb961Xaa1e8gNW1mLq3ASMDSE3aVFbU96vSL+qkYWghDufZy8
	tRQg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:mime-version:references:in-reply-to:from:date
	:message-id:subject:to:cc;
	bh=VYTRXVUmWipP8+g0+Cc/hgMw2JRI+ssDVkOG8d4BDnE=;
	b=eGyTWevYnlJlQ/NIZKl5OAYYnLtuHJB9azHdKW+94K1VEfBR/SeXB4QCKrlIx+/n0/
	6QndjMwl5g3/vFbel6RGRjELTG+v18QGs9VF9Ni1AffHPFJG3chuWbcbpMl39ltGlk6b
	jVjyWU20i9BQyn6R1xEJulxrea+7ykgnajWl0jWGJ44NumvjACF/uRA5w5gQJW8oKdBi
	aeAyDatzkBIYdy9AONwcfQy53fZTtSa+w8/R0it5jYJsjrfU6EbNhjvAccqgUfLs7i+d
	SulkPZKO1sLQ4xz/697684j4iVYzxWDvbTvua0I3CD477h+OJmHlSbh4juTRVkGEFfhy
	hVMg==
X-Gm-Message-State: AMCzsaU+f08QOv0/hMhlDmecdXdPTzMw/N2xYG8Qqw6J1XGjgPxxrNWv
	OuLeSflKZW3VSH5o3Wj47p3nyi5wBLQlzEOm4hsFlA==
X-Google-Smtp-Source: AOwi7QAPGpUtp1pR1yznfJG5g91iyd44ju34In/hteY47aEcATRYFtyFrn7GwO1A3zUNTpqXFq3fRiokU8srgX1j/ns=
X-Received: by 10.202.86.206 with SMTP id k197mr4650640oib.254.1507503861466; 
	Sun, 08 Oct 2017 16:04:21 -0700 (PDT)
In-Reply-To: <83lgkqxe3l.fsf@gnu.org>
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 2607:f8b0:4003:c06::22f
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel/>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: "Emacs-devel" <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.devel:219276
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/219276>

--001a113d78f8290b42055b1118f1
Content-Type: text/plain; charset="UTF-8"

Eli Zaretskii <eliz@gnu.org> schrieb am Do., 5. Okt. 2017 um 09:12 Uhr:

> > Cc: p.stephani2@gmail.com, emacs-devel@gnu.org
> > From: Paul Eggert <eggert@cs.ucla.edu>
> > Date: Wed, 4 Oct 2017 14:24:59 -0700
> >
> > On 10/04/2017 12:38 PM, Eli Zaretskii wrote:
> > > if we did use size_t for the arguments which can clearly only be
> > > non-negative, the problems which we are discussing would not have
> > > happened
> > Sure, but we would also have worse problems, as size_t is inherently
> > more error-prone. ptrdiff_t overflows are reliably diagnosed when Emacs
> > is compiled with suitable GCC compiler options. size_t overflows cannot
> > be diagnosed, are all too common, and can cause serious trouble.
>
> If ptrdiff_t overflows are reliably diagnosed, then why do we have to
> test for them explicitly in our code, as in the proposed json.c?
> AFAIU, ptrdiff_t overflows are the _only_ reason for json.c checks
> whether a size_t value is too large, because similar checks for
> ptrdiff_t values are already in the low-level subroutines involved in
> creating Lisp objects.  So why couldn't those checks be avoided by
> simply assigning to a ptrdiff_t variables?


Signed integer overflow is only diagnosed when compiling with -ftrapv or
-fsanitize=undefined, which normally doesn't happen in production builds.
IIUC lossy numeric conversions are never diagnosed because they are not
undefined behavior, so we always need to check for them explicitly.
We also need to strictly distinguish between the case where an overflow is
a bug (incorrect assumptions in the Emacs source code itself) and the case
where it is a legitimate outcome due to user input (in this case, return
values from Jansson that overflow ptrdiff_t). The two cases need to be
treated differently: In the first case we can say it's UB and rely on
(hopefully regular and comprehensive) UBSan runs, in the second case we
need explicit checks and normal signalling as opposed to assertions.


>
> > The Emacs internals occasionally use size_t because underlying
> > primitives like 'malloc' do, so we do make some exceptions. Perhaps
> > there should be an exception here, for convenience with the JSON
> > library. The code snippets I've seen so far in this thread are not
> > enough context to judge whether an exception would be helpful in this
> > case. Generally speaking, though, unsigned types should be avoided
> > because they are more error-prone. This has long been the style in Emacs
> > internals, and it's served us well.
>
> I'm not arguing for general replacement of ptrdiff_t with size_t, only
> for doing that in those primitives where negative values are a clear
> mistake/bug.
>
> For example, let's take this case from your proposed changes:
>
>    static Lisp_Object
>   -json_make_string (const char *data, ptrdiff_t size)
>   +json_make_string (const char *data, size_t size)
>    {
>   +  if (PTRDIFF_MAX < size)
>   +    string_overflow ();
>      return make_specified_string (data, -1, size, true);
>    }
>
> If we were to change make_specified_string (and its subroutines, like
> make_uninit_multibyte_string etc.) to accept a size_t value in its 3rd
> argument, the need for the above check against PTRDIFF_MAX would
> disappear.
>

It wouldn't disappear, it would merely be shifted around. Arguments could
be made for either choice, but somewhere the check needs to happen.

  It would also make the higher-level code more
> reliable, because application-level programmers will not need to
> understand all the non-trivial intricacies of this stuff.


Every C programmer needs to understand these issues, no matter what
codebase they are working with. Lossy integral conversions are fundamental
design choices of the C language that can't be avoided.

C is a nasty language full of traps. You can try to hide some of the traps,
but you can't remove them.
(Arguably I've yet to come across a language that doesn't have nasty
integer types. Python 3, with integers auto-upgrading to bignums, might be
closest.)

--001a113d78f8290b42055b1118f1
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br><div class=3D"gmail_quote"><div dir=3D"ltr">Eli Za=
retskii &lt;<a href=3D"mailto:eliz@gnu.org">eliz@gnu.org</a>&gt; schrieb am=
 Do., 5. Okt. 2017 um 09:12=C2=A0Uhr:<br></div><blockquote class=3D"gmail_q=
uote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1e=
x">&gt; Cc: <a href=3D"mailto:p.stephani2@gmail.com" target=3D"_blank">p.st=
ephani2@gmail.com</a>, <a href=3D"mailto:emacs-devel@gnu.org" target=3D"_bl=
ank">emacs-devel@gnu.org</a><br>
&gt; From: Paul Eggert &lt;<a href=3D"mailto:eggert@cs.ucla.edu" target=3D"=
_blank">eggert@cs.ucla.edu</a>&gt;<br>
&gt; Date: Wed, 4 Oct 2017 14:24:59 -0700<br>
&gt;<br>
&gt; On 10/04/2017 12:38 PM, Eli Zaretskii wrote:<br>
&gt; &gt; if we did use size_t for the arguments which can clearly only be<=
br>
&gt; &gt; non-negative, the problems which we are discussing would not have=
<br>
&gt; &gt; happened<br>
&gt; Sure, but we would also have worse problems, as size_t is inherently<b=
r>
&gt; more error-prone. ptrdiff_t overflows are reliably diagnosed when Emac=
s<br>
&gt; is compiled with suitable GCC compiler options. size_t overflows canno=
t<br>
&gt; be diagnosed, are all too common, and can cause serious trouble.<br>
<br>
If ptrdiff_t overflows are reliably diagnosed, then why do we have to<br>
test for them explicitly in our code, as in the proposed json.c?<br>
AFAIU, ptrdiff_t overflows are the _only_ reason for json.c checks<br>
whether a size_t value is too large, because similar checks for<br>
ptrdiff_t values are already in the low-level subroutines involved in<br>
creating Lisp objects.=C2=A0 So why couldn&#39;t those checks be avoided by=
<br>
simply assigning to a ptrdiff_t variables?</blockquote><div><br></div><div>=
Signed integer overflow is only diagnosed when compiling with -ftrapv or -f=
sanitize=3Dundefined, which normally doesn&#39;t happen in production build=
s.</div><div>IIUC lossy numeric conversions are never diagnosed because the=
y are not undefined behavior, so we always need to check for them explicitl=
y.</div><div>We also need to strictly distinguish between the case where an=
 overflow is a bug (incorrect assumptions in the Emacs source code itself) =
and the case where it is a legitimate outcome due to user input (in this ca=
se, return values from Jansson that overflow ptrdiff_t). The two cases need=
 to be treated differently: In the first case we can say it&#39;s UB and re=
ly on (hopefully regular and comprehensive) UBSan runs, in the second case =
we need explicit checks and normal signalling as opposed to assertions.</di=
v><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
&gt; The Emacs internals occasionally use size_t because underlying<br>
&gt; primitives like &#39;malloc&#39; do, so we do make some exceptions. Pe=
rhaps<br>
&gt; there should be an exception here, for convenience with the JSON<br>
&gt; library. The code snippets I&#39;ve seen so far in this thread are not=
<br>
&gt; enough context to judge whether an exception would be helpful in this<=
br>
&gt; case. Generally speaking, though, unsigned types should be avoided<br>
&gt; because they are more error-prone. This has long been the style in Ema=
cs<br>
&gt; internals, and it&#39;s served us well.<br>
<br>
I&#39;m not arguing for general replacement of ptrdiff_t with size_t, only<=
br>
for doing that in those primitives where negative values are a clear<br>
mistake/bug.<br>
<br>
For example, let&#39;s take this case from your proposed changes:<br>
<br>
=C2=A0 =C2=A0static Lisp_Object<br>
=C2=A0 -json_make_string (const char *data, ptrdiff_t size)<br>
=C2=A0 +json_make_string (const char *data, size_t size)<br>
=C2=A0 =C2=A0{<br>
=C2=A0 +=C2=A0 if (PTRDIFF_MAX &lt; size)<br>
=C2=A0 +=C2=A0 =C2=A0 string_overflow ();<br>
=C2=A0 =C2=A0 =C2=A0return make_specified_string (data, -1, size, true);<br=
>
=C2=A0 =C2=A0}<br>
<br>
If we were to change make_specified_string (and its subroutines, like<br>
make_uninit_multibyte_string etc.) to accept a size_t value in its 3rd<br>
argument, the need for the above check against PTRDIFF_MAX would<br>
disappear.<br></blockquote><div><br></div><div>It wouldn&#39;t disappear, i=
t would merely be shifted around. Arguments could be made for either choice=
, but somewhere the check needs to happen.</div><div><br></div><blockquote =
class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid=
;padding-left:1ex">=C2=A0 It would also make the higher-level code more<br>
reliable, because application-level programmers will not need to<br>
understand all the non-trivial intricacies of this stuff.=C2=A0</blockquote=
><div><br></div><div>Every C programmer needs to understand these issues, n=
o matter what codebase they are working with. Lossy integral conversions ar=
e fundamental design choices of the C language that can&#39;t be avoided.=
=C2=A0</div><div><br></div><div>C is a nasty language full of traps. You ca=
n try to hide some of the traps, but you can&#39;t remove them.</div><div>(=
Arguably I&#39;ve yet to come across a language that doesn&#39;t have nasty=
 integer types. Python 3, with integers auto-upgrading to bignums, might be=
 closest.)</div></div></div>

--001a113d78f8290b42055b1118f1--