From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Philipp Stephani Newsgroups: gmane.emacs.devel Subject: Re: JSON/YAML/TOML/etc. parsing performance Date: Sun, 08 Oct 2017 23:04:10 +0000 Message-ID: References: <87poaqhc63.fsf@lifelogs.com> <8360ceh5f1.fsf@gnu.org> <83h8vl5lf9.fsf@gnu.org> <83r2um3fqi.fsf@gnu.org> <43520b71-9e25-926c-d744-78098dad6441@cs.ucla.edu> <83o9pnzddc.fsf@gnu.org> <472176ce-846b-1f24-716b-98eb95ceaa47@cs.ucla.edu> <83d163z6dy.fsf@gnu.org> <73477c99-1600-a53d-d84f-737837d0f91f@cs.ucla.edu> <83poa2ya8j.fsf@gnu.org> <21b0ba97-ed49-43ae-e86f-63fba762353a@cs.ucla.edu> <83lgkqxe3l.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="001a113d78f8290b42055b1118f1" X-Trace: blaine.gmane.org 1507503875 12381 195.159.176.226 (8 Oct 2017 23:04:35 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sun, 8 Oct 2017 23:04:35 +0000 (UTC) Cc: emacs-devel@gnu.org To: Eli Zaretskii , Paul Eggert Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Oct 09 01:04:29 2017 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1e1KcW-0002Ro-VP for ged-emacs-devel@m.gmane.org; Mon, 09 Oct 2017 01:04:29 +0200 Original-Received: from localhost ([::1]:55374 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1e1Kce-0000Op-BY for ged-emacs-devel@m.gmane.org; Sun, 08 Oct 2017 19:04:36 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:58963) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1e1KcU-0000OS-AI for emacs-devel@gnu.org; Sun, 08 Oct 2017 19:04:27 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1e1KcS-0002mr-Mc for emacs-devel@gnu.org; Sun, 08 Oct 2017 19:04:26 -0400 Original-Received: from mail-oi0-x22f.google.com ([2607:f8b0:4003:c06::22f]:46634) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1e1KcQ-0002lt-H8; Sun, 08 Oct 2017 19:04:22 -0400 Original-Received: by mail-oi0-x22f.google.com with SMTP id n82so29704456oig.3; Sun, 08 Oct 2017 16:04:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=VYTRXVUmWipP8+g0+Cc/hgMw2JRI+ssDVkOG8d4BDnE=; b=k+XvJROuUUW9zQz4LLtLJEAxbzZBNcJeT0Do7hc06bUs565gBmg6YFIYd1qy8NkIAg /mIcScFf5anXU6Xb57/o+9yocOlq/RKr1yZA1aoCWDW8hCLqIZQPnieRTqGSWBx7JoxR Ilm37olS4K/UNenHWaBJh9Iy/kgsajojEizykbSfHwkb9+3eVTaAAZf5rucJGJUORcwm hulsy2lai0GCiBt2WQ4IwL94YhO2VengLDVlPxU+ojvM7PW4tU2LHQWcwbAeS1L60Nqd wRevfNpRL2rRwTTB0lYfb961Xaa1e8gNW1mLq3ASMDSE3aVFbU96vSL+qkYWghDufZy8 tRQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=VYTRXVUmWipP8+g0+Cc/hgMw2JRI+ssDVkOG8d4BDnE=; b=eGyTWevYnlJlQ/NIZKl5OAYYnLtuHJB9azHdKW+94K1VEfBR/SeXB4QCKrlIx+/n0/ 6QndjMwl5g3/vFbel6RGRjELTG+v18QGs9VF9Ni1AffHPFJG3chuWbcbpMl39ltGlk6b jVjyWU20i9BQyn6R1xEJulxrea+7ykgnajWl0jWGJ44NumvjACF/uRA5w5gQJW8oKdBi aeAyDatzkBIYdy9AONwcfQy53fZTtSa+w8/R0it5jYJsjrfU6EbNhjvAccqgUfLs7i+d SulkPZKO1sLQ4xz/697684j4iVYzxWDvbTvua0I3CD477h+OJmHlSbh4juTRVkGEFfhy hVMg== X-Gm-Message-State: AMCzsaU+f08QOv0/hMhlDmecdXdPTzMw/N2xYG8Qqw6J1XGjgPxxrNWv OuLeSflKZW3VSH5o3Wj47p3nyi5wBLQlzEOm4hsFlA== X-Google-Smtp-Source: AOwi7QAPGpUtp1pR1yznfJG5g91iyd44ju34In/hteY47aEcATRYFtyFrn7GwO1A3zUNTpqXFq3fRiokU8srgX1j/ns= X-Received: by 10.202.86.206 with SMTP id k197mr4650640oib.254.1507503861466; Sun, 08 Oct 2017 16:04:21 -0700 (PDT) In-Reply-To: <83lgkqxe3l.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:4003:c06::22f X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:219276 Archived-At: --001a113d78f8290b42055b1118f1 Content-Type: text/plain; charset="UTF-8" Eli Zaretskii schrieb am Do., 5. Okt. 2017 um 09:12 Uhr: > > Cc: p.stephani2@gmail.com, emacs-devel@gnu.org > > From: Paul Eggert > > Date: Wed, 4 Oct 2017 14:24:59 -0700 > > > > On 10/04/2017 12:38 PM, Eli Zaretskii wrote: > > > if we did use size_t for the arguments which can clearly only be > > > non-negative, the problems which we are discussing would not have > > > happened > > Sure, but we would also have worse problems, as size_t is inherently > > more error-prone. ptrdiff_t overflows are reliably diagnosed when Emacs > > is compiled with suitable GCC compiler options. size_t overflows cannot > > be diagnosed, are all too common, and can cause serious trouble. > > If ptrdiff_t overflows are reliably diagnosed, then why do we have to > test for them explicitly in our code, as in the proposed json.c? > AFAIU, ptrdiff_t overflows are the _only_ reason for json.c checks > whether a size_t value is too large, because similar checks for > ptrdiff_t values are already in the low-level subroutines involved in > creating Lisp objects. So why couldn't those checks be avoided by > simply assigning to a ptrdiff_t variables? Signed integer overflow is only diagnosed when compiling with -ftrapv or -fsanitize=undefined, which normally doesn't happen in production builds. IIUC lossy numeric conversions are never diagnosed because they are not undefined behavior, so we always need to check for them explicitly. We also need to strictly distinguish between the case where an overflow is a bug (incorrect assumptions in the Emacs source code itself) and the case where it is a legitimate outcome due to user input (in this case, return values from Jansson that overflow ptrdiff_t). The two cases need to be treated differently: In the first case we can say it's UB and rely on (hopefully regular and comprehensive) UBSan runs, in the second case we need explicit checks and normal signalling as opposed to assertions. > > > The Emacs internals occasionally use size_t because underlying > > primitives like 'malloc' do, so we do make some exceptions. Perhaps > > there should be an exception here, for convenience with the JSON > > library. The code snippets I've seen so far in this thread are not > > enough context to judge whether an exception would be helpful in this > > case. Generally speaking, though, unsigned types should be avoided > > because they are more error-prone. This has long been the style in Emacs > > internals, and it's served us well. > > I'm not arguing for general replacement of ptrdiff_t with size_t, only > for doing that in those primitives where negative values are a clear > mistake/bug. > > For example, let's take this case from your proposed changes: > > static Lisp_Object > -json_make_string (const char *data, ptrdiff_t size) > +json_make_string (const char *data, size_t size) > { > + if (PTRDIFF_MAX < size) > + string_overflow (); > return make_specified_string (data, -1, size, true); > } > > If we were to change make_specified_string (and its subroutines, like > make_uninit_multibyte_string etc.) to accept a size_t value in its 3rd > argument, the need for the above check against PTRDIFF_MAX would > disappear. > It wouldn't disappear, it would merely be shifted around. Arguments could be made for either choice, but somewhere the check needs to happen. It would also make the higher-level code more > reliable, because application-level programmers will not need to > understand all the non-trivial intricacies of this stuff. Every C programmer needs to understand these issues, no matter what codebase they are working with. Lossy integral conversions are fundamental design choices of the C language that can't be avoided. C is a nasty language full of traps. You can try to hide some of the traps, but you can't remove them. (Arguably I've yet to come across a language that doesn't have nasty integer types. Python 3, with integers auto-upgrading to bignums, might be closest.) --001a113d78f8290b42055b1118f1 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


Eli Za= retskii <eliz@gnu.org> schrieb am= Do., 5. Okt. 2017 um 09:12=C2=A0Uhr:
> Cc: p.st= ephani2@gmail.com, emacs-devel@gnu.org
> From: Paul Eggert <eggert@cs.ucla.edu>
> Date: Wed, 4 Oct 2017 14:24:59 -0700
>
> On 10/04/2017 12:38 PM, Eli Zaretskii wrote:
> > if we did use size_t for the arguments which can clearly only be<= br> > > non-negative, the problems which we are discussing would not have=
> > happened
> Sure, but we would also have worse problems, as size_t is inherently > more error-prone. ptrdiff_t overflows are reliably diagnosed when Emac= s
> is compiled with suitable GCC compiler options. size_t overflows canno= t
> be diagnosed, are all too common, and can cause serious trouble.

If ptrdiff_t overflows are reliably diagnosed, then why do we have to
test for them explicitly in our code, as in the proposed json.c?
AFAIU, ptrdiff_t overflows are the _only_ reason for json.c checks
whether a size_t value is too large, because similar checks for
ptrdiff_t values are already in the low-level subroutines involved in
creating Lisp objects.=C2=A0 So why couldn't those checks be avoided by=
simply assigning to a ptrdiff_t variables?

= Signed integer overflow is only diagnosed when compiling with -ftrapv or -f= sanitize=3Dundefined, which normally doesn't happen in production build= s.
IIUC lossy numeric conversions are never diagnosed because the= y are not undefined behavior, so we always need to check for them explicitl= y.
We also need to strictly distinguish between the case where an= overflow is a bug (incorrect assumptions in the Emacs source code itself) = and the case where it is a legitimate outcome due to user input (in this ca= se, return values from Jansson that overflow ptrdiff_t). The two cases need= to be treated differently: In the first case we can say it's UB and re= ly on (hopefully regular and comprehensive) UBSan runs, in the second case = we need explicit checks and normal signalling as opposed to assertions.
=C2=A0

> The Emacs internals occasionally use size_t because underlying
> primitives like 'malloc' do, so we do make some exceptions. Pe= rhaps
> there should be an exception here, for convenience with the JSON
> library. The code snippets I've seen so far in this thread are not=
> enough context to judge whether an exception would be helpful in this<= br> > case. Generally speaking, though, unsigned types should be avoided
> because they are more error-prone. This has long been the style in Ema= cs
> internals, and it's served us well.

I'm not arguing for general replacement of ptrdiff_t with size_t, only<= br> for doing that in those primitives where negative values are a clear
mistake/bug.

For example, let's take this case from your proposed changes:

=C2=A0 =C2=A0static Lisp_Object
=C2=A0 -json_make_string (const char *data, ptrdiff_t size)
=C2=A0 +json_make_string (const char *data, size_t size)
=C2=A0 =C2=A0{
=C2=A0 +=C2=A0 if (PTRDIFF_MAX < size)
=C2=A0 +=C2=A0 =C2=A0 string_overflow ();
=C2=A0 =C2=A0 =C2=A0return make_specified_string (data, -1, size, true); =C2=A0 =C2=A0}

If we were to change make_specified_string (and its subroutines, like
make_uninit_multibyte_string etc.) to accept a size_t value in its 3rd
argument, the need for the above check against PTRDIFF_MAX would
disappear.

It wouldn't disappear, i= t would merely be shifted around. Arguments could be made for either choice= , but somewhere the check needs to happen.

=C2=A0 It would also make the higher-level code more
reliable, because application-level programmers will not need to
understand all the non-trivial intricacies of this stuff.=C2=A0

Every C programmer needs to understand these issues, n= o matter what codebase they are working with. Lossy integral conversions ar= e fundamental design choices of the C language that can't be avoided.= =C2=A0

C is a nasty language full of traps. You ca= n try to hide some of the traps, but you can't remove them.
(= Arguably I've yet to come across a language that doesn't have nasty= integer types. Python 3, with integers auto-upgrading to bignums, might be= closest.)
--001a113d78f8290b42055b1118f1--