Re: JSON/YAML/TOML/etc. parsing performance

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Philipp Stephani <p.stephani2@gmail.com>
To: Eli Zaretskii <eliz@gnu.org>
Cc: emacs-devel@gnu.org
Subject: Re: JSON/YAML/TOML/etc. parsing performance
Date: Tue, 03 Oct 2017 15:52:03 +0000	[thread overview]
Message-ID: <CAArVCkTKXBnfsxY2e+F_94w+aDkMtc5=wNqm3gKNfB=BMVURkQ@mail.gmail.com> (raw)
In-Reply-To: <83fub01c4m.fsf@gnu.org>

[-- Attachment #1: Type: text/plain, Size: 9519 bytes --]

Eli Zaretskii <eliz@gnu.org> schrieb am Di., 3. Okt. 2017 um 17:32 Uhr:

> >
> > > > +static Lisp_Object
> > > > +internal_catch_all_1 (Lisp_Object (*function) (void *), void
> *argument)
> > >
> > > Can you tell why you needed this (and the similar internal_catch_all)?
> > > Is that only because the callbacks could signal an error, or is there
> > > another reason?  If the former, I'd prefer to simplify the code and
> > > its maintenance by treating the error condition in a less drastic
> > > manner, and avoiding the call to xsignal.
> >
> > The callbacks (especially insert and before-/after-change-hook) can exit
> > nonlocally, but these nonlocal exits may not escape the Jansson callback.
> > Therefore all nonlocal exits must be caught here.
>
> Why can't you use record_unwind_protect, as we normally do in these
> situations?
>

How would that help? record_unwind_protect can't stop nonlocal exits.


>
> > > > +static _GL_ARG_NONNULL ((2)) Lisp_Object
> > > > +lisp_to_json_toplevel_1 (Lisp_Object lisp, json_t **json)
> > > > +{
> > > > +  if (VECTORP (lisp))
> > > > +    {
> > > > +      ptrdiff_t size = ASIZE (lisp);
> > > > +      eassert (size >= 0);
> > > > +      if (size > SIZE_MAX)
> > > > +        xsignal1 (Qoverflow_error, build_string ("vector is too
> long"));
> > >
> > > I think this error text is too vague.  Can we come up with something
> > > that describes the problem more accurately?
> >
> > Maybe, but it's probably not worth it because I don't think we have many
> > architectures where PTRDIFF_MAX > SIZE_MAX.
>
> Then why do we punish all the platforms with this runtime check?
>

If you think this cannot happen we can turn it into a runtime or
compile-time assertion.


>
> > > And btw, how can size be greater than SIZE_MAX in this case?  This is
> > > a valid Lisp object, isn't it?  (There are more such tests in the
> > > patch, e.g. in lisp_to_json, and I think they, too, are redundant.)
> >
> > Depends on the range of ptrdiff_t and size_t. IIUC nothing in the C
> > standard guarantees PTRDIFF_MAX <= SIZE_MAX.
>
> I wasn't talking about PTRDIFF_MAX, I was talking about 'size', which
> is the number of bytes in a Lisp string.  Since that Lisp string is a
> valid Lisp object, how can its size be greater than SIZE_MAX?  I don't
> think there's a way of creating such a Lisp string in Emacs, because
> functions that allocate memory for strings will prevent that.
>

Then I think we should at least add an assertion to document this.


>
> > > > +      *json = json_check (json_array ());
> > > > +      ptrdiff_t count = SPECPDL_INDEX ();
> > > > +      record_unwind_protect_ptr (json_release_object, json);
> > > > +      for (ptrdiff_t i = 0; i < size; ++i)
> > > > +        {
> > > > +          int status
> > > > +            = json_array_append_new (*json, lisp_to_json (AREF
> (lisp,
> > > i)));
> > > > +          if (status == -1)
> > > > +            json_out_of_memory ();
> > > > +          eassert (status == 0);
> > > > +        }
> > > > +      eassert (json_array_size (*json) == size);
> > > > +      clear_unwind_protect (count);
> > > > +      return unbind_to (count, Qnil);
> > >
> > > This, too, sounds more complex than it should: you record
> > > unwind-protect just so lisp_to_json's subroutines could signal an
> > > error due to insufficient memory, right?  Why can't we have the
> > > out-of-memory check only inside this loop, which you already do, and
> > > avoid the checks on lower levels (which undoubtedly cost us extra
> > > cycles)?  What do those extra checks in json_check buy us? the errors
> > > they signal are no more informative than the one in the loop, AFAICT.
> >
> > I don't understand what you mean. We need to check the return values of
> all
> > functions if we want to to use them later.
>
> Yes, but what problems can cause these return value to be invalid?
> AFAICT, only out-of-memory conditions, and that can be checked only
> once, there's no need to check every single allocation, because once
> an allocation fails, all the rest will too.
>

But if the first succeeds, the second can still fail, so we do need to
check all of them.


>
> > > > +static Lisp_Object
> > > > +json_insert (void *data)
> > > > +{
> > > > +  const struct json_buffer_and_size *buffer_and_size = data;
> > > > +  if (buffer_and_size->size > PTRDIFF_MAX)
> > > > +    xsignal1 (Qoverflow_error, build_string ("buffer too large"));
> > > > +  insert (buffer_and_size->buffer, buffer_and_size->size);
> > >
> > > I don't think we need this test here, as 'insert' already has the
> > > equivalent test in one of its subroutines.
> >
> > It can't, because it takes the byte length as ptrdiff_t. We need to check
> > before whether the size is actually in the valid range of ptrdiff_t.
>
> I'm sorry, but I don't see why we should support such exotic
> situations only for this one feature.  In all other cases we use
> either ptrdiff_t type or EMACS_INT type, and these issues disappear
> then.  Trying to support the SIZE_MAX > PTRDIFF_MAX situation causes
> the code to be much more complicated, harder to maintain, and more
> expensive at run time than it should be.


We can't avoid these checks. The API returns size_t, so we can only assume
that the numbers are in the valid range of size_t, which is larger than the
ones for positive ptrdiff_t's. There's no way around this.


> I'm not even sure there are
> such platforms out there that Emacs supports,


All platforms that I know of have SIZE_MAX > PTRDIFF_MAX.


> but if there are, we
> already have a gazillion problems like that all over our code.


Just because other parts of the codebase are buggy doesn't mean we need to
introduce more bugs in new code.


> I
> object to having such code just for this reason, sorry.
>

We can't avoid it.


>
> > > > +    case JSON_INTEGER:
> > > > +      {
> > > > +        json_int_t value = json_integer_value (json);
> > > > +        if (FIXNUM_OVERFLOW_P (value))
> > > > +          xsignal1 (Qoverflow_error,
> > > > +                    build_string ("JSON integer is too large"));
> > > > +        return make_number (value);
> > >
> > > This overflow test is also redundant, as make_number already does it.
> >
> > It can't, because json_int_t can be larger than EMACS_INT.
>
> OK, but then I think we should consider returning a float value, or a
> cons of 2 integers.  If these situations are frequent enough, users
> will thank us, and if they are very infrequent, they will never see
> such values, and we gain code simplicity and less non-local exits.
>

Returning a float (using make_natnum_or_float) might work, but in the end
I've decided against it because it could silently drop precision. I think
that's worse than signaling an error.


>
> > > > +    case JSON_STRING:
> > > > +      {
> > > > +        size_t size = json_string_length (json);
> > > > +        if (FIXNUM_OVERFLOW_P (size))
> > > > +          xsignal1 (Qoverflow_error, build_string ("JSON string is
> too
> > > long"));
> > > > +        return json_make_string (json_string_value (json), size);
> > >
> > > Once again, the overflow test is redundant, as make_specified_string
> > > (called by json_make_string) already includes an equivalent test.
> >
> > And once again, we need to check at least whether the size fits into
> > ptrdiff_t.
>
> No, we don't, as we don't in other similar cases.
>

I don't understand why you think these checks aren't necessary. Converting
between integral types when the number is out of range for the destination
type results in an implementation-defined result, i.e. it's unportable.
Even assuming the GCC convention, performing such conversions results in
dangerously incorrect values.


>
> > > > +    case JSON_ARRAY:
> > > > +      {
> > > > +        if (++lisp_eval_depth > max_lisp_eval_depth)
> > > > +          xsignal0 (Qjson_object_too_deep);
> > > > +        size_t size = json_array_size (json);
> > > > +        if (FIXNUM_OVERFLOW_P (size))
> > > > +          xsignal1 (Qoverflow_error, build_string ("JSON array is
> too
> > > long"));
> > > > +        Lisp_Object result = Fmake_vector (make_natnum (size),
> > > Qunbound);
> > >
> > > Likewise here: Fmake_vector makes sure the size is not larger than
> > > allowed.
> >
> > Same as above: It can't.
>
> It can and it does.
>

No, it can't. make_natnum takes a ptrdiff_t argument, and when passing a
value that's out of range for ptrdiff_t, it will receive an incorrect,
implementation-defined value.


>
> > > > +    case JSON_OBJECT:
> > > > +      {
> > > > +        if (++lisp_eval_depth > max_lisp_eval_depth)
> > > > +          xsignal0 (Qjson_object_too_deep);
> > > > +        size_t size = json_object_size (json);
> > > > +        if (FIXNUM_OVERFLOW_P (size))
> > > > +          xsignal1 (Qoverflow_error,
> > > > +                    build_string ("JSON object has too many
> elements"));
> > > > +        Lisp_Object result = CALLN (Fmake_hash_table, QCtest,
> Qequal,
> > > > +                                    QCsize, make_natnum (size));
> > >
> > > Likewise here: make_natnum does the equivalent test.
> >
> > It doesn't and can't.
>
> Yes, it does:
>
>   INLINE Lisp_Object
>   make_natnum (EMACS_INT n)
>   {
>     eassert (0 <= n && n <= MOST_POSITIVE_FIXNUM);  <<<<<<<<<<<<<<<
>     EMACS_INT int0 = Lisp_Int0;
>

We're not talking about the same thing. What if make_natnum is called with
a value that doesn't fit in EMACS_INT?
Also an assertion is incorrect here because the overflowing value comes
from user data.

[-- Attachment #2: Type: text/html, Size: 13387 bytes --]

next prev parent reply	other threads:[~2017-10-03 15:52 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-16 15:54 JSON/YAML/TOML/etc. parsing performance Ted Zlatanov
2017-09-16 16:02 ` Mark Oteiza
2017-09-17  0:02   ` Richard Stallman
2017-09-17  3:13     ` Mark Oteiza
2017-09-18  0:00       ` Richard Stallman
2017-09-17  0:02 ` Richard Stallman
2017-09-18 13:46   ` Ted Zlatanov
2017-09-17 18:46 ` Philipp Stephani
2017-09-17 19:05   ` Eli Zaretskii
2017-09-17 20:27     ` Philipp Stephani
2017-09-17 22:41       ` Mark Oteiza
2017-09-18 13:53       ` Ted Zlatanov
2017-09-17 21:17   ` Speed of Elisp (was: JSON/YAML/TOML/etc. parsing performance) Stefan Monnier
2017-09-18 13:26   ` JSON/YAML/TOML/etc. parsing performance Philipp Stephani
2017-09-18 13:58     ` Mark Oteiza
2017-09-18 14:14       ` Philipp Stephani
2017-09-18 14:28         ` Mark Oteiza
2017-09-18 14:36           ` Philipp Stephani
2017-09-18 15:02             ` Eli Zaretskii
2017-09-18 16:14               ` Philipp Stephani
2017-09-18 17:33                 ` Eli Zaretskii
2017-09-18 19:57                 ` Thien-Thi Nguyen
2017-09-18 14:57     ` Eli Zaretskii
2017-09-18 15:07       ` Mark Oteiza
2017-09-18 15:51         ` Eli Zaretskii
2017-09-18 16:22           ` Philipp Stephani
2017-09-18 18:08             ` Eli Zaretskii
2017-09-19 19:32               ` Richard Stallman
2017-09-18 17:26           ` Glenn Morris
2017-09-18 18:16             ` Eli Zaretskii
2017-09-18 16:08       ` Philipp Stephani
2017-09-19  8:18     ` Philipp Stephani
2017-09-19 19:09       ` Eli Zaretskii
2017-09-28 21:19         ` Philipp Stephani
2017-09-28 21:27           ` Stefan Monnier
2017-09-29 19:55           ` Eli Zaretskii
2017-09-30 22:02             ` Philipp Stephani
2017-10-01 18:06               ` Eli Zaretskii
2017-10-03 12:26                 ` Philipp Stephani
2017-10-03 15:31                   ` Eli Zaretskii
2017-10-03 15:52                     ` Philipp Stephani [this message]
2017-10-03 16:26                       ` Eli Zaretskii
2017-10-03 17:10                         ` Eli Zaretskii
2017-10-03 18:37                           ` Philipp Stephani
2017-10-03 20:52                   ` Paul Eggert
2017-10-04  5:33                     ` Eli Zaretskii
2017-10-04  6:41                       ` Paul Eggert
2017-10-04  8:03                         ` Eli Zaretskii
2017-10-04 17:51                           ` Paul Eggert
2017-10-04 19:38                             ` Eli Zaretskii
2017-10-04 21:24                               ` Paul Eggert
2017-10-05  1:48                                 ` Paul Eggert
2017-10-05  7:14                                   ` Eli Zaretskii
2017-10-08 22:52                                   ` Philipp Stephani
2017-10-09  5:54                                     ` Paul Eggert
2017-10-29 20:48                                       ` Philipp Stephani
2017-10-09  6:38                                     ` Eli Zaretskii
2017-10-05  7:12                                 ` Eli Zaretskii
2017-10-06  1:58                                   ` Paul Eggert
2017-10-06  7:40                                     ` Eli Zaretskii
2017-10-06 19:36                                       ` Paul Eggert
2017-10-06 21:03                                         ` Eli Zaretskii
2017-10-08 23:09                                     ` Philipp Stephani
2017-10-09  6:19                                       ` Paul Eggert
2017-10-29 20:48                                         ` Philipp Stephani
2017-10-29 22:49                                           ` Philipp Stephani
2017-12-09 23:05                                             ` Philipp Stephani
2017-12-10  7:08                                               ` Eli Zaretskii
2017-12-10 13:26                                                 ` Philipp Stephani
2017-12-10 13:32                                                   ` Ted Zlatanov
2017-10-08 23:04                                   ` Philipp Stephani
2017-10-09  6:47                                     ` Eli Zaretskii
2017-10-08 17:58                     ` Philipp Stephani
2017-10-08 18:42                       ` Eli Zaretskii
2017-10-08 23:14                         ` Philipp Stephani
2017-10-09  6:53                           ` Eli Zaretskii
2017-10-29 20:41                             ` Philipp Stephani
2017-10-09  6:22                       ` Paul Eggert
2017-10-01 18:38               ` Eli Zaretskii
2017-10-03 12:12                 ` Philipp Stephani
2017-10-03 14:54                   ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAArVCkTKXBnfsxY2e+F_94w+aDkMtc5=wNqm3gKNfB=BMVURkQ@mail.gmail.com' \
    --to=p.stephani2@gmail.com \
    --cc=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).