From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: JSON/YAML/TOML/etc. parsing performance
Date: Tue, 03 Oct 2017 18:31:37 +0300
Message-ID: <83fub01c4m.fsf@gnu.org>
References: <87poaqhc63.fsf@lifelogs.com>
	<CAArVCkQTLp=Cmh-FM1R-WK=WYFX_hP=6XiUUinKRT17bciL+CQ@mail.gmail.com>
	<CAArVCkTj_1P+fTDCzEY5xG8bBB7B6ctNkQCv+bAxt=N_cuD05Q@mail.gmail.com>
	<CAArVCkS52m8SGeOQt19k+XsfZnxy+bh9LJMyX=h+e67_adP6Mg@mail.gmail.com>
	<8360ceh5f1.fsf@gnu.org>
	<CAArVCkRvSaS-orqHcVPtZ2etUnRiY39okHh+6sYV-mtQQRYs-g@mail.gmail.com>
	<83h8vl5lf9.fsf@gnu.org>
	<CAArVCkRSY+bUiX2n1sysxy4dRTLE5M8R3OZmt+1W0-OfiYGDuA@mail.gmail.com>
	<83r2um3fqi.fsf@gnu.org>
	<CAArVCkSR5w5B4_E0UU8uJtZJtr1oP51md4-i-J+NSRCGpGP_9A@mail.gmail.com>
Reply-To: Eli Zaretskii <eliz@gnu.org>
NNTP-Posting-Host: blaine.gmane.org
X-Trace: blaine.gmane.org 1507044762 14034 195.159.176.226 (3 Oct 2017 15:32:42 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Tue, 3 Oct 2017 15:32:42 +0000 (UTC)
Cc: emacs-devel@gnu.org
To: Philipp Stephani <p.stephani2@gmail.com>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Oct 03 17:32:37 2017
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by blaine.gmane.org with esmtp (Exim 4.84_2)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1dzPBR-0002l7-8R
	for ged-emacs-devel@m.gmane.org; Tue, 03 Oct 2017 17:32:33 +0200
Original-Received: from localhost ([::1]:59155 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1dzPBY-00053g-LL
	for ged-emacs-devel@m.gmane.org; Tue, 03 Oct 2017 11:32:40 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:48909)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1dzPAx-00053P-7a
	for emacs-devel@gnu.org; Tue, 03 Oct 2017 11:32:04 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eliz@gnu.org>) id 1dzPAn-0006Jc-6G
	for emacs-devel@gnu.org; Tue, 03 Oct 2017 11:32:03 -0400
Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:34962)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <eliz@gnu.org>)
	id 1dzPAn-0006JJ-2H; Tue, 03 Oct 2017 11:31:53 -0400
Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:2805
	helo=home-c4e4a596f7)
	by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)
	(Exim 4.82) (envelope-from <eliz@gnu.org>)
	id 1dzPAm-0003p5-Fg; Tue, 03 Oct 2017 11:31:52 -0400
In-reply-to: <CAArVCkSR5w5B4_E0UU8uJtZJtr1oP51md4-i-J+NSRCGpGP_9A@mail.gmail.com>
	(message from Philipp Stephani on Tue, 03 Oct 2017 12:26:32 +0000)
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 2001:4830:134:3::e
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel/>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: "Emacs-devel" <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.devel:219039
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/219039>

> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Tue, 03 Oct 2017 12:26:32 +0000
> Cc: emacs-devel@gnu.org
> 
> > > +#if __has_attribute (warn_unused_result)
> > > +# define ATTRIBUTE_WARN_UNUSED_RESULT __attribute__
> > ((__warn_unused_result__))
> > > +#else
> > > +# define ATTRIBUTE_WARN_UNUSED_RESULT
> > > +#endif
> >
> > Hmm... why do we need this attribute?  You use it with 2 static
> > functions, so this sounds like a left-over from the development stage?
> >
> 
> It's not strictly needed (and if you don't like it, I can remove it), but
> it helps catching memory leaks.

No strong feeling here, but I'd be interested in hearing Paul's
opinion on this.

> > > +static Lisp_Object
> > > +internal_catch_all_1 (Lisp_Object (*function) (void *), void *argument)
> >
> > Can you tell why you needed this (and the similar internal_catch_all)?
> > Is that only because the callbacks could signal an error, or is there
> > another reason?  If the former, I'd prefer to simplify the code and
> > its maintenance by treating the error condition in a less drastic
> > manner, and avoiding the call to xsignal.
> 
> The callbacks (especially insert and before-/after-change-hook) can exit
> nonlocally, but these nonlocal exits may not escape the Jansson callback.
> Therefore all nonlocal exits must be caught here.

Why can't you use record_unwind_protect, as we normally do in these
situations?

> > > +static _GL_ARG_NONNULL ((2)) Lisp_Object
> > > +lisp_to_json_toplevel_1 (Lisp_Object lisp, json_t **json)
> > > +{
> > > +  if (VECTORP (lisp))
> > > +    {
> > > +      ptrdiff_t size = ASIZE (lisp);
> > > +      eassert (size >= 0);
> > > +      if (size > SIZE_MAX)
> > > +        xsignal1 (Qoverflow_error, build_string ("vector is too long"));
> >
> > I think this error text is too vague.  Can we come up with something
> > that describes the problem more accurately?
> 
> Maybe, but it's probably not worth it because I don't think we have many
> architectures where PTRDIFF_MAX > SIZE_MAX.

Then why do we punish all the platforms with this runtime check?

> > And btw, how can size be greater than SIZE_MAX in this case?  This is
> > a valid Lisp object, isn't it?  (There are more such tests in the
> > patch, e.g. in lisp_to_json, and I think they, too, are redundant.)
> 
> Depends on the range of ptrdiff_t and size_t. IIUC nothing in the C
> standard guarantees PTRDIFF_MAX <= SIZE_MAX.

I wasn't talking about PTRDIFF_MAX, I was talking about 'size', which
is the number of bytes in a Lisp string.  Since that Lisp string is a
valid Lisp object, how can its size be greater than SIZE_MAX?  I don't
think there's a way of creating such a Lisp string in Emacs, because
functions that allocate memory for strings will prevent that.

> > > +      *json = json_check (json_array ());
> > > +      ptrdiff_t count = SPECPDL_INDEX ();
> > > +      record_unwind_protect_ptr (json_release_object, json);
> > > +      for (ptrdiff_t i = 0; i < size; ++i)
> > > +        {
> > > +          int status
> > > +            = json_array_append_new (*json, lisp_to_json (AREF (lisp,
> > i)));
> > > +          if (status == -1)
> > > +            json_out_of_memory ();
> > > +          eassert (status == 0);
> > > +        }
> > > +      eassert (json_array_size (*json) == size);
> > > +      clear_unwind_protect (count);
> > > +      return unbind_to (count, Qnil);
> >
> > This, too, sounds more complex than it should: you record
> > unwind-protect just so lisp_to_json's subroutines could signal an
> > error due to insufficient memory, right?  Why can't we have the
> > out-of-memory check only inside this loop, which you already do, and
> > avoid the checks on lower levels (which undoubtedly cost us extra
> > cycles)?  What do those extra checks in json_check buy us? the errors
> > they signal are no more informative than the one in the loop, AFAICT.
> 
> I don't understand what you mean. We need to check the return values of all
> functions if we want to to use them later.

Yes, but what problems can cause these return value to be invalid?
AFAICT, only out-of-memory conditions, and that can be checked only
once, there's no need to check every single allocation, because once
an allocation fails, all the rest will too.

> > > +static Lisp_Object
> > > +json_insert (void *data)
> > > +{
> > > +  const struct json_buffer_and_size *buffer_and_size = data;
> > > +  if (buffer_and_size->size > PTRDIFF_MAX)
> > > +    xsignal1 (Qoverflow_error, build_string ("buffer too large"));
> > > +  insert (buffer_and_size->buffer, buffer_and_size->size);
> >
> > I don't think we need this test here, as 'insert' already has the
> > equivalent test in one of its subroutines.
> 
> It can't, because it takes the byte length as ptrdiff_t. We need to check
> before whether the size is actually in the valid range of ptrdiff_t.

I'm sorry, but I don't see why we should support such exotic
situations only for this one feature.  In all other cases we use
either ptrdiff_t type or EMACS_INT type, and these issues disappear
then.  Trying to support the SIZE_MAX > PTRDIFF_MAX situation causes
the code to be much more complicated, harder to maintain, and more
expensive at run time than it should be.  I'm not even sure there are
such platforms out there that Emacs supports, but if there are, we
already have a gazillion problems like that all over our code.  I
object to having such code just for this reason, sorry.

> > > +    case JSON_INTEGER:
> > > +      {
> > > +        json_int_t value = json_integer_value (json);
> > > +        if (FIXNUM_OVERFLOW_P (value))
> > > +          xsignal1 (Qoverflow_error,
> > > +                    build_string ("JSON integer is too large"));
> > > +        return make_number (value);
> >
> > This overflow test is also redundant, as make_number already does it.
> 
> It can't, because json_int_t can be larger than EMACS_INT.

OK, but then I think we should consider returning a float value, or a
cons of 2 integers.  If these situations are frequent enough, users
will thank us, and if they are very infrequent, they will never see
such values, and we gain code simplicity and less non-local exits.

> > > +    case JSON_STRING:
> > > +      {
> > > +        size_t size = json_string_length (json);
> > > +        if (FIXNUM_OVERFLOW_P (size))
> > > +          xsignal1 (Qoverflow_error, build_string ("JSON string is too
> > long"));
> > > +        return json_make_string (json_string_value (json), size);
> >
> > Once again, the overflow test is redundant, as make_specified_string
> > (called by json_make_string) already includes an equivalent test.
> 
> And once again, we need to check at least whether the size fits into
> ptrdiff_t.

No, we don't, as we don't in other similar cases.

> > > +    case JSON_ARRAY:
> > > +      {
> > > +        if (++lisp_eval_depth > max_lisp_eval_depth)
> > > +          xsignal0 (Qjson_object_too_deep);
> > > +        size_t size = json_array_size (json);
> > > +        if (FIXNUM_OVERFLOW_P (size))
> > > +          xsignal1 (Qoverflow_error, build_string ("JSON array is too
> > long"));
> > > +        Lisp_Object result = Fmake_vector (make_natnum (size),
> > Qunbound);
> >
> > Likewise here: Fmake_vector makes sure the size is not larger than
> > allowed.
> 
> Same as above: It can't.

It can and it does.

> > > +    case JSON_OBJECT:
> > > +      {
> > > +        if (++lisp_eval_depth > max_lisp_eval_depth)
> > > +          xsignal0 (Qjson_object_too_deep);
> > > +        size_t size = json_object_size (json);
> > > +        if (FIXNUM_OVERFLOW_P (size))
> > > +          xsignal1 (Qoverflow_error,
> > > +                    build_string ("JSON object has too many elements"));
> > > +        Lisp_Object result = CALLN (Fmake_hash_table, QCtest, Qequal,
> > > +                                    QCsize, make_natnum (size));
> >
> > Likewise here: make_natnum does the equivalent test.
> 
> It doesn't and can't.

Yes, it does:

  INLINE Lisp_Object
  make_natnum (EMACS_INT n)
  {
    eassert (0 <= n && n <= MOST_POSITIVE_FIXNUM);  <<<<<<<<<<<<<<<
    EMACS_INT int0 = Lisp_Int0;

> > > +    /* Adjust point by how much we just read.  Do this here because
> > > +       tokener->char_offset becomes incorrect below.  */
> > > +    bool overflow = INT_ADD_WRAPV (point, error.position, &point);
> > > +    eassert (!overflow);
> > > +    eassert (point <= ZV_BYTE);
> > > +    SET_PT_BOTH (BYTE_TO_CHAR (point), point);
> >
> > It's better to use SET_PT here, I think.
> 
> That's not possible because we don't have the character offset.

Right, sorry, I was confused.

Thanks.