From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: JSON/YAML/TOML/etc. parsing performance Date: Tue, 03 Oct 2017 18:31:37 +0300 Message-ID: <83fub01c4m.fsf@gnu.org> References: <87poaqhc63.fsf@lifelogs.com> <8360ceh5f1.fsf@gnu.org> <83h8vl5lf9.fsf@gnu.org> <83r2um3fqi.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org X-Trace: blaine.gmane.org 1507044762 14034 195.159.176.226 (3 Oct 2017 15:32:42 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 3 Oct 2017 15:32:42 +0000 (UTC) Cc: emacs-devel@gnu.org To: Philipp Stephani Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Oct 03 17:32:37 2017 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dzPBR-0002l7-8R for ged-emacs-devel@m.gmane.org; Tue, 03 Oct 2017 17:32:33 +0200 Original-Received: from localhost ([::1]:59155 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dzPBY-00053g-LL for ged-emacs-devel@m.gmane.org; Tue, 03 Oct 2017 11:32:40 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:48909) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dzPAx-00053P-7a for emacs-devel@gnu.org; Tue, 03 Oct 2017 11:32:04 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dzPAn-0006Jc-6G for emacs-devel@gnu.org; Tue, 03 Oct 2017 11:32:03 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:34962) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dzPAn-0006JJ-2H; Tue, 03 Oct 2017 11:31:53 -0400 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:2805 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1dzPAm-0003p5-Fg; Tue, 03 Oct 2017 11:31:52 -0400 In-reply-to: (message from Philipp Stephani on Tue, 03 Oct 2017 12:26:32 +0000) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:219039 Archived-At: > From: Philipp Stephani > Date: Tue, 03 Oct 2017 12:26:32 +0000 > Cc: emacs-devel@gnu.org > > > > +#if __has_attribute (warn_unused_result) > > > +# define ATTRIBUTE_WARN_UNUSED_RESULT __attribute__ > > ((__warn_unused_result__)) > > > +#else > > > +# define ATTRIBUTE_WARN_UNUSED_RESULT > > > +#endif > > > > Hmm... why do we need this attribute? You use it with 2 static > > functions, so this sounds like a left-over from the development stage? > > > > It's not strictly needed (and if you don't like it, I can remove it), but > it helps catching memory leaks. No strong feeling here, but I'd be interested in hearing Paul's opinion on this. > > > +static Lisp_Object > > > +internal_catch_all_1 (Lisp_Object (*function) (void *), void *argument) > > > > Can you tell why you needed this (and the similar internal_catch_all)? > > Is that only because the callbacks could signal an error, or is there > > another reason? If the former, I'd prefer to simplify the code and > > its maintenance by treating the error condition in a less drastic > > manner, and avoiding the call to xsignal. > > The callbacks (especially insert and before-/after-change-hook) can exit > nonlocally, but these nonlocal exits may not escape the Jansson callback. > Therefore all nonlocal exits must be caught here. Why can't you use record_unwind_protect, as we normally do in these situations? > > > +static _GL_ARG_NONNULL ((2)) Lisp_Object > > > +lisp_to_json_toplevel_1 (Lisp_Object lisp, json_t **json) > > > +{ > > > + if (VECTORP (lisp)) > > > + { > > > + ptrdiff_t size = ASIZE (lisp); > > > + eassert (size >= 0); > > > + if (size > SIZE_MAX) > > > + xsignal1 (Qoverflow_error, build_string ("vector is too long")); > > > > I think this error text is too vague. Can we come up with something > > that describes the problem more accurately? > > Maybe, but it's probably not worth it because I don't think we have many > architectures where PTRDIFF_MAX > SIZE_MAX. Then why do we punish all the platforms with this runtime check? > > And btw, how can size be greater than SIZE_MAX in this case? This is > > a valid Lisp object, isn't it? (There are more such tests in the > > patch, e.g. in lisp_to_json, and I think they, too, are redundant.) > > Depends on the range of ptrdiff_t and size_t. IIUC nothing in the C > standard guarantees PTRDIFF_MAX <= SIZE_MAX. I wasn't talking about PTRDIFF_MAX, I was talking about 'size', which is the number of bytes in a Lisp string. Since that Lisp string is a valid Lisp object, how can its size be greater than SIZE_MAX? I don't think there's a way of creating such a Lisp string in Emacs, because functions that allocate memory for strings will prevent that. > > > + *json = json_check (json_array ()); > > > + ptrdiff_t count = SPECPDL_INDEX (); > > > + record_unwind_protect_ptr (json_release_object, json); > > > + for (ptrdiff_t i = 0; i < size; ++i) > > > + { > > > + int status > > > + = json_array_append_new (*json, lisp_to_json (AREF (lisp, > > i))); > > > + if (status == -1) > > > + json_out_of_memory (); > > > + eassert (status == 0); > > > + } > > > + eassert (json_array_size (*json) == size); > > > + clear_unwind_protect (count); > > > + return unbind_to (count, Qnil); > > > > This, too, sounds more complex than it should: you record > > unwind-protect just so lisp_to_json's subroutines could signal an > > error due to insufficient memory, right? Why can't we have the > > out-of-memory check only inside this loop, which you already do, and > > avoid the checks on lower levels (which undoubtedly cost us extra > > cycles)? What do those extra checks in json_check buy us? the errors > > they signal are no more informative than the one in the loop, AFAICT. > > I don't understand what you mean. We need to check the return values of all > functions if we want to to use them later. Yes, but what problems can cause these return value to be invalid? AFAICT, only out-of-memory conditions, and that can be checked only once, there's no need to check every single allocation, because once an allocation fails, all the rest will too. > > > +static Lisp_Object > > > +json_insert (void *data) > > > +{ > > > + const struct json_buffer_and_size *buffer_and_size = data; > > > + if (buffer_and_size->size > PTRDIFF_MAX) > > > + xsignal1 (Qoverflow_error, build_string ("buffer too large")); > > > + insert (buffer_and_size->buffer, buffer_and_size->size); > > > > I don't think we need this test here, as 'insert' already has the > > equivalent test in one of its subroutines. > > It can't, because it takes the byte length as ptrdiff_t. We need to check > before whether the size is actually in the valid range of ptrdiff_t. I'm sorry, but I don't see why we should support such exotic situations only for this one feature. In all other cases we use either ptrdiff_t type or EMACS_INT type, and these issues disappear then. Trying to support the SIZE_MAX > PTRDIFF_MAX situation causes the code to be much more complicated, harder to maintain, and more expensive at run time than it should be. I'm not even sure there are such platforms out there that Emacs supports, but if there are, we already have a gazillion problems like that all over our code. I object to having such code just for this reason, sorry. > > > + case JSON_INTEGER: > > > + { > > > + json_int_t value = json_integer_value (json); > > > + if (FIXNUM_OVERFLOW_P (value)) > > > + xsignal1 (Qoverflow_error, > > > + build_string ("JSON integer is too large")); > > > + return make_number (value); > > > > This overflow test is also redundant, as make_number already does it. > > It can't, because json_int_t can be larger than EMACS_INT. OK, but then I think we should consider returning a float value, or a cons of 2 integers. If these situations are frequent enough, users will thank us, and if they are very infrequent, they will never see such values, and we gain code simplicity and less non-local exits. > > > + case JSON_STRING: > > > + { > > > + size_t size = json_string_length (json); > > > + if (FIXNUM_OVERFLOW_P (size)) > > > + xsignal1 (Qoverflow_error, build_string ("JSON string is too > > long")); > > > + return json_make_string (json_string_value (json), size); > > > > Once again, the overflow test is redundant, as make_specified_string > > (called by json_make_string) already includes an equivalent test. > > And once again, we need to check at least whether the size fits into > ptrdiff_t. No, we don't, as we don't in other similar cases. > > > + case JSON_ARRAY: > > > + { > > > + if (++lisp_eval_depth > max_lisp_eval_depth) > > > + xsignal0 (Qjson_object_too_deep); > > > + size_t size = json_array_size (json); > > > + if (FIXNUM_OVERFLOW_P (size)) > > > + xsignal1 (Qoverflow_error, build_string ("JSON array is too > > long")); > > > + Lisp_Object result = Fmake_vector (make_natnum (size), > > Qunbound); > > > > Likewise here: Fmake_vector makes sure the size is not larger than > > allowed. > > Same as above: It can't. It can and it does. > > > + case JSON_OBJECT: > > > + { > > > + if (++lisp_eval_depth > max_lisp_eval_depth) > > > + xsignal0 (Qjson_object_too_deep); > > > + size_t size = json_object_size (json); > > > + if (FIXNUM_OVERFLOW_P (size)) > > > + xsignal1 (Qoverflow_error, > > > + build_string ("JSON object has too many elements")); > > > + Lisp_Object result = CALLN (Fmake_hash_table, QCtest, Qequal, > > > + QCsize, make_natnum (size)); > > > > Likewise here: make_natnum does the equivalent test. > > It doesn't and can't. Yes, it does: INLINE Lisp_Object make_natnum (EMACS_INT n) { eassert (0 <= n && n <= MOST_POSITIVE_FIXNUM); <<<<<<<<<<<<<<< EMACS_INT int0 = Lisp_Int0; > > > + /* Adjust point by how much we just read. Do this here because > > > + tokener->char_offset becomes incorrect below. */ > > > + bool overflow = INT_ADD_WRAPV (point, error.position, &point); > > > + eassert (!overflow); > > > + eassert (point <= ZV_BYTE); > > > + SET_PT_BOTH (BYTE_TO_CHAR (point), point); > > > > It's better to use SET_PT here, I think. > > That's not possible because we don't have the character offset. Right, sorry, I was confused. Thanks.