From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Philipp Stephani Newsgroups: gmane.emacs.devel Subject: Re: JSON/YAML/TOML/etc. parsing performance Date: Thu, 28 Sep 2017 21:19:00 +0000 Message-ID: References: <87poaqhc63.fsf@lifelogs.com> <8360ceh5f1.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="001a113df252a19909055a467509" X-Trace: blaine.gmane.org 1506633562 29935 195.159.176.226 (28 Sep 2017 21:19:22 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Thu, 28 Sep 2017 21:19:22 +0000 (UTC) Cc: emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Sep 28 23:19:17 2017 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dxgDE-0007Bu-VE for ged-emacs-devel@m.gmane.org; Thu, 28 Sep 2017 23:19:17 +0200 Original-Received: from localhost ([::1]:60961 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dxgDM-0005GB-4O for ged-emacs-devel@m.gmane.org; Thu, 28 Sep 2017 17:19:24 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:38060) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dxgDE-0005Em-T6 for emacs-devel@gnu.org; Thu, 28 Sep 2017 17:19:18 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dxgDD-00065W-0F for emacs-devel@gnu.org; Thu, 28 Sep 2017 17:19:16 -0400 Original-Received: from mail-oi0-x231.google.com ([2607:f8b0:4003:c06::231]:48138) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1dxgDA-00063j-Gd; Thu, 28 Sep 2017 17:19:12 -0400 Original-Received: by mail-oi0-x231.google.com with SMTP id 125so4503377oie.5; Thu, 28 Sep 2017 14:19:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Jzhe+XHHg0A5KM/q1caDX8LOjv07va84wBRfrH4jAho=; b=VryJMX1XeL2tul8NK5ooKj2P0z9SSE42MtLOJ5S7h+OfOtWUXfucJ2qsCH0lYv8Iw7 WUIcVO3CEvS01GzqNFi1dWzGSyPzS4psqV8NF0waqWQgVq/3rEPVWBu8U+ZE1tjz8125 TuwEZVhLZloCHKdiF0RZ+XoqsI2/DrIn2LpHWUrM3MFGUYMKxYuHatMH63F3jrS7p4Ti MDEYv2RavhZuNmkIOrM7p/DlCytF+WLtX5+oG5oR7gekn6YyNjqmSpkewnLH276P5v1f 7bBn8Qz4IYsiQ7quMLi8pQVASAYtuMs/iZbvBtL29Q6obUdx7vOS7ul2bfYplRLnWJe4 PPqQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Jzhe+XHHg0A5KM/q1caDX8LOjv07va84wBRfrH4jAho=; b=hXtF5WkXpih0dcHF9QRw8yuUWgM9qaMl8dCXqurXby270z+Typ14fe28KCjhzw0y3c qm1ZbIzC0zproaS//8X4aVoWknbaPwiprW7t7dgD23O74aR47U55lzr0QUmYyWHYcBVG scvCf/iRIpYrUQqZwtqshiC/L6l8oeWC0XqqA5qZZ2At6jpADn26eCxVaeMQpalPrHBL Qr+BfwvNcYFo7WKj+slfKHR6qHJLt3697QB7qNTPbagPbcUTbkq9zkgjjN++mIQdEXYS By0Wvu9KLWChTo5xcr721W6+CNKkDHlhkNhEQ8zpUCjdEVZEro342snMpz32LB59vBhI /w+Q== X-Gm-Message-State: AMCzsaUF6lhVI1DXUB3V2+vu6mmeS7umGAwkND4MycFBlPlFNp9ksSlS +s6X5GDIajUhW98TUUqHG3s5MDfsfzI8KCj/svPuxg== X-Google-Smtp-Source: AOwi7QBABE5P3y669bEW3Q2h3CmhEBdyinj0aFKtPf/osI9dKmnk6uVEAy3WJ608rBj1fC2h3unnV2vcdvg8/ce4jdk= X-Received: by 10.202.86.141 with SMTP id k135mr1347124oib.254.1506633551286; Thu, 28 Sep 2017 14:19:11 -0700 (PDT) In-Reply-To: <8360ceh5f1.fsf@gnu.org> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:4003:c06::231 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:218869 Archived-At: --001a113df252a19909055a467509 Content-Type: text/plain; charset="UTF-8" Eli Zaretskii schrieb am Di., 19. Sep. 2017 um 21:10 Uhr: > > From: Philipp Stephani > > Date: Tue, 19 Sep 2017 08:18:14 +0000 > > > > Here's a newer version of the patch. The only significant difference is > that now the Lisp values for JSON null > > and false are :null and :false, respectively. Using a dedicated symbol > for :null reduces the mental overhead of > > the triple meaning of nil (null, false, empty list), and is more > future-proof, should we ever want to support lists. > > Thanks, a few comments below. > Thanks for the review. Most of the comments are about converting between C and Lisp strings, so let me summarize my questions here. IIUC Jansson only accepts UTF-8 strings (i.e. it will generate an error some input is not an UTF-8 string), and will only return UTF-8 strings as well. Therefore I think that direct conversion between Lisp strings and C strings (using SDATA etc.) is always correct because the internal Emacs encoding is a superset of UTF-8. Also build_string should always be correct because it will generate a correct multibyte string for an UTF-8 string with non-ASCII characters, and a correct unibyte string for an ASCII string, right? > > > +static _Noreturn void > > +json_parse_error (const json_error_t *error) > > +{ > > + xsignal (Qjson_parse_error, > > + list5 (build_string (error->text), build_string > (error->source), > > + make_natnum (error->line), make_natnum > (error->column), > > + make_natnum (error->position))); > > +} > > I think error->source could include non-ASCII characters, in which > case you need to use make_specified_string with its last argument > non-zero, not build_string, which has its own ideas about when to > produce a multibyte string. > > > +static _GL_ARG_NONNULL ((2)) Lisp_Object > > +lisp_to_json_1 (Lisp_Object lisp, json_t **json) > > +{ > > + if (VECTORP (lisp)) > > + { > > + ptrdiff_t size = ASIZE (lisp); > > + eassert (size >= 0); > > + if (size > SIZE_MAX) > > + xsignal1 (Qoverflow_error, build_pure_c_string ("vector is too > long")); > > I don't think you can allocate pure storage at run time, only at dump > time. (There are more of this elsewhere in the patch.) > OK, will be fixed in the next version. > > > + /* LISP now must be a vector or hashtable. */ > > + if (++lisp_eval_depth > max_lisp_eval_depth) > > + xsignal0 (Qjson_object_too_deep); > > This error could mislead: the problem could be in the nesting of > surrounding Lisp being too deep, and the JSON part could be just fine. > Agreed, but I think it's better to use lisp_eval_depth here because it's the total nesting depth that could cause stack overflows. > > > + Lisp_Object string > > + = make_string (buffer_and_size->buffer, buffer_and_size->size); > > This is arbitrary text, so I'm not sure make_string is appropriate. > Could the text be a byte stream, i.e. not human-readable text? If so, > do we want to create a unibyte string or a multibyte string here? > It should always be UTF-8. > > > + insert_from_string (string, 0, 0, SCHARS (string), SBYTES (string), > false); > > Hmmm... if you want to insert the text into the buffer, you need to > make sure it has the right representation. What kind of text is this? > It probably should be decoded. > > In any case, going through a string sounds gross. You should insert > the text directly into the gap, like we do in a couple of places > already. See insert_from_gap and its users, and maybe also > decode_coding_gap. > OK, I'll have to check that, but it sounds doable. > > > +DEFUN ("json-parse-string", Fjson_parse_string, Sjson_parse_string, 1, > 1, NULL, > > + doc: /* Parse the JSON STRING into a Lisp object. > > +This is essentially the reverse operation of `json-serialize', which > > +see. The returned object will be a vector or hashtable. Its elements > > +will be `:null', `:false', t, numbers, strings, or further vectors and > > +hashtables. If there are duplicate keys in an object, all but the > > +last one are ignored. If STRING doesn't contain a valid JSON object, > > +an error of type `json-parse-error' is signaled. */) > > + (Lisp_Object string) > > +{ > > + ptrdiff_t count = SPECPDL_INDEX (); > > + check_string_without_embedded_nulls (string); > > + > > + json_error_t error; > > + json_t *object = json_loads (SSDATA (string), 0, &error); > > Doesn't json_loads require the string to be encoded in some particular > encoding? If so, passing it our internal representation might not be > TRT. > > > + /* First, parse from point to the gap or the end of the accessible > > + portion, whatever is closer. */ > > + ptrdiff_t point = d->point; > > + ptrdiff_t end; > > + { > > + bool overflow = INT_ADD_WRAPV (BUFFER_CEILING_OF (point), 1, &end); > > + eassert (!overflow); > > + } > > + size_t count; > > + { > > + bool overflow = INT_SUBTRACT_WRAPV (end, point, &count); > > + eassert (!overflow); > > + } > > Why did you need these blocks in braces? > To be able to reuse the "overflow" name/ > > > +(provide 'json-tests) > > +;;; json-tests.el ends here > > IMO, it would be good to test also non-ASCII text in JSON objects. > > Yes, once the patch is in acceptable shape, I plan to add many more tests. --001a113df252a19909055a467509 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


Eli Za= retskii <eliz@gnu.org> schrieb am= Di., 19. Sep. 2017 um 21:10=C2=A0Uhr:
> From: Philipp Stephani <p.stephani2@gmail.com>
> Date: Tue, 19 Sep 2017 08:18:14 +0000
>
> Here's a newer version of the patch. The only significant differen= ce is that now the Lisp values for JSON null
> and false are :null and :false, respectively. Using a dedicated symbol= for :null reduces the mental overhead of
> the triple meaning of nil (null, false, empty list), and is more futur= e-proof, should we ever want to support lists.

Thanks, a few comments below.

Thanks fo= r the review. Most of the comments are about converting between C and Lisp = strings, so let me summarize my questions here.
IIUC Jansson only= accepts UTF-8 strings (i.e. it will generate an error some input is not an= UTF-8 string), and will only return UTF-8 strings as well. Therefore I thi= nk that direct conversion between Lisp strings and C strings (using SDATA e= tc.) is always correct because the internal Emacs encoding is a superset of= UTF-8. Also build_string should always be correct because it will generate= a correct multibyte string for an UTF-8 string with non-ASCII characters, = and a correct unibyte string for an ASCII string, right?
=C2=A0

> +static _Noreturn void
> +json_parse_error (const json_error_t *error)
> +{
> +=C2=A0 xsignal (Qjson_parse_error,
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0list5 (build_string (error-&= gt;text), build_string (error->source),
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 make_n= atnum (error->line), make_natnum (error->column),
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 make_n= atnum (error->position)));
> +}

I think error->source could include non-ASCII characters, in which
case you need to use make_specified_string with its last argument
non-zero, not build_string, which has its own ideas about when to
produce a multibyte string.

> +static=C2=A0 _GL_ARG_NONNULL ((2)) Lisp_Object
> +lisp_to_json_1 (Lisp_Object lisp, json_t **json)
> +{
> +=C2=A0 if (VECTORP (lisp))
> +=C2=A0 =C2=A0 {
> +=C2=A0 =C2=A0 =C2=A0 ptrdiff_t size =3D ASIZE (lisp);
> +=C2=A0 =C2=A0 =C2=A0 eassert (size >=3D 0);
> +=C2=A0 =C2=A0 =C2=A0 if (size > SIZE_MAX)
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 xsignal1 (Qoverflow_error, build_pure_c_s= tring ("vector is too long"));

I don't think you can allocate pure storage at run time, only at dump time.=C2=A0 (There are more of this elsewhere in the patch.)

OK, will be fixed in the next version.
=C2= =A0

> +=C2=A0 /* LISP now must be a vector or hashtable.=C2=A0 */
> +=C2=A0 if (++lisp_eval_depth > max_lisp_eval_depth)
> +=C2=A0 =C2=A0 xsignal0 (Qjson_object_too_deep);

This error could mislead: the problem could be in the nesting of
surrounding Lisp being too deep, and the JSON part could be just fine.
<= /blockquote>

Agreed, but I think it's better to use = lisp_eval_depth here because it's the total nesting depth that could ca= use stack overflows.
=C2=A0

> +=C2=A0 Lisp_Object string
> +=C2=A0 =C2=A0 =3D make_string (buffer_and_size->buffer, buffer_and= _size->size);

This is arbitrary text, so I'm not sure make_string is appropriate.
Could the text be a byte stream, i.e. not human-readable text?=C2=A0 If so,=
do we want to create a unibyte string or a multibyte string here?

It should always be UTF-8.
=C2=A0

> +=C2=A0 insert_from_string (string, 0, 0, SCHARS (string), SBYTES (str= ing), false);

Hmmm... if you want to insert the text into the buffer, you need to
make sure it has the right representation.=C2=A0 What kind of text is this?=
It probably should be decoded.

In any case, going through a string sounds gross.=C2=A0 You should insert the text directly into the gap, like we do in a couple of places
already.=C2=A0 See insert_from_gap and its users, and maybe also
decode_coding_gap.

OK, I'll have to= check that, but it sounds doable.
=C2=A0

> +DEFUN ("json-parse-string", Fjson_parse_string, Sjson_parse= _string, 1, 1, NULL,
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0doc: /* Parse the JSON STRING into a Lisp = object.
> +This is essentially the reverse operation of `json-serialize', wh= ich
> +see.=C2=A0 The returned object will be a vector or hashtable.=C2=A0 I= ts elements
> +will be `:null', `:false', t, numbers, strings, or further ve= ctors and
> +hashtables.=C2=A0 If there are duplicate keys in an object, all but t= he
> +last one are ignored.=C2=A0 If STRING doesn't contain a valid JSO= N object,
> +an error of type `json-parse-error' is signaled.=C2=A0 */)
> +=C2=A0 (Lisp_Object string)
> +{
> +=C2=A0 ptrdiff_t count =3D SPECPDL_INDEX ();
> +=C2=A0 check_string_without_embedded_nulls (string);
> +
> +=C2=A0 json_error_t error;
> +=C2=A0 json_t *object =3D json_loads (SSDATA (string), 0, &error)= ;

Doesn't json_loads require the string to be encoded in some particular<= br> encoding?=C2=A0 If so, passing it our internal representation might not be<= br> TRT.

> +=C2=A0 /* First, parse from point to the gap or the end of the access= ible
> +=C2=A0 =C2=A0 =C2=A0portion, whatever is closer.=C2=A0 */
> +=C2=A0 ptrdiff_t point =3D d->point;
> +=C2=A0 ptrdiff_t end;
> +=C2=A0 {
> +=C2=A0 =C2=A0 bool overflow =3D INT_ADD_WRAPV (BUFFER_CEILING_OF (poi= nt), 1, &end);
> +=C2=A0 =C2=A0 eassert (!overflow);
> +=C2=A0 }
> +=C2=A0 size_t count;
> +=C2=A0 {
> +=C2=A0 =C2=A0 bool overflow =3D INT_SUBTRACT_WRAPV (end, point, &= count);
> +=C2=A0 =C2=A0 eassert (!overflow);
> +=C2=A0 }

Why did you need these blocks in braces?

To be able to reuse the "overflow" name/
=C2=A0
=

> +(provide 'json-tests)
> +;;; json-tests.el ends here

IMO, it would be good to test also non-ASCII text in JSON objects.


Yes, once the patch is in acceptable s= hape, I plan to add many more tests.=C2=A0
--001a113df252a19909055a467509--