From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED!not-for-mail
From: Philipp Stephani <p.stephani2@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: JSON/YAML/TOML/etc. parsing performance
Date: Thu, 28 Sep 2017 21:19:00 +0000
Message-ID: <CAArVCkRvSaS-orqHcVPtZ2etUnRiY39okHh+6sYV-mtQQRYs-g@mail.gmail.com>
References: <87poaqhc63.fsf@lifelogs.com>
	<CAArVCkQTLp=Cmh-FM1R-WK=WYFX_hP=6XiUUinKRT17bciL+CQ@mail.gmail.com>
	<CAArVCkTj_1P+fTDCzEY5xG8bBB7B6ctNkQCv+bAxt=N_cuD05Q@mail.gmail.com>
	<CAArVCkS52m8SGeOQt19k+XsfZnxy+bh9LJMyX=h+e67_adP6Mg@mail.gmail.com>
	<8360ceh5f1.fsf@gnu.org>
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="001a113df252a19909055a467509"
X-Trace: blaine.gmane.org 1506633562 29935 195.159.176.226 (28 Sep 2017 21:19:22 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Thu, 28 Sep 2017 21:19:22 +0000 (UTC)
Cc: emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Sep 28 23:19:17 2017
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by blaine.gmane.org with esmtp (Exim 4.84_2)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1dxgDE-0007Bu-VE
	for ged-emacs-devel@m.gmane.org; Thu, 28 Sep 2017 23:19:17 +0200
Original-Received: from localhost ([::1]:60961 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1dxgDM-0005GB-4O
	for ged-emacs-devel@m.gmane.org; Thu, 28 Sep 2017 17:19:24 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:38060)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1dxgDE-0005Em-T6
	for emacs-devel@gnu.org; Thu, 28 Sep 2017 17:19:18 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <p.stephani2@gmail.com>) id 1dxgDD-00065W-0F
	for emacs-devel@gnu.org; Thu, 28 Sep 2017 17:19:16 -0400
Original-Received: from mail-oi0-x231.google.com ([2607:f8b0:4003:c06::231]:48138)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <p.stephani2@gmail.com>)
	id 1dxgDA-00063j-Gd; Thu, 28 Sep 2017 17:19:12 -0400
Original-Received: by mail-oi0-x231.google.com with SMTP id 125so4503377oie.5;
	Thu, 28 Sep 2017 14:19:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=mime-version:references:in-reply-to:from:date:message-id:subject:to
	:cc; bh=Jzhe+XHHg0A5KM/q1caDX8LOjv07va84wBRfrH4jAho=;
	b=VryJMX1XeL2tul8NK5ooKj2P0z9SSE42MtLOJ5S7h+OfOtWUXfucJ2qsCH0lYv8Iw7
	WUIcVO3CEvS01GzqNFi1dWzGSyPzS4psqV8NF0waqWQgVq/3rEPVWBu8U+ZE1tjz8125
	TuwEZVhLZloCHKdiF0RZ+XoqsI2/DrIn2LpHWUrM3MFGUYMKxYuHatMH63F3jrS7p4Ti
	MDEYv2RavhZuNmkIOrM7p/DlCytF+WLtX5+oG5oR7gekn6YyNjqmSpkewnLH276P5v1f
	7bBn8Qz4IYsiQ7quMLi8pQVASAYtuMs/iZbvBtL29Q6obUdx7vOS7ul2bfYplRLnWJe4
	PPqQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:mime-version:references:in-reply-to:from:date
	:message-id:subject:to:cc;
	bh=Jzhe+XHHg0A5KM/q1caDX8LOjv07va84wBRfrH4jAho=;
	b=hXtF5WkXpih0dcHF9QRw8yuUWgM9qaMl8dCXqurXby270z+Typ14fe28KCjhzw0y3c
	qm1ZbIzC0zproaS//8X4aVoWknbaPwiprW7t7dgD23O74aR47U55lzr0QUmYyWHYcBVG
	scvCf/iRIpYrUQqZwtqshiC/L6l8oeWC0XqqA5qZZ2At6jpADn26eCxVaeMQpalPrHBL
	Qr+BfwvNcYFo7WKj+slfKHR6qHJLt3697QB7qNTPbagPbcUTbkq9zkgjjN++mIQdEXYS
	By0Wvu9KLWChTo5xcr721W6+CNKkDHlhkNhEQ8zpUCjdEVZEro342snMpz32LB59vBhI
	/w+Q==
X-Gm-Message-State: AMCzsaUF6lhVI1DXUB3V2+vu6mmeS7umGAwkND4MycFBlPlFNp9ksSlS
	+s6X5GDIajUhW98TUUqHG3s5MDfsfzI8KCj/svPuxg==
X-Google-Smtp-Source: AOwi7QBABE5P3y669bEW3Q2h3CmhEBdyinj0aFKtPf/osI9dKmnk6uVEAy3WJ608rBj1fC2h3unnV2vcdvg8/ce4jdk=
X-Received: by 10.202.86.141 with SMTP id k135mr1347124oib.254.1506633551286; 
	Thu, 28 Sep 2017 14:19:11 -0700 (PDT)
In-Reply-To: <8360ceh5f1.fsf@gnu.org>
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 2607:f8b0:4003:c06::231
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel/>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: "Emacs-devel" <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.devel:218869
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/218869>

--001a113df252a19909055a467509
Content-Type: text/plain; charset="UTF-8"

Eli Zaretskii <eliz@gnu.org> schrieb am Di., 19. Sep. 2017 um 21:10 Uhr:

> > From: Philipp Stephani <p.stephani2@gmail.com>
> > Date: Tue, 19 Sep 2017 08:18:14 +0000
> >
> > Here's a newer version of the patch. The only significant difference is
> that now the Lisp values for JSON null
> > and false are :null and :false, respectively. Using a dedicated symbol
> for :null reduces the mental overhead of
> > the triple meaning of nil (null, false, empty list), and is more
> future-proof, should we ever want to support lists.
>
> Thanks, a few comments below.
>

Thanks for the review. Most of the comments are about converting between C
and Lisp strings, so let me summarize my questions here.
IIUC Jansson only accepts UTF-8 strings (i.e. it will generate an error
some input is not an UTF-8 string), and will only return UTF-8 strings as
well. Therefore I think that direct conversion between Lisp strings and C
strings (using SDATA etc.) is always correct because the internal Emacs
encoding is a superset of UTF-8. Also build_string should always be correct
because it will generate a correct multibyte string for an UTF-8 string
with non-ASCII characters, and a correct unibyte string for an ASCII
string, right?


>
> > +static _Noreturn void
> > +json_parse_error (const json_error_t *error)
> > +{
> > +  xsignal (Qjson_parse_error,
> > +           list5 (build_string (error->text), build_string
> (error->source),
> > +                  make_natnum (error->line), make_natnum
> (error->column),
> > +                  make_natnum (error->position)));
> > +}
>
> I think error->source could include non-ASCII characters, in which
> case you need to use make_specified_string with its last argument
> non-zero, not build_string, which has its own ideas about when to
> produce a multibyte string.
>
> > +static  _GL_ARG_NONNULL ((2)) Lisp_Object
> > +lisp_to_json_1 (Lisp_Object lisp, json_t **json)
> > +{
> > +  if (VECTORP (lisp))
> > +    {
> > +      ptrdiff_t size = ASIZE (lisp);
> > +      eassert (size >= 0);
> > +      if (size > SIZE_MAX)
> > +        xsignal1 (Qoverflow_error, build_pure_c_string ("vector is too
> long"));
>
> I don't think you can allocate pure storage at run time, only at dump
> time.  (There are more of this elsewhere in the patch.)
>

OK, will be fixed in the next version.


>
> > +  /* LISP now must be a vector or hashtable.  */
> > +  if (++lisp_eval_depth > max_lisp_eval_depth)
> > +    xsignal0 (Qjson_object_too_deep);
>
> This error could mislead: the problem could be in the nesting of
> surrounding Lisp being too deep, and the JSON part could be just fine.
>

Agreed, but I think it's better to use lisp_eval_depth here because it's
the total nesting depth that could cause stack overflows.


>
> > +  Lisp_Object string
> > +    = make_string (buffer_and_size->buffer, buffer_and_size->size);
>
> This is arbitrary text, so I'm not sure make_string is appropriate.
> Could the text be a byte stream, i.e. not human-readable text?  If so,
> do we want to create a unibyte string or a multibyte string here?
>

It should always be UTF-8.


>
> > +  insert_from_string (string, 0, 0, SCHARS (string), SBYTES (string),
> false);
>
> Hmmm... if you want to insert the text into the buffer, you need to
> make sure it has the right representation.  What kind of text is this?
> It probably should be decoded.
>
> In any case, going through a string sounds gross.  You should insert
> the text directly into the gap, like we do in a couple of places
> already.  See insert_from_gap and its users, and maybe also
> decode_coding_gap.
>

OK, I'll have to check that, but it sounds doable.


>
> > +DEFUN ("json-parse-string", Fjson_parse_string, Sjson_parse_string, 1,
> 1, NULL,
> > +       doc: /* Parse the JSON STRING into a Lisp object.
> > +This is essentially the reverse operation of `json-serialize', which
> > +see.  The returned object will be a vector or hashtable.  Its elements
> > +will be `:null', `:false', t, numbers, strings, or further vectors and
> > +hashtables.  If there are duplicate keys in an object, all but the
> > +last one are ignored.  If STRING doesn't contain a valid JSON object,
> > +an error of type `json-parse-error' is signaled.  */)
> > +  (Lisp_Object string)
> > +{
> > +  ptrdiff_t count = SPECPDL_INDEX ();
> > +  check_string_without_embedded_nulls (string);
> > +
> > +  json_error_t error;
> > +  json_t *object = json_loads (SSDATA (string), 0, &error);
>
> Doesn't json_loads require the string to be encoded in some particular
> encoding?  If so, passing it our internal representation might not be
> TRT.
>
> > +  /* First, parse from point to the gap or the end of the accessible
> > +     portion, whatever is closer.  */
> > +  ptrdiff_t point = d->point;
> > +  ptrdiff_t end;
> > +  {
> > +    bool overflow = INT_ADD_WRAPV (BUFFER_CEILING_OF (point), 1, &end);
> > +    eassert (!overflow);
> > +  }
> > +  size_t count;
> > +  {
> > +    bool overflow = INT_SUBTRACT_WRAPV (end, point, &count);
> > +    eassert (!overflow);
> > +  }
>
> Why did you need these blocks in braces?
>

To be able to reuse the "overflow" name/


>
> > +(provide 'json-tests)
> > +;;; json-tests.el ends here
>
> IMO, it would be good to test also non-ASCII text in JSON objects.
>
>
Yes, once the patch is in acceptable shape, I plan to add many more tests.

--001a113df252a19909055a467509
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><br><div class=3D"gmail_quote"><div dir=3D"ltr">Eli Za=
retskii &lt;<a href=3D"mailto:eliz@gnu.org">eliz@gnu.org</a>&gt; schrieb am=
 Di., 19. Sep. 2017 um 21:10=C2=A0Uhr:<br></div><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex">&gt; From: Philipp Stephani &lt;<a href=3D"mailto:p.stephani2@gmail.com=
" target=3D"_blank">p.stephani2@gmail.com</a>&gt;<br>
&gt; Date: Tue, 19 Sep 2017 08:18:14 +0000<br>
&gt;<br>
&gt; Here&#39;s a newer version of the patch. The only significant differen=
ce is that now the Lisp values for JSON null<br>
&gt; and false are :null and :false, respectively. Using a dedicated symbol=
 for :null reduces the mental overhead of<br>
&gt; the triple meaning of nil (null, false, empty list), and is more futur=
e-proof, should we ever want to support lists.<br>
<br>
Thanks, a few comments below.<br></blockquote><div><br></div><div>Thanks fo=
r the review. Most of the comments are about converting between C and Lisp =
strings, so let me summarize my questions here.</div><div>IIUC Jansson only=
 accepts UTF-8 strings (i.e. it will generate an error some input is not an=
 UTF-8 string), and will only return UTF-8 strings as well. Therefore I thi=
nk that direct conversion between Lisp strings and C strings (using SDATA e=
tc.) is always correct because the internal Emacs encoding is a superset of=
 UTF-8. Also build_string should always be correct because it will generate=
 a correct multibyte string for an UTF-8 string with non-ASCII characters, =
and a correct unibyte string for an ASCII string, right?</div><div>=C2=A0</=
div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-lef=
t:1px #ccc solid;padding-left:1ex">
<br>
&gt; +static _Noreturn void<br>
&gt; +json_parse_error (const json_error_t *error)<br>
&gt; +{<br>
&gt; +=C2=A0 xsignal (Qjson_parse_error,<br>
&gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0list5 (build_string (error-&=
gt;text), build_string (error-&gt;source),<br>
&gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 make_n=
atnum (error-&gt;line), make_natnum (error-&gt;column),<br>
&gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 make_n=
atnum (error-&gt;position)));<br>
&gt; +}<br>
<br>
I think error-&gt;source could include non-ASCII characters, in which<br>
case you need to use make_specified_string with its last argument<br>
non-zero, not build_string, which has its own ideas about when to<br>
produce a multibyte string.<br>
<br>
&gt; +static=C2=A0 _GL_ARG_NONNULL ((2)) Lisp_Object<br>
&gt; +lisp_to_json_1 (Lisp_Object lisp, json_t **json)<br>
&gt; +{<br>
&gt; +=C2=A0 if (VECTORP (lisp))<br>
&gt; +=C2=A0 =C2=A0 {<br>
&gt; +=C2=A0 =C2=A0 =C2=A0 ptrdiff_t size =3D ASIZE (lisp);<br>
&gt; +=C2=A0 =C2=A0 =C2=A0 eassert (size &gt;=3D 0);<br>
&gt; +=C2=A0 =C2=A0 =C2=A0 if (size &gt; SIZE_MAX)<br>
&gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 xsignal1 (Qoverflow_error, build_pure_c_s=
tring (&quot;vector is too long&quot;));<br>
<br>
I don&#39;t think you can allocate pure storage at run time, only at dump<b=
r>
time.=C2=A0 (There are more of this elsewhere in the patch.)<br></blockquot=
e><div><br></div><div>OK, will be fixed in the next version.</div><div>=C2=
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;borde=
r-left:1px #ccc solid;padding-left:1ex">
<br>
&gt; +=C2=A0 /* LISP now must be a vector or hashtable.=C2=A0 */<br>
&gt; +=C2=A0 if (++lisp_eval_depth &gt; max_lisp_eval_depth)<br>
&gt; +=C2=A0 =C2=A0 xsignal0 (Qjson_object_too_deep);<br>
<br>
This error could mislead: the problem could be in the nesting of<br>
surrounding Lisp being too deep, and the JSON part could be just fine.<br><=
/blockquote><div><br></div><div>Agreed, but I think it&#39;s better to use =
lisp_eval_depth here because it&#39;s the total nesting depth that could ca=
use stack overflows.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote=
" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
&gt; +=C2=A0 Lisp_Object string<br>
&gt; +=C2=A0 =C2=A0 =3D make_string (buffer_and_size-&gt;buffer, buffer_and=
_size-&gt;size);<br>
<br>
This is arbitrary text, so I&#39;m not sure make_string is appropriate.<br>
Could the text be a byte stream, i.e. not human-readable text?=C2=A0 If so,=
<br>
do we want to create a unibyte string or a multibyte string here?<br></bloc=
kquote><div><br></div><div>It should always be UTF-8.</div><div>=C2=A0</div=
><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1=
px #ccc solid;padding-left:1ex">
<br>
&gt; +=C2=A0 insert_from_string (string, 0, 0, SCHARS (string), SBYTES (str=
ing), false);<br>
<br>
Hmmm... if you want to insert the text into the buffer, you need to<br>
make sure it has the right representation.=C2=A0 What kind of text is this?=
<br>
It probably should be decoded.<br>
<br>
In any case, going through a string sounds gross.=C2=A0 You should insert<b=
r>
the text directly into the gap, like we do in a couple of places<br>
already.=C2=A0 See insert_from_gap and its users, and maybe also<br>
decode_coding_gap.<br></blockquote><div><br></div><div>OK, I&#39;ll have to=
 check that, but it sounds doable.</div><div>=C2=A0</div><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">
<br>
&gt; +DEFUN (&quot;json-parse-string&quot;, Fjson_parse_string, Sjson_parse=
_string, 1, 1, NULL,<br>
&gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0doc: /* Parse the JSON STRING into a Lisp =
object.<br>
&gt; +This is essentially the reverse operation of `json-serialize&#39;, wh=
ich<br>
&gt; +see.=C2=A0 The returned object will be a vector or hashtable.=C2=A0 I=
ts elements<br>
&gt; +will be `:null&#39;, `:false&#39;, t, numbers, strings, or further ve=
ctors and<br>
&gt; +hashtables.=C2=A0 If there are duplicate keys in an object, all but t=
he<br>
&gt; +last one are ignored.=C2=A0 If STRING doesn&#39;t contain a valid JSO=
N object,<br>
&gt; +an error of type `json-parse-error&#39; is signaled.=C2=A0 */)<br>
&gt; +=C2=A0 (Lisp_Object string)<br>
&gt; +{<br>
&gt; +=C2=A0 ptrdiff_t count =3D SPECPDL_INDEX ();<br>
&gt; +=C2=A0 check_string_without_embedded_nulls (string);<br>
&gt; +<br>
&gt; +=C2=A0 json_error_t error;<br>
&gt; +=C2=A0 json_t *object =3D json_loads (SSDATA (string), 0, &amp;error)=
;<br>
<br>
Doesn&#39;t json_loads require the string to be encoded in some particular<=
br>
encoding?=C2=A0 If so, passing it our internal representation might not be<=
br>
TRT.<br>
<br>
&gt; +=C2=A0 /* First, parse from point to the gap or the end of the access=
ible<br>
&gt; +=C2=A0 =C2=A0 =C2=A0portion, whatever is closer.=C2=A0 */<br>
&gt; +=C2=A0 ptrdiff_t point =3D d-&gt;point;<br>
&gt; +=C2=A0 ptrdiff_t end;<br>
&gt; +=C2=A0 {<br>
&gt; +=C2=A0 =C2=A0 bool overflow =3D INT_ADD_WRAPV (BUFFER_CEILING_OF (poi=
nt), 1, &amp;end);<br>
&gt; +=C2=A0 =C2=A0 eassert (!overflow);<br>
&gt; +=C2=A0 }<br>
&gt; +=C2=A0 size_t count;<br>
&gt; +=C2=A0 {<br>
&gt; +=C2=A0 =C2=A0 bool overflow =3D INT_SUBTRACT_WRAPV (end, point, &amp;=
count);<br>
&gt; +=C2=A0 =C2=A0 eassert (!overflow);<br>
&gt; +=C2=A0 }<br>
<br>
Why did you need these blocks in braces?<br></blockquote><div><br></div><di=
v>To be able to reuse the &quot;overflow&quot; name/</div><div>=C2=A0</div>=
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<br>
&gt; +(provide &#39;json-tests)<br>
&gt; +;;; json-tests.el ends here<br>
<br>
IMO, it would be good to test also non-ASCII text in JSON objects.<br>
<br></blockquote><div><br></div><div>Yes, once the patch is in acceptable s=
hape, I plan to add many more tests.=C2=A0</div></div></div>

--001a113df252a19909055a467509--