* I created a faster JSON parser
@ 2024-03-08 10:27 Herman, Géza
2024-03-08 11:41 ` Philip Kaludercic
` (3 more replies)
0 siblings, 4 replies; 51+ messages in thread
From: Herman, Géza @ 2024-03-08 10:27 UTC (permalink / raw)
To: emacs-devel@gnu.org
Hi,
I created a faster JSON parser for emacs, you can check it out
here:
https://github.com/geza-herman/emacs/tree/faster-json-parsing
It replaces json-parse-string and json-parse-buffer functions.
The behavior should be the same as before, with the only exception
that objects with duplicated keys are not detected if :object-type
is not 'hash-table.
This parser runs 8-9x faster than the jansson based parser on my
machine (tested on clangd language server messages). An
additional tiny benefit is that large integers are parsed, instead
of having an "out of range" error.
What do you think?
Geza
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 10:27 I created a faster JSON parser Herman, Géza
@ 2024-03-08 11:41 ` Philip Kaludercic
2024-03-08 12:34 ` Herman, Géza
2024-03-08 12:03 ` Eli Zaretskii
` (2 subsequent siblings)
3 siblings, 1 reply; 51+ messages in thread
From: Philip Kaludercic @ 2024-03-08 11:41 UTC (permalink / raw)
To: Herman, Géza; +Cc: emacs-devel@gnu.org
"Herman, Géza" <geza.herman@gmail.com> writes:
> Hi,
>
> I created a faster JSON parser for emacs, you can check it out here:
>
> https://github.com/geza-herman/emacs/tree/faster-json-parsing
For convenience, this is a URL to the patch:
https://github.com/geza-herman/emacs/commit/d76feeeac3ce397f15aebdde8b9deae676b0bf1e.patch
>
> It replaces json-parse-string and json-parse-buffer functions. The
> behavior should be the same as before, with the only exception that
> objects with duplicated keys are not detected if :object-type is not
> 'hash-table.
Is that a problem?
> This parser runs 8-9x faster than the jansson based parser on my
> machine (tested on clangd language server messages). An additional
> tiny benefit is that large integers are parsed, instead of having an
> "out of range" error.
That sounds interesting, but I am reminded of this article:
https://seriot.ch/projects/parsing_json.html. There seem to be plenty
of difficult edge-cases when dealing with JSON input, that should
probably be tested if Emacs has it's own custom parser built-in.
> What do you think?
>
> Geza
--
Philip Kaludercic on peregrine
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 10:27 I created a faster JSON parser Herman, Géza
2024-03-08 11:41 ` Philip Kaludercic
@ 2024-03-08 12:03 ` Eli Zaretskii
2024-03-08 12:38 ` Herman, Géza
2024-03-08 13:28 ` Po Lu
2024-03-09 20:37 ` Christopher Wellons
3 siblings, 1 reply; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-08 12:03 UTC (permalink / raw)
To: Géza Herman; +Cc: emacs-devel
> From: Herman, Géza <geza.herman@gmail.com>
> Date: Fri, 08 Mar 2024 11:27:16 +0100
>
> This parser runs 8-9x faster than the jansson based parser on my
> machine (tested on clangd language server messages).
How does it do that? Can you summarize the main ideas of your
implementation, which make it so much faster?
Thanks.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 11:41 ` Philip Kaludercic
@ 2024-03-08 12:34 ` Herman, Géza
0 siblings, 0 replies; 51+ messages in thread
From: Herman, Géza @ 2024-03-08 12:34 UTC (permalink / raw)
To: Philip Kaludercic; +Cc: Herman, Géza, emacs-devel@gnu.org
Philip Kaludercic <philipk@posteo.net> writes:
> "Herman, Géza" <geza.herman@gmail.com> writes:
>
>> It replaces json-parse-string and json-parse-buffer
>> functions. The
>> behavior should be the same as before, with the only exception
>> that
>> objects with duplicated keys are not detected if :object-type
>> is not
>> 'hash-table.
>
> Is that a problem?
Not sure. I just mentioned it because it's a behavior change.
But I intentionally designed it this way, because it is faster.
To me, it makes some sense that if the user specifies 'alist or
'plist, then they want to have all the object members, even if the
keys are duplicated. I didn't find a clear direction from JSON
descriptions of how duplicated keys should be handled.
>> This parser runs 8-9x faster than the jansson based parser on
>> my
>> machine (tested on clangd language server messages). An
>> additional
>> tiny benefit is that large integers are parsed, instead of
>> having an
>> "out of range" error.
>
> That sounds interesting, but I am reminded of this article:
> https://seriot.ch/projects/parsing_json.html. There seem to be
> plenty
> of difficult edge-cases when dealing with JSON input, that
> should
> probably be tested if Emacs has it's own custom parser built-in.
I've now run my parser on the tests in this repo, it passes all of
them.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 12:03 ` Eli Zaretskii
@ 2024-03-08 12:38 ` Herman, Géza
2024-03-08 12:59 ` Eli Zaretskii
0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-08 12:38 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: Géza Herman, emacs-devel
Eli Zaretskii <eliz@gnu.org> writes:
>> From: Herman, Géza <geza.herman@gmail.com>
>> Date: Fri, 08 Mar 2024 11:27:16 +0100
>>
>> This parser runs 8-9x faster than the jansson based parser on
>> my
>> machine (tested on clangd language server messages).
>
> How does it do that? Can you summarize the main ideas of your
> implementation, which make it so much faster?
My parser creates Lisp objects during parsing, there is no
intermediate step as Emacs has with jansson. With jansson, there
are a lot of allocations, which my parser doesn't have (my parser
has only two buffers, which exponentially grow. There are no other
allocations). But even ignoring performance loss because of
mallocs (on my dataset, 40% of CPU time goes into malloc/free), I
think parsing should be faster, so maybe jansson is not a fast
parser in the first place.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 12:38 ` Herman, Géza
@ 2024-03-08 12:59 ` Eli Zaretskii
2024-03-08 13:12 ` Herman, Géza
0 siblings, 1 reply; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-08 12:59 UTC (permalink / raw)
To: Géza Herman; +Cc: emacs-devel
> From: Herman, Géza <geza.herman@gmail.com>
> Cc: Géza Herman <geza.herman@gmail.com>,
> emacs-devel@gnu.org
> Date: Fri, 08 Mar 2024 13:38:48 +0100
>
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> >> From: Herman, Géza <geza.herman@gmail.com>
> >> Date: Fri, 08 Mar 2024 11:27:16 +0100
> >>
> >> This parser runs 8-9x faster than the jansson based parser on
> >> my
> >> machine (tested on clangd language server messages).
> >
> > How does it do that? Can you summarize the main ideas of your
> > implementation, which make it so much faster?
>
> My parser creates Lisp objects during parsing, there is no
> intermediate step as Emacs has with jansson. With jansson, there
> are a lot of allocations, which my parser doesn't have (my parser
> has only two buffers, which exponentially grow. There are no other
> allocations). But even ignoring performance loss because of
> mallocs (on my dataset, 40% of CPU time goes into malloc/free), I
> think parsing should be faster, so maybe jansson is not a fast
> parser in the first place.
Thanks.
So, if you are willing to contribute this code to Emacs, what is left
is:
. clean up the code (e.g., currently it still calls the function
that on MS-Windows loads the jansson DLL, which is now unneeded)
and adjust it to our style and conventions;
. thoroughly test the code on the available test suites (or maybe
you already did);
. for you to assign the copyright to the FSF, without which we
cannot accept such a substantial contribution
As an additional idea: perhaps initially we'd want to have a
configure-time option to either use your parser or jansson, to make
the migration and comparison easier. What do others think about this?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 12:59 ` Eli Zaretskii
@ 2024-03-08 13:12 ` Herman, Géza
2024-03-08 14:10 ` Eli Zaretskii
0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-08 13:12 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: Géza Herman, emacs-devel
Eli Zaretskii <eliz@gnu.org> writes:
> So, if you are willing to contribute this code to Emacs, what is
> left
> is:
>
> . clean up the code (e.g., currently it still calls the
> function
> that on MS-Windows loads the jansson DLL, which is now
> unneeded)
> and adjust it to our style and conventions;
I can remove the WINDOWSNT related #ifs. Are there any more
comments? I formatted my code with clangd, is there anything else
that should be done?
> . thoroughly test the code on the available test suites (or
> maybe
> you already did);
I tested it on github.com/nst/JSONTestSuite, it passes all "y_"
and "n_" tests in the directory "test_parsing". If you're aware
of other test repositories, I'm happy to test on them as well.
> . for you to assign the copyright to the FSF, without which we
> cannot accept such a substantial contribution
Can you please send me the necessary documents?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 10:27 I created a faster JSON parser Herman, Géza
2024-03-08 11:41 ` Philip Kaludercic
2024-03-08 12:03 ` Eli Zaretskii
@ 2024-03-08 13:28 ` Po Lu
2024-03-08 16:14 ` Herman, Géza
2024-03-09 20:37 ` Christopher Wellons
3 siblings, 1 reply; 51+ messages in thread
From: Po Lu @ 2024-03-08 13:28 UTC (permalink / raw)
To: Herman, Géza; +Cc: emacs-devel@gnu.org
"Herman, Géza" <geza.herman@gmail.com> writes:
> I created a faster JSON parser for emacs, you can check it out here:
>
> https://github.com/geza-herman/emacs/tree/faster-json-parsing
>
> It replaces json-parse-string and json-parse-buffer functions. The
> behavior should be the same as before, with the only exception that
> objects with duplicated keys are not detected if :object-type is not
> 'hash-table.
>
> This parser runs 8-9x faster than the jansson based parser on my
> machine (tested on clangd language server messages). An additional
> tiny benefit is that large integers are parsed, instead of having an
> "out of range" error.
>
> What do you think?
Speed aside, this change eliminates an external dependency from Emacs,
which is a great service to us in its own right. Thank you!
Nevertheless, there are several coding style issues that should be
resolved, viz. the presence of redundant opening braces and ternary or
nested binary operations not enclosed in parentheses, and the size of
the change is such that its copyright must be assigned to the FSF.
Please see that these are addressed, and thanks again.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 13:12 ` Herman, Géza
@ 2024-03-08 14:10 ` Eli Zaretskii
2024-03-08 14:24 ` Collin Funk
2024-03-08 15:20 ` Herman, Géza
0 siblings, 2 replies; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-08 14:10 UTC (permalink / raw)
To: Géza Herman; +Cc: emacs-devel
> From: Herman, Géza <geza.herman@gmail.com>
> Cc: Géza Herman <geza.herman@gmail.com>,
> emacs-devel@gnu.org
> Date: Fri, 08 Mar 2024 14:12:19 +0100
>
>
> > . clean up the code (e.g., currently it still calls the
> > function
> > that on MS-Windows loads the jansson DLL, which is now
> > unneeded)
> > and adjust it to our style and conventions;
> I can remove the WINDOWSNT related #ifs. Are there any more
> comments? I formatted my code with clangd, is there anything else
> that should be done?
The following is based on an initial reading of the patch:
. Redundant braces (for blocks of a single code line) is one issue.
. The way you break a long line at the equals sign '=' is another (we
break after '=', not before).
. The code must not call malloc/realloc/free, but their x* variants,
xmalloc/xrealloc/xfree.
. The names should be changed to remove the "my_" and "My" prefixes.
. Many functions should have a commentary before them explaining
their purpose and use, the exception is short function whose
purpose is clear from reading the code.
. Some of the generic 'json-parse-error' errors should probably be
more specific; for example, UTF-8 encoding problems should be
flagged as such.
. The code which handles integers seems to assume that 'unsigned long'
is a 64-bit type? if so, this is not true on Windows; please see how
we handle this elsewhere in Emacs, in particular in the
WIDE_EMACS_INT case.
A more general comment is that you seem to be parsing buffer text
assuming it's UTF-8? If so, this is not accurate, as the internal
representation is a superset of UTF-8, and can represent characters
above 0x10FFFF.
There could be other issues as well.
> > . thoroughly test the code on the available test suites (or
> > maybe
> > you already did);
> I tested it on github.com/nst/JSONTestSuite, it passes all "y_"
> and "n_" tests in the directory "test_parsing". If you're aware
> of other test repositories, I'm happy to test on them as well.
I don't know about such test suites, but maybe someone else does.
> > . for you to assign the copyright to the FSF, without which we
> > cannot accept such a substantial contribution
> Can you please send me the necessary documents?
Sent off-list.
Thanks.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 14:10 ` Eli Zaretskii
@ 2024-03-08 14:24 ` Collin Funk
2024-03-08 15:20 ` Herman, Géza
1 sibling, 0 replies; 51+ messages in thread
From: Collin Funk @ 2024-03-08 14:24 UTC (permalink / raw)
To: emacs-devel
Hello,
On 3/8/24 6:10 AM, Eli Zaretskii wrote:
> . The code which handles integers seems to assume that 'unsigned long'
> is a 64-bit type? if so, this is not true on Windows; please see how
> we handle this elsewhere in Emacs, in particular in the
> WIDE_EMACS_INT case.
I noticed this in 'my_json_parse_number' as well. Another change that
would be nice there is using ckd_add and ckd_mul from stdckdint.h,
generated from lib/stdckdint.in.h, to check for integer overflows.
Less room for errors compared to checking manually and easier to read.
Collin
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 14:10 ` Eli Zaretskii
2024-03-08 14:24 ` Collin Funk
@ 2024-03-08 15:20 ` Herman, Géza
2024-03-08 16:22 ` Eli Zaretskii
1 sibling, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-08 15:20 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: Géza Herman, emacs-devel
Eli Zaretskii <eliz@gnu.org> writes:
>> From: Herman, Géza <geza.herman@gmail.com>
>> Cc: Géza Herman <geza.herman@gmail.com>,
>> emacs-devel@gnu.org
>> Date: Fri, 08 Mar 2024 14:12:19 +0100
> The following is based on an initial reading of the patch:
>
> . Redundant braces (for blocks of a single code line) is one
> issue.
> . The way you break a long line at the equals sign '=' is
> another (we
> break after '=', not before).
I used clang-format to format my code (I use a completely
different coding style). I see that clang-format is configured
this way in Emacs. Shouldn't BreakBeforeBinaryOperators be set to
None or NonAssignment in .clang-format?
> . The code which handles integers seems to assume that
> 'unsigned long'
> is a 64-bit type? if so, this is not true on Windows; please
> see how
> we handle this elsewhere in Emacs, in particular in the
> WIDE_EMACS_INT case.
That was a mistake on my part, though a different (but similar)
one. I originally used a 64-bit type, but then changed it to
long, because of 32-bit architectures. The idea is to use a type
which likely has the same size as a CPU register. So I think long
is OK, I just need to change the thresholds to ULONG_MAX. Or I
think I'll use ckd_* functions as Collin suggested.
> A more general comment is that you seem to be parsing buffer
> text
> assuming it's UTF-8? If so, this is not accurate, as the
> internal
> representation is a superset of UTF-8, and can represent
> characters
> above 0x10FFFF.
When does a buffer have characters above 0x10ffff? I supposed
that a JSON shouldn't contain characters that are out of range.
But if the solution is to just remove the upper-range comparison,
I can do that easily.
>> Can you please send me the necessary documents?
>
> Sent off-list.
Thanks!
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 13:28 ` Po Lu
@ 2024-03-08 16:14 ` Herman, Géza
2024-03-09 1:55 ` Po Lu
0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-08 16:14 UTC (permalink / raw)
To: Po Lu; +Cc: Herman, Géza, emacs-devel@gnu.org
Po Lu <luangruo@yahoo.com> writes:
> "Herman, Géza" <geza.herman@gmail.com> writes:
>
>> What do you think?
>
> Speed aside, this change eliminates an external dependency from
> Emacs,
> which is a great service to us in its own right. Thank you!
I think jansson is still used, but only for writing to json.
> Nevertheless, there are several coding style issues that should
> be
> resolved, viz. the presence of redundant opening braces and
> ternary or
> nested binary operations not enclosed in parentheses,
Can you give me some examples reagarding operations need to be
parenthesized? Isn't there a gcc warning for this, so I can
follow the convention more easily?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 15:20 ` Herman, Géza
@ 2024-03-08 16:22 ` Eli Zaretskii
2024-03-08 18:34 ` Herman, Géza
0 siblings, 1 reply; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-08 16:22 UTC (permalink / raw)
To: Géza Herman; +Cc: geza.herman, emacs-devel
> From: Herman, Géza <geza.herman@gmail.com>
> Cc: Géza Herman <geza.herman@gmail.com>,
> emacs-devel@gnu.org
> Date: Fri, 08 Mar 2024 16:20:40 +0100
>
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> > . The way you break a long line at the equals sign '=' is
> > another (we
> > break after '=', not before).
> I used clang-format to format my code (I use a completely
> different coding style). I see that clang-format is configured
> this way in Emacs. Shouldn't BreakBeforeBinaryOperators be set to
> None or NonAssignment in .clang-format?
Actually, I see we use both styles, so I guess you can disregard that
part.
> > . The code which handles integers seems to assume that
> > 'unsigned long'
> > is a 64-bit type? if so, this is not true on Windows; please
> > see how
> > we handle this elsewhere in Emacs, in particular in the
> > WIDE_EMACS_INT case.
> That was a mistake on my part, though a different (but similar)
> one. I originally used a 64-bit type, but then changed it to
> long, because of 32-bit architectures. The idea is to use a type
> which likely has the same size as a CPU register.
If you want to use a 32-bit type, use 'int' or 'unsigned int'.
> > A more general comment is that you seem to be parsing buffer text
> > assuming it's UTF-8? If so, this is not accurate, as the internal
> > representation is a superset of UTF-8, and can represent
> > characters above 0x10FFFF.
>
> When does a buffer have characters above 0x10ffff?
See the node "Text Representations" in the ELisp manual, it explains
that.
> I supposed that a JSON shouldn't contain characters that are out of
> range. But if the solution is to just remove the upper-range
> comparison, I can do that easily.
We need to decide what to do with characters outside of the Unicode
range. The encoding of those characters is different, btw, it doesn't
follow the UTF-8 scheme.
In any case, the error in those cases cannot be just "json parse
error", it must be something more self-explanatory.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 16:22 ` Eli Zaretskii
@ 2024-03-08 18:34 ` Herman, Géza
2024-03-08 19:57 ` Eli Zaretskii
0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-08 18:34 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: geza.herman, emacs-devel
Eli Zaretskii <eliz@gnu.org> writes:
>> From: Herman, Géza <geza.herman@gmail.com>
>> That was a mistake on my part, though a different (but similar)
>> one. I originally used a 64-bit type, but then changed it to
>> long, because of 32-bit architectures. The idea is to use a
>> type
>> which likely has the same size as a CPU register.
>
> If you want to use a 32-bit type, use 'int' or 'unsigned int'.
I want to use a datatype which is likely to have the same size as
a CPU register. On 32-bit machines, it should be 32-bit, on 64-bit
architectures it should be 64-bit. This is not a strong
requirement, but it makes the parser faster. During number
parsing, this part of the code calculates the value of the
number. If it overflows, or it ends up being a float, then the
workspace buffer is re-parsed to find the value. But if it is an
integer without overflow, the parser just can use the value
without any further processing (a certain kind of LSP message is
basically a large array of integers, so it makes sense to optimize
this case - this optimization makes parsing such messages 1.5-2x
faster).
> We need to decide what to do with characters outside of the
> Unicode
> range. The encoding of those characters is different, btw, it
> doesn't
> follow the UTF-8 scheme.
>
> In any case, the error in those cases cannot be just "json parse
> error", it must be something more self-explanatory.
I am open to suggestions, as even after reading "Text
Representations" in the manual, I have no idea what to do here. I
didn't find any code which handles this case in the jansson parser
either.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 18:34 ` Herman, Géza
@ 2024-03-08 19:57 ` Eli Zaretskii
2024-03-08 20:22 ` Herman, Géza
0 siblings, 1 reply; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-08 19:57 UTC (permalink / raw)
To: Herman Géza; +Cc: emacs-devel
> From: Herman, Géza <geza.herman@gmail.com>
> Cc: geza.herman@gmail.com, emacs-devel@gnu.org
> Date: Fri, 08 Mar 2024 19:34:28 +0100
>
> > If you want to use a 32-bit type, use 'int' or 'unsigned int'.
> I want to use a datatype which is likely to have the same size as
> a CPU register. On 32-bit machines, it should be 32-bit, on 64-bit
> architectures it should be 64-bit.
Is there a reason for you to want it to be 64-bit type on a 64-bit
machine? If the only bother is efficiency, then you can use 'int'
without fear. But if a 64-bit machine will need the range of values
beyond INT_MAX (does it?), then I suggest to use ptrdiff_t.
> This is not a strong requirement, but it makes the parser faster.
Are you sure? AFAIK, 32-bit arithmetics on a 64-bit machine is as
fast as 64-bit arithmetics. So if efficiency is the only
consideration, using 'int' is OK.
> During number parsing, this part of the code calculates the value of
> the number. If it overflows, or it ends up being a float, then the
> workspace buffer is re-parsed to find the value. But if it is an
> integer without overflow, the parser just can use the value without
> any further processing (a certain kind of LSP message is basically a
> large array of integers, so it makes sense to optimize this case -
> this optimization makes parsing such messages 1.5-2x faster).
If the value needs to fit in a fixnum, use EMACS_INT, which is a type
that already takes the 32-but vs 64-bit nature of the machine into
consideration.
> > We need to decide what to do with characters outside of the
> > Unicode range. The encoding of those characters is different,
> > btw, it doesn't follow the UTF-8 scheme.
> >
> > In any case, the error in those cases cannot be just "json parse
> > error", it must be something more self-explanatory.
> I am open to suggestions, as even after reading "Text
> Representations" in the manual, I have no idea what to do here.
We should begin by deciding whether we want to support characters
outside of the Unicode range. The error message depends on that
decision.
> I didn't find any code which handles this case in the jansson parser
> either.
The jansson code required encoding/decoding strings to make sure we
submit to jansson text that is always valid UTF-8.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 19:57 ` Eli Zaretskii
@ 2024-03-08 20:22 ` Herman, Géza
2024-03-09 6:52 ` Eli Zaretskii
0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-08 20:22 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: Herman Géza, emacs-devel
Eli Zaretskii <eliz@gnu.org> writes:
>> I want to use a datatype which is likely to have the same size
>> as
>> a CPU register. On 32-bit machines, it should be 32-bit, on
>> 64-bit
>> architectures it should be 64-bit.
>
> Is there a reason for you to want it to be 64-bit type on a
> 64-bit
> machine? If the only bother is efficiency, then you can use
> 'int'
> without fear. But if a 64-bit machine will need the range of
> values
> beyond INT_MAX (does it?), then I suggest to use ptrdiff_t.
The only reason is if I use a 64-bit number on a 64-bit platform,
then the fast path will be chosen more frequently. So it makes
sense to use a register-sized integer here.
>> This is not a strong requirement, but it makes the parser
>> faster.
>
> Are you sure? AFAIK, 32-bit arithmetics on a 64-bit machine is
> as
> fast as 64-bit arithmetics. So if efficiency is the only
> consideration, using 'int' is OK.
Yes, that's right, but my reasoning is the other way around. I
shouldn't choose a larger integer than the platform natively
support. I could just use a 64-bit integer on all platforms, this
is what I originally had. But then on 32-bit platforms, this part
of the parser will be slower, because it unnecessarily will use
two registers instead of one (for most cases, as numbers are
usually smaller than 32-bit).
This whole thing is not big of a deal, as int is probably fine, as
it covers most of the commonly used range. But if I can just put a
more proper type here, then I think I should do it.
> If the value needs to fit in a fixnum, use EMACS_INT, which is a
> type
> that already takes the 32-but vs 64-bit nature of the machine
> into
> consideration.
Yes, it seems that EMACS_UINT is good for my purpose, thanks for
the suggestion.
> We should begin by deciding whether we want to support
> characters
> outside of the Unicode range. The error message depends on that
> decision.
>
>> I didn't find any code which handles this case in the jansson
>> parser
>> either.
>
> The jansson code required encoding/decoding strings to make sure
> we
> submit to jansson text that is always valid UTF-8.
I tried to use the jansson parser with a unicode 0x333333
character in a string, and it didn't work, it fails with
(json-parse-error "unable to decode byte... message. Also, I see
that json-parse-string calls some utf8 encoding related function
before parsing, but json-parse-buffer doesn't (and it doesn't do
anything encoding related thing in the callback, it just calls
memcpy). So based on these, does it have any benefit of
supporting these? Out of curiosity, what are these extra
characters used for? What is the purpose of the odd special
2-byte encoding of 8-bit characters (I mean where the 1st byte is
C0/C1)? Why don't just use the regular utf-8 encoding for these
values?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 16:14 ` Herman, Géza
@ 2024-03-09 1:55 ` Po Lu
0 siblings, 0 replies; 51+ messages in thread
From: Po Lu @ 2024-03-09 1:55 UTC (permalink / raw)
To: Herman, Géza; +Cc: emacs-devel@gnu.org
"Herman, Géza" <geza.herman@gmail.com> writes:
> I think jansson is still used, but only for writing to json.
That's a disappointment, but would you be willing to solve this as well?
> Can you give me some examples reagarding operations need to be
> parenthesized?
e.g., in
function_call ()
+ function_call_1 ()
* function_call_2 ()
+ function_call_3 ()
the entire statement should be surrounded by parentheses, along with
each group of operators separated by precedence:
(function_call ()
+ (function_call_1 ()
* function_call_2 ())
+ function_call_3 ())
and lastly all ternary operators should be surrounded by parentheses.
> Isn't there a gcc warning for this, so I can follow the convention
> more easily?
Not that I'm aware of.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 20:22 ` Herman, Géza
@ 2024-03-09 6:52 ` Eli Zaretskii
2024-03-09 11:08 ` Herman, Géza
0 siblings, 1 reply; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-09 6:52 UTC (permalink / raw)
To: Géza Herman; +Cc: emacs-devel
> From: Herman, Géza <geza.herman@gmail.com>
> Cc: Herman Géza <geza.herman@gmail.com>,
> emacs-devel@gnu.org
> Date: Fri, 08 Mar 2024 21:22:13 +0100
>
> > Is there a reason for you to want it to be 64-bit type on a 64-bit
> > machine? If the only bother is efficiency, then you can use 'int'
> > without fear. But if a 64-bit machine will need the range of
> > values beyond INT_MAX (does it?), then I suggest to use ptrdiff_t.
>
> The only reason is if I use a 64-bit number on a 64-bit platform,
> then the fast path will be chosen more frequently. So it makes
> sense to use a register-sized integer here.
Then either ptrdiff_t or EMACS_INT should do what you want.
> Yes, it seems that EMACS_UINT is good for my purpose, thanks for
> the suggestion.
Are you sure you need the unsigned variety? If EMACS_INT fits the
bill, then it is a better candidate, since unsigned arithmetics has
its quirks.
> > The jansson code required encoding/decoding strings to make sure
> > we submit to jansson text that is always valid UTF-8.
>
> I tried to use the jansson parser with a unicode 0x333333
> character in a string, and it didn't work, it fails with
> (json-parse-error "unable to decode byte... message.
Well, I didn't say trying an arbitrary codepoint will demonstrate the
issue. Some codepoints above 0x10FFFF indeed cannot be passed to
jansson.
It's okay if the initial version of this parser only handles the
Unicode range and errors out otherwise; we could extend it if needed
later. But the error message should talk specifically about invalid
character or something, not just a generic "parse error".
> Also, I see that json-parse-string calls some utf8 encoding related
> function before parsing, but json-parse-buffer doesn't (and it
> doesn't do anything encoding related thing in the callback, it just
> calls memcpy).
This is a part I was never happy about. But, as I say above, we can
get to handling these rare cases later.
> So based on these, does it have any benefit of supporting these?
Yes, definitely. But it isn't urgent.
> Out of curiosity, what are these extra characters used for?
Raw bytes and characters from charsets that are not (yet) unified with
Unicode.
> What is the purpose of the odd special 2-byte encoding of 8-bit
> characters (I mean where the 1st byte is C0/C1)? Why don't just use
> the regular utf-8 encoding for these values?
I think it's for efficiency: a 2-byte encoding takes much less space
than the 6-byte encoding (using superset of UTF-8) would take.
Imagine the case where a large byte-stream is inserted into a
multibyte buffer, before decoding it, something that happens a lot
when visiting non-ASCII files or reading from a network sub-process.
The regular UTF-8 encoding cannot be used for the raw bytes, because
then we will be unable to distinguish between them and the Unicode
codepoints of the same value. For example, a raw byte whose value is
160 decimal (A0 hex) will be indistinguishable from U+00A0 No-Break
Space character. This is why the "codepoint" corresponding to raw
byte 160 is 0x3FFFA0, see BYTE8_TO_CHAR.
Once again, we can extend the parser for codepoints outside of the
Unicode range later. For now, it's okay to reject them with a
suitable error.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-09 6:52 ` Eli Zaretskii
@ 2024-03-09 11:08 ` Herman, Géza
2024-03-09 12:23 ` Lynn Winebarger
` (2 more replies)
0 siblings, 3 replies; 51+ messages in thread
From: Herman, Géza @ 2024-03-09 11:08 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: Géza Herman, emacs-devel
Eli Zaretskii <eliz@gnu.org> writes:
>> From: Herman, Géza <geza.herman@gmail.com>
>> Cc: Herman Géza <geza.herman@gmail.com>,
>> emacs-devel@gnu.org
>> Date: Fri, 08 Mar 2024 21:22:13 +0100
>> Yes, it seems that EMACS_UINT is good for my purpose, thanks
>> for
>> the suggestion.
>
> Are you sure you need the unsigned variety? If EMACS_INT fits
> the
> bill, then it is a better candidate, since unsigned arithmetics
> has
> its quirks.
Yes, I think it's better to use unsigned: read the sign, and then
parse the number as unsigned, and then apply the sign at the
end. If the number is parsed with its sign, it needs an additional
step at each character (the sign needs to be applied to each
digit).
>> Also, I see that json-parse-string calls some utf8 encoding
>> related
>> function before parsing, but json-parse-buffer doesn't (and it
>> doesn't do anything encoding related thing in the callback, it
>> just
>> calls memcpy).
>
> This is a part I was never happy about. But, as I say above, we
> can
> get to handling these rare cases later.
I think this is an additional benefit of my parser: this feature
can be added to it more easily than into jansson.
Even, I'm tempted to say that we could just remove utf-8 checking
from my code, and then Emacs's encoding method should work right
out of the box.
Or, to say that utf-8 handling should stay as is. Because as far
as I understand, if the JSON contains an invalid utf-8 sequence
which is not invalid according to Emacs's character
representation, then this problem won't be detected. So checking
for utf-8 encoding errors shouldn't be the job of the json parser,
but around IO handling, which has the chance to know that the JSON
stream itself must only contain a valid utf-8 encoding.
Or, as the JSON specification explcitly says that the allowed
character range is 0x20 .. 0x10ffff, the current solution is fine,
because it is actually against JSON rules to allow anything else
outside of this range.
> Once again, we can extend the parser for codepoints outside of
> the
> Unicode range later. For now, it's okay to reject them with a
> suitable error.
OK, cool, I added Qjson_utf8_decode_error to indicate decoding
errors.
How can we proceed further? This is the current state of the
patch:
https://github.com/geza-herman/emacs/commit/ce5d990776a1ccdfd0b6d9c4d5e5e5df55245672.patch
I think I did everything that was asked for, except Po Lu's
parenthesis-related comment, because I still don't know what to
parenthesize and what not to. I saw a lot of "a + x * y" kind of
expressions in emacs codebase without any parenthesis. Are the
exact rules documented somewhere?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-09 11:08 ` Herman, Géza
@ 2024-03-09 12:23 ` Lynn Winebarger
2024-03-09 12:58 ` Po Lu
2024-03-09 13:13 ` Eli Zaretskii
2 siblings, 0 replies; 51+ messages in thread
From: Lynn Winebarger @ 2024-03-09 12:23 UTC (permalink / raw)
To: Herman, Géza; +Cc: Eli Zaretskii, emacs-devel
[-- Attachment #1: Type: text/plain, Size: 759 bytes --]
On Sat, Mar 9, 2024, 6:39 AM Herman, Géza <geza.herman@gmail.com> wrote:
>
> I think I did everything that was asked for, except Po Lu's
> parenthesis-related comment, because I still don't know what to
> parenthesize and what not to. I saw a lot of "a + x * y" kind of
> expressions in emacs codebase without any parenthesis. Are the
> exact rules documented somewhere?
>
I suspect the rule is to ensure that emacs maintainers can freely convert
functions to C preprocessor macros. So you should follow Po Lu's
description.
If the preprocessor wasn't an issue, such replication of the language's
builtin arithmetic precedence would be undesirable, an possibly repugnant,
depending on your aesthetic sensibilities.
Lynn
Lynn
[-- Attachment #2: Type: text/html, Size: 1380 bytes --]
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-09 11:08 ` Herman, Géza
2024-03-09 12:23 ` Lynn Winebarger
@ 2024-03-09 12:58 ` Po Lu
2024-03-09 13:13 ` Eli Zaretskii
2 siblings, 0 replies; 51+ messages in thread
From: Po Lu @ 2024-03-09 12:58 UTC (permalink / raw)
To: Herman, Géza; +Cc: Eli Zaretskii, emacs-devel
"Herman, Géza" <geza.herman@gmail.com> writes:
> I think I did everything that was asked for, except Po Lu's
> parenthesis-related comment, because I still don't know what to
> parenthesize and what not to. I saw a lot of "a + x * y" kind of
> expressions in emacs codebase without any parenthesis. Are the exact
> rules documented somewhere?
Simple statements as above, which fit in one line, don't require the
extra parentheses. The objective of these parentheses is to enable
indenting statements satisfactorily with CC Mode's indentation commands,
and so that such indentation does not give readers a misleading
impression of the statement's structure.
https://www.gnu.org/prep/standards/standards.html#Formatting
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-09 11:08 ` Herman, Géza
2024-03-09 12:23 ` Lynn Winebarger
2024-03-09 12:58 ` Po Lu
@ 2024-03-09 13:13 ` Eli Zaretskii
2024-03-09 14:00 ` Herman, Géza
2 siblings, 1 reply; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-09 13:13 UTC (permalink / raw)
To: Géza Herman; +Cc: emacs-devel
> From: Herman, Géza <geza.herman@gmail.com>
> Cc: Géza Herman <geza.herman@gmail.com>,
> emacs-devel@gnu.org
> Date: Sat, 09 Mar 2024 12:08:54 +0100
>
> How can we proceed further? This is the current state of the
> patch:
> https://github.com/geza-herman/emacs/commit/ce5d990776a1ccdfd0b6d9c4d5e5e5df55245672.patch
What about removing the dependency on jansson altogether?
We also need to wait until your legal paperwork is completed.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-09 13:13 ` Eli Zaretskii
@ 2024-03-09 14:00 ` Herman, Géza
2024-03-09 14:21 ` Eli Zaretskii
0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-09 14:00 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: Géza Herman, emacs-devel
Eli Zaretskii <eliz@gnu.org> writes:
>> From: Herman, Géza <geza.herman@gmail.com>
>> Cc: Géza Herman <geza.herman@gmail.com>,
>> emacs-devel@gnu.org
>> Date: Sat, 09 Mar 2024 12:08:54 +0100
>>
>> How can we proceed further? This is the current state of the
>> patch:
>> https://github.com/geza-herman/emacs/commit/ce5d990776a1ccdfd0b6d9c4d5e5e5df55245672.patch
>
> What about removing the dependency on jansson altogether?
Unless it also turns out to be a serious bottleneck, I'm not too
motivated to work on that.
Based on Po Lu's comments, I added some parentheses, here's the
latest version:
https://github.com/geza-herman/emacs/commit/5e3db3fb06b6393fd1b5d5fd967062ec1909e94a.patch
(I reformatted the code using indent-region, and put parentheses
where it didn't used the correct indentation. I'm not sure
whether this method finds all the cases, but as the intent of
these extra parentheses is to have correct indentation, I think
this approach more-or-less OK)
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-09 14:00 ` Herman, Géza
@ 2024-03-09 14:21 ` Eli Zaretskii
0 siblings, 0 replies; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-09 14:21 UTC (permalink / raw)
To: Géza Herman; +Cc: emacs-devel
> From: Herman, Géza <geza.herman@gmail.com>
> Cc: Géza Herman <geza.herman@gmail.com>,
> emacs-devel@gnu.org
> Date: Sat, 09 Mar 2024 15:00:09 +0100
>
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> >> How can we proceed further? This is the current state of the
> >> patch:
> >> https://github.com/geza-herman/emacs/commit/ce5d990776a1ccdfd0b6d9c4d5e5e5df55245672.patch
> >
> > What about removing the dependency on jansson altogether?
>
> Unless it also turns out to be a serious bottleneck, I'm not too
> motivated to work on that.
That's a pity, because using jansson for part of the job and our own
code for the other part makes little sense to me. So I hope you will
reconsider, because having our own JSON code is a very attractive
development for Emacs, it opens several possibilities for future
extensions that using an external library cannot.
> Based on Po Lu's comments, I added some parentheses, here's the
> latest version:
Thanks.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-08 10:27 I created a faster JSON parser Herman, Géza
` (2 preceding siblings ...)
2024-03-08 13:28 ` Po Lu
@ 2024-03-09 20:37 ` Christopher Wellons
2024-03-10 6:31 ` Eli Zaretskii
2024-03-10 6:58 ` Herman, Géza
3 siblings, 2 replies; 51+ messages in thread
From: Christopher Wellons @ 2024-03-09 20:37 UTC (permalink / raw)
To: Herman, Géza; +Cc: emacs-devel@gnu.org
> What do you think?
Your new JSON parser appears to carefully written, and has the marks of
thoughtful design. You've avoided the common pitfalls, and that gives me
confidence in it.
In review I noticed a potential pointer overflow in json_parse_string:
parser->input_current + 4 <= parser->input_end
The "+ 4" may push the pointer beyond one-past-the-end not just for the
input, but the buffer itself, overflowing the pointer. To fix, re-arrange
the expression to check a size rather than an address:
parser->input_end - parser->input_current >= 4
In json_make_object_workspace_for and json_byte_workspace_put, a size is
doubled without an overflow check ("new_workspace_size * 2"). The first
could cause an infinite loop, and the second could allocate less than was
expected. Both are minor, and in practice can only affect 32-bit targets,
because you'd need to grow these buffers to the limits before these sizes
could overflow.
Despite the obvious care which with this was written, I personally would
not adopt a JSON parser that had not been thoroughly fuzz tested under
Address Sanitizer and Undefined Behavior Sanitizer. Fuzzing is incredibly
effective at finding defects, and it would be foolish not to use it in its
ideal circumstances. Normally it's not difficult and requires only a few
lines of code. But this JSON parser is tightly coupled with the Emacs Lisp
runtime, which greatly complicates things. I couldn't simply pluck it out
by itself and drop it in, say, AFL++.
As noted earlier, the parser gets its performance edge through skipping
the intermediate steps. This is great! That could still be accomplished
without such tight coupling, allowing for performance *and* an interface
that is testable and fuzzable in relative isolation.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-09 20:37 ` Christopher Wellons
@ 2024-03-10 6:31 ` Eli Zaretskii
2024-03-10 21:39 ` Philip Kaludercic
2024-03-10 6:58 ` Herman, Géza
1 sibling, 1 reply; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-10 6:31 UTC (permalink / raw)
To: Christopher Wellons; +Cc: geza.herman, emacs-devel
> Date: Sat, 9 Mar 2024 15:37:25 -0500
> From: Christopher Wellons <wellons@nullprogram.com>
> Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org>
>
> Despite the obvious care which with this was written, I personally would
> not adopt a JSON parser that had not been thoroughly fuzz tested under
> Address Sanitizer and Undefined Behavior Sanitizer. Fuzzing is incredibly
> effective at finding defects, and it would be foolish not to use it in its
> ideal circumstances. Normally it's not difficult and requires only a few
> lines of code. But this JSON parser is tightly coupled with the Emacs Lisp
> runtime, which greatly complicates things. I couldn't simply pluck it out
> by itself and drop it in, say, AFL++.
That's okay, we can start by making this an optional feature, and
consider making it the default after a couple of major releases;
meanwhile, any problems will be detected and reported.
However, it would make much more sense to consider switching to this
code if it also could handle producing JSON output, thus making
libjansson unnecessary when we decide to switch.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-09 20:37 ` Christopher Wellons
2024-03-10 6:31 ` Eli Zaretskii
@ 2024-03-10 6:58 ` Herman, Géza
2024-03-10 16:54 ` Christopher Wellons
1 sibling, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-10 6:58 UTC (permalink / raw)
To: Christopher Wellons; +Cc: Herman, Géza, emacs-devel@gnu.org
Christopher Wellons <wellons@nullprogram.com> writes:
>> What do you think?
>
> In review I noticed a potential pointer overflow in
> json_parse_string:
>
> parser->input_current + 4 <= parser->input_end
>
> [...]
>
> In json_make_object_workspace_for and json_byte_workspace_put, a
> size
> is doubled without an overflow check ("new_workspace_size * 2").
Thanks for the review and finding these problems! I fixed them:
https://github.com/geza-herman/emacs/commit/cbbf3dd494034750ff324703e64f1125a1056832.patch
> But this JSON parser is tightly
> coupled with the Emacs Lisp runtime, which greatly complicates
> things. I couldn't simply pluck it out by itself and drop it in,
> say,
> AFL++.
Yes, it needs some work. The Lisp Object creation part is only
done at very specific places, it's easy to remove them (actually,
I wrote this parser outside of Emacs, and then just put it in by
adding the necessary Lisp Object creation code). Or, if the
fuzzer needs the actual output (I mean, the result of the
parsing), it shouldn't be too hard to put some code there which
provides the output. The other thing is error handling, but it
also can be easily replaced by using longjmp.
I'm happy to do this work, I'd just need some directions how to do
it. I'm not experienced with fuzzy testing, so if you are, I'd
glad if you can give some advices: which fuzzy-testing framework
to use, which introductory material is worth reading, etc.
> As noted earlier, the parser gets its performance edge through
> skipping the intermediate steps. This is great! That could still
> be
> accomplished without such tight coupling, allowing for
> performance
> *and* an interface that is testable and fuzzable in relative
> isolation.
Yes, I think a SAX parser like interface would have a very little
cost. But honestly, I don't see the point of it. This is a
parser for Emacs only. It has a very specific purpose, to make
JSON parsing fast in Emacs. It is a small module. Input is JSON,
output is Lisp Objects. Working with Lisp Objects inside Emacs is
a natural thing, usually there is no need for intermediate
representations. So if the only reason to have a
Emacs-independent API is to make the parser fuzzy-testable, then
wouldn't it make more sense to make Emacs fuzzy-testable in
general? I find this approach more useful, because I think it's
not just this parser which can be a sensible target for fuzzy
testing.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-10 6:58 ` Herman, Géza
@ 2024-03-10 16:54 ` Christopher Wellons
2024-03-10 20:41 ` Herman, Géza
0 siblings, 1 reply; 51+ messages in thread
From: Christopher Wellons @ 2024-03-10 16:54 UTC (permalink / raw)
To: Herman, Géza; +Cc: emacs-devel@gnu.org
> I'd glad if you can give some advices: which fuzzy-testing framework to
> use, which introductory material is worth reading, etc.
The Jansson repository has a libFuzzer-based fuzz test, which is perhaps a
useful example. In it they define LLVMFuzzerTestOneInput, a function which
accepts a buffer of input (pointer and length), which they feed into the
code under test. That's basically it. In the new parser that buffer would
go into json_parse. The tested code is instrumented, and the fuzz tester
observes the affect inputs have on control flow, using that information to
construct new inputs that explore new execution paths, trying to exercise
as many as possible.
I'm partial to AFL++, and it's what I reach for first. It also works with
GCC. It has two modes, with persistent mode preferred:
https://github.com/AFLplusplus/AFLplusplus/blob/stable/instrumentation/README.persistent_mode.md
Same in principle, but with control inverted. For seed inputs, a few small
JSON documents exercising the parser's features is sufficient. In either
case, use -fsanitize=address,undefined so that defective execution paths
are more likely to be detected. More assertions would help, too, such as
"assert(input_current <= input_end)" in a number of places. Assertions
must abort or trap so that the fuzz tester knows it found a defect.
Fuzz testing works better in a narrow scope. Ideally only the code being
tested is instrumented. If it's running within an Emacs context, and you
instrument all of Emacs, the fuzz tester would explore paths in Emacs
reachable through the JSON parser rather than focus on the parser itself.
That will waste time that could instead be spent exploring the parser.
You don't need to allocate lisp objects during fuzz testing. In fact, you
should avoid it because that would just slow it down. (I even bet it's the
bottleneck in the new parser.) Ideally the core parser consumes bytes and
produces JSON events, and is agnostic to its greater context. To integrate
with Emacs, you'd have additional, separate code that turns JSON events
into lisp objects, and which wouldn't be fuzz tested.
Written that way, I could hook this core up to one of the above fuzz test
interfaces, mock out whatever bits of Emacs might still be there (e.g.
ckd_mul: the isolation need not be perfect), feed it the input, and pump
events until either error (i.e. bad input detected, which is ignored) or
EOF. The fuzz tester uses a timeout to detect infinite loops, which AFL++
will report as "hangs" and save the input for manual investigation. It
should exercise JSON numeric parsing, too, at least to the extent that
it's not punted to Emacs or strtod (mind your locale!). I'd also make
available_depth much smaller so that the fuzzing could exercise failing
checks.
To get the bulk of the value, the fuzz test does not necessarily need to
be checked into source control, or even run as part of a standard test
suite. Given a clean, decoupled interface and implementation, it would
only take a few minutes to hook up a fuzz test. I was hoping to find just
that, but each JSON function has multiple points of contact with Emacs,
most especially json_parse_object.
I've done such ad-hoc fuzz testing on dozens of programs and libraries to
evaluate their quality, and sometimes even improve them. In most cases, if
can be fuzz tested and it's never been fuzz tested before, this technique
finds fresh bugs in a matter of minutes, if not seconds. When I say it's
incredibly effective, I mean it! Case in point from a few weeks ago, under
similar circumstances, which can also serve as a practical example:
https://github.com/editorconfig/editorconfig-core-c/pull/103
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-10 16:54 ` Christopher Wellons
@ 2024-03-10 20:41 ` Herman, Géza
2024-03-10 23:22 ` Christopher Wellons
0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-10 20:41 UTC (permalink / raw)
To: Christopher Wellons; +Cc: Herman, Géza, emacs-devel@gnu.org
[-- Attachment #1: Type: text/plain, Size: 1124 bytes --]
Christopher Wellons <wellons@nullprogram.com> writes:
>> I'd glad if you can give some advices: which fuzzy-testing
>> framework
>> to use, which introductory material is worth reading, etc.
>
> I'm partial to AFL++, and it's what I reach for first. It also
> works
> with GCC. It has two modes, with persistent mode preferred:
Thanks so much for the description! I created a standalone
version of my parser (I attached it), and used "afl-clang-fast -o
json json.c -fsanitize=address,undefined" and afl-fuzz to test it.
It's been running for an hour, the tester didn't find any problems
yet.
I discovered a funny clang bug: it incorrectly optimizes around
setjmp in do_test(): when json_parser_init runs, it stores the
workspace pointer in a register. And if there is an error during
JSON parsing, it will always free the pointer which is in that
register. But in the meantime (I mean, after json_parser_init,
and before the error is thrown), the parser could have updated
it. So free() will be called on an already freed block. I had to
add a dummy printf("free!\n"); to circumvent this optimization.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: json.c --]
[-- Type: text/x-csrc, Size: 29853 bytes --]
#include <stddef.h>
#include <stdlib.h>
#include <errno.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#include <setjmp.h>
#include <stdio.h>
#include <unistd.h>
#define ckd_add(R, A, B) __builtin_add_overflow((A), (B), (R))
#define ckd_mul(R, A, B) __builtin_mul_overflow((A), (B), (R))
#define bool _Bool
#define true 1
#define false 0
struct json_configuration {
};
typedef struct {
size_t value;
} Lisp_Object;
struct json_parser {
/* Because of a possible gap in the input (an emacs buffer can have
a gap), the input is described by [input_begin;input_end) and
[secondary_input_begin;secondary_input_end). If the input is
continuous, then secondary_input_begin and secondary_input_end
should be NULL */
const unsigned char *input_current;
const unsigned char *input_begin;
const unsigned char *input_end;
const unsigned char *secondary_input_begin;
const unsigned char *secondary_input_end;
int current_line;
int current_column;
/* The parser has a maximum allowed depth. available_depth
decreases at each object/array begin. If reaches zero, then an
error is generated */
int available_depth;
struct json_configuration conf;
size_t additional_bytes_count;
/* Lisp_Objects are collected in this area during object/array
parsing */
Lisp_Object *object_workspace;
Lisp_Object *object_workspace_end;
Lisp_Object *object_workspace_current;
/* String and number parsing uses this workspace */
unsigned char *byte_workspace;
unsigned char *byte_workspace_end;
unsigned char *byte_workspace_current;
};
jmp_buf signal_env;
Lisp_Object Qjson_out_of_memory;
Lisp_Object Qjson_end_of_file;
Lisp_Object Qjson_escape_sequence_error;
Lisp_Object Qjson_utf8_decode_error;
Lisp_Object Qjson_invalid_surrogate_error;
Lisp_Object Qjson_parse_error;
Lisp_Object Qjson_error;
Lisp_Object Qjson_number_out_of_range;
Lisp_Object Qjson_object_too_deep;
Lisp_Object Qjson_trailing_content;
static _Noreturn void json_signal_error(struct json_parser *parser, Lisp_Object error) {
longjmp(signal_env, 1);
/* xsignal2(error, INT_TO_INTEGER(parser->current_line), INT_TO_INTEGER(parser->current_column)); */
}
static void json_parser_init(struct json_parser *parser, struct json_configuration conf, const unsigned char *input,
const unsigned char *input_end, const unsigned char *secondary_input,
const unsigned char *secondary_input_end) {
const int initial_workspace_size = 64;
const int initial_string_workspace_size = 512;
if (secondary_input >= secondary_input_end) {
secondary_input = NULL;
secondary_input_end = NULL;
}
if (input < input_end) {
parser->input_begin = input;
parser->input_end = input_end;
parser->secondary_input_begin = secondary_input;
parser->secondary_input_end = secondary_input_end;
} else {
parser->input_begin = secondary_input;
parser->input_end = secondary_input_end;
parser->secondary_input_begin = NULL;
parser->secondary_input_end = NULL;
}
parser->input_current = parser->input_begin;
parser->current_line = 1;
parser->current_column = 0;
parser->available_depth = 8;
parser->conf = conf;
parser->additional_bytes_count = 0;
parser->object_workspace = malloc(initial_workspace_size * sizeof(Lisp_Object));
parser->object_workspace_end = parser->object_workspace + initial_workspace_size;
parser->object_workspace_current = parser->object_workspace;
parser->byte_workspace = malloc(initial_string_workspace_size);
parser->byte_workspace_end = parser->byte_workspace + initial_string_workspace_size;
}
static void json_parser_done(void *parser) {
struct json_parser *p = (struct json_parser *)parser;
/* printf("free: %p %p\n", p->object_workspace, p->byte_workspace); */
free(p->object_workspace);
free(p->byte_workspace);
}
/* Makes sure that the object_workspace has 'size' available space for
Lisp_Objects */
static void json_make_object_workspace_for(struct json_parser *parser, size_t size) {
size_t available_size = parser->object_workspace_end - parser->object_workspace_current;
if (available_size >= size) {
return;
}
size_t needed_workspace_size = (parser->object_workspace_current - parser->object_workspace + size);
size_t new_workspace_size = parser->object_workspace_end - parser->object_workspace;
while (new_workspace_size < needed_workspace_size) {
if (ckd_mul(&new_workspace_size, new_workspace_size, 2)) {
json_signal_error(parser, Qjson_out_of_memory);
}
}
size_t offset = parser->object_workspace_current - parser->object_workspace;
parser->object_workspace = realloc(parser->object_workspace, new_workspace_size * sizeof(Lisp_Object));
parser->object_workspace_end = parser->object_workspace + new_workspace_size;
parser->object_workspace_current = parser->object_workspace + offset;
}
static void json_byte_workspace_reset(struct json_parser *parser) {
parser->byte_workspace_current = parser->byte_workspace;
}
/* Puts 'value' into the byte_workspace. If there is no space
available, it allocates space */
static void json_byte_workspace_put(struct json_parser *parser, unsigned char value) {
if (parser->byte_workspace_current < parser->byte_workspace_end) {
*parser->byte_workspace_current++ = value;
return;
}
size_t new_workspace_size = parser->byte_workspace_end - parser->byte_workspace;
if (ckd_mul(&new_workspace_size, new_workspace_size, 2)) {
json_signal_error(parser, Qjson_out_of_memory);
}
size_t offset = parser->byte_workspace_current - parser->byte_workspace;
parser->byte_workspace = realloc(parser->byte_workspace, new_workspace_size);
parser->byte_workspace_end = parser->byte_workspace + new_workspace_size;
parser->byte_workspace_current = parser->byte_workspace + offset;
*parser->byte_workspace_current++ = value;
}
static bool json_input_at_eof(struct json_parser *parser) {
if (parser->input_current < parser->input_end)
return false;
return parser->secondary_input_end == NULL;
}
/* If there is a secondary buffer, it switches to it */
static int json_input_switch_to_secondary(struct json_parser *parser) {
if (parser->secondary_input_begin < parser->secondary_input_end) {
parser->additional_bytes_count = parser->input_end - parser->input_begin;
parser->input_begin = parser->secondary_input_begin;
parser->input_end = parser->secondary_input_end;
parser->input_current = parser->secondary_input_begin;
parser->secondary_input_begin = NULL;
parser->secondary_input_end = NULL;
return 0;
} else
return -1;
}
/* Reads a byte from the JSON input stream */
static unsigned char json_input_get(struct json_parser *parser) {
if (parser->input_current >= parser->input_end && json_input_switch_to_secondary(parser) < 0)
json_signal_error(parser, Qjson_end_of_file);
return *parser->input_current++;
}
/* Reads a byte from the JSON input stream, if the stream is not at
* eof. At eof, returns -1 */
static int json_input_get_if_possible(struct json_parser *parser) {
if (parser->input_current >= parser->input_end && json_input_switch_to_secondary(parser) < 0)
return -1;
return *parser->input_current++;
}
/* Puts back the last read input byte. Only one byte can be put back,
because otherwise this code would need to handle switching from
the secondary buffer to the initial */
static void json_input_put_back(struct json_parser *parser) {
parser->input_current--;
}
static bool json_skip_whitespace_internal(struct json_parser *parser, int c) {
parser->current_column++;
if (c == 0x20 || c == 0x09 || c == 0x0d)
return false;
else if (c == 0x0a) {
parser->current_line++;
parser->current_column = 0;
return false;
} else
return true;
}
/* Skips JSON whitespace, and returns with the first non-whitespace
* character */
static int json_skip_whitespace(struct json_parser *parser) {
for (;;) {
int c = json_input_get(parser);
if (json_skip_whitespace_internal(parser, c))
return c;
}
}
/* Skips JSON whitespace, and returns with the first non-whitespace
* character, if possible. If there is no non-whitespace character
* (because we reached the end), it returns -1 */
static int json_skip_whitespace_if_possible(struct json_parser *parser) {
for (;;) {
int c = json_input_get_if_possible(parser);
if (c < 0)
return c;
if (json_skip_whitespace_internal(parser, c))
return c;
}
}
static int json_hex_value(int c) {
if (c >= '0' && c <= '9')
return c - '0';
else if (c >= 'A' && c <= 'F')
return c - 'A' + 10;
else if (c >= 'a' && c <= 'f')
return c - 'a' + 10;
else
return -1;
}
/* Parses the CCCC part of the unicode escape sequence \uCCCC */
static int json_parse_unicode(struct json_parser *parser) {
unsigned char v[4];
for (int i = 0; i < 4; i++) {
int c = json_hex_value(json_input_get(parser));
parser->current_column++;
if (c < 0)
json_signal_error(parser, Qjson_escape_sequence_error);
v[i] = c;
}
return v[0] << 12 | v[1] << 8 | v[2] << 4 | v[3];
}
/* Parses an utf-8 code-point encoding (except the first byte), and
returns the numeric value of the code-point (without considering
the first byte) */
static int json_handle_utf8_tail_bytes(struct json_parser *parser, int n) {
int v = 0;
for (int i = 0; i < n; i++) {
int c = json_input_get(parser);
json_byte_workspace_put(parser, c);
if ((c & 0xc0) != 0x80)
json_signal_error(parser, Qjson_utf8_decode_error);
v = (v << 6) | (c & 0x3f);
}
return v;
}
/* Reads a JSON string, and puts the result into the byte workspace */
static void json_parse_string(struct json_parser *parser) {
/* a single_uninteresting byte can be simply copied from the input
to output, it doesn't need any extra care. */
static const char is_single_uninteresting[256] = {
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
};
for (;;) {
/* This if is only here for a possible speedup. If there are 4
bytes available, and all of them are single_uninteresting,
then we can just copy these 4 bytes to output */
if (parser->input_end - parser->input_current >= 4) {
int c0 = parser->input_current[0];
int c1 = parser->input_current[1];
int c2 = parser->input_current[2];
int c3 = parser->input_current[3];
bool v0 = is_single_uninteresting[c0];
bool v1 = is_single_uninteresting[c1];
bool v2 = is_single_uninteresting[c2];
bool v3 = is_single_uninteresting[c3];
if (v0 && v1 && v2 && v3) {
json_byte_workspace_put(parser, c0);
json_byte_workspace_put(parser, c1);
json_byte_workspace_put(parser, c2);
json_byte_workspace_put(parser, c3);
parser->input_current += 4;
parser->current_column += 4;
continue;
}
}
int c = json_input_get(parser);
parser->current_column++;
if (is_single_uninteresting[c]) {
json_byte_workspace_put(parser, c);
continue;
}
if (c == '"')
return;
else if (c & 0x80) {
/* Handle utf-8 encoding */
json_byte_workspace_put(parser, c);
if (c < 0xc0)
json_signal_error(parser, Qjson_utf8_decode_error);
else if (c < 0xe0) {
int n = ((c & 0x1f) << 6 | json_handle_utf8_tail_bytes(parser, 1));
if (n < 0x80)
json_signal_error(parser, Qjson_utf8_decode_error);
} else if (c < 0xf0) {
int n = ((c & 0xf) << 12 | json_handle_utf8_tail_bytes(parser, 2));
if (n < 0x800 || (n >= 0xd800 && n < 0xe000))
json_signal_error(parser, Qjson_utf8_decode_error);
} else if (c < 0xf8) {
int n = ((c & 0x7) << 18 | json_handle_utf8_tail_bytes(parser, 3));
if (n < 0x10000 || n > 0x10ffff)
json_signal_error(parser, Qjson_utf8_decode_error);
} else
json_signal_error(parser, Qjson_utf8_decode_error);
} else if (c == '\\') {
/* Handle escape sequences */
c = json_input_get(parser);
parser->current_column++;
if (c == '"')
json_byte_workspace_put(parser, '"');
else if (c == '\\')
json_byte_workspace_put(parser, '\\');
else if (c == '/')
json_byte_workspace_put(parser, '/');
else if (c == 'b')
json_byte_workspace_put(parser, '\b');
else if (c == 'f')
json_byte_workspace_put(parser, '\f');
else if (c == 'n')
json_byte_workspace_put(parser, '\n');
else if (c == 'r')
json_byte_workspace_put(parser, '\r');
else if (c == 't')
json_byte_workspace_put(parser, '\t');
else if (c == 'u') {
int num = json_parse_unicode(parser);
/* is the first half of the surrogate pair */
if (num >= 0xd800 && num < 0xdc00) {
parser->current_column++;
if (json_input_get(parser) != '\\')
json_signal_error(parser, Qjson_invalid_surrogate_error);
parser->current_column++;
if (json_input_get(parser) != 'u')
json_signal_error(parser, Qjson_invalid_surrogate_error);
int num2 = json_parse_unicode(parser);
if (num2 < 0xdc00 || num2 >= 0xe000)
json_signal_error(parser, Qjson_invalid_surrogate_error);
num = (0x10000 + ((num - 0xd800) << 10 | (num2 - 0xdc00)));
} else if (num >= 0xdc00 && num < 0xe000)
/* is the second half of the surrogate pair without
the first half */
json_signal_error(parser, Qjson_invalid_surrogate_error);
/* utf-8 encode the code-point */
if (num < 0x80)
json_byte_workspace_put(parser, num);
else if (num < 0x800) {
json_byte_workspace_put(parser, 0xc0 | num >> 6);
json_byte_workspace_put(parser, 0x80 | (num & 0x3f));
} else if (num < 0x10000) {
json_byte_workspace_put(parser, 0xe0 | num >> 12);
json_byte_workspace_put(parser, (0x80 | ((num >> 6) & 0x3f)));
json_byte_workspace_put(parser, 0x80 | (num & 0x3f));
} else {
json_byte_workspace_put(parser, 0xf0 | num >> 18);
json_byte_workspace_put(parser, (0x80 | ((num >> 12) & 0x3f)));
json_byte_workspace_put(parser, (0x80 | ((num >> 6) & 0x3f)));
json_byte_workspace_put(parser, 0x80 | (num & 0x3f));
}
} else
json_signal_error(parser, Qjson_escape_sequence_error);
} else
json_signal_error(parser, Qjson_parse_error);
}
}
/* If there was no integer overflow during parsing the integer, this
puts 'value' to the output. Otherwise this calls string_to_number
to parse integer on the byte workspace. This could just always
call string_to_number, but for performance reasons, during parsing
the code tries to calculate the value, so in most cases, we can
save call of string_to_number */
static Lisp_Object json_create_integer(struct json_parser *parser, bool integer_overflow, bool negative,
ulong value) {
Lisp_Object o;
o.value = value;
if (!integer_overflow) {
if (negative) {
uintmax_t v = value;
if (v <= (uintmax_t)INTMAX_MAX + 1)
return o;
} else {
return o;
}
}
json_byte_workspace_put(parser, 0);
ptrdiff_t len = strlen((const char *)parser->byte_workspace);
if (len != parser->byte_workspace_current - parser->byte_workspace - 1)
json_signal_error(parser, Qjson_error);
return o;
}
/* Parses a float using the byte workspace */
static Lisp_Object json_create_float(struct json_parser *parser) {
Lisp_Object o;
json_byte_workspace_put(parser, 0);
errno = 0;
char *e;
double value = strtod((const char *)parser->byte_workspace, &e);
bool out_of_range = (errno != 0 && (value == HUGE_VAL || value == -HUGE_VAL));
if (out_of_range)
json_signal_error(parser, Qjson_number_out_of_range);
else if ((const unsigned char *)e != parser->byte_workspace_current - 1)
json_signal_error(parser, Qjson_error);
else {
o.value = value;
return o;
}
}
/* Parses a number. The first character is the input parameter 'c'.
*/
static Lisp_Object json_parse_number(struct json_parser *parser, int c) {
json_byte_workspace_reset(parser);
json_byte_workspace_put(parser, c);
Lisp_Object o;
o.value = 0;
bool negative = false;
if (c == '-') {
negative = true;
c = json_input_get(parser);
json_byte_workspace_put(parser, c);
parser->current_column++;
}
if (c < '0' || c > '9')
json_signal_error(parser, Qjson_parse_error);
/* The idea is that during finding the last character of the
number, the for loop below also tries to calculate the value. If
the parsed number is an integer which fits into unsigned long,
then the parser can use the value of 'integer' right away,
instead of having to re-parse the byte workspace later.
Ideally, this integer should have the same size as a CPU general
purpose register. */
unsigned long integer = c - '0';
bool integer_overflow = false;
if (integer == 0) {
if (json_input_at_eof(parser))
return o;
c = json_input_get(parser);
} else {
for (;;) {
if (json_input_at_eof(parser))
return json_create_integer(parser, integer_overflow, negative, integer);
c = json_input_get(parser);
if (c < '0' || c > '9')
break;
json_byte_workspace_put(parser, c);
parser->current_column++;
integer_overflow |= ckd_mul(&integer, integer, 10);
integer_overflow |= ckd_add(&integer, integer, c - '0');
}
}
bool is_float = false;
if (c == '.') {
json_byte_workspace_put(parser, c);
parser->current_column++;
is_float = true;
c = json_input_get(parser);
json_byte_workspace_put(parser, c);
parser->current_column++;
if (c < '0' || c > '9')
json_signal_error(parser, Qjson_parse_error);
for (;;) {
if (json_input_at_eof(parser))
return json_create_float(parser);
c = json_input_get(parser);
if (c < '0' || c > '9')
break;
json_byte_workspace_put(parser, c);
parser->current_column++;
}
}
if (c == 'e' || c == 'E') {
json_byte_workspace_put(parser, c);
parser->current_column++;
is_float = true;
c = json_input_get(parser);
json_byte_workspace_put(parser, c);
parser->current_column++;
if (c == '-' || c == '+') {
c = json_input_get(parser);
json_byte_workspace_put(parser, c);
parser->current_column++;
}
if (c < '0' || c > '9')
json_signal_error(parser, Qjson_parse_error);
for (;;) {
if (json_input_at_eof(parser))
return json_create_float(parser);
c = json_input_get(parser);
if (c < '0' || c > '9')
break;
json_byte_workspace_put(parser, c);
parser->current_column++;
}
}
/* 'c' contains a character which is not part of the number,
so it is need to be put back */
json_input_put_back(parser);
if (is_float)
return json_create_float(parser);
else
return json_create_integer(parser, integer_overflow, negative, integer);
}
static Lisp_Object json_parse_value(struct json_parser *parser, int c);
/* Parses a JSON array. */
static Lisp_Object json_parse_array(struct json_parser *parser) {
int c = json_skip_whitespace(parser);
const size_t begin_offset = parser->object_workspace_current - parser->object_workspace;
if (c != ']') {
parser->available_depth--;
if (parser->available_depth < 0)
json_signal_error(parser, Qjson_object_too_deep);
size_t number_of_elements = 0;
/* This loop collects the array elements in the object workspace
*/
for (;;) {
Lisp_Object element = json_parse_value(parser, c);
json_make_object_workspace_for(parser, 1);
*parser->object_workspace_current++ = element;
c = json_skip_whitespace(parser);
number_of_elements++;
if (c == ']') {
parser->available_depth++;
break;
}
if (c != ',')
json_signal_error(parser, Qjson_parse_error);
c = json_skip_whitespace(parser);
}
}
Lisp_Object result;
const Lisp_Object *b = parser->object_workspace + begin_offset;
size_t number_of_elements = parser->object_workspace_current - b;
void *array = malloc(number_of_elements*sizeof(Lisp_Object));
memcpy(array, b, number_of_elements*sizeof(Lisp_Object));
result.value = (size_t)array;
parser->object_workspace_current = parser->object_workspace + begin_offset;
return result;
}
/* Parses a JSON object. */
static Lisp_Object json_parse_object(struct json_parser *parser) {
int c = json_skip_whitespace(parser);
const size_t begin_offset = parser->object_workspace_current - parser->object_workspace;
if (c != '}') {
parser->available_depth--;
if (parser->available_depth < 0)
json_signal_error(parser, Qjson_object_too_deep);
/* This loop collects the object members (key/value pairs) in
* the object workspace */
for (;;) {
if (c != '"')
json_signal_error(parser, Qjson_parse_error);
Lisp_Object key;
json_byte_workspace_reset(parser);
json_parse_string(parser);
if (parser->byte_workspace_current > parser->byte_workspace) {
key.value = parser->byte_workspace[0] + parser->byte_workspace_current[-1];
} else {
key.value = 0;
}
c = json_skip_whitespace(parser);
if (c != ':')
json_signal_error(parser, Qjson_parse_error);
c = json_skip_whitespace(parser);
Lisp_Object value = json_parse_value(parser, c);
json_make_object_workspace_for(parser, 2);
*parser->object_workspace_current++ = key;
*parser->object_workspace_current++ = value;
c = json_skip_whitespace(parser);
if (c == '}') {
parser->available_depth++;
break;
}
if (c != ',')
json_signal_error(parser, Qjson_parse_error);
c = json_skip_whitespace(parser);
}
}
Lisp_Object result;
Lisp_Object *end = parser->object_workspace_current;
Lisp_Object *member = parser->object_workspace + begin_offset;
Lisp_Object *array = malloc((end - member)*sizeof(Lisp_Object));
while (member < end) {
*array++ = member[0];
*array++ = member[1];
member += 2;
}
result.value = (size_t)array;
parser->object_workspace_current = parser->object_workspace + begin_offset;
return result;
}
/* Token-char is not a JSON terminology. When parsing
null/false/true, this function tells the character set that is need
to be considered as part of a token. For example, if the input is
"truesomething", then the parser shouldn't consider it as "true",
and an additional later "something" token. An additional example:
if the input is "truetrue", then calling (json-parse-buffer) twice
shouldn't produce two successful calls which return t, but a
parsing error */
static bool json_is_token_char(int c) {
return ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9') || (c == '-'));
}
/* This is the entry point to the value parser, this parses a JSON
* value */
Lisp_Object json_parse_value(struct json_parser *parser, int c) {
if (c == '{')
return json_parse_object(parser);
else if (c == '[')
return json_parse_array(parser);
else if (c == '"') {
json_byte_workspace_reset(parser);
json_parse_string(parser);
Lisp_Object result;
if (parser->byte_workspace_current > parser->byte_workspace) {
result.value = parser->byte_workspace[0] + parser->byte_workspace_current[-1];
} else {
result.value = 0;
}
return result;
} else if ((c >= '0' && c <= '9') || (c == '-'))
return json_parse_number(parser, c);
else {
int c2 = json_input_get(parser);
int c3 = json_input_get(parser);
int c4 = json_input_get(parser);
int c5 = json_input_get_if_possible(parser);
if (c == 't' && c2 == 'r' && c3 == 'u' && c4 == 'e' && (c5 < 0 || !json_is_token_char(c5))) {
if (c5 >= 0)
json_input_put_back(parser);
parser->current_column += 4;
Lisp_Object o;
o.value = 0;
return o;
}
if (c == 'n' && c2 == 'u' && c3 == 'l' && c4 == 'l' && (c5 < 0 || !json_is_token_char(c5))) {
if (c5 >= 0)
json_input_put_back(parser);
parser->current_column += 4;
Lisp_Object o;
o.value = 1;
return o;
}
if (c == 'f' && c2 == 'a' && c3 == 'l' && c4 == 's' && c5 == 'e') {
int c6 = json_input_get_if_possible(parser);
if (c6 < 0 || !json_is_token_char(c6)) {
if (c6 >= 0)
json_input_put_back(parser);
parser->current_column += 5;
Lisp_Object o;
o.value = 2;
return o;
}
}
json_signal_error(parser, Qjson_parse_error);
}
}
enum ParseEndBehavior { PARSEENDBEHAVIOR_CheckForGarbage, PARSEENDBEHAVIOR_MovePoint };
static Lisp_Object json_parse(struct json_parser *parser, enum ParseEndBehavior parse_end_behavior) {
int c = json_skip_whitespace(parser);
Lisp_Object result = json_parse_value(parser, c);
switch (parse_end_behavior) {
case PARSEENDBEHAVIOR_CheckForGarbage:
c = json_skip_whitespace_if_possible(parser);
if (c >= 0)
json_signal_error(parser, Qjson_trailing_content);
break;
case PARSEENDBEHAVIOR_MovePoint: {
break;
}
}
return result;
}
void do_test(unsigned char *buffer, int length) {
struct json_parser p;
struct json_configuration conf;
/* unsigned char json_data[] = "[12, 34]"; */
json_parser_init(&p, conf, buffer, buffer + length, NULL, NULL);
int x = setjmp(signal_env);
if (x == 0) {
json_parse(&p, PARSEENDBEHAVIOR_CheckForGarbage);
/* } else { */
/* printf("error\n"); */
}
printf("free!\n");
json_parser_done(&p);
}
__AFL_FUZZ_INIT();
int main() {
#ifdef __AFL_HAVE_MANUAL_CONTROL
__AFL_INIT();
#endif
unsigned char *buf = __AFL_FUZZ_TESTCASE_BUF; // must be after __AFL_INIT
// and before __AFL_LOOP!
while (__AFL_LOOP(10000)) {
int len = __AFL_FUZZ_TESTCASE_LEN; // don't use the macro directly in a
// call!
if (len < 8) continue; // check for a required/useful minimum input length
/* Setup function call, e.g. struct target *tmp = libtarget_init() */
/* Call function to be fuzzed, e.g.: */
do_test(buf, len);
/* Reset state. e.g. libtarget_free(tmp) */
}
return 0;
}
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-10 6:31 ` Eli Zaretskii
@ 2024-03-10 21:39 ` Philip Kaludercic
2024-03-11 13:29 ` Eli Zaretskii
0 siblings, 1 reply; 51+ messages in thread
From: Philip Kaludercic @ 2024-03-10 21:39 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: Christopher Wellons, geza.herman, emacs-devel
Eli Zaretskii <eliz@gnu.org> writes:
>> Date: Sat, 9 Mar 2024 15:37:25 -0500
>> From: Christopher Wellons <wellons@nullprogram.com>
>> Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org>
>>
>> Despite the obvious care which with this was written, I personally would
>> not adopt a JSON parser that had not been thoroughly fuzz tested under
>> Address Sanitizer and Undefined Behavior Sanitizer. Fuzzing is incredibly
>> effective at finding defects, and it would be foolish not to use it in its
>> ideal circumstances. Normally it's not difficult and requires only a few
>> lines of code. But this JSON parser is tightly coupled with the Emacs Lisp
>> runtime, which greatly complicates things. I couldn't simply pluck it out
>> by itself and drop it in, say, AFL++.
>
> That's okay, we can start by making this an optional feature, and
> consider making it the default after a couple of major releases;
> meanwhile, any problems will be detected and reported.
>
> However, it would make much more sense to consider switching to this
> code if it also could handle producing JSON output, thus making
> libjansson unnecessary when we decide to switch.
If libjansson is not available, it should still be possible to use this
faster parser, while falling back to json.el for generating JSON? From
what I understand, the main issue people have with JSON is the parsing
bottleneck, right?
--
Philip Kaludercic on icterid
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-10 20:41 ` Herman, Géza
@ 2024-03-10 23:22 ` Christopher Wellons
2024-03-11 9:34 ` Herman, Géza
0 siblings, 1 reply; 51+ messages in thread
From: Christopher Wellons @ 2024-03-10 23:22 UTC (permalink / raw)
To: Herman, Géza; +Cc: emacs-devel@gnu.org
> It's been running for an hour, the tester didn't find any problems yet.
Except for an overflow assigning huge floats to the dummy Lisp_Object
value — which is a problem with the test, not the parser — this stripped
down version looks robust to me, too. Solid work! I have no further
feedback or commentary.
I made a few tweaks to harden the test, which did not change the results:
* Rather that directly use the AFL input buffer, it uses an exactly-sized
copy so that it could detect out of bounds access on input.
* Used a custom malloc that initializes memory to garbage.
* Used a custom realloc that always moves, and initializes extended memory
to garbage.
> it incorrectly optimizes around setjmp in do_test()
If a program modifies a variable after the first setjmp return and then
accesses it after the second setjmp return, it must be volatile-qualified.
GCC and Clang have some machinery to mitigate a lack of volatile, such as
the returns_twice attribute, but technically it's required. I believe in
practice using either builtin setjmp/longjmp or a memory barrier would be
sufficient, but my system's Clang doesn't exhibit the stale pointer here,
so I can't test that theory.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-10 23:22 ` Christopher Wellons
@ 2024-03-11 9:34 ` Herman, Géza
2024-03-11 13:47 ` Christopher Wellons
0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-11 9:34 UTC (permalink / raw)
To: Christopher Wellons; +Cc: Herman, Géza, emacs-devel@gnu.org
Christopher Wellons <wellons@nullprogram.com> writes:
>> It's been running for an hour, the tester didn't find any
>> problems yet.
>
> Except for an overflow assigning huge floats to the dummy
> Lisp_Object
> value — which is a problem with the test, not the parser — this
> stripped down version looks robust to me, too. Solid work! I
> have no
> further feedback or commentary.
Cool! Out of curiosity, how did you find the overflow problem?
Did you just notice it, or did the fuzzer/sanitizer find it?
> I made a few tweaks to harden the test, which did not change the
> results:
Thanks for the work! I think that we can say then that the new
parser works OK, it's unlikely that it has any serious problem.
>> it incorrectly optimizes around setjmp in do_test()
>
> If a program modifies a variable after the first setjmp return
> and
> then accesses it after the second setjmp return, it must be
> volatile-qualified.
Good to know, thanks for the info!
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-10 21:39 ` Philip Kaludercic
@ 2024-03-11 13:29 ` Eli Zaretskii
2024-03-11 14:05 ` Mattias Engdegård
0 siblings, 1 reply; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-11 13:29 UTC (permalink / raw)
To: Philip Kaludercic; +Cc: wellons, geza.herman, emacs-devel
> From: Philip Kaludercic <philipk@posteo.net>
> Cc: Christopher Wellons <wellons@nullprogram.com>, geza.herman@gmail.com,
> emacs-devel@gnu.org
> Date: Sun, 10 Mar 2024 21:39:29 +0000
>
> Eli Zaretskii <eliz@gnu.org> writes:
>
> >> Date: Sat, 9 Mar 2024 15:37:25 -0500
> >> From: Christopher Wellons <wellons@nullprogram.com>
> >> Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org>
> >>
> >> Despite the obvious care which with this was written, I personally would
> >> not adopt a JSON parser that had not been thoroughly fuzz tested under
> >> Address Sanitizer and Undefined Behavior Sanitizer. Fuzzing is incredibly
> >> effective at finding defects, and it would be foolish not to use it in its
> >> ideal circumstances. Normally it's not difficult and requires only a few
> >> lines of code. But this JSON parser is tightly coupled with the Emacs Lisp
> >> runtime, which greatly complicates things. I couldn't simply pluck it out
> >> by itself and drop it in, say, AFL++.
> >
> > That's okay, we can start by making this an optional feature, and
> > consider making it the default after a couple of major releases;
> > meanwhile, any problems will be detected and reported.
> >
> > However, it would make much more sense to consider switching to this
> > code if it also could handle producing JSON output, thus making
> > libjansson unnecessary when we decide to switch.
>
> If libjansson is not available, it should still be possible to use this
> faster parser, while falling back to json.el for generating JSON? From
> what I understand, the main issue people have with JSON is the parsing
> bottleneck, right?
What you describe are possible fallbacks, but I would prefer not to
use any fallback at all, but instead have a full C implementation.
(Using json.el has one unfortunate disadvantage that it's incompatible
with the json.c APIs.)
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-11 9:34 ` Herman, Géza
@ 2024-03-11 13:47 ` Christopher Wellons
0 siblings, 0 replies; 51+ messages in thread
From: Christopher Wellons @ 2024-03-11 13:47 UTC (permalink / raw)
To: Herman, Géza; +Cc: emacs-devel@gnu.org
> did the fuzzer/sanitizer find it?
Clang, but not GCC, places a UBSan check on float to integer conversions,
and that check was tripped when I fuzzed a Clang build. Example:
int main(void)
{
int x = 1e10;
}
Nothing from GCC:
$ gcc -w -g3 -fsanitize=undefined example.c
$ ./a.out
$
But with Clang:
$ clang -w -g3 -fsanitize=undefined example.c
$ ./a.out
example.c:3:13: runtime error: 1e+10 is outside the range of representable values of type 'int'
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-11 13:29 ` Eli Zaretskii
@ 2024-03-11 14:05 ` Mattias Engdegård
2024-03-11 14:35 ` Herman, Géza
0 siblings, 1 reply; 51+ messages in thread
From: Mattias Engdegård @ 2024-03-11 14:05 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: Philip Kaludercic, wellons, geza.herman, emacs-devel
11 mars 2024 kl. 14.29 skrev Eli Zaretskii <eliz@gnu.org>:
> What you describe are possible fallbacks, but I would prefer not to
> use any fallback at all, but instead have a full C implementation.
Yes, I definitely think we should do that. I'm pretty sure that writing a JSON unparser is a lot easier than doing the parser, and the extra speed we stand to gain from not having the intermediate jansson step is not without interest.
Overall the proposed parser looks fine, nothing terribly wrong that can't be fixed later on. A few minor points:
* The `is_single_uninteresting` array is hard to review and badly formatted. It appears to be 1 for all printable ASCII plus DEL except double-quote and backslash. (Why DEL?)
Happy if you would make that clearer, perhaps by formatting or initialising it at run-time. (I don't think we can use GCC-style array interval designators, alas.)
* Do you really need to maintain line and column during the parse? If you want them for error reporting, you can materialise them from the offset that you already have.
* Are you sure that GC can't run during parsing or that all your Lisp objects are reachable directly from the stack? (It's the `object_workspace` in particular that's worrying me a bit.)
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-11 14:05 ` Mattias Engdegård
@ 2024-03-11 14:35 ` Herman, Géza
2024-03-12 9:26 ` Mattias Engdegård
0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-11 14:35 UTC (permalink / raw)
To: Mattias Engdegård
Cc: Eli Zaretskii, Philip Kaludercic, wellons, geza.herman,
emacs-devel
Mattias Engdegård <mattias.engdegard@gmail.com> writes:
> 11 mars 2024 kl. 14.29 skrev Eli Zaretskii <eliz@gnu.org>:
>
>> What you describe are possible fallbacks, but I would prefer
>> not to
>> use any fallback at all, but instead have a full C
>> implementation.
>
> Yes, I definitely think we should do that. I'm pretty sure that
> writing a JSON unparser is a lot easier than doing the parser,
> and the
> extra speed we stand to gain from not having the intermediate
> jansson
> step is not without interest.
FYI: I checked out a JSON benchmark, and it turned out that
jansson is not a fast parser, there are faster libraries. If a
library has a SAX interface, that could be a potentially useful
library for Emacs. According to
https://github.com/miloyip/nativejson-benchmark, RapidJSON is at
least 10x faster than jansson. I'm just saying this because Emacs
doesn't have to stick with my parser, there are possible
alternatives, which have JSON serializers as well.
(But note: I am happy to make my parser into a mergeable state,
and if eventually it gets merged then fixing its bugs, but I'm not
motivated to work on integrating other JSON libraries).
> Overall the proposed parser looks fine, nothing terribly wrong
> that can't be fixed later on. A few minor points:
>
> * The `is_single_uninteresting` array is hard to review and
> badly
> formatted. It appears to be 1 for all printable ASCII plus DEL
> except
> double-quote and backslash. (Why DEL?)
Yep, the formatting of that table got destroyed when I reformatted
the code into GNU style. Now I formatted the table back, and
added comments for each row/col. Here's the latest version:
https://github.com/geza-herman/emacs/commit/4b5895636c1ec06e630baf47881b246c198af056.patch
I'm not sure about DEL: I haven't seen anything which says that
it's an invalid character in a string, so the parser currently
allows it.
> * Do you really need to maintain line and column during the
> parse? If
> you want them for error reporting, you can materialise them from
> the
> offset that you already have.
Yeah, I thought of that, but it turned out that maintaining the
line/column doesn't have an impact on performance. I added that
easily, tough admittedly it's a little bit awkward to maintain
these variables. If emacs has a way to tell from the byte-pointer
the line/col position (both for strings and buffers), I am happy
to use that instead. It would be a better solution, because
currently the parser always starts from line 1, col 1, which means
that if json-parse-buffer is used, these numbers will be local to
the current parsing, not actual numbers related to the whole
buffer. But as the jansson based parsed behaves the same, I
thought it's OK.
> * Are you sure that GC can't run during parsing or that all your
> Lisp
> objects are reachable directly from the stack? (It's the
> `object_workspace` in particular that's worrying me a bit.)
That's a very good question. I suppose that object_workspace is
invisible to the Lisp VM, as it is just a malloc'd object. But
I've never seen a problem because of this. What triggers the GC?
Is it possible that for the duration of the whole parsing, GC is
never get triggered? Otherwise it should have GCd the objects in
object_workspace, causing problems (I tried this parser in a loop,
where GC is caused hundreds of times. In the loop, I compared the
result to json-read, everything was fine).
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-11 14:35 ` Herman, Géza
@ 2024-03-12 9:26 ` Mattias Engdegård
2024-03-12 10:20 ` Gerd Möllmann
2024-03-12 10:58 ` Herman, Géza
0 siblings, 2 replies; 51+ messages in thread
From: Mattias Engdegård @ 2024-03-12 9:26 UTC (permalink / raw)
To: "Herman, Géza"
Cc: Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel
11 mars 2024 kl. 15.35 skrev Herman, Géza <geza.herman@gmail.com>:
> According to https://github.com/miloyip/nativejson-benchmark, RapidJSON is at least 10x faster than jansson. I'm just saying this because Emacs doesn't have to stick with my parser, there are possible alternatives, which have JSON serializers as well.
Thanks for the benchmark page reference. Yes, if this turns out to matter more we may consider a faster library. Right now I think your efforts are good enough (at least if we finish the job with a JSON serialiser).
> Yep, the formatting of that table got destroyed when I reformatted the code into GNU style. Now I formatted the table back, and added comments for each row/col. Here's the latest version: https://github.com/geza-herman/emacs/commit/4b5895636c1ec06e630baf47881b246c198af056.patch
Much better, thank you.
>> * Do you really need to maintain line and column during the parse? If
>> you want them for error reporting, you can materialise them from the
>> offset that you already have.
>
> Yeah, I thought of that, but it turned out that maintaining the line/column doesn't have an impact on performance.
That's just because your code isn't fast enough! We are very disappointed. Very.
> I added that easily, tough admittedly it's a little bit awkward to maintain these variables. If emacs has a way to tell from the byte-pointer the line/col position (both for strings and buffers), I am happy to use that instead.
Since error handling isn't performance-critical it doesn't matter if it's a bit slow. (I'd just count newlines.)
>> * Are you sure that GC can't run during parsing or that all your Lisp
>> objects are reachable directly from the stack? (It's the
>> `object_workspace` in particular that's worrying me a bit.)
>
> That's a very good question. I suppose that object_workspace is invisible to the Lisp VM, as it is just a malloc'd object. But I've never seen a problem because of this. What triggers the GC? Is it possible that for the duration of the whole parsing, GC is never get triggered? Otherwise it should have GCd the objects in object_workspace, causing problems (I tried this parser in a loop, where GC is caused hundreds of times. In the loop, I compared the result to json-read, everything was fine).
You can't test that code is GC-safe, you have to show that it's correct by design.
Looking at the code it is quite possible that GC cannot take place. But it can signal errors, and getting into the debugger should open GC windows unless I'm mistaken.
There are some options. `record_unwind_protect_ptr_mark` would be one, and it was made for code like this, but Gerd has been grumbling about it lately. Perhaps it's easier just to disable GC in the dynamic scope (inhibit_garbage_collection).
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-12 9:26 ` Mattias Engdegård
@ 2024-03-12 10:20 ` Gerd Möllmann
2024-03-12 11:14 ` Mattias Engdegård
2024-03-15 13:35 ` Herman, Géza
2024-03-12 10:58 ` Herman, Géza
1 sibling, 2 replies; 51+ messages in thread
From: Gerd Möllmann @ 2024-03-12 10:20 UTC (permalink / raw)
To: Mattias Engdegård
Cc: Herman, Géza, Eli Zaretskii, Philip Kaludercic, wellons,
emacs-devel
Mattias Engdegård <mattias.engdegard@gmail.com> writes:
> There are some options. `record_unwind_protect_ptr_mark` would be one,
> and it was made for code like this, but Gerd has been grumbling about
> it lately. Perhaps it's easier just to disable GC in the dynamic scope
> (inhibit_garbage_collection).
Yes, I'm grumbling because record_..._mark assumes GC is mark-sweep. And
it's really super ugly to work around.
And if you disable GC, I'll be grumbling because that's not good for a
concurrent GC.
What do I propose: If you can, please make objects reachable from the
control stack. For instance, create a context object on the stack that
has a Lisp vector member. That will prevent GC'ing the vector and by
that its contents, of course. You could pass a pointer to the context
around to get access to the vector in subroutines, replace it with a
larger vector if necessary, and so on.
Howsoever, that needed to be said! ;-).
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-12 9:26 ` Mattias Engdegård
2024-03-12 10:20 ` Gerd Möllmann
@ 2024-03-12 10:58 ` Herman, Géza
2024-03-12 13:11 ` Mattias Engdegård
1 sibling, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-12 10:58 UTC (permalink / raw)
To: Mattias Engdegård
Cc: Herman, Géza, Eli Zaretskii, Philip Kaludercic, wellons,
emacs-devel
Mattias Engdegård <mattias.engdegard@gmail.com> writes:
>> I added that easily, tough admittedly it's a little bit
>> awkward to
>> maintain these variables. If emacs has a way to tell from the
>> byte-pointer the line/col position (both for strings and
>> buffers), I
>> am happy to use that instead.
>
> Since error handling isn't performance-critical it doesn't
> matter if it's a bit slow. (I'd just count newlines.)
Basically this is what the current code does, it's just not
postponed until an actual error, but registered during parsing.
I'm tempted to keep it as is, as line/col information can be
useful in other circumstances as well. Like, for example, if we
wanted to tag the created objects with their source location.
> You can't test that code is GC-safe, you have to show that it's
> correct by design.
Sure, but there has to be an explanation why the current way
doesn't have any problems. Having a glance at garbage collection
in emacs, it seems that it only runs during elisp code execution
(is this true?). As the parser doesn't call back to any elisp
code which runs the VM, it should be safe. If GC could happen any
time, then I suppose the whole C Emacs code should be checked for
that, because one can never be sure that if something is allocated
at the C side, then at the very next moment it will be immediately
freed by the GC. Conceptually, I see no difference between calling
a single Fcons vs. the what the whole parser does. If calling
Fcons and then using its result is safe, then the parser should
also be safe. Or is there some magic about the GC which makes this
argument false?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-12 10:20 ` Gerd Möllmann
@ 2024-03-12 11:14 ` Mattias Engdegård
2024-03-12 11:33 ` Gerd Möllmann
2024-03-15 13:35 ` Herman, Géza
1 sibling, 1 reply; 51+ messages in thread
From: Mattias Engdegård @ 2024-03-12 11:14 UTC (permalink / raw)
To: Gerd Möllmann
Cc: "Herman, Géza", Eli Zaretskii, Philip Kaludercic,
wellons, emacs-devel
12 mars 2024 kl. 11.20 skrev Gerd Möllmann <gerd.moellmann@gmail.com>:
> Yes, I'm grumbling because record_..._mark assumes GC is mark-sweep. And
> it's really super ugly to work around.
>
> And if you disable GC, I'll be grumbling because that's not good for a
> concurrent GC.
These are problems we clearly need to solve, but I don't think we should hold Géza's improved JSON parser hostage to our desired future GC demands. Let's help him finish his work in the Emacs that we have today and be explicit about our assumptions and needs. The proposed code's requirements are not unique, nor do the solutions need to be.
> What do I propose: If you can, please make objects reachable from the
> control stack. For instance, create a context object on the stack that
> has a Lisp vector member.
The problem is that with the GC that we actually have, xmalloc/xfree is a lot faster than creating a Lisp vector and letting the GC collect it later on.
One possible way out might be providing a way to free a vector ahead of time, akin to free_cons, and create a vector type with a fill pointer so that it does not need to be initialised nor traversed in full if only part of it is used.
The last part is straightforward, although it complicates vectorlike traversal a bit. Ahead-of-time freeing might be unnecessary with a modern collector but that needs to be verified.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-12 11:14 ` Mattias Engdegård
@ 2024-03-12 11:33 ` Gerd Möllmann
0 siblings, 0 replies; 51+ messages in thread
From: Gerd Möllmann @ 2024-03-12 11:33 UTC (permalink / raw)
To: Mattias Engdegård
Cc: Herman, Géza, Eli Zaretskii, Philip Kaludercic, wellons,
emacs-devel
Mattias Engdegård <mattias.engdegard@gmail.com> writes:
> The problem is that with the GC that we actually have, xmalloc/xfree
> is a lot faster than creating a Lisp vector and letting the GC collect
> it later on.
Will any user ever notice a difference? I doubt it.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-12 10:58 ` Herman, Géza
@ 2024-03-12 13:11 ` Mattias Engdegård
2024-03-12 13:42 ` Mattias Engdegård
2024-03-12 15:23 ` Herman, Géza
0 siblings, 2 replies; 51+ messages in thread
From: Mattias Engdegård @ 2024-03-12 13:11 UTC (permalink / raw)
To: "Herman, Géza"
Cc: Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel,
Gerd Möllmann
12 mars 2024 kl. 11.58 skrev Herman, Géza <geza.herman@gmail.com>:
>> You can't test that code is GC-safe, you have to show that it's correct by design.
>
> Sure, but there has to be an explanation why the current way doesn't have any problems.
It doesn't matter -- we don't need to prove to you why you don't get a segfault, it's you who need to convince us that your code is fine.
Most C code does not need any special attention as long as it only places Lisp roots in local variables, ie, on the C stack and in registers. Your code keeps live roots in heap-allocated memory. It's not alone in doing that, but elsewhere we give those roots special attention, either by explicit GC marking or disabling GC during the lifespan of those roots.
As I said your code is probably safe unless it signals an error.
An alternative is to do as Gerd suggested and only have roots on the stack, by allocating vectors (or lists) for the task. However:
12 mars 2024 kl. 12.33 skrev Gerd Möllmann <gerd.moellmann@gmail.com>:
> Will any user ever notice a difference? I doubt it.
I can't say, and these things are not always easy to measure. A Lisp allocation has the drawbacks:
- Allocation requires initialisation of the entire block, not just the used part.
- Deallocation is delayed until the next GC.
- The cache lines are effectively wasted.
- If the code is run again before next GC, it will have to allocate more Lisp objects.
- GC happens more frequently because allocation advances the GC clock.
- GC takes longer because it has to sweep more dead objects.
One compromise in this case (json_parse_array) would be to collect array elements into a Lisp list which is then turned into a Lisp vector. In the latter case we could use free_cons to undo the consing if we want.
Consing and list traversal is slow but perhaps not bad enough; measurement is needed.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-12 13:11 ` Mattias Engdegård
@ 2024-03-12 13:42 ` Mattias Engdegård
2024-03-12 15:23 ` Herman, Géza
1 sibling, 0 replies; 51+ messages in thread
From: Mattias Engdegård @ 2024-03-12 13:42 UTC (permalink / raw)
To: "Herman, Géza"
Cc: Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel,
Gerd Möllmann
12 mars 2024 kl. 14.11 skrev Mattias Engdegård <mattias.engdegard@gmail.com>:
> As I said your code is probably safe unless it signals an error.
Actually, as long as the heap-allocated scratch space isn't touched during error handling and unwind, that should not be a problem either, so we may be entirely safe here. We should explain why in the code, of course.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-12 13:11 ` Mattias Engdegård
2024-03-12 13:42 ` Mattias Engdegård
@ 2024-03-12 15:23 ` Herman, Géza
2024-03-12 15:39 ` Gerd Möllmann
1 sibling, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-12 15:23 UTC (permalink / raw)
To: Mattias Engdegård
Cc: Herman, Géza, Eli Zaretskii, Philip Kaludercic, wellons,
emacs-devel, Gerd Möllmann
Mattias Engdegård <mattias.engdegard@gmail.com> writes:
> 12 mars 2024 kl. 11.58 skrev Herman, Géza
> <geza.herman@gmail.com>:
>
>>> You can't test that code is GC-safe, you have to show that
>>> it's correct by design.
>>
>> Sure, but there has to be an explanation why the current way
>> doesn't have any problems.
>
> It doesn't matter -- we don't need to prove to you why you don't
> get a segfault, it's you who need to convince us that your code
> is fine.
>
> Most C code does not need any special attention as long as it
> only
> places Lisp roots in local variables, ie, on the C stack and in
> registers. Your code keeps live roots in heap-allocated
> memory. It's
> not alone in doing that, but elsewhere we give those roots
> special
> attention, either by explicit GC marking or disabling GC during
> the
> lifespan of those roots.
>
> As I said your code is probably safe unless it signals an error.
Sorry, but I don't understand. If my parser puts roots in
heap-allocated memory (that's true), and doesn't give such roots
special attention (that's also true), then why is it probably
safe?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-12 15:23 ` Herman, Géza
@ 2024-03-12 15:39 ` Gerd Möllmann
0 siblings, 0 replies; 51+ messages in thread
From: Gerd Möllmann @ 2024-03-12 15:39 UTC (permalink / raw)
To: Herman, Géza
Cc: Mattias Engdegård, Eli Zaretskii, Philip Kaludercic, wellons,
emacs-devel
"Herman, Géza" <geza.herman@gmail.com> writes:
> Sorry, but I don't understand. If my parser puts roots in
> heap-allocated memory (that's true), and doesn't give such roots
> special attention (that's also true), then why is it probably safe?
Assumin GC doesn't run, nothing can happen.
For me personally, that assumption would be too dangerous. At one point,
Emacs' regexp engine made that assumption, GC did run, because of
something N levels down a call stack, GC compacted strings, and all that
was really nice to debug :-).
My 2 cents :-)
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-12 10:20 ` Gerd Möllmann
2024-03-12 11:14 ` Mattias Engdegård
@ 2024-03-15 13:35 ` Herman, Géza
2024-03-15 14:56 ` Gerd Möllmann
2024-03-19 18:49 ` Mattias Engdegård
1 sibling, 2 replies; 51+ messages in thread
From: Herman, Géza @ 2024-03-15 13:35 UTC (permalink / raw)
To: Gerd Möllmann
Cc: Mattias Engdegård, Herman, Géza, Eli Zaretskii,
Philip Kaludercic, wellons, emacs-devel
Gerd Möllmann <gerd.moellmann@gmail.com> writes:
> Mattias Engdegård <mattias.engdegard@gmail.com> writes:
>
>> There are some options. `record_unwind_protect_ptr_mark` would
>> be one,
>> and it was made for code like this, but Gerd has been grumbling
>> about
>> it lately. Perhaps it's easier just to disable GC in the
>> dynamic scope
>> (inhibit_garbage_collection).
>
> Yes, I'm grumbling because record_..._mark assumes GC is
> mark-sweep. And
> it's really super ugly to work around.
>
> And if you disable GC, I'll be grumbling because that's not good
> for a
> concurrent GC.
>
> What do I propose: If you can, please make objects reachable
> from the
> control stack. For instance, create a context object on the
> stack that
> has a Lisp vector member. That will prevent GC'ing the vector
> and by
> that its contents, of course. You could pass a pointer to the
> context
> around to get access to the vector in subroutines, replace it
> with a
> larger vector if necessary, and so on.
I implemented this idea, here is the latest version (now,
object_workspace is a Lisp_Object):
https://github.com/geza-herman/emacs/commit/12decaddc9b8260745be4d5782fea21f4578eab2.patch
Is there anything that need to be done?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-15 13:35 ` Herman, Géza
@ 2024-03-15 14:56 ` Gerd Möllmann
2024-03-19 18:49 ` Mattias Engdegård
1 sibling, 0 replies; 51+ messages in thread
From: Gerd Möllmann @ 2024-03-15 14:56 UTC (permalink / raw)
To: Herman, Géza
Cc: Mattias Engdegård, Eli Zaretskii, Philip Kaludercic, wellons,
emacs-devel
"Herman, Géza" <geza.herman@gmail.com> writes:
>> What do I propose: If you can, please make objects reachable from the
>> control stack. For instance, create a context object on the stack
>> that has a Lisp vector member. That will prevent GC'ing the vector
>> and by that its contents, of course. You could pass a pointer to the
>> context around to get access to the vector in subroutines, replace it
>> with a larger vector if necessary, and so on.
>
> I implemented this idea, here is the latest version (now,
> object_workspace is a Lisp_Object):
>
> https://github.com/geza-herman/emacs/commit/12decaddc9b8260745be4d5782fea21f4578eab2.patch
>
> Is there anything that need to be done?
Thanks, that looks very good, I'm happy :-).
By putting a struct json_parser on the stack as a local variable, its
.object_workspace itself and everything rechable from it are safe from
GC.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-15 13:35 ` Herman, Géza
2024-03-15 14:56 ` Gerd Möllmann
@ 2024-03-19 18:49 ` Mattias Engdegård
2024-03-19 19:05 ` Herman, Géza
2024-03-19 19:13 ` Gerd Möllmann
1 sibling, 2 replies; 51+ messages in thread
From: Mattias Engdegård @ 2024-03-19 18:49 UTC (permalink / raw)
To: "Herman, Géza"
Cc: Gerd Möllmann, Eli Zaretskii, Philip Kaludercic, wellons,
emacs-devel
15 mars 2024 kl. 14.35 skrev Herman, Géza <geza.herman@gmail.com>:
> I implemented this idea, here is the latest version (now, object_workspace is a Lisp_Object):
That's considerably slower than before, especially for small inputs, so it's probably a no-go.
I definitely want to make Gerd's efforts no harder than necessary but really don't think we should do pay the cost up-front like this. If required for a better GC then fine, but let's do that work in the branch for that effort instead of incurring overhead in Emacs now.
> Is there anything that need to be done?
It's probably a good idea to open a bug for this task, since it provides a better and more focussed record of the discussion, with a single bug number.
Here are some remaining tasks:
1. Go back to the previous version which was much faster.
2. Don't allocate any temporary storage before you know that it's actually necessary.
3. Stop using the object_workspace when not required, which is everywhere except possibly when reading arrays into Lisp vectors.
4. Write a JSON serialiser.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-19 18:49 ` Mattias Engdegård
@ 2024-03-19 19:05 ` Herman, Géza
2024-03-19 19:18 ` Gerd Möllmann
2024-03-19 19:13 ` Gerd Möllmann
1 sibling, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-19 19:05 UTC (permalink / raw)
To: Mattias Engdegård
Cc: Herman, Géza, Gerd Möllmann, Eli Zaretskii,
Philip Kaludercic, wellons, emacs-devel
Mattias Engdegård <mattias.engdegard@gmail.com> writes:
> 15 mars 2024 kl. 14.35 skrev Herman, Géza
> <geza.herman@gmail.com>:
>
>> I implemented this idea, here is the latest version (now,
>> object_workspace is a Lisp_Object):
>
> That's considerably slower than before, especially for small
> inputs, so it's probably a no-go.
Have you benchmarked it? I really doubt that there is a
significant performance difference. Maybe it's possible to find
some case, where this modification matters a bit, but for most
cases, the difference should be negligible. The amount of
allocations done is the same in both cases, which is for large
files, only 20-30 allocations. Compared to the thousands of
allocations for storing the actual Lisp objects.
> Here are some remaining tasks:
>
> 2. Don't allocate any temporary storage before you know that
> it's actually necessary.
Except for the very-very trivial cases, memory allocation is
always necessary. I can lower the allocation sizes, but I don't
think it's worth it. We're only talking about sizes of the KB
range.
> 3. Stop using the object_workspace when not required, which is
> everywhere except possibly when reading arrays into Lisp
> vectors.
object_workspace is necessary whenever a JSON contains an object
or array. So basically always.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-19 18:49 ` Mattias Engdegård
2024-03-19 19:05 ` Herman, Géza
@ 2024-03-19 19:13 ` Gerd Möllmann
1 sibling, 0 replies; 51+ messages in thread
From: Gerd Möllmann @ 2024-03-19 19:13 UTC (permalink / raw)
To: Mattias Engdegård
Cc: Herman, Géza, Eli Zaretskii, Philip Kaludercic, wellons,
emacs-devel
Mattias Engdegård <mattias.engdegard@gmail.com> writes:
> 15 mars 2024 kl. 14.35 skrev Herman, Géza <geza.herman@gmail.com>:
>
>> I implemented this idea, here is the latest version (now, object_workspace is a Lisp_Object):
>
> That's considerably slower than before, especially for small inputs,
> so it's probably a no-go.
Can you give numbers?
>
> I definitely want to make Gerd's efforts no harder than necessary but
> really don't think we should do pay the cost up-front like this. If
> required for a better GC then fine, but let's do that work in the
> branch for that effort instead of incurring overhead in Emacs now.
Please don't take my GC efforts into consideration. That may succeed or
not. But this is also a matter of good design, using the stack, (which
BTW pdumper does, too), vs. bad design.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser
2024-03-19 19:05 ` Herman, Géza
@ 2024-03-19 19:18 ` Gerd Möllmann
0 siblings, 0 replies; 51+ messages in thread
From: Gerd Möllmann @ 2024-03-19 19:18 UTC (permalink / raw)
To: Herman, Géza
Cc: Mattias Engdegård, Eli Zaretskii, Philip Kaludercic, wellons,
emacs-devel
"Herman, Géza" <geza.herman@gmail.com> writes:
> Mattias Engdegård <mattias.engdegard@gmail.com> writes:
>
>> 15 mars 2024 kl. 14.35 skrev Herman, Géza <geza.herman@gmail.com>:
>>
>>> I implemented this idea, here is the latest version (now,
>>> object_workspace is a Lisp_Object):
>>
>> That's considerably slower than before, especially for small inputs,
>> so it's probably a no-go.
> Have you benchmarked it? I really doubt that there is a significant
> performance difference. Maybe it's possible to find some case, where
> this modification matters a bit, but for most cases, the difference
> should be negligible. The amount of allocations done is the same in
> both cases, which is for large files, only 20-30 allocations.
> Compared to the thousands of allocations for storing the actual Lisp
> objects.
Right, if N is the number of objects allocated for JSON, the usual
strategy of doubling buffers sizes yields Log(N) buffer allocations.
^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2024-03-19 19:18 UTC | newest]
Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-08 10:27 I created a faster JSON parser Herman, Géza
2024-03-08 11:41 ` Philip Kaludercic
2024-03-08 12:34 ` Herman, Géza
2024-03-08 12:03 ` Eli Zaretskii
2024-03-08 12:38 ` Herman, Géza
2024-03-08 12:59 ` Eli Zaretskii
2024-03-08 13:12 ` Herman, Géza
2024-03-08 14:10 ` Eli Zaretskii
2024-03-08 14:24 ` Collin Funk
2024-03-08 15:20 ` Herman, Géza
2024-03-08 16:22 ` Eli Zaretskii
2024-03-08 18:34 ` Herman, Géza
2024-03-08 19:57 ` Eli Zaretskii
2024-03-08 20:22 ` Herman, Géza
2024-03-09 6:52 ` Eli Zaretskii
2024-03-09 11:08 ` Herman, Géza
2024-03-09 12:23 ` Lynn Winebarger
2024-03-09 12:58 ` Po Lu
2024-03-09 13:13 ` Eli Zaretskii
2024-03-09 14:00 ` Herman, Géza
2024-03-09 14:21 ` Eli Zaretskii
2024-03-08 13:28 ` Po Lu
2024-03-08 16:14 ` Herman, Géza
2024-03-09 1:55 ` Po Lu
2024-03-09 20:37 ` Christopher Wellons
2024-03-10 6:31 ` Eli Zaretskii
2024-03-10 21:39 ` Philip Kaludercic
2024-03-11 13:29 ` Eli Zaretskii
2024-03-11 14:05 ` Mattias Engdegård
2024-03-11 14:35 ` Herman, Géza
2024-03-12 9:26 ` Mattias Engdegård
2024-03-12 10:20 ` Gerd Möllmann
2024-03-12 11:14 ` Mattias Engdegård
2024-03-12 11:33 ` Gerd Möllmann
2024-03-15 13:35 ` Herman, Géza
2024-03-15 14:56 ` Gerd Möllmann
2024-03-19 18:49 ` Mattias Engdegård
2024-03-19 19:05 ` Herman, Géza
2024-03-19 19:18 ` Gerd Möllmann
2024-03-19 19:13 ` Gerd Möllmann
2024-03-12 10:58 ` Herman, Géza
2024-03-12 13:11 ` Mattias Engdegård
2024-03-12 13:42 ` Mattias Engdegård
2024-03-12 15:23 ` Herman, Géza
2024-03-12 15:39 ` Gerd Möllmann
2024-03-10 6:58 ` Herman, Géza
2024-03-10 16:54 ` Christopher Wellons
2024-03-10 20:41 ` Herman, Géza
2024-03-10 23:22 ` Christopher Wellons
2024-03-11 9:34 ` Herman, Géza
2024-03-11 13:47 ` Christopher Wellons
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).