I created a faster JSON parser

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* I created a faster JSON parser
@ 2024-03-08 10:27 Herman, Géza
  2024-03-08 11:41 ` Philip Kaludercic
                   ` (3 more replies)
  0 siblings, 4 replies; 51+ messages in thread
From: Herman, Géza @ 2024-03-08 10:27 UTC (permalink / raw)
  To: emacs-devel@gnu.org

Hi,

I created a faster JSON parser for emacs, you can check it out 
here:

https://github.com/geza-herman/emacs/tree/faster-json-parsing

It replaces json-parse-string and json-parse-buffer functions. 
The behavior should be the same as before, with the only exception 
that objects with duplicated keys are not detected if :object-type 
is not 'hash-table.

This parser runs 8-9x faster than the jansson based parser on my 
machine (tested on clangd language server messages).  An 
additional tiny benefit is that large integers are parsed, instead 
of having an "out of range" error.

What do you think?

Geza

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 10:27 I created a faster JSON parser Herman, Géza
@ 2024-03-08 11:41 ` Philip Kaludercic
  2024-03-08 12:34   ` Herman, Géza
  2024-03-08 12:03 ` Eli Zaretskii
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 51+ messages in thread
From: Philip Kaludercic @ 2024-03-08 11:41 UTC (permalink / raw)
  To: Herman, Géza; +Cc: emacs-devel@gnu.org

"Herman, Géza" <geza.herman@gmail.com> writes:

> Hi,
>
> I created a faster JSON parser for emacs, you can check it out here:
>
> https://github.com/geza-herman/emacs/tree/faster-json-parsing

For convenience, this is a URL to the patch:

https://github.com/geza-herman/emacs/commit/d76feeeac3ce397f15aebdde8b9deae676b0bf1e.patch

>
> It replaces json-parse-string and json-parse-buffer functions. The
> behavior should be the same as before, with the only exception that
> objects with duplicated keys are not detected if :object-type is not
> 'hash-table.

Is that a problem?

> This parser runs 8-9x faster than the jansson based parser on my
> machine (tested on clangd language server messages).  An additional
> tiny benefit is that large integers are parsed, instead of having an
> "out of range" error.

That sounds interesting, but I am reminded of this article:
https://seriot.ch/projects/parsing_json.html.  There seem to be plenty
of difficult edge-cases when dealing with JSON input, that should
probably be tested if Emacs has it's own custom parser built-in.

> What do you think?
>
> Geza

-- 
	Philip Kaludercic on peregrine



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 11:41 ` Philip Kaludercic
@ 2024-03-08 12:34   ` Herman, Géza
  0 siblings, 0 replies; 51+ messages in thread
From: Herman, Géza @ 2024-03-08 12:34 UTC (permalink / raw)
  To: Philip Kaludercic; +Cc: Herman, Géza, emacs-devel@gnu.org


Philip Kaludercic <philipk@posteo.net> writes:

> "Herman, Géza" <geza.herman@gmail.com> writes:
>
>> It replaces json-parse-string and json-parse-buffer 
>> functions. The
>> behavior should be the same as before, with the only exception 
>> that
>> objects with duplicated keys are not detected if :object-type 
>> is not
>> 'hash-table.
>
> Is that a problem?
Not sure.  I just mentioned it because it's a behavior change. 
But I intentionally designed it this way, because it is faster. 
To me, it makes some sense that if the user specifies 'alist or 
'plist, then they want to have all the object members, even if the 
keys are duplicated.  I didn't find a clear direction from JSON 
descriptions of how duplicated keys should be handled.

>> This parser runs 8-9x faster than the jansson based parser on 
>> my
>> machine (tested on clangd language server messages).  An 
>> additional
>> tiny benefit is that large integers are parsed, instead of 
>> having an
>> "out of range" error.
>
> That sounds interesting, but I am reminded of this article:
> https://seriot.ch/projects/parsing_json.html.  There seem to be 
> plenty
> of difficult edge-cases when dealing with JSON input, that 
> should
> probably be tested if Emacs has it's own custom parser built-in.
I've now run my parser on the tests in this repo, it passes all of 
them.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 10:27 I created a faster JSON parser Herman, Géza
  2024-03-08 11:41 ` Philip Kaludercic
@ 2024-03-08 12:03 ` Eli Zaretskii
  2024-03-08 12:38   ` Herman, Géza
  2024-03-08 13:28 ` Po Lu
  2024-03-09 20:37 ` Christopher Wellons
  3 siblings, 1 reply; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-08 12:03 UTC (permalink / raw)
  To: Géza Herman; +Cc: emacs-devel

> From: Herman, Géza <geza.herman@gmail.com>
> Date: Fri, 08 Mar 2024 11:27:16 +0100
> 
> This parser runs 8-9x faster than the jansson based parser on my 
> machine (tested on clangd language server messages).

How does it do that?  Can you summarize the main ideas of your
implementation, which make it so much faster?

Thanks.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 12:03 ` Eli Zaretskii
@ 2024-03-08 12:38   ` Herman, Géza
  2024-03-08 12:59     ` Eli Zaretskii
  0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-08 12:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Géza Herman, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Herman, Géza <geza.herman@gmail.com>
>> Date: Fri, 08 Mar 2024 11:27:16 +0100
>>
>> This parser runs 8-9x faster than the jansson based parser on 
>> my
>> machine (tested on clangd language server messages).
>
> How does it do that?  Can you summarize the main ideas of your
> implementation, which make it so much faster?

My parser creates Lisp objects during parsing, there is no 
intermediate step as Emacs has with jansson.  With jansson, there 
are a lot of allocations, which my parser doesn't have (my parser 
has only two buffers, which exponentially grow. There are no other 
allocations).  But even ignoring performance loss because of 
mallocs (on my dataset, 40% of CPU time goes into malloc/free), I 
think parsing should be faster, so maybe jansson is not a fast 
parser in the first place.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 12:38   ` Herman, Géza
@ 2024-03-08 12:59     ` Eli Zaretskii
  2024-03-08 13:12       ` Herman, Géza
  0 siblings, 1 reply; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-08 12:59 UTC (permalink / raw)
  To: Géza Herman; +Cc: emacs-devel

> From: Herman, Géza <geza.herman@gmail.com>
> Cc: Géza Herman <geza.herman@gmail.com>,
>  emacs-devel@gnu.org
> Date: Fri, 08 Mar 2024 13:38:48 +0100
> 
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> From: Herman, Géza <geza.herman@gmail.com>
> >> Date: Fri, 08 Mar 2024 11:27:16 +0100
> >>
> >> This parser runs 8-9x faster than the jansson based parser on 
> >> my
> >> machine (tested on clangd language server messages).
> >
> > How does it do that?  Can you summarize the main ideas of your
> > implementation, which make it so much faster?
> 
> My parser creates Lisp objects during parsing, there is no 
> intermediate step as Emacs has with jansson.  With jansson, there 
> are a lot of allocations, which my parser doesn't have (my parser 
> has only two buffers, which exponentially grow. There are no other 
> allocations).  But even ignoring performance loss because of 
> mallocs (on my dataset, 40% of CPU time goes into malloc/free), I 
> think parsing should be faster, so maybe jansson is not a fast 
> parser in the first place.

Thanks.

So, if you are willing to contribute this code to Emacs, what is left
is:

  . clean up the code (e.g., currently it still calls the function
    that on MS-Windows loads the jansson DLL, which is now unneeded)
    and adjust it to our style and conventions;
  . thoroughly test the code on the available test suites (or maybe
    you already did);
  . for you to assign the copyright to the FSF, without which we
    cannot accept such a substantial contribution

As an additional idea: perhaps initially we'd want to have a
configure-time option to either use your parser or jansson, to make
the migration and comparison easier.  What do others think about this?



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 12:59     ` Eli Zaretskii
@ 2024-03-08 13:12       ` Herman, Géza
  2024-03-08 14:10         ` Eli Zaretskii
  0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-08 13:12 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Géza Herman, emacs-devel


Eli Zaretskii <eliz@gnu.org> writes:

> So, if you are willing to contribute this code to Emacs, what is 
> left
> is:
>
>   . clean up the code (e.g., currently it still calls the 
>   function
>     that on MS-Windows loads the jansson DLL, which is now 
>     unneeded)
>     and adjust it to our style and conventions;
I can remove the WINDOWSNT related #ifs.  Are there any more 
comments?  I formatted my code with clangd, is there anything else 
that should be done?

>   . thoroughly test the code on the available test suites (or 
>   maybe
>     you already did);
I tested it on github.com/nst/JSONTestSuite, it passes all "y_" 
and "n_" tests in the directory "test_parsing".  If you're aware 
of other test repositories, I'm happy to test on them as well.

>   . for you to assign the copyright to the FSF, without which we
>     cannot accept such a substantial contribution
Can you please send me the necessary documents?



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 13:12       ` Herman, Géza
@ 2024-03-08 14:10         ` Eli Zaretskii
  2024-03-08 14:24           ` Collin Funk
  2024-03-08 15:20           ` Herman, Géza
  0 siblings, 2 replies; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-08 14:10 UTC (permalink / raw)
  To: Géza Herman; +Cc: emacs-devel

> From: Herman, Géza <geza.herman@gmail.com>
> Cc: Géza Herman <geza.herman@gmail.com>,
>  emacs-devel@gnu.org
> Date: Fri, 08 Mar 2024 14:12:19 +0100
> 
> 
> >   . clean up the code (e.g., currently it still calls the 
> >   function
> >     that on MS-Windows loads the jansson DLL, which is now 
> >     unneeded)
> >     and adjust it to our style and conventions;
> I can remove the WINDOWSNT related #ifs.  Are there any more 
> comments?  I formatted my code with clangd, is there anything else 
> that should be done?

The following is based on an initial reading of the patch:

 . Redundant braces (for blocks of a single code line) is one issue.
 . The way you break a long line at the equals sign '=' is another (we
   break after '=', not before).
 . The code must not call malloc/realloc/free, but their x* variants,
   xmalloc/xrealloc/xfree.
 . The names should be changed to remove the "my_" and "My" prefixes.
 . Many functions should have a commentary before them explaining
   their purpose and use, the exception is short function whose
   purpose is clear from reading the code.
 . Some of the generic 'json-parse-error' errors should probably be
   more specific; for example, UTF-8 encoding problems should be
   flagged as such.
 . The code which handles integers seems to assume that 'unsigned long'
   is a 64-bit type? if so, this is not true on Windows; please see how
   we handle this elsewhere in Emacs, in particular in the
   WIDE_EMACS_INT case.

A more general comment is that you seem to be parsing buffer text
assuming it's UTF-8?  If so, this is not accurate, as the internal
representation is a superset of UTF-8, and can represent characters
above 0x10FFFF.

There could be other issues as well.

> >   . thoroughly test the code on the available test suites (or 
> >   maybe
> >     you already did);
> I tested it on github.com/nst/JSONTestSuite, it passes all "y_" 
> and "n_" tests in the directory "test_parsing".  If you're aware 
> of other test repositories, I'm happy to test on them as well.

I don't know about such test suites, but maybe someone else does.

> >   . for you to assign the copyright to the FSF, without which we
> >     cannot accept such a substantial contribution
> Can you please send me the necessary documents?

Sent off-list.

Thanks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 14:10         ` Eli Zaretskii
@ 2024-03-08 14:24           ` Collin Funk
  2024-03-08 15:20           ` Herman, Géza
  1 sibling, 0 replies; 51+ messages in thread
From: Collin Funk @ 2024-03-08 14:24 UTC (permalink / raw)
  To: emacs-devel

Hello,

On 3/8/24 6:10 AM, Eli Zaretskii wrote:
>  . The code which handles integers seems to assume that 'unsigned long'
>    is a 64-bit type? if so, this is not true on Windows; please see how
>    we handle this elsewhere in Emacs, in particular in the
>    WIDE_EMACS_INT case.

I noticed this in 'my_json_parse_number' as well. Another change that
would be nice there is using ckd_add and ckd_mul from stdckdint.h,
generated from lib/stdckdint.in.h, to check for integer overflows.
Less room for errors compared to checking manually and easier to read.

Collin



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 14:10         ` Eli Zaretskii
  2024-03-08 14:24           ` Collin Funk
@ 2024-03-08 15:20           ` Herman, Géza
  2024-03-08 16:22             ` Eli Zaretskii
  1 sibling, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-08 15:20 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Géza Herman, emacs-devel


Eli Zaretskii <eliz@gnu.org> writes:

>> From: Herman, Géza <geza.herman@gmail.com>
>> Cc: Géza Herman <geza.herman@gmail.com>,
>>  emacs-devel@gnu.org
>> Date: Fri, 08 Mar 2024 14:12:19 +0100
> The following is based on an initial reading of the patch:
>
>  . Redundant braces (for blocks of a single code line) is one 
>  issue.
>  . The way you break a long line at the equals sign '=' is 
>  another (we
>    break after '=', not before).
I used clang-format to format my code (I use a completely 
different coding style).  I see that clang-format is configured 
this way in Emacs.  Shouldn't BreakBeforeBinaryOperators be set to 
None or NonAssignment in .clang-format?

>  . The code which handles integers seems to assume that 
>  'unsigned long'
>    is a 64-bit type? if so, this is not true on Windows; please 
>    see how
>    we handle this elsewhere in Emacs, in particular in the
>    WIDE_EMACS_INT case.
That was a mistake on my part, though a different (but similar) 
one.  I originally used a 64-bit type, but then changed it to 
long, because of 32-bit architectures.  The idea is to use a type 
which likely has the same size as a CPU register.  So I think long 
is OK, I just need to change the thresholds to ULONG_MAX.  Or I 
think I'll use ckd_* functions as Collin suggested.

> A more general comment is that you seem to be parsing buffer 
> text
> assuming it's UTF-8?  If so, this is not accurate, as the 
> internal
> representation is a superset of UTF-8, and can represent 
> characters
> above 0x10FFFF.
When does a buffer have characters above 0x10ffff?  I supposed 
that a JSON shouldn't contain characters that are out of range. 
But if the solution is to just remove the upper-range comparison, 
I can do that easily.

>> Can you please send me the necessary documents?
>
> Sent off-list.
Thanks!



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 15:20           ` Herman, Géza
@ 2024-03-08 16:22             ` Eli Zaretskii
  2024-03-08 18:34               ` Herman, Géza
  0 siblings, 1 reply; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-08 16:22 UTC (permalink / raw)
  To: Géza Herman; +Cc: geza.herman, emacs-devel

> From: Herman, Géza <geza.herman@gmail.com>
> Cc: Géza Herman <geza.herman@gmail.com>,
>  emacs-devel@gnu.org
> Date: Fri, 08 Mar 2024 16:20:40 +0100
> 
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >  . The way you break a long line at the equals sign '=' is 
> >  another (we
> >    break after '=', not before).
> I used clang-format to format my code (I use a completely 
> different coding style).  I see that clang-format is configured 
> this way in Emacs.  Shouldn't BreakBeforeBinaryOperators be set to 
> None or NonAssignment in .clang-format?

Actually, I see we use both styles, so I guess you can disregard that
part.

> >  . The code which handles integers seems to assume that 
> >  'unsigned long'
> >    is a 64-bit type? if so, this is not true on Windows; please 
> >    see how
> >    we handle this elsewhere in Emacs, in particular in the
> >    WIDE_EMACS_INT case.
> That was a mistake on my part, though a different (but similar) 
> one.  I originally used a 64-bit type, but then changed it to 
> long, because of 32-bit architectures.  The idea is to use a type 
> which likely has the same size as a CPU register.

If you want to use a 32-bit type, use 'int' or 'unsigned int'.

> > A more general comment is that you seem to be parsing buffer text
> > assuming it's UTF-8?  If so, this is not accurate, as the internal
> > representation is a superset of UTF-8, and can represent
> > characters above 0x10FFFF.
>
> When does a buffer have characters above 0x10ffff?

See the node "Text Representations" in the ELisp manual, it explains
that.

> I supposed that a JSON shouldn't contain characters that are out of
> range.  But if the solution is to just remove the upper-range
> comparison, I can do that easily.

We need to decide what to do with characters outside of the Unicode
range.  The encoding of those characters is different, btw, it doesn't
follow the UTF-8 scheme.

In any case, the error in those cases cannot be just "json parse
error", it must be something more self-explanatory.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 16:22             ` Eli Zaretskii
@ 2024-03-08 18:34               ` Herman, Géza
  2024-03-08 19:57                 ` Eli Zaretskii
  0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-08 18:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: geza.herman, emacs-devel


Eli Zaretskii <eliz@gnu.org> writes:

>> From: Herman, Géza <geza.herman@gmail.com>
>> That was a mistake on my part, though a different (but similar)
>> one.  I originally used a 64-bit type, but then changed it to
>> long, because of 32-bit architectures.  The idea is to use a 
>> type
>> which likely has the same size as a CPU register.
>
> If you want to use a 32-bit type, use 'int' or 'unsigned int'.
I want to use a datatype which is likely to have the same size as 
a CPU register. On 32-bit machines, it should be 32-bit, on 64-bit 
architectures it should be 64-bit.  This is not a strong 
requirement, but it makes the parser faster.  During number 
parsing, this part of the code calculates the value of the 
number. If it overflows, or it ends up being a float, then the 
workspace buffer is re-parsed to find the value. But if it is an 
integer without overflow, the parser just can use the value 
without any further processing (a certain kind of LSP message is 
basically a large array of integers, so it makes sense to optimize 
this case - this optimization makes parsing such messages 1.5-2x 
faster).

> We need to decide what to do with characters outside of the 
> Unicode
> range.  The encoding of those characters is different, btw, it 
> doesn't
> follow the UTF-8 scheme.
>
> In any case, the error in those cases cannot be just "json parse
> error", it must be something more self-explanatory.
I am open to suggestions, as even after reading "Text 
Representations" in the manual, I have no idea what to do here.  I 
didn't find any code which handles this case in the jansson parser 
either.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 18:34               ` Herman, Géza
@ 2024-03-08 19:57                 ` Eli Zaretskii
  2024-03-08 20:22                   ` Herman, Géza
  0 siblings, 1 reply; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-08 19:57 UTC (permalink / raw)
  To: Herman Géza; +Cc: emacs-devel

> From: Herman, Géza <geza.herman@gmail.com>
> Cc: geza.herman@gmail.com, emacs-devel@gnu.org
> Date: Fri, 08 Mar 2024 19:34:28 +0100
> 
> > If you want to use a 32-bit type, use 'int' or 'unsigned int'.
> I want to use a datatype which is likely to have the same size as 
> a CPU register. On 32-bit machines, it should be 32-bit, on 64-bit 
> architectures it should be 64-bit.

Is there a reason for you to want it to be 64-bit type on a 64-bit
machine?  If the only bother is efficiency, then you can use 'int'
without fear.  But if a 64-bit machine will need the range of values
beyond INT_MAX (does it?), then I suggest to use ptrdiff_t.

> This is not a strong requirement, but it makes the parser faster.

Are you sure?  AFAIK, 32-bit arithmetics on a 64-bit machine is as
fast as 64-bit arithmetics.  So if efficiency is the only
consideration, using 'int' is OK.

> During number parsing, this part of the code calculates the value of
> the number. If it overflows, or it ends up being a float, then the
> workspace buffer is re-parsed to find the value. But if it is an
> integer without overflow, the parser just can use the value without
> any further processing (a certain kind of LSP message is basically a
> large array of integers, so it makes sense to optimize this case -
> this optimization makes parsing such messages 1.5-2x faster).

If the value needs to fit in a fixnum, use EMACS_INT, which is a type
that already takes the 32-but vs 64-bit nature of the machine into
consideration.

> > We need to decide what to do with characters outside of the
> > Unicode range.  The encoding of those characters is different,
> > btw, it doesn't follow the UTF-8 scheme.
> >
> > In any case, the error in those cases cannot be just "json parse
> > error", it must be something more self-explanatory.
> I am open to suggestions, as even after reading "Text 
> Representations" in the manual, I have no idea what to do here.

We should begin by deciding whether we want to support characters
outside of the Unicode range.  The error message depends on that
decision.

> I didn't find any code which handles this case in the jansson parser
> either.

The jansson code required encoding/decoding strings to make sure we
submit to jansson text that is always valid UTF-8.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 19:57                 ` Eli Zaretskii
@ 2024-03-08 20:22                   ` Herman, Géza
  2024-03-09  6:52                     ` Eli Zaretskii
  0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-08 20:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Herman Géza, emacs-devel


Eli Zaretskii <eliz@gnu.org> writes:

>> I want to use a datatype which is likely to have the same size 
>> as
>> a CPU register. On 32-bit machines, it should be 32-bit, on 
>> 64-bit
>> architectures it should be 64-bit.
>
> Is there a reason for you to want it to be 64-bit type on a 
> 64-bit
> machine?  If the only bother is efficiency, then you can use 
> 'int'
> without fear.  But if a 64-bit machine will need the range of 
> values
> beyond INT_MAX (does it?), then I suggest to use ptrdiff_t.
The only reason is if I use a 64-bit number on a 64-bit platform, 
then the fast path will be chosen more frequently. So it makes 
sense to use a register-sized integer here.

>> This is not a strong requirement, but it makes the parser 
>> faster.
>
> Are you sure?  AFAIK, 32-bit arithmetics on a 64-bit machine is 
> as
> fast as 64-bit arithmetics.  So if efficiency is the only
> consideration, using 'int' is OK.
Yes, that's right, but my reasoning is the other way around. I 
shouldn't choose a larger integer than the platform natively 
support. I could just use a 64-bit integer on all platforms, this 
is what I originally had. But then on 32-bit platforms, this part 
of the parser will be slower, because it unnecessarily will use 
two registers instead of one (for most cases, as numbers are 
usually smaller than 32-bit).

This whole thing is not big of a deal, as int is probably fine, as 
it covers most of the commonly used range. But if I can just put a 
more proper type here, then I think I should do it.

> If the value needs to fit in a fixnum, use EMACS_INT, which is a 
> type
> that already takes the 32-but vs 64-bit nature of the machine 
> into
> consideration.
Yes, it seems that EMACS_UINT is good for my purpose, thanks for 
the suggestion.

> We should begin by deciding whether we want to support 
> characters
> outside of the Unicode range.  The error message depends on that
> decision.
>
>> I didn't find any code which handles this case in the jansson 
>> parser
>> either.
>
> The jansson code required encoding/decoding strings to make sure 
> we
> submit to jansson text that is always valid UTF-8.

I tried to use the jansson parser with a unicode 0x333333 
character in a string, and it didn't work, it fails with 
(json-parse-error "unable to decode byte... message. Also, I see 
that json-parse-string calls some utf8 encoding related function 
before parsing, but json-parse-buffer doesn't (and it doesn't do 
anything encoding related thing in the callback, it just calls 
memcpy).  So based on these, does it have any benefit of 
supporting these?  Out of curiosity, what are these extra 
characters used for?  What is the purpose of the odd special 
2-byte encoding of 8-bit characters (I mean where the 1st byte is 
C0/C1)? Why don't just use the regular utf-8 encoding for these 
values?



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 20:22                   ` Herman, Géza
@ 2024-03-09  6:52                     ` Eli Zaretskii
  2024-03-09 11:08                       ` Herman, Géza
  0 siblings, 1 reply; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-09  6:52 UTC (permalink / raw)
  To: Géza Herman; +Cc: emacs-devel

> From: Herman, Géza <geza.herman@gmail.com>
> Cc: Herman Géza <geza.herman@gmail.com>,
>  emacs-devel@gnu.org
> Date: Fri, 08 Mar 2024 21:22:13 +0100
> 
> > Is there a reason for you to want it to be 64-bit type on a 64-bit
> > machine?  If the only bother is efficiency, then you can use 'int'
> > without fear.  But if a 64-bit machine will need the range of
> > values beyond INT_MAX (does it?), then I suggest to use ptrdiff_t.
>
> The only reason is if I use a 64-bit number on a 64-bit platform, 
> then the fast path will be chosen more frequently. So it makes 
> sense to use a register-sized integer here.

Then either ptrdiff_t or EMACS_INT should do what you want.

> Yes, it seems that EMACS_UINT is good for my purpose, thanks for 
> the suggestion.

Are you sure you need the unsigned variety?  If EMACS_INT fits the
bill, then it is a better candidate, since unsigned arithmetics has
its quirks.

> > The jansson code required encoding/decoding strings to make sure
> > we submit to jansson text that is always valid UTF-8.
> 
> I tried to use the jansson parser with a unicode 0x333333 
> character in a string, and it didn't work, it fails with 
> (json-parse-error "unable to decode byte... message.

Well, I didn't say trying an arbitrary codepoint will demonstrate the
issue.  Some codepoints above 0x10FFFF indeed cannot be passed to
jansson.

It's okay if the initial version of this parser only handles the
Unicode range and errors out otherwise; we could extend it if needed
later.  But the error message should talk specifically about invalid
character or something, not just a generic "parse error".

> Also, I see that json-parse-string calls some utf8 encoding related
> function before parsing, but json-parse-buffer doesn't (and it
> doesn't do anything encoding related thing in the callback, it just
> calls memcpy).

This is a part I was never happy about.  But, as I say above, we can
get to handling these rare cases later.

> So based on these, does it have any benefit of supporting these?

Yes, definitely.  But it isn't urgent.

> Out of curiosity, what are these extra characters used for?

Raw bytes and characters from charsets that are not (yet) unified with
Unicode.

> What is the purpose of the odd special 2-byte encoding of 8-bit
> characters (I mean where the 1st byte is C0/C1)? Why don't just use
> the regular utf-8 encoding for these values?

I think it's for efficiency: a 2-byte encoding takes much less space
than the 6-byte encoding (using superset of UTF-8) would take.
Imagine the case where a large byte-stream is inserted into a
multibyte buffer, before decoding it, something that happens a lot
when visiting non-ASCII files or reading from a network sub-process.

The regular UTF-8 encoding cannot be used for the raw bytes, because
then we will be unable to distinguish between them and the Unicode
codepoints of the same value.  For example, a raw byte whose value is
160 decimal (A0 hex) will be indistinguishable from U+00A0 No-Break
Space character.  This is why the "codepoint" corresponding to raw
byte 160 is 0x3FFFA0, see BYTE8_TO_CHAR.

Once again, we can extend the parser for codepoints outside of the
Unicode range later.  For now, it's okay to reject them with a
suitable error.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-09  6:52                     ` Eli Zaretskii
@ 2024-03-09 11:08                       ` Herman, Géza
  2024-03-09 12:23                         ` Lynn Winebarger
                                           ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Herman, Géza @ 2024-03-09 11:08 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Géza Herman, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Herman, Géza <geza.herman@gmail.com>
>> Cc: Herman Géza <geza.herman@gmail.com>,
>>  emacs-devel@gnu.org
>> Date: Fri, 08 Mar 2024 21:22:13 +0100
>> Yes, it seems that EMACS_UINT is good for my purpose, thanks 
>> for
>> the suggestion.
>
> Are you sure you need the unsigned variety?  If EMACS_INT fits 
> the
> bill, then it is a better candidate, since unsigned arithmetics 
> has
> its quirks.

Yes, I think it's better to use unsigned: read the sign, and then 
parse the number as unsigned, and then apply the sign at the 
end. If the number is parsed with its sign, it needs an additional 
step at each character (the sign needs to be applied to each 
digit).

>> Also, I see that json-parse-string calls some utf8 encoding 
>> related
>> function before parsing, but json-parse-buffer doesn't (and it
>> doesn't do anything encoding related thing in the callback, it 
>> just
>> calls memcpy).
>
> This is a part I was never happy about.  But, as I say above, we 
> can
> get to handling these rare cases later.

I think this is an additional benefit of my parser: this feature 
can be added to it more easily than into jansson.
Even, I'm tempted to say that we could just remove utf-8 checking 
from my code, and then Emacs's encoding method should work right 
out of the box.

Or, to say that utf-8 handling should stay as is. Because as far 
as I understand, if the JSON contains an invalid utf-8 sequence 
which is not invalid according to Emacs's character 
representation, then this problem won't be detected.  So checking 
for utf-8 encoding errors shouldn't be the job of the json parser, 
but around IO handling, which has the chance to know that the JSON 
stream itself must only contain a valid utf-8 encoding.

Or, as the JSON specification explcitly says that the allowed 
character range is 0x20 .. 0x10ffff, the current solution is fine, 
because it is actually against JSON rules to allow anything else 
outside of this range.

> Once again, we can extend the parser for codepoints outside of 
> the
> Unicode range later.  For now, it's okay to reject them with a
> suitable error.

OK, cool, I added Qjson_utf8_decode_error to indicate decoding 
errors.

How can we proceed further?  This is the current state of the 
patch: 
https://github.com/geza-herman/emacs/commit/ce5d990776a1ccdfd0b6d9c4d5e5e5df55245672.patch

I think I did everything that was asked for, except Po Lu's 
parenthesis-related comment, because I still don't know what to 
parenthesize and what not to.  I saw a lot of "a + x * y" kind of 
expressions in emacs codebase without any parenthesis.  Are the 
exact rules documented somewhere?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-09 11:08                       ` Herman, Géza
@ 2024-03-09 12:23                         ` Lynn Winebarger
  2024-03-09 12:58                         ` Po Lu
  2024-03-09 13:13                         ` Eli Zaretskii
  2 siblings, 0 replies; 51+ messages in thread
From: Lynn Winebarger @ 2024-03-09 12:23 UTC (permalink / raw)
  To: Herman, Géza; +Cc: Eli Zaretskii, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 759 bytes --]

On Sat, Mar 9, 2024, 6:39 AM Herman, Géza <geza.herman@gmail.com> wrote:

>
> I think I did everything that was asked for, except Po Lu's
> parenthesis-related comment, because I still don't know what to
> parenthesize and what not to.  I saw a lot of "a + x * y" kind of
> expressions in emacs codebase without any parenthesis.  Are the
> exact rules documented somewhere?
>

I suspect the rule is to ensure that emacs maintainers can freely convert
functions to C preprocessor macros.  So you should follow Po Lu's
description.

If the preprocessor wasn't an issue, such replication of the language's
builtin arithmetic precedence would be undesirable, an possibly repugnant,
depending on your aesthetic sensibilities.

Lynn

Lynn

[-- Attachment #2: Type: text/html, Size: 1380 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-09 11:08                       ` Herman, Géza
  2024-03-09 12:23                         ` Lynn Winebarger
@ 2024-03-09 12:58                         ` Po Lu
  2024-03-09 13:13                         ` Eli Zaretskii
  2 siblings, 0 replies; 51+ messages in thread
From: Po Lu @ 2024-03-09 12:58 UTC (permalink / raw)
  To: Herman, Géza; +Cc: Eli Zaretskii, emacs-devel

"Herman, Géza" <geza.herman@gmail.com> writes:

> I think I did everything that was asked for, except Po Lu's
> parenthesis-related comment, because I still don't know what to
> parenthesize and what not to.  I saw a lot of "a + x * y" kind of
> expressions in emacs codebase without any parenthesis.  Are the exact
> rules documented somewhere?

Simple statements as above, which fit in one line, don't require the
extra parentheses.  The objective of these parentheses is to enable
indenting statements satisfactorily with CC Mode's indentation commands,
and so that such indentation does not give readers a misleading
impression of the statement's structure.

  https://www.gnu.org/prep/standards/standards.html#Formatting



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-09 11:08                       ` Herman, Géza
  2024-03-09 12:23                         ` Lynn Winebarger
  2024-03-09 12:58                         ` Po Lu
@ 2024-03-09 13:13                         ` Eli Zaretskii
  2024-03-09 14:00                           ` Herman, Géza
  2 siblings, 1 reply; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-09 13:13 UTC (permalink / raw)
  To: Géza Herman; +Cc: emacs-devel

> From: Herman, Géza <geza.herman@gmail.com>
> Cc: Géza Herman <geza.herman@gmail.com>,
>  emacs-devel@gnu.org
> Date: Sat, 09 Mar 2024 12:08:54 +0100
> 
> How can we proceed further?  This is the current state of the 
> patch: 
> https://github.com/geza-herman/emacs/commit/ce5d990776a1ccdfd0b6d9c4d5e5e5df55245672.patch

What about removing the dependency on jansson altogether?

We also need to wait until your legal paperwork is completed.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-09 13:13                         ` Eli Zaretskii
@ 2024-03-09 14:00                           ` Herman, Géza
  2024-03-09 14:21                             ` Eli Zaretskii
  0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-09 14:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Géza Herman, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Herman, Géza <geza.herman@gmail.com>
>> Cc: Géza Herman <geza.herman@gmail.com>,
>>  emacs-devel@gnu.org
>> Date: Sat, 09 Mar 2024 12:08:54 +0100
>>
>> How can we proceed further?  This is the current state of the
>> patch:
>> https://github.com/geza-herman/emacs/commit/ce5d990776a1ccdfd0b6d9c4d5e5e5df55245672.patch
>
> What about removing the dependency on jansson altogether?

Unless it also turns out to be a serious bottleneck, I'm not too 
motivated to work on that.

Based on Po Lu's comments, I added some parentheses, here's the 
latest version:

https://github.com/geza-herman/emacs/commit/5e3db3fb06b6393fd1b5d5fd967062ec1909e94a.patch

(I reformatted the code using indent-region, and put parentheses 
where it didn't used the correct indentation.  I'm not sure 
whether this method finds all the cases, but as the intent of 
these extra parentheses is to have correct indentation, I think 
this approach more-or-less OK)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-09 14:00                           ` Herman, Géza
@ 2024-03-09 14:21                             ` Eli Zaretskii
  0 siblings, 0 replies; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-09 14:21 UTC (permalink / raw)
  To: Géza Herman; +Cc: emacs-devel

> From: Herman, Géza <geza.herman@gmail.com>
> Cc: Géza Herman <geza.herman@gmail.com>,
>  emacs-devel@gnu.org
> Date: Sat, 09 Mar 2024 15:00:09 +0100
> 
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> How can we proceed further?  This is the current state of the
> >> patch:
> >> https://github.com/geza-herman/emacs/commit/ce5d990776a1ccdfd0b6d9c4d5e5e5df55245672.patch
> >
> > What about removing the dependency on jansson altogether?
> 
> Unless it also turns out to be a serious bottleneck, I'm not too 
> motivated to work on that.

That's a pity, because using jansson for part of the job and our own
code for the other part makes little sense to me.  So I hope you will
reconsider, because having our own JSON code is a very attractive
development for Emacs, it opens several possibilities for future
extensions that using an external library cannot.

> Based on Po Lu's comments, I added some parentheses, here's the 
> latest version:

Thanks.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 10:27 I created a faster JSON parser Herman, Géza
  2024-03-08 11:41 ` Philip Kaludercic
  2024-03-08 12:03 ` Eli Zaretskii
@ 2024-03-08 13:28 ` Po Lu
  2024-03-08 16:14   ` Herman, Géza
  2024-03-09 20:37 ` Christopher Wellons
  3 siblings, 1 reply; 51+ messages in thread
From: Po Lu @ 2024-03-08 13:28 UTC (permalink / raw)
  To: Herman, Géza; +Cc: emacs-devel@gnu.org

"Herman, Géza" <geza.herman@gmail.com> writes:

> I created a faster JSON parser for emacs, you can check it out here:
>
> https://github.com/geza-herman/emacs/tree/faster-json-parsing
>
> It replaces json-parse-string and json-parse-buffer functions. The
> behavior should be the same as before, with the only exception that
> objects with duplicated keys are not detected if :object-type is not
> 'hash-table.
>
> This parser runs 8-9x faster than the jansson based parser on my
> machine (tested on clangd language server messages).  An additional
> tiny benefit is that large integers are parsed, instead of having an
> "out of range" error.
>
> What do you think?

Speed aside, this change eliminates an external dependency from Emacs,
which is a great service to us in its own right.  Thank you!

Nevertheless, there are several coding style issues that should be
resolved, viz. the presence of redundant opening braces and ternary or
nested binary operations not enclosed in parentheses, and the size of
the change is such that its copyright must be assigned to the FSF.

Please see that these are addressed, and thanks again.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 13:28 ` Po Lu
@ 2024-03-08 16:14   ` Herman, Géza
  2024-03-09  1:55     ` Po Lu
  0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-08 16:14 UTC (permalink / raw)
  To: Po Lu; +Cc: Herman, Géza, emacs-devel@gnu.org


Po Lu <luangruo@yahoo.com> writes:

> "Herman, Géza" <geza.herman@gmail.com> writes:
>
>> What do you think?
>
> Speed aside, this change eliminates an external dependency from 
> Emacs,
> which is a great service to us in its own right.  Thank you!
I think jansson is still used, but only for writing to json.

> Nevertheless, there are several coding style issues that should 
> be
> resolved, viz. the presence of redundant opening braces and 
> ternary or
> nested binary operations not enclosed in parentheses,
Can you give me some examples reagarding operations need to be 
parenthesized?  Isn't there a gcc warning for this, so I can 
follow the convention more easily?



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 16:14   ` Herman, Géza
@ 2024-03-09  1:55     ` Po Lu
  0 siblings, 0 replies; 51+ messages in thread
From: Po Lu @ 2024-03-09  1:55 UTC (permalink / raw)
  To: Herman, Géza; +Cc: emacs-devel@gnu.org

"Herman, Géza" <geza.herman@gmail.com> writes:

> I think jansson is still used, but only for writing to json.

That's a disappointment, but would you be willing to solve this as well?

> Can you give me some examples reagarding operations need to be
> parenthesized?

e.g., in

  function_call ()
  + function_call_1 ()
  * function_call_2 ()
  + function_call_3 ()

the entire statement should be surrounded by parentheses, along with
each group of operators separated by precedence:

   (function_call ()
    + (function_call_1 ()
       * function_call_2 ())
    + function_call_3 ())

and lastly all ternary operators should be surrounded by parentheses.

> Isn't there a gcc warning for this, so I can follow the convention
> more easily?

Not that I'm aware of.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-08 10:27 I created a faster JSON parser Herman, Géza
                   ` (2 preceding siblings ...)
  2024-03-08 13:28 ` Po Lu
@ 2024-03-09 20:37 ` Christopher Wellons
  2024-03-10  6:31   ` Eli Zaretskii
  2024-03-10  6:58   ` Herman, Géza
  3 siblings, 2 replies; 51+ messages in thread
From: Christopher Wellons @ 2024-03-09 20:37 UTC (permalink / raw)
  To: Herman, Géza; +Cc: emacs-devel@gnu.org

> What do you think?

Your new JSON parser appears to carefully written, and has the marks of 
thoughtful design. You've avoided the common pitfalls, and that gives me 
confidence in it.

In review I noticed a potential pointer overflow in json_parse_string:

parser->input_current + 4 <= parser->input_end

The "+ 4" may push the pointer beyond one-past-the-end not just for the 
input, but the buffer itself, overflowing the pointer. To fix, re-arrange 
the expression to check a size rather than an address:

parser->input_end - parser->input_current >= 4

In json_make_object_workspace_for and json_byte_workspace_put, a size is 
doubled without an overflow check ("new_workspace_size * 2"). The first 
could cause an infinite loop, and the second could allocate less than was 
expected. Both are minor, and in practice can only affect 32-bit targets, 
because you'd need to grow these buffers to the limits before these sizes 
could overflow.

Despite the obvious care which with this was written, I personally would 
not adopt a JSON parser that had not been thoroughly fuzz tested under 
Address Sanitizer and Undefined Behavior Sanitizer. Fuzzing is incredibly 
effective at finding defects, and it would be foolish not to use it in its 
ideal circumstances. Normally it's not difficult and requires only a few 
lines of code. But this JSON parser is tightly coupled with the Emacs Lisp 
runtime, which greatly complicates things. I couldn't simply pluck it out 
by itself and drop it in, say, AFL++.

As noted earlier, the parser gets its performance edge through skipping 
the intermediate steps. This is great! That could still be accomplished 
without such tight coupling, allowing for performance *and* an interface 
that is testable and fuzzable in relative isolation.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-09 20:37 ` Christopher Wellons
@ 2024-03-10  6:31   ` Eli Zaretskii
  2024-03-10 21:39     ` Philip Kaludercic
  2024-03-10  6:58   ` Herman, Géza
  1 sibling, 1 reply; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-10  6:31 UTC (permalink / raw)
  To: Christopher Wellons; +Cc: geza.herman, emacs-devel

> Date: Sat, 9 Mar 2024 15:37:25 -0500
> From: Christopher Wellons <wellons@nullprogram.com>
> Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org>
> 
> Despite the obvious care which with this was written, I personally would 
> not adopt a JSON parser that had not been thoroughly fuzz tested under 
> Address Sanitizer and Undefined Behavior Sanitizer. Fuzzing is incredibly 
> effective at finding defects, and it would be foolish not to use it in its 
> ideal circumstances. Normally it's not difficult and requires only a few 
> lines of code. But this JSON parser is tightly coupled with the Emacs Lisp 
> runtime, which greatly complicates things. I couldn't simply pluck it out 
> by itself and drop it in, say, AFL++.

That's okay, we can start by making this an optional feature, and
consider making it the default after a couple of major releases;
meanwhile, any problems will be detected and reported.

However, it would make much more sense to consider switching to this
code if it also could handle producing JSON output, thus making
libjansson unnecessary when we decide to switch.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-10  6:31   ` Eli Zaretskii
@ 2024-03-10 21:39     ` Philip Kaludercic
  2024-03-11 13:29       ` Eli Zaretskii
  0 siblings, 1 reply; 51+ messages in thread
From: Philip Kaludercic @ 2024-03-10 21:39 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Christopher Wellons, geza.herman, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> Date: Sat, 9 Mar 2024 15:37:25 -0500
>> From: Christopher Wellons <wellons@nullprogram.com>
>> Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org>
>> 
>> Despite the obvious care which with this was written, I personally would 
>> not adopt a JSON parser that had not been thoroughly fuzz tested under 
>> Address Sanitizer and Undefined Behavior Sanitizer. Fuzzing is incredibly 
>> effective at finding defects, and it would be foolish not to use it in its 
>> ideal circumstances. Normally it's not difficult and requires only a few 
>> lines of code. But this JSON parser is tightly coupled with the Emacs Lisp 
>> runtime, which greatly complicates things. I couldn't simply pluck it out 
>> by itself and drop it in, say, AFL++.
>
> That's okay, we can start by making this an optional feature, and
> consider making it the default after a couple of major releases;
> meanwhile, any problems will be detected and reported.
>
> However, it would make much more sense to consider switching to this
> code if it also could handle producing JSON output, thus making
> libjansson unnecessary when we decide to switch.

If libjansson is not available, it should still be possible to use this
faster parser, while falling back to json.el for generating JSON?  From
what I understand, the main issue people have with JSON is the parsing
bottleneck, right?

-- 
	Philip Kaludercic on icterid



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-10 21:39     ` Philip Kaludercic
@ 2024-03-11 13:29       ` Eli Zaretskii
  2024-03-11 14:05         ` Mattias Engdegård
  0 siblings, 1 reply; 51+ messages in thread
From: Eli Zaretskii @ 2024-03-11 13:29 UTC (permalink / raw)
  To: Philip Kaludercic; +Cc: wellons, geza.herman, emacs-devel

> From: Philip Kaludercic <philipk@posteo.net>
> Cc: Christopher Wellons <wellons@nullprogram.com>,  geza.herman@gmail.com,
>   emacs-devel@gnu.org
> Date: Sun, 10 Mar 2024 21:39:29 +0000
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> Date: Sat, 9 Mar 2024 15:37:25 -0500
> >> From: Christopher Wellons <wellons@nullprogram.com>
> >> Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org>
> >> 
> >> Despite the obvious care which with this was written, I personally would 
> >> not adopt a JSON parser that had not been thoroughly fuzz tested under 
> >> Address Sanitizer and Undefined Behavior Sanitizer. Fuzzing is incredibly 
> >> effective at finding defects, and it would be foolish not to use it in its 
> >> ideal circumstances. Normally it's not difficult and requires only a few 
> >> lines of code. But this JSON parser is tightly coupled with the Emacs Lisp 
> >> runtime, which greatly complicates things. I couldn't simply pluck it out 
> >> by itself and drop it in, say, AFL++.
> >
> > That's okay, we can start by making this an optional feature, and
> > consider making it the default after a couple of major releases;
> > meanwhile, any problems will be detected and reported.
> >
> > However, it would make much more sense to consider switching to this
> > code if it also could handle producing JSON output, thus making
> > libjansson unnecessary when we decide to switch.
> 
> If libjansson is not available, it should still be possible to use this
> faster parser, while falling back to json.el for generating JSON?  From
> what I understand, the main issue people have with JSON is the parsing
> bottleneck, right?

What you describe are possible fallbacks, but I would prefer not to
use any fallback at all, but instead have a full C implementation.

(Using json.el has one unfortunate disadvantage that it's incompatible
with the json.c APIs.)



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-11 13:29       ` Eli Zaretskii
@ 2024-03-11 14:05         ` Mattias Engdegård
  2024-03-11 14:35           ` Herman, Géza
  0 siblings, 1 reply; 51+ messages in thread
From: Mattias Engdegård @ 2024-03-11 14:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Philip Kaludercic, wellons, geza.herman, emacs-devel

11 mars 2024 kl. 14.29 skrev Eli Zaretskii <eliz@gnu.org>:

> What you describe are possible fallbacks, but I would prefer not to
> use any fallback at all, but instead have a full C implementation.

Yes, I definitely think we should do that. I'm pretty sure that writing a JSON unparser is a lot easier than doing the parser, and the extra speed we stand to gain from not having the intermediate jansson step is not without interest.

Overall the proposed parser looks fine, nothing terribly wrong that can't be fixed later on. A few minor points:

* The `is_single_uninteresting` array is hard to review and badly formatted. It appears to be 1 for all printable ASCII plus DEL except double-quote and backslash. (Why DEL?)

Happy if you would make that clearer, perhaps by formatting or initialising it at run-time. (I don't think we can use GCC-style array interval designators, alas.)

* Do you really need to maintain line and column during the parse? If you want them for error reporting, you can materialise them from the offset that you already have.

* Are you sure that GC can't run during parsing or that all your Lisp objects are reachable directly from the stack? (It's the `object_workspace` in particular that's worrying me a bit.)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-11 14:05         ` Mattias Engdegård
@ 2024-03-11 14:35           ` Herman, Géza
  2024-03-12  9:26             ` Mattias Engdegård
  0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-11 14:35 UTC (permalink / raw)
  To: Mattias Engdegård
  Cc: Eli Zaretskii, Philip Kaludercic, wellons, geza.herman,
	emacs-devel

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> 11 mars 2024 kl. 14.29 skrev Eli Zaretskii <eliz@gnu.org>:
>
>> What you describe are possible fallbacks, but I would prefer 
>> not to
>> use any fallback at all, but instead have a full C 
>> implementation.
>
> Yes, I definitely think we should do that. I'm pretty sure that
> writing a JSON unparser is a lot easier than doing the parser, 
> and the
> extra speed we stand to gain from not having the intermediate 
> jansson
> step is not without interest.

FYI: I checked out a JSON benchmark, and it turned out that 
jansson is not a fast parser, there are faster libraries.  If a 
library has a SAX interface, that could be a potentially useful 
library for Emacs.  According to 
https://github.com/miloyip/nativejson-benchmark, RapidJSON is at 
least 10x faster than jansson.  I'm just saying this because Emacs 
doesn't have to stick with my parser, there are possible 
alternatives, which have JSON serializers as well.

(But note: I am happy to make my parser into a mergeable state, 
and if eventually it gets merged then fixing its bugs, but I'm not 
motivated to work on integrating other JSON libraries).

> Overall the proposed parser looks fine, nothing terribly wrong 
> that can't be fixed later on. A few minor points:
>
> * The `is_single_uninteresting` array is hard to review and 
> badly
> formatted. It appears to be 1 for all printable ASCII plus DEL 
> except
> double-quote and backslash. (Why DEL?)

Yep, the formatting of that table got destroyed when I reformatted 
the code into GNU style.  Now I formatted the table back, and 
added comments for each row/col.  Here's the latest version: 
https://github.com/geza-herman/emacs/commit/4b5895636c1ec06e630baf47881b246c198af056.patch

I'm not sure about DEL: I haven't seen anything which says that 
it's an invalid character in a string, so the parser currently 
allows it.

> * Do you really need to maintain line and column during the 
> parse? If
> you want them for error reporting, you can materialise them from 
> the
> offset that you already have.

Yeah, I thought of that, but it turned out that maintaining the 
line/column doesn't have an impact on performance.  I added that 
easily, tough admittedly it's a little bit awkward to maintain 
these variables.  If emacs has a way to tell from the byte-pointer 
the line/col position (both for strings and buffers), I am happy 
to use that instead.  It would be a better solution, because 
currently the parser always starts from line 1, col 1, which means 
that if json-parse-buffer is used, these numbers will be local to 
the current parsing, not actual numbers related to the whole 
buffer.  But as the jansson based parsed behaves the same, I 
thought it's OK.

> * Are you sure that GC can't run during parsing or that all your 
> Lisp
> objects are reachable directly from the stack? (It's the
> `object_workspace` in particular that's worrying me a bit.)

That's a very good question.  I suppose that object_workspace is 
invisible to the Lisp VM, as it is just a malloc'd object.  But 
I've never seen a problem because of this.  What triggers the GC? 
Is it possible that for the duration of the whole parsing, GC is 
never get triggered?  Otherwise it should have GCd the objects in 
object_workspace, causing problems (I tried this parser in a loop, 
where GC is caused hundreds of times. In the loop, I compared the 
result to json-read, everything was fine).

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-11 14:35           ` Herman, Géza
@ 2024-03-12  9:26             ` Mattias Engdegård
  2024-03-12 10:20               ` Gerd Möllmann
  2024-03-12 10:58               ` Herman, Géza
  0 siblings, 2 replies; 51+ messages in thread
From: Mattias Engdegård @ 2024-03-12  9:26 UTC (permalink / raw)
  To: "Herman, Géza"
  Cc: Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel

11 mars 2024 kl. 15.35 skrev Herman, Géza <geza.herman@gmail.com>:

> According to https://github.com/miloyip/nativejson-benchmark, RapidJSON is at least 10x faster than jansson.  I'm just saying this because Emacs doesn't have to stick with my parser, there are possible alternatives, which have JSON serializers as well.

Thanks for the benchmark page reference. Yes, if this turns out to matter more we may consider a faster library. Right now I think your efforts are good enough (at least if we finish the job with a JSON serialiser).

> Yep, the formatting of that table got destroyed when I reformatted the code into GNU style.  Now I formatted the table back, and added comments for each row/col.  Here's the latest version: https://github.com/geza-herman/emacs/commit/4b5895636c1ec06e630baf47881b246c198af056.patch

Much better, thank you.

>> * Do you really need to maintain line and column during the parse? If
>> you want them for error reporting, you can materialise them from the
>> offset that you already have.
> 
> Yeah, I thought of that, but it turned out that maintaining the line/column doesn't have an impact on performance.

That's just because your code isn't fast enough! We are very disappointed. Very.

>  I added that easily, tough admittedly it's a little bit awkward to maintain these variables.  If emacs has a way to tell from the byte-pointer the line/col position (both for strings and buffers), I am happy to use that instead.

Since error handling isn't performance-critical it doesn't matter if it's a bit slow. (I'd just count newlines.)

>> * Are you sure that GC can't run during parsing or that all your Lisp
>> objects are reachable directly from the stack? (It's the
>> `object_workspace` in particular that's worrying me a bit.)
> 
> That's a very good question.  I suppose that object_workspace is invisible to the Lisp VM, as it is just a malloc'd object.  But I've never seen a problem because of this.  What triggers the GC? Is it possible that for the duration of the whole parsing, GC is never get triggered?  Otherwise it should have GCd the objects in object_workspace, causing problems (I tried this parser in a loop, where GC is caused hundreds of times. In the loop, I compared the result to json-read, everything was fine).

You can't test that code is GC-safe, you have to show that it's correct by design.

Looking at the code it is quite possible that GC cannot take place. But it can signal errors, and getting into the debugger should open GC windows unless I'm mistaken.

There are some options. `record_unwind_protect_ptr_mark` would be one, and it was made for code like this, but Gerd has been grumbling about it lately. Perhaps it's easier just to disable GC in the dynamic scope (inhibit_garbage_collection).




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-12  9:26             ` Mattias Engdegård
@ 2024-03-12 10:20               ` Gerd Möllmann
  2024-03-12 11:14                 ` Mattias Engdegård
  2024-03-15 13:35                 ` Herman, Géza
  2024-03-12 10:58               ` Herman, Géza
  1 sibling, 2 replies; 51+ messages in thread
From: Gerd Möllmann @ 2024-03-12 10:20 UTC (permalink / raw)
  To: Mattias Engdegård
  Cc: Herman, Géza, Eli Zaretskii, Philip Kaludercic, wellons,
	emacs-devel

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> There are some options. `record_unwind_protect_ptr_mark` would be one,
> and it was made for code like this, but Gerd has been grumbling about
> it lately. Perhaps it's easier just to disable GC in the dynamic scope
> (inhibit_garbage_collection).

Yes, I'm grumbling because record_..._mark assumes GC is mark-sweep. And
it's really super ugly to work around.

And if you disable GC, I'll be grumbling because that's not good for a
concurrent GC.

What do I propose: If you can, please make objects reachable from the
control stack. For instance, create a context object on the stack that
has a Lisp vector member. That will prevent GC'ing the vector and by
that its contents, of course. You could pass a pointer to the context
around to get access to the vector in subroutines, replace it with a
larger vector if necessary, and so on.

Howsoever, that needed to be said! ;-).

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-12 10:20               ` Gerd Möllmann
@ 2024-03-12 11:14                 ` Mattias Engdegård
  2024-03-12 11:33                   ` Gerd Möllmann
  2024-03-15 13:35                 ` Herman, Géza
  1 sibling, 1 reply; 51+ messages in thread
From: Mattias Engdegård @ 2024-03-12 11:14 UTC (permalink / raw)
  To: Gerd Möllmann
  Cc: "Herman, Géza", Eli Zaretskii, Philip Kaludercic,
	wellons, emacs-devel

12 mars 2024 kl. 11.20 skrev Gerd Möllmann <gerd.moellmann@gmail.com>:

> Yes, I'm grumbling because record_..._mark assumes GC is mark-sweep. And
> it's really super ugly to work around.
> 
> And if you disable GC, I'll be grumbling because that's not good for a
> concurrent GC.

These are problems we clearly need to solve, but I don't think we should hold Géza's improved JSON parser hostage to our desired future GC demands. Let's help him finish his work in the Emacs that we have today and be explicit about our assumptions and needs. The proposed code's requirements are not unique, nor do the solutions need to be.

> What do I propose: If you can, please make objects reachable from the
> control stack. For instance, create a context object on the stack that
> has a Lisp vector member.

The problem is that with the GC that we actually have, xmalloc/xfree is a lot faster than creating a Lisp vector and letting the GC collect it later on.

One possible way out might be providing a way to free a vector ahead of time, akin to free_cons, and create a vector type with a fill pointer so that it does not need to be initialised nor traversed in full if only part of it is used.

The last part is straightforward, although it complicates vectorlike traversal a bit. Ahead-of-time freeing might be unnecessary with a modern collector but that needs to be verified.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-12 11:14                 ` Mattias Engdegård
@ 2024-03-12 11:33                   ` Gerd Möllmann
  0 siblings, 0 replies; 51+ messages in thread
From: Gerd Möllmann @ 2024-03-12 11:33 UTC (permalink / raw)
  To: Mattias Engdegård
  Cc: Herman, Géza, Eli Zaretskii, Philip Kaludercic, wellons,
	emacs-devel

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> The problem is that with the GC that we actually have, xmalloc/xfree
> is a lot faster than creating a Lisp vector and letting the GC collect
> it later on.

Will any user ever notice a difference? I doubt it.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-12 10:20               ` Gerd Möllmann
  2024-03-12 11:14                 ` Mattias Engdegård
@ 2024-03-15 13:35                 ` Herman, Géza
  2024-03-15 14:56                   ` Gerd Möllmann
  2024-03-19 18:49                   ` Mattias Engdegård
  1 sibling, 2 replies; 51+ messages in thread
From: Herman, Géza @ 2024-03-15 13:35 UTC (permalink / raw)
  To: Gerd Möllmann
  Cc: Mattias Engdegård, Herman, Géza, Eli Zaretskii,
	Philip Kaludercic, wellons, emacs-devel


Gerd Möllmann <gerd.moellmann@gmail.com> writes:

> Mattias Engdegård <mattias.engdegard@gmail.com> writes:
>
>> There are some options. `record_unwind_protect_ptr_mark` would 
>> be one,
>> and it was made for code like this, but Gerd has been grumbling 
>> about
>> it lately. Perhaps it's easier just to disable GC in the 
>> dynamic scope
>> (inhibit_garbage_collection).
>
> Yes, I'm grumbling because record_..._mark assumes GC is 
> mark-sweep. And
> it's really super ugly to work around.
>
> And if you disable GC, I'll be grumbling because that's not good 
> for a
> concurrent GC.
>
> What do I propose: If you can, please make objects reachable 
> from the
> control stack. For instance, create a context object on the 
> stack that
> has a Lisp vector member. That will prevent GC'ing the vector 
> and by
> that its contents, of course. You could pass a pointer to the 
> context
> around to get access to the vector in subroutines, replace it 
> with a
> larger vector if necessary, and so on.

I implemented this idea, here is the latest version (now, 
object_workspace is a Lisp_Object):

https://github.com/geza-herman/emacs/commit/12decaddc9b8260745be4d5782fea21f4578eab2.patch

Is there anything that need to be done?



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-15 13:35                 ` Herman, Géza
@ 2024-03-15 14:56                   ` Gerd Möllmann
  2024-03-19 18:49                   ` Mattias Engdegård
  1 sibling, 0 replies; 51+ messages in thread
From: Gerd Möllmann @ 2024-03-15 14:56 UTC (permalink / raw)
  To: Herman, Géza
  Cc: Mattias Engdegård, Eli Zaretskii, Philip Kaludercic, wellons,
	emacs-devel

"Herman, Géza" <geza.herman@gmail.com> writes:

>> What do I propose: If you can, please make objects reachable from the
>> control stack. For instance, create a context object on the stack
>> that has a Lisp vector member. That will prevent GC'ing the vector
>> and by that its contents, of course. You could pass a pointer to the
>> context around to get access to the vector in subroutines, replace it
>> with a larger vector if necessary, and so on.
>
> I implemented this idea, here is the latest version (now,
> object_workspace is a Lisp_Object):
>
> https://github.com/geza-herman/emacs/commit/12decaddc9b8260745be4d5782fea21f4578eab2.patch
>
> Is there anything that need to be done?

Thanks, that looks very good, I'm happy :-).

By putting a struct json_parser on the stack as a local variable, its
.object_workspace itself and everything rechable from it are safe from
GC.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-15 13:35                 ` Herman, Géza
  2024-03-15 14:56                   ` Gerd Möllmann
@ 2024-03-19 18:49                   ` Mattias Engdegård
  2024-03-19 19:05                     ` Herman, Géza
  2024-03-19 19:13                     ` Gerd Möllmann
  1 sibling, 2 replies; 51+ messages in thread
From: Mattias Engdegård @ 2024-03-19 18:49 UTC (permalink / raw)
  To: "Herman, Géza"
  Cc: Gerd Möllmann, Eli Zaretskii, Philip Kaludercic, wellons,
	emacs-devel

15 mars 2024 kl. 14.35 skrev Herman, Géza <geza.herman@gmail.com>:

> I implemented this idea, here is the latest version (now, object_workspace is a Lisp_Object):

That's considerably slower than before, especially for small inputs, so it's probably a no-go.

I definitely want to make Gerd's efforts no harder than necessary but really don't think we should do pay the cost up-front like this. If required for a better GC then fine, but let's do that work in the branch for that effort instead of incurring overhead in Emacs now.

> Is there anything that need to be done?

It's probably a good idea to open a bug for this task, since it provides a better and more focussed record of the discussion, with a single bug number.

Here are some remaining tasks:

1. Go back to the previous version which was much faster.
2. Don't allocate any temporary storage before you know that it's actually necessary.
3. Stop using the object_workspace when not required, which is everywhere except possibly when reading arrays into Lisp vectors.
4. Write a JSON serialiser.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-19 18:49                   ` Mattias Engdegård
@ 2024-03-19 19:05                     ` Herman, Géza
  2024-03-19 19:18                       ` Gerd Möllmann
  2024-03-19 19:13                     ` Gerd Möllmann
  1 sibling, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-19 19:05 UTC (permalink / raw)
  To: Mattias Engdegård
  Cc: Herman, Géza, Gerd Möllmann, Eli Zaretskii,
	Philip Kaludercic, wellons, emacs-devel


Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> 15 mars 2024 kl. 14.35 skrev Herman, Géza 
> <geza.herman@gmail.com>:
>
>> I implemented this idea, here is the latest version (now, 
>> object_workspace is a Lisp_Object):
>
> That's considerably slower than before, especially for small 
> inputs, so it's probably a no-go.
Have you benchmarked it?  I really doubt that there is a 
significant performance difference.  Maybe it's possible to find 
some case, where this modification matters a bit, but for most 
cases, the difference should be negligible. The amount of 
allocations done is the same in both cases, which is for large 
files, only 20-30 allocations.  Compared to the thousands of 
allocations for storing the actual Lisp objects.

> Here are some remaining tasks:
>
> 2. Don't allocate any temporary storage before you know that 
> it's actually necessary.
Except for the very-very trivial cases, memory allocation is 
always necessary.  I can lower the allocation sizes, but I don't 
think it's worth it. We're only talking about sizes of the KB 
range.

> 3. Stop using the object_workspace when not required, which is 
> everywhere except possibly when reading arrays into Lisp 
> vectors.
object_workspace is necessary whenever a JSON contains an object 
or array.  So basically always.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-19 19:05                     ` Herman, Géza
@ 2024-03-19 19:18                       ` Gerd Möllmann
  0 siblings, 0 replies; 51+ messages in thread
From: Gerd Möllmann @ 2024-03-19 19:18 UTC (permalink / raw)
  To: Herman, Géza
  Cc: Mattias Engdegård, Eli Zaretskii, Philip Kaludercic, wellons,
	emacs-devel

"Herman, Géza" <geza.herman@gmail.com> writes:

> Mattias Engdegård <mattias.engdegard@gmail.com> writes:
>
>> 15 mars 2024 kl. 14.35 skrev Herman, Géza <geza.herman@gmail.com>:
>>
>>> I implemented this idea, here is the latest version (now,
>>> object_workspace is a Lisp_Object):
>>
>> That's considerably slower than before, especially for small inputs,
>> so it's probably a no-go.
> Have you benchmarked it?  I really doubt that there is a significant
> performance difference.  Maybe it's possible to find some case, where
> this modification matters a bit, but for most cases, the difference
> should be negligible. The amount of allocations done is the same in
> both cases, which is for large files, only 20-30 allocations.
> Compared to the thousands of allocations for storing the actual Lisp
> objects.

Right, if N is the number of objects allocated for JSON, the usual
strategy of doubling buffers sizes yields Log(N) buffer allocations.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-19 18:49                   ` Mattias Engdegård
  2024-03-19 19:05                     ` Herman, Géza
@ 2024-03-19 19:13                     ` Gerd Möllmann
  1 sibling, 0 replies; 51+ messages in thread
From: Gerd Möllmann @ 2024-03-19 19:13 UTC (permalink / raw)
  To: Mattias Engdegård
  Cc: Herman, Géza, Eli Zaretskii, Philip Kaludercic, wellons,
	emacs-devel

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> 15 mars 2024 kl. 14.35 skrev Herman, Géza <geza.herman@gmail.com>:
>
>> I implemented this idea, here is the latest version (now, object_workspace is a Lisp_Object):
>
> That's considerably slower than before, especially for small inputs,
> so it's probably a no-go.

Can you give numbers?

>
> I definitely want to make Gerd's efforts no harder than necessary but
> really don't think we should do pay the cost up-front like this. If
> required for a better GC then fine, but let's do that work in the
> branch for that effort instead of incurring overhead in Emacs now.

Please don't take my GC efforts into consideration. That may succeed or
not. But this is also a matter of good design, using the stack, (which
BTW pdumper does, too), vs. bad design.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-12  9:26             ` Mattias Engdegård
  2024-03-12 10:20               ` Gerd Möllmann
@ 2024-03-12 10:58               ` Herman, Géza
  2024-03-12 13:11                 ` Mattias Engdegård
  1 sibling, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-12 10:58 UTC (permalink / raw)
  To: Mattias Engdegård
  Cc: Herman, Géza, Eli Zaretskii, Philip Kaludercic, wellons,
	emacs-devel

Mattias Engdegård <mattias.engdegard@gmail.com> writes:
>>  I added that easily, tough admittedly it's a little bit 
>>  awkward to
>> maintain these variables.  If emacs has a way to tell from the
>> byte-pointer the line/col position (both for strings and 
>> buffers), I
>> am happy to use that instead.
>
> Since error handling isn't performance-critical it doesn't 
> matter if it's a bit slow. (I'd just count newlines.)

Basically this is what the current code does, it's just not 
postponed until an actual error, but registered during parsing. 
I'm tempted to keep it as is, as line/col information can be 
useful in other circumstances as well.  Like, for example, if we 
wanted to tag the created objects with their source location.

> You can't test that code is GC-safe, you have to show that it's 
> correct by design.

Sure, but there has to be an explanation why the current way 
doesn't have any problems.  Having a glance at garbage collection 
in emacs, it seems that it only runs during elisp code execution 
(is this true?).  As the parser doesn't call back to any elisp 
code which runs the VM, it should be safe.  If GC could happen any 
time, then I suppose the whole C Emacs code should be checked for 
that, because one can never be sure that if something is allocated 
at the C side, then at the very next moment it will be immediately 
freed by the GC. Conceptually, I see no difference between calling 
a single Fcons vs. the what the whole parser does.  If calling 
Fcons and then using its result is safe, then the parser should 
also be safe. Or is there some magic about the GC which makes this 
argument false?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-12 10:58               ` Herman, Géza
@ 2024-03-12 13:11                 ` Mattias Engdegård
  2024-03-12 13:42                   ` Mattias Engdegård
  2024-03-12 15:23                   ` Herman, Géza
  0 siblings, 2 replies; 51+ messages in thread
From: Mattias Engdegård @ 2024-03-12 13:11 UTC (permalink / raw)
  To: "Herman, Géza"
  Cc: Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel,
	Gerd Möllmann

12 mars 2024 kl. 11.58 skrev Herman, Géza <geza.herman@gmail.com>:

>> You can't test that code is GC-safe, you have to show that it's correct by design.
> 
> Sure, but there has to be an explanation why the current way doesn't have any problems.

It doesn't matter -- we don't need to prove to you why you don't get a segfault, it's you who need to convince us that your code is fine.

Most C code does not need any special attention as long as it only places Lisp roots in local variables, ie, on the C stack and in registers. Your code keeps live roots in heap-allocated memory. It's not alone in doing that, but elsewhere we give those roots special attention, either by explicit GC marking or disabling GC during the lifespan of those roots.

As I said your code is probably safe unless it signals an error.

An alternative is to do as Gerd suggested and only have roots on the stack, by allocating vectors (or lists) for the task. However:

12 mars 2024 kl. 12.33 skrev Gerd Möllmann <gerd.moellmann@gmail.com>:

> Will any user ever notice a difference? I doubt it.

I can't say, and these things are not always easy to measure. A Lisp allocation has the drawbacks:

- Allocation requires initialisation of the entire block, not just the used part.
- Deallocation is delayed until the next GC.
- The cache lines are effectively wasted.
- If the code is run again before next GC, it will have to allocate more Lisp objects.
- GC happens more frequently because allocation advances the GC clock.
- GC takes longer because it has to sweep more dead objects.

One compromise in this case (json_parse_array) would be to collect array elements into a Lisp list which is then turned into a Lisp vector. In the latter case we could use free_cons to undo the consing if we want.

Consing and list traversal is slow but perhaps not bad enough; measurement is needed.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-12 13:11                 ` Mattias Engdegård
@ 2024-03-12 13:42                   ` Mattias Engdegård
  2024-03-12 15:23                   ` Herman, Géza
  1 sibling, 0 replies; 51+ messages in thread
From: Mattias Engdegård @ 2024-03-12 13:42 UTC (permalink / raw)
  To: "Herman, Géza"
  Cc: Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel,
	Gerd Möllmann

12 mars 2024 kl. 14.11 skrev Mattias Engdegård <mattias.engdegard@gmail.com>:

> As I said your code is probably safe unless it signals an error.

Actually, as long as the heap-allocated scratch space isn't touched during error handling and unwind, that should not be a problem either, so we may be entirely safe here. We should explain why in the code, of course.




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-12 13:11                 ` Mattias Engdegård
  2024-03-12 13:42                   ` Mattias Engdegård
@ 2024-03-12 15:23                   ` Herman, Géza
  2024-03-12 15:39                     ` Gerd Möllmann
  1 sibling, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-12 15:23 UTC (permalink / raw)
  To: Mattias Engdegård
  Cc: Herman, Géza, Eli Zaretskii, Philip Kaludercic, wellons,
	emacs-devel, Gerd Möllmann


Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> 12 mars 2024 kl. 11.58 skrev Herman, Géza 
> <geza.herman@gmail.com>:
>
>>> You can't test that code is GC-safe, you have to show that 
>>> it's correct by design.
>>
>> Sure, but there has to be an explanation why the current way 
>> doesn't have any problems.
>
> It doesn't matter -- we don't need to prove to you why you don't 
> get a segfault, it's you who need to convince us that your code 
> is fine.
>
> Most C code does not need any special attention as long as it 
> only
> places Lisp roots in local variables, ie, on the C stack and in
> registers. Your code keeps live roots in heap-allocated 
> memory. It's
> not alone in doing that, but elsewhere we give those roots 
> special
> attention, either by explicit GC marking or disabling GC during 
> the
> lifespan of those roots.
>
> As I said your code is probably safe unless it signals an error.

Sorry, but I don't understand.  If my parser puts roots in 
heap-allocated memory (that's true), and doesn't give such roots 
special attention (that's also true), then why is it probably 
safe?



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-12 15:23                   ` Herman, Géza
@ 2024-03-12 15:39                     ` Gerd Möllmann
  0 siblings, 0 replies; 51+ messages in thread
From: Gerd Möllmann @ 2024-03-12 15:39 UTC (permalink / raw)
  To: Herman, Géza
  Cc: Mattias Engdegård, Eli Zaretskii, Philip Kaludercic, wellons,
	emacs-devel

"Herman, Géza" <geza.herman@gmail.com> writes:

> Sorry, but I don't understand.  If my parser puts roots in
> heap-allocated memory (that's true), and doesn't give such roots
> special attention (that's also true), then why is it probably safe?

Assumin GC doesn't run, nothing can happen.

For me personally, that assumption would be too dangerous. At one point,
Emacs' regexp engine made that assumption, GC did run, because of
something N levels down a call stack, GC compacted strings, and all that
was really nice to debug :-).

My 2 cents :-)



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-09 20:37 ` Christopher Wellons
  2024-03-10  6:31   ` Eli Zaretskii
@ 2024-03-10  6:58   ` Herman, Géza
  2024-03-10 16:54     ` Christopher Wellons
  1 sibling, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-10  6:58 UTC (permalink / raw)
  To: Christopher Wellons; +Cc: Herman, Géza, emacs-devel@gnu.org

Christopher Wellons <wellons@nullprogram.com> writes:

>> What do you think?
>
> In review I noticed a potential pointer overflow in 
> json_parse_string:
>
> parser->input_current + 4 <= parser->input_end
>
> [...]
>
> In json_make_object_workspace_for and json_byte_workspace_put, a 
> size
> is doubled without an overflow check ("new_workspace_size * 2").

Thanks for the review and finding these problems! I fixed them: 
https://github.com/geza-herman/emacs/commit/cbbf3dd494034750ff324703e64f1125a1056832.patch

> But this JSON parser is tightly
> coupled with the Emacs Lisp runtime, which greatly complicates
> things. I couldn't simply pluck it out by itself and drop it in, 
> say,
> AFL++.

Yes, it needs some work. The Lisp Object creation part is only 
done at very specific places, it's easy to remove them (actually, 
I wrote this parser outside of Emacs, and then just put it in by 
adding the necessary Lisp Object creation code).  Or, if the 
fuzzer needs the actual output (I mean, the result of the 
parsing), it shouldn't be too hard to put some code there which 
provides the output.  The other thing is error handling, but it 
also can be easily replaced by using longjmp.

I'm happy to do this work, I'd just need some directions how to do 
it.  I'm not experienced with fuzzy testing, so if you are, I'd 
glad if you can give some advices: which fuzzy-testing framework 
to use, which introductory material is worth reading, etc.

> As noted earlier, the parser gets its performance edge through
> skipping the intermediate steps. This is great! That could still 
> be
> accomplished without such tight coupling, allowing for 
> performance
> *and* an interface that is testable and fuzzable in relative
> isolation.

Yes, I think a SAX parser like interface would have a very little 
cost.  But honestly, I don't see the point of it.  This is a 
parser for Emacs only.  It has a very specific purpose, to make 
JSON parsing fast in Emacs.  It is a small module.  Input is JSON, 
output is Lisp Objects.  Working with Lisp Objects inside Emacs is 
a natural thing, usually there is no need for intermediate 
representations.  So if the only reason to have a 
Emacs-independent API is to make the parser fuzzy-testable, then 
wouldn't it make more sense to make Emacs fuzzy-testable in 
general?  I find this approach more useful, because I think it's 
not just this parser which can be a sensible target for fuzzy 
testing.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-10  6:58   ` Herman, Géza
@ 2024-03-10 16:54     ` Christopher Wellons
  2024-03-10 20:41       ` Herman, Géza
  0 siblings, 1 reply; 51+ messages in thread
From: Christopher Wellons @ 2024-03-10 16:54 UTC (permalink / raw)
  To: Herman, Géza; +Cc: emacs-devel@gnu.org

> I'd glad if you can give some advices: which fuzzy-testing framework to 
> use, which introductory material is worth reading, etc.

The Jansson repository has a libFuzzer-based fuzz test, which is perhaps a 
useful example. In it they define LLVMFuzzerTestOneInput, a function which 
accepts a buffer of input (pointer and length), which they feed into the 
code under test. That's basically it. In the new parser that buffer would 
go into json_parse. The tested code is instrumented, and the fuzz tester 
observes the affect inputs have on control flow, using that information to 
construct new inputs that explore new execution paths, trying to exercise 
as many as possible.

I'm partial to AFL++, and it's what I reach for first. It also works with 
GCC. It has two modes, with persistent mode preferred:

https://github.com/AFLplusplus/AFLplusplus/blob/stable/instrumentation/README.persistent_mode.md

Same in principle, but with control inverted. For seed inputs, a few small 
JSON documents exercising the parser's features is sufficient. In either 
case, use -fsanitize=address,undefined so that defective execution paths 
are more likely to be detected. More assertions would help, too, such as 
"assert(input_current <= input_end)" in a number of places. Assertions 
must abort or trap so that the fuzz tester knows it found a defect.

Fuzz testing works better in a narrow scope. Ideally only the code being 
tested is instrumented. If it's running within an Emacs context, and you 
instrument all of Emacs, the fuzz tester would explore paths in Emacs 
reachable through the JSON parser rather than focus on the parser itself. 
That will waste time that could instead be spent exploring the parser.

You don't need to allocate lisp objects during fuzz testing. In fact, you 
should avoid it because that would just slow it down. (I even bet it's the 
bottleneck in the new parser.) Ideally the core parser consumes bytes and 
produces JSON events, and is agnostic to its greater context. To integrate 
with Emacs, you'd have additional, separate code that turns JSON events 
into lisp objects, and which wouldn't be fuzz tested.

Written that way, I could hook this core up to one of the above fuzz test 
interfaces, mock out whatever bits of Emacs might still be there (e.g. 
ckd_mul: the isolation need not be perfect), feed it the input, and pump 
events until either error (i.e. bad input detected, which is ignored) or 
EOF. The fuzz tester uses a timeout to detect infinite loops, which AFL++ 
will report as "hangs" and save the input for manual investigation. It 
should exercise JSON numeric parsing, too, at least to the extent that 
it's not punted to Emacs or strtod (mind your locale!). I'd also make 
available_depth much smaller so that the fuzzing could exercise failing 
checks.

To get the bulk of the value, the fuzz test does not necessarily need to 
be checked into source control, or even run as part of a standard test 
suite. Given a clean, decoupled interface and implementation, it would 
only take a few minutes to hook up a fuzz test. I was hoping to find just 
that, but each JSON function has multiple points of contact with Emacs, 
most especially json_parse_object.

I've done such ad-hoc fuzz testing on dozens of programs and libraries to 
evaluate their quality, and sometimes even improve them. In most cases, if 
can be fuzz tested and it's never been fuzz tested before, this technique 
finds fresh bugs in a matter of minutes, if not seconds. When I say it's 
incredibly effective, I mean it! Case in point from a few weeks ago, under 
similar circumstances, which can also serve as a practical example:

https://github.com/editorconfig/editorconfig-core-c/pull/103

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-10 16:54     ` Christopher Wellons
@ 2024-03-10 20:41       ` Herman, Géza
  2024-03-10 23:22         ` Christopher Wellons
  0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-10 20:41 UTC (permalink / raw)
  To: Christopher Wellons; +Cc: Herman, Géza, emacs-devel@gnu.org

[-- Attachment #1: Type: text/plain, Size: 1124 bytes --]


Christopher Wellons <wellons@nullprogram.com> writes:

>> I'd glad if you can give some advices: which fuzzy-testing 
>> framework
>> to use, which introductory material is worth reading, etc.
>
> I'm partial to AFL++, and it's what I reach for first. It also 
> works
> with GCC. It has two modes, with persistent mode preferred:

Thanks so much for the description!  I created a standalone 
version of my parser (I attached it), and used "afl-clang-fast -o 
json json.c -fsanitize=address,undefined" and afl-fuzz to test it. 
It's been running for an hour, the tester didn't find any problems 
yet.

I discovered a funny clang bug: it incorrectly optimizes around 
setjmp in do_test(): when json_parser_init runs, it stores the 
workspace pointer in a register.  And if there is an error during 
JSON parsing, it will always free the pointer which is in that 
register.  But in the meantime (I mean, after json_parser_init, 
and before the error is thrown), the parser could have updated 
it. So free() will be called on an already freed block.  I had to 
add a dummy printf("free!\n"); to circumvent this optimization.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: json.c --]
[-- Type: text/x-csrc, Size: 29853 bytes --]

#include <stddef.h>
#include <stdlib.h>
#include <errno.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#include <setjmp.h>
#include <stdio.h>
#include <unistd.h>

#define ckd_add(R, A, B) __builtin_add_overflow((A), (B), (R))
#define ckd_mul(R, A, B) __builtin_mul_overflow((A), (B), (R))

#define bool _Bool
#define true 1
#define false 0

struct json_configuration {
};

typedef struct {
    size_t value;
} Lisp_Object;

struct json_parser {
    /* Because of a possible gap in the input (an emacs buffer can have
       a gap), the input is described by [input_begin;input_end) and
       [secondary_input_begin;secondary_input_end).  If the input is
       continuous, then secondary_input_begin and secondary_input_end
       should be NULL */
    const unsigned char *input_current;
    const unsigned char *input_begin;
    const unsigned char *input_end;

    const unsigned char *secondary_input_begin;
    const unsigned char *secondary_input_end;

    int current_line;
    int current_column;

    /* The parser has a maximum allowed depth.  available_depth
       decreases at each object/array begin.  If reaches zero, then an
       error is generated */
    int available_depth;

    struct json_configuration conf;

    size_t additional_bytes_count;

    /* Lisp_Objects are collected in this area during object/array
       parsing */
    Lisp_Object *object_workspace;
    Lisp_Object *object_workspace_end;
    Lisp_Object *object_workspace_current;

    /* String and number parsing uses this workspace */
    unsigned char *byte_workspace;
    unsigned char *byte_workspace_end;
    unsigned char *byte_workspace_current;
};

jmp_buf signal_env;

Lisp_Object Qjson_out_of_memory;
Lisp_Object Qjson_end_of_file;
Lisp_Object Qjson_escape_sequence_error;
Lisp_Object Qjson_utf8_decode_error;
Lisp_Object Qjson_invalid_surrogate_error;
Lisp_Object Qjson_parse_error;
Lisp_Object Qjson_error;
Lisp_Object Qjson_number_out_of_range;
Lisp_Object Qjson_object_too_deep;
Lisp_Object Qjson_trailing_content;

static _Noreturn void json_signal_error(struct json_parser *parser, Lisp_Object error) {
    longjmp(signal_env, 1);
    /* xsignal2(error, INT_TO_INTEGER(parser->current_line), INT_TO_INTEGER(parser->current_column)); */
}

static void json_parser_init(struct json_parser *parser, struct json_configuration conf, const unsigned char *input,
                             const unsigned char *input_end, const unsigned char *secondary_input,
                             const unsigned char *secondary_input_end) {
    const int initial_workspace_size = 64;
    const int initial_string_workspace_size = 512;

    if (secondary_input >= secondary_input_end) {
        secondary_input = NULL;
        secondary_input_end = NULL;
    }

    if (input < input_end) {
        parser->input_begin = input;
        parser->input_end = input_end;

        parser->secondary_input_begin = secondary_input;
        parser->secondary_input_end = secondary_input_end;
    } else {
        parser->input_begin = secondary_input;
        parser->input_end = secondary_input_end;

        parser->secondary_input_begin = NULL;
        parser->secondary_input_end = NULL;
    }

    parser->input_current = parser->input_begin;

    parser->current_line = 1;
    parser->current_column = 0;
    parser->available_depth = 8;
    parser->conf = conf;

    parser->additional_bytes_count = 0;

    parser->object_workspace = malloc(initial_workspace_size * sizeof(Lisp_Object));
    parser->object_workspace_end = parser->object_workspace + initial_workspace_size;
    parser->object_workspace_current = parser->object_workspace;

    parser->byte_workspace = malloc(initial_string_workspace_size);
    parser->byte_workspace_end = parser->byte_workspace + initial_string_workspace_size;
}

static void json_parser_done(void *parser) {
    struct json_parser *p = (struct json_parser *)parser;
    /* printf("free: %p %p\n", p->object_workspace, p->byte_workspace); */
    free(p->object_workspace);
    free(p->byte_workspace);
}

/* Makes sure that the object_workspace has 'size' available space for
   Lisp_Objects */
static void json_make_object_workspace_for(struct json_parser *parser, size_t size) {
    size_t available_size = parser->object_workspace_end - parser->object_workspace_current;
    if (available_size >= size) {
        return;
    }
    size_t needed_workspace_size = (parser->object_workspace_current - parser->object_workspace + size);
    size_t new_workspace_size = parser->object_workspace_end - parser->object_workspace;
    while (new_workspace_size < needed_workspace_size) {
        if (ckd_mul(&new_workspace_size, new_workspace_size, 2)) {
            json_signal_error(parser, Qjson_out_of_memory);
        }
    }
    size_t offset = parser->object_workspace_current - parser->object_workspace;
    parser->object_workspace = realloc(parser->object_workspace, new_workspace_size * sizeof(Lisp_Object));
    parser->object_workspace_end = parser->object_workspace + new_workspace_size;
    parser->object_workspace_current = parser->object_workspace + offset;
}

static void json_byte_workspace_reset(struct json_parser *parser) {
    parser->byte_workspace_current = parser->byte_workspace;
}

/* Puts 'value' into the byte_workspace.  If there is no space
   available, it allocates space */
static void json_byte_workspace_put(struct json_parser *parser, unsigned char value) {
    if (parser->byte_workspace_current < parser->byte_workspace_end) {
        *parser->byte_workspace_current++ = value;
        return;
    }

    size_t new_workspace_size = parser->byte_workspace_end - parser->byte_workspace;
    if (ckd_mul(&new_workspace_size, new_workspace_size, 2)) {
        json_signal_error(parser, Qjson_out_of_memory);
    }

    size_t offset = parser->byte_workspace_current - parser->byte_workspace;
    parser->byte_workspace = realloc(parser->byte_workspace, new_workspace_size);
    parser->byte_workspace_end = parser->byte_workspace + new_workspace_size;
    parser->byte_workspace_current = parser->byte_workspace + offset;
    *parser->byte_workspace_current++ = value;
}

static bool json_input_at_eof(struct json_parser *parser) {
    if (parser->input_current < parser->input_end)
        return false;
    return parser->secondary_input_end == NULL;
}

/* If there is a secondary buffer, it switches to it */
static int json_input_switch_to_secondary(struct json_parser *parser) {
    if (parser->secondary_input_begin < parser->secondary_input_end) {
        parser->additional_bytes_count = parser->input_end - parser->input_begin;
        parser->input_begin = parser->secondary_input_begin;
        parser->input_end = parser->secondary_input_end;
        parser->input_current = parser->secondary_input_begin;
        parser->secondary_input_begin = NULL;
        parser->secondary_input_end = NULL;
        return 0;
    } else
        return -1;
}

/* Reads a byte from the JSON input stream */
static unsigned char json_input_get(struct json_parser *parser) {
    if (parser->input_current >= parser->input_end && json_input_switch_to_secondary(parser) < 0)
        json_signal_error(parser, Qjson_end_of_file);
    return *parser->input_current++;
}

/* Reads a byte from the JSON input stream, if the stream is not at
 * eof.  At eof, returns -1 */
static int json_input_get_if_possible(struct json_parser *parser) {
    if (parser->input_current >= parser->input_end && json_input_switch_to_secondary(parser) < 0)
        return -1;
    return *parser->input_current++;
}

/* Puts back the last read input byte.  Only one byte can be put back,
   because otherwise this code would need to handle switching from
   the secondary buffer to the initial */
static void json_input_put_back(struct json_parser *parser) {
    parser->input_current--;
}

static bool json_skip_whitespace_internal(struct json_parser *parser, int c) {
    parser->current_column++;
    if (c == 0x20 || c == 0x09 || c == 0x0d)
        return false;
    else if (c == 0x0a) {
        parser->current_line++;
        parser->current_column = 0;
        return false;
    } else
        return true;
}

/* Skips JSON whitespace, and returns with the first non-whitespace
 * character */
static int json_skip_whitespace(struct json_parser *parser) {
    for (;;) {
        int c = json_input_get(parser);
        if (json_skip_whitespace_internal(parser, c))
            return c;
    }
}

/* Skips JSON whitespace, and returns with the first non-whitespace
 * character, if possible.  If there is no non-whitespace character
 * (because we reached the end), it returns -1 */
static int json_skip_whitespace_if_possible(struct json_parser *parser) {
    for (;;) {
        int c = json_input_get_if_possible(parser);
        if (c < 0)
            return c;
        if (json_skip_whitespace_internal(parser, c))
            return c;
    }
}

static int json_hex_value(int c) {
    if (c >= '0' && c <= '9')
        return c - '0';
    else if (c >= 'A' && c <= 'F')
        return c - 'A' + 10;
    else if (c >= 'a' && c <= 'f')
        return c - 'a' + 10;
    else
        return -1;
}

/* Parses the CCCC part of the unicode escape sequence \uCCCC */
static int json_parse_unicode(struct json_parser *parser) {
    unsigned char v[4];
    for (int i = 0; i < 4; i++) {
        int c = json_hex_value(json_input_get(parser));
        parser->current_column++;
        if (c < 0)
            json_signal_error(parser, Qjson_escape_sequence_error);
        v[i] = c;
    }

    return v[0] << 12 | v[1] << 8 | v[2] << 4 | v[3];
}

/* Parses an utf-8 code-point encoding (except the first byte), and
   returns the numeric value of the code-point (without considering
   the first byte) */
static int json_handle_utf8_tail_bytes(struct json_parser *parser, int n) {
    int v = 0;
    for (int i = 0; i < n; i++) {
        int c = json_input_get(parser);
        json_byte_workspace_put(parser, c);
        if ((c & 0xc0) != 0x80)
            json_signal_error(parser, Qjson_utf8_decode_error);
        v = (v << 6) | (c & 0x3f);
    }
    return v;
}

/* Reads a JSON string, and puts the result into the byte workspace */
static void json_parse_string(struct json_parser *parser) {
    /* a single_uninteresting byte can be simply copied from the input
       to output, it doesn't need any extra care. */
    static const char is_single_uninteresting[256] = {
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    };

    for (;;) {
        /* This if is only here for a possible speedup.  If there are 4
       bytes available, and all of them are single_uninteresting,
       then we can just copy these 4 bytes to output */
        if (parser->input_end - parser->input_current >= 4) {
            int c0 = parser->input_current[0];
            int c1 = parser->input_current[1];
            int c2 = parser->input_current[2];
            int c3 = parser->input_current[3];
            bool v0 = is_single_uninteresting[c0];
            bool v1 = is_single_uninteresting[c1];
            bool v2 = is_single_uninteresting[c2];
            bool v3 = is_single_uninteresting[c3];
            if (v0 && v1 && v2 && v3) {
                json_byte_workspace_put(parser, c0);
                json_byte_workspace_put(parser, c1);
                json_byte_workspace_put(parser, c2);
                json_byte_workspace_put(parser, c3);
                parser->input_current += 4;
                parser->current_column += 4;
                continue;
            }
        }

        int c = json_input_get(parser);
        parser->current_column++;
        if (is_single_uninteresting[c]) {
            json_byte_workspace_put(parser, c);
            continue;
        }

        if (c == '"')
            return;
        else if (c & 0x80) {
            /* Handle utf-8 encoding */
            json_byte_workspace_put(parser, c);
            if (c < 0xc0)
                json_signal_error(parser, Qjson_utf8_decode_error);
            else if (c < 0xe0) {
                int n = ((c & 0x1f) << 6 | json_handle_utf8_tail_bytes(parser, 1));
                if (n < 0x80)
                    json_signal_error(parser, Qjson_utf8_decode_error);
            } else if (c < 0xf0) {
                int n = ((c & 0xf) << 12 | json_handle_utf8_tail_bytes(parser, 2));
                if (n < 0x800 || (n >= 0xd800 && n < 0xe000))
                    json_signal_error(parser, Qjson_utf8_decode_error);
            } else if (c < 0xf8) {
                int n = ((c & 0x7) << 18 | json_handle_utf8_tail_bytes(parser, 3));
                if (n < 0x10000 || n > 0x10ffff)
                    json_signal_error(parser, Qjson_utf8_decode_error);
            } else
                json_signal_error(parser, Qjson_utf8_decode_error);
        } else if (c == '\\') {
            /* Handle escape sequences */
            c = json_input_get(parser);
            parser->current_column++;
            if (c == '"')
                json_byte_workspace_put(parser, '"');
            else if (c == '\\')
                json_byte_workspace_put(parser, '\\');
            else if (c == '/')
                json_byte_workspace_put(parser, '/');
            else if (c == 'b')
                json_byte_workspace_put(parser, '\b');
            else if (c == 'f')
                json_byte_workspace_put(parser, '\f');
            else if (c == 'n')
                json_byte_workspace_put(parser, '\n');
            else if (c == 'r')
                json_byte_workspace_put(parser, '\r');
            else if (c == 't')
                json_byte_workspace_put(parser, '\t');
            else if (c == 'u') {
                int num = json_parse_unicode(parser);
                /* is the first half of the surrogate pair */
                if (num >= 0xd800 && num < 0xdc00) {
                    parser->current_column++;
                    if (json_input_get(parser) != '\\')
                        json_signal_error(parser, Qjson_invalid_surrogate_error);
                    parser->current_column++;
                    if (json_input_get(parser) != 'u')
                        json_signal_error(parser, Qjson_invalid_surrogate_error);
                    int num2 = json_parse_unicode(parser);
                    if (num2 < 0xdc00 || num2 >= 0xe000)
                        json_signal_error(parser, Qjson_invalid_surrogate_error);
                    num = (0x10000 + ((num - 0xd800) << 10 | (num2 - 0xdc00)));
                } else if (num >= 0xdc00 && num < 0xe000)
                    /* is the second half of the surrogate pair without
                       the first half */
                    json_signal_error(parser, Qjson_invalid_surrogate_error);

                /* utf-8 encode the code-point */
                if (num < 0x80)
                    json_byte_workspace_put(parser, num);
                else if (num < 0x800) {
                    json_byte_workspace_put(parser, 0xc0 | num >> 6);
                    json_byte_workspace_put(parser, 0x80 | (num & 0x3f));
                } else if (num < 0x10000) {
                    json_byte_workspace_put(parser, 0xe0 | num >> 12);
                    json_byte_workspace_put(parser, (0x80 | ((num >> 6) & 0x3f)));
                    json_byte_workspace_put(parser, 0x80 | (num & 0x3f));
                } else {
                    json_byte_workspace_put(parser, 0xf0 | num >> 18);
                    json_byte_workspace_put(parser, (0x80 | ((num >> 12) & 0x3f)));
                    json_byte_workspace_put(parser, (0x80 | ((num >> 6) & 0x3f)));
                    json_byte_workspace_put(parser, 0x80 | (num & 0x3f));
                }
            } else
                json_signal_error(parser, Qjson_escape_sequence_error);
        } else
            json_signal_error(parser, Qjson_parse_error);
    }
}

/* If there was no integer overflow during parsing the integer, this
   puts 'value' to the output. Otherwise this calls string_to_number
   to parse integer on the byte workspace.  This could just always
   call string_to_number, but for performance reasons, during parsing
   the code tries to calculate the value, so in most cases, we can
   save call of string_to_number */
static Lisp_Object json_create_integer(struct json_parser *parser, bool integer_overflow, bool negative,
                                       ulong value) {
    Lisp_Object o;
    o.value = value;
    if (!integer_overflow) {
        if (negative) {
            uintmax_t v = value;
            if (v <= (uintmax_t)INTMAX_MAX + 1)
                return o;
        } else {
            return o;
        }
    }

    json_byte_workspace_put(parser, 0);
    ptrdiff_t len = strlen((const char *)parser->byte_workspace);
    if (len != parser->byte_workspace_current - parser->byte_workspace - 1)
        json_signal_error(parser, Qjson_error);
    return o;
}

/* Parses a float using the byte workspace */
static Lisp_Object json_create_float(struct json_parser *parser) {
    Lisp_Object o;
    json_byte_workspace_put(parser, 0);
    errno = 0;
    char *e;
    double value = strtod((const char *)parser->byte_workspace, &e);
    bool out_of_range = (errno != 0 && (value == HUGE_VAL || value == -HUGE_VAL));
    if (out_of_range)
        json_signal_error(parser, Qjson_number_out_of_range);
    else if ((const unsigned char *)e != parser->byte_workspace_current - 1)
        json_signal_error(parser, Qjson_error);
    else {
        o.value = value;
        return o;
    }
}

/* Parses a number.  The first character is the input parameter 'c'.
 */
static Lisp_Object json_parse_number(struct json_parser *parser, int c) {
    json_byte_workspace_reset(parser);
    json_byte_workspace_put(parser, c);

    Lisp_Object o;
    o.value = 0;

    bool negative = false;
    if (c == '-') {
        negative = true;
        c = json_input_get(parser);
        json_byte_workspace_put(parser, c);
        parser->current_column++;
    }
    if (c < '0' || c > '9')
        json_signal_error(parser, Qjson_parse_error);

    /* The idea is that during finding the last character of the
       number, the for loop below also tries to calculate the value.  If
       the parsed number is an integer which fits into unsigned long,
       then the parser can use the value of 'integer' right away,
       instead of having to re-parse the byte workspace later.
       Ideally, this integer should have the same size as a CPU general
       purpose register. */
    unsigned long integer = c - '0';
    bool integer_overflow = false;

    if (integer == 0) {
        if (json_input_at_eof(parser))
            return o;
        c = json_input_get(parser);
    } else {
        for (;;) {
            if (json_input_at_eof(parser))
                return json_create_integer(parser, integer_overflow, negative, integer);
            c = json_input_get(parser);
            if (c < '0' || c > '9')
                break;
            json_byte_workspace_put(parser, c);
            parser->current_column++;

            integer_overflow |= ckd_mul(&integer, integer, 10);
            integer_overflow |= ckd_add(&integer, integer, c - '0');
        }
    }

    bool is_float = false;
    if (c == '.') {
        json_byte_workspace_put(parser, c);
        parser->current_column++;

        is_float = true;
        c = json_input_get(parser);
        json_byte_workspace_put(parser, c);
        parser->current_column++;
        if (c < '0' || c > '9')
            json_signal_error(parser, Qjson_parse_error);
        for (;;) {
            if (json_input_at_eof(parser))
                return json_create_float(parser);
            c = json_input_get(parser);
            if (c < '0' || c > '9')
                break;
            json_byte_workspace_put(parser, c);
            parser->current_column++;
        }
    }
    if (c == 'e' || c == 'E') {
        json_byte_workspace_put(parser, c);
        parser->current_column++;

        is_float = true;
        c = json_input_get(parser);
        json_byte_workspace_put(parser, c);
        parser->current_column++;
        if (c == '-' || c == '+') {
            c = json_input_get(parser);
            json_byte_workspace_put(parser, c);
            parser->current_column++;
        }
        if (c < '0' || c > '9')
            json_signal_error(parser, Qjson_parse_error);
        for (;;) {
            if (json_input_at_eof(parser))
                return json_create_float(parser);
            c = json_input_get(parser);
            if (c < '0' || c > '9')
                break;
            json_byte_workspace_put(parser, c);
            parser->current_column++;
        }
    }

    /* 'c' contains a character which is not part of the number,
       so it is need to be put back */
    json_input_put_back(parser);

    if (is_float)
        return json_create_float(parser);
    else
        return json_create_integer(parser, integer_overflow, negative, integer);
}

static Lisp_Object json_parse_value(struct json_parser *parser, int c);

/* Parses a JSON array. */
static Lisp_Object json_parse_array(struct json_parser *parser) {
    int c = json_skip_whitespace(parser);

    const size_t begin_offset = parser->object_workspace_current - parser->object_workspace;

    if (c != ']') {
        parser->available_depth--;
        if (parser->available_depth < 0)
            json_signal_error(parser, Qjson_object_too_deep);

        size_t number_of_elements = 0;
        /* This loop collects the array elements in the object workspace
         */
        for (;;) {
            Lisp_Object element = json_parse_value(parser, c);
            json_make_object_workspace_for(parser, 1);
            *parser->object_workspace_current++ = element;

            c = json_skip_whitespace(parser);

            number_of_elements++;
            if (c == ']') {
                parser->available_depth++;
                break;
            }

            if (c != ',')
                json_signal_error(parser, Qjson_parse_error);

            c = json_skip_whitespace(parser);
        }
    }

    Lisp_Object result;
    const Lisp_Object *b = parser->object_workspace + begin_offset;
    size_t number_of_elements = parser->object_workspace_current - b;
    void *array = malloc(number_of_elements*sizeof(Lisp_Object));
    memcpy(array, b, number_of_elements*sizeof(Lisp_Object));

    result.value = (size_t)array;

    parser->object_workspace_current = parser->object_workspace + begin_offset;

    return result;
}

/* Parses a JSON object. */
static Lisp_Object json_parse_object(struct json_parser *parser) {
    int c = json_skip_whitespace(parser);

    const size_t begin_offset = parser->object_workspace_current - parser->object_workspace;

    if (c != '}') {
        parser->available_depth--;
        if (parser->available_depth < 0)
            json_signal_error(parser, Qjson_object_too_deep);

        /* This loop collects the object members (key/value pairs) in
         * the object workspace */
        for (;;) {
            if (c != '"')
                json_signal_error(parser, Qjson_parse_error);

            Lisp_Object key;
            json_byte_workspace_reset(parser);
            json_parse_string(parser);
            if (parser->byte_workspace_current > parser->byte_workspace) {
                key.value = parser->byte_workspace[0] + parser->byte_workspace_current[-1];
            } else {
                key.value = 0;
            }

            c = json_skip_whitespace(parser);
            if (c != ':')
                json_signal_error(parser, Qjson_parse_error);

            c = json_skip_whitespace(parser);

            Lisp_Object value = json_parse_value(parser, c);

            json_make_object_workspace_for(parser, 2);
            *parser->object_workspace_current++ = key;
            *parser->object_workspace_current++ = value;

            c = json_skip_whitespace(parser);

            if (c == '}') {
                parser->available_depth++;
                break;
            }

            if (c != ',')
                json_signal_error(parser, Qjson_parse_error);

            c = json_skip_whitespace(parser);
        }
    }

    Lisp_Object result;
    Lisp_Object *end = parser->object_workspace_current;
    Lisp_Object *member = parser->object_workspace + begin_offset;
    Lisp_Object *array = malloc((end - member)*sizeof(Lisp_Object));
    while (member < end) {
        *array++ = member[0];
        *array++ = member[1];

        member += 2;
    }
    result.value = (size_t)array;

    parser->object_workspace_current = parser->object_workspace + begin_offset;

    return result;
}

/* Token-char is not a JSON terminology.  When parsing
   null/false/true, this function tells the character set that is need
   to be considered as part of a token.  For example, if the input is
   "truesomething", then the parser shouldn't consider it as "true",
   and an additional later "something" token. An additional example:
   if the input is "truetrue", then calling (json-parse-buffer) twice
   shouldn't produce two successful calls which return t, but a
   parsing error */
static bool json_is_token_char(int c) {
    return ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9') || (c == '-'));
}

/* This is the entry point to the value parser, this parses a JSON
 * value */
Lisp_Object json_parse_value(struct json_parser *parser, int c) {
    if (c == '{')
        return json_parse_object(parser);
    else if (c == '[')
        return json_parse_array(parser);
    else if (c == '"') {
        json_byte_workspace_reset(parser);
        json_parse_string(parser);
        Lisp_Object result;
        if (parser->byte_workspace_current > parser->byte_workspace) {
            result.value = parser->byte_workspace[0] + parser->byte_workspace_current[-1];
        } else {
            result.value = 0;
        }
        return result;
    } else if ((c >= '0' && c <= '9') || (c == '-'))
        return json_parse_number(parser, c);
    else {
        int c2 = json_input_get(parser);
        int c3 = json_input_get(parser);
        int c4 = json_input_get(parser);
        int c5 = json_input_get_if_possible(parser);

        if (c == 't' && c2 == 'r' && c3 == 'u' && c4 == 'e' && (c5 < 0 || !json_is_token_char(c5))) {
            if (c5 >= 0)
                json_input_put_back(parser);
            parser->current_column += 4;
            Lisp_Object o;
            o.value = 0;
            return o;
        }
        if (c == 'n' && c2 == 'u' && c3 == 'l' && c4 == 'l' && (c5 < 0 || !json_is_token_char(c5))) {
            if (c5 >= 0)
                json_input_put_back(parser);
            parser->current_column += 4;
            Lisp_Object o;
            o.value = 1;
            return o;
        }
        if (c == 'f' && c2 == 'a' && c3 == 'l' && c4 == 's' && c5 == 'e') {
            int c6 = json_input_get_if_possible(parser);
            if (c6 < 0 || !json_is_token_char(c6)) {
                if (c6 >= 0)
                    json_input_put_back(parser);
                parser->current_column += 5;
                Lisp_Object o;
                o.value = 2;
                return o;
            }
        }

        json_signal_error(parser, Qjson_parse_error);
    }
}

enum ParseEndBehavior { PARSEENDBEHAVIOR_CheckForGarbage, PARSEENDBEHAVIOR_MovePoint };

static Lisp_Object json_parse(struct json_parser *parser, enum ParseEndBehavior parse_end_behavior) {
    int c = json_skip_whitespace(parser);

    Lisp_Object result = json_parse_value(parser, c);

    switch (parse_end_behavior) {
        case PARSEENDBEHAVIOR_CheckForGarbage:
            c = json_skip_whitespace_if_possible(parser);
            if (c >= 0)
                json_signal_error(parser, Qjson_trailing_content);
            break;
        case PARSEENDBEHAVIOR_MovePoint: {
            break;
        }
    }

    return result;
}

void do_test(unsigned char *buffer, int length) {
    struct json_parser p;
    struct json_configuration conf;

    /* unsigned char json_data[] = "[12, 34]"; */

    json_parser_init(&p, conf, buffer, buffer + length, NULL, NULL);

    int x = setjmp(signal_env);
    if (x == 0) {
        json_parse(&p, PARSEENDBEHAVIOR_CheckForGarbage);
    /* } else { */
    /*     printf("error\n"); */
    }
    printf("free!\n");
    json_parser_done(&p);
}

__AFL_FUZZ_INIT();

int main() {
#ifdef __AFL_HAVE_MANUAL_CONTROL
  __AFL_INIT();
#endif

  unsigned char *buf = __AFL_FUZZ_TESTCASE_BUF;  // must be after __AFL_INIT
                                                 // and before __AFL_LOOP!

  while (__AFL_LOOP(10000)) {

    int len = __AFL_FUZZ_TESTCASE_LEN;  // don't use the macro directly in a
                                        // call!

    if (len < 8) continue;  // check for a required/useful minimum input length

    /* Setup function call, e.g. struct target *tmp = libtarget_init() */
    /* Call function to be fuzzed, e.g.: */
    do_test(buf, len);
    /* Reset state. e.g. libtarget_free(tmp) */

  }

  return 0;

}

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-10 20:41       ` Herman, Géza
@ 2024-03-10 23:22         ` Christopher Wellons
  2024-03-11  9:34           ` Herman, Géza
  0 siblings, 1 reply; 51+ messages in thread
From: Christopher Wellons @ 2024-03-10 23:22 UTC (permalink / raw)
  To: Herman, Géza; +Cc: emacs-devel@gnu.org

> It's been running for an hour, the tester didn't find any problems yet.

Except for an overflow assigning huge floats to the dummy Lisp_Object 
value — which is a problem with the test, not the parser — this stripped 
down version looks robust to me, too. Solid work! I have no further 
feedback or commentary.

I made a few tweaks to harden the test, which did not change the results:

* Rather that directly use the AFL input buffer, it uses an exactly-sized 
copy so that it could detect out of bounds access on input.

* Used a custom malloc that initializes memory to garbage.

* Used a custom realloc that always moves, and initializes extended memory 
to garbage.

> it incorrectly optimizes around setjmp in do_test()

If a program modifies a variable after the first setjmp return and then 
accesses it after the second setjmp return, it must be volatile-qualified. 
GCC and Clang have some machinery to mitigate a lack of volatile, such as 
the returns_twice attribute, but technically it's required. I believe in 
practice using either builtin setjmp/longjmp or a memory barrier would be 
sufficient, but my system's Clang doesn't exhibit the stale pointer here, 
so I can't test that theory.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-10 23:22         ` Christopher Wellons
@ 2024-03-11  9:34           ` Herman, Géza
  2024-03-11 13:47             ` Christopher Wellons
  0 siblings, 1 reply; 51+ messages in thread
From: Herman, Géza @ 2024-03-11  9:34 UTC (permalink / raw)
  To: Christopher Wellons; +Cc: Herman, Géza, emacs-devel@gnu.org


Christopher Wellons <wellons@nullprogram.com> writes:

>> It's been running for an hour, the tester didn't find any 
>> problems yet.
>
> Except for an overflow assigning huge floats to the dummy 
> Lisp_Object
> value — which is a problem with the test, not the parser — this
> stripped down version looks robust to me, too. Solid work! I 
> have no
> further feedback or commentary.

Cool!  Out of curiosity, how did you find the overflow problem? 
Did you just notice it, or did the fuzzer/sanitizer find it?

> I made a few tweaks to harden the test, which did not change the 
> results:

Thanks for the work! I think that we can say then that the new 
parser works OK, it's unlikely that it has any serious problem.

>> it incorrectly optimizes around setjmp in do_test()
>
> If a program modifies a variable after the first setjmp return 
> and
> then accesses it after the second setjmp return, it must be
> volatile-qualified.

Good to know, thanks for the info!



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: I created a faster JSON parser
  2024-03-11  9:34           ` Herman, Géza
@ 2024-03-11 13:47             ` Christopher Wellons
  0 siblings, 0 replies; 51+ messages in thread
From: Christopher Wellons @ 2024-03-11 13:47 UTC (permalink / raw)
  To: Herman, Géza; +Cc: emacs-devel@gnu.org

> did the fuzzer/sanitizer find it?

Clang, but not GCC, places a UBSan check on float to integer conversions, 
and that check was tripped when I fuzzed a Clang build. Example:

int main(void)
{
    int x = 1e10;
}

Nothing from GCC:

$ gcc -w -g3 -fsanitize=undefined example.c
$ ./a.out
$

But with Clang:

$ clang -w -g3 -fsanitize=undefined example.c
$ ./a.out 
example.c:3:13: runtime error: 1e+10 is outside the range of representable values of type 'int'



^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2024-03-19 19:18 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-08 10:27 I created a faster JSON parser Herman, Géza
2024-03-08 11:41 ` Philip Kaludercic
2024-03-08 12:34   ` Herman, Géza
2024-03-08 12:03 ` Eli Zaretskii
2024-03-08 12:38   ` Herman, Géza
2024-03-08 12:59     ` Eli Zaretskii
2024-03-08 13:12       ` Herman, Géza
2024-03-08 14:10         ` Eli Zaretskii
2024-03-08 14:24           ` Collin Funk
2024-03-08 15:20           ` Herman, Géza
2024-03-08 16:22             ` Eli Zaretskii
2024-03-08 18:34               ` Herman, Géza
2024-03-08 19:57                 ` Eli Zaretskii
2024-03-08 20:22                   ` Herman, Géza
2024-03-09  6:52                     ` Eli Zaretskii
2024-03-09 11:08                       ` Herman, Géza
2024-03-09 12:23                         ` Lynn Winebarger
2024-03-09 12:58                         ` Po Lu
2024-03-09 13:13                         ` Eli Zaretskii
2024-03-09 14:00                           ` Herman, Géza
2024-03-09 14:21                             ` Eli Zaretskii
2024-03-08 13:28 ` Po Lu
2024-03-08 16:14   ` Herman, Géza
2024-03-09  1:55     ` Po Lu
2024-03-09 20:37 ` Christopher Wellons
2024-03-10  6:31   ` Eli Zaretskii
2024-03-10 21:39     ` Philip Kaludercic
2024-03-11 13:29       ` Eli Zaretskii
2024-03-11 14:05         ` Mattias Engdegård
2024-03-11 14:35           ` Herman, Géza
2024-03-12  9:26             ` Mattias Engdegård
2024-03-12 10:20               ` Gerd Möllmann
2024-03-12 11:14                 ` Mattias Engdegård
2024-03-12 11:33                   ` Gerd Möllmann
2024-03-15 13:35                 ` Herman, Géza
2024-03-15 14:56                   ` Gerd Möllmann
2024-03-19 18:49                   ` Mattias Engdegård
2024-03-19 19:05                     ` Herman, Géza
2024-03-19 19:18                       ` Gerd Möllmann
2024-03-19 19:13                     ` Gerd Möllmann
2024-03-12 10:58               ` Herman, Géza
2024-03-12 13:11                 ` Mattias Engdegård
2024-03-12 13:42                   ` Mattias Engdegård
2024-03-12 15:23                   ` Herman, Géza
2024-03-12 15:39                     ` Gerd Möllmann
2024-03-10  6:58   ` Herman, Géza
2024-03-10 16:54     ` Christopher Wellons
2024-03-10 20:41       ` Herman, Géza
2024-03-10 23:22         ` Christopher Wellons
2024-03-11  9:34           ` Herman, Géza
2024-03-11 13:47             ` Christopher Wellons

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.