* I created a faster JSON parser @ 2024-03-08 10:27 Herman, Géza 2024-03-08 11:41 ` Philip Kaludercic ` (3 more replies) 0 siblings, 4 replies; 51+ messages in thread From: Herman, Géza @ 2024-03-08 10:27 UTC (permalink / raw) To: emacs-devel@gnu.org Hi, I created a faster JSON parser for emacs, you can check it out here: https://github.com/geza-herman/emacs/tree/faster-json-parsing It replaces json-parse-string and json-parse-buffer functions. The behavior should be the same as before, with the only exception that objects with duplicated keys are not detected if :object-type is not 'hash-table. This parser runs 8-9x faster than the jansson based parser on my machine (tested on clangd language server messages). An additional tiny benefit is that large integers are parsed, instead of having an "out of range" error. What do you think? Geza ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 10:27 I created a faster JSON parser Herman, Géza @ 2024-03-08 11:41 ` Philip Kaludercic 2024-03-08 12:34 ` Herman, Géza 2024-03-08 12:03 ` Eli Zaretskii ` (2 subsequent siblings) 3 siblings, 1 reply; 51+ messages in thread From: Philip Kaludercic @ 2024-03-08 11:41 UTC (permalink / raw) To: Herman, Géza; +Cc: emacs-devel@gnu.org "Herman, Géza" <geza.herman@gmail.com> writes: > Hi, > > I created a faster JSON parser for emacs, you can check it out here: > > https://github.com/geza-herman/emacs/tree/faster-json-parsing For convenience, this is a URL to the patch: https://github.com/geza-herman/emacs/commit/d76feeeac3ce397f15aebdde8b9deae676b0bf1e.patch > > It replaces json-parse-string and json-parse-buffer functions. The > behavior should be the same as before, with the only exception that > objects with duplicated keys are not detected if :object-type is not > 'hash-table. Is that a problem? > This parser runs 8-9x faster than the jansson based parser on my > machine (tested on clangd language server messages). An additional > tiny benefit is that large integers are parsed, instead of having an > "out of range" error. That sounds interesting, but I am reminded of this article: https://seriot.ch/projects/parsing_json.html. There seem to be plenty of difficult edge-cases when dealing with JSON input, that should probably be tested if Emacs has it's own custom parser built-in. > What do you think? > > Geza -- Philip Kaludercic on peregrine ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 11:41 ` Philip Kaludercic @ 2024-03-08 12:34 ` Herman, Géza 0 siblings, 0 replies; 51+ messages in thread From: Herman, Géza @ 2024-03-08 12:34 UTC (permalink / raw) To: Philip Kaludercic; +Cc: Herman, Géza, emacs-devel@gnu.org Philip Kaludercic <philipk@posteo.net> writes: > "Herman, Géza" <geza.herman@gmail.com> writes: > >> It replaces json-parse-string and json-parse-buffer >> functions. The >> behavior should be the same as before, with the only exception >> that >> objects with duplicated keys are not detected if :object-type >> is not >> 'hash-table. > > Is that a problem? Not sure. I just mentioned it because it's a behavior change. But I intentionally designed it this way, because it is faster. To me, it makes some sense that if the user specifies 'alist or 'plist, then they want to have all the object members, even if the keys are duplicated. I didn't find a clear direction from JSON descriptions of how duplicated keys should be handled. >> This parser runs 8-9x faster than the jansson based parser on >> my >> machine (tested on clangd language server messages). An >> additional >> tiny benefit is that large integers are parsed, instead of >> having an >> "out of range" error. > > That sounds interesting, but I am reminded of this article: > https://seriot.ch/projects/parsing_json.html. There seem to be > plenty > of difficult edge-cases when dealing with JSON input, that > should > probably be tested if Emacs has it's own custom parser built-in. I've now run my parser on the tests in this repo, it passes all of them. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 10:27 I created a faster JSON parser Herman, Géza 2024-03-08 11:41 ` Philip Kaludercic @ 2024-03-08 12:03 ` Eli Zaretskii 2024-03-08 12:38 ` Herman, Géza 2024-03-08 13:28 ` Po Lu 2024-03-09 20:37 ` Christopher Wellons 3 siblings, 1 reply; 51+ messages in thread From: Eli Zaretskii @ 2024-03-08 12:03 UTC (permalink / raw) To: Géza Herman; +Cc: emacs-devel > From: Herman, Géza <geza.herman@gmail.com> > Date: Fri, 08 Mar 2024 11:27:16 +0100 > > This parser runs 8-9x faster than the jansson based parser on my > machine (tested on clangd language server messages). How does it do that? Can you summarize the main ideas of your implementation, which make it so much faster? Thanks. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 12:03 ` Eli Zaretskii @ 2024-03-08 12:38 ` Herman, Géza 2024-03-08 12:59 ` Eli Zaretskii 0 siblings, 1 reply; 51+ messages in thread From: Herman, Géza @ 2024-03-08 12:38 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Géza Herman, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Herman, Géza <geza.herman@gmail.com> >> Date: Fri, 08 Mar 2024 11:27:16 +0100 >> >> This parser runs 8-9x faster than the jansson based parser on >> my >> machine (tested on clangd language server messages). > > How does it do that? Can you summarize the main ideas of your > implementation, which make it so much faster? My parser creates Lisp objects during parsing, there is no intermediate step as Emacs has with jansson. With jansson, there are a lot of allocations, which my parser doesn't have (my parser has only two buffers, which exponentially grow. There are no other allocations). But even ignoring performance loss because of mallocs (on my dataset, 40% of CPU time goes into malloc/free), I think parsing should be faster, so maybe jansson is not a fast parser in the first place. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 12:38 ` Herman, Géza @ 2024-03-08 12:59 ` Eli Zaretskii 2024-03-08 13:12 ` Herman, Géza 0 siblings, 1 reply; 51+ messages in thread From: Eli Zaretskii @ 2024-03-08 12:59 UTC (permalink / raw) To: Géza Herman; +Cc: emacs-devel > From: Herman, Géza <geza.herman@gmail.com> > Cc: Géza Herman <geza.herman@gmail.com>, > emacs-devel@gnu.org > Date: Fri, 08 Mar 2024 13:38:48 +0100 > > > Eli Zaretskii <eliz@gnu.org> writes: > > >> From: Herman, Géza <geza.herman@gmail.com> > >> Date: Fri, 08 Mar 2024 11:27:16 +0100 > >> > >> This parser runs 8-9x faster than the jansson based parser on > >> my > >> machine (tested on clangd language server messages). > > > > How does it do that? Can you summarize the main ideas of your > > implementation, which make it so much faster? > > My parser creates Lisp objects during parsing, there is no > intermediate step as Emacs has with jansson. With jansson, there > are a lot of allocations, which my parser doesn't have (my parser > has only two buffers, which exponentially grow. There are no other > allocations). But even ignoring performance loss because of > mallocs (on my dataset, 40% of CPU time goes into malloc/free), I > think parsing should be faster, so maybe jansson is not a fast > parser in the first place. Thanks. So, if you are willing to contribute this code to Emacs, what is left is: . clean up the code (e.g., currently it still calls the function that on MS-Windows loads the jansson DLL, which is now unneeded) and adjust it to our style and conventions; . thoroughly test the code on the available test suites (or maybe you already did); . for you to assign the copyright to the FSF, without which we cannot accept such a substantial contribution As an additional idea: perhaps initially we'd want to have a configure-time option to either use your parser or jansson, to make the migration and comparison easier. What do others think about this? ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 12:59 ` Eli Zaretskii @ 2024-03-08 13:12 ` Herman, Géza 2024-03-08 14:10 ` Eli Zaretskii 0 siblings, 1 reply; 51+ messages in thread From: Herman, Géza @ 2024-03-08 13:12 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Géza Herman, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: > So, if you are willing to contribute this code to Emacs, what is > left > is: > > . clean up the code (e.g., currently it still calls the > function > that on MS-Windows loads the jansson DLL, which is now > unneeded) > and adjust it to our style and conventions; I can remove the WINDOWSNT related #ifs. Are there any more comments? I formatted my code with clangd, is there anything else that should be done? > . thoroughly test the code on the available test suites (or > maybe > you already did); I tested it on github.com/nst/JSONTestSuite, it passes all "y_" and "n_" tests in the directory "test_parsing". If you're aware of other test repositories, I'm happy to test on them as well. > . for you to assign the copyright to the FSF, without which we > cannot accept such a substantial contribution Can you please send me the necessary documents? ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 13:12 ` Herman, Géza @ 2024-03-08 14:10 ` Eli Zaretskii 2024-03-08 14:24 ` Collin Funk 2024-03-08 15:20 ` Herman, Géza 0 siblings, 2 replies; 51+ messages in thread From: Eli Zaretskii @ 2024-03-08 14:10 UTC (permalink / raw) To: Géza Herman; +Cc: emacs-devel > From: Herman, Géza <geza.herman@gmail.com> > Cc: Géza Herman <geza.herman@gmail.com>, > emacs-devel@gnu.org > Date: Fri, 08 Mar 2024 14:12:19 +0100 > > > > . clean up the code (e.g., currently it still calls the > > function > > that on MS-Windows loads the jansson DLL, which is now > > unneeded) > > and adjust it to our style and conventions; > I can remove the WINDOWSNT related #ifs. Are there any more > comments? I formatted my code with clangd, is there anything else > that should be done? The following is based on an initial reading of the patch: . Redundant braces (for blocks of a single code line) is one issue. . The way you break a long line at the equals sign '=' is another (we break after '=', not before). . The code must not call malloc/realloc/free, but their x* variants, xmalloc/xrealloc/xfree. . The names should be changed to remove the "my_" and "My" prefixes. . Many functions should have a commentary before them explaining their purpose and use, the exception is short function whose purpose is clear from reading the code. . Some of the generic 'json-parse-error' errors should probably be more specific; for example, UTF-8 encoding problems should be flagged as such. . The code which handles integers seems to assume that 'unsigned long' is a 64-bit type? if so, this is not true on Windows; please see how we handle this elsewhere in Emacs, in particular in the WIDE_EMACS_INT case. A more general comment is that you seem to be parsing buffer text assuming it's UTF-8? If so, this is not accurate, as the internal representation is a superset of UTF-8, and can represent characters above 0x10FFFF. There could be other issues as well. > > . thoroughly test the code on the available test suites (or > > maybe > > you already did); > I tested it on github.com/nst/JSONTestSuite, it passes all "y_" > and "n_" tests in the directory "test_parsing". If you're aware > of other test repositories, I'm happy to test on them as well. I don't know about such test suites, but maybe someone else does. > > . for you to assign the copyright to the FSF, without which we > > cannot accept such a substantial contribution > Can you please send me the necessary documents? Sent off-list. Thanks. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 14:10 ` Eli Zaretskii @ 2024-03-08 14:24 ` Collin Funk 2024-03-08 15:20 ` Herman, Géza 1 sibling, 0 replies; 51+ messages in thread From: Collin Funk @ 2024-03-08 14:24 UTC (permalink / raw) To: emacs-devel Hello, On 3/8/24 6:10 AM, Eli Zaretskii wrote: > . The code which handles integers seems to assume that 'unsigned long' > is a 64-bit type? if so, this is not true on Windows; please see how > we handle this elsewhere in Emacs, in particular in the > WIDE_EMACS_INT case. I noticed this in 'my_json_parse_number' as well. Another change that would be nice there is using ckd_add and ckd_mul from stdckdint.h, generated from lib/stdckdint.in.h, to check for integer overflows. Less room for errors compared to checking manually and easier to read. Collin ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 14:10 ` Eli Zaretskii 2024-03-08 14:24 ` Collin Funk @ 2024-03-08 15:20 ` Herman, Géza 2024-03-08 16:22 ` Eli Zaretskii 1 sibling, 1 reply; 51+ messages in thread From: Herman, Géza @ 2024-03-08 15:20 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Géza Herman, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Herman, Géza <geza.herman@gmail.com> >> Cc: Géza Herman <geza.herman@gmail.com>, >> emacs-devel@gnu.org >> Date: Fri, 08 Mar 2024 14:12:19 +0100 > The following is based on an initial reading of the patch: > > . Redundant braces (for blocks of a single code line) is one > issue. > . The way you break a long line at the equals sign '=' is > another (we > break after '=', not before). I used clang-format to format my code (I use a completely different coding style). I see that clang-format is configured this way in Emacs. Shouldn't BreakBeforeBinaryOperators be set to None or NonAssignment in .clang-format? > . The code which handles integers seems to assume that > 'unsigned long' > is a 64-bit type? if so, this is not true on Windows; please > see how > we handle this elsewhere in Emacs, in particular in the > WIDE_EMACS_INT case. That was a mistake on my part, though a different (but similar) one. I originally used a 64-bit type, but then changed it to long, because of 32-bit architectures. The idea is to use a type which likely has the same size as a CPU register. So I think long is OK, I just need to change the thresholds to ULONG_MAX. Or I think I'll use ckd_* functions as Collin suggested. > A more general comment is that you seem to be parsing buffer > text > assuming it's UTF-8? If so, this is not accurate, as the > internal > representation is a superset of UTF-8, and can represent > characters > above 0x10FFFF. When does a buffer have characters above 0x10ffff? I supposed that a JSON shouldn't contain characters that are out of range. But if the solution is to just remove the upper-range comparison, I can do that easily. >> Can you please send me the necessary documents? > > Sent off-list. Thanks! ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 15:20 ` Herman, Géza @ 2024-03-08 16:22 ` Eli Zaretskii 2024-03-08 18:34 ` Herman, Géza 0 siblings, 1 reply; 51+ messages in thread From: Eli Zaretskii @ 2024-03-08 16:22 UTC (permalink / raw) To: Géza Herman; +Cc: geza.herman, emacs-devel > From: Herman, Géza <geza.herman@gmail.com> > Cc: Géza Herman <geza.herman@gmail.com>, > emacs-devel@gnu.org > Date: Fri, 08 Mar 2024 16:20:40 +0100 > > > Eli Zaretskii <eliz@gnu.org> writes: > > > . The way you break a long line at the equals sign '=' is > > another (we > > break after '=', not before). > I used clang-format to format my code (I use a completely > different coding style). I see that clang-format is configured > this way in Emacs. Shouldn't BreakBeforeBinaryOperators be set to > None or NonAssignment in .clang-format? Actually, I see we use both styles, so I guess you can disregard that part. > > . The code which handles integers seems to assume that > > 'unsigned long' > > is a 64-bit type? if so, this is not true on Windows; please > > see how > > we handle this elsewhere in Emacs, in particular in the > > WIDE_EMACS_INT case. > That was a mistake on my part, though a different (but similar) > one. I originally used a 64-bit type, but then changed it to > long, because of 32-bit architectures. The idea is to use a type > which likely has the same size as a CPU register. If you want to use a 32-bit type, use 'int' or 'unsigned int'. > > A more general comment is that you seem to be parsing buffer text > > assuming it's UTF-8? If so, this is not accurate, as the internal > > representation is a superset of UTF-8, and can represent > > characters above 0x10FFFF. > > When does a buffer have characters above 0x10ffff? See the node "Text Representations" in the ELisp manual, it explains that. > I supposed that a JSON shouldn't contain characters that are out of > range. But if the solution is to just remove the upper-range > comparison, I can do that easily. We need to decide what to do with characters outside of the Unicode range. The encoding of those characters is different, btw, it doesn't follow the UTF-8 scheme. In any case, the error in those cases cannot be just "json parse error", it must be something more self-explanatory. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 16:22 ` Eli Zaretskii @ 2024-03-08 18:34 ` Herman, Géza 2024-03-08 19:57 ` Eli Zaretskii 0 siblings, 1 reply; 51+ messages in thread From: Herman, Géza @ 2024-03-08 18:34 UTC (permalink / raw) To: Eli Zaretskii; +Cc: geza.herman, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Herman, Géza <geza.herman@gmail.com> >> That was a mistake on my part, though a different (but similar) >> one. I originally used a 64-bit type, but then changed it to >> long, because of 32-bit architectures. The idea is to use a >> type >> which likely has the same size as a CPU register. > > If you want to use a 32-bit type, use 'int' or 'unsigned int'. I want to use a datatype which is likely to have the same size as a CPU register. On 32-bit machines, it should be 32-bit, on 64-bit architectures it should be 64-bit. This is not a strong requirement, but it makes the parser faster. During number parsing, this part of the code calculates the value of the number. If it overflows, or it ends up being a float, then the workspace buffer is re-parsed to find the value. But if it is an integer without overflow, the parser just can use the value without any further processing (a certain kind of LSP message is basically a large array of integers, so it makes sense to optimize this case - this optimization makes parsing such messages 1.5-2x faster). > We need to decide what to do with characters outside of the > Unicode > range. The encoding of those characters is different, btw, it > doesn't > follow the UTF-8 scheme. > > In any case, the error in those cases cannot be just "json parse > error", it must be something more self-explanatory. I am open to suggestions, as even after reading "Text Representations" in the manual, I have no idea what to do here. I didn't find any code which handles this case in the jansson parser either. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 18:34 ` Herman, Géza @ 2024-03-08 19:57 ` Eli Zaretskii 2024-03-08 20:22 ` Herman, Géza 0 siblings, 1 reply; 51+ messages in thread From: Eli Zaretskii @ 2024-03-08 19:57 UTC (permalink / raw) To: Herman Géza; +Cc: emacs-devel > From: Herman, Géza <geza.herman@gmail.com> > Cc: geza.herman@gmail.com, emacs-devel@gnu.org > Date: Fri, 08 Mar 2024 19:34:28 +0100 > > > If you want to use a 32-bit type, use 'int' or 'unsigned int'. > I want to use a datatype which is likely to have the same size as > a CPU register. On 32-bit machines, it should be 32-bit, on 64-bit > architectures it should be 64-bit. Is there a reason for you to want it to be 64-bit type on a 64-bit machine? If the only bother is efficiency, then you can use 'int' without fear. But if a 64-bit machine will need the range of values beyond INT_MAX (does it?), then I suggest to use ptrdiff_t. > This is not a strong requirement, but it makes the parser faster. Are you sure? AFAIK, 32-bit arithmetics on a 64-bit machine is as fast as 64-bit arithmetics. So if efficiency is the only consideration, using 'int' is OK. > During number parsing, this part of the code calculates the value of > the number. If it overflows, or it ends up being a float, then the > workspace buffer is re-parsed to find the value. But if it is an > integer without overflow, the parser just can use the value without > any further processing (a certain kind of LSP message is basically a > large array of integers, so it makes sense to optimize this case - > this optimization makes parsing such messages 1.5-2x faster). If the value needs to fit in a fixnum, use EMACS_INT, which is a type that already takes the 32-but vs 64-bit nature of the machine into consideration. > > We need to decide what to do with characters outside of the > > Unicode range. The encoding of those characters is different, > > btw, it doesn't follow the UTF-8 scheme. > > > > In any case, the error in those cases cannot be just "json parse > > error", it must be something more self-explanatory. > I am open to suggestions, as even after reading "Text > Representations" in the manual, I have no idea what to do here. We should begin by deciding whether we want to support characters outside of the Unicode range. The error message depends on that decision. > I didn't find any code which handles this case in the jansson parser > either. The jansson code required encoding/decoding strings to make sure we submit to jansson text that is always valid UTF-8. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 19:57 ` Eli Zaretskii @ 2024-03-08 20:22 ` Herman, Géza 2024-03-09 6:52 ` Eli Zaretskii 0 siblings, 1 reply; 51+ messages in thread From: Herman, Géza @ 2024-03-08 20:22 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Herman Géza, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> I want to use a datatype which is likely to have the same size >> as >> a CPU register. On 32-bit machines, it should be 32-bit, on >> 64-bit >> architectures it should be 64-bit. > > Is there a reason for you to want it to be 64-bit type on a > 64-bit > machine? If the only bother is efficiency, then you can use > 'int' > without fear. But if a 64-bit machine will need the range of > values > beyond INT_MAX (does it?), then I suggest to use ptrdiff_t. The only reason is if I use a 64-bit number on a 64-bit platform, then the fast path will be chosen more frequently. So it makes sense to use a register-sized integer here. >> This is not a strong requirement, but it makes the parser >> faster. > > Are you sure? AFAIK, 32-bit arithmetics on a 64-bit machine is > as > fast as 64-bit arithmetics. So if efficiency is the only > consideration, using 'int' is OK. Yes, that's right, but my reasoning is the other way around. I shouldn't choose a larger integer than the platform natively support. I could just use a 64-bit integer on all platforms, this is what I originally had. But then on 32-bit platforms, this part of the parser will be slower, because it unnecessarily will use two registers instead of one (for most cases, as numbers are usually smaller than 32-bit). This whole thing is not big of a deal, as int is probably fine, as it covers most of the commonly used range. But if I can just put a more proper type here, then I think I should do it. > If the value needs to fit in a fixnum, use EMACS_INT, which is a > type > that already takes the 32-but vs 64-bit nature of the machine > into > consideration. Yes, it seems that EMACS_UINT is good for my purpose, thanks for the suggestion. > We should begin by deciding whether we want to support > characters > outside of the Unicode range. The error message depends on that > decision. > >> I didn't find any code which handles this case in the jansson >> parser >> either. > > The jansson code required encoding/decoding strings to make sure > we > submit to jansson text that is always valid UTF-8. I tried to use the jansson parser with a unicode 0x333333 character in a string, and it didn't work, it fails with (json-parse-error "unable to decode byte... message. Also, I see that json-parse-string calls some utf8 encoding related function before parsing, but json-parse-buffer doesn't (and it doesn't do anything encoding related thing in the callback, it just calls memcpy). So based on these, does it have any benefit of supporting these? Out of curiosity, what are these extra characters used for? What is the purpose of the odd special 2-byte encoding of 8-bit characters (I mean where the 1st byte is C0/C1)? Why don't just use the regular utf-8 encoding for these values? ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 20:22 ` Herman, Géza @ 2024-03-09 6:52 ` Eli Zaretskii 2024-03-09 11:08 ` Herman, Géza 0 siblings, 1 reply; 51+ messages in thread From: Eli Zaretskii @ 2024-03-09 6:52 UTC (permalink / raw) To: Géza Herman; +Cc: emacs-devel > From: Herman, Géza <geza.herman@gmail.com> > Cc: Herman Géza <geza.herman@gmail.com>, > emacs-devel@gnu.org > Date: Fri, 08 Mar 2024 21:22:13 +0100 > > > Is there a reason for you to want it to be 64-bit type on a 64-bit > > machine? If the only bother is efficiency, then you can use 'int' > > without fear. But if a 64-bit machine will need the range of > > values beyond INT_MAX (does it?), then I suggest to use ptrdiff_t. > > The only reason is if I use a 64-bit number on a 64-bit platform, > then the fast path will be chosen more frequently. So it makes > sense to use a register-sized integer here. Then either ptrdiff_t or EMACS_INT should do what you want. > Yes, it seems that EMACS_UINT is good for my purpose, thanks for > the suggestion. Are you sure you need the unsigned variety? If EMACS_INT fits the bill, then it is a better candidate, since unsigned arithmetics has its quirks. > > The jansson code required encoding/decoding strings to make sure > > we submit to jansson text that is always valid UTF-8. > > I tried to use the jansson parser with a unicode 0x333333 > character in a string, and it didn't work, it fails with > (json-parse-error "unable to decode byte... message. Well, I didn't say trying an arbitrary codepoint will demonstrate the issue. Some codepoints above 0x10FFFF indeed cannot be passed to jansson. It's okay if the initial version of this parser only handles the Unicode range and errors out otherwise; we could extend it if needed later. But the error message should talk specifically about invalid character or something, not just a generic "parse error". > Also, I see that json-parse-string calls some utf8 encoding related > function before parsing, but json-parse-buffer doesn't (and it > doesn't do anything encoding related thing in the callback, it just > calls memcpy). This is a part I was never happy about. But, as I say above, we can get to handling these rare cases later. > So based on these, does it have any benefit of supporting these? Yes, definitely. But it isn't urgent. > Out of curiosity, what are these extra characters used for? Raw bytes and characters from charsets that are not (yet) unified with Unicode. > What is the purpose of the odd special 2-byte encoding of 8-bit > characters (I mean where the 1st byte is C0/C1)? Why don't just use > the regular utf-8 encoding for these values? I think it's for efficiency: a 2-byte encoding takes much less space than the 6-byte encoding (using superset of UTF-8) would take. Imagine the case where a large byte-stream is inserted into a multibyte buffer, before decoding it, something that happens a lot when visiting non-ASCII files or reading from a network sub-process. The regular UTF-8 encoding cannot be used for the raw bytes, because then we will be unable to distinguish between them and the Unicode codepoints of the same value. For example, a raw byte whose value is 160 decimal (A0 hex) will be indistinguishable from U+00A0 No-Break Space character. This is why the "codepoint" corresponding to raw byte 160 is 0x3FFFA0, see BYTE8_TO_CHAR. Once again, we can extend the parser for codepoints outside of the Unicode range later. For now, it's okay to reject them with a suitable error. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-09 6:52 ` Eli Zaretskii @ 2024-03-09 11:08 ` Herman, Géza 2024-03-09 12:23 ` Lynn Winebarger ` (2 more replies) 0 siblings, 3 replies; 51+ messages in thread From: Herman, Géza @ 2024-03-09 11:08 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Géza Herman, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Herman, Géza <geza.herman@gmail.com> >> Cc: Herman Géza <geza.herman@gmail.com>, >> emacs-devel@gnu.org >> Date: Fri, 08 Mar 2024 21:22:13 +0100 >> Yes, it seems that EMACS_UINT is good for my purpose, thanks >> for >> the suggestion. > > Are you sure you need the unsigned variety? If EMACS_INT fits > the > bill, then it is a better candidate, since unsigned arithmetics > has > its quirks. Yes, I think it's better to use unsigned: read the sign, and then parse the number as unsigned, and then apply the sign at the end. If the number is parsed with its sign, it needs an additional step at each character (the sign needs to be applied to each digit). >> Also, I see that json-parse-string calls some utf8 encoding >> related >> function before parsing, but json-parse-buffer doesn't (and it >> doesn't do anything encoding related thing in the callback, it >> just >> calls memcpy). > > This is a part I was never happy about. But, as I say above, we > can > get to handling these rare cases later. I think this is an additional benefit of my parser: this feature can be added to it more easily than into jansson. Even, I'm tempted to say that we could just remove utf-8 checking from my code, and then Emacs's encoding method should work right out of the box. Or, to say that utf-8 handling should stay as is. Because as far as I understand, if the JSON contains an invalid utf-8 sequence which is not invalid according to Emacs's character representation, then this problem won't be detected. So checking for utf-8 encoding errors shouldn't be the job of the json parser, but around IO handling, which has the chance to know that the JSON stream itself must only contain a valid utf-8 encoding. Or, as the JSON specification explcitly says that the allowed character range is 0x20 .. 0x10ffff, the current solution is fine, because it is actually against JSON rules to allow anything else outside of this range. > Once again, we can extend the parser for codepoints outside of > the > Unicode range later. For now, it's okay to reject them with a > suitable error. OK, cool, I added Qjson_utf8_decode_error to indicate decoding errors. How can we proceed further? This is the current state of the patch: https://github.com/geza-herman/emacs/commit/ce5d990776a1ccdfd0b6d9c4d5e5e5df55245672.patch I think I did everything that was asked for, except Po Lu's parenthesis-related comment, because I still don't know what to parenthesize and what not to. I saw a lot of "a + x * y" kind of expressions in emacs codebase without any parenthesis. Are the exact rules documented somewhere? ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-09 11:08 ` Herman, Géza @ 2024-03-09 12:23 ` Lynn Winebarger 2024-03-09 12:58 ` Po Lu 2024-03-09 13:13 ` Eli Zaretskii 2 siblings, 0 replies; 51+ messages in thread From: Lynn Winebarger @ 2024-03-09 12:23 UTC (permalink / raw) To: Herman, Géza; +Cc: Eli Zaretskii, emacs-devel [-- Attachment #1: Type: text/plain, Size: 759 bytes --] On Sat, Mar 9, 2024, 6:39 AM Herman, Géza <geza.herman@gmail.com> wrote: > > I think I did everything that was asked for, except Po Lu's > parenthesis-related comment, because I still don't know what to > parenthesize and what not to. I saw a lot of "a + x * y" kind of > expressions in emacs codebase without any parenthesis. Are the > exact rules documented somewhere? > I suspect the rule is to ensure that emacs maintainers can freely convert functions to C preprocessor macros. So you should follow Po Lu's description. If the preprocessor wasn't an issue, such replication of the language's builtin arithmetic precedence would be undesirable, an possibly repugnant, depending on your aesthetic sensibilities. Lynn Lynn [-- Attachment #2: Type: text/html, Size: 1380 bytes --] ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-09 11:08 ` Herman, Géza 2024-03-09 12:23 ` Lynn Winebarger @ 2024-03-09 12:58 ` Po Lu 2024-03-09 13:13 ` Eli Zaretskii 2 siblings, 0 replies; 51+ messages in thread From: Po Lu @ 2024-03-09 12:58 UTC (permalink / raw) To: Herman, Géza; +Cc: Eli Zaretskii, emacs-devel "Herman, Géza" <geza.herman@gmail.com> writes: > I think I did everything that was asked for, except Po Lu's > parenthesis-related comment, because I still don't know what to > parenthesize and what not to. I saw a lot of "a + x * y" kind of > expressions in emacs codebase without any parenthesis. Are the exact > rules documented somewhere? Simple statements as above, which fit in one line, don't require the extra parentheses. The objective of these parentheses is to enable indenting statements satisfactorily with CC Mode's indentation commands, and so that such indentation does not give readers a misleading impression of the statement's structure. https://www.gnu.org/prep/standards/standards.html#Formatting ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-09 11:08 ` Herman, Géza 2024-03-09 12:23 ` Lynn Winebarger 2024-03-09 12:58 ` Po Lu @ 2024-03-09 13:13 ` Eli Zaretskii 2024-03-09 14:00 ` Herman, Géza 2 siblings, 1 reply; 51+ messages in thread From: Eli Zaretskii @ 2024-03-09 13:13 UTC (permalink / raw) To: Géza Herman; +Cc: emacs-devel > From: Herman, Géza <geza.herman@gmail.com> > Cc: Géza Herman <geza.herman@gmail.com>, > emacs-devel@gnu.org > Date: Sat, 09 Mar 2024 12:08:54 +0100 > > How can we proceed further? This is the current state of the > patch: > https://github.com/geza-herman/emacs/commit/ce5d990776a1ccdfd0b6d9c4d5e5e5df55245672.patch What about removing the dependency on jansson altogether? We also need to wait until your legal paperwork is completed. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-09 13:13 ` Eli Zaretskii @ 2024-03-09 14:00 ` Herman, Géza 2024-03-09 14:21 ` Eli Zaretskii 0 siblings, 1 reply; 51+ messages in thread From: Herman, Géza @ 2024-03-09 14:00 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Géza Herman, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Herman, Géza <geza.herman@gmail.com> >> Cc: Géza Herman <geza.herman@gmail.com>, >> emacs-devel@gnu.org >> Date: Sat, 09 Mar 2024 12:08:54 +0100 >> >> How can we proceed further? This is the current state of the >> patch: >> https://github.com/geza-herman/emacs/commit/ce5d990776a1ccdfd0b6d9c4d5e5e5df55245672.patch > > What about removing the dependency on jansson altogether? Unless it also turns out to be a serious bottleneck, I'm not too motivated to work on that. Based on Po Lu's comments, I added some parentheses, here's the latest version: https://github.com/geza-herman/emacs/commit/5e3db3fb06b6393fd1b5d5fd967062ec1909e94a.patch (I reformatted the code using indent-region, and put parentheses where it didn't used the correct indentation. I'm not sure whether this method finds all the cases, but as the intent of these extra parentheses is to have correct indentation, I think this approach more-or-less OK) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-09 14:00 ` Herman, Géza @ 2024-03-09 14:21 ` Eli Zaretskii 0 siblings, 0 replies; 51+ messages in thread From: Eli Zaretskii @ 2024-03-09 14:21 UTC (permalink / raw) To: Géza Herman; +Cc: emacs-devel > From: Herman, Géza <geza.herman@gmail.com> > Cc: Géza Herman <geza.herman@gmail.com>, > emacs-devel@gnu.org > Date: Sat, 09 Mar 2024 15:00:09 +0100 > > > Eli Zaretskii <eliz@gnu.org> writes: > > >> How can we proceed further? This is the current state of the > >> patch: > >> https://github.com/geza-herman/emacs/commit/ce5d990776a1ccdfd0b6d9c4d5e5e5df55245672.patch > > > > What about removing the dependency on jansson altogether? > > Unless it also turns out to be a serious bottleneck, I'm not too > motivated to work on that. That's a pity, because using jansson for part of the job and our own code for the other part makes little sense to me. So I hope you will reconsider, because having our own JSON code is a very attractive development for Emacs, it opens several possibilities for future extensions that using an external library cannot. > Based on Po Lu's comments, I added some parentheses, here's the > latest version: Thanks. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 10:27 I created a faster JSON parser Herman, Géza 2024-03-08 11:41 ` Philip Kaludercic 2024-03-08 12:03 ` Eli Zaretskii @ 2024-03-08 13:28 ` Po Lu 2024-03-08 16:14 ` Herman, Géza 2024-03-09 20:37 ` Christopher Wellons 3 siblings, 1 reply; 51+ messages in thread From: Po Lu @ 2024-03-08 13:28 UTC (permalink / raw) To: Herman, Géza; +Cc: emacs-devel@gnu.org "Herman, Géza" <geza.herman@gmail.com> writes: > I created a faster JSON parser for emacs, you can check it out here: > > https://github.com/geza-herman/emacs/tree/faster-json-parsing > > It replaces json-parse-string and json-parse-buffer functions. The > behavior should be the same as before, with the only exception that > objects with duplicated keys are not detected if :object-type is not > 'hash-table. > > This parser runs 8-9x faster than the jansson based parser on my > machine (tested on clangd language server messages). An additional > tiny benefit is that large integers are parsed, instead of having an > "out of range" error. > > What do you think? Speed aside, this change eliminates an external dependency from Emacs, which is a great service to us in its own right. Thank you! Nevertheless, there are several coding style issues that should be resolved, viz. the presence of redundant opening braces and ternary or nested binary operations not enclosed in parentheses, and the size of the change is such that its copyright must be assigned to the FSF. Please see that these are addressed, and thanks again. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 13:28 ` Po Lu @ 2024-03-08 16:14 ` Herman, Géza 2024-03-09 1:55 ` Po Lu 0 siblings, 1 reply; 51+ messages in thread From: Herman, Géza @ 2024-03-08 16:14 UTC (permalink / raw) To: Po Lu; +Cc: Herman, Géza, emacs-devel@gnu.org Po Lu <luangruo@yahoo.com> writes: > "Herman, Géza" <geza.herman@gmail.com> writes: > >> What do you think? > > Speed aside, this change eliminates an external dependency from > Emacs, > which is a great service to us in its own right. Thank you! I think jansson is still used, but only for writing to json. > Nevertheless, there are several coding style issues that should > be > resolved, viz. the presence of redundant opening braces and > ternary or > nested binary operations not enclosed in parentheses, Can you give me some examples reagarding operations need to be parenthesized? Isn't there a gcc warning for this, so I can follow the convention more easily? ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 16:14 ` Herman, Géza @ 2024-03-09 1:55 ` Po Lu 0 siblings, 0 replies; 51+ messages in thread From: Po Lu @ 2024-03-09 1:55 UTC (permalink / raw) To: Herman, Géza; +Cc: emacs-devel@gnu.org "Herman, Géza" <geza.herman@gmail.com> writes: > I think jansson is still used, but only for writing to json. That's a disappointment, but would you be willing to solve this as well? > Can you give me some examples reagarding operations need to be > parenthesized? e.g., in function_call () + function_call_1 () * function_call_2 () + function_call_3 () the entire statement should be surrounded by parentheses, along with each group of operators separated by precedence: (function_call () + (function_call_1 () * function_call_2 ()) + function_call_3 ()) and lastly all ternary operators should be surrounded by parentheses. > Isn't there a gcc warning for this, so I can follow the convention > more easily? Not that I'm aware of. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-08 10:27 I created a faster JSON parser Herman, Géza ` (2 preceding siblings ...) 2024-03-08 13:28 ` Po Lu @ 2024-03-09 20:37 ` Christopher Wellons 2024-03-10 6:31 ` Eli Zaretskii 2024-03-10 6:58 ` Herman, Géza 3 siblings, 2 replies; 51+ messages in thread From: Christopher Wellons @ 2024-03-09 20:37 UTC (permalink / raw) To: Herman, Géza; +Cc: emacs-devel@gnu.org > What do you think? Your new JSON parser appears to carefully written, and has the marks of thoughtful design. You've avoided the common pitfalls, and that gives me confidence in it. In review I noticed a potential pointer overflow in json_parse_string: parser->input_current + 4 <= parser->input_end The "+ 4" may push the pointer beyond one-past-the-end not just for the input, but the buffer itself, overflowing the pointer. To fix, re-arrange the expression to check a size rather than an address: parser->input_end - parser->input_current >= 4 In json_make_object_workspace_for and json_byte_workspace_put, a size is doubled without an overflow check ("new_workspace_size * 2"). The first could cause an infinite loop, and the second could allocate less than was expected. Both are minor, and in practice can only affect 32-bit targets, because you'd need to grow these buffers to the limits before these sizes could overflow. Despite the obvious care which with this was written, I personally would not adopt a JSON parser that had not been thoroughly fuzz tested under Address Sanitizer and Undefined Behavior Sanitizer. Fuzzing is incredibly effective at finding defects, and it would be foolish not to use it in its ideal circumstances. Normally it's not difficult and requires only a few lines of code. But this JSON parser is tightly coupled with the Emacs Lisp runtime, which greatly complicates things. I couldn't simply pluck it out by itself and drop it in, say, AFL++. As noted earlier, the parser gets its performance edge through skipping the intermediate steps. This is great! That could still be accomplished without such tight coupling, allowing for performance *and* an interface that is testable and fuzzable in relative isolation. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-09 20:37 ` Christopher Wellons @ 2024-03-10 6:31 ` Eli Zaretskii 2024-03-10 21:39 ` Philip Kaludercic 2024-03-10 6:58 ` Herman, Géza 1 sibling, 1 reply; 51+ messages in thread From: Eli Zaretskii @ 2024-03-10 6:31 UTC (permalink / raw) To: Christopher Wellons; +Cc: geza.herman, emacs-devel > Date: Sat, 9 Mar 2024 15:37:25 -0500 > From: Christopher Wellons <wellons@nullprogram.com> > Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org> > > Despite the obvious care which with this was written, I personally would > not adopt a JSON parser that had not been thoroughly fuzz tested under > Address Sanitizer and Undefined Behavior Sanitizer. Fuzzing is incredibly > effective at finding defects, and it would be foolish not to use it in its > ideal circumstances. Normally it's not difficult and requires only a few > lines of code. But this JSON parser is tightly coupled with the Emacs Lisp > runtime, which greatly complicates things. I couldn't simply pluck it out > by itself and drop it in, say, AFL++. That's okay, we can start by making this an optional feature, and consider making it the default after a couple of major releases; meanwhile, any problems will be detected and reported. However, it would make much more sense to consider switching to this code if it also could handle producing JSON output, thus making libjansson unnecessary when we decide to switch. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-10 6:31 ` Eli Zaretskii @ 2024-03-10 21:39 ` Philip Kaludercic 2024-03-11 13:29 ` Eli Zaretskii 0 siblings, 1 reply; 51+ messages in thread From: Philip Kaludercic @ 2024-03-10 21:39 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Christopher Wellons, geza.herman, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> Date: Sat, 9 Mar 2024 15:37:25 -0500 >> From: Christopher Wellons <wellons@nullprogram.com> >> Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org> >> >> Despite the obvious care which with this was written, I personally would >> not adopt a JSON parser that had not been thoroughly fuzz tested under >> Address Sanitizer and Undefined Behavior Sanitizer. Fuzzing is incredibly >> effective at finding defects, and it would be foolish not to use it in its >> ideal circumstances. Normally it's not difficult and requires only a few >> lines of code. But this JSON parser is tightly coupled with the Emacs Lisp >> runtime, which greatly complicates things. I couldn't simply pluck it out >> by itself and drop it in, say, AFL++. > > That's okay, we can start by making this an optional feature, and > consider making it the default after a couple of major releases; > meanwhile, any problems will be detected and reported. > > However, it would make much more sense to consider switching to this > code if it also could handle producing JSON output, thus making > libjansson unnecessary when we decide to switch. If libjansson is not available, it should still be possible to use this faster parser, while falling back to json.el for generating JSON? From what I understand, the main issue people have with JSON is the parsing bottleneck, right? -- Philip Kaludercic on icterid ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-10 21:39 ` Philip Kaludercic @ 2024-03-11 13:29 ` Eli Zaretskii 2024-03-11 14:05 ` Mattias Engdegård 0 siblings, 1 reply; 51+ messages in thread From: Eli Zaretskii @ 2024-03-11 13:29 UTC (permalink / raw) To: Philip Kaludercic; +Cc: wellons, geza.herman, emacs-devel > From: Philip Kaludercic <philipk@posteo.net> > Cc: Christopher Wellons <wellons@nullprogram.com>, geza.herman@gmail.com, > emacs-devel@gnu.org > Date: Sun, 10 Mar 2024 21:39:29 +0000 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> Date: Sat, 9 Mar 2024 15:37:25 -0500 > >> From: Christopher Wellons <wellons@nullprogram.com> > >> Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org> > >> > >> Despite the obvious care which with this was written, I personally would > >> not adopt a JSON parser that had not been thoroughly fuzz tested under > >> Address Sanitizer and Undefined Behavior Sanitizer. Fuzzing is incredibly > >> effective at finding defects, and it would be foolish not to use it in its > >> ideal circumstances. Normally it's not difficult and requires only a few > >> lines of code. But this JSON parser is tightly coupled with the Emacs Lisp > >> runtime, which greatly complicates things. I couldn't simply pluck it out > >> by itself and drop it in, say, AFL++. > > > > That's okay, we can start by making this an optional feature, and > > consider making it the default after a couple of major releases; > > meanwhile, any problems will be detected and reported. > > > > However, it would make much more sense to consider switching to this > > code if it also could handle producing JSON output, thus making > > libjansson unnecessary when we decide to switch. > > If libjansson is not available, it should still be possible to use this > faster parser, while falling back to json.el for generating JSON? From > what I understand, the main issue people have with JSON is the parsing > bottleneck, right? What you describe are possible fallbacks, but I would prefer not to use any fallback at all, but instead have a full C implementation. (Using json.el has one unfortunate disadvantage that it's incompatible with the json.c APIs.) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-11 13:29 ` Eli Zaretskii @ 2024-03-11 14:05 ` Mattias Engdegård 2024-03-11 14:35 ` Herman, Géza 0 siblings, 1 reply; 51+ messages in thread From: Mattias Engdegård @ 2024-03-11 14:05 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Philip Kaludercic, wellons, geza.herman, emacs-devel 11 mars 2024 kl. 14.29 skrev Eli Zaretskii <eliz@gnu.org>: > What you describe are possible fallbacks, but I would prefer not to > use any fallback at all, but instead have a full C implementation. Yes, I definitely think we should do that. I'm pretty sure that writing a JSON unparser is a lot easier than doing the parser, and the extra speed we stand to gain from not having the intermediate jansson step is not without interest. Overall the proposed parser looks fine, nothing terribly wrong that can't be fixed later on. A few minor points: * The `is_single_uninteresting` array is hard to review and badly formatted. It appears to be 1 for all printable ASCII plus DEL except double-quote and backslash. (Why DEL?) Happy if you would make that clearer, perhaps by formatting or initialising it at run-time. (I don't think we can use GCC-style array interval designators, alas.) * Do you really need to maintain line and column during the parse? If you want them for error reporting, you can materialise them from the offset that you already have. * Are you sure that GC can't run during parsing or that all your Lisp objects are reachable directly from the stack? (It's the `object_workspace` in particular that's worrying me a bit.) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-11 14:05 ` Mattias Engdegård @ 2024-03-11 14:35 ` Herman, Géza 2024-03-12 9:26 ` Mattias Engdegård 0 siblings, 1 reply; 51+ messages in thread From: Herman, Géza @ 2024-03-11 14:35 UTC (permalink / raw) To: Mattias Engdegård Cc: Eli Zaretskii, Philip Kaludercic, wellons, geza.herman, emacs-devel Mattias Engdegård <mattias.engdegard@gmail.com> writes: > 11 mars 2024 kl. 14.29 skrev Eli Zaretskii <eliz@gnu.org>: > >> What you describe are possible fallbacks, but I would prefer >> not to >> use any fallback at all, but instead have a full C >> implementation. > > Yes, I definitely think we should do that. I'm pretty sure that > writing a JSON unparser is a lot easier than doing the parser, > and the > extra speed we stand to gain from not having the intermediate > jansson > step is not without interest. FYI: I checked out a JSON benchmark, and it turned out that jansson is not a fast parser, there are faster libraries. If a library has a SAX interface, that could be a potentially useful library for Emacs. According to https://github.com/miloyip/nativejson-benchmark, RapidJSON is at least 10x faster than jansson. I'm just saying this because Emacs doesn't have to stick with my parser, there are possible alternatives, which have JSON serializers as well. (But note: I am happy to make my parser into a mergeable state, and if eventually it gets merged then fixing its bugs, but I'm not motivated to work on integrating other JSON libraries). > Overall the proposed parser looks fine, nothing terribly wrong > that can't be fixed later on. A few minor points: > > * The `is_single_uninteresting` array is hard to review and > badly > formatted. It appears to be 1 for all printable ASCII plus DEL > except > double-quote and backslash. (Why DEL?) Yep, the formatting of that table got destroyed when I reformatted the code into GNU style. Now I formatted the table back, and added comments for each row/col. Here's the latest version: https://github.com/geza-herman/emacs/commit/4b5895636c1ec06e630baf47881b246c198af056.patch I'm not sure about DEL: I haven't seen anything which says that it's an invalid character in a string, so the parser currently allows it. > * Do you really need to maintain line and column during the > parse? If > you want them for error reporting, you can materialise them from > the > offset that you already have. Yeah, I thought of that, but it turned out that maintaining the line/column doesn't have an impact on performance. I added that easily, tough admittedly it's a little bit awkward to maintain these variables. If emacs has a way to tell from the byte-pointer the line/col position (both for strings and buffers), I am happy to use that instead. It would be a better solution, because currently the parser always starts from line 1, col 1, which means that if json-parse-buffer is used, these numbers will be local to the current parsing, not actual numbers related to the whole buffer. But as the jansson based parsed behaves the same, I thought it's OK. > * Are you sure that GC can't run during parsing or that all your > Lisp > objects are reachable directly from the stack? (It's the > `object_workspace` in particular that's worrying me a bit.) That's a very good question. I suppose that object_workspace is invisible to the Lisp VM, as it is just a malloc'd object. But I've never seen a problem because of this. What triggers the GC? Is it possible that for the duration of the whole parsing, GC is never get triggered? Otherwise it should have GCd the objects in object_workspace, causing problems (I tried this parser in a loop, where GC is caused hundreds of times. In the loop, I compared the result to json-read, everything was fine). ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-11 14:35 ` Herman, Géza @ 2024-03-12 9:26 ` Mattias Engdegård 2024-03-12 10:20 ` Gerd Möllmann 2024-03-12 10:58 ` Herman, Géza 0 siblings, 2 replies; 51+ messages in thread From: Mattias Engdegård @ 2024-03-12 9:26 UTC (permalink / raw) To: "Herman, Géza" Cc: Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel 11 mars 2024 kl. 15.35 skrev Herman, Géza <geza.herman@gmail.com>: > According to https://github.com/miloyip/nativejson-benchmark, RapidJSON is at least 10x faster than jansson. I'm just saying this because Emacs doesn't have to stick with my parser, there are possible alternatives, which have JSON serializers as well. Thanks for the benchmark page reference. Yes, if this turns out to matter more we may consider a faster library. Right now I think your efforts are good enough (at least if we finish the job with a JSON serialiser). > Yep, the formatting of that table got destroyed when I reformatted the code into GNU style. Now I formatted the table back, and added comments for each row/col. Here's the latest version: https://github.com/geza-herman/emacs/commit/4b5895636c1ec06e630baf47881b246c198af056.patch Much better, thank you. >> * Do you really need to maintain line and column during the parse? If >> you want them for error reporting, you can materialise them from the >> offset that you already have. > > Yeah, I thought of that, but it turned out that maintaining the line/column doesn't have an impact on performance. That's just because your code isn't fast enough! We are very disappointed. Very. > I added that easily, tough admittedly it's a little bit awkward to maintain these variables. If emacs has a way to tell from the byte-pointer the line/col position (both for strings and buffers), I am happy to use that instead. Since error handling isn't performance-critical it doesn't matter if it's a bit slow. (I'd just count newlines.) >> * Are you sure that GC can't run during parsing or that all your Lisp >> objects are reachable directly from the stack? (It's the >> `object_workspace` in particular that's worrying me a bit.) > > That's a very good question. I suppose that object_workspace is invisible to the Lisp VM, as it is just a malloc'd object. But I've never seen a problem because of this. What triggers the GC? Is it possible that for the duration of the whole parsing, GC is never get triggered? Otherwise it should have GCd the objects in object_workspace, causing problems (I tried this parser in a loop, where GC is caused hundreds of times. In the loop, I compared the result to json-read, everything was fine). You can't test that code is GC-safe, you have to show that it's correct by design. Looking at the code it is quite possible that GC cannot take place. But it can signal errors, and getting into the debugger should open GC windows unless I'm mistaken. There are some options. `record_unwind_protect_ptr_mark` would be one, and it was made for code like this, but Gerd has been grumbling about it lately. Perhaps it's easier just to disable GC in the dynamic scope (inhibit_garbage_collection). ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-12 9:26 ` Mattias Engdegård @ 2024-03-12 10:20 ` Gerd Möllmann 2024-03-12 11:14 ` Mattias Engdegård 2024-03-15 13:35 ` Herman, Géza 2024-03-12 10:58 ` Herman, Géza 1 sibling, 2 replies; 51+ messages in thread From: Gerd Möllmann @ 2024-03-12 10:20 UTC (permalink / raw) To: Mattias Engdegård Cc: Herman, Géza, Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel Mattias Engdegård <mattias.engdegard@gmail.com> writes: > There are some options. `record_unwind_protect_ptr_mark` would be one, > and it was made for code like this, but Gerd has been grumbling about > it lately. Perhaps it's easier just to disable GC in the dynamic scope > (inhibit_garbage_collection). Yes, I'm grumbling because record_..._mark assumes GC is mark-sweep. And it's really super ugly to work around. And if you disable GC, I'll be grumbling because that's not good for a concurrent GC. What do I propose: If you can, please make objects reachable from the control stack. For instance, create a context object on the stack that has a Lisp vector member. That will prevent GC'ing the vector and by that its contents, of course. You could pass a pointer to the context around to get access to the vector in subroutines, replace it with a larger vector if necessary, and so on. Howsoever, that needed to be said! ;-). ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-12 10:20 ` Gerd Möllmann @ 2024-03-12 11:14 ` Mattias Engdegård 2024-03-12 11:33 ` Gerd Möllmann 2024-03-15 13:35 ` Herman, Géza 1 sibling, 1 reply; 51+ messages in thread From: Mattias Engdegård @ 2024-03-12 11:14 UTC (permalink / raw) To: Gerd Möllmann Cc: "Herman, Géza", Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel 12 mars 2024 kl. 11.20 skrev Gerd Möllmann <gerd.moellmann@gmail.com>: > Yes, I'm grumbling because record_..._mark assumes GC is mark-sweep. And > it's really super ugly to work around. > > And if you disable GC, I'll be grumbling because that's not good for a > concurrent GC. These are problems we clearly need to solve, but I don't think we should hold Géza's improved JSON parser hostage to our desired future GC demands. Let's help him finish his work in the Emacs that we have today and be explicit about our assumptions and needs. The proposed code's requirements are not unique, nor do the solutions need to be. > What do I propose: If you can, please make objects reachable from the > control stack. For instance, create a context object on the stack that > has a Lisp vector member. The problem is that with the GC that we actually have, xmalloc/xfree is a lot faster than creating a Lisp vector and letting the GC collect it later on. One possible way out might be providing a way to free a vector ahead of time, akin to free_cons, and create a vector type with a fill pointer so that it does not need to be initialised nor traversed in full if only part of it is used. The last part is straightforward, although it complicates vectorlike traversal a bit. Ahead-of-time freeing might be unnecessary with a modern collector but that needs to be verified. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-12 11:14 ` Mattias Engdegård @ 2024-03-12 11:33 ` Gerd Möllmann 0 siblings, 0 replies; 51+ messages in thread From: Gerd Möllmann @ 2024-03-12 11:33 UTC (permalink / raw) To: Mattias Engdegård Cc: Herman, Géza, Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel Mattias Engdegård <mattias.engdegard@gmail.com> writes: > The problem is that with the GC that we actually have, xmalloc/xfree > is a lot faster than creating a Lisp vector and letting the GC collect > it later on. Will any user ever notice a difference? I doubt it. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-12 10:20 ` Gerd Möllmann 2024-03-12 11:14 ` Mattias Engdegård @ 2024-03-15 13:35 ` Herman, Géza 2024-03-15 14:56 ` Gerd Möllmann 2024-03-19 18:49 ` Mattias Engdegård 1 sibling, 2 replies; 51+ messages in thread From: Herman, Géza @ 2024-03-15 13:35 UTC (permalink / raw) To: Gerd Möllmann Cc: Mattias Engdegård, Herman, Géza, Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel Gerd Möllmann <gerd.moellmann@gmail.com> writes: > Mattias Engdegård <mattias.engdegard@gmail.com> writes: > >> There are some options. `record_unwind_protect_ptr_mark` would >> be one, >> and it was made for code like this, but Gerd has been grumbling >> about >> it lately. Perhaps it's easier just to disable GC in the >> dynamic scope >> (inhibit_garbage_collection). > > Yes, I'm grumbling because record_..._mark assumes GC is > mark-sweep. And > it's really super ugly to work around. > > And if you disable GC, I'll be grumbling because that's not good > for a > concurrent GC. > > What do I propose: If you can, please make objects reachable > from the > control stack. For instance, create a context object on the > stack that > has a Lisp vector member. That will prevent GC'ing the vector > and by > that its contents, of course. You could pass a pointer to the > context > around to get access to the vector in subroutines, replace it > with a > larger vector if necessary, and so on. I implemented this idea, here is the latest version (now, object_workspace is a Lisp_Object): https://github.com/geza-herman/emacs/commit/12decaddc9b8260745be4d5782fea21f4578eab2.patch Is there anything that need to be done? ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-15 13:35 ` Herman, Géza @ 2024-03-15 14:56 ` Gerd Möllmann 2024-03-19 18:49 ` Mattias Engdegård 1 sibling, 0 replies; 51+ messages in thread From: Gerd Möllmann @ 2024-03-15 14:56 UTC (permalink / raw) To: Herman, Géza Cc: Mattias Engdegård, Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel "Herman, Géza" <geza.herman@gmail.com> writes: >> What do I propose: If you can, please make objects reachable from the >> control stack. For instance, create a context object on the stack >> that has a Lisp vector member. That will prevent GC'ing the vector >> and by that its contents, of course. You could pass a pointer to the >> context around to get access to the vector in subroutines, replace it >> with a larger vector if necessary, and so on. > > I implemented this idea, here is the latest version (now, > object_workspace is a Lisp_Object): > > https://github.com/geza-herman/emacs/commit/12decaddc9b8260745be4d5782fea21f4578eab2.patch > > Is there anything that need to be done? Thanks, that looks very good, I'm happy :-). By putting a struct json_parser on the stack as a local variable, its .object_workspace itself and everything rechable from it are safe from GC. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-15 13:35 ` Herman, Géza 2024-03-15 14:56 ` Gerd Möllmann @ 2024-03-19 18:49 ` Mattias Engdegård 2024-03-19 19:05 ` Herman, Géza 2024-03-19 19:13 ` Gerd Möllmann 1 sibling, 2 replies; 51+ messages in thread From: Mattias Engdegård @ 2024-03-19 18:49 UTC (permalink / raw) To: "Herman, Géza" Cc: Gerd Möllmann, Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel 15 mars 2024 kl. 14.35 skrev Herman, Géza <geza.herman@gmail.com>: > I implemented this idea, here is the latest version (now, object_workspace is a Lisp_Object): That's considerably slower than before, especially for small inputs, so it's probably a no-go. I definitely want to make Gerd's efforts no harder than necessary but really don't think we should do pay the cost up-front like this. If required for a better GC then fine, but let's do that work in the branch for that effort instead of incurring overhead in Emacs now. > Is there anything that need to be done? It's probably a good idea to open a bug for this task, since it provides a better and more focussed record of the discussion, with a single bug number. Here are some remaining tasks: 1. Go back to the previous version which was much faster. 2. Don't allocate any temporary storage before you know that it's actually necessary. 3. Stop using the object_workspace when not required, which is everywhere except possibly when reading arrays into Lisp vectors. 4. Write a JSON serialiser. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-19 18:49 ` Mattias Engdegård @ 2024-03-19 19:05 ` Herman, Géza 2024-03-19 19:18 ` Gerd Möllmann 2024-03-19 19:13 ` Gerd Möllmann 1 sibling, 1 reply; 51+ messages in thread From: Herman, Géza @ 2024-03-19 19:05 UTC (permalink / raw) To: Mattias Engdegård Cc: Herman, Géza, Gerd Möllmann, Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel Mattias Engdegård <mattias.engdegard@gmail.com> writes: > 15 mars 2024 kl. 14.35 skrev Herman, Géza > <geza.herman@gmail.com>: > >> I implemented this idea, here is the latest version (now, >> object_workspace is a Lisp_Object): > > That's considerably slower than before, especially for small > inputs, so it's probably a no-go. Have you benchmarked it? I really doubt that there is a significant performance difference. Maybe it's possible to find some case, where this modification matters a bit, but for most cases, the difference should be negligible. The amount of allocations done is the same in both cases, which is for large files, only 20-30 allocations. Compared to the thousands of allocations for storing the actual Lisp objects. > Here are some remaining tasks: > > 2. Don't allocate any temporary storage before you know that > it's actually necessary. Except for the very-very trivial cases, memory allocation is always necessary. I can lower the allocation sizes, but I don't think it's worth it. We're only talking about sizes of the KB range. > 3. Stop using the object_workspace when not required, which is > everywhere except possibly when reading arrays into Lisp > vectors. object_workspace is necessary whenever a JSON contains an object or array. So basically always. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-19 19:05 ` Herman, Géza @ 2024-03-19 19:18 ` Gerd Möllmann 0 siblings, 0 replies; 51+ messages in thread From: Gerd Möllmann @ 2024-03-19 19:18 UTC (permalink / raw) To: Herman, Géza Cc: Mattias Engdegård, Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel "Herman, Géza" <geza.herman@gmail.com> writes: > Mattias Engdegård <mattias.engdegard@gmail.com> writes: > >> 15 mars 2024 kl. 14.35 skrev Herman, Géza <geza.herman@gmail.com>: >> >>> I implemented this idea, here is the latest version (now, >>> object_workspace is a Lisp_Object): >> >> That's considerably slower than before, especially for small inputs, >> so it's probably a no-go. > Have you benchmarked it? I really doubt that there is a significant > performance difference. Maybe it's possible to find some case, where > this modification matters a bit, but for most cases, the difference > should be negligible. The amount of allocations done is the same in > both cases, which is for large files, only 20-30 allocations. > Compared to the thousands of allocations for storing the actual Lisp > objects. Right, if N is the number of objects allocated for JSON, the usual strategy of doubling buffers sizes yields Log(N) buffer allocations. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-19 18:49 ` Mattias Engdegård 2024-03-19 19:05 ` Herman, Géza @ 2024-03-19 19:13 ` Gerd Möllmann 1 sibling, 0 replies; 51+ messages in thread From: Gerd Möllmann @ 2024-03-19 19:13 UTC (permalink / raw) To: Mattias Engdegård Cc: Herman, Géza, Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel Mattias Engdegård <mattias.engdegard@gmail.com> writes: > 15 mars 2024 kl. 14.35 skrev Herman, Géza <geza.herman@gmail.com>: > >> I implemented this idea, here is the latest version (now, object_workspace is a Lisp_Object): > > That's considerably slower than before, especially for small inputs, > so it's probably a no-go. Can you give numbers? > > I definitely want to make Gerd's efforts no harder than necessary but > really don't think we should do pay the cost up-front like this. If > required for a better GC then fine, but let's do that work in the > branch for that effort instead of incurring overhead in Emacs now. Please don't take my GC efforts into consideration. That may succeed or not. But this is also a matter of good design, using the stack, (which BTW pdumper does, too), vs. bad design. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-12 9:26 ` Mattias Engdegård 2024-03-12 10:20 ` Gerd Möllmann @ 2024-03-12 10:58 ` Herman, Géza 2024-03-12 13:11 ` Mattias Engdegård 1 sibling, 1 reply; 51+ messages in thread From: Herman, Géza @ 2024-03-12 10:58 UTC (permalink / raw) To: Mattias Engdegård Cc: Herman, Géza, Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel Mattias Engdegård <mattias.engdegard@gmail.com> writes: >> I added that easily, tough admittedly it's a little bit >> awkward to >> maintain these variables. If emacs has a way to tell from the >> byte-pointer the line/col position (both for strings and >> buffers), I >> am happy to use that instead. > > Since error handling isn't performance-critical it doesn't > matter if it's a bit slow. (I'd just count newlines.) Basically this is what the current code does, it's just not postponed until an actual error, but registered during parsing. I'm tempted to keep it as is, as line/col information can be useful in other circumstances as well. Like, for example, if we wanted to tag the created objects with their source location. > You can't test that code is GC-safe, you have to show that it's > correct by design. Sure, but there has to be an explanation why the current way doesn't have any problems. Having a glance at garbage collection in emacs, it seems that it only runs during elisp code execution (is this true?). As the parser doesn't call back to any elisp code which runs the VM, it should be safe. If GC could happen any time, then I suppose the whole C Emacs code should be checked for that, because one can never be sure that if something is allocated at the C side, then at the very next moment it will be immediately freed by the GC. Conceptually, I see no difference between calling a single Fcons vs. the what the whole parser does. If calling Fcons and then using its result is safe, then the parser should also be safe. Or is there some magic about the GC which makes this argument false? ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-12 10:58 ` Herman, Géza @ 2024-03-12 13:11 ` Mattias Engdegård 2024-03-12 13:42 ` Mattias Engdegård 2024-03-12 15:23 ` Herman, Géza 0 siblings, 2 replies; 51+ messages in thread From: Mattias Engdegård @ 2024-03-12 13:11 UTC (permalink / raw) To: "Herman, Géza" Cc: Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel, Gerd Möllmann 12 mars 2024 kl. 11.58 skrev Herman, Géza <geza.herman@gmail.com>: >> You can't test that code is GC-safe, you have to show that it's correct by design. > > Sure, but there has to be an explanation why the current way doesn't have any problems. It doesn't matter -- we don't need to prove to you why you don't get a segfault, it's you who need to convince us that your code is fine. Most C code does not need any special attention as long as it only places Lisp roots in local variables, ie, on the C stack and in registers. Your code keeps live roots in heap-allocated memory. It's not alone in doing that, but elsewhere we give those roots special attention, either by explicit GC marking or disabling GC during the lifespan of those roots. As I said your code is probably safe unless it signals an error. An alternative is to do as Gerd suggested and only have roots on the stack, by allocating vectors (or lists) for the task. However: 12 mars 2024 kl. 12.33 skrev Gerd Möllmann <gerd.moellmann@gmail.com>: > Will any user ever notice a difference? I doubt it. I can't say, and these things are not always easy to measure. A Lisp allocation has the drawbacks: - Allocation requires initialisation of the entire block, not just the used part. - Deallocation is delayed until the next GC. - The cache lines are effectively wasted. - If the code is run again before next GC, it will have to allocate more Lisp objects. - GC happens more frequently because allocation advances the GC clock. - GC takes longer because it has to sweep more dead objects. One compromise in this case (json_parse_array) would be to collect array elements into a Lisp list which is then turned into a Lisp vector. In the latter case we could use free_cons to undo the consing if we want. Consing and list traversal is slow but perhaps not bad enough; measurement is needed. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-12 13:11 ` Mattias Engdegård @ 2024-03-12 13:42 ` Mattias Engdegård 2024-03-12 15:23 ` Herman, Géza 1 sibling, 0 replies; 51+ messages in thread From: Mattias Engdegård @ 2024-03-12 13:42 UTC (permalink / raw) To: "Herman, Géza" Cc: Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel, Gerd Möllmann 12 mars 2024 kl. 14.11 skrev Mattias Engdegård <mattias.engdegard@gmail.com>: > As I said your code is probably safe unless it signals an error. Actually, as long as the heap-allocated scratch space isn't touched during error handling and unwind, that should not be a problem either, so we may be entirely safe here. We should explain why in the code, of course. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-12 13:11 ` Mattias Engdegård 2024-03-12 13:42 ` Mattias Engdegård @ 2024-03-12 15:23 ` Herman, Géza 2024-03-12 15:39 ` Gerd Möllmann 1 sibling, 1 reply; 51+ messages in thread From: Herman, Géza @ 2024-03-12 15:23 UTC (permalink / raw) To: Mattias Engdegård Cc: Herman, Géza, Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel, Gerd Möllmann Mattias Engdegård <mattias.engdegard@gmail.com> writes: > 12 mars 2024 kl. 11.58 skrev Herman, Géza > <geza.herman@gmail.com>: > >>> You can't test that code is GC-safe, you have to show that >>> it's correct by design. >> >> Sure, but there has to be an explanation why the current way >> doesn't have any problems. > > It doesn't matter -- we don't need to prove to you why you don't > get a segfault, it's you who need to convince us that your code > is fine. > > Most C code does not need any special attention as long as it > only > places Lisp roots in local variables, ie, on the C stack and in > registers. Your code keeps live roots in heap-allocated > memory. It's > not alone in doing that, but elsewhere we give those roots > special > attention, either by explicit GC marking or disabling GC during > the > lifespan of those roots. > > As I said your code is probably safe unless it signals an error. Sorry, but I don't understand. If my parser puts roots in heap-allocated memory (that's true), and doesn't give such roots special attention (that's also true), then why is it probably safe? ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-12 15:23 ` Herman, Géza @ 2024-03-12 15:39 ` Gerd Möllmann 0 siblings, 0 replies; 51+ messages in thread From: Gerd Möllmann @ 2024-03-12 15:39 UTC (permalink / raw) To: Herman, Géza Cc: Mattias Engdegård, Eli Zaretskii, Philip Kaludercic, wellons, emacs-devel "Herman, Géza" <geza.herman@gmail.com> writes: > Sorry, but I don't understand. If my parser puts roots in > heap-allocated memory (that's true), and doesn't give such roots > special attention (that's also true), then why is it probably safe? Assumin GC doesn't run, nothing can happen. For me personally, that assumption would be too dangerous. At one point, Emacs' regexp engine made that assumption, GC did run, because of something N levels down a call stack, GC compacted strings, and all that was really nice to debug :-). My 2 cents :-) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-09 20:37 ` Christopher Wellons 2024-03-10 6:31 ` Eli Zaretskii @ 2024-03-10 6:58 ` Herman, Géza 2024-03-10 16:54 ` Christopher Wellons 1 sibling, 1 reply; 51+ messages in thread From: Herman, Géza @ 2024-03-10 6:58 UTC (permalink / raw) To: Christopher Wellons; +Cc: Herman, Géza, emacs-devel@gnu.org Christopher Wellons <wellons@nullprogram.com> writes: >> What do you think? > > In review I noticed a potential pointer overflow in > json_parse_string: > > parser->input_current + 4 <= parser->input_end > > [...] > > In json_make_object_workspace_for and json_byte_workspace_put, a > size > is doubled without an overflow check ("new_workspace_size * 2"). Thanks for the review and finding these problems! I fixed them: https://github.com/geza-herman/emacs/commit/cbbf3dd494034750ff324703e64f1125a1056832.patch > But this JSON parser is tightly > coupled with the Emacs Lisp runtime, which greatly complicates > things. I couldn't simply pluck it out by itself and drop it in, > say, > AFL++. Yes, it needs some work. The Lisp Object creation part is only done at very specific places, it's easy to remove them (actually, I wrote this parser outside of Emacs, and then just put it in by adding the necessary Lisp Object creation code). Or, if the fuzzer needs the actual output (I mean, the result of the parsing), it shouldn't be too hard to put some code there which provides the output. The other thing is error handling, but it also can be easily replaced by using longjmp. I'm happy to do this work, I'd just need some directions how to do it. I'm not experienced with fuzzy testing, so if you are, I'd glad if you can give some advices: which fuzzy-testing framework to use, which introductory material is worth reading, etc. > As noted earlier, the parser gets its performance edge through > skipping the intermediate steps. This is great! That could still > be > accomplished without such tight coupling, allowing for > performance > *and* an interface that is testable and fuzzable in relative > isolation. Yes, I think a SAX parser like interface would have a very little cost. But honestly, I don't see the point of it. This is a parser for Emacs only. It has a very specific purpose, to make JSON parsing fast in Emacs. It is a small module. Input is JSON, output is Lisp Objects. Working with Lisp Objects inside Emacs is a natural thing, usually there is no need for intermediate representations. So if the only reason to have a Emacs-independent API is to make the parser fuzzy-testable, then wouldn't it make more sense to make Emacs fuzzy-testable in general? I find this approach more useful, because I think it's not just this parser which can be a sensible target for fuzzy testing. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-10 6:58 ` Herman, Géza @ 2024-03-10 16:54 ` Christopher Wellons 2024-03-10 20:41 ` Herman, Géza 0 siblings, 1 reply; 51+ messages in thread From: Christopher Wellons @ 2024-03-10 16:54 UTC (permalink / raw) To: Herman, Géza; +Cc: emacs-devel@gnu.org > I'd glad if you can give some advices: which fuzzy-testing framework to > use, which introductory material is worth reading, etc. The Jansson repository has a libFuzzer-based fuzz test, which is perhaps a useful example. In it they define LLVMFuzzerTestOneInput, a function which accepts a buffer of input (pointer and length), which they feed into the code under test. That's basically it. In the new parser that buffer would go into json_parse. The tested code is instrumented, and the fuzz tester observes the affect inputs have on control flow, using that information to construct new inputs that explore new execution paths, trying to exercise as many as possible. I'm partial to AFL++, and it's what I reach for first. It also works with GCC. It has two modes, with persistent mode preferred: https://github.com/AFLplusplus/AFLplusplus/blob/stable/instrumentation/README.persistent_mode.md Same in principle, but with control inverted. For seed inputs, a few small JSON documents exercising the parser's features is sufficient. In either case, use -fsanitize=address,undefined so that defective execution paths are more likely to be detected. More assertions would help, too, such as "assert(input_current <= input_end)" in a number of places. Assertions must abort or trap so that the fuzz tester knows it found a defect. Fuzz testing works better in a narrow scope. Ideally only the code being tested is instrumented. If it's running within an Emacs context, and you instrument all of Emacs, the fuzz tester would explore paths in Emacs reachable through the JSON parser rather than focus on the parser itself. That will waste time that could instead be spent exploring the parser. You don't need to allocate lisp objects during fuzz testing. In fact, you should avoid it because that would just slow it down. (I even bet it's the bottleneck in the new parser.) Ideally the core parser consumes bytes and produces JSON events, and is agnostic to its greater context. To integrate with Emacs, you'd have additional, separate code that turns JSON events into lisp objects, and which wouldn't be fuzz tested. Written that way, I could hook this core up to one of the above fuzz test interfaces, mock out whatever bits of Emacs might still be there (e.g. ckd_mul: the isolation need not be perfect), feed it the input, and pump events until either error (i.e. bad input detected, which is ignored) or EOF. The fuzz tester uses a timeout to detect infinite loops, which AFL++ will report as "hangs" and save the input for manual investigation. It should exercise JSON numeric parsing, too, at least to the extent that it's not punted to Emacs or strtod (mind your locale!). I'd also make available_depth much smaller so that the fuzzing could exercise failing checks. To get the bulk of the value, the fuzz test does not necessarily need to be checked into source control, or even run as part of a standard test suite. Given a clean, decoupled interface and implementation, it would only take a few minutes to hook up a fuzz test. I was hoping to find just that, but each JSON function has multiple points of contact with Emacs, most especially json_parse_object. I've done such ad-hoc fuzz testing on dozens of programs and libraries to evaluate their quality, and sometimes even improve them. In most cases, if can be fuzz tested and it's never been fuzz tested before, this technique finds fresh bugs in a matter of minutes, if not seconds. When I say it's incredibly effective, I mean it! Case in point from a few weeks ago, under similar circumstances, which can also serve as a practical example: https://github.com/editorconfig/editorconfig-core-c/pull/103 ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-10 16:54 ` Christopher Wellons @ 2024-03-10 20:41 ` Herman, Géza 2024-03-10 23:22 ` Christopher Wellons 0 siblings, 1 reply; 51+ messages in thread From: Herman, Géza @ 2024-03-10 20:41 UTC (permalink / raw) To: Christopher Wellons; +Cc: Herman, Géza, emacs-devel@gnu.org [-- Attachment #1: Type: text/plain, Size: 1124 bytes --] Christopher Wellons <wellons@nullprogram.com> writes: >> I'd glad if you can give some advices: which fuzzy-testing >> framework >> to use, which introductory material is worth reading, etc. > > I'm partial to AFL++, and it's what I reach for first. It also > works > with GCC. It has two modes, with persistent mode preferred: Thanks so much for the description! I created a standalone version of my parser (I attached it), and used "afl-clang-fast -o json json.c -fsanitize=address,undefined" and afl-fuzz to test it. It's been running for an hour, the tester didn't find any problems yet. I discovered a funny clang bug: it incorrectly optimizes around setjmp in do_test(): when json_parser_init runs, it stores the workspace pointer in a register. And if there is an error during JSON parsing, it will always free the pointer which is in that register. But in the meantime (I mean, after json_parser_init, and before the error is thrown), the parser could have updated it. So free() will be called on an already freed block. I had to add a dummy printf("free!\n"); to circumvent this optimization. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: json.c --] [-- Type: text/x-csrc, Size: 29853 bytes --] #include <stddef.h> #include <stdlib.h> #include <errno.h> #include <stdint.h> #include <string.h> #include <math.h> #include <setjmp.h> #include <stdio.h> #include <unistd.h> #define ckd_add(R, A, B) __builtin_add_overflow((A), (B), (R)) #define ckd_mul(R, A, B) __builtin_mul_overflow((A), (B), (R)) #define bool _Bool #define true 1 #define false 0 struct json_configuration { }; typedef struct { size_t value; } Lisp_Object; struct json_parser { /* Because of a possible gap in the input (an emacs buffer can have a gap), the input is described by [input_begin;input_end) and [secondary_input_begin;secondary_input_end). If the input is continuous, then secondary_input_begin and secondary_input_end should be NULL */ const unsigned char *input_current; const unsigned char *input_begin; const unsigned char *input_end; const unsigned char *secondary_input_begin; const unsigned char *secondary_input_end; int current_line; int current_column; /* The parser has a maximum allowed depth. available_depth decreases at each object/array begin. If reaches zero, then an error is generated */ int available_depth; struct json_configuration conf; size_t additional_bytes_count; /* Lisp_Objects are collected in this area during object/array parsing */ Lisp_Object *object_workspace; Lisp_Object *object_workspace_end; Lisp_Object *object_workspace_current; /* String and number parsing uses this workspace */ unsigned char *byte_workspace; unsigned char *byte_workspace_end; unsigned char *byte_workspace_current; }; jmp_buf signal_env; Lisp_Object Qjson_out_of_memory; Lisp_Object Qjson_end_of_file; Lisp_Object Qjson_escape_sequence_error; Lisp_Object Qjson_utf8_decode_error; Lisp_Object Qjson_invalid_surrogate_error; Lisp_Object Qjson_parse_error; Lisp_Object Qjson_error; Lisp_Object Qjson_number_out_of_range; Lisp_Object Qjson_object_too_deep; Lisp_Object Qjson_trailing_content; static _Noreturn void json_signal_error(struct json_parser *parser, Lisp_Object error) { longjmp(signal_env, 1); /* xsignal2(error, INT_TO_INTEGER(parser->current_line), INT_TO_INTEGER(parser->current_column)); */ } static void json_parser_init(struct json_parser *parser, struct json_configuration conf, const unsigned char *input, const unsigned char *input_end, const unsigned char *secondary_input, const unsigned char *secondary_input_end) { const int initial_workspace_size = 64; const int initial_string_workspace_size = 512; if (secondary_input >= secondary_input_end) { secondary_input = NULL; secondary_input_end = NULL; } if (input < input_end) { parser->input_begin = input; parser->input_end = input_end; parser->secondary_input_begin = secondary_input; parser->secondary_input_end = secondary_input_end; } else { parser->input_begin = secondary_input; parser->input_end = secondary_input_end; parser->secondary_input_begin = NULL; parser->secondary_input_end = NULL; } parser->input_current = parser->input_begin; parser->current_line = 1; parser->current_column = 0; parser->available_depth = 8; parser->conf = conf; parser->additional_bytes_count = 0; parser->object_workspace = malloc(initial_workspace_size * sizeof(Lisp_Object)); parser->object_workspace_end = parser->object_workspace + initial_workspace_size; parser->object_workspace_current = parser->object_workspace; parser->byte_workspace = malloc(initial_string_workspace_size); parser->byte_workspace_end = parser->byte_workspace + initial_string_workspace_size; } static void json_parser_done(void *parser) { struct json_parser *p = (struct json_parser *)parser; /* printf("free: %p %p\n", p->object_workspace, p->byte_workspace); */ free(p->object_workspace); free(p->byte_workspace); } /* Makes sure that the object_workspace has 'size' available space for Lisp_Objects */ static void json_make_object_workspace_for(struct json_parser *parser, size_t size) { size_t available_size = parser->object_workspace_end - parser->object_workspace_current; if (available_size >= size) { return; } size_t needed_workspace_size = (parser->object_workspace_current - parser->object_workspace + size); size_t new_workspace_size = parser->object_workspace_end - parser->object_workspace; while (new_workspace_size < needed_workspace_size) { if (ckd_mul(&new_workspace_size, new_workspace_size, 2)) { json_signal_error(parser, Qjson_out_of_memory); } } size_t offset = parser->object_workspace_current - parser->object_workspace; parser->object_workspace = realloc(parser->object_workspace, new_workspace_size * sizeof(Lisp_Object)); parser->object_workspace_end = parser->object_workspace + new_workspace_size; parser->object_workspace_current = parser->object_workspace + offset; } static void json_byte_workspace_reset(struct json_parser *parser) { parser->byte_workspace_current = parser->byte_workspace; } /* Puts 'value' into the byte_workspace. If there is no space available, it allocates space */ static void json_byte_workspace_put(struct json_parser *parser, unsigned char value) { if (parser->byte_workspace_current < parser->byte_workspace_end) { *parser->byte_workspace_current++ = value; return; } size_t new_workspace_size = parser->byte_workspace_end - parser->byte_workspace; if (ckd_mul(&new_workspace_size, new_workspace_size, 2)) { json_signal_error(parser, Qjson_out_of_memory); } size_t offset = parser->byte_workspace_current - parser->byte_workspace; parser->byte_workspace = realloc(parser->byte_workspace, new_workspace_size); parser->byte_workspace_end = parser->byte_workspace + new_workspace_size; parser->byte_workspace_current = parser->byte_workspace + offset; *parser->byte_workspace_current++ = value; } static bool json_input_at_eof(struct json_parser *parser) { if (parser->input_current < parser->input_end) return false; return parser->secondary_input_end == NULL; } /* If there is a secondary buffer, it switches to it */ static int json_input_switch_to_secondary(struct json_parser *parser) { if (parser->secondary_input_begin < parser->secondary_input_end) { parser->additional_bytes_count = parser->input_end - parser->input_begin; parser->input_begin = parser->secondary_input_begin; parser->input_end = parser->secondary_input_end; parser->input_current = parser->secondary_input_begin; parser->secondary_input_begin = NULL; parser->secondary_input_end = NULL; return 0; } else return -1; } /* Reads a byte from the JSON input stream */ static unsigned char json_input_get(struct json_parser *parser) { if (parser->input_current >= parser->input_end && json_input_switch_to_secondary(parser) < 0) json_signal_error(parser, Qjson_end_of_file); return *parser->input_current++; } /* Reads a byte from the JSON input stream, if the stream is not at * eof. At eof, returns -1 */ static int json_input_get_if_possible(struct json_parser *parser) { if (parser->input_current >= parser->input_end && json_input_switch_to_secondary(parser) < 0) return -1; return *parser->input_current++; } /* Puts back the last read input byte. Only one byte can be put back, because otherwise this code would need to handle switching from the secondary buffer to the initial */ static void json_input_put_back(struct json_parser *parser) { parser->input_current--; } static bool json_skip_whitespace_internal(struct json_parser *parser, int c) { parser->current_column++; if (c == 0x20 || c == 0x09 || c == 0x0d) return false; else if (c == 0x0a) { parser->current_line++; parser->current_column = 0; return false; } else return true; } /* Skips JSON whitespace, and returns with the first non-whitespace * character */ static int json_skip_whitespace(struct json_parser *parser) { for (;;) { int c = json_input_get(parser); if (json_skip_whitespace_internal(parser, c)) return c; } } /* Skips JSON whitespace, and returns with the first non-whitespace * character, if possible. If there is no non-whitespace character * (because we reached the end), it returns -1 */ static int json_skip_whitespace_if_possible(struct json_parser *parser) { for (;;) { int c = json_input_get_if_possible(parser); if (c < 0) return c; if (json_skip_whitespace_internal(parser, c)) return c; } } static int json_hex_value(int c) { if (c >= '0' && c <= '9') return c - '0'; else if (c >= 'A' && c <= 'F') return c - 'A' + 10; else if (c >= 'a' && c <= 'f') return c - 'a' + 10; else return -1; } /* Parses the CCCC part of the unicode escape sequence \uCCCC */ static int json_parse_unicode(struct json_parser *parser) { unsigned char v[4]; for (int i = 0; i < 4; i++) { int c = json_hex_value(json_input_get(parser)); parser->current_column++; if (c < 0) json_signal_error(parser, Qjson_escape_sequence_error); v[i] = c; } return v[0] << 12 | v[1] << 8 | v[2] << 4 | v[3]; } /* Parses an utf-8 code-point encoding (except the first byte), and returns the numeric value of the code-point (without considering the first byte) */ static int json_handle_utf8_tail_bytes(struct json_parser *parser, int n) { int v = 0; for (int i = 0; i < n; i++) { int c = json_input_get(parser); json_byte_workspace_put(parser, c); if ((c & 0xc0) != 0x80) json_signal_error(parser, Qjson_utf8_decode_error); v = (v << 6) | (c & 0x3f); } return v; } /* Reads a JSON string, and puts the result into the byte workspace */ static void json_parse_string(struct json_parser *parser) { /* a single_uninteresting byte can be simply copied from the input to output, it doesn't need any extra care. */ static const char is_single_uninteresting[256] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, }; for (;;) { /* This if is only here for a possible speedup. If there are 4 bytes available, and all of them are single_uninteresting, then we can just copy these 4 bytes to output */ if (parser->input_end - parser->input_current >= 4) { int c0 = parser->input_current[0]; int c1 = parser->input_current[1]; int c2 = parser->input_current[2]; int c3 = parser->input_current[3]; bool v0 = is_single_uninteresting[c0]; bool v1 = is_single_uninteresting[c1]; bool v2 = is_single_uninteresting[c2]; bool v3 = is_single_uninteresting[c3]; if (v0 && v1 && v2 && v3) { json_byte_workspace_put(parser, c0); json_byte_workspace_put(parser, c1); json_byte_workspace_put(parser, c2); json_byte_workspace_put(parser, c3); parser->input_current += 4; parser->current_column += 4; continue; } } int c = json_input_get(parser); parser->current_column++; if (is_single_uninteresting[c]) { json_byte_workspace_put(parser, c); continue; } if (c == '"') return; else if (c & 0x80) { /* Handle utf-8 encoding */ json_byte_workspace_put(parser, c); if (c < 0xc0) json_signal_error(parser, Qjson_utf8_decode_error); else if (c < 0xe0) { int n = ((c & 0x1f) << 6 | json_handle_utf8_tail_bytes(parser, 1)); if (n < 0x80) json_signal_error(parser, Qjson_utf8_decode_error); } else if (c < 0xf0) { int n = ((c & 0xf) << 12 | json_handle_utf8_tail_bytes(parser, 2)); if (n < 0x800 || (n >= 0xd800 && n < 0xe000)) json_signal_error(parser, Qjson_utf8_decode_error); } else if (c < 0xf8) { int n = ((c & 0x7) << 18 | json_handle_utf8_tail_bytes(parser, 3)); if (n < 0x10000 || n > 0x10ffff) json_signal_error(parser, Qjson_utf8_decode_error); } else json_signal_error(parser, Qjson_utf8_decode_error); } else if (c == '\\') { /* Handle escape sequences */ c = json_input_get(parser); parser->current_column++; if (c == '"') json_byte_workspace_put(parser, '"'); else if (c == '\\') json_byte_workspace_put(parser, '\\'); else if (c == '/') json_byte_workspace_put(parser, '/'); else if (c == 'b') json_byte_workspace_put(parser, '\b'); else if (c == 'f') json_byte_workspace_put(parser, '\f'); else if (c == 'n') json_byte_workspace_put(parser, '\n'); else if (c == 'r') json_byte_workspace_put(parser, '\r'); else if (c == 't') json_byte_workspace_put(parser, '\t'); else if (c == 'u') { int num = json_parse_unicode(parser); /* is the first half of the surrogate pair */ if (num >= 0xd800 && num < 0xdc00) { parser->current_column++; if (json_input_get(parser) != '\\') json_signal_error(parser, Qjson_invalid_surrogate_error); parser->current_column++; if (json_input_get(parser) != 'u') json_signal_error(parser, Qjson_invalid_surrogate_error); int num2 = json_parse_unicode(parser); if (num2 < 0xdc00 || num2 >= 0xe000) json_signal_error(parser, Qjson_invalid_surrogate_error); num = (0x10000 + ((num - 0xd800) << 10 | (num2 - 0xdc00))); } else if (num >= 0xdc00 && num < 0xe000) /* is the second half of the surrogate pair without the first half */ json_signal_error(parser, Qjson_invalid_surrogate_error); /* utf-8 encode the code-point */ if (num < 0x80) json_byte_workspace_put(parser, num); else if (num < 0x800) { json_byte_workspace_put(parser, 0xc0 | num >> 6); json_byte_workspace_put(parser, 0x80 | (num & 0x3f)); } else if (num < 0x10000) { json_byte_workspace_put(parser, 0xe0 | num >> 12); json_byte_workspace_put(parser, (0x80 | ((num >> 6) & 0x3f))); json_byte_workspace_put(parser, 0x80 | (num & 0x3f)); } else { json_byte_workspace_put(parser, 0xf0 | num >> 18); json_byte_workspace_put(parser, (0x80 | ((num >> 12) & 0x3f))); json_byte_workspace_put(parser, (0x80 | ((num >> 6) & 0x3f))); json_byte_workspace_put(parser, 0x80 | (num & 0x3f)); } } else json_signal_error(parser, Qjson_escape_sequence_error); } else json_signal_error(parser, Qjson_parse_error); } } /* If there was no integer overflow during parsing the integer, this puts 'value' to the output. Otherwise this calls string_to_number to parse integer on the byte workspace. This could just always call string_to_number, but for performance reasons, during parsing the code tries to calculate the value, so in most cases, we can save call of string_to_number */ static Lisp_Object json_create_integer(struct json_parser *parser, bool integer_overflow, bool negative, ulong value) { Lisp_Object o; o.value = value; if (!integer_overflow) { if (negative) { uintmax_t v = value; if (v <= (uintmax_t)INTMAX_MAX + 1) return o; } else { return o; } } json_byte_workspace_put(parser, 0); ptrdiff_t len = strlen((const char *)parser->byte_workspace); if (len != parser->byte_workspace_current - parser->byte_workspace - 1) json_signal_error(parser, Qjson_error); return o; } /* Parses a float using the byte workspace */ static Lisp_Object json_create_float(struct json_parser *parser) { Lisp_Object o; json_byte_workspace_put(parser, 0); errno = 0; char *e; double value = strtod((const char *)parser->byte_workspace, &e); bool out_of_range = (errno != 0 && (value == HUGE_VAL || value == -HUGE_VAL)); if (out_of_range) json_signal_error(parser, Qjson_number_out_of_range); else if ((const unsigned char *)e != parser->byte_workspace_current - 1) json_signal_error(parser, Qjson_error); else { o.value = value; return o; } } /* Parses a number. The first character is the input parameter 'c'. */ static Lisp_Object json_parse_number(struct json_parser *parser, int c) { json_byte_workspace_reset(parser); json_byte_workspace_put(parser, c); Lisp_Object o; o.value = 0; bool negative = false; if (c == '-') { negative = true; c = json_input_get(parser); json_byte_workspace_put(parser, c); parser->current_column++; } if (c < '0' || c > '9') json_signal_error(parser, Qjson_parse_error); /* The idea is that during finding the last character of the number, the for loop below also tries to calculate the value. If the parsed number is an integer which fits into unsigned long, then the parser can use the value of 'integer' right away, instead of having to re-parse the byte workspace later. Ideally, this integer should have the same size as a CPU general purpose register. */ unsigned long integer = c - '0'; bool integer_overflow = false; if (integer == 0) { if (json_input_at_eof(parser)) return o; c = json_input_get(parser); } else { for (;;) { if (json_input_at_eof(parser)) return json_create_integer(parser, integer_overflow, negative, integer); c = json_input_get(parser); if (c < '0' || c > '9') break; json_byte_workspace_put(parser, c); parser->current_column++; integer_overflow |= ckd_mul(&integer, integer, 10); integer_overflow |= ckd_add(&integer, integer, c - '0'); } } bool is_float = false; if (c == '.') { json_byte_workspace_put(parser, c); parser->current_column++; is_float = true; c = json_input_get(parser); json_byte_workspace_put(parser, c); parser->current_column++; if (c < '0' || c > '9') json_signal_error(parser, Qjson_parse_error); for (;;) { if (json_input_at_eof(parser)) return json_create_float(parser); c = json_input_get(parser); if (c < '0' || c > '9') break; json_byte_workspace_put(parser, c); parser->current_column++; } } if (c == 'e' || c == 'E') { json_byte_workspace_put(parser, c); parser->current_column++; is_float = true; c = json_input_get(parser); json_byte_workspace_put(parser, c); parser->current_column++; if (c == '-' || c == '+') { c = json_input_get(parser); json_byte_workspace_put(parser, c); parser->current_column++; } if (c < '0' || c > '9') json_signal_error(parser, Qjson_parse_error); for (;;) { if (json_input_at_eof(parser)) return json_create_float(parser); c = json_input_get(parser); if (c < '0' || c > '9') break; json_byte_workspace_put(parser, c); parser->current_column++; } } /* 'c' contains a character which is not part of the number, so it is need to be put back */ json_input_put_back(parser); if (is_float) return json_create_float(parser); else return json_create_integer(parser, integer_overflow, negative, integer); } static Lisp_Object json_parse_value(struct json_parser *parser, int c); /* Parses a JSON array. */ static Lisp_Object json_parse_array(struct json_parser *parser) { int c = json_skip_whitespace(parser); const size_t begin_offset = parser->object_workspace_current - parser->object_workspace; if (c != ']') { parser->available_depth--; if (parser->available_depth < 0) json_signal_error(parser, Qjson_object_too_deep); size_t number_of_elements = 0; /* This loop collects the array elements in the object workspace */ for (;;) { Lisp_Object element = json_parse_value(parser, c); json_make_object_workspace_for(parser, 1); *parser->object_workspace_current++ = element; c = json_skip_whitespace(parser); number_of_elements++; if (c == ']') { parser->available_depth++; break; } if (c != ',') json_signal_error(parser, Qjson_parse_error); c = json_skip_whitespace(parser); } } Lisp_Object result; const Lisp_Object *b = parser->object_workspace + begin_offset; size_t number_of_elements = parser->object_workspace_current - b; void *array = malloc(number_of_elements*sizeof(Lisp_Object)); memcpy(array, b, number_of_elements*sizeof(Lisp_Object)); result.value = (size_t)array; parser->object_workspace_current = parser->object_workspace + begin_offset; return result; } /* Parses a JSON object. */ static Lisp_Object json_parse_object(struct json_parser *parser) { int c = json_skip_whitespace(parser); const size_t begin_offset = parser->object_workspace_current - parser->object_workspace; if (c != '}') { parser->available_depth--; if (parser->available_depth < 0) json_signal_error(parser, Qjson_object_too_deep); /* This loop collects the object members (key/value pairs) in * the object workspace */ for (;;) { if (c != '"') json_signal_error(parser, Qjson_parse_error); Lisp_Object key; json_byte_workspace_reset(parser); json_parse_string(parser); if (parser->byte_workspace_current > parser->byte_workspace) { key.value = parser->byte_workspace[0] + parser->byte_workspace_current[-1]; } else { key.value = 0; } c = json_skip_whitespace(parser); if (c != ':') json_signal_error(parser, Qjson_parse_error); c = json_skip_whitespace(parser); Lisp_Object value = json_parse_value(parser, c); json_make_object_workspace_for(parser, 2); *parser->object_workspace_current++ = key; *parser->object_workspace_current++ = value; c = json_skip_whitespace(parser); if (c == '}') { parser->available_depth++; break; } if (c != ',') json_signal_error(parser, Qjson_parse_error); c = json_skip_whitespace(parser); } } Lisp_Object result; Lisp_Object *end = parser->object_workspace_current; Lisp_Object *member = parser->object_workspace + begin_offset; Lisp_Object *array = malloc((end - member)*sizeof(Lisp_Object)); while (member < end) { *array++ = member[0]; *array++ = member[1]; member += 2; } result.value = (size_t)array; parser->object_workspace_current = parser->object_workspace + begin_offset; return result; } /* Token-char is not a JSON terminology. When parsing null/false/true, this function tells the character set that is need to be considered as part of a token. For example, if the input is "truesomething", then the parser shouldn't consider it as "true", and an additional later "something" token. An additional example: if the input is "truetrue", then calling (json-parse-buffer) twice shouldn't produce two successful calls which return t, but a parsing error */ static bool json_is_token_char(int c) { return ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9') || (c == '-')); } /* This is the entry point to the value parser, this parses a JSON * value */ Lisp_Object json_parse_value(struct json_parser *parser, int c) { if (c == '{') return json_parse_object(parser); else if (c == '[') return json_parse_array(parser); else if (c == '"') { json_byte_workspace_reset(parser); json_parse_string(parser); Lisp_Object result; if (parser->byte_workspace_current > parser->byte_workspace) { result.value = parser->byte_workspace[0] + parser->byte_workspace_current[-1]; } else { result.value = 0; } return result; } else if ((c >= '0' && c <= '9') || (c == '-')) return json_parse_number(parser, c); else { int c2 = json_input_get(parser); int c3 = json_input_get(parser); int c4 = json_input_get(parser); int c5 = json_input_get_if_possible(parser); if (c == 't' && c2 == 'r' && c3 == 'u' && c4 == 'e' && (c5 < 0 || !json_is_token_char(c5))) { if (c5 >= 0) json_input_put_back(parser); parser->current_column += 4; Lisp_Object o; o.value = 0; return o; } if (c == 'n' && c2 == 'u' && c3 == 'l' && c4 == 'l' && (c5 < 0 || !json_is_token_char(c5))) { if (c5 >= 0) json_input_put_back(parser); parser->current_column += 4; Lisp_Object o; o.value = 1; return o; } if (c == 'f' && c2 == 'a' && c3 == 'l' && c4 == 's' && c5 == 'e') { int c6 = json_input_get_if_possible(parser); if (c6 < 0 || !json_is_token_char(c6)) { if (c6 >= 0) json_input_put_back(parser); parser->current_column += 5; Lisp_Object o; o.value = 2; return o; } } json_signal_error(parser, Qjson_parse_error); } } enum ParseEndBehavior { PARSEENDBEHAVIOR_CheckForGarbage, PARSEENDBEHAVIOR_MovePoint }; static Lisp_Object json_parse(struct json_parser *parser, enum ParseEndBehavior parse_end_behavior) { int c = json_skip_whitespace(parser); Lisp_Object result = json_parse_value(parser, c); switch (parse_end_behavior) { case PARSEENDBEHAVIOR_CheckForGarbage: c = json_skip_whitespace_if_possible(parser); if (c >= 0) json_signal_error(parser, Qjson_trailing_content); break; case PARSEENDBEHAVIOR_MovePoint: { break; } } return result; } void do_test(unsigned char *buffer, int length) { struct json_parser p; struct json_configuration conf; /* unsigned char json_data[] = "[12, 34]"; */ json_parser_init(&p, conf, buffer, buffer + length, NULL, NULL); int x = setjmp(signal_env); if (x == 0) { json_parse(&p, PARSEENDBEHAVIOR_CheckForGarbage); /* } else { */ /* printf("error\n"); */ } printf("free!\n"); json_parser_done(&p); } __AFL_FUZZ_INIT(); int main() { #ifdef __AFL_HAVE_MANUAL_CONTROL __AFL_INIT(); #endif unsigned char *buf = __AFL_FUZZ_TESTCASE_BUF; // must be after __AFL_INIT // and before __AFL_LOOP! while (__AFL_LOOP(10000)) { int len = __AFL_FUZZ_TESTCASE_LEN; // don't use the macro directly in a // call! if (len < 8) continue; // check for a required/useful minimum input length /* Setup function call, e.g. struct target *tmp = libtarget_init() */ /* Call function to be fuzzed, e.g.: */ do_test(buf, len); /* Reset state. e.g. libtarget_free(tmp) */ } return 0; } ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-10 20:41 ` Herman, Géza @ 2024-03-10 23:22 ` Christopher Wellons 2024-03-11 9:34 ` Herman, Géza 0 siblings, 1 reply; 51+ messages in thread From: Christopher Wellons @ 2024-03-10 23:22 UTC (permalink / raw) To: Herman, Géza; +Cc: emacs-devel@gnu.org > It's been running for an hour, the tester didn't find any problems yet. Except for an overflow assigning huge floats to the dummy Lisp_Object value — which is a problem with the test, not the parser — this stripped down version looks robust to me, too. Solid work! I have no further feedback or commentary. I made a few tweaks to harden the test, which did not change the results: * Rather that directly use the AFL input buffer, it uses an exactly-sized copy so that it could detect out of bounds access on input. * Used a custom malloc that initializes memory to garbage. * Used a custom realloc that always moves, and initializes extended memory to garbage. > it incorrectly optimizes around setjmp in do_test() If a program modifies a variable after the first setjmp return and then accesses it after the second setjmp return, it must be volatile-qualified. GCC and Clang have some machinery to mitigate a lack of volatile, such as the returns_twice attribute, but technically it's required. I believe in practice using either builtin setjmp/longjmp or a memory barrier would be sufficient, but my system's Clang doesn't exhibit the stale pointer here, so I can't test that theory. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-10 23:22 ` Christopher Wellons @ 2024-03-11 9:34 ` Herman, Géza 2024-03-11 13:47 ` Christopher Wellons 0 siblings, 1 reply; 51+ messages in thread From: Herman, Géza @ 2024-03-11 9:34 UTC (permalink / raw) To: Christopher Wellons; +Cc: Herman, Géza, emacs-devel@gnu.org Christopher Wellons <wellons@nullprogram.com> writes: >> It's been running for an hour, the tester didn't find any >> problems yet. > > Except for an overflow assigning huge floats to the dummy > Lisp_Object > value — which is a problem with the test, not the parser — this > stripped down version looks robust to me, too. Solid work! I > have no > further feedback or commentary. Cool! Out of curiosity, how did you find the overflow problem? Did you just notice it, or did the fuzzer/sanitizer find it? > I made a few tweaks to harden the test, which did not change the > results: Thanks for the work! I think that we can say then that the new parser works OK, it's unlikely that it has any serious problem. >> it incorrectly optimizes around setjmp in do_test() > > If a program modifies a variable after the first setjmp return > and > then accesses it after the second setjmp return, it must be > volatile-qualified. Good to know, thanks for the info! ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: I created a faster JSON parser 2024-03-11 9:34 ` Herman, Géza @ 2024-03-11 13:47 ` Christopher Wellons 0 siblings, 0 replies; 51+ messages in thread From: Christopher Wellons @ 2024-03-11 13:47 UTC (permalink / raw) To: Herman, Géza; +Cc: emacs-devel@gnu.org > did the fuzzer/sanitizer find it? Clang, but not GCC, places a UBSan check on float to integer conversions, and that check was tripped when I fuzzed a Clang build. Example: int main(void) { int x = 1e10; } Nothing from GCC: $ gcc -w -g3 -fsanitize=undefined example.c $ ./a.out $ But with Clang: $ clang -w -g3 -fsanitize=undefined example.c $ ./a.out example.c:3:13: runtime error: 1e+10 is outside the range of representable values of type 'int' ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2024-03-19 19:18 UTC | newest] Thread overview: 51+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-03-08 10:27 I created a faster JSON parser Herman, Géza 2024-03-08 11:41 ` Philip Kaludercic 2024-03-08 12:34 ` Herman, Géza 2024-03-08 12:03 ` Eli Zaretskii 2024-03-08 12:38 ` Herman, Géza 2024-03-08 12:59 ` Eli Zaretskii 2024-03-08 13:12 ` Herman, Géza 2024-03-08 14:10 ` Eli Zaretskii 2024-03-08 14:24 ` Collin Funk 2024-03-08 15:20 ` Herman, Géza 2024-03-08 16:22 ` Eli Zaretskii 2024-03-08 18:34 ` Herman, Géza 2024-03-08 19:57 ` Eli Zaretskii 2024-03-08 20:22 ` Herman, Géza 2024-03-09 6:52 ` Eli Zaretskii 2024-03-09 11:08 ` Herman, Géza 2024-03-09 12:23 ` Lynn Winebarger 2024-03-09 12:58 ` Po Lu 2024-03-09 13:13 ` Eli Zaretskii 2024-03-09 14:00 ` Herman, Géza 2024-03-09 14:21 ` Eli Zaretskii 2024-03-08 13:28 ` Po Lu 2024-03-08 16:14 ` Herman, Géza 2024-03-09 1:55 ` Po Lu 2024-03-09 20:37 ` Christopher Wellons 2024-03-10 6:31 ` Eli Zaretskii 2024-03-10 21:39 ` Philip Kaludercic 2024-03-11 13:29 ` Eli Zaretskii 2024-03-11 14:05 ` Mattias Engdegård 2024-03-11 14:35 ` Herman, Géza 2024-03-12 9:26 ` Mattias Engdegård 2024-03-12 10:20 ` Gerd Möllmann 2024-03-12 11:14 ` Mattias Engdegård 2024-03-12 11:33 ` Gerd Möllmann 2024-03-15 13:35 ` Herman, Géza 2024-03-15 14:56 ` Gerd Möllmann 2024-03-19 18:49 ` Mattias Engdegård 2024-03-19 19:05 ` Herman, Géza 2024-03-19 19:18 ` Gerd Möllmann 2024-03-19 19:13 ` Gerd Möllmann 2024-03-12 10:58 ` Herman, Géza 2024-03-12 13:11 ` Mattias Engdegård 2024-03-12 13:42 ` Mattias Engdegård 2024-03-12 15:23 ` Herman, Géza 2024-03-12 15:39 ` Gerd Möllmann 2024-03-10 6:58 ` Herman, Géza 2024-03-10 16:54 ` Christopher Wellons 2024-03-10 20:41 ` Herman, Géza 2024-03-10 23:22 ` Christopher Wellons 2024-03-11 9:34 ` Herman, Géza 2024-03-11 13:47 ` Christopher Wellons
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).