unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Christopher Wellons <wellons@nullprogram.com>
To: "Herman, Géza" <geza.herman@gmail.com>
Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org>
Subject: Re: I created a faster JSON parser
Date: Sun, 10 Mar 2024 12:54:13 -0400	[thread overview]
Message-ID: <20240310165413.35pszp3b37m3y2kh@nullprogram.com> (raw)
In-Reply-To: <878r2qfrs5.fsf@gmail.com>

> I'd glad if you can give some advices: which fuzzy-testing framework to 
> use, which introductory material is worth reading, etc.

The Jansson repository has a libFuzzer-based fuzz test, which is perhaps a 
useful example. In it they define LLVMFuzzerTestOneInput, a function which 
accepts a buffer of input (pointer and length), which they feed into the 
code under test. That's basically it. In the new parser that buffer would 
go into json_parse. The tested code is instrumented, and the fuzz tester 
observes the affect inputs have on control flow, using that information to 
construct new inputs that explore new execution paths, trying to exercise 
as many as possible.

I'm partial to AFL++, and it's what I reach for first. It also works with 
GCC. It has two modes, with persistent mode preferred:

https://github.com/AFLplusplus/AFLplusplus/blob/stable/instrumentation/README.persistent_mode.md

Same in principle, but with control inverted. For seed inputs, a few small 
JSON documents exercising the parser's features is sufficient. In either 
case, use -fsanitize=address,undefined so that defective execution paths 
are more likely to be detected. More assertions would help, too, such as 
"assert(input_current <= input_end)" in a number of places. Assertions 
must abort or trap so that the fuzz tester knows it found a defect.

Fuzz testing works better in a narrow scope. Ideally only the code being 
tested is instrumented. If it's running within an Emacs context, and you 
instrument all of Emacs, the fuzz tester would explore paths in Emacs 
reachable through the JSON parser rather than focus on the parser itself. 
That will waste time that could instead be spent exploring the parser.

You don't need to allocate lisp objects during fuzz testing. In fact, you 
should avoid it because that would just slow it down. (I even bet it's the 
bottleneck in the new parser.) Ideally the core parser consumes bytes and 
produces JSON events, and is agnostic to its greater context. To integrate 
with Emacs, you'd have additional, separate code that turns JSON events 
into lisp objects, and which wouldn't be fuzz tested.

Written that way, I could hook this core up to one of the above fuzz test 
interfaces, mock out whatever bits of Emacs might still be there (e.g. 
ckd_mul: the isolation need not be perfect), feed it the input, and pump 
events until either error (i.e. bad input detected, which is ignored) or 
EOF. The fuzz tester uses a timeout to detect infinite loops, which AFL++ 
will report as "hangs" and save the input for manual investigation. It 
should exercise JSON numeric parsing, too, at least to the extent that 
it's not punted to Emacs or strtod (mind your locale!). I'd also make 
available_depth much smaller so that the fuzzing could exercise failing 
checks.

To get the bulk of the value, the fuzz test does not necessarily need to 
be checked into source control, or even run as part of a standard test 
suite. Given a clean, decoupled interface and implementation, it would 
only take a few minutes to hook up a fuzz test. I was hoping to find just 
that, but each JSON function has multiple points of contact with Emacs, 
most especially json_parse_object.

I've done such ad-hoc fuzz testing on dozens of programs and libraries to 
evaluate their quality, and sometimes even improve them. In most cases, if 
can be fuzz tested and it's never been fuzz tested before, this technique 
finds fresh bugs in a matter of minutes, if not seconds. When I say it's 
incredibly effective, I mean it! Case in point from a few weeks ago, under 
similar circumstances, which can also serve as a practical example:

https://github.com/editorconfig/editorconfig-core-c/pull/103



  reply	other threads:[~2024-03-10 16:54 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-08 10:27 I created a faster JSON parser Herman, Géza
2024-03-08 11:41 ` Philip Kaludercic
2024-03-08 12:34   ` Herman, Géza
2024-03-08 12:03 ` Eli Zaretskii
2024-03-08 12:38   ` Herman, Géza
2024-03-08 12:59     ` Eli Zaretskii
2024-03-08 13:12       ` Herman, Géza
2024-03-08 14:10         ` Eli Zaretskii
2024-03-08 14:24           ` Collin Funk
2024-03-08 15:20           ` Herman, Géza
2024-03-08 16:22             ` Eli Zaretskii
2024-03-08 18:34               ` Herman, Géza
2024-03-08 19:57                 ` Eli Zaretskii
2024-03-08 20:22                   ` Herman, Géza
2024-03-09  6:52                     ` Eli Zaretskii
2024-03-09 11:08                       ` Herman, Géza
2024-03-09 12:23                         ` Lynn Winebarger
2024-03-09 12:58                         ` Po Lu
2024-03-09 13:13                         ` Eli Zaretskii
2024-03-09 14:00                           ` Herman, Géza
2024-03-09 14:21                             ` Eli Zaretskii
2024-03-08 13:28 ` Po Lu
2024-03-08 16:14   ` Herman, Géza
2024-03-09  1:55     ` Po Lu
2024-03-09 20:37 ` Christopher Wellons
2024-03-10  6:31   ` Eli Zaretskii
2024-03-10 21:39     ` Philip Kaludercic
2024-03-11 13:29       ` Eli Zaretskii
2024-03-11 14:05         ` Mattias Engdegård
2024-03-11 14:35           ` Herman, Géza
2024-03-12  9:26             ` Mattias Engdegård
2024-03-12 10:20               ` Gerd Möllmann
2024-03-12 11:14                 ` Mattias Engdegård
2024-03-12 11:33                   ` Gerd Möllmann
2024-03-15 13:35                 ` Herman, Géza
2024-03-15 14:56                   ` Gerd Möllmann
2024-03-19 18:49                   ` Mattias Engdegård
2024-03-19 19:05                     ` Herman, Géza
2024-03-19 19:18                       ` Gerd Möllmann
2024-03-19 19:13                     ` Gerd Möllmann
2024-03-12 10:58               ` Herman, Géza
2024-03-12 13:11                 ` Mattias Engdegård
2024-03-12 13:42                   ` Mattias Engdegård
2024-03-12 15:23                   ` Herman, Géza
2024-03-12 15:39                     ` Gerd Möllmann
2024-03-10  6:58   ` Herman, Géza
2024-03-10 16:54     ` Christopher Wellons [this message]
2024-03-10 20:41       ` Herman, Géza
2024-03-10 23:22         ` Christopher Wellons
2024-03-11  9:34           ` Herman, Géza
2024-03-11 13:47             ` Christopher Wellons

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240310165413.35pszp3b37m3y2kh@nullprogram.com \
    --to=wellons@nullprogram.com \
    --cc=emacs-devel@gnu.org \
    --cc=geza.herman@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).