From: Christopher Wellons <wellons@nullprogram.com>
To: "Herman, Géza" <geza.herman@gmail.com>
Cc: "emacs-devel@gnu.org" <emacs-devel@gnu.org>
Subject: Re: I created a faster JSON parser
Date: Sun, 10 Mar 2024 12:54:13 -0400 [thread overview]
Message-ID: <20240310165413.35pszp3b37m3y2kh@nullprogram.com> (raw)
In-Reply-To: <878r2qfrs5.fsf@gmail.com>
> I'd glad if you can give some advices: which fuzzy-testing framework to
> use, which introductory material is worth reading, etc.
The Jansson repository has a libFuzzer-based fuzz test, which is perhaps a
useful example. In it they define LLVMFuzzerTestOneInput, a function which
accepts a buffer of input (pointer and length), which they feed into the
code under test. That's basically it. In the new parser that buffer would
go into json_parse. The tested code is instrumented, and the fuzz tester
observes the affect inputs have on control flow, using that information to
construct new inputs that explore new execution paths, trying to exercise
as many as possible.
I'm partial to AFL++, and it's what I reach for first. It also works with
GCC. It has two modes, with persistent mode preferred:
https://github.com/AFLplusplus/AFLplusplus/blob/stable/instrumentation/README.persistent_mode.md
Same in principle, but with control inverted. For seed inputs, a few small
JSON documents exercising the parser's features is sufficient. In either
case, use -fsanitize=address,undefined so that defective execution paths
are more likely to be detected. More assertions would help, too, such as
"assert(input_current <= input_end)" in a number of places. Assertions
must abort or trap so that the fuzz tester knows it found a defect.
Fuzz testing works better in a narrow scope. Ideally only the code being
tested is instrumented. If it's running within an Emacs context, and you
instrument all of Emacs, the fuzz tester would explore paths in Emacs
reachable through the JSON parser rather than focus on the parser itself.
That will waste time that could instead be spent exploring the parser.
You don't need to allocate lisp objects during fuzz testing. In fact, you
should avoid it because that would just slow it down. (I even bet it's the
bottleneck in the new parser.) Ideally the core parser consumes bytes and
produces JSON events, and is agnostic to its greater context. To integrate
with Emacs, you'd have additional, separate code that turns JSON events
into lisp objects, and which wouldn't be fuzz tested.
Written that way, I could hook this core up to one of the above fuzz test
interfaces, mock out whatever bits of Emacs might still be there (e.g.
ckd_mul: the isolation need not be perfect), feed it the input, and pump
events until either error (i.e. bad input detected, which is ignored) or
EOF. The fuzz tester uses a timeout to detect infinite loops, which AFL++
will report as "hangs" and save the input for manual investigation. It
should exercise JSON numeric parsing, too, at least to the extent that
it's not punted to Emacs or strtod (mind your locale!). I'd also make
available_depth much smaller so that the fuzzing could exercise failing
checks.
To get the bulk of the value, the fuzz test does not necessarily need to
be checked into source control, or even run as part of a standard test
suite. Given a clean, decoupled interface and implementation, it would
only take a few minutes to hook up a fuzz test. I was hoping to find just
that, but each JSON function has multiple points of contact with Emacs,
most especially json_parse_object.
I've done such ad-hoc fuzz testing on dozens of programs and libraries to
evaluate their quality, and sometimes even improve them. In most cases, if
can be fuzz tested and it's never been fuzz tested before, this technique
finds fresh bugs in a matter of minutes, if not seconds. When I say it's
incredibly effective, I mean it! Case in point from a few weeks ago, under
similar circumstances, which can also serve as a practical example:
https://github.com/editorconfig/editorconfig-core-c/pull/103
next prev parent reply other threads:[~2024-03-10 16:54 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-08 10:27 I created a faster JSON parser Herman, Géza
2024-03-08 11:41 ` Philip Kaludercic
2024-03-08 12:34 ` Herman, Géza
2024-03-08 12:03 ` Eli Zaretskii
2024-03-08 12:38 ` Herman, Géza
2024-03-08 12:59 ` Eli Zaretskii
2024-03-08 13:12 ` Herman, Géza
2024-03-08 14:10 ` Eli Zaretskii
2024-03-08 14:24 ` Collin Funk
2024-03-08 15:20 ` Herman, Géza
2024-03-08 16:22 ` Eli Zaretskii
2024-03-08 18:34 ` Herman, Géza
2024-03-08 19:57 ` Eli Zaretskii
2024-03-08 20:22 ` Herman, Géza
2024-03-09 6:52 ` Eli Zaretskii
2024-03-09 11:08 ` Herman, Géza
2024-03-09 12:23 ` Lynn Winebarger
2024-03-09 12:58 ` Po Lu
2024-03-09 13:13 ` Eli Zaretskii
2024-03-09 14:00 ` Herman, Géza
2024-03-09 14:21 ` Eli Zaretskii
2024-03-08 13:28 ` Po Lu
2024-03-08 16:14 ` Herman, Géza
2024-03-09 1:55 ` Po Lu
2024-03-09 20:37 ` Christopher Wellons
2024-03-10 6:31 ` Eli Zaretskii
2024-03-10 21:39 ` Philip Kaludercic
2024-03-11 13:29 ` Eli Zaretskii
2024-03-11 14:05 ` Mattias Engdegård
2024-03-11 14:35 ` Herman, Géza
2024-03-12 9:26 ` Mattias Engdegård
2024-03-12 10:20 ` Gerd Möllmann
2024-03-12 11:14 ` Mattias Engdegård
2024-03-12 11:33 ` Gerd Möllmann
2024-03-15 13:35 ` Herman, Géza
2024-03-15 14:56 ` Gerd Möllmann
2024-03-19 18:49 ` Mattias Engdegård
2024-03-19 19:05 ` Herman, Géza
2024-03-19 19:18 ` Gerd Möllmann
2024-03-19 19:13 ` Gerd Möllmann
2024-03-12 10:58 ` Herman, Géza
2024-03-12 13:11 ` Mattias Engdegård
2024-03-12 13:42 ` Mattias Engdegård
2024-03-12 15:23 ` Herman, Géza
2024-03-12 15:39 ` Gerd Möllmann
2024-03-10 6:58 ` Herman, Géza
2024-03-10 16:54 ` Christopher Wellons [this message]
2024-03-10 20:41 ` Herman, Géza
2024-03-10 23:22 ` Christopher Wellons
2024-03-11 9:34 ` Herman, Géza
2024-03-11 13:47 ` Christopher Wellons
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240310165413.35pszp3b37m3y2kh@nullprogram.com \
--to=wellons@nullprogram.com \
--cc=emacs-devel@gnu.org \
--cc=geza.herman@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).