Re: I created a faster JSON parser

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Eli Zaretskii <eliz@gnu.org>
To: "Géza Herman" <geza.herman@gmail.com>
Cc: emacs-devel@gnu.org
Subject: Re: I created a faster JSON parser
Date: Sat, 09 Mar 2024 08:52:36 +0200	[thread overview]
Message-ID: <86a5n7zykr.fsf@gnu.org> (raw)
In-Reply-To: <87a5n8to8m.fsf@gmail.com> (message from Herman, Géza on Fri, 08 Mar 2024 21:22:13 +0100)

> From: Herman, Géza <geza.herman@gmail.com>
> Cc: Herman Géza <geza.herman@gmail.com>,
>  emacs-devel@gnu.org
> Date: Fri, 08 Mar 2024 21:22:13 +0100
> 
> > Is there a reason for you to want it to be 64-bit type on a 64-bit
> > machine?  If the only bother is efficiency, then you can use 'int'
> > without fear.  But if a 64-bit machine will need the range of
> > values beyond INT_MAX (does it?), then I suggest to use ptrdiff_t.
>
> The only reason is if I use a 64-bit number on a 64-bit platform, 
> then the fast path will be chosen more frequently. So it makes 
> sense to use a register-sized integer here.

Then either ptrdiff_t or EMACS_INT should do what you want.

> Yes, it seems that EMACS_UINT is good for my purpose, thanks for 
> the suggestion.

Are you sure you need the unsigned variety?  If EMACS_INT fits the
bill, then it is a better candidate, since unsigned arithmetics has
its quirks.

> > The jansson code required encoding/decoding strings to make sure
> > we submit to jansson text that is always valid UTF-8.
> 
> I tried to use the jansson parser with a unicode 0x333333 
> character in a string, and it didn't work, it fails with 
> (json-parse-error "unable to decode byte... message.

Well, I didn't say trying an arbitrary codepoint will demonstrate the
issue.  Some codepoints above 0x10FFFF indeed cannot be passed to
jansson.

It's okay if the initial version of this parser only handles the
Unicode range and errors out otherwise; we could extend it if needed
later.  But the error message should talk specifically about invalid
character or something, not just a generic "parse error".

> Also, I see that json-parse-string calls some utf8 encoding related
> function before parsing, but json-parse-buffer doesn't (and it
> doesn't do anything encoding related thing in the callback, it just
> calls memcpy).

This is a part I was never happy about.  But, as I say above, we can
get to handling these rare cases later.

> So based on these, does it have any benefit of supporting these?

Yes, definitely.  But it isn't urgent.

> Out of curiosity, what are these extra characters used for?

Raw bytes and characters from charsets that are not (yet) unified with
Unicode.

> What is the purpose of the odd special 2-byte encoding of 8-bit
> characters (I mean where the 1st byte is C0/C1)? Why don't just use
> the regular utf-8 encoding for these values?

I think it's for efficiency: a 2-byte encoding takes much less space
than the 6-byte encoding (using superset of UTF-8) would take.
Imagine the case where a large byte-stream is inserted into a
multibyte buffer, before decoding it, something that happens a lot
when visiting non-ASCII files or reading from a network sub-process.

The regular UTF-8 encoding cannot be used for the raw bytes, because
then we will be unable to distinguish between them and the Unicode
codepoints of the same value.  For example, a raw byte whose value is
160 decimal (A0 hex) will be indistinguishable from U+00A0 No-Break
Space character.  This is why the "codepoint" corresponding to raw
byte 160 is 0x3FFFA0, see BYTE8_TO_CHAR.

Once again, we can extend the parser for codepoints outside of the
Unicode range later.  For now, it's okay to reject them with a
suitable error.

next prev parent reply	other threads:[~2024-03-09  6:52 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-08 10:27 I created a faster JSON parser Herman, Géza
2024-03-08 11:41 ` Philip Kaludercic
2024-03-08 12:34   ` Herman, Géza
2024-03-08 12:03 ` Eli Zaretskii
2024-03-08 12:38   ` Herman, Géza
2024-03-08 12:59     ` Eli Zaretskii
2024-03-08 13:12       ` Herman, Géza
2024-03-08 14:10         ` Eli Zaretskii
2024-03-08 14:24           ` Collin Funk
2024-03-08 15:20           ` Herman, Géza
2024-03-08 16:22             ` Eli Zaretskii
2024-03-08 18:34               ` Herman, Géza
2024-03-08 19:57                 ` Eli Zaretskii
2024-03-08 20:22                   ` Herman, Géza
2024-03-09  6:52                     ` Eli Zaretskii [this message]
2024-03-09 11:08                       ` Herman, Géza
2024-03-09 12:23                         ` Lynn Winebarger
2024-03-09 12:58                         ` Po Lu
2024-03-09 13:13                         ` Eli Zaretskii
2024-03-09 14:00                           ` Herman, Géza
2024-03-09 14:21                             ` Eli Zaretskii
2024-03-08 13:28 ` Po Lu
2024-03-08 16:14   ` Herman, Géza
2024-03-09  1:55     ` Po Lu
2024-03-09 20:37 ` Christopher Wellons
2024-03-10  6:31   ` Eli Zaretskii
2024-03-10 21:39     ` Philip Kaludercic
2024-03-11 13:29       ` Eli Zaretskii
2024-03-11 14:05         ` Mattias Engdegård
2024-03-11 14:35           ` Herman, Géza
2024-03-12  9:26             ` Mattias Engdegård
2024-03-12 10:20               ` Gerd Möllmann
2024-03-12 11:14                 ` Mattias Engdegård
2024-03-12 11:33                   ` Gerd Möllmann
2024-03-15 13:35                 ` Herman, Géza
2024-03-15 14:56                   ` Gerd Möllmann
2024-03-19 18:49                   ` Mattias Engdegård
2024-03-19 19:05                     ` Herman, Géza
2024-03-19 19:18                       ` Gerd Möllmann
2024-03-19 19:13                     ` Gerd Möllmann
2024-03-12 10:58               ` Herman, Géza
2024-03-12 13:11                 ` Mattias Engdegård
2024-03-12 13:42                   ` Mattias Engdegård
2024-03-12 15:23                   ` Herman, Géza
2024-03-12 15:39                     ` Gerd Möllmann
2024-03-10  6:58   ` Herman, Géza
2024-03-10 16:54     ` Christopher Wellons
2024-03-10 20:41       ` Herman, Géza
2024-03-10 23:22         ` Christopher Wellons
2024-03-11  9:34           ` Herman, Géza
2024-03-11 13:47             ` Christopher Wellons

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86a5n7zykr.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=emacs-devel@gnu.org \
    --cc=geza.herman@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).