From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: I created a faster JSON parser
Date: Sat, 09 Mar 2024 08:52:36 +0200
Message-ID: <86a5n7zykr.fsf@gnu.org>
References: <87a5n96mb5.fsf@gmail.com> <861q8l0w2c.fsf@gnu.org>
 <878r2s99j0.fsf@gmail.com> <86y1aszxom.fsf@gnu.org>
 <874jdg97xm.fsf@gmail.com> <86ttlgzuew.fsf@gnu.org>
 <875xxw3f3a.fsf@gmail.com> <86plw4zo9u.fsf@gnu.org>
 <87edcktumt.fsf@gmail.com> <86cys4zec7.fsf@gnu.org> <87a5n8to8m.fsf@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="15770"; mail-complaints-to="usenet@ciao.gmane.io"
Cc: emacs-devel@gnu.org
To: =?iso-8859-1?Q?G=E9za_Herman?= <geza.herman@gmail.com>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sat Mar 09 07:53:50 2024
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1riqac-0003x2-BT
	for ged-emacs-devel@m.gmane-mx.org; Sat, 09 Mar 2024 07:53:50 +0100
Original-Received: from localhost ([::1] helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <emacs-devel-bounces@gnu.org>)
	id 1riqZg-0001ql-9Y; Sat, 09 Mar 2024 01:52:52 -0500
Original-Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <eliz@gnu.org>) id 1riqZY-0001qJ-Vg
 for emacs-devel@gnu.org; Sat, 09 Mar 2024 01:52:45 -0500
Original-Received: from fencepost.gnu.org ([2001:470:142:3::e])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <eliz@gnu.org>)
 id 1riqZT-0007gL-Rd; Sat, 09 Mar 2024 01:52:44 -0500
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org;
 s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From:
 Date; bh=69NnHhqk40jxidPnT1DtIJjXb/1Cxyfj1sAjpjqgseA=; b=Z9sqKMk2RaNXYNKycyiW
 hKoTlQHD5OJYbJLgQ4/WWh7eBZKaxnTt+cLoveBywoQBU8LArns67ryksOzkIWSsSMGWAQrsszR/e
 IVm4fIIKllzsOBNdfukkTs9yC6CnVpS/Kiy2DslPKeKmGCGNp3gdN1JscXyx6F1dumN6qrRrE3C7c
 clOnE1PvJeCSqU2QDYvrJR3AfrvVZBlqF5kCrLU3kblvFMrn4QFvk/PIsZ8AM0w90Z4J7lSFa5L/p
 T2ZGdoOqDPjys1L86LT5cPNXJbMKfoEpnf/Qkd+PtHEgzY3KSZ/gwFOXgWvppSl8wLQeO5CRIlrC7
 0Rc8tk3ISy33aA==;
In-Reply-To: <87a5n8to8m.fsf@gmail.com> (message from Herman,
 =?iso-8859-1?Q?G=E9za?= on Fri, 08 Mar 2024 21:22:13 +0100)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Xref: news.gmane.io gmane.emacs.devel:316934
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/316934>

> From: Herman, Géza <geza.herman@gmail.com>
> Cc: Herman Géza <geza.herman@gmail.com>,
>  emacs-devel@gnu.org
> Date: Fri, 08 Mar 2024 21:22:13 +0100
> 
> > Is there a reason for you to want it to be 64-bit type on a 64-bit
> > machine?  If the only bother is efficiency, then you can use 'int'
> > without fear.  But if a 64-bit machine will need the range of
> > values beyond INT_MAX (does it?), then I suggest to use ptrdiff_t.
>
> The only reason is if I use a 64-bit number on a 64-bit platform, 
> then the fast path will be chosen more frequently. So it makes 
> sense to use a register-sized integer here.

Then either ptrdiff_t or EMACS_INT should do what you want.

> Yes, it seems that EMACS_UINT is good for my purpose, thanks for 
> the suggestion.

Are you sure you need the unsigned variety?  If EMACS_INT fits the
bill, then it is a better candidate, since unsigned arithmetics has
its quirks.

> > The jansson code required encoding/decoding strings to make sure
> > we submit to jansson text that is always valid UTF-8.
> 
> I tried to use the jansson parser with a unicode 0x333333 
> character in a string, and it didn't work, it fails with 
> (json-parse-error "unable to decode byte... message.

Well, I didn't say trying an arbitrary codepoint will demonstrate the
issue.  Some codepoints above 0x10FFFF indeed cannot be passed to
jansson.

It's okay if the initial version of this parser only handles the
Unicode range and errors out otherwise; we could extend it if needed
later.  But the error message should talk specifically about invalid
character or something, not just a generic "parse error".

> Also, I see that json-parse-string calls some utf8 encoding related
> function before parsing, but json-parse-buffer doesn't (and it
> doesn't do anything encoding related thing in the callback, it just
> calls memcpy).

This is a part I was never happy about.  But, as I say above, we can
get to handling these rare cases later.

> So based on these, does it have any benefit of supporting these?

Yes, definitely.  But it isn't urgent.

> Out of curiosity, what are these extra characters used for?

Raw bytes and characters from charsets that are not (yet) unified with
Unicode.

> What is the purpose of the odd special 2-byte encoding of 8-bit
> characters (I mean where the 1st byte is C0/C1)? Why don't just use
> the regular utf-8 encoding for these values?

I think it's for efficiency: a 2-byte encoding takes much less space
than the 6-byte encoding (using superset of UTF-8) would take.
Imagine the case where a large byte-stream is inserted into a
multibyte buffer, before decoding it, something that happens a lot
when visiting non-ASCII files or reading from a network sub-process.

The regular UTF-8 encoding cannot be used for the raw bytes, because
then we will be unable to distinguish between them and the Unicode
codepoints of the same value.  For example, a raw byte whose value is
160 decimal (A0 hex) will be indistinguishable from U+00A0 No-Break
Space character.  This is why the "codepoint" corresponding to raw
byte 160 is 0x3FFFA0, see BYTE8_TO_CHAR.

Once again, we can extend the parser for codepoints outside of the
Unicode range later.  For now, it's okay to reject them with a
suitable error.