From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: I created a faster JSON parser Date: Sat, 09 Mar 2024 08:52:36 +0200 Message-ID: <86a5n7zykr.fsf@gnu.org> References: <87a5n96mb5.fsf@gmail.com> <861q8l0w2c.fsf@gnu.org> <878r2s99j0.fsf@gmail.com> <86y1aszxom.fsf@gnu.org> <874jdg97xm.fsf@gmail.com> <86ttlgzuew.fsf@gnu.org> <875xxw3f3a.fsf@gmail.com> <86plw4zo9u.fsf@gnu.org> <87edcktumt.fsf@gmail.com> <86cys4zec7.fsf@gnu.org> <87a5n8to8m.fsf@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="15770"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel@gnu.org To: =?iso-8859-1?Q?G=E9za_Herman?= Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sat Mar 09 07:53:50 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1riqac-0003x2-BT for ged-emacs-devel@m.gmane-mx.org; Sat, 09 Mar 2024 07:53:50 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1riqZg-0001ql-9Y; Sat, 09 Mar 2024 01:52:52 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1riqZY-0001qJ-Vg for emacs-devel@gnu.org; Sat, 09 Mar 2024 01:52:45 -0500 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1riqZT-0007gL-Rd; Sat, 09 Mar 2024 01:52:44 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=69NnHhqk40jxidPnT1DtIJjXb/1Cxyfj1sAjpjqgseA=; b=Z9sqKMk2RaNXYNKycyiW hKoTlQHD5OJYbJLgQ4/WWh7eBZKaxnTt+cLoveBywoQBU8LArns67ryksOzkIWSsSMGWAQrsszR/e IVm4fIIKllzsOBNdfukkTs9yC6CnVpS/Kiy2DslPKeKmGCGNp3gdN1JscXyx6F1dumN6qrRrE3C7c clOnE1PvJeCSqU2QDYvrJR3AfrvVZBlqF5kCrLU3kblvFMrn4QFvk/PIsZ8AM0w90Z4J7lSFa5L/p T2ZGdoOqDPjys1L86LT5cPNXJbMKfoEpnf/Qkd+PtHEgzY3KSZ/gwFOXgWvppSl8wLQeO5CRIlrC7 0Rc8tk3ISy33aA==; In-Reply-To: <87a5n8to8m.fsf@gmail.com> (message from Herman, =?iso-8859-1?Q?G=E9za?= on Fri, 08 Mar 2024 21:22:13 +0100) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:316934 Archived-At: > From: Herman, Géza > Cc: Herman Géza , > emacs-devel@gnu.org > Date: Fri, 08 Mar 2024 21:22:13 +0100 > > > Is there a reason for you to want it to be 64-bit type on a 64-bit > > machine? If the only bother is efficiency, then you can use 'int' > > without fear. But if a 64-bit machine will need the range of > > values beyond INT_MAX (does it?), then I suggest to use ptrdiff_t. > > The only reason is if I use a 64-bit number on a 64-bit platform, > then the fast path will be chosen more frequently. So it makes > sense to use a register-sized integer here. Then either ptrdiff_t or EMACS_INT should do what you want. > Yes, it seems that EMACS_UINT is good for my purpose, thanks for > the suggestion. Are you sure you need the unsigned variety? If EMACS_INT fits the bill, then it is a better candidate, since unsigned arithmetics has its quirks. > > The jansson code required encoding/decoding strings to make sure > > we submit to jansson text that is always valid UTF-8. > > I tried to use the jansson parser with a unicode 0x333333 > character in a string, and it didn't work, it fails with > (json-parse-error "unable to decode byte... message. Well, I didn't say trying an arbitrary codepoint will demonstrate the issue. Some codepoints above 0x10FFFF indeed cannot be passed to jansson. It's okay if the initial version of this parser only handles the Unicode range and errors out otherwise; we could extend it if needed later. But the error message should talk specifically about invalid character or something, not just a generic "parse error". > Also, I see that json-parse-string calls some utf8 encoding related > function before parsing, but json-parse-buffer doesn't (and it > doesn't do anything encoding related thing in the callback, it just > calls memcpy). This is a part I was never happy about. But, as I say above, we can get to handling these rare cases later. > So based on these, does it have any benefit of supporting these? Yes, definitely. But it isn't urgent. > Out of curiosity, what are these extra characters used for? Raw bytes and characters from charsets that are not (yet) unified with Unicode. > What is the purpose of the odd special 2-byte encoding of 8-bit > characters (I mean where the 1st byte is C0/C1)? Why don't just use > the regular utf-8 encoding for these values? I think it's for efficiency: a 2-byte encoding takes much less space than the 6-byte encoding (using superset of UTF-8) would take. Imagine the case where a large byte-stream is inserted into a multibyte buffer, before decoding it, something that happens a lot when visiting non-ASCII files or reading from a network sub-process. The regular UTF-8 encoding cannot be used for the raw bytes, because then we will be unable to distinguish between them and the Unicode codepoints of the same value. For example, a raw byte whose value is 160 decimal (A0 hex) will be indistinguishable from U+00A0 No-Break Space character. This is why the "codepoint" corresponding to raw byte 160 is 0x3FFFA0, see BYTE8_TO_CHAR. Once again, we can extend the parser for codepoints outside of the Unicode range later. For now, it's okay to reject them with a suitable error.