From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.bugs
Subject: bug#70007: [PATCH] native JSON encoder
Date: Fri, 29 Mar 2024 09:04:21 +0300
Message-ID: <86cyrdfuai.fsf@gnu.org>
References: <1BF559D1-DB9F-4FEB-90ED-72E0EFD76424@gmail.com>
 <86wmpphrg7.fsf@gnu.org> <C6944977-0CF9-4D19-9D92-EA1F086700D7@gmail.com>
 <4589243D-C11A-45C1-AF3E-6F4A5BADEB54@gmail.com> <864jcrindg.fsf@gnu.org>
 <291DD5F1-85B8-4647-A40A-EBBD4C51E253@gmail.com> <8634sbijfx.fsf@gnu.org>
 <2CF47DA5-A65B-47C4-A28A-6FEE1469BD13@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="21611"; mail-complaints-to="usenet@ciao.gmane.io"
Cc: casouri@gmail.com, 70007@debbugs.gnu.org
To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= <mattias.engdegard@gmail.com>
Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Fri Mar 29 07:05:33 2024
Return-path: <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>
Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>)
	id 1rq5Mq-0005S8-MU
	for geb-bug-gnu-emacs@m.gmane-mx.org; Fri, 29 Mar 2024 07:05:32 +0100
Original-Received: from localhost ([::1] helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <bug-gnu-emacs-bounces@gnu.org>)
	id 1rq5MO-0004o2-2M; Fri, 29 Mar 2024 02:05:04 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
 id 1rq5ML-0004na-V6
 for bug-gnu-emacs@gnu.org; Fri, 29 Mar 2024 02:05:02 -0400
Original-Received: from debbugs.gnu.org ([2001:470:142:5::43])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
 id 1rq5ML-0005lC-LB
 for bug-gnu-emacs@gnu.org; Fri, 29 Mar 2024 02:05:01 -0400
Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2)
 (envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1rq5MM-00074Q-Ah
 for bug-gnu-emacs@gnu.org; Fri, 29 Mar 2024 02:05:02 -0400
X-Loop: help-debbugs@gnu.org
Resent-From: Eli Zaretskii <eliz@gnu.org>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
Resent-CC: bug-gnu-emacs@gnu.org
Resent-Date: Fri, 29 Mar 2024 06:05:02 +0000
Resent-Message-ID: <handler.70007.B70007.171169227527086@debbugs.gnu.org>
Resent-Sender: help-debbugs@gnu.org
X-GNU-PR-Message: followup 70007
X-GNU-PR-Package: emacs
X-GNU-PR-Keywords: patch
Original-Received: via spool by 70007-submit@debbugs.gnu.org id=B70007.171169227527086
 (code B ref 70007); Fri, 29 Mar 2024 06:05:02 +0000
Original-Received: (at 70007) by debbugs.gnu.org; 29 Mar 2024 06:04:35 +0000
Original-Received: from localhost ([127.0.0.1]:41557 helo=debbugs.gnu.org)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
 id 1rq5Lu-00072o-Q6
 for submit@debbugs.gnu.org; Fri, 29 Mar 2024 02:04:35 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:53230)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eliz@gnu.org>) id 1rq5Lr-00072R-3F
 for 70007@debbugs.gnu.org; Fri, 29 Mar 2024 02:04:32 -0400
Original-Received: from fencepost.gnu.org ([2001:470:142:3::e])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <eliz@gnu.org>)
 id 1rq5Lk-0005jH-Hg; Fri, 29 Mar 2024 02:04:24 -0400
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org;
 s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From:
 Date; bh=LJGz29EtHIsjOMQS50Y2nyTyk+WbIEHXWhRlA8QKnYc=; b=B+JgiLlULygMsQ+dYJNE
 EPP2QFBmAe6p0Jlff6RkUNk2h6Bi6JwfdmH7DbZrzshFt2SGoVBV6AuomkXPJ22mZ/965pf/XDqU0
 G+NVFP4XxfI0xB9OJbGxro9OnvVUscakmZY1oODsw34GqqVi44NByv4MyJoYHl1Ncr6/0C5ksbz59
 Xm4bSrdNSlyqXkLj3+7/BUoQlZ/eRUWOyWtm+cAA0Tq4Jf2iabYVE0PR/W4W7CyaizrpmxYPt0Nw2
 XADimL04uydoUPejjPS7bGVwogJ+ZQmvtl50O168ggpgtlJpCjkTzgmc6Z3ctmhPy6fRiQ/9wTyJH
 cekiWei+/pE9/w==;
In-Reply-To: <2CF47DA5-A65B-47C4-A28A-6FEE1469BD13@gmail.com> (message from
 Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= on Thu, 28 Mar 2024 21:59:38 +0100)
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
X-BeenThere: bug-gnu-emacs@gnu.org
List-Id: "Bug reports for GNU Emacs,
 the Swiss army knife of text editors" <bug-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-gnu-emacs>,
 <mailto:bug-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/bug-gnu-emacs>
List-Post: <mailto:bug-gnu-emacs@gnu.org>
List-Help: <mailto:bug-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-gnu-emacs>,
 <mailto:bug-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org
Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org
Xref: news.gmane.io gmane.emacs.bugs:282268
Archived-At: <http://permalink.gmane.org/gmane.emacs.bugs/282268>

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Thu, 28 Mar 2024 21:59:38 +0100
> Cc: casouri@gmail.com,
>  70007@debbugs.gnu.org
> 
> 27 mars 2024 kl. 20.05 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> >>> This rejects unibyte non-ASCII strings, AFAU, in which case I suggest
> >>> to think whether we really want that.  E.g., why is it wrong to encode
> >>> a string to UTF-8, and then send it to JSON?
> >> 
> >> The way I see it, that would break the JSON abstraction: it transports strings of Unicode characters, not strings of bytes.
> > 
> > What's the difference?  AFAIU, JSON expects UTF-8 encoded strings, and
> > whether that is used as a sequence of bytes or a sequence of
> > characters is in the eyes of the beholder: the bytestream is the same,
> > only the interpretation changes.
> 
> Well no -- JSON transports Unicode strings: the JSON serialiser takes a Unicode string as input and outputs a byte sequence; the JSON parser takes a byte sequence and returns a Unicode string (assuming we are just interested in strings).
> 
> That the transport format uses UTF-8 is unrelated;

It is not unrelated.  A JSON stream is AFAIK supposed to have strings
represented in UTF-8 encoding.  When a Lisp program produces a JSON
stream, all that should matter to it is that any string there has a
valid UTF-8 sequence; where and how that sequence was obtained is of
secondary importance.

> if the user hands an encoded byte sequence to us then it seems more likely that it's a mistake.

We don't know that.  Since Emacs lets Lisp programs produce unibyte
UTF-8 encoded strings very easily, a program could do just that, for
whatever reasons.  Unless we have very serious reasons not to allow
UTF-8 sequences produced by something other than the JSON serializer
itself (and I think we don't), we should not prohibit it.  The Emacs
spirit is to let bad Lisp program enough rope to hang themselves if
that allows legitimate programs do their job more easily and flexibly.

> After all, it cannot have come from a received JSON message.

It could have, if it was encoded by the calling Lisp program.  It
could also have been received from another source, in unibyte form
that is nonetheless valid UTF-8.  If we force non-ASCII strings to be
multibyte, Lisp programs will be unable to take a unibyte UTF-8 string
received from an external source and plug it directly into an object
to be serialized into JSON; instead, they will have to decode the
string, then let the serializer encode it back -- a clear waste of CPU
cycles.

> I think it was just an another artefact of the old implementation. That code incorrectly used encode_string_utf_8 even on non-ASCII unibyte strings and trusted Jansson to validate the result. That resulted in a lot of wasted work and some strange strings getting accepted.

I'm not talking about the old implementation.  I was not completely
happy with it, either, and in particular with its insistence of
signaling errors due to encoding issues.  I think this is not our
business in this case: the responsibility for submitting a valid UTF-8
sequence, when we get a unibyte string, is on the caller.

> While it's theoretically possible that there are users with code relying on this behaviour, I can't find any evidence for it in the packages that I've looked at.

Once again, my bother is not about some code that expects us to encode
UTF-8 byte sequences -- doing that is definitely not TRT.  What I
would like to see is that unibyte strings are passed through
unchanged, so that valid UTF-8 strings will be okay, and invalid ones
will produce invalid JSON.  This is better than signaling errors,
IMNSHO, and in particular is more in-line with how Emacs handles
unibyte strings elsewhere.

> > I didn't suggest to decode the input string, not at all.  I suggested
> > to allow unibyte strings, and process them just like you process
> > pure-ASCII strings, leaving it to the caller to make sure the string
> > has only valid UTF-8 sequences.
> 
> Users of this raw-bytes-input feature (if they exist at all) previously had their input validated by Jansson. While mistakes would probably be detected at the other end I'm not sure it's a good idea.

Why not?  Once again, if we get a unibyte string, the onus is on the
caller to verify it's valid UTF-8, or suffer the consequences.

> >  Forcing callers to decode such
> > strings is IMO too harsh and largely unjustified.
> 
> We usually force them to do so in most other contexts. To take a random example, `princ` doesn't work with encoded strings. But it's rarely a problem.

There are many examples to the contrary.  For example, primitives that
deal with file names can accept both multibyte and unibyte encoded
strings.

> Let's see how testing goes. We'll find a solution no matter what, pass-through or separate slow-path validation, if it turns out that we really need to after all.

OK.  FTR, I'm not in favor of validation of unibyte strings, I just
suggest that we treat them as plain-ASCII: pass them through without
any validation, leaving the validation to the callers.