From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#70007: [PATCH] native JSON encoder Date: Fri, 29 Mar 2024 09:04:21 +0300 Message-ID: <86cyrdfuai.fsf@gnu.org> References: <1BF559D1-DB9F-4FEB-90ED-72E0EFD76424@gmail.com> <86wmpphrg7.fsf@gnu.org> <4589243D-C11A-45C1-AF3E-6F4A5BADEB54@gmail.com> <864jcrindg.fsf@gnu.org> <291DD5F1-85B8-4647-A40A-EBBD4C51E253@gmail.com> <8634sbijfx.fsf@gnu.org> <2CF47DA5-A65B-47C4-A28A-6FEE1469BD13@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="21611"; mail-complaints-to="usenet@ciao.gmane.io" Cc: casouri@gmail.com, 70007@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Fri Mar 29 07:05:33 2024 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1rq5Mq-0005S8-MU for geb-bug-gnu-emacs@m.gmane-mx.org; Fri, 29 Mar 2024 07:05:32 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rq5MO-0004o2-2M; Fri, 29 Mar 2024 02:05:04 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rq5ML-0004na-V6 for bug-gnu-emacs@gnu.org; Fri, 29 Mar 2024 02:05:02 -0400 Original-Received: from debbugs.gnu.org ([2001:470:142:5::43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1rq5ML-0005lC-LB for bug-gnu-emacs@gnu.org; Fri, 29 Mar 2024 02:05:01 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1rq5MM-00074Q-Ah for bug-gnu-emacs@gnu.org; Fri, 29 Mar 2024 02:05:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Fri, 29 Mar 2024 06:05:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 70007 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 70007-submit@debbugs.gnu.org id=B70007.171169227527086 (code B ref 70007); Fri, 29 Mar 2024 06:05:02 +0000 Original-Received: (at 70007) by debbugs.gnu.org; 29 Mar 2024 06:04:35 +0000 Original-Received: from localhost ([127.0.0.1]:41557 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1rq5Lu-00072o-Q6 for submit@debbugs.gnu.org; Fri, 29 Mar 2024 02:04:35 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:53230) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1rq5Lr-00072R-3F for 70007@debbugs.gnu.org; Fri, 29 Mar 2024 02:04:32 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rq5Lk-0005jH-Hg; Fri, 29 Mar 2024 02:04:24 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=LJGz29EtHIsjOMQS50Y2nyTyk+WbIEHXWhRlA8QKnYc=; b=B+JgiLlULygMsQ+dYJNE EPP2QFBmAe6p0Jlff6RkUNk2h6Bi6JwfdmH7DbZrzshFt2SGoVBV6AuomkXPJ22mZ/965pf/XDqU0 G+NVFP4XxfI0xB9OJbGxro9OnvVUscakmZY1oODsw34GqqVi44NByv4MyJoYHl1Ncr6/0C5ksbz59 Xm4bSrdNSlyqXkLj3+7/BUoQlZ/eRUWOyWtm+cAA0Tq4Jf2iabYVE0PR/W4W7CyaizrpmxYPt0Nw2 XADimL04uydoUPejjPS7bGVwogJ+ZQmvtl50O168ggpgtlJpCjkTzgmc6Z3ctmhPy6fRiQ/9wTyJH cekiWei+/pE9/w==; In-Reply-To: <2CF47DA5-A65B-47C4-A28A-6FEE1469BD13@gmail.com> (message from Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= on Thu, 28 Mar 2024 21:59:38 +0100) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.bugs:282268 Archived-At: > From: Mattias EngdegÄrd > Date: Thu, 28 Mar 2024 21:59:38 +0100 > Cc: casouri@gmail.com, > 70007@debbugs.gnu.org > > 27 mars 2024 kl. 20.05 skrev Eli Zaretskii : > > >>> This rejects unibyte non-ASCII strings, AFAU, in which case I suggest > >>> to think whether we really want that. E.g., why is it wrong to encode > >>> a string to UTF-8, and then send it to JSON? > >> > >> The way I see it, that would break the JSON abstraction: it transports strings of Unicode characters, not strings of bytes. > > > > What's the difference? AFAIU, JSON expects UTF-8 encoded strings, and > > whether that is used as a sequence of bytes or a sequence of > > characters is in the eyes of the beholder: the bytestream is the same, > > only the interpretation changes. > > Well no -- JSON transports Unicode strings: the JSON serialiser takes a Unicode string as input and outputs a byte sequence; the JSON parser takes a byte sequence and returns a Unicode string (assuming we are just interested in strings). > > That the transport format uses UTF-8 is unrelated; It is not unrelated. A JSON stream is AFAIK supposed to have strings represented in UTF-8 encoding. When a Lisp program produces a JSON stream, all that should matter to it is that any string there has a valid UTF-8 sequence; where and how that sequence was obtained is of secondary importance. > if the user hands an encoded byte sequence to us then it seems more likely that it's a mistake. We don't know that. Since Emacs lets Lisp programs produce unibyte UTF-8 encoded strings very easily, a program could do just that, for whatever reasons. Unless we have very serious reasons not to allow UTF-8 sequences produced by something other than the JSON serializer itself (and I think we don't), we should not prohibit it. The Emacs spirit is to let bad Lisp program enough rope to hang themselves if that allows legitimate programs do their job more easily and flexibly. > After all, it cannot have come from a received JSON message. It could have, if it was encoded by the calling Lisp program. It could also have been received from another source, in unibyte form that is nonetheless valid UTF-8. If we force non-ASCII strings to be multibyte, Lisp programs will be unable to take a unibyte UTF-8 string received from an external source and plug it directly into an object to be serialized into JSON; instead, they will have to decode the string, then let the serializer encode it back -- a clear waste of CPU cycles. > I think it was just an another artefact of the old implementation. That code incorrectly used encode_string_utf_8 even on non-ASCII unibyte strings and trusted Jansson to validate the result. That resulted in a lot of wasted work and some strange strings getting accepted. I'm not talking about the old implementation. I was not completely happy with it, either, and in particular with its insistence of signaling errors due to encoding issues. I think this is not our business in this case: the responsibility for submitting a valid UTF-8 sequence, when we get a unibyte string, is on the caller. > While it's theoretically possible that there are users with code relying on this behaviour, I can't find any evidence for it in the packages that I've looked at. Once again, my bother is not about some code that expects us to encode UTF-8 byte sequences -- doing that is definitely not TRT. What I would like to see is that unibyte strings are passed through unchanged, so that valid UTF-8 strings will be okay, and invalid ones will produce invalid JSON. This is better than signaling errors, IMNSHO, and in particular is more in-line with how Emacs handles unibyte strings elsewhere. > > I didn't suggest to decode the input string, not at all. I suggested > > to allow unibyte strings, and process them just like you process > > pure-ASCII strings, leaving it to the caller to make sure the string > > has only valid UTF-8 sequences. > > Users of this raw-bytes-input feature (if they exist at all) previously had their input validated by Jansson. While mistakes would probably be detected at the other end I'm not sure it's a good idea. Why not? Once again, if we get a unibyte string, the onus is on the caller to verify it's valid UTF-8, or suffer the consequences. > > Forcing callers to decode such > > strings is IMO too harsh and largely unjustified. > > We usually force them to do so in most other contexts. To take a random example, `princ` doesn't work with encoded strings. But it's rarely a problem. There are many examples to the contrary. For example, primitives that deal with file names can accept both multibyte and unibyte encoded strings. > Let's see how testing goes. We'll find a solution no matter what, pass-through or separate slow-path validation, if it turns out that we really need to after all. OK. FTR, I'm not in favor of validation of unibyte strings, I just suggest that we treat them as plain-ASCII: pass them through without any validation, leaving the validation to the callers.