From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: =?utf-8?Q?Herman=2C_G=C3=A9za?= Newsgroups: gmane.emacs.devel Subject: Re: I created a faster JSON parser Date: Sat, 09 Mar 2024 12:08:54 +0100 Message-ID: <874jdfocst.fsf@gmail.com> References: <87a5n96mb5.fsf@gmail.com> <861q8l0w2c.fsf@gnu.org> <878r2s99j0.fsf@gmail.com> <86y1aszxom.fsf@gnu.org> <874jdg97xm.fsf@gmail.com> <86ttlgzuew.fsf@gnu.org> <875xxw3f3a.fsf@gmail.com> <86plw4zo9u.fsf@gnu.org> <87edcktumt.fsf@gmail.com> <86cys4zec7.fsf@gnu.org> <87a5n8to8m.fsf@gmail.com> <86a5n7zykr.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="40006"; mail-complaints-to="usenet@ciao.gmane.io" Cc: =?utf-8?Q?G=C3=A9za?= Herman , emacs-devel@gnu.org To: Eli Zaretskii Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sat Mar 09 12:39:09 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1riv2j-000ACx-Jp for ged-emacs-devel@m.gmane-mx.org; Sat, 09 Mar 2024 12:39:09 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1riv2B-00018x-HJ; Sat, 09 Mar 2024 06:38:35 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1riv29-00018a-9J for emacs-devel@gnu.org; Sat, 09 Mar 2024 06:38:33 -0500 Original-Received: from mail-ej1-x62b.google.com ([2a00:1450:4864:20::62b]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1riv27-00073Z-KW; Sat, 09 Mar 2024 06:38:33 -0500 Original-Received: by mail-ej1-x62b.google.com with SMTP id a640c23a62f3a-a44628725e3so334446266b.0; Sat, 09 Mar 2024 03:38:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709984309; x=1710589109; darn=gnu.org; h=content-transfer-encoding:mime-version:message-id:in-reply-to:date :subject:cc:to:from:references:from:to:cc:subject:date:message-id :reply-to; bh=2Lr+qfpHJ0DrZGZam8kvqtgSrpsaPZMGQSCm4dqvJ9Q=; b=jMLW8V/Ol4remipe+McgDIJWEn34nLm7KzzeXMWn6k4KHmSmv6M2EDLF+neCqpu7dZ 9FOl57v9ZhHQl+c6fhBmmwPTKu6Aan/Fz8vM6iNLz+qkpupZCIeMa74zdzwtQcyi3gBs nZwXXpAPYd4HLidEMX5qzeG9vgHabBbkAf8/yJbG+18ItABa5TWiwU2ejKaDm6ZRFmt5 nyKiRhXaLvo4Cs3dVgoBT0pfw2hqz7baJvPUqN6v1bqC6ZAXCkoZto8rX0bQK6dXRT9D 4suOQGSCs+6DQLoP0OAdppPtw3x631pIh0gXzyVeKJTIzPDl6HI8OWuebt1FTOB5YAC1 gTYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709984309; x=1710589109; h=content-transfer-encoding:mime-version:message-id:in-reply-to:date :subject:cc:to:from:references:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=2Lr+qfpHJ0DrZGZam8kvqtgSrpsaPZMGQSCm4dqvJ9Q=; b=KqRvozaBoGyb4a3zLich53JL7Z8U0Y/AJ1aZ57RrTjeiztblWKybxymbK0pTe2UyOC lzx8Krh/vks3RgpLurruRnEJ5bJL2mtsA5tIgfabk1gJLSOp5gOjvbaar+DCZcKypIC+ zGwcXwhe7JkxBYTniZ2EekWWRKd8FQxA9JKn31AWM4h+SQbQxB+9N7vCdWmsCilgRqYe 1ebFJuP4hhc41WLZfLwt7S4oVPJNpeZuVCEHGaRLd9IyjM1qqZZqOM0K/Ha1Fkq0wXh/ db9dBxcTge1v3p0kO2v2P38Jb0HfTecjCxZDjuJZ/P+wC0QoMcunZpu/4ltiex6CmNkx znyw== X-Forwarded-Encrypted: i=1; AJvYcCX3+c1kBu04a2E7c3FtdtI/Rg9WxnuGMuLkbX/r21ThUED39qE+FUVMjAi8QHmVvYfUpmIOIyHrF2GUEHn8HYoZ+icu X-Gm-Message-State: AOJu0YzInmyWB5bXOg38DrUgAsa4EkyiAbEuagpjmXHN26uj6HS8evTL nH/tO8Ng5kW4SaYStmgZjJvAw6046wBZ1TnkT0H6lobosAut6oHgHl3mHRW9 X-Google-Smtp-Source: AGHT+IHsDh4ukdU/kdp0SibkrpMGlopUAXotDjpDA38Q+KVi80n0n1aHoS9RPxE/Nrag1ilzG3MlSg== X-Received: by 2002:a17:906:71c7:b0:a45:e412:8abe with SMTP id i7-20020a17090671c700b00a45e4128abemr764850ejk.8.1709984308693; Sat, 09 Mar 2024 03:38:28 -0800 (PST) Original-Received: from localhost (netacc-gpn-4-138-211.pool.yettel.hu. [84.224.138.211]) by smtp.gmail.com with ESMTPSA id qa10-20020a170907868a00b00a43815bf5edsm791216ejc.133.2024.03.09.03.38.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 09 Mar 2024 03:38:28 -0800 (PST) In-reply-to: <86a5n7zykr.fsf@gnu.org> Received-SPF: pass client-ip=2a00:1450:4864:20::62b; envelope-from=geza.herman@gmail.com; helo=mail-ej1-x62b.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:316938 Archived-At: Eli Zaretskii writes: >> From: Herman, G=C3=A9za >> Cc: Herman G=C3=A9za , >> emacs-devel@gnu.org >> Date: Fri, 08 Mar 2024 21:22:13 +0100 >> Yes, it seems that EMACS_UINT is good for my purpose, thanks=20 >> for >> the suggestion. > > Are you sure you need the unsigned variety? If EMACS_INT fits=20 > the > bill, then it is a better candidate, since unsigned arithmetics=20 > has > its quirks. Yes, I think it's better to use unsigned: read the sign, and then=20 parse the number as unsigned, and then apply the sign at the=20 end. If the number is parsed with its sign, it needs an additional=20 step at each character (the sign needs to be applied to each=20 digit). >> Also, I see that json-parse-string calls some utf8 encoding=20 >> related >> function before parsing, but json-parse-buffer doesn't (and it >> doesn't do anything encoding related thing in the callback, it=20 >> just >> calls memcpy). > > This is a part I was never happy about. But, as I say above, we=20 > can > get to handling these rare cases later. I think this is an additional benefit of my parser: this feature=20 can be added to it more easily than into jansson. Even, I'm tempted to say that we could just remove utf-8 checking=20 from my code, and then Emacs's encoding method should work right=20 out of the box. Or, to say that utf-8 handling should stay as is. Because as far=20 as I understand, if the JSON contains an invalid utf-8 sequence=20 which is not invalid according to Emacs's character=20 representation, then this problem won't be detected. So checking=20 for utf-8 encoding errors shouldn't be the job of the json parser,=20 but around IO handling, which has the chance to know that the JSON=20 stream itself must only contain a valid utf-8 encoding. Or, as the JSON specification explcitly says that the allowed=20 character range is 0x20 .. 0x10ffff, the current solution is fine,=20 because it is actually against JSON rules to allow anything else=20 outside of this range. > Once again, we can extend the parser for codepoints outside of=20 > the > Unicode range later. For now, it's okay to reject them with a > suitable error. OK, cool, I added Qjson_utf8_decode_error to indicate decoding=20 errors. How can we proceed further? This is the current state of the=20 patch:=20 https://github.com/geza-herman/emacs/commit/ce5d990776a1ccdfd0b6d9c4d5e5e5d= f55245672.patch I think I did everything that was asked for, except Po Lu's=20 parenthesis-related comment, because I still don't know what to=20 parenthesize and what not to. I saw a lot of "a + x * y" kind of=20 expressions in emacs codebase without any parenthesis. Are the=20 exact rules documented somewhere?