From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: JSON/YAML/TOML/etc. parsing performance Date: Fri, 29 Sep 2017 22:55:54 +0300 Message-ID: <83h8vl5lf9.fsf@gnu.org> References: <87poaqhc63.fsf@lifelogs.com> <8360ceh5f1.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org X-Trace: blaine.gmane.org 1506715024 16007 195.159.176.226 (29 Sep 2017 19:57:04 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Fri, 29 Sep 2017 19:57:04 +0000 (UTC) Cc: emacs-devel@gnu.org To: Philipp Stephani Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Sep 29 21:57:00 2017 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dy1P6-0003Rj-4E for ged-emacs-devel@m.gmane.org; Fri, 29 Sep 2017 21:56:56 +0200 Original-Received: from localhost ([::1]:36861 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dy1PD-0004YI-Ds for ged-emacs-devel@m.gmane.org; Fri, 29 Sep 2017 15:57:03 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:38667) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dy1OH-0003zT-JG for emacs-devel@gnu.org; Fri, 29 Sep 2017 15:56:06 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dy1OD-0002Lr-M4 for emacs-devel@gnu.org; Fri, 29 Sep 2017 15:56:05 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:52795) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dy1OD-0002Le-Hg; Fri, 29 Sep 2017 15:56:01 -0400 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:1520 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1dy1OC-0001Ip-Vy; Fri, 29 Sep 2017 15:56:01 -0400 In-reply-to: (message from Philipp Stephani on Thu, 28 Sep 2017 21:19:00 +0000) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:218946 Archived-At: > From: Philipp Stephani > Date: Thu, 28 Sep 2017 21:19:00 +0000 > Cc: emacs-devel@gnu.org > > IIUC Jansson only accepts UTF-8 strings (i.e. it will generate an error some input is not an UTF-8 string), and > will only return UTF-8 strings as well. Therefore I think that direct conversion between Lisp strings and C > strings (using SDATA etc.) is always correct because the internal Emacs encoding is a superset of UTF-8. > Also build_string should always be correct because it will generate a correct multibyte string for an UTF-8 > string with non-ASCII characters, and a correct unibyte string for an ASCII string, right? I don't think it's a good idea to write code which has such assumptions embedded in it. We don't do that in other cases, although UTF-8 based systems are widespread nowadays. Instead, we make sure that encoding and decoding UTF-8 byte stream is implemented efficiently, and when possible simply reuses the same string data. Besides, these assumptions are not always true, for example: . The Emacs internal representation could include raw bytes, whose representations (both of them) is not valid UTF-8; . Strings we receive from the library could be invalid UTF-8, in which case putting them into a buffer or string without decoding will mean trouble for programs that will try to process them; So I think decoding and encoding any string passed to/from Jansson is better for stability and future maintenance. If you worry about performance, you shouldn't: we convert UTF-8 into our internal representation as efficiently as possible. > > + /* LISP now must be a vector or hashtable. */ > > + if (++lisp_eval_depth > max_lisp_eval_depth) > > + xsignal0 (Qjson_object_too_deep); > > This error could mislead: the problem could be in the nesting of > surrounding Lisp being too deep, and the JSON part could be just fine. > > Agreed, but I think it's better to use lisp_eval_depth here because it's the total nesting depth that could cause > stack overflows. Well, at least the error message should not point exclusively to a JSON problem, it should mention the possibility of a Lisp eval depth overflow as well. > > + Lisp_Object string > > + = make_string (buffer_and_size->buffer, buffer_and_size->size); > > This is arbitrary text, so I'm not sure make_string is appropriate. > Could the text be a byte stream, i.e. not human-readable text? If so, > do we want to create a unibyte string or a multibyte string here? > > It should always be UTF-8. How does JSON express byte streams, then? Doesn't it support data (as opposed to text)? > > + { > > + bool overflow = INT_ADD_WRAPV (BUFFER_CEILING_OF (point), 1, &end); > > + eassert (!overflow); > > + } > > + size_t count; > > + { > > + bool overflow = INT_SUBTRACT_WRAPV (end, point, &count); > > + eassert (!overflow); > > + } > > Why did you need these blocks in braces? > > To be able to reuse the "overflow" name/ Why can't you reuse it without the braces? Thanks.