From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: JSON/YAML/TOML/etc. parsing performance Date: Thu, 05 Oct 2017 10:12:30 +0300 Message-ID: <83lgkqxe3l.fsf@gnu.org> References: <87poaqhc63.fsf@lifelogs.com> <8360ceh5f1.fsf@gnu.org> <83h8vl5lf9.fsf@gnu.org> <83r2um3fqi.fsf@gnu.org> <43520b71-9e25-926c-d744-78098dad6441@cs.ucla.edu> <83o9pnzddc.fsf@gnu.org> <472176ce-846b-1f24-716b-98eb95ceaa47@cs.ucla.edu> <83d163z6dy.fsf@gnu.org> <73477c99-1600-a53d-d84f-737837d0f91f@cs.ucla.edu> <83poa2ya8j.fsf@gnu.org> <21b0ba97-ed49-43ae-e86f-63fba762353a@cs.ucla.edu> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org X-Trace: blaine.gmane.org 1507187662 17915 195.159.176.226 (5 Oct 2017 07:14:22 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Thu, 5 Oct 2017 07:14:22 +0000 (UTC) Cc: p.stephani2@gmail.com, emacs-devel@gnu.org To: Paul Eggert Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Oct 05 09:14:16 2017 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1e00MI-0003wn-BA for ged-emacs-devel@m.gmane.org; Thu, 05 Oct 2017 09:14:14 +0200 Original-Received: from localhost ([::1]:38093 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1e00MP-0002GR-La for ged-emacs-devel@m.gmane.org; Thu, 05 Oct 2017 03:14:21 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:52408) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1e00Ks-0001mz-KI for emacs-devel@gnu.org; Thu, 05 Oct 2017 03:12:47 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1e00Ko-0007I3-Is for emacs-devel@gnu.org; Thu, 05 Oct 2017 03:12:46 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:44046) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1e00Ko-0007Hv-EP; Thu, 05 Oct 2017 03:12:42 -0400 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:2588 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1e00Kn-0005lr-R3; Thu, 05 Oct 2017 03:12:42 -0400 In-reply-to: <21b0ba97-ed49-43ae-e86f-63fba762353a@cs.ucla.edu> (message from Paul Eggert on Wed, 4 Oct 2017 14:24:59 -0700) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:219110 Archived-At: > Cc: p.stephani2@gmail.com, emacs-devel@gnu.org > From: Paul Eggert > Date: Wed, 4 Oct 2017 14:24:59 -0700 > > On 10/04/2017 12:38 PM, Eli Zaretskii wrote: > > if we did use size_t for the arguments which can clearly only be > > non-negative, the problems which we are discussing would not have > > happened > Sure, but we would also have worse problems, as size_t is inherently > more error-prone. ptrdiff_t overflows are reliably diagnosed when Emacs > is compiled with suitable GCC compiler options. size_t overflows cannot > be diagnosed, are all too common, and can cause serious trouble. If ptrdiff_t overflows are reliably diagnosed, then why do we have to test for them explicitly in our code, as in the proposed json.c? AFAIU, ptrdiff_t overflows are the _only_ reason for json.c checks whether a size_t value is too large, because similar checks for ptrdiff_t values are already in the low-level subroutines involved in creating Lisp objects. So why couldn't those checks be avoided by simply assigning to a ptrdiff_t variables? > The Emacs internals occasionally use size_t because underlying > primitives like 'malloc' do, so we do make some exceptions. Perhaps > there should be an exception here, for convenience with the JSON > library. The code snippets I've seen so far in this thread are not > enough context to judge whether an exception would be helpful in this > case. Generally speaking, though, unsigned types should be avoided > because they are more error-prone. This has long been the style in Emacs > internals, and it's served us well. I'm not arguing for general replacement of ptrdiff_t with size_t, only for doing that in those primitives where negative values are a clear mistake/bug. For example, let's take this case from your proposed changes: static Lisp_Object -json_make_string (const char *data, ptrdiff_t size) +json_make_string (const char *data, size_t size) { + if (PTRDIFF_MAX < size) + string_overflow (); return make_specified_string (data, -1, size, true); } If we were to change make_specified_string (and its subroutines, like make_uninit_multibyte_string etc.) to accept a size_t value in its 3rd argument, the need for the above check against PTRDIFF_MAX would disappear. Another such case is 'insert', which is also used in json.c, and requires a similar check: void insert (const char *string, ptrdiff_t nbytes) { if (nbytes > 0) { ptrdiff_t len = chars_in_text ((unsigned char *) string, nbytes), opoint; insert_1_both (string, len, nbytes, 0, 1, 0); opoint = PT - len; signal_after_change (opoint, 0, len); update_compositions (opoint, PT, CHECK_BORDER); } } It clearly ignores negative values of nbytes, as expected. So why not make nbytes a size_t argument? (We will probably need some low-level changes inside the subroutines of insert_1_both, like move_gap, to reject too large size_t values before we convert them to signed values, but that's hardly rocket science.) I envision that all the Fmake_SOMETHING primitives could use similar changes to have the size specified as size_t, because it can never be negative. E.g., Fmake_vector is used by json.c and currently requires a similar check because its size argument is a signed type. IOW, I'm saying that using size_t judiciously, in a small number of places, would make a lot of sense and allow us to simplify higher-level code, and make it faster by avoiding duplicate checks of the same values. It would also make the higher-level code more reliable, because application-level programmers will not need to understand all the non-trivial intricacies of this stuff. As Emacs starts using more and more external libraries, whether built-in or via modules, the issue of size_t vs ptrdiff_t will become more and more important, and a source for more and more error-prone code. Why not fix that in advance in our primitives? > (Ironically, just last week I was telling beginning students to beware > unsigned types, with (0u < -1) as an example....) Well, "kids, don't do that at home -- we are trained professionals" seems to apply here ;-)