From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Ihor Radchenko Newsgroups: gmane.emacs.bugs Subject: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Date: Tue, 02 May 2023 21:21:39 +0000 Message-ID: <87wn1qqvj0.fsf@localhost> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="13679"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 63225@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Tue May 02 23:19:30 2023 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1ptxPG-0003H7-CC for geb-bug-gnu-emacs@m.gmane-mx.org; Tue, 02 May 2023 23:19:30 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ptxP1-0002en-Qv; Tue, 02 May 2023 17:19:15 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ptxOz-0002eU-K8 for bug-gnu-emacs@gnu.org; Tue, 02 May 2023 17:19:13 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1ptxOo-000058-Va for bug-gnu-emacs@gnu.org; Tue, 02 May 2023 17:19:11 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1ptxOo-0003xT-C3 for bug-gnu-emacs@gnu.org; Tue, 02 May 2023 17:19:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Ihor Radchenko Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 02 May 2023 21:19:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 63225 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 63225-submit@debbugs.gnu.org id=B63225.168306232115178 (code B ref 63225); Tue, 02 May 2023 21:19:02 +0000 Original-Received: (at 63225) by debbugs.gnu.org; 2 May 2023 21:18:41 +0000 Original-Received: from localhost ([127.0.0.1]:45162 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptxOS-0003wk-OD for submit@debbugs.gnu.org; Tue, 02 May 2023 17:18:41 -0400 Original-Received: from mout01.posteo.de ([185.67.36.65]:48647) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptxOO-0003wJ-7x for 63225@debbugs.gnu.org; Tue, 02 May 2023 17:18:39 -0400 Original-Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id 4EAB32403AD for <63225@debbugs.gnu.org>; Tue, 2 May 2023 23:18:30 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683062310; bh=XDlJFdqCLvJ4VzD2S9WZBaLOnQwwzVfVn/xvhrEIwNY=; h=From:To:Cc:Subject:Date:From; b=aNWgBWMLeioftZqn7d3fPpghXRTK7LG1EvbBDSq7AUrWgY/CtT6buC4omuPlXIVvA v2YlDssRmAeHxNB6DVUURO7V/kDQ4fXW6nKPL/bPzor1pcNY1L/UKxtDm/grVZNpl4 AZLgOX5LJGh9+jC54DsJpN83ccDm1wzc6cqcptgco5BqNr0HefVszeTMPWBabgH5vV mfZ+z91FYW6BY/6D1fwAU2CKlWKN1VGZQXaVOTyeI/9ySqZ4ooUc4vQOPkPdxQM7mg 8bLIf36o6VqE//F6brj77JVRq9KlVYeatn4C59uS8LsbHvc7T3aBzC4kvleqBveNw0 R4TgCr4meSqHw== Original-Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4Q9tGs3p9gz6tm4; Tue, 2 May 2023 23:18:29 +0200 (CEST) In-Reply-To: <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.bugs:260959 Archived-At: Mattias Engdeg=C3=A5rd writes: >> The Org parser is basically a giant `cond' of a number of regexp >> matches. See `org-element--current-element'. > > A common way to handle this is to build a big regexp to match many cases = at the same time, essentially transforming > > (cond ((looking-at RE1) ...) > ((looking-at RE2) ...) > ...) > > to > > (looking-at (rx (or (group RE1) (group RE2) ...))) > (cond ((match-beginning 1) ...) > ((match-beginning 2) ...) > ...) > > This reduces the number of regexps used and is also typically faster. > (Essentially this is what `syntax-propertize-rules` does but in a more sp= ecialised context.) I tried this, and it did not give any noticeable improvement. Most likely, because the actual `cond' is (cond ((looking-at "foo) ()) ...) Actually, I started looking into C-level perf data right after trying to consolidate the regexps into one giant looking-at form and not seeing any improvement. At that point, I already cached most of the dynamically generated regexps in there and ran out of reasonable ideas. > Using tree-sitter for this could very well be even faster but it's not gu= aranteed to be available. The available tree-sitter grammars for Org are about 1.5-2x faster in my previous tests, but they do less granular parsing of fields. And not complete. Org is not context-free and does not fit well into GLR. And we are not going to use tree sitter for development to avoid increasing contribution barriers. > Otherwise it's very much a matter of optimisation of everything, includin= g regexps. Minimise backtracking. > If you want to match five or more dashes, use "------*" instead of "-\\{5= ,\\}". And so on. This example sounds like something that regexp compilation should be able to optimize, no? I do not easily see why the latter should cause more CPU time compared to the former. > It's also obviously a good idea not to generate regexps dynamically each = time if you can help it, and minimise consing in general. Sure. I was able to shave a few seconds off using this idea. Other than regexp compilation hotspot, I only see re-writing parser non-recursively (`org-element--parse-elements' and `org-element--parse-objects'). >>> Introducing regexp objects that could store compiled regexps and be >>> used instead of strings would be quite some work but probably >>> worthwhile. >>=20 >> What exactly needs to be done? Assuming that regexp objects are not >> going to be readable, for simplicity. > > A proper design, for starters. For example, we probably want them to be u= sable in customised variables which calls for them to be readable. Or, alternatively, the parsed regexps can be attached to string objects internally. Then, regexp cache lookup will degenerate to looking into a string object slot. --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at