From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Ihor Radchenko Newsgroups: gmane.emacs.bugs Subject: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Date: Mon, 08 May 2023 11:58:05 +0000 Message-ID: <87zg6fjar6.fsf@localhost> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> <87y1m1oa01.fsf@localhost> <74CD5EF4-5424-40BA-8F80-D0FD89CB890F@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="14018"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 63225@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Mon May 08 13:55:24 2023 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1pvzSe-0003L6-1e for geb-bug-gnu-emacs@m.gmane-mx.org; Mon, 08 May 2023 13:55:24 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pvzSK-00030G-My; Mon, 08 May 2023 07:55:04 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pvzSI-0002zq-Cl for bug-gnu-emacs@gnu.org; Mon, 08 May 2023 07:55:02 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1pvzSI-0003GC-48 for bug-gnu-emacs@gnu.org; Mon, 08 May 2023 07:55:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1pvzSH-0002Ec-W7 for bug-gnu-emacs@gnu.org; Mon, 08 May 2023 07:55:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Ihor Radchenko Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 08 May 2023 11:55:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 63225 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 63225-submit@debbugs.gnu.org id=B63225.16835468958539 (code B ref 63225); Mon, 08 May 2023 11:55:01 +0000 Original-Received: (at 63225) by debbugs.gnu.org; 8 May 2023 11:54:55 +0000 Original-Received: from localhost ([127.0.0.1]:39482 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pvzSB-0002Da-2C for submit@debbugs.gnu.org; Mon, 08 May 2023 07:54:55 -0400 Original-Received: from mout01.posteo.de ([185.67.36.65]:48095) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pvzS8-0002D7-Uj for 63225@debbugs.gnu.org; Mon, 08 May 2023 07:54:54 -0400 Original-Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id E43132401B2 for <63225@debbugs.gnu.org>; Mon, 8 May 2023 13:54:46 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683546886; bh=TSKD8mLJ8anfL6nsJzsFEDN2+iROQjHFQN3qgqyU9x0=; h=From:To:Cc:Subject:Date:From; b=rJAHc+bmh+kaG7/nrLVDsjgrW6W/VYKgQWVW6QpnE9uoyQkFD+CaZ2STep3o6NU4u hOc9WZiCkd3PQhbTBcD0YJ+XN97rFvfGZ+eCSCNq8f58D4GtTdkTcw2xifdbuDE3EV sHeh1GAuEj6ZBgUyPgeRjE1ngFAwW9vQpQDFAAWfCp7nr4PQ/xgl69zqbY9baAzdwM R//PMifwzTevzNEl/FRqAtpwWRQzEccM7O2N8XmZhUrsrgq0FZQwt2RDS13rJLf83D oOZ4gz2h5vTu9R903i0iRIgphDi/WABEqGZCsPdOoLH7P584yJNQkDMCzy14lJP5X8 fsghGnVdShYZQ== Original-Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QFKTf2JLsz9rxK; Mon, 8 May 2023 13:54:46 +0200 (CEST) In-Reply-To: <74CD5EF4-5424-40BA-8F80-D0FD89CB890F@gmail.com> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.bugs:261312 Archived-At: Mattias Engdeg=C3=A5rd writes: > What I meant was that the consolidated root regexp could just match the i= nitial :BEGIN: line and then dispatch to different branches for parsers spe= cific to the drawer type. That would reduce complexity and time spent at th= e critical parser root. Let me elaborate on what I mean. The relevant `cond' branches are: (and (pcase mode (`planning (eq ?* (char-after (line-beginning-position 0)))) ((or `property-drawer `top-comment) (save-excursion (beginning-of-line 0) (not (looking-at-p "[[:blank:]]*$")))) (_ nil)) (looking-at-p org-property-drawer-re)) where org-property-drawer-re is (rx ;; Drawer begin line. bol (0+ (in " \t")) ":PROPERTIES:" (0+ (in " \t")) "\n" ;; Node properties. (*? (0+ (in " \t")) ":" (+ (not (in " \t\n:"))) ":" (* nonl) "\n") ;; Drawer end line. (0+ (in " \t")) ":END:" (0+ (in " \t")) eol) Note that this regexp is only matches when certain conditions are met and the beginning of the property drawer regexp matches for less general ":PROPERTIES:". and [now part of the giant rx] (rx line-start (0+ (any ?\s ?\t)) ":" (1+ (any ?- ?_ word)) ":" (0+ (any ?\s ?\t)) line-end) matches for more general ":[-_[:word:]]+:". If we make ":[-_[:word:]]+:" a new root, how will it be helpful? >> This will account for Org syntax change, so no. > > Don't dismiss it out of hand. I'm not trying to optimise a few regexps, b= ut to use examples to illustrate some useful principles that would help you= improve many of them yourself. > > When matching something terminated by a specific character, it's particul= arly useful if the regexp engine can be made to understand that the termina= tor doesn't occur in what precedes it, as that enables it to omit backtrack= ing points. For example, in "a*b", the engine doesn't need to save backtrac= king points for each 'a' matched since the sets {a} and {b} are obviously d= isjoint. > > In this case, the > > (group (+ (| wordchar (in "_-")))) > > part is unnecessarily slow because it's an or-pattern, which also > inhibits that optimisation. > Fortunately it can easily be rewritten as > > (group (+ (in "_-" word))) > > which solves both problems. But why? Aren't (in word ?_ ?-) and (or word ?_ ?-) not the same? >> Slight improvement in performance cannot justify syntax changes. > > Always question your assumptions. A slight change of spec may not be so b= ad after all if it buys speed and/or improves our understanding of the code= . Do you know what characters have 'word' syntax in org-mode? If not, bette= r be careful before using them in regexps. Honestly, I have no clue how syntax tables in Org mode are working. I once tried to alter syntax table inside code blocks and Org got completely broken. So, I simply do not dare to touch syntax-dependent matches. Ideally, Org should use dedicated, unmutable, syntax tables when parsing. The difficulty though is that we need to support various languages, including CJK syntax where syntax expectation quite different from Latin. Your suggestions about using (not (in ...)) in place of (in ...) are good, but I am afraid to break CJK cases where people can use unexpected set of characters. I was bitten by this in the past. And there is consideration about not breaking the syntax. It is not just about Elisp part - Org is a markup standard and for drawers in particular we have defined the syntax like https://orgmode.org/worg/org-syntax.html#Drawers :NAME: CONTENTS :end: NAME A string consisting of word-constituent characters, hyphens and underscores= (-_). To change this, we should weigh on the possible impact on the external Org parsers, not implemented in Elisp. > (Looks like org-tags-expand permanently adds @ and _ to the set of word c= hars. A bug, surely?) Yeah. Fixed now. https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=3D6e6354c07 >> (defvar org--item-re-cache nil >> "Results cache for `org-item-re'.") >> (defsubst org-item-re () >> ... >> It should not give much overhead. > > Maybe, but you still cons each time. (And remember that the plist-get equ= ality funarg is new in Emacs 29.) Sure it does. It is just one of the variable parts of Org syntax that might be changed. There are ways to make this into constant, but it is a fragile area of the code that I do not want to touch without a reason. (Especially given that I am not familiar with org-list.el) > Also make sure that if the same regexp is used in multiple places, it sho= uld always use the same `case-fold-search` value or they will be considered= different for cache purposes. I hope we do. If only Emacs had a way to define `case-fold-search' right within the regexp itself. --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at