From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Newsgroups: gmane.emacs.bugs Subject: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Date: Sun, 7 May 2023 12:32:52 +0200 Message-ID: <74CD5EF4-5424-40BA-8F80-D0FD89CB890F@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> <87y1m1oa01.fsf@localhost> Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="4903"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 63225@debbugs.gnu.org To: Ihor Radchenko Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Sun May 07 12:34:20 2023 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1pvbid-00010L-93 for geb-bug-gnu-emacs@m.gmane-mx.org; Sun, 07 May 2023 12:34:19 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pvbiO-0002HS-Ey; Sun, 07 May 2023 06:34:04 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pvbiN-0002HF-0I for bug-gnu-emacs@gnu.org; Sun, 07 May 2023 06:34:03 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1pvbiM-0006gL-OM for bug-gnu-emacs@gnu.org; Sun, 07 May 2023 06:34:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1pvbiM-0006H5-Aa for bug-gnu-emacs@gnu.org; Sun, 07 May 2023 06:34:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 07 May 2023 10:34:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 63225 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 63225-submit@debbugs.gnu.org id=B63225.168345558624049 (code B ref 63225); Sun, 07 May 2023 10:34:02 +0000 Original-Received: (at 63225) by debbugs.gnu.org; 7 May 2023 10:33:06 +0000 Original-Received: from localhost ([127.0.0.1]:36534 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pvbhS-0006Fp-6n for submit@debbugs.gnu.org; Sun, 07 May 2023 06:33:06 -0400 Original-Received: from mail-lf1-f44.google.com ([209.85.167.44]:48438) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pvbhM-0006FH-St for 63225@debbugs.gnu.org; Sun, 07 May 2023 06:33:03 -0400 Original-Received: by mail-lf1-f44.google.com with SMTP id 2adb3069b0e04-4efe9a98736so3973269e87.1 for <63225@debbugs.gnu.org>; Sun, 07 May 2023 03:33:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683455575; x=1686047575; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=JptQaIBbn8iVr3OCXKmYEicFomURJ3nQi8JjtAqgJrQ=; b=rBoSd40ZVi6eso0kkuYDIj5R+xmbvNAxC9KL+CxkwG6yc/Utg5feRPLyZyCoKB1LU6 lKdzWQfj56fy6m9oQuEFBrW2AvOQEXepCToILYG4xanniAi4aqyjEHVpgR2n3jEOy2dU aSD8lrTXIRKBgrSVxhP6JfdedLuQVcd614aN378grXqfhQ+dkqtCPbqy0qeR4vTt0iHp qaJXvwUOhfjN5FTgFDn2865M2HUnjRKIBi8Oeng35pPLMOt6WvfARxRs7kiGRcRS20SK MDKCogN0VX9HVGu5sQJX2UdNmW9etrEoo7udZDG5Gp1P1aBo+CwUTIeSIIdkGu44jJR0 PTDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683455575; x=1686047575; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=JptQaIBbn8iVr3OCXKmYEicFomURJ3nQi8JjtAqgJrQ=; b=f6vmymXbTagJ7fzfR77oBdlxab1SxBfcvPRQ0rFS4LXhy6LPsR4tKyuEc9SRKxI1lK P9gr/IlrLa8CshdhItd5DnbcJ04rF2uMnwdHTs843QnM+2YTSOlXCK785j1Y2McJPyt9 ftHMbzgjkjP2l8PyGJDJCzGKSn0Ly3s4fQLBJJvyiIjO780vyKDp85E2q9yiHWs/wSgm XEP5BSfMJNRxn5e21iqkC0Wm6TNbJLsyJS0RuwjvPBg9n3RLnxs7EW/06TvvC2MdNwA7 HLnZQ36nYXgnBD2FL6IxKtN0MXsWvxA2ua0T3jBRHgZgTLfpGyiG3c25Grlk6rPtvhT3 LMzA== X-Gm-Message-State: AC+VfDxyp7jxN7sxTVGacOP08MMlzUuHUWaHETlCVzBUyh8xfh9FhuiE YdVO6YQ8tbYCoMJ+JBNQuS4= X-Google-Smtp-Source: ACHHUZ5bB2Dwd83ydvcxG0e8Dr1rmIHWUS0/nvyS+Cjwf0c9aWYnKTFQ0mu+LunbvakSTW2bLf+2gQ== X-Received: by 2002:a19:a414:0:b0:4f1:450b:a13 with SMTP id q20-20020a19a414000000b004f1450b0a13mr1742990lfc.2.1683455574573; Sun, 07 May 2023 03:32:54 -0700 (PDT) Original-Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id p8-20020a05651211e800b004e96afb1e9asm937212lfs.253.2023.05.07.03.32.53 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 07 May 2023 03:32:53 -0700 (PDT) In-Reply-To: <87y1m1oa01.fsf@localhost> X-Mailer: Apple Mail (2.3654.120.0.1.15) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.bugs:261248 Archived-At: 6 maj 2023 kl. 15.38 skrev Ihor Radchenko : > I may, but it will be even more complex regexp. Currently, ordinary > drawers have somewhat complex :BEGIN: line, because they can have any > word there, while property drawers require very complex match for the > lines inside. Also, property drawers only occur right after headings, = as > marked by appropriate parser flag. So, matching property drawers = mostly > happens what they are supposed to be. If we try to match ordinary > drawers at the same time, it will actually be slower in practice. What I meant was that the consolidated root regexp could just match the = initial :BEGIN: line and then dispatch to different branches for parsers = specific to the drawer type. That would reduce complexity and time spent = at the critical parser root. > This will account for Org syntax change, so no. Don't dismiss it out of hand. I'm not trying to optimise a few regexps, = but to use examples to illustrate some useful principles that would help = you improve many of them yourself. When matching something terminated by a specific character, it's = particularly useful if the regexp engine can be made to understand that = the terminator doesn't occur in what precedes it, as that enables it to = omit backtracking points. For example, in "a*b", the engine doesn't need = to save backtracking points for each 'a' matched since the sets {a} and = {b} are obviously disjoint. In this case, the (group (+ (| wordchar (in "_-")))) part is unnecessarily slow because it's an or-pattern, which also = inhibits that optimisation. Fortunately it can easily be rewritten as (group (+ (in "_-" word))) which solves both problems. > Slight improvement in performance cannot justify syntax changes. Always question your assumptions. A slight change of spec may not be so = bad after all if it buys speed and/or improves our understanding of the = code. Do you know what characters have 'word' syntax in org-mode? If = not, better be careful before using them in regexps. (Looks like org-tags-expand permanently adds @ and _ to the set of word = chars. A bug, surely?) > (defvar org--item-re-cache nil > "Results cache for `org-item-re'.") > (defsubst org-item-re () > "Return the correct regular expression for plain lists." > (or (plist-get > org--item-re-cache > (cons org-list-allow-alphabetical > org-plain-list-ordered-item-terminator) > #'equal) > ...)) >=20 > It should not give much overhead. Maybe, but you still cons each time. (And remember that the plist-get = equality funarg is new in Emacs 29.) > A larger number of regexps is matched in the individual > element parsers. They just don't contribute as much as > `org-element--current-element' individually and thus do not show up = high > in the profiler. Still, if called often enough they do outsized damage by evicting = regexps used elsewhere. Also make sure that if the same regexp is used in multiple places, it = should always use the same `case-fold-search` value or they will be = considered different for cache purposes. > [ now we are just 2x slower than tree-sitter rather than 2.5x :) ] Progress!