From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Newsgroups: gmane.emacs.bugs Subject: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Date: Fri, 5 May 2023 18:26:52 +0200 Message-ID: References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="19267"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 63225@debbugs.gnu.org To: Ihor Radchenko Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Fri May 05 18:28:20 2023 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1puyI7-0004i1-Sg for geb-bug-gnu-emacs@m.gmane-mx.org; Fri, 05 May 2023 18:28:19 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1puyHr-0005un-US; Fri, 05 May 2023 12:28:03 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1puyHq-0005uA-Q5 for bug-gnu-emacs@gnu.org; Fri, 05 May 2023 12:28:02 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1puyHq-0003eG-IP for bug-gnu-emacs@gnu.org; Fri, 05 May 2023 12:28:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1puyHq-0005Zr-1V for bug-gnu-emacs@gnu.org; Fri, 05 May 2023 12:28:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Fri, 05 May 2023 16:28:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 63225 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 63225-submit@debbugs.gnu.org id=B63225.168330402621351 (code B ref 63225); Fri, 05 May 2023 16:28:02 +0000 Original-Received: (at 63225) by debbugs.gnu.org; 5 May 2023 16:27:06 +0000 Original-Received: from localhost ([127.0.0.1]:57317 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1puyGv-0005YH-Dw for submit@debbugs.gnu.org; Fri, 05 May 2023 12:27:05 -0400 Original-Received: from mail-lf1-f41.google.com ([209.85.167.41]:55471) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1puyGp-0005Xl-Uk for 63225@debbugs.gnu.org; Fri, 05 May 2023 12:27:03 -0400 Original-Received: by mail-lf1-f41.google.com with SMTP id 2adb3069b0e04-4f14ea058dcso315448e87.2 for <63225@debbugs.gnu.org>; Fri, 05 May 2023 09:26:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683304014; x=1685896014; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=R8RoTyCQ6Edy1gdG8/zPqi/m0z0u8F8LfitKn/6ybdo=; b=em8R14OOT2oOw+vuvJeAgxdh6qRJGPioVmA8oSuWQj/E71NHFI4nCjqCpx53XoOc38 1I/pG+YFbBFoJgl7R2CO90RFzYSfE+/qamcze3eLubdhCZMgghx6SMiPIl4wWTx9zeij FvZGuH2rbaI2vD2eXWreVvzTfcd2+sZLudXqO16Wqq111e2gElgdgrpXFuATEFBinNf8 Tv/Sp45MGs9d3ggO00+2s7i4G2JlFjmI2okLZtM9zbefoitb9vVPRisMr0xU2fnoexFm sHFQlyHXEA70+O3tfGHy3HpF8URmWIll9gfF28BiAKRP1V1nWlxc/UEjLW/j+M5fttZ6 u+bA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683304014; x=1685896014; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=R8RoTyCQ6Edy1gdG8/zPqi/m0z0u8F8LfitKn/6ybdo=; b=A/P2XXk21e8m2X3OcN/JPNHC+RJk0lY1QDxE2rzfHa+WQv9vE7hUl6Y/lEjQ12sxcI iZAybt0c/RbP+P5grMJ6QG3IyuDeur7LTPk5Pkwb1YNf16TY4L2Ciyd/P8GeFdsMPkkR IlPTcebH6x5cVBkNeWV1NWT1TjR/3hDnN/XPURQUUzCja1dDevpobTTl7HrjfqBCfhYw m5m3G78G8CvHFJo8Hpnhmk6btsNGk5qBTHynSQPsaIg/fy7lpwg8KjJBQtwlSfU1dCo4 KuQdRrWsV8ehUs/SKwv90ymaiFjV7825uFnet+TOgHPN9DcebW/jSWtOdRyP3XXBgjTw axyQ== X-Gm-Message-State: AC+VfDw93jWBNJ2meobouiSfpjBjr9ztNz7sJ9P/WqIDjlcyd60dvRSR dP40bY434AU9z2QR9IM5Ffw= X-Google-Smtp-Source: ACHHUZ6p0gNwnsUzi7FBxo+2r0zGHk3DhjsvfIFD8v/F8Lolg56g4Tpvc5EuXJ/c2tOe+xJ/MDir5A== X-Received: by 2002:a19:f70c:0:b0:4f0:6aa3:d860 with SMTP id z12-20020a19f70c000000b004f06aa3d860mr705473lfe.39.1683304013925; Fri, 05 May 2023 09:26:53 -0700 (PDT) Original-Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id d18-20020a05651221d200b004edc76b6aabsm339284lft.209.2023.05.05.09.26.53 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 05 May 2023 09:26:53 -0700 (PDT) In-Reply-To: <87ednvul22.fsf@localhost> X-Mailer: Apple Mail (2.3654.120.0.1.15) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.bugs:261118 Archived-At: 5 maj 2023 kl. 12.31 skrev Ihor Radchenko : > Not exactly. The actual statistics is the following (of course, it is = a > subject of the actual parsed file structure). >=20 > Below, I measured time spent in different branches of cond. This is useful. It looks like drawers consume a lot of time, and list = items. I know very little about Org, but from afar it looks like all = drawers have the same basic form. Can't you recognise them with a single = regexp and then branch on the drawer type for subtype-specific = treatment? There are micro-inefficiencies in the regexps here and there that you = might want to try fixing (although I can't promise any noticeable gain = from doing so): (defconst org-property-drawer-re (concat "^[ \t]*:PROPERTIES:[ \t]*\n" "\\(?:[ \t]*:\\S-+:\\(?:[ \t].*\\)?[ \t]*\n\\)*?" "[ \t]*:END:[ \t]*$") Look at the middle line in particular. Translated to rx, that part = becomes (*? (* (in "\t ")) ":" (+ (not (syntax whitespace))) ":" (? (in "\t ") (* nonl)) (* (in "\t ")) "\n") There are too many ways this could match. Maybe you could change it to (*? (* (in " \t")) ":" (+ (not (in " \t\n:"))) ":" (* nonl) "\n") which prevents a lot of unnecessary backtracking and does away with = parsing structure that doesn't matter here. Another example: (defconst org-drawer-regexp "^[ \t]*:\\(\\(?:\\w\\|[-_]\\)+\\):[ \t]*$" which is (: bol (* (in "\t ")) ":" (group (+ (| wordchar (in "_-")))) ; <-- ":" (* (in "\t ")) eol) Making reasonable assumptions about characters, the line marked with an = arrow could become (group (+ (not (in " \t\n:")))) but it's fine if you want to exclude more characters here, as long as = you avoid leaving backtrack points everywhere. (Character syntax is kind = of expensive too.) Regarding list items, are you still calling (org-item-re) each time? >> Now if as you suggest the parsing is dominated by sequences of = regexps in the branches, it prompts the questions: which branches, what = regexps, why are there so many of them, and is there anything that can = be done to reduce their number? >=20 > Oh. No. The parsing is dominated by `org-element--current-element'. I > can clearly see it because the profiler hits > `org-element--current-element', not the branches. Well there must be regexps being matched elsewhere since you did show = early on the working set to be above 40, not the ca. 20 in = org-element--current-element. > I just had no idea what to make of your suggestion about >=20 > Run on a reduced dataset, and see if the sequence of regexps being > exercised, and their frequencies, are consistent with what you > expect. Stupid printf-debugging actually, nothing fancier than that. I'll see if I can put together a patch for you a bit later on. > (looking-at > (rx > (or > (group-n 1 (regexp org-element--latex-begin-environment)) > (group-n 2 (regexp org-element-drawer-re)) > (group-n 3 (regexp "[ \t]*:\\( \\|$\\)")) > (group-n 7 (regexp org-element-dynamic-block-open-re)) > (seq (group-n 4 (regexp "[ \t]*#\\+")) > (or > (seq "BEGIN_" (group-n 5 (1+ (not space)))) > (group-n 6 "CALL:") > (group-n 8 (1+ (not space)) ":") > )) > (group-n 9 (regexp org-footnote-definition-re)) > (group-n 10 (regexp "[ \t]*-\\{5,\\}[ \t]*$")) > (group-n 11 "%%(")))) This actually incurs some unnecessary run-time cost: the (regexp ...) = forms make this expand to a `concat` call to construct this rather long = regexp each time. Either only recompute it when any of the variables = (org-element--latex-begin-environment etc) change, or if you intend them = to be compile-time constants, make sure they are expanded as such. > is actually slightly slower overall compared to a series of = `looking-at-p'. > AFAIU, because the `looking-at' needs to allocate match-data vector = for > all these 11 groups, which leads to > ;; 6.78% emacs emacs [.] = process_mark_stack > floating up in the perf top. Quite sure that's the concat calls. Match data doesn't actually = contribute to any GC-level consing unless you reify it by calling = `match-data`, or indirectly through `safe-match-data` (which I see that = you are using in several places -- try not to).