From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Ihor Radchenko Newsgroups: gmane.emacs.bugs Subject: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Date: Sat, 06 May 2023 13:38:38 +0000 Message-ID: <87y1m1oa01.fsf@localhost> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="5129"; mail-complaints-to="usenet@ciao.gmane.io" Cc: 63225@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Sat May 06 15:36:25 2023 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1pvI5J-00016k-MS for geb-bug-gnu-emacs@m.gmane-mx.org; Sat, 06 May 2023 15:36:25 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pvI4y-0004TO-2s; Sat, 06 May 2023 09:36:04 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pvI4w-0004TA-IN for bug-gnu-emacs@gnu.org; Sat, 06 May 2023 09:36:02 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1pvI4w-0008CZ-AW for bug-gnu-emacs@gnu.org; Sat, 06 May 2023 09:36:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1pvI4w-00009F-3I for bug-gnu-emacs@gnu.org; Sat, 06 May 2023 09:36:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Ihor Radchenko Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 06 May 2023 13:36:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 63225 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 63225-submit@debbugs.gnu.org id=B63225.1683380146548 (code B ref 63225); Sat, 06 May 2023 13:36:02 +0000 Original-Received: (at 63225) by debbugs.gnu.org; 6 May 2023 13:35:46 +0000 Original-Received: from localhost ([127.0.0.1]:59731 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pvI4f-00008l-St for submit@debbugs.gnu.org; Sat, 06 May 2023 09:35:46 -0400 Original-Received: from mout02.posteo.de ([185.67.36.66]:50947) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pvI4V-00008O-Ps for 63225@debbugs.gnu.org; Sat, 06 May 2023 09:35:44 -0400 Original-Received: from submission (posteo.de [185.67.36.169]) by mout02.posteo.de (Postfix) with ESMTPS id B94FA240101 for <63225@debbugs.gnu.org>; Sat, 6 May 2023 15:35:29 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683380129; bh=97TOvOxEWjgjHU1/vvT8UMwbAj9qnLAeETThHLI+fI0=; h=From:To:Cc:Subject:Date:From; b=RXxh8AJjcNzEgmx4tRbuO10jxB3wJaQGMoQRY71L4rJUVJgmPSY0QdrlqOeRVLUOJ W6nG+1a6v7+TN5dNBXoGt3fe5sEt0LhHqnv58XuDx0l5i4mVpHHtpG/FPglXayytka O5vEY75Z0f8bja13rWX1Ao/S71FNclxprcIYQFesqPrSelJTzP2r4xPU8oh/sk6LbQ grDkW6lkUTfF6LCxZB+q0buJmXUnTAsO2F9caD/+1e2PIdJpyjhSJelDvdcmef/V0j nEoZI8/cM0/7vVb/OIeclLSrIJ8l1C3Wi5V2rMhbYz1MRrdUr2kuxJT9ztsS0fw6u8 wN0vf4te2QVvA== Original-Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QD7pn1Jm9z6v16; Sat, 6 May 2023 15:35:29 +0200 (CEST) In-Reply-To: X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.bugs:261197 Archived-At: Mattias Engdeg=C3=A5rd writes: >> Below, I measured time spent in different branches of cond. > > This is useful. It looks like drawers consume a lot of time, and list ite= ms. I know very little about Org, but from afar it looks like all drawers h= ave the same basic form. Can't you recognise them with a single regexp and = then branch on the drawer type for subtype-specific treatment? I may, but it will be even more complex regexp. Currently, ordinary drawers have somewhat complex :BEGIN: line, because they can have any word there, while property drawers require very complex match for the lines inside. Also, property drawers only occur right after headings, as marked by appropriate parser flag. So, matching property drawers mostly happens what they are supposed to be. If we try to match ordinary drawers at the same time, it will actually be slower in practice. > There are micro-inefficiencies in the regexps here and there that you mig= ht want to try fixing (although I can't promise any noticeable gain from do= ing so): > > (defconst org-property-drawer-re > (concat "^[ \t]*:PROPERTIES:[ \t]*\n" > "\\(?:[ \t]*:\\S-+:\\(?:[ \t].*\\)?[ \t]*\n\\)*?" > "[ \t]*:END:[ \t]*$") > ... > There are too many ways this could match. Maybe you could change it to > > (*? (* (in " \t")) > ":" (+ (not (in " \t\n:"))) ":" > (* nonl) > "\n") Sure. Thanks! It was a sub-second improvement, but an improvement. > Another example: > > (defconst org-drawer-regexp "^[ \t]*:\\(\\(?:\\w\\|[-_]\\)+\\):[ \t]*$" > ... > Making reasonable assumptions about characters, the line marked with an a= rrow could become > > (group (+ (not (in " \t\n:")))) This will account for Org syntax change, so no. Slight improvement in performance cannot justify syntax changes. =20 > Regarding list items, are you still calling (org-item-re) each time? Yes and no. `org-item-re' now looks like (defvar org--item-re-cache nil "Results cache for `org-item-re'.") (defsubst org-item-re () "Return the correct regular expression for plain lists." (or (plist-get org--item-re-cache (cons org-list-allow-alphabetical org-plain-list-ordered-item-terminator) #'equal) ...)) It should not give much overhead. >> Oh. No. The parsing is dominated by `org-element--current-element'. I >> can clearly see it because the profiler hits >> `org-element--current-element', not the branches. > > Well there must be regexps being matched elsewhere since you did show ear= ly on the working set to be above 40, not the ca. 20 in org-element--curren= t-element. Of course. A larger number of regexps is matched in the individual element parsers. They just don't contribute as much as `org-element--current-element' individually and thus do not show up high in the profiler. For reference, I calculated the time taken in `org-element--current-element' to decide about parsing specific element type (Time/Avg) vs. time taken to actual parse it (Time2/Avg2). (Note that the data below is for my WIP parser refactoring branch at https://git.sr.ht/~yantar92/org-mode/tree/feature/org-element-ast/item/lisp= /org-element.el; The original, e.g. headline will be way slower) | Depth | Count | Time, msec | Time2, msec | Avg, =CE=BCsec | Avg2, =CE=BC= sec | Type | |-------+--------+------------+-------------+-----------+------------+-----= ---------| | 0 | 89729 | 30.714894 | 1339.9075 | 0.34 | 14.93 | item= | | 1 | 2074 | 0.779739 | 19.040295 | 0.38 | 9.18 | tabl= e row | | 2 | 207365 | 37.53366 | 1970.9524 | 0.18 | 9.50 | node= | | 3 | 72849 | 303.36754 | 2448.6616 | 4.16 | 33.61 | head= line | | 4 | 56076 | 33.117519 | 763.41927 | 0.59 | 13.61 | sect= ion | | 5 | 291 | 0.258913 | 2.622451 | 0.89 | 9.01 | comm= ent | | 6 | 8247 | 23.15524 | 224.61437 | 2.81 | 27.24 | plan= ning | | 7 | 54924 | 362.36612 | 523.11581 | 6.60 | 9.52 | prop= drawer | | 8 | 89647 | 69.361279 | 761.29519 | 0.77 | 8.49 | para= graph | | 9 | 29652 | 67.658072 | 829.21937 | 2.28 | 27.97 | cloc= k | | 10 | 231 | 1.285224 | 3.832217 | 5.56 | 16.59 | inli= netask | | 11 | 0 | 0 | 0 | 0.00 | 0.00 | keyw= ord | | 12 | 30 | 0.059978 | 0.413909 | 2.00 | 13.80 | late= x env | | 13 | 45401 | 159.57401 | 515.15776 | 3.51 | 11.35 | draw= er | | 14 | 21 | 0.265039 | 0.265754 | 12.62 | 12.65 | fixe= d width | | 15 | 913 | 5.597659 | 17.326571 | 6.13 | 18.98 | bloc= k | | 16 | 53 | 0.355013 | 1.329438 | 6.70 | 25.08 | call= | | 17 | 0 | 0 | 0 | 0.00 | 0.00 | dynb= lock | | 18 | 29 | 0.365553 | 0.494062 | 12.61 | 17.04 | keyw= ord | | 19 | 0 | 0 | 0 | 0.00 | 0.00 | para= graph | | 20 | 0 | 0 | 0 | 0.00 | 0.00 | foot= note def | | 21 | 0 | 0 | 0 | 0.00 | 0.00 | hrul= e | | 22 | 0 | 0 | 0 | 0.00 | 0.00 | diar= y sexp | | 23 | 69 | 0.739084 | 1.459472 | 10.71 | 21.15 | tabl= e | | 24 | 41586 | 281.42632 | 1327.9897 | 6.77 | 31.93 | plai= n list | | 25 | 5370 | 36.202665 | 66.853115 | 6.74 | 12.45 | para= graph | #+TBLFM: $5=3D1000.0*$3/$2;%.2f::$6=3D1000.0*$4/$2;%.2f >> I just had no idea what to make of your suggestion about >>=20 >> Run on a reduced dataset, and see if the sequence of regexps being >> exercised, and their frequencies, are consistent with what you >> expect. > > Stupid printf-debugging actually, nothing fancier than that. > I'll see if I can put together a patch for you a bit later on. I once tried #defun REGEX_EMACS_DEBUG + regex_emacs_debug =3D 100000, but it produced so much output that I cannot even open Emacs in reasonable time because of the wall of output in terminal. >> (looking-at >> (rx >> (or >> ... >> (group-n 11 "%%(")))) > > This actually incurs some unnecessary run-time cost: the (regexp ...) for= ms make this expand to a `concat` call to construct this rather long regexp= each time. Either only recompute it when any of the variables (org-element= --latex-begin-environment etc) change, or if you intend them to be compile-= time constants, make sure they are expanded as such. > >> is actually slightly slower overall compared to a series of `looking-at-= p'. >> AFAIU, because the `looking-at' needs to allocate match-data vector for >> all these 11 groups, which leads to >> ;; 6.78% emacs emacs [.] proc= ess_mark_stack >> floating up in the perf top. > > Quite sure that's the concat calls. Match data doesn't actually contribut= e to any GC-level consing unless you reify it by calling `match-data`, or i= ndirectly through `safe-match-data` (which I see that you are using in seve= ral places -- try not to). After moving that giant rx into defconst, the parsing time is not growing significantly anymore: ;; No rx: 17.947628s (1.373926s in 2 GCs) ;; rx: 18.058193s (1.379169s in 2 GCs) But there is no improvement either... [ now we are just 2x slower than tree-sitter rather than 2.5x :) ] --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at