bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

From: Ihor Radchenko <yantar92@posteo.net>
To: "Mattias Engdegård" <mattias.engdegard@gmail.com>
Cc: 63225@debbugs.gnu.org
Subject: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
Date: Fri, 05 May 2023 10:31:01 +0000	[thread overview]
Message-ID: <87ednvul22.fsf@localhost> (raw)
In-Reply-To: <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com>

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

>> What is the aim of instrumenting regexp engine in this scenario?
>> I already know that additional regexps will be tested by individual
>> `org-element-X-parser' functions.
>
> I got the impression that the 'spine' of the parser, the sequence of `looking-at` calls in `org-element--current-element`, would frequently be run through in its entirety which means that consolidating these would reduce the number of working regexps by about 20 (if I'm counting correctly).

Not exactly. The actual statistics is the following (of course, it is a
subject of the actual parsed file structure).

Below, I measured time spent in different branches of cond.
Note describes the cond type.

| Depth |  count |     Time, msec | Note                              | Avg time, μsec/count | Element type  |
|-------+--------+----------------+-----------------------------------+----------------------+---------------|
|     0 |  89592 |         31.094 | eq                                |           0.43315819 | item          |
|     1 |   1984 |           0.68 | eq                                |           0.45513710 | table row     |
|     2 | 206607 |          43.23 | eq                                |           0.24850257 | node-property |
|     3 |  72770 |         302.95 | looking-at-p, skip-chars          |            4.8545025 | headline      |
|     4 |  56000 |      39.190916 | memq                              |           0.69983779 | section       |
|     5 |   8231 |      26.129109 | looking-at-p, lookback            |            3.1744756 | planning      |
|     6 |  54852 |      503.97346 | looking-at-p, multiline, lookback |            9.1878776 | prop drawer   |
|     7 |  89510 |      78.514284 | bolp                              |           0.87715656 | paragraph     |
|     8 |  29610 |      79.589466 | looking-at-p                      |            2.6879252 | clock         |
|     9 |    231 |       1.644304 | eq                                |            7.1181991 | inlinetask    |
|    10 |      0 | tot: 1173 msec | eq                                |                  0/0 | affiliated    |
|-------+--------+----------------+-----------------------------------+----------------------+---------------|
|    11 |     30 |       0.060081 | looking-at-p                      |               2.0027 | latex env     |
|    12 |  45443 |      187.41703 | looking-at-p                      |            4.1242222 | drawer        |
|    13 |     21 |       0.255528 | looking-at-p                      |               12.168 | fixed width   |
|    14 |    967 |        6.67522 | looking-at                        |            6.9030196 | block         |
|    15 |     53 |       0.342144 | looking-at-p                      |            6.4555472 | call          |
|    16 |      0 |              0 | looking-at-p                      |                  0/0 | dynblock      |
|    17 |     29 |       0.361915 | looking-at-p                      |            12.479828 | keyword       |
|    18 |      0 |              0 | eq                                |                  0/0 | paragraph     |
|    19 |      0 |              0 | looking-at-p                      |                  0/0 | footnote def  |
|    20 |      0 |              0 | looking-at-p                      |                  0/0 | rule          |
|    21 |      0 |              0 | looking-at-p                      |                  0/0 | diary sexp    |
|-------+--------+----------------+-----------------------------------+----------------------+---------------|
|    22 |     66 |       0.752823 | looking-at-p, re-search-forward   |            11.406409 | table         |
|    23 |  41509 |      303.39472 | looking-at-p                      |            7.3091310 | item          |
|    24 |   5340 |      41.188231 | t                                 |            7.7131519 | paragraph     |
|       |        | tot: 1713 msec |                                   |                      |               |

If I try to group regexps into one giant rx form and then compare time
spend in different cond branches, I get the following.
(I grouped the regexps between horizontal rules in the above table)

I tried two different files: (1) notes.org that is heavy on headlines;
(2) org-manual that is heavy on actual text.

Grouping with rx gives no noticeable impact. 

| Depth |  Avg time, μs | Avg time, μs |   Avg time, μs | Avg time, μs |
|       | (notes+no rx) |   (notes+rx) | (manual+no rx) |  (manual+rx) |
|-------+---------------+--------------+----------------+--------------|
|     0 |    0.34576248 |   0.35948186 |     0.43996679 |   0.44675874 |
|     1 |    0.35749752 |   0.37239325 |     0.44559585 |   0.43868739 |
|     2 |    0.18958309 |   0.20197035 |     0.29960921 |   0.29960921 |
|     3 |     4.1282904 |    4.2407582 |      4.4482968 |    4.4711219 |
|     4 |    0.61503580 |   0.59914459 |     0.64377158 |   0.63460540 |
|     5 |    0.88028169 |   0.83916084 |      1.2820513 |    1.2820513 |
|     6 |     2.6515244 |    2.6348024 |      2.6795055 |    2.7579648 |
|     7 |     7.8175124 |    7.8262918 |      7.1999256 |    7.1996154 |
|     8 |    0.75458424 |   0.75368242 |     0.70958084 |   0.72455090 |
|     9 |     2.1446653 |    2.1466905 |            10. |          10. |
|    10 |     5.2813853 |    5.2813853 |      5.4761905 |    6.5476190 |
|    11 |          0./0 |         0./0 |           0./0 |         0./0 |
|    12 |            2. |    2.3333333 |           0./0 |         0./0 |
|    13 |     3.5030250 |    4.6581886 |      4.0623783 |    5.8718692 |
|    14 |     11.428571 |    10.952381 |      2.6970634 |    3.3307573 |
|    15 |     5.6508264 |    4.6177686 |      5.1308629 |    4.3741902 |
|    16 |     6.2264151 |    4.1509434 |           0./0 |         0./0 |
|    17 |          0./0 |         0./0 |           0./0 |         0./0 |
|    18 |     10.689655 |          12. |      5.7134386 |    3.7413831 |
|    19 |          0./0 |         0./0 |           0./0 |         0./0 |
|    20 |          0./0 |         0./0 |      2.8888889 |    2.9444444 |
|    21 |          0./0 |         0./0 |           0./0 |         0./0 |
|    22 |          0./0 |         0./0 |           0./0 |         0./0 |
|    23 |     10.746269 |    9.4029851 |      6.2695313 |    6.1328125 |
|    24 |     6.4371193 |    6.2419339 |      6.0138782 |    5.8558211 |
|    25 |     6.4154824 |    6.3855647 |      4.9707695 |    4.7727182 |
|    26 |          0./0 |         0./0 |           0./0 |         0./0 |

> Now if as you suggest the parsing is dominated by sequences of regexps in the branches, it prompts the questions: which branches, what regexps, why are there so many of them, and is there anything that can be done to reduce their number?

Oh. No. The parsing is dominated by `org-element--current-element'. I
can clearly see it because the profiler hits
`org-element--current-element', not the branches.

I just had no idea what to make of your suggestion about

    Run on a reduced dataset, and see if the sequence of regexps being
    exercised, and their frequencies, are consistent with what you
    expect.

Also, my testing showed that

(looking-at
  (rx
   (or
    (group-n 1 (regexp org-element--latex-begin-environment))
    (group-n 2 (regexp org-element-drawer-re))
    (group-n 3 (regexp "[ \t]*:\\( \\|$\\)"))
    (group-n 7 (regexp org-element-dynamic-block-open-re))
    (seq (group-n 4 (regexp "[ \t]*#\\+"))
         (or
          (seq "BEGIN_" (group-n 5 (1+ (not space))))
          (group-n 6 "CALL:")
          (group-n 8 (1+ (not space)) ":")
          ))
    (group-n 9 (regexp org-footnote-definition-re))
    (group-n 10 (regexp "[ \t]*-\\{5,\\}[ \t]*$"))
    (group-n 11 "%%("))))

is actually slightly slower overall compared to a series of `looking-at-p'.
AFAIU, because the `looking-at' needs to allocate match-data vector for
all these 11 groups, which leads to
;;  6.78%  emacs         emacs                                  [.] process_mark_stack
floating up in the perf top.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

next prev parent reply	other threads:[~2023-05-05 10:31 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-02  7:37 bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-02 14:33 ` Mattias Engdegård
2023-05-02 15:25   ` Eli Zaretskii
2023-05-02 15:28     ` Mattias Engdegård
2023-05-02 17:30       ` Eli Zaretskii
2023-05-02 17:58         ` Ihor Radchenko
2023-05-02 16:14   ` Ihor Radchenko
2023-05-02 21:00     ` Mattias Engdegård
2023-05-02 21:21       ` Ihor Radchenko
2023-05-03  8:39         ` Mattias Engdegård
2023-05-03  9:36           ` Ihor Radchenko
2023-05-03 13:59             ` Mattias Engdegård
2023-05-03 15:05               ` Ihor Radchenko
2023-05-03 15:20                 ` Mattias Engdegård
2023-05-03 16:02                   ` Ihor Radchenko
2023-05-04  9:24                     ` Mattias Engdegård
2023-05-05 10:31                       ` Ihor Radchenko [this message]
2023-05-05 16:26                         ` Mattias Engdegård
2023-05-06 13:38                           ` Ihor Radchenko
2023-05-07 10:32                             ` Mattias Engdegård
2023-05-08 11:58                               ` Ihor Radchenko
2023-05-08 18:21                                 ` Mattias Engdegård
2023-05-08 19:38                                   ` Ihor Radchenko
2023-05-08 19:53                                     ` Mattias Engdegård
2023-05-09  8:36                                       ` bug#63225: Using char table-based finite-state machines as a replacement for re-search-forward (was: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)) Ihor Radchenko
2023-05-09 12:02                                       ` bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-09 15:05                                         ` Mattias Engdegård
2023-05-09 15:56                                           ` Ihor Radchenko
2023-05-09 15:57                                             ` Mattias Engdegård
2023-05-07 12:45                           ` Mattias Engdegård
2023-05-08 13:56                             ` Ihor Radchenko
2023-05-08 19:32                               ` Mattias Engdegård
2023-05-08 19:44                                 ` Ihor Radchenko
2023-05-04 12:58               ` Ihor Radchenko
2023-05-02 23:36   ` Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ednvul22.fsf@localhost \
    --to=yantar92@posteo.net \
    --cc=63225@debbugs.gnu.org \
    --cc=mattias.engdegard@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.