unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Ihor Radchenko <yantar92@posteo.net>
To: "Mattias Engdegård" <mattias.engdegard@gmail.com>
Cc: 63225@debbugs.gnu.org
Subject: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
Date: Sat, 06 May 2023 13:38:38 +0000	[thread overview]
Message-ID: <87y1m1oa01.fsf@localhost> (raw)
In-Reply-To: <EFFDF31B-2B58-44E8-9B05-6039A98331D3@gmail.com>

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

>> Below, I measured time spent in different branches of cond.
>
> This is useful. It looks like drawers consume a lot of time, and list items. I know very little about Org, but from afar it looks like all drawers have the same basic form. Can't you recognise them with a single regexp and then branch on the drawer type for subtype-specific treatment?

I may, but it will be even more complex regexp. Currently, ordinary
drawers have somewhat complex :BEGIN: line, because they can have any
word there, while property drawers require very complex match for the
lines inside. Also, property drawers only occur right after headings, as
marked by appropriate parser flag. So, matching property drawers mostly
happens what they are supposed to be. If we try to match ordinary
drawers at the same time, it will actually be slower in practice.

> There are micro-inefficiencies in the regexps here and there that you might want to try fixing (although I can't promise any noticeable gain from doing so):
>
> (defconst org-property-drawer-re
>   (concat "^[ \t]*:PROPERTIES:[ \t]*\n"
>           "\\(?:[ \t]*:\\S-+:\\(?:[ \t].*\\)?[ \t]*\n\\)*?"
>           "[ \t]*:END:[ \t]*$")
> ...
> There are too many ways this could match. Maybe you could change it to
>
> (*? (* (in " \t"))
>     ":" (+ (not (in " \t\n:"))) ":"
>     (* nonl)
>     "\n")

Sure. Thanks! It was a sub-second improvement, but an improvement.

> Another example:
>
> (defconst org-drawer-regexp "^[ \t]*:\\(\\(?:\\w\\|[-_]\\)+\\):[ \t]*$"
> ...
> Making reasonable assumptions about characters, the line marked with an arrow could become
>
>    (group (+ (not (in " \t\n:"))))

This will account for Org syntax change, so no.
Slight improvement in performance cannot justify syntax changes.
 
> Regarding list items, are you still calling (org-item-re) each time?

Yes and no.
`org-item-re' now looks like

(defvar org--item-re-cache nil
  "Results cache for `org-item-re'.")
(defsubst org-item-re ()
  "Return the correct regular expression for plain lists."
  (or (plist-get
       org--item-re-cache
       (cons org-list-allow-alphabetical
             org-plain-list-ordered-item-terminator)
       #'equal)
      ...))

It should not give much overhead.

>> Oh. No. The parsing is dominated by `org-element--current-element'. I
>> can clearly see it because the profiler hits
>> `org-element--current-element', not the branches.
>
> Well there must be regexps being matched elsewhere since you did show early on the working set to be above 40, not the ca. 20 in org-element--current-element.

Of course. A larger number of regexps is matched in the individual
element parsers. They just don't contribute as much as
`org-element--current-element' individually and thus do not show up high
in the profiler.

For reference, I calculated the time taken in
`org-element--current-element' to decide about parsing specific element
type (Time/Avg) vs. time taken to actual parse it (Time2/Avg2).
(Note that the data below is for my WIP parser refactoring branch at
https://git.sr.ht/~yantar92/org-mode/tree/feature/org-element-ast/item/lisp/org-element.el;
The original, e.g. headline will be way slower)

| Depth |  Count | Time, msec | Time2, msec | Avg, μsec | Avg2, μsec | Type         |
|-------+--------+------------+-------------+-----------+------------+--------------|
|     0 |  89729 |  30.714894 |   1339.9075 |      0.34 |      14.93 | item         |
|     1 |   2074 |   0.779739 |   19.040295 |      0.38 |       9.18 | table row    |
|     2 | 207365 |   37.53366 |   1970.9524 |      0.18 |       9.50 | node         |
|     3 |  72849 |  303.36754 |   2448.6616 |      4.16 |      33.61 | headline     |
|     4 |  56076 |  33.117519 |   763.41927 |      0.59 |      13.61 | section      |
|     5 |    291 |   0.258913 |    2.622451 |      0.89 |       9.01 | comment      |
|     6 |   8247 |   23.15524 |   224.61437 |      2.81 |      27.24 | planning     |
|     7 |  54924 |  362.36612 |   523.11581 |      6.60 |       9.52 | prop drawer  |
|     8 |  89647 |  69.361279 |   761.29519 |      0.77 |       8.49 | paragraph    |
|     9 |  29652 |  67.658072 |   829.21937 |      2.28 |      27.97 | clock        |
|    10 |    231 |   1.285224 |    3.832217 |      5.56 |      16.59 | inlinetask   |
|    11 |      0 |          0 |           0 |      0.00 |       0.00 | keyword      |
|    12 |     30 |   0.059978 |    0.413909 |      2.00 |      13.80 | latex env    |
|    13 |  45401 |  159.57401 |   515.15776 |      3.51 |      11.35 | drawer       |
|    14 |     21 |   0.265039 |    0.265754 |     12.62 |      12.65 | fixed width  |
|    15 |    913 |   5.597659 |   17.326571 |      6.13 |      18.98 | block        |
|    16 |     53 |   0.355013 |    1.329438 |      6.70 |      25.08 | call         |
|    17 |      0 |          0 |           0 |      0.00 |       0.00 | dynblock     |
|    18 |     29 |   0.365553 |    0.494062 |     12.61 |      17.04 | keyword      |
|    19 |      0 |          0 |           0 |      0.00 |       0.00 | paragraph    |
|    20 |      0 |          0 |           0 |      0.00 |       0.00 | footnote def |
|    21 |      0 |          0 |           0 |      0.00 |       0.00 | hrule        |
|    22 |      0 |          0 |           0 |      0.00 |       0.00 | diary sexp   |
|    23 |     69 |   0.739084 |    1.459472 |     10.71 |      21.15 | table        |
|    24 |  41586 |  281.42632 |   1327.9897 |      6.77 |      31.93 | plain list   |
|    25 |   5370 |  36.202665 |   66.853115 |      6.74 |      12.45 | paragraph    |
#+TBLFM: $5=1000.0*$3/$2;%.2f::$6=1000.0*$4/$2;%.2f

>> I just had no idea what to make of your suggestion about
>> 
>>    Run on a reduced dataset, and see if the sequence of regexps being
>>    exercised, and their frequencies, are consistent with what you
>>    expect.
>
> Stupid printf-debugging actually, nothing fancier than that.
> I'll see if I can put together a patch for you a bit later on.

I once tried #defun REGEX_EMACS_DEBUG  + regex_emacs_debug = 100000, but
it produced so much output that I cannot even open Emacs in reasonable
time because of the wall of output in terminal.

>> (looking-at
>>  (rx
>>   (or
>> ...
>>    (group-n 11 "%%("))))
>
> This actually incurs some unnecessary run-time cost: the (regexp ...) forms make this expand to a `concat` call to construct this rather long regexp each time. Either only recompute it when any of the variables (org-element--latex-begin-environment etc) change, or if you intend them to be compile-time constants, make sure they are expanded as such.
>
>> is actually slightly slower overall compared to a series of `looking-at-p'.
>> AFAIU, because the `looking-at' needs to allocate match-data vector for
>> all these 11 groups, which leads to
>> ;;  6.78%  emacs         emacs                                  [.] process_mark_stack
>> floating up in the perf top.
>
> Quite sure that's the concat calls. Match data doesn't actually contribute to any GC-level consing unless you reify it by calling `match-data`, or indirectly through `safe-match-data` (which I see that you are using in several places -- try not to).

After moving that giant rx into defconst, the parsing time is not
growing significantly anymore:

;; No rx:  17.947628s (1.373926s in 2 GCs)
;; rx: 18.058193s (1.379169s in 2 GCs)

But there is no improvement either...

[ now we are just 2x slower than tree-sitter rather than 2.5x :) ]

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





  reply	other threads:[~2023-05-06 13:38 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-02  7:37 bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-02 14:33 ` Mattias Engdegård
2023-05-02 15:25   ` Eli Zaretskii
2023-05-02 15:28     ` Mattias Engdegård
2023-05-02 17:30       ` Eli Zaretskii
2023-05-02 17:58         ` Ihor Radchenko
2023-05-02 16:14   ` Ihor Radchenko
2023-05-02 21:00     ` Mattias Engdegård
2023-05-02 21:21       ` Ihor Radchenko
2023-05-03  8:39         ` Mattias Engdegård
2023-05-03  9:36           ` Ihor Radchenko
2023-05-03 13:59             ` Mattias Engdegård
2023-05-03 15:05               ` Ihor Radchenko
2023-05-03 15:20                 ` Mattias Engdegård
2023-05-03 16:02                   ` Ihor Radchenko
2023-05-04  9:24                     ` Mattias Engdegård
2023-05-05 10:31                       ` Ihor Radchenko
2023-05-05 16:26                         ` Mattias Engdegård
2023-05-06 13:38                           ` Ihor Radchenko [this message]
2023-05-07 10:32                             ` Mattias Engdegård
2023-05-08 11:58                               ` Ihor Radchenko
2023-05-08 18:21                                 ` Mattias Engdegård
2023-05-08 19:38                                   ` Ihor Radchenko
2023-05-08 19:53                                     ` Mattias Engdegård
2023-05-09  8:36                                       ` bug#63225: Using char table-based finite-state machines as a replacement for re-search-forward (was: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)) Ihor Radchenko
2023-05-09 12:02                                       ` bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-09 15:05                                         ` Mattias Engdegård
2023-05-09 15:56                                           ` Ihor Radchenko
2023-05-09 15:57                                             ` Mattias Engdegård
2023-05-07 12:45                           ` Mattias Engdegård
2023-05-08 13:56                             ` Ihor Radchenko
2023-05-08 19:32                               ` Mattias Engdegård
2023-05-08 19:44                                 ` Ihor Radchenko
2023-05-04 12:58               ` Ihor Radchenko
2023-05-02 23:36   ` Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87y1m1oa01.fsf@localhost \
    --to=yantar92@posteo.net \
    --cc=63225@debbugs.gnu.org \
    --cc=mattias.engdegard@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).