unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Ihor Radchenko <yantar92@posteo.net>
To: "Mattias Engdegård" <mattias.engdegard@gmail.com>
Cc: 63225@debbugs.gnu.org
Subject: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
Date: Tue, 02 May 2023 21:21:39 +0000	[thread overview]
Message-ID: <87wn1qqvj0.fsf@localhost> (raw)
In-Reply-To: <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com>

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

>> The Org parser is basically a giant `cond' of a number of regexp
>> matches. See `org-element--current-element'.
>
> A common way to handle this is to build a big regexp to match many cases at the same time, essentially transforming
>
> (cond ((looking-at RE1) ...)
>       ((looking-at RE2) ...)
>       ...)
>
> to
>
>     (looking-at (rx (or (group RE1) (group RE2) ...)))
>     (cond ((match-beginning 1) ...)
>           ((match-beginning 2) ...)
>           ...)
>
> This reduces the number of regexps used and is also typically faster.
> (Essentially this is what `syntax-propertize-rules` does but in a more specialised context.)

I tried this, and it did not give any noticeable improvement.
Most likely, because the actual `cond' is

  (cond ((looking-at "foo) (<parser function calling many more regex searches>)) ...)

Actually, I started looking into C-level perf data right after trying to
consolidate the regexps into one giant looking-at form and not seeing
any improvement. At that point, I already cached most of the
dynamically generated regexps in there and ran out of reasonable ideas.

> Using tree-sitter for this could very well be even faster but it's not guaranteed to be available.

The available tree-sitter grammars for Org are about 1.5-2x faster in my
previous tests, but they do less granular parsing of fields. And not
complete. Org is not context-free and does not fit well into GLR.

And we are not going to use tree sitter for development to avoid
increasing contribution barriers.

> Otherwise it's very much a matter of optimisation of everything, including regexps. Minimise backtracking.
> If you want to match five or more dashes, use "------*" instead of "-\\{5,\\}". And so on.

This example sounds like something that regexp compilation should be
able to optimize, no? I do not easily see why the latter should cause
more CPU time compared to the former.

> It's also obviously a good idea not to generate regexps dynamically each time if you can help it, and minimise consing in general.

Sure. I was able to shave a few seconds off using this idea.
Other than regexp compilation hotspot, I only see re-writing parser
non-recursively (`org-element--parse-elements' and
`org-element--parse-objects').

>>> Introducing regexp objects that could store compiled regexps and be
>>> used instead of strings would be quite some work but probably
>>> worthwhile.
>> 
>> What exactly needs to be done? Assuming that regexp objects are not
>> going to be readable, for simplicity.
>
> A proper design, for starters. For example, we probably want them to be usable in customised variables which calls for them to be readable.

Or, alternatively, the parsed regexps can be attached to string objects
internally. Then, regexp cache lookup will degenerate to looking into a
string object slot.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





  reply	other threads:[~2023-05-02 21:21 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-02  7:37 bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-02 14:33 ` Mattias Engdegård
2023-05-02 15:25   ` Eli Zaretskii
2023-05-02 15:28     ` Mattias Engdegård
2023-05-02 17:30       ` Eli Zaretskii
2023-05-02 17:58         ` Ihor Radchenko
2023-05-02 16:14   ` Ihor Radchenko
2023-05-02 21:00     ` Mattias Engdegård
2023-05-02 21:21       ` Ihor Radchenko [this message]
2023-05-03  8:39         ` Mattias Engdegård
2023-05-03  9:36           ` Ihor Radchenko
2023-05-03 13:59             ` Mattias Engdegård
2023-05-03 15:05               ` Ihor Radchenko
2023-05-03 15:20                 ` Mattias Engdegård
2023-05-03 16:02                   ` Ihor Radchenko
2023-05-04  9:24                     ` Mattias Engdegård
2023-05-05 10:31                       ` Ihor Radchenko
2023-05-05 16:26                         ` Mattias Engdegård
2023-05-06 13:38                           ` Ihor Radchenko
2023-05-07 10:32                             ` Mattias Engdegård
2023-05-08 11:58                               ` Ihor Radchenko
2023-05-08 18:21                                 ` Mattias Engdegård
2023-05-08 19:38                                   ` Ihor Radchenko
2023-05-08 19:53                                     ` Mattias Engdegård
2023-05-09  8:36                                       ` bug#63225: Using char table-based finite-state machines as a replacement for re-search-forward (was: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)) Ihor Radchenko
2023-05-09 12:02                                       ` bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-09 15:05                                         ` Mattias Engdegård
2023-05-09 15:56                                           ` Ihor Radchenko
2023-05-09 15:57                                             ` Mattias Engdegård
2023-05-07 12:45                           ` Mattias Engdegård
2023-05-08 13:56                             ` Ihor Radchenko
2023-05-08 19:32                               ` Mattias Engdegård
2023-05-08 19:44                                 ` Ihor Radchenko
2023-05-04 12:58               ` Ihor Radchenko
2023-05-02 23:36   ` Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wn1qqvj0.fsf@localhost \
    --to=yantar92@posteo.net \
    --cc=63225@debbugs.gnu.org \
    --cc=mattias.engdegard@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).