unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Ihor Radchenko <yantar92@posteo.net>
To: "Mattias Engdegård" <mattias.engdegard@gmail.com>
Cc: 63225@debbugs.gnu.org
Subject: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
Date: Wed, 03 May 2023 09:36:01 +0000	[thread overview]
Message-ID: <87wn1psqny.fsf@localhost> (raw)
In-Reply-To: <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com>

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

>> I tried this, and it did not give any noticeable improvement.
>> Most likely, because the actual `cond' is
>> 
>>  (cond ((looking-at "foo) (<parser function calling many more regex searches>)) ...)
>
> I see, so it doesn't run through all top-level cases very often then? I thought that would be the common path (plain text).

You are indeed right. Top-level cases are ran very often. So, what I
said does not make much sense.

Yet, in my tests, I am unable to see any improvement when I consolidate
the regexps.

If I do (progn (set-regexp-cache-size 50) (org-element-parse-buffer) nil)

Without consolidation, but using `looking-at-p' as much as possible:

Profiler top
        ;; 4160  21% + org-element--current-element
        ;; 2100  10% + org-element--parse-elements
        ;; 1894   9% + org-element--parse-objects
        ;; 1422   7%   Automatic GC
        ;;  871   4% + org-element--headline-deferred
        ;;  806   4% + apply
        ;;  796   4% + org-element-create
        ;;  638   3% + org-element--list-struct

Perf top
    ;; 16.72%  emacs         emacs                                  [.] re_match_2_internal
    ;;  7.16%  emacs         emacs                                  [.] exec_byte_code
    ;;  4.08%  emacs         emacs                                  [.] funcall_subr
    ;;  4.06%  emacs         emacs                                  [.] re_search_2

With consolidation into a giant rx (or ...) with groups:

        ;; 4158  21% + org-element--current-element
        ;; 2163  11% + org-element--parse-objects
        ;; 1796   9% + org-element--parse-elements
        ;; 1276   6%   Automatic GC
        ;;  921   4% + org-element--headline-deferred
        ;;  833   4% + apply
        ;;  793   4% + org-element-create
        ;;  660   3% + org-element--list-struct

    ;; 16.44%  emacs         emacs                                  [.] re_match_2_internal
    ;;  7.03%  emacs         emacs                                  [.] exec_byte_code
    ;;  6.78%  emacs         emacs                                  [.] process_mark_stack
    ;;  4.05%  emacs         emacs                                  [.] re_search_2
    ;;  4.02%  emacs         emacs                                  [.] funcall_subr

The version with giant single rx form is actually slower overall (!),
making no difference at all in `org-element--current-element'.

> Perhaps you just don't see much improvement until the working set of regexps fits in the cache.

As you see, I now increased cache size to 50. No improvement. Same with
my observations on current master.

> The regexp compiler doesn't do much optimisation in order to keep the translation fast. It doesn't even convert "[a]" to "a".

I guess that it is another thing that could be improved if we were to
have compiled regexp objects. Compilation time would not matter as much.

Ideally, the compiler should do something similar to
what https://www.colm.net/open-source/ragel/ does.

>> Or, alternatively, the parsed regexps can be attached to string objects
>> internally. Then, regexp cache lookup will degenerate to looking into a
>> string object slot.
>
> That would work too but we really don't want to make our strings any fancier, they are already much too big and slow.

Then, what about making compiled regexp object similar to string, but
with plist slot replaced by compiled regexp slot? Maybe some other slots
removed (I am not very familiar with specific of internal string
representation)

AFAIU, compiled regexp read/write syntax can be uniquely represented
simply by a string. Something like #r"[a-z]+" (maybe even with special
handling for backslashes, like proposed in
https://yhetil.org/emacs-devel/4209edd83cfee7c84b2d75ebfcd38784fa21b23c.camel@crossproduct.net)

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





  reply	other threads:[~2023-05-03  9:36 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-02  7:37 bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-02 14:33 ` Mattias Engdegård
2023-05-02 15:25   ` Eli Zaretskii
2023-05-02 15:28     ` Mattias Engdegård
2023-05-02 17:30       ` Eli Zaretskii
2023-05-02 17:58         ` Ihor Radchenko
2023-05-02 16:14   ` Ihor Radchenko
2023-05-02 21:00     ` Mattias Engdegård
2023-05-02 21:21       ` Ihor Radchenko
2023-05-03  8:39         ` Mattias Engdegård
2023-05-03  9:36           ` Ihor Radchenko [this message]
2023-05-03 13:59             ` Mattias Engdegård
2023-05-03 15:05               ` Ihor Radchenko
2023-05-03 15:20                 ` Mattias Engdegård
2023-05-03 16:02                   ` Ihor Radchenko
2023-05-04  9:24                     ` Mattias Engdegård
2023-05-05 10:31                       ` Ihor Radchenko
2023-05-05 16:26                         ` Mattias Engdegård
2023-05-06 13:38                           ` Ihor Radchenko
2023-05-07 10:32                             ` Mattias Engdegård
2023-05-08 11:58                               ` Ihor Radchenko
2023-05-08 18:21                                 ` Mattias Engdegård
2023-05-08 19:38                                   ` Ihor Radchenko
2023-05-08 19:53                                     ` Mattias Engdegård
2023-05-09  8:36                                       ` bug#63225: Using char table-based finite-state machines as a replacement for re-search-forward (was: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)) Ihor Radchenko
2023-05-09 12:02                                       ` bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-09 15:05                                         ` Mattias Engdegård
2023-05-09 15:56                                           ` Ihor Radchenko
2023-05-09 15:57                                             ` Mattias Engdegård
2023-05-07 12:45                           ` Mattias Engdegård
2023-05-08 13:56                             ` Ihor Radchenko
2023-05-08 19:32                               ` Mattias Engdegård
2023-05-08 19:44                                 ` Ihor Radchenko
2023-05-04 12:58               ` Ihor Radchenko
2023-05-02 23:36   ` Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wn1psqny.fsf@localhost \
    --to=yantar92@posteo.net \
    --cc=63225@debbugs.gnu.org \
    --cc=mattias.engdegard@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).