bug#63225: Compiling regexp patterns (and REGEXP_CACHE

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
@ 2023-05-02  7:37 Ihor Radchenko
  2023-05-02 14:33 ` Mattias Engdegård
  0 siblings, 1 reply; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-02  7:37 UTC (permalink / raw)
  To: 63225

[-- Attachment #1: Type: text/plain, Size: 1308 bytes --]

Tags: patch

Hello,

I am now studying the performance of Org mode parser on huge Org files.

I noticed that `org-element-parse-buffer' spends a significant (~10%)
fraction of CPU time simply compiling regexp patterns.

This happens because Org parser performs a huge number repeated regexp
searches as it incrementally parses the buffer. The searches happen on a
fixed set of regexp patterns (several dozens).

I was able to get rid of the regex compilation-related slowdown simply
by increasing REGEXP_CACHE_SIZE 10x (see the attached patch).

Does anyone know if there are potential side effects of this increase if
applied across Emacs? Or, alternatively, may Emacs provide an ability to
store compiled regexp patterns from Elisp (similar to what
`treesit-query-compile' does)?

I suspect that storing pre-compiled patterns may benefit a number of
major modes that have to perform complex regexp matching.

Best,
Ihor

In GNU Emacs 30.0.50 (build 4, x86_64-pc-linux-gnu, GTK+ Version
 3.24.37, cairo version 1.17.8) of 2023-05-02 built on localhost
Repository revision: a0a71ca12d585bca5173775f08eabae553e15659
Repository branch: master
Windowing system distributor 'The X.Org Foundation', version 11.0.12101008
System Description: Gentoo Linux

Configured using:
 'configure --with-native-compilation'


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-src-search.c-REGEXP_CACHE_SIZE-Increase-to-200.patch --]
[-- Type: text/patch, Size: 816 bytes --]

From f0d9c814ff0601fa3487ad3da1ae4dcdf511d850 Mon Sep 17 00:00:00 2001
Message-Id: <f0d9c814ff0601fa3487ad3da1ae4dcdf511d850.1683012524.git.yantar92@posteo.net>
From: Ihor Radchenko <yantar92@posteo.net>
Date: Tue, 2 May 2023 09:28:21 +0200
Subject: [PATCH] * src/search.c (REGEXP_CACHE_SIZE): Increase to 200

---
 src/search.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/search.c b/src/search.c
index 0bb52c03eef..cfcce3c7293 100644
--- a/src/search.c
+++ b/src/search.c
@@ -34,7 +34,7 @@ Copyright (C) 1985-1987, 1993-1994, 1997-1999, 2001-2023 Free Software
 
 #include "regex-emacs.h"
 
-#define REGEXP_CACHE_SIZE 20
+#define REGEXP_CACHE_SIZE 200
 
 /* If the regexp is non-nil, then the buffer contains the compiled form
    of that regexp, suitable for searching.  */
-- 
2.40.0


[-- Attachment #3: Type: text/plain, Size: 224 bytes --]


-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-02  7:37 bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
@ 2023-05-02 14:33 ` Mattias Engdegård
  2023-05-02 15:25   ` Eli Zaretskii
                     ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Mattias Engdegård @ 2023-05-02 14:33 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 63225

[-- Attachment #1: Type: text/plain, Size: 1192 bytes --]

> I was able to get rid of the regex compilation-related slowdown simply
> by increasing REGEXP_CACHE_SIZE 10x (see the attached patch).

Indeed it sounds like you are suffering from regexp cache thrashing. I'm attaching two patches: one to measure the cache miss rate, and one that allows the regexp cache size to be changed at run time.

That should let you find the working set size for your application, and ideally come up with a way to reduce it. Perhaps you could give us an idea of what these regexps look like and how they are used?

> Does anyone know if there are potential side effects of this increase if
> applied across Emacs? Or, alternatively, may Emacs provide an ability to
> store compiled regexp patterns from Elisp (similar to what
> `treesit-query-compile' does)?

I don't think it's necessarily a good idea to increase the size to 200 right away because of the linear cache lookup mechanism. Allowing the size to be changed at run time is probably less controversial (but arguably just as much of a crutch).

Introducing regexp objects that could store compiled regexps and be used instead of strings would be quite some work but probably worthwhile.


[-- Attachment #2: 0001-Add-regexp-cache-hit-miss-counters.patch --]
[-- Type: application/octet-stream, Size: 1785 bytes --]

From f1246af3cc558bd38527f320964bb0e0a1e74de0 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Sat, 7 Nov 2020 17:00:53 +0100
Subject: [PATCH 1/2] Add regexp cache hit/miss counters

---
 src/search.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/src/search.c b/src/search.c
index 0bb52c03eef..6f71f3d16c1 100644
--- a/src/search.c
+++ b/src/search.c
@@ -220,7 +220,10 @@ compile_pattern (Lisp_Object pattern, struct re_registers *regp,
 	      || EQ (cp->syntax_table, BVAR (current_buffer, syntax_table)))
 	  && !NILP (Fequal (cp->f_whitespace_regexp, Vsearch_spaces_regexp))
 	  && cp->buf.charset_unibyte == charset_unibyte)
-	break;
+        {
+          regexp_cache_hit++;
+          break;
+        }
 
       /* If we're at the end of the cache, compile into the last
 	 (least recently used) non-busy cell in the cache.  */
@@ -232,6 +235,7 @@ compile_pattern (Lisp_Object pattern, struct re_registers *regp,
           cp = *cpp;
 	compile_it:
           eassert (!cp->busy);
+          regexp_cache_miss++;
 	  compile_pattern_1 (cp, pattern, translate, posix);
 	  break;
 	}
@@ -3431,6 +3435,13 @@ syms_of_search (void)
 is to bind it with `let' around a small expression.  */);
   Vinhibit_changing_match_data = Qnil;
 
+  DEFVAR_INT("regexp-cache-hit", regexp_cache_hit,
+             doc: /* Regexp cache hit count.  Internal use only. */);
+  regexp_cache_hit = 0;
+  DEFVAR_INT("regexp-cache-miss", regexp_cache_miss,
+             doc: /* Regexp cache miss count.  Internal use only. */);
+  regexp_cache_miss = 0;
+
   defsubr (&Slooking_at);
   defsubr (&Sposix_looking_at);
   defsubr (&Sstring_match);
-- 
2.32.0 (Apple Git-132)


[-- Attachment #3: 0002-Make-regexp-cache-size-adjustable-at-run-time.patch --]
[-- Type: application/octet-stream, Size: 5606 bytes --]

From 81461c26b7e40e7c27804407571199a5247815d4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 2 May 2023 16:23:11 +0200
Subject: [PATCH 2/2] Make regexp cache size adjustable at run-time

---
 src/alloc.c  |  2 ++
 src/lisp.h   |  1 +
 src/search.c | 85 +++++++++++++++++++++++++++++++++++++++-------------
 3 files changed, 67 insertions(+), 21 deletions(-)

diff --git a/src/alloc.c b/src/alloc.c
index d09fc41dec6..0b11b5469a9 100644
--- a/src/alloc.c
+++ b/src/alloc.c
@@ -6451,6 +6451,8 @@ garbage_collect (void)
   mark_terminals ();
   mark_kboards ();
   mark_threads ();
+  mark_regexp_cache ();
+
 #ifdef HAVE_PGTK
   mark_pgtkterm ();
 #endif
diff --git a/src/lisp.h b/src/lisp.h
index ab66109d5df..8b2beadefd6 100644
--- a/src/lisp.h
+++ b/src/lisp.h
@@ -4745,6 +4745,7 @@ XMODULE_FUNCTION (Lisp_Object o)
 extern void syms_of_fileio (void);
 
 /* Defined in search.c.  */
+extern void mark_regexp_cache (void);
 extern void shrink_regexp_cache (void);
 extern void restore_search_regs (void);
 extern void update_search_regs (ptrdiff_t oldstart,
diff --git a/src/search.c b/src/search.c
index 6f71f3d16c1..c454d5e1ca9 100644
--- a/src/search.c
+++ b/src/search.c
@@ -34,7 +34,7 @@ Copyright (C) 1985-1987, 1993-1994, 1997-1999, 2001-2023 Free Software
 
 #include "regex-emacs.h"
 
-#define REGEXP_CACHE_SIZE 20
+#define DEFAULT_REGEXP_CACHE_SIZE 20
 
 /* If the regexp is non-nil, then the buffer contains the compiled form
    of that regexp, suitable for searching.  */
@@ -55,7 +55,9 @@ #define REGEXP_CACHE_SIZE 20
 };
 
 /* The instances of that struct.  */
-static struct regexp_cache searchbufs[REGEXP_CACHE_SIZE];
+static struct regexp_cache *searchbufs;
+/* Allocated size of searchbufs array, in elements.  */
+static int regexp_cache_size = 0;
 
 /* The head of the linked list; points to the most recently used buffer.  */
 static struct regexp_cache *searchbuf_head;
@@ -158,7 +160,7 @@ clear_regexp_cache (void)
 {
   int i;
 
-  for (i = 0; i < REGEXP_CACHE_SIZE; ++i)
+  for (i = 0; i < regexp_cache_size; ++i)
     /* It's tempting to compare with the syntax-table we've actually changed,
        but it's not sufficient because char-table inheritance means that
        modifying one syntax-table can change others at the same time.  */
@@ -3376,18 +3378,68 @@ DEFUN ("newline-cache-check", Fnewline_cache_check, Snewline_cache_check,
 }
 \f
 
-static void syms_of_search_for_pdumper (void);
+static void
+set_regexp_cache_size (int n)
+{
+  for (int i = 0; i < regexp_cache_size; i++)
+    xfree (searchbufs[i].buf.buffer);
+
+  size_t bytes = n * sizeof *searchbufs;
+  searchbufs = xrealloc (searchbufs, bytes);
+  memset (searchbufs, 0, bytes);
+  regexp_cache_size = n;
+
+  for (int i = 0; i < n; i++)
+    {
+      searchbufs[i].buf.allocated = 100;
+      searchbufs[i].buf.buffer = xmalloc (100);
+      searchbufs[i].buf.fastmap = searchbufs[i].fastmap;
+      searchbufs[i].regexp = Qnil;
+      searchbufs[i].f_whitespace_regexp = Qnil;
+      searchbufs[i].busy = false;
+      searchbufs[i].syntax_table = Qnil;
+      searchbufs[i].next = (i == regexp_cache_size-1 ? 0 : &searchbufs[i+1]);
+    }
+  searchbuf_head = &searchbufs[0];
+}
+
+DEFUN ("regexp-cache-size", Fregexp_cache_size, Sregexp_cache_size,
+       0, 0, 0,
+       doc: /* Current regexp cache size.  Internal use only.  */)
+  (void)
+{
+  return make_int (regexp_cache_size);
+}
+
+DEFUN ("set-regexp-cache-size", Fset_regexp_cache_size, Sset_regexp_cache_size,
+       1, 1, 0,
+       doc: /* Set the regexp cache size to N elements.  Internal use only.  */)
+  (Lisp_Object n)
+{
+  CHECK_FIXNUM (n);
+  EMACS_INT nelems = XFIXNUM (n);
+  if (nelems <= 0)
+    error ("regexp cache size must be positive");
+  set_regexp_cache_size (nelems);
+  return Qnil;
+}
 
 void
-syms_of_search (void)
+mark_regexp_cache (void)
 {
-  for (int i = 0; i < REGEXP_CACHE_SIZE; ++i)
+  for (int i = 0; i < regexp_cache_size; ++i)
     {
-      staticpro (&searchbufs[i].regexp);
-      staticpro (&searchbufs[i].f_whitespace_regexp);
-      staticpro (&searchbufs[i].syntax_table);
+      mark_object (searchbufs[i].regexp);
+      mark_object (searchbufs[i].f_whitespace_regexp);
+      mark_object (searchbufs[i].syntax_table);
     }
+}
 
+static void syms_of_search_for_pdumper (void);
+
+void
+syms_of_search (void)
+{
   /* Error condition used for failing searches.  */
   DEFSYM (Qsearch_failed, "search-failed");
 
@@ -3460,6 +3512,8 @@ syms_of_search (void)
   defsubr (&Smatch_data__translate);
   defsubr (&Sregexp_quote);
   defsubr (&Snewline_cache_check);
+  defsubr (&Sregexp_cache_size);
+  defsubr (&Sset_regexp_cache_size);
 
   pdumper_do_now_and_after_load (syms_of_search_for_pdumper);
 }
@@ -3467,16 +3521,5 @@ syms_of_search (void)
 static void
 syms_of_search_for_pdumper (void)
 {
-  for (int i = 0; i < REGEXP_CACHE_SIZE; ++i)
-    {
-      searchbufs[i].buf.allocated = 100;
-      searchbufs[i].buf.buffer = xmalloc (100);
-      searchbufs[i].buf.fastmap = searchbufs[i].fastmap;
-      searchbufs[i].regexp = Qnil;
-      searchbufs[i].f_whitespace_regexp = Qnil;
-      searchbufs[i].busy = false;
-      searchbufs[i].syntax_table = Qnil;
-      searchbufs[i].next = (i == REGEXP_CACHE_SIZE-1 ? 0 : &searchbufs[i+1]);
-    }
-  searchbuf_head = &searchbufs[0];
+  set_regexp_cache_size (DEFAULT_REGEXP_CACHE_SIZE);
 }
-- 
2.32.0 (Apple Git-132)


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-02 14:33 ` Mattias Engdegård
@ 2023-05-02 15:25   ` Eli Zaretskii
  2023-05-02 15:28     ` Mattias Engdegård
  2023-05-02 16:14   ` Ihor Radchenko
  2023-05-02 23:36   ` Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2 siblings, 1 reply; 35+ messages in thread
From: Eli Zaretskii @ 2023-05-02 15:25 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225, yantar92

> Cc: 63225@debbugs.gnu.org
> From: Mattias Engdegård <mattiase@acm.org>
> Date: Tue, 2 May 2023 16:33:58 +0200
> 
> Indeed it sounds like you are suffering from regexp cache thrashing. I'm attaching two patches: one to measure the cache miss rate, and one that allows the regexp cache size to be changed at run time.

Thanks, but the new primitives need to be documented (why do the doc
strings say "internal use only"?), and also we should include the
cache size in the output of memory-report.

> +static void
> +set_regexp_cache_size (int n)
> +{
> +  for (int i = 0; i < regexp_cache_size; i++)
> +    xfree (searchbufs[i].buf.buffer);
> +
> +  size_t bytes = n * sizeof *searchbufs;
> +  searchbufs = xrealloc (searchbufs, bytes);
> +  memset (searchbufs, 0, bytes);
> +  regexp_cache_size = n;
> +

Should this first check that the new size is identical to the old one,
and if so, do nothing?  Come to think of this, do we really need to
realloc if the new size is smaller?  And why zero out the cache when
changing the size?

> +DEFUN ("regexp-cache-size", Fregexp_cache_size, Sregexp_cache_size,
> +       0, 0, 0,
> +       doc: /* Current regexp cache size.  Internal use only.  */)
> +  (void)
> +{
> +  return make_int (regexp_cache_size);
> +}

Since the size of the cache is a fixnum, why not use make_fixnum here?
it's a tad faster.





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-02 15:25   ` Eli Zaretskii
@ 2023-05-02 15:28     ` Mattias Engdegård
  2023-05-02 17:30       ` Eli Zaretskii
  0 siblings, 1 reply; 35+ messages in thread
From: Mattias Engdegård @ 2023-05-02 15:28 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 63225, yantar92

2 maj 2023 kl. 17.25 skrev Eli Zaretskii <eliz@gnu.org>:

> Thanks, but the new primitives need to be documented

These patches were not proposed for inclusion in Emacs but to help Ihor solve his problems in other ways. Sorry about not making it clear.






^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-02 14:33 ` Mattias Engdegård
  2023-05-02 15:25   ` Eli Zaretskii
@ 2023-05-02 16:14   ` Ihor Radchenko
  2023-05-02 21:00     ` Mattias Engdegård
  2023-05-02 23:36   ` Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2 siblings, 1 reply; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-02 16:14 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225

Mattias Engdegård <mattiase@acm.org> writes:

>> I was able to get rid of the regex compilation-related slowdown simply
>> by increasing REGEXP_CACHE_SIZE 10x (see the attached patch).
>
> Indeed it sounds like you are suffering from regexp cache thrashing. I'm attaching two patches: one to measure the cache miss rate, and one that allows the regexp cache size to be changed at run time.

Here are the results:

Command: (benchmark-progn
           (setq regexp-cache-hit 0
                 regexp-cache-miss 0)
           (set-regexp-cache-size 42)
           (org-element-parse-buffer)
           nil)
Buffer size: 22Mb
|   Cache size |     Hit |    Miss | % miss from total | ~org-element-parse-buffer~ time   |
|--------------+---------+---------+-------------------+---------------------------------|
| 20 (default) | 3219470 | 1491165 |             31.66 | 21.035765s (1.091127s in 2 GCs) |
|           40 | 4418377 |  293805 |              6.24 | 18.294018s (1.123854s in 2 GCs) |
|           42 | 4550483 |  161820 |              3.43 | 17.946184s (1.073528s in 2 GCs) |
|           45 | 4636222 |   76582 |              1.62 | 18.410150s (1.078844s in 2 GCs) |
|           50 | 4693497 |   44174 |              0.93 | 17.896177s (1.082944s in 2 GCs) |
|           60 | 4734712 |   10807 |              0.23 | 18.011224s (1.097961s in 2 GCs) |
|           80 | 4710155 |    1386 |              0.03 | 18.047544s (1.103518s in 2 GCs) |
|          100 | 4711821 |     399 |              0.01 | 17.880491s (1.102658s in 2 GCs) |
|          160 | 4711895 |     160 |              0.00 | 17.950772s (1.068975s in 2 GCs) |
|          320 | 4737968 |     393 |              0.01 | 17.773617s (1.089100s in 2 GCs) |
|          640 | 4737388 |     320 |              0.01 | 18.225701s (1.097688s in 2 GCs) |
|         1280 | 4711353 |     160 |              0.00 | 17.847522s (1.099575s in 2 GCs) |
|         2560 | 4711898 |     160 |              0.00 | 18.168488s (1.082394s in 2 GCs) |
|         5120 | 4711835 |     160 |              0.00 | 17.797036s (1.097445s in 2 GCs) |
#+TBLFM: $4=100*$3/($3+$2);%.2f

> That should let you find the working set size for your application,
> and ideally come up with a way to reduce it. Perhaps you could give us
> an idea of what these regexps look like and how they are used?

The Org parser is basically a giant `cond' of a number of regexp
matches. See `org-element--current-element'. It is called repeatedly on
every syntax element in Org buffer (like heading, table, paragraph,
etc). Each clause in the `cond' additionally calls for more complex
series regexps to look into smaller components of the parsed syntax
elements. For example, see `org-element-keyword-parser'.

So, we are cycling across several dozens (more than regexp cache size)
of regexps repeatedly.

>> Does anyone know if there are potential side effects of this increase if
>> applied across Emacs? Or, alternatively, may Emacs provide an ability to
>> store compiled regexp patterns from Elisp (similar to what
>> `treesit-query-compile' does)?
>
> I don't think it's necessarily a good idea to increase the size to 200
> right away because of the linear cache lookup mechanism. Allowing the
> size to be changed at run time is probably less controversial (but
> arguably just as much of a crutch).

Fair point. Although overshooting within a single command does not
appear to do much as long as we really re-use these regexps - everything
gets cached.

> Introducing regexp objects that could store compiled regexps and be
> used instead of strings would be quite some work but probably
> worthwhile.

What exactly needs to be done? Assuming that regexp objects are not
going to be readable, for simplicity.

> +DEFUN ("set-regexp-cache-size", Fset_regexp_cache_size, Sset_regexp_cache_size,
> +       1, 1, 0,
> +       doc: /* Set the regexp cache size to N elements.  Internal use only.  */)

If this is something to be used in practice, it will be more convenient
to provide a macro like (with-regexp-cache-size N <body>).

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-02 15:28     ` Mattias Engdegård
@ 2023-05-02 17:30       ` Eli Zaretskii
  2023-05-02 17:58         ` Ihor Radchenko
  0 siblings, 1 reply; 35+ messages in thread
From: Eli Zaretskii @ 2023-05-02 17:30 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225, yantar92

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Tue, 2 May 2023 17:28:31 +0200
> Cc: yantar92@posteo.net,
>  63225@debbugs.gnu.org
> 
> 2 maj 2023 kl. 17.25 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > Thanks, but the new primitives need to be documented
> 
> These patches were not proposed for inclusion in Emacs but to help Ihor solve his problems in other ways. Sorry about not making it clear.

I thought you said that exposing the cache size to Lisp _was_ the way
to better support Lisp programs that need to use huge amounts of
regular expressions, no?  If not, how will those patches be able to
help Ihor, given that he wants to solve them in Emacs?





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-02 17:30       ` Eli Zaretskii
@ 2023-05-02 17:58         ` Ihor Radchenko
  0 siblings, 0 replies; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-02 17:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 63225, Mattias Engdegård

Eli Zaretskii <eliz@gnu.org> writes:

>> These patches were not proposed for inclusion in Emacs but to help Ihor solve his problems in other ways. Sorry about not making it clear.
>
> I thought you said that exposing the cache size to Lisp _was_ the way
> to better support Lisp programs that need to use huge amounts of
> regular expressions, no?  If not, how will those patches be able to
> help Ihor, given that he wants to solve them in Emacs?

Well. My original simplistic proposal was to increase REGEXP_CACHE_SIZE.
Exposing this to Elisp certainly gives more flexibility, but also has
downsides (due to linear search across the regexp cache).
Ultimately, compiled regexp objects should be much more universal and
will not require extensive benchmarking to balance between regexp
compilation and increasing regexp cache search latency. But adding a new
Elisp object type is a big deal.

For now, we should probably study what is going on in my use case more
generally and maybe figure out if some alternative approach could be
better. If a simple solution will do, it may be good enough.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-02 16:14   ` Ihor Radchenko
@ 2023-05-02 21:00     ` Mattias Engdegård
  2023-05-02 21:21       ` Ihor Radchenko
  0 siblings, 1 reply; 35+ messages in thread
From: Mattias Engdegård @ 2023-05-02 21:00 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 63225

2 maj 2023 kl. 18.14 skrev Ihor Radchenko <yantar92@posteo.net>:

> |   Cache size |     Hit |    Miss | % miss from total | ~org-element-parse-buffer~ time   |
> |--------------+---------+---------+-------------------+---------------------------------|
> | 20 (default) | 3219470 | 1491165 |             31.66 | 21.035765s (1.091127s in 2 GCs) |
> |           40 | 4418377 |  293805 |              6.24 | 18.294018s (1.123854s in 2 GCs) |
> |           42 | 4550483 |  161820 |              3.43 | 17.946184s (1.073528s in 2 GCs) |
> |           45 | 4636222 |   76582 |              1.62 | 18.410150s (1.078844s in 2 GCs) |

Good, this quite solidly puts the working set size at 40-odd elements.

> The Org parser is basically a giant `cond' of a number of regexp
> matches. See `org-element--current-element'.

A common way to handle this is to build a big regexp to match many cases at the same time, essentially transforming

(cond ((looking-at RE1) ...)
      ((looking-at RE2) ...)
      ...)

to

    (looking-at (rx (or (group RE1) (group RE2) ...)))
    (cond ((match-beginning 1) ...)
          ((match-beginning 2) ...)
          ...)

This reduces the number of regexps used and is also typically faster.
(Essentially this is what `syntax-propertize-rules` does but in a more specialised context.)

Using tree-sitter for this could very well be even faster but it's not guaranteed to be available.

Otherwise it's very much a matter of optimisation of everything, including regexps. Minimise backtracking.
If you want to match five or more dashes, use "------*" instead of "-\\{5,\\}". And so on.
It's also obviously a good idea not to generate regexps dynamically each time if you can help it, and minimise consing in general.

>> Introducing regexp objects that could store compiled regexps and be
>> used instead of strings would be quite some work but probably
>> worthwhile.
> 
> What exactly needs to be done? Assuming that regexp objects are not
> going to be readable, for simplicity.

A proper design, for starters. For example, we probably want them to be usable in customised variables which calls for them to be readable.

> If this is something to be used in practice, it will be more convenient
> to provide a macro like (with-regexp-cache-size N <body>).

Maybe, we'll see if it's something we need to add.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-02 21:00     ` Mattias Engdegård
@ 2023-05-02 21:21       ` Ihor Radchenko
  2023-05-03  8:39         ` Mattias Engdegård
  0 siblings, 1 reply; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-02 21:21 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

>> The Org parser is basically a giant `cond' of a number of regexp
>> matches. See `org-element--current-element'.
>
> A common way to handle this is to build a big regexp to match many cases at the same time, essentially transforming
>
> (cond ((looking-at RE1) ...)
>       ((looking-at RE2) ...)
>       ...)
>
> to
>
>     (looking-at (rx (or (group RE1) (group RE2) ...)))
>     (cond ((match-beginning 1) ...)
>           ((match-beginning 2) ...)
>           ...)
>
> This reduces the number of regexps used and is also typically faster.
> (Essentially this is what `syntax-propertize-rules` does but in a more specialised context.)

I tried this, and it did not give any noticeable improvement.
Most likely, because the actual `cond' is

  (cond ((looking-at "foo) (<parser function calling many more regex searches>)) ...)

Actually, I started looking into C-level perf data right after trying to
consolidate the regexps into one giant looking-at form and not seeing
any improvement. At that point, I already cached most of the
dynamically generated regexps in there and ran out of reasonable ideas.

> Using tree-sitter for this could very well be even faster but it's not guaranteed to be available.

The available tree-sitter grammars for Org are about 1.5-2x faster in my
previous tests, but they do less granular parsing of fields. And not
complete. Org is not context-free and does not fit well into GLR.

And we are not going to use tree sitter for development to avoid
increasing contribution barriers.

> Otherwise it's very much a matter of optimisation of everything, including regexps. Minimise backtracking.
> If you want to match five or more dashes, use "------*" instead of "-\\{5,\\}". And so on.

This example sounds like something that regexp compilation should be
able to optimize, no? I do not easily see why the latter should cause
more CPU time compared to the former.

> It's also obviously a good idea not to generate regexps dynamically each time if you can help it, and minimise consing in general.

Sure. I was able to shave a few seconds off using this idea.
Other than regexp compilation hotspot, I only see re-writing parser
non-recursively (`org-element--parse-elements' and
`org-element--parse-objects').

>>> Introducing regexp objects that could store compiled regexps and be
>>> used instead of strings would be quite some work but probably
>>> worthwhile.
>> 
>> What exactly needs to be done? Assuming that regexp objects are not
>> going to be readable, for simplicity.
>
> A proper design, for starters. For example, we probably want them to be usable in customised variables which calls for them to be readable.

Or, alternatively, the parsed regexps can be attached to string objects
internally. Then, regexp cache lookup will degenerate to looking into a
string object slot.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-02 14:33 ` Mattias Engdegård
  2023-05-02 15:25   ` Eli Zaretskii
  2023-05-02 16:14   ` Ihor Radchenko
@ 2023-05-02 23:36   ` Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2 siblings, 0 replies; 35+ messages in thread
From: Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-05-02 23:36 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225, Ihor Radchenko

Mattias Engdegård <mattiase@acm.org> writes:

>> I was able to get rid of the regex compilation-related slowdown simply
>> by increasing REGEXP_CACHE_SIZE 10x (see the attached patch).
>
> Indeed it sounds like you are suffering from regexp cache thrashing. I'm attaching two patches: one to measure the cache miss rate, and one that allows the regexp cache size to be changed at run time.
>
> That should let you find the working set size for your application, and ideally come up with a way to reduce it. Perhaps you could give us an idea of what these regexps look like and how they are used?
>
>> Does anyone know if there are potential side effects of this increase if
>> applied across Emacs? Or, alternatively, may Emacs provide an ability to
>> store compiled regexp patterns from Elisp (similar to what
>> `treesit-query-compile' does)?
>
> I don't think it's necessarily a good idea to increase the size to 200
> right away because of the linear cache lookup mechanism. Allowing the
> size to be changed at run time is probably less controversial (but
> arguably just as much of a crutch).
>
> Introducing regexp objects that could store compiled regexps and be used instead of strings would be quite some work but probably worthwhile.

Thanks for curing this instance of C programmer's disease.

> From f1246af3cc558bd38527f320964bb0e0a1e74de0 Mon Sep 17 00:00:00 2001
> From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
> Date: Sat, 7 Nov 2020 17:00:53 +0100
> Subject: [PATCH 1/2] Add regexp cache hit/miss counters
>
> ---
>  src/search.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/src/search.c b/src/search.c
> index 0bb52c03eef..6f71f3d16c1 100644
> --- a/src/search.c
> +++ b/src/search.c
> @@ -220,7 +220,10 @@ compile_pattern (Lisp_Object pattern, struct re_registers *regp,
>  	      || EQ (cp->syntax_table, BVAR (current_buffer, syntax_table)))
>  	  && !NILP (Fequal (cp->f_whitespace_regexp, Vsearch_spaces_regexp))
>  	  && cp->buf.charset_unibyte == charset_unibyte)
> -	break;
> +        {
> +          regexp_cache_hit++;
> +          break;
> +        }
>  
>        /* If we're at the end of the cache, compile into the last
>  	 (least recently used) non-busy cell in the cache.  */
> @@ -232,6 +235,7 @@ compile_pattern (Lisp_Object pattern, struct re_registers *regp,
>            cp = *cpp;
>  	compile_it:
>            eassert (!cp->busy);
> +          regexp_cache_miss++;
>  	  compile_pattern_1 (cp, pattern, translate, posix);
>  	  break;
>  	}
> @@ -3431,6 +3435,13 @@ syms_of_search (void)
>  is to bind it with `let' around a small expression.  */);
>    Vinhibit_changing_match_data = Qnil;
>  
> +  DEFVAR_INT("regexp-cache-hit", regexp_cache_hit,
> +             doc: /* Regexp cache hit count.  Internal use only. */);
> +  regexp_cache_hit = 0;
> +  DEFVAR_INT("regexp-cache-miss", regexp_cache_miss,
> +             doc: /* Regexp cache miss count.  Internal use only. */);
> +  regexp_cache_miss = 0;

Please put a space between `DEFVAR_INT' and `('.





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-02 21:21       ` Ihor Radchenko
@ 2023-05-03  8:39         ` Mattias Engdegård
  2023-05-03  9:36           ` Ihor Radchenko
  0 siblings, 1 reply; 35+ messages in thread
From: Mattias Engdegård @ 2023-05-03  8:39 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 63225

2 maj 2023 kl. 23.21 skrev Ihor Radchenko <yantar92@posteo.net>:

> I tried this, and it did not give any noticeable improvement.
> Most likely, because the actual `cond' is
> 
>  (cond ((looking-at "foo) (<parser function calling many more regex searches>)) ...)

I see, so it doesn't run through all top-level cases very often then? I thought that would be the common path (plain text).

Would consolidating some of the secondary regexps help at all? What are the most frequent branches in the parser?

Perhaps you just don't see much improvement until the working set of regexps fits in the cache.

>> Otherwise it's very much a matter of optimisation of everything, including regexps. Minimise backtracking.
>> If you want to match five or more dashes, use "------*" instead of "-\\{5,\\}". And so on.
> 
> This example sounds like something that regexp compilation should be
> able to optimize, no? I do not easily see why the latter should cause
> more CPU time compared to the former.

It's a trivial point and definitely not the source of your problems, sorry! (Counted repetitions are slightly less efficient because they need to maintain the counter, it's all done in a terrible way.)
The regexp compiler doesn't do much optimisation in order to keep the translation fast. It doesn't even convert "[a]" to "a".

> Or, alternatively, the parsed regexps can be attached to string objects
> internally. Then, regexp cache lookup will degenerate to looking into a
> string object slot.

That would work too but we really don't want to make our strings any fancier, they are already much too big and slow.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-03  8:39         ` Mattias Engdegård
@ 2023-05-03  9:36           ` Ihor Radchenko
  2023-05-03 13:59             ` Mattias Engdegård
  0 siblings, 1 reply; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-03  9:36 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

>> I tried this, and it did not give any noticeable improvement.
>> Most likely, because the actual `cond' is
>> 
>>  (cond ((looking-at "foo) (<parser function calling many more regex searches>)) ...)
>
> I see, so it doesn't run through all top-level cases very often then? I thought that would be the common path (plain text).

You are indeed right. Top-level cases are ran very often. So, what I
said does not make much sense.

Yet, in my tests, I am unable to see any improvement when I consolidate
the regexps.

If I do (progn (set-regexp-cache-size 50) (org-element-parse-buffer) nil)

Without consolidation, but using `looking-at-p' as much as possible:

Profiler top
        ;; 4160  21% + org-element--current-element
        ;; 2100  10% + org-element--parse-elements
        ;; 1894   9% + org-element--parse-objects
        ;; 1422   7%   Automatic GC
        ;;  871   4% + org-element--headline-deferred
        ;;  806   4% + apply
        ;;  796   4% + org-element-create
        ;;  638   3% + org-element--list-struct

Perf top
    ;; 16.72%  emacs         emacs                                  [.] re_match_2_internal
    ;;  7.16%  emacs         emacs                                  [.] exec_byte_code
    ;;  4.08%  emacs         emacs                                  [.] funcall_subr
    ;;  4.06%  emacs         emacs                                  [.] re_search_2

With consolidation into a giant rx (or ...) with groups:

        ;; 4158  21% + org-element--current-element
        ;; 2163  11% + org-element--parse-objects
        ;; 1796   9% + org-element--parse-elements
        ;; 1276   6%   Automatic GC
        ;;  921   4% + org-element--headline-deferred
        ;;  833   4% + apply
        ;;  793   4% + org-element-create
        ;;  660   3% + org-element--list-struct

    ;; 16.44%  emacs         emacs                                  [.] re_match_2_internal
    ;;  7.03%  emacs         emacs                                  [.] exec_byte_code
    ;;  6.78%  emacs         emacs                                  [.] process_mark_stack
    ;;  4.05%  emacs         emacs                                  [.] re_search_2
    ;;  4.02%  emacs         emacs                                  [.] funcall_subr

The version with giant single rx form is actually slower overall (!),
making no difference at all in `org-element--current-element'.

> Perhaps you just don't see much improvement until the working set of regexps fits in the cache.

As you see, I now increased cache size to 50. No improvement. Same with
my observations on current master.

> The regexp compiler doesn't do much optimisation in order to keep the translation fast. It doesn't even convert "[a]" to "a".

I guess that it is another thing that could be improved if we were to
have compiled regexp objects. Compilation time would not matter as much.

Ideally, the compiler should do something similar to
what https://www.colm.net/open-source/ragel/ does.

>> Or, alternatively, the parsed regexps can be attached to string objects
>> internally. Then, regexp cache lookup will degenerate to looking into a
>> string object slot.
>
> That would work too but we really don't want to make our strings any fancier, they are already much too big and slow.

Then, what about making compiled regexp object similar to string, but
with plist slot replaced by compiled regexp slot? Maybe some other slots
removed (I am not very familiar with specific of internal string
representation)

AFAIU, compiled regexp read/write syntax can be uniquely represented
simply by a string. Something like #r"[a-z]+" (maybe even with special
handling for backslashes, like proposed in
https://yhetil.org/emacs-devel/4209edd83cfee7c84b2d75ebfcd38784fa21b23c.camel@crossproduct.net)

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-03  9:36           ` Ihor Radchenko
@ 2023-05-03 13:59             ` Mattias Engdegård
  2023-05-03 15:05               ` Ihor Radchenko
  2023-05-04 12:58               ` Ihor Radchenko
  0 siblings, 2 replies; 35+ messages in thread
From: Mattias Engdegård @ 2023-05-03 13:59 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 63225

3 maj 2023 kl. 11.36 skrev Ihor Radchenko <yantar92@posteo.net>:

> Yet, in my tests, I am unable to see any improvement when I consolidate
> the regexps.

That's odd, but do you get a better cache hit rate (assuming a cache size of 20)?

> The version with giant single rx form is actually slower overall (!),
> making no difference at all in `org-element--current-element'.

Can't say what's going on here, really. Normally a combined regexp shouldn't be slower. Are you sure you get the same parse?

> Ideally, the compiler should do something similar to
> what https://www.colm.net/open-source/ragel/ does.

Yes, constructing a DFA would be more realistic when it's less in danger of being thrown away at any time.






^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-03 13:59             ` Mattias Engdegård
@ 2023-05-03 15:05               ` Ihor Radchenko
  2023-05-03 15:20                 ` Mattias Engdegård
  2023-05-04 12:58               ` Ihor Radchenko
  1 sibling, 1 reply; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-03 15:05 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> 3 maj 2023 kl. 11.36 skrev Ihor Radchenko <yantar92@posteo.net>:
>
>> Yet, in my tests, I am unable to see any improvement when I consolidate
>> the regexps.
>
> That's odd, but do you get a better cache hit rate (assuming a cache size of 20)?

With the default cache size of 20,
(benchmark-progn
  (setq regexp-cache-hit 0 regexp-cache-miss 0)
  (set-regexp-cache-size 20)
  (org-element-parse-buffer)
  nil)

(cond ((looking-at-p ...) ...)) gives

misses: 1493570
hits: 3225203
% misses from total: 31%

giant rx + looking-at gives

misses: 1177242
hits: 3233553
% misses from total: 27%

>> The version with giant single rx form is actually slower overall (!),
>> making no difference at all in `org-element--current-element'.
>
> Can't say what's going on here, really. Normally a combined regexp
> shouldn't be slower. Are you sure you get the same parse?

All the tests are passing...

Note that I am using `looking-at-p'

I now also tried replacing `looking-at-p' with `looking-at' and I get

        4880  21% + org-element--current-element

        (previous data with `looking-at-p')
        4160  21% + org-element--current-element 

with total time increasing compared to the version with `looking-at-p'
(21.743226s (1.364015s in 2 GCs) compared to 21.035765s (1.091127s in 2 GCs))

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>






^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-03 15:05               ` Ihor Radchenko
@ 2023-05-03 15:20                 ` Mattias Engdegård
  2023-05-03 16:02                   ` Ihor Radchenko
  0 siblings, 1 reply; 35+ messages in thread
From: Mattias Engdegård @ 2023-05-03 15:20 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 63225

3 maj 2023 kl. 17.05 skrev Ihor Radchenko <yantar92@posteo.net>:

> (cond ((looking-at-p ...) ...)) gives
> 
> misses: 1493570
> hits: 3225203
> % misses from total: 31%
> 
> giant rx + looking-at gives
> 
> misses: 1177242
> hits: 3233553
> % misses from total: 27%

Maybe you should instrument the regexp engine and log the pattern and whether compilation was needed to a file. Run on a reduced dataset, and see if the sequence of regexps being exercised, and their frequencies, are consistent with what you expect.







^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-03 15:20                 ` Mattias Engdegård
@ 2023-05-03 16:02                   ` Ihor Radchenko
  2023-05-04  9:24                     ` Mattias Engdegård
  0 siblings, 1 reply; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-03 16:02 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> Maybe you should instrument the regexp engine and log the pattern and whether compilation was needed to a file. Run on a reduced dataset, and see if the sequence of regexps being exercised, and their frequencies, are consistent with what you expect.

Sorry, but I am starting to lose track of the purpose here.
What is the aim of instrumenting regexp engine in this scenario?
I already know that additional regexps will be tested by individual
`org-element-X-parser' functions.
I am also not sure how to instrument the regexp engine and what I can
see there.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-03 16:02                   ` Ihor Radchenko
@ 2023-05-04  9:24                     ` Mattias Engdegård
  2023-05-05 10:31                       ` Ihor Radchenko
  0 siblings, 1 reply; 35+ messages in thread
From: Mattias Engdegård @ 2023-05-04  9:24 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 63225

3 maj 2023 kl. 18.02 skrev Ihor Radchenko <yantar92@posteo.net>:

> What is the aim of instrumenting regexp engine in this scenario?
> I already know that additional regexps will be tested by individual
> `org-element-X-parser' functions.

I got the impression that the 'spine' of the parser, the sequence of `looking-at` calls in `org-element--current-element`, would frequently be run through in its entirety which means that consolidating these would reduce the number of working regexps by about 20 (if I'm counting correctly).

Now if as you suggest the parsing is dominated by sequences of regexps in the branches, it prompts the questions: which branches, what regexps, why are there so many of them, and is there anything that can be done to reduce their number?

> I am also not sure how to instrument the regexp engine and what I can
> see there.

Sorry, it is just what I who know nothing about the structure of Org would do to get a better view. You may find it easier to work at the Lisp level.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-03 13:59             ` Mattias Engdegård
  2023-05-03 15:05               ` Ihor Radchenko
@ 2023-05-04 12:58               ` Ihor Radchenko
  1 sibling, 0 replies; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-04 12:58 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

>> Ideally, the compiler should do something similar to
>> what https://www.colm.net/open-source/ragel/ does.
>
> Yes, constructing a DFA would be more realistic when it's less in danger of being thrown away at any time.

I tried to look closer into regex-emacs.c and I note that the regexp
compiler basically generates a specialized bytecode to be executed by
regexp matcher later.

So, may the existing vector type be reused to store compiled regexp
pattern objects?

AFIU, pvec_type will need to have one more item and reader/printer will
need to be modified for a specialized print representation. The compiled
regexp itself will be stored as unibyte string, containing the
instructions.

This will also open a possibility to compile regexp patterns from Elisp,
if necessary.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-04  9:24                     ` Mattias Engdegård
@ 2023-05-05 10:31                       ` Ihor Radchenko
  2023-05-05 16:26                         ` Mattias Engdegård
  0 siblings, 1 reply; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-05 10:31 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

>> What is the aim of instrumenting regexp engine in this scenario?
>> I already know that additional regexps will be tested by individual
>> `org-element-X-parser' functions.
>
> I got the impression that the 'spine' of the parser, the sequence of `looking-at` calls in `org-element--current-element`, would frequently be run through in its entirety which means that consolidating these would reduce the number of working regexps by about 20 (if I'm counting correctly).

Not exactly. The actual statistics is the following (of course, it is a
subject of the actual parsed file structure).

Below, I measured time spent in different branches of cond.
Note describes the cond type.

| Depth |  count |     Time, msec | Note                              | Avg time, μsec/count | Element type  |
|-------+--------+----------------+-----------------------------------+----------------------+---------------|
|     0 |  89592 |         31.094 | eq                                |           0.43315819 | item          |
|     1 |   1984 |           0.68 | eq                                |           0.45513710 | table row     |
|     2 | 206607 |          43.23 | eq                                |           0.24850257 | node-property |
|     3 |  72770 |         302.95 | looking-at-p, skip-chars          |            4.8545025 | headline      |
|     4 |  56000 |      39.190916 | memq                              |           0.69983779 | section       |
|     5 |   8231 |      26.129109 | looking-at-p, lookback            |            3.1744756 | planning      |
|     6 |  54852 |      503.97346 | looking-at-p, multiline, lookback |            9.1878776 | prop drawer   |
|     7 |  89510 |      78.514284 | bolp                              |           0.87715656 | paragraph     |
|     8 |  29610 |      79.589466 | looking-at-p                      |            2.6879252 | clock         |
|     9 |    231 |       1.644304 | eq                                |            7.1181991 | inlinetask    |
|    10 |      0 | tot: 1173 msec | eq                                |                  0/0 | affiliated    |
|-------+--------+----------------+-----------------------------------+----------------------+---------------|
|    11 |     30 |       0.060081 | looking-at-p                      |               2.0027 | latex env     |
|    12 |  45443 |      187.41703 | looking-at-p                      |            4.1242222 | drawer        |
|    13 |     21 |       0.255528 | looking-at-p                      |               12.168 | fixed width   |
|    14 |    967 |        6.67522 | looking-at                        |            6.9030196 | block         |
|    15 |     53 |       0.342144 | looking-at-p                      |            6.4555472 | call          |
|    16 |      0 |              0 | looking-at-p                      |                  0/0 | dynblock      |
|    17 |     29 |       0.361915 | looking-at-p                      |            12.479828 | keyword       |
|    18 |      0 |              0 | eq                                |                  0/0 | paragraph     |
|    19 |      0 |              0 | looking-at-p                      |                  0/0 | footnote def  |
|    20 |      0 |              0 | looking-at-p                      |                  0/0 | rule          |
|    21 |      0 |              0 | looking-at-p                      |                  0/0 | diary sexp    |
|-------+--------+----------------+-----------------------------------+----------------------+---------------|
|    22 |     66 |       0.752823 | looking-at-p, re-search-forward   |            11.406409 | table         |
|    23 |  41509 |      303.39472 | looking-at-p                      |            7.3091310 | item          |
|    24 |   5340 |      41.188231 | t                                 |            7.7131519 | paragraph     |
|       |        | tot: 1713 msec |                                   |                      |               |

If I try to group regexps into one giant rx form and then compare time
spend in different cond branches, I get the following.
(I grouped the regexps between horizontal rules in the above table)

I tried two different files: (1) notes.org that is heavy on headlines;
(2) org-manual that is heavy on actual text.

Grouping with rx gives no noticeable impact. 

| Depth |  Avg time, μs | Avg time, μs |   Avg time, μs | Avg time, μs |
|       | (notes+no rx) |   (notes+rx) | (manual+no rx) |  (manual+rx) |
|-------+---------------+--------------+----------------+--------------|
|     0 |    0.34576248 |   0.35948186 |     0.43996679 |   0.44675874 |
|     1 |    0.35749752 |   0.37239325 |     0.44559585 |   0.43868739 |
|     2 |    0.18958309 |   0.20197035 |     0.29960921 |   0.29960921 |
|     3 |     4.1282904 |    4.2407582 |      4.4482968 |    4.4711219 |
|     4 |    0.61503580 |   0.59914459 |     0.64377158 |   0.63460540 |
|     5 |    0.88028169 |   0.83916084 |      1.2820513 |    1.2820513 |
|     6 |     2.6515244 |    2.6348024 |      2.6795055 |    2.7579648 |
|     7 |     7.8175124 |    7.8262918 |      7.1999256 |    7.1996154 |
|     8 |    0.75458424 |   0.75368242 |     0.70958084 |   0.72455090 |
|     9 |     2.1446653 |    2.1466905 |            10. |          10. |
|    10 |     5.2813853 |    5.2813853 |      5.4761905 |    6.5476190 |
|    11 |          0./0 |         0./0 |           0./0 |         0./0 |
|    12 |            2. |    2.3333333 |           0./0 |         0./0 |
|    13 |     3.5030250 |    4.6581886 |      4.0623783 |    5.8718692 |
|    14 |     11.428571 |    10.952381 |      2.6970634 |    3.3307573 |
|    15 |     5.6508264 |    4.6177686 |      5.1308629 |    4.3741902 |
|    16 |     6.2264151 |    4.1509434 |           0./0 |         0./0 |
|    17 |          0./0 |         0./0 |           0./0 |         0./0 |
|    18 |     10.689655 |          12. |      5.7134386 |    3.7413831 |
|    19 |          0./0 |         0./0 |           0./0 |         0./0 |
|    20 |          0./0 |         0./0 |      2.8888889 |    2.9444444 |
|    21 |          0./0 |         0./0 |           0./0 |         0./0 |
|    22 |          0./0 |         0./0 |           0./0 |         0./0 |
|    23 |     10.746269 |    9.4029851 |      6.2695313 |    6.1328125 |
|    24 |     6.4371193 |    6.2419339 |      6.0138782 |    5.8558211 |
|    25 |     6.4154824 |    6.3855647 |      4.9707695 |    4.7727182 |
|    26 |          0./0 |         0./0 |           0./0 |         0./0 |

> Now if as you suggest the parsing is dominated by sequences of regexps in the branches, it prompts the questions: which branches, what regexps, why are there so many of them, and is there anything that can be done to reduce their number?

Oh. No. The parsing is dominated by `org-element--current-element'. I
can clearly see it because the profiler hits
`org-element--current-element', not the branches.

I just had no idea what to make of your suggestion about

    Run on a reduced dataset, and see if the sequence of regexps being
    exercised, and their frequencies, are consistent with what you
    expect.

Also, my testing showed that

(looking-at
  (rx
   (or
    (group-n 1 (regexp org-element--latex-begin-environment))
    (group-n 2 (regexp org-element-drawer-re))
    (group-n 3 (regexp "[ \t]*:\\( \\|$\\)"))
    (group-n 7 (regexp org-element-dynamic-block-open-re))
    (seq (group-n 4 (regexp "[ \t]*#\\+"))
         (or
          (seq "BEGIN_" (group-n 5 (1+ (not space))))
          (group-n 6 "CALL:")
          (group-n 8 (1+ (not space)) ":")
          ))
    (group-n 9 (regexp org-footnote-definition-re))
    (group-n 10 (regexp "[ \t]*-\\{5,\\}[ \t]*$"))
    (group-n 11 "%%("))))

is actually slightly slower overall compared to a series of `looking-at-p'.
AFAIU, because the `looking-at' needs to allocate match-data vector for
all these 11 groups, which leads to
;;  6.78%  emacs         emacs                                  [.] process_mark_stack
floating up in the perf top.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-05 10:31                       ` Ihor Radchenko
@ 2023-05-05 16:26                         ` Mattias Engdegård
  2023-05-06 13:38                           ` Ihor Radchenko
  2023-05-07 12:45                           ` Mattias Engdegård
  0 siblings, 2 replies; 35+ messages in thread
From: Mattias Engdegård @ 2023-05-05 16:26 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 63225

5 maj 2023 kl. 12.31 skrev Ihor Radchenko <yantar92@posteo.net>:

> Not exactly. The actual statistics is the following (of course, it is a
> subject of the actual parsed file structure).
> 
> Below, I measured time spent in different branches of cond.

This is useful. It looks like drawers consume a lot of time, and list items. I know very little about Org, but from afar it looks like all drawers have the same basic form. Can't you recognise them with a single regexp and then branch on the drawer type for subtype-specific treatment?

There are micro-inefficiencies in the regexps here and there that you might want to try fixing (although I can't promise any noticeable gain from doing so):

(defconst org-property-drawer-re
  (concat "^[ \t]*:PROPERTIES:[ \t]*\n"
          "\\(?:[ \t]*:\\S-+:\\(?:[ \t].*\\)?[ \t]*\n\\)*?"
          "[ \t]*:END:[ \t]*$")

Look at the middle line in particular. Translated to rx, that part becomes

(*? (* (in "\t "))
    ":" (+ (not (syntax whitespace))) ":"
    (? (in "\t ") (* nonl))
    (* (in "\t "))
    "\n")

There are too many ways this could match. Maybe you could change it to

(*? (* (in " \t"))
    ":" (+ (not (in " \t\n:"))) ":"
    (* nonl)
    "\n")

which prevents a lot of unnecessary backtracking and does away with parsing structure that doesn't matter here.
Another example:

(defconst org-drawer-regexp "^[ \t]*:\\(\\(?:\\w\\|[-_]\\)+\\):[ \t]*$"

which is

(: bol
   (* (in "\t "))
   ":"
   (group (+ (| wordchar (in "_-"))))  ; <--
   ":"
   (* (in "\t "))
   eol)

Making reasonable assumptions about characters, the line marked with an arrow could become

   (group (+ (not (in " \t\n:"))))

but it's fine if you want to exclude more characters here, as long as you avoid leaving backtrack points everywhere. (Character syntax is kind of expensive too.)

Regarding list items, are you still calling (org-item-re) each time?

>> Now if as you suggest the parsing is dominated by sequences of regexps in the branches, it prompts the questions: which branches, what regexps, why are there so many of them, and is there anything that can be done to reduce their number?
> 
> Oh. No. The parsing is dominated by `org-element--current-element'. I
> can clearly see it because the profiler hits
> `org-element--current-element', not the branches.

Well there must be regexps being matched elsewhere since you did show early on the working set to be above 40, not the ca. 20 in org-element--current-element.

> I just had no idea what to make of your suggestion about
> 
>    Run on a reduced dataset, and see if the sequence of regexps being
>    exercised, and their frequencies, are consistent with what you
>    expect.

Stupid printf-debugging actually, nothing fancier than that.
I'll see if I can put together a patch for you a bit later on.

> (looking-at
>  (rx
>   (or
>    (group-n 1 (regexp org-element--latex-begin-environment))
>    (group-n 2 (regexp org-element-drawer-re))
>    (group-n 3 (regexp "[ \t]*:\\( \\|$\\)"))
>    (group-n 7 (regexp org-element-dynamic-block-open-re))
>    (seq (group-n 4 (regexp "[ \t]*#\\+"))
>         (or
>          (seq "BEGIN_" (group-n 5 (1+ (not space))))
>          (group-n 6 "CALL:")
>          (group-n 8 (1+ (not space)) ":")
>          ))
>    (group-n 9 (regexp org-footnote-definition-re))
>    (group-n 10 (regexp "[ \t]*-\\{5,\\}[ \t]*$"))
>    (group-n 11 "%%("))))

This actually incurs some unnecessary run-time cost: the (regexp ...) forms make this expand to a `concat` call to construct this rather long regexp each time. Either only recompute it when any of the variables (org-element--latex-begin-environment etc) change, or if you intend them to be compile-time constants, make sure they are expanded as such.

> is actually slightly slower overall compared to a series of `looking-at-p'.
> AFAIU, because the `looking-at' needs to allocate match-data vector for
> all these 11 groups, which leads to
> ;;  6.78%  emacs         emacs                                  [.] process_mark_stack
> floating up in the perf top.

Quite sure that's the concat calls. Match data doesn't actually contribute to any GC-level consing unless you reify it by calling `match-data`, or indirectly through `safe-match-data` (which I see that you are using in several places -- try not to).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-05 16:26                         ` Mattias Engdegård
@ 2023-05-06 13:38                           ` Ihor Radchenko
  2023-05-07 10:32                             ` Mattias Engdegård
  2023-05-07 12:45                           ` Mattias Engdegård
  1 sibling, 1 reply; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-06 13:38 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

>> Below, I measured time spent in different branches of cond.
>
> This is useful. It looks like drawers consume a lot of time, and list items. I know very little about Org, but from afar it looks like all drawers have the same basic form. Can't you recognise them with a single regexp and then branch on the drawer type for subtype-specific treatment?

I may, but it will be even more complex regexp. Currently, ordinary
drawers have somewhat complex :BEGIN: line, because they can have any
word there, while property drawers require very complex match for the
lines inside. Also, property drawers only occur right after headings, as
marked by appropriate parser flag. So, matching property drawers mostly
happens what they are supposed to be. If we try to match ordinary
drawers at the same time, it will actually be slower in practice.

> There are micro-inefficiencies in the regexps here and there that you might want to try fixing (although I can't promise any noticeable gain from doing so):
>
> (defconst org-property-drawer-re
>   (concat "^[ \t]*:PROPERTIES:[ \t]*\n"
>           "\\(?:[ \t]*:\\S-+:\\(?:[ \t].*\\)?[ \t]*\n\\)*?"
>           "[ \t]*:END:[ \t]*$")
> ...
> There are too many ways this could match. Maybe you could change it to
>
> (*? (* (in " \t"))
>     ":" (+ (not (in " \t\n:"))) ":"
>     (* nonl)
>     "\n")

Sure. Thanks! It was a sub-second improvement, but an improvement.

> Another example:
>
> (defconst org-drawer-regexp "^[ \t]*:\\(\\(?:\\w\\|[-_]\\)+\\):[ \t]*$"
> ...
> Making reasonable assumptions about characters, the line marked with an arrow could become
>
>    (group (+ (not (in " \t\n:"))))

This will account for Org syntax change, so no.
Slight improvement in performance cannot justify syntax changes.
 
> Regarding list items, are you still calling (org-item-re) each time?

Yes and no.
`org-item-re' now looks like

(defvar org--item-re-cache nil
  "Results cache for `org-item-re'.")
(defsubst org-item-re ()
  "Return the correct regular expression for plain lists."
  (or (plist-get
       org--item-re-cache
       (cons org-list-allow-alphabetical
             org-plain-list-ordered-item-terminator)
       #'equal)
      ...))

It should not give much overhead.

>> Oh. No. The parsing is dominated by `org-element--current-element'. I
>> can clearly see it because the profiler hits
>> `org-element--current-element', not the branches.
>
> Well there must be regexps being matched elsewhere since you did show early on the working set to be above 40, not the ca. 20 in org-element--current-element.

Of course. A larger number of regexps is matched in the individual
element parsers. They just don't contribute as much as
`org-element--current-element' individually and thus do not show up high
in the profiler.

For reference, I calculated the time taken in
`org-element--current-element' to decide about parsing specific element
type (Time/Avg) vs. time taken to actual parse it (Time2/Avg2).
(Note that the data below is for my WIP parser refactoring branch at
https://git.sr.ht/~yantar92/org-mode/tree/feature/org-element-ast/item/lisp/org-element.el;
The original, e.g. headline will be way slower)

| Depth |  Count | Time, msec | Time2, msec | Avg, μsec | Avg2, μsec | Type         |
|-------+--------+------------+-------------+-----------+------------+--------------|
|     0 |  89729 |  30.714894 |   1339.9075 |      0.34 |      14.93 | item         |
|     1 |   2074 |   0.779739 |   19.040295 |      0.38 |       9.18 | table row    |
|     2 | 207365 |   37.53366 |   1970.9524 |      0.18 |       9.50 | node         |
|     3 |  72849 |  303.36754 |   2448.6616 |      4.16 |      33.61 | headline     |
|     4 |  56076 |  33.117519 |   763.41927 |      0.59 |      13.61 | section      |
|     5 |    291 |   0.258913 |    2.622451 |      0.89 |       9.01 | comment      |
|     6 |   8247 |   23.15524 |   224.61437 |      2.81 |      27.24 | planning     |
|     7 |  54924 |  362.36612 |   523.11581 |      6.60 |       9.52 | prop drawer  |
|     8 |  89647 |  69.361279 |   761.29519 |      0.77 |       8.49 | paragraph    |
|     9 |  29652 |  67.658072 |   829.21937 |      2.28 |      27.97 | clock        |
|    10 |    231 |   1.285224 |    3.832217 |      5.56 |      16.59 | inlinetask   |
|    11 |      0 |          0 |           0 |      0.00 |       0.00 | keyword      |
|    12 |     30 |   0.059978 |    0.413909 |      2.00 |      13.80 | latex env    |
|    13 |  45401 |  159.57401 |   515.15776 |      3.51 |      11.35 | drawer       |
|    14 |     21 |   0.265039 |    0.265754 |     12.62 |      12.65 | fixed width  |
|    15 |    913 |   5.597659 |   17.326571 |      6.13 |      18.98 | block        |
|    16 |     53 |   0.355013 |    1.329438 |      6.70 |      25.08 | call         |
|    17 |      0 |          0 |           0 |      0.00 |       0.00 | dynblock     |
|    18 |     29 |   0.365553 |    0.494062 |     12.61 |      17.04 | keyword      |
|    19 |      0 |          0 |           0 |      0.00 |       0.00 | paragraph    |
|    20 |      0 |          0 |           0 |      0.00 |       0.00 | footnote def |
|    21 |      0 |          0 |           0 |      0.00 |       0.00 | hrule        |
|    22 |      0 |          0 |           0 |      0.00 |       0.00 | diary sexp   |
|    23 |     69 |   0.739084 |    1.459472 |     10.71 |      21.15 | table        |
|    24 |  41586 |  281.42632 |   1327.9897 |      6.77 |      31.93 | plain list   |
|    25 |   5370 |  36.202665 |   66.853115 |      6.74 |      12.45 | paragraph    |
#+TBLFM: $5=1000.0*$3/$2;%.2f::$6=1000.0*$4/$2;%.2f

>> I just had no idea what to make of your suggestion about
>> 
>>    Run on a reduced dataset, and see if the sequence of regexps being
>>    exercised, and their frequencies, are consistent with what you
>>    expect.
>
> Stupid printf-debugging actually, nothing fancier than that.
> I'll see if I can put together a patch for you a bit later on.

I once tried #defun REGEX_EMACS_DEBUG  + regex_emacs_debug = 100000, but
it produced so much output that I cannot even open Emacs in reasonable
time because of the wall of output in terminal.

>> (looking-at
>>  (rx
>>   (or
>> ...
>>    (group-n 11 "%%("))))
>
> This actually incurs some unnecessary run-time cost: the (regexp ...) forms make this expand to a `concat` call to construct this rather long regexp each time. Either only recompute it when any of the variables (org-element--latex-begin-environment etc) change, or if you intend them to be compile-time constants, make sure they are expanded as such.
>
>> is actually slightly slower overall compared to a series of `looking-at-p'.
>> AFAIU, because the `looking-at' needs to allocate match-data vector for
>> all these 11 groups, which leads to
>> ;;  6.78%  emacs         emacs                                  [.] process_mark_stack
>> floating up in the perf top.
>
> Quite sure that's the concat calls. Match data doesn't actually contribute to any GC-level consing unless you reify it by calling `match-data`, or indirectly through `safe-match-data` (which I see that you are using in several places -- try not to).

After moving that giant rx into defconst, the parsing time is not
growing significantly anymore:

;; No rx:  17.947628s (1.373926s in 2 GCs)
;; rx: 18.058193s (1.379169s in 2 GCs)

But there is no improvement either...

[ now we are just 2x slower than tree-sitter rather than 2.5x :) ]

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-06 13:38                           ` Ihor Radchenko
@ 2023-05-07 10:32                             ` Mattias Engdegård
  2023-05-08 11:58                               ` Ihor Radchenko
  0 siblings, 1 reply; 35+ messages in thread
From: Mattias Engdegård @ 2023-05-07 10:32 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 63225

6 maj 2023 kl. 15.38 skrev Ihor Radchenko <yantar92@posteo.net>:

> I may, but it will be even more complex regexp. Currently, ordinary
> drawers have somewhat complex :BEGIN: line, because they can have any
> word there, while property drawers require very complex match for the
> lines inside. Also, property drawers only occur right after headings, as
> marked by appropriate parser flag. So, matching property drawers mostly
> happens what they are supposed to be. If we try to match ordinary
> drawers at the same time, it will actually be slower in practice.

What I meant was that the consolidated root regexp could just match the initial :BEGIN: line and then dispatch to different branches for parsers specific to the drawer type. That would reduce complexity and time spent at the critical parser root.

> This will account for Org syntax change, so no.

Don't dismiss it out of hand. I'm not trying to optimise a few regexps, but to use examples to illustrate some useful principles that would help you improve many of them yourself.

When matching something terminated by a specific character, it's particularly useful if the regexp engine can be made to understand that the terminator doesn't occur in what precedes it, as that enables it to omit backtracking points. For example, in "a*b", the engine doesn't need to save backtracking points for each 'a' matched since the sets {a} and {b} are obviously disjoint.

In this case, the

   (group (+ (| wordchar (in "_-"))))

part is unnecessarily slow because it's an or-pattern, which also inhibits that optimisation. Fortunately it can easily be rewritten as

   (group (+ (in "_-" word)))

which solves both problems.

> Slight improvement in performance cannot justify syntax changes.

Always question your assumptions. A slight change of spec may not be so bad after all if it buys speed and/or improves our understanding of the code. Do you know what characters have 'word' syntax in org-mode? If not, better be careful before using them in regexps.

(Looks like org-tags-expand permanently adds @ and _ to the set of word chars. A bug, surely?)

> (defvar org--item-re-cache nil
>  "Results cache for `org-item-re'.")
> (defsubst org-item-re ()
>  "Return the correct regular expression for plain lists."
>  (or (plist-get
>       org--item-re-cache
>       (cons org-list-allow-alphabetical
>             org-plain-list-ordered-item-terminator)
>       #'equal)
>      ...))
> 
> It should not give much overhead.

Maybe, but you still cons each time. (And remember that the plist-get equality funarg is new in Emacs 29.)

> A larger number of regexps is matched in the individual
> element parsers. They just don't contribute as much as
> `org-element--current-element' individually and thus do not show up high
> in the profiler.

Still, if called often enough they do outsized damage by evicting regexps used elsewhere.

Also make sure that if the same regexp is used in multiple places, it should always use the same `case-fold-search` value or they will be considered different for cache purposes.

> [ now we are just 2x slower than tree-sitter rather than 2.5x :) ]

Progress!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-05 16:26                         ` Mattias Engdegård
  2023-05-06 13:38                           ` Ihor Radchenko
@ 2023-05-07 12:45                           ` Mattias Engdegård
  2023-05-08 13:56                             ` Ihor Radchenko
  1 sibling, 1 reply; 35+ messages in thread
From: Mattias Engdegård @ 2023-05-07 12:45 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 63225

[-- Attachment #1: Type: text/plain, Size: 591 bytes --]

5 maj 2023 kl. 18.26 skrev Mattias Engdegård <mattias.engdegard@gmail.com>:

> Stupid printf-debugging actually, nothing fancier than that.

Here is some of that stupidity I promised. You probably want to use it with

  (set-regexp-trace-file "re.log")
  (unwind-protect
      (do-something-interesting)
    (set-regexp-trace-file nil))

so that you don't trace more than necessary. The file may become large, but it's useful data for off-line analysis, scripted or just looking at it in an editor.
The first letter of each line indicates a regexp cache hit (H) or miss (M).


[-- Attachment #2: 0003-Regexp-tracing-add-set-regexp-trace-file.patch --]
[-- Type: application/octet-stream, Size: 3353 bytes --]

From cd66a560a74d2ed94202cab278455544f0c9337c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Sun, 7 May 2023 14:05:31 +0200
Subject: [PATCH 3/3] Regexp tracing: add set-regexp-trace-file

---
 src/search.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/src/search.c b/src/search.c
index c454d5e1ca9..b378db152a2 100644
--- a/src/search.c
+++ b/src/search.c
@@ -34,6 +34,10 @@ Copyright (C) 1985-1987, 1993-1994, 1997-1999, 2001-2023 Free Software
 
 #include "regex-emacs.h"
 
+#include <stdio.h>
+
+static FILE *regexp_trace_file = NULL;
+
 #define DEFAULT_REGEXP_CACHE_SIZE 20
 
 /* If the regexp is non-nil, then the buffer contains the compiled form
@@ -200,6 +204,7 @@ compile_pattern (Lisp_Object pattern, struct re_registers *regp,
 {
   struct regexp_cache *cp, **cpp, **lru_nonbusy;
 
+  bool cache_hit = false;
   for (cpp = &searchbuf_head, lru_nonbusy = NULL; ; cpp = &cp->next)
     {
       cp = *cpp;
@@ -224,6 +229,7 @@ compile_pattern (Lisp_Object pattern, struct re_registers *regp,
 	  && cp->buf.charset_unibyte == charset_unibyte)
         {
           regexp_cache_hit++;
+	  cache_hit = true;
           break;
         }
 
@@ -243,6 +249,26 @@ compile_pattern (Lisp_Object pattern, struct re_registers *regp,
 	}
     }
 
+  if (regexp_trace_file) {
+    fprintf(regexp_trace_file, "%c \"", cache_hit ? 'H' : 'M');
+    ptrdiff_t n = SBYTES (pattern);
+    for (ptrdiff_t i = 0; i < n; i++) {
+      unsigned char c = SREF (pattern, i);
+      switch (c) {
+      case '"': case '\\': fprintf(regexp_trace_file, "\\%c", c); break;
+      case '\n': fprintf(regexp_trace_file, "\\n"); break;
+      case '\t': fprintf(regexp_trace_file, "\\t"); break;
+      default:
+	if (c < 32 || c == 127)
+	  fprintf(regexp_trace_file, "\\x%02x", c);
+	else
+	  putc(c, regexp_trace_file);
+	break;
+      }
+    }
+    fprintf(regexp_trace_file, "\"\n");
+  }
+
   /* When we get here, cp (aka *cpp) contains the compiled pattern,
      either because we found it in the cache or because we just compiled it.
      Move it to the front of the queue to mark it as most recently used.  */
@@ -3424,6 +3450,27 @@ DEFUN ("set-regexp-cache-size", Fset_regexp_cache_size, Sset_regexp_cache_size,
   return Qnil;
 }
 
+DEFUN ("set-regexp-trace-file", Fset_regexp_trace_file, Sset_regexp_trace_file,
+       1, 1, 0,
+       doc: /* Set the regexp trace file to FILE.  Internal use only.
+Use `nil' as argument to stop tracing.  */)
+  (Lisp_Object file)
+{
+  if (NILP (file)) {
+    fclose (regexp_trace_file);
+    regexp_trace_file = NULL;
+  } else {
+    CHECK_STRING (file);
+    if (regexp_trace_file)
+      Fset_regexp_trace_file (Qnil);
+    FILE *f = fopen (SSDATA (file), "a");
+    if (!f)
+      report_file_error ("opening regexp trace file", file);
+    regexp_trace_file = f;
+  }
+  return Qnil;
+}
+
 void
 mark_regexp_cache (void)
 {
@@ -3514,6 +3561,7 @@ syms_of_search (void)
   defsubr (&Snewline_cache_check);
   defsubr (&Sregexp_cache_size);
   defsubr (&Sset_regexp_cache_size);
+  defsubr (&Sset_regexp_trace_file);
 
   pdumper_do_now_and_after_load (syms_of_search_for_pdumper);
 }
-- 
2.32.0 (Apple Git-132)


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-07 10:32                             ` Mattias Engdegård
@ 2023-05-08 11:58                               ` Ihor Radchenko
  2023-05-08 18:21                                 ` Mattias Engdegård
  0 siblings, 1 reply; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-08 11:58 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> What I meant was that the consolidated root regexp could just match the initial :BEGIN: line and then dispatch to different branches for parsers specific to the drawer type. That would reduce complexity and time spent at the critical parser root.

Let me elaborate on what I mean.

The relevant `cond' branches are:

(and (pcase mode
	        (`planning (eq ?* (char-after (line-beginning-position 0))))
	        ((or `property-drawer `top-comment)
		 (save-excursion
		   (beginning-of-line 0)
		   (not (looking-at-p "[[:blank:]]*$"))))
	        (_ nil))
	      (looking-at-p org-property-drawer-re))

where org-property-drawer-re is

(rx
   ;; Drawer begin line.
   bol (0+ (in " \t")) ":PROPERTIES:" (0+ (in " \t")) "\n"
   ;; Node properties.
   (*? (0+ (in " \t")) ":" (+ (not (in " \t\n:"))) ":" (* nonl) "\n")
   ;; Drawer end line.
   (0+ (in " \t")) ":END:" (0+ (in " \t")) eol)

Note that this regexp is only matches when certain conditions are met
and the beginning of the property drawer regexp matches for less general
":PROPERTIES:".

and

[now part of the giant rx]

(rx line-start (0+ (any ?\s ?\t))
      ":" (1+ (any ?- ?_ word)) ":"
      (0+ (any ?\s ?\t)) line-end)

matches for more general ":[-_[:word:]]+:".

If we make ":[-_[:word:]]+:" a new root, how will it be helpful?

>> This will account for Org syntax change, so no.
>
> Don't dismiss it out of hand. I'm not trying to optimise a few regexps, but to use examples to illustrate some useful principles that would help you improve many of them yourself.
>
> When matching something terminated by a specific character, it's particularly useful if the regexp engine can be made to understand that the terminator doesn't occur in what precedes it, as that enables it to omit backtracking points. For example, in "a*b", the engine doesn't need to save backtracking points for each 'a' matched since the sets {a} and {b} are obviously disjoint.
>
> In this case, the
>
>    (group (+ (| wordchar (in "_-"))))
>
> part is unnecessarily slow because it's an or-pattern, which also
> inhibits that optimisation.

> Fortunately it can easily be rewritten as
>
>    (group (+ (in "_-" word)))
>
> which solves both problems.

But why? Aren't (in word ?_ ?-) and (or word ?_ ?-) not the same?

>> Slight improvement in performance cannot justify syntax changes.
>
> Always question your assumptions. A slight change of spec may not be so bad after all if it buys speed and/or improves our understanding of the code. Do you know what characters have 'word' syntax in org-mode? If not, better be careful before using them in regexps.

Honestly, I have no clue how syntax tables in Org mode are working.
I once tried to alter syntax table inside code blocks and Org got
completely broken. So, I simply do not dare to touch syntax-dependent
matches.

Ideally, Org should use dedicated, unmutable, syntax tables when
parsing. The difficulty though is that we need to support various
languages, including CJK syntax where syntax expectation quite different
from Latin.

Your suggestions about using (not (in ...)) in place of (in ...) are
good, but I am afraid to break CJK cases where people can use unexpected
set of characters. I was bitten by this in the past.

And there is consideration about not breaking the syntax.
It is not just about Elisp part - Org is a markup standard and for
drawers in particular we have defined the syntax like

https://orgmode.org/worg/org-syntax.html#Drawers

:NAME:
CONTENTS
:end:

NAME
A string consisting of word-constituent characters, hyphens and underscores (-_).

To change this, we should weigh on the possible impact on the external
Org parsers, not implemented in Elisp.

> (Looks like org-tags-expand permanently adds @ and _ to the set of word chars. A bug, surely?)

Yeah. Fixed now.
https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=6e6354c07

>> (defvar org--item-re-cache nil
>>  "Results cache for `org-item-re'.")
>> (defsubst org-item-re ()
>> ...
>> It should not give much overhead.
>
> Maybe, but you still cons each time. (And remember that the plist-get equality funarg is new in Emacs 29.)

Sure it does.
It is just one of the variable parts of Org syntax that might be
changed. There are ways to make this into constant, but it is a fragile
area of the code that I do not want to touch without a reason.
(Especially given that I am not familiar with org-list.el)

> Also make sure that if the same regexp is used in multiple places, it should always use the same `case-fold-search` value or they will be considered different for cache purposes.

I hope we do. If only Emacs had a way to define `case-fold-search' right
within the regexp itself.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-07 12:45                           ` Mattias Engdegård
@ 2023-05-08 13:56                             ` Ihor Radchenko
  2023-05-08 19:32                               ` Mattias Engdegård
  0 siblings, 1 reply; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-08 13:56 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225

[-- Attachment #1: Type: text/plain, Size: 947 bytes --]

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> 5 maj 2023 kl. 18.26 skrev Mattias Engdegård <mattias.engdegard@gmail.com>:
>
>> Stupid printf-debugging actually, nothing fancier than that.
>
> Here is some of that stupidity I promised. You probably want to use it with
>
>   (set-regexp-trace-file "re.log")
>   (unwind-protect
>       (do-something-interesting)
>     (set-regexp-trace-file nil))

Thanks!

> so that you don't trace more than necessary. The file may become large, but it's useful data for off-line analysis, scripted or just looking at it in an editor.
> The first letter of each line indicates a regexp cache hit (H) or miss (M).

I am not sure what I can make out of hits/misses, but I am at least able
to look into frequency data, via sort re.log  | uniq -c > re-freq.log

It would be even nicer if apart from frequency, there was information
about time taken to search for each regexp.


[-- Attachment #2: re2-large.log --]
[-- Type: text/plain, Size: 34397 bytes --]

 624713 H "^\\(?4:[ \t]*\\)\\(?1::\\(?2:\\S-+\\):\\)\\(?:\\(?3:$\\)\\|[ \t]+\\(?3:.*?\\)\\)\\(?5:[ \t]*\\)$"
 293473 H "^\\*+ "
 271943 H "\\(?:[_^][-{(*+.,[:alnum:]]\\)\\|[*+/=_~][^[:space:]]\\|\\<\\(?:A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:desc\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|Smartcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:abel\\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\\*\\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(?:range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:\\+\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\\)?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\\)l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|l\\(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:ameref\\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:search\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:-\\(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|notecite\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\\|martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\):\\|\\[\\(?:cite[:/]\\|fn:\\|\\(?:[0-9]\\|\\(?:%\\|/[0-9]*\\)\\]\\)\\|\\[\\)\\|@@\\|{{{\\|<\\(?:%%\\|<\\|[0-9]\\|\\(?:A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:desc\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|Smartcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:abel\\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\\*\\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(?:range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:\\+\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\\)?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\\)l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|l\\(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:ameref\\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:search\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:-\\(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|notecite\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\\|martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\)\\)\\|\\$\\|\\\\\\(?:[a-zA-Z[(]\\|\\\\[ \t]*$\\|_ +\\)\\|\\(?:call\\|src\\)_"
 260619 H "\\(?:^\\|[^[:alnum:]]\\|\\c|\\)\\(foobar\\)\\(?:$\\|[^[:alnum:]]\\|\\c|\\)"
 222386 H "^[\t ]*#\\(?: \\|$\\)"
 179559 H "^[ \t]*\\(\\(?:[-+*]\\|\\(?:[0-9]+\\|[A-Za-z]\\)[.)]\\)\\(?:[ \t]+\\|$\\)\\)\\(?:\\[@\\(?:start:\\)?\\([0-9]+\\|[A-Za-z]\\)\\][ \t]*\\)?\\(?:\\(\\[[ X-]\\]\\)\\(?:[ \t]+\\|$\\)\\)?\\(?:\\(.*\\)[ \t]+::\\(?:[ \t]+\\|$\\)\\)?"
 152627 H "\\([ \t]*\\([-+]\\|\\(\\([0-9]+\\)[.)]\\)\\)\\|[ \t]+\\*\\)\\([ \t]+\\|$\\)"
 145600 H "\\(?:^[\t ]*\\(\\(?:\\(?:CLOSED\\|DEADLINE\\|SCHEDULED\\):\\)\\)\\)"
 125515 H "\\(\\([0-9]\\{4\\}\\)-\\([0-9]\\{2\\}\\)-\\([0-9]\\{2\\}\\)\\( +[^]+0-9>\x0d\n -]+\\)?\\( +\\([0-9]\\{1,2\\}\\):\\([0-9]\\{2\\}\\)\\)?\\)"
 117890 H ":"
 105789 H "^[ \t]*\n[ \t]*\n"
 102357 H "^[\t ]*:PROPERTIES:[\t ]*\n\\(?:[\t ]*:[^\t\n :]+:.*\n\\)*?[\t ]*:END:[\t ]*$"
 100268 H "\\(?:^[\t ]*CLOCK: \\(?:\\[\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)\\]\\)\\(?:--\\(?:\\[\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)\\]\\)[\t ]+=>[\t ]+[[:digit:]]+:[[:digit:]][[:digit:]]\\)?[\t ]*$\\)"
  96479 H "[[<]\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)[]>]\\|\\(?:<[0-9]+-[0-9]+-[0-9]+[^>\n]+?\\+[0-9]+[dwmy]>\\)\\|\\(?:<%%\\(?:([^>\n]+)\\)>\\)"
  95804 H "\\([<[]\\(%%\\)?.*?\\)[]>]\\(?:--\\([[<]\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)[]>]\\)\\)?"
  95801 H "\\([.+]?\\+\\)\\([0-9]+\\)\\([hdwmy]\\)"
  95801 H "\\(-\\)?-\\([0-9]+\\)\\([hdwmy]\\)"
  95801 H "[012]?[0-9]:[0-5][0-9]\\(-\\([012]?[0-9]\\):\\([0-5][0-9]\\)\\)"
  94944 H "^\\(?:\\*+ \\|\\[fn:[-_[:word:]]+\\]\\|%%(\\|[ \t]*\\(?:$\\||\\|\\+\\(?:-+\\+\\)+[ \t]*$\\|#\\(?: \\|$\\|\\+\\(?:BEGIN_\\S-+\\|\\S-+\\(?:\\[.*\\]\\)?:[ \t]*\\)\\)\\|:\\(?: \\|$\\|[-_[:word:]]+:[ \t]*$\\)\\|-\\{5,\\}[ \t]*$\\|\\\\begin{\\([A-Za-z0-9*]+\\)}\\|\\(?:^[\t ]*CLOCK: \\(?:\\[\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)\\]\\)\\(?:--\\(?:\\[\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)\\]\\)[\t ]+=>[\t ]+[[:digit:]]+:[[:digit:]][[:digit:]]\\)?[\t ]*$\\)\\|\\(?:[-+*]\\|\\(?:[0-9]+\\)[.)]\\)\\(?:[ \t]\\|$\\)\\)\\)"
  89719 H "[-+*]"
  70668 H "[ \t]*#\\+\\(?:\\(?1:\\(?:CAPTION\\|RESULTS\\)\\)\\(?:\\[\\(.*\\)\\]\\)?\\|\\(?1:\\(?:DATA\\|HEADERS?\\|LABEL\\|NAME\\|PLOT\\|RES\\(?:NAME\\|ULT\\)\\|\\(?:S\\(?:OURC\\|RCNAM\\)\\|TBLNAM\\)E\\)\\)\\|\\(?1:ATTR_[-_A-Za-z0-9]+\\)\\):[ \t]*"
  70409 H "\\(?1:^[ \t]*\\\\begin{[A-Za-z0-9*]+}\\)\\|\\(?2:^[\t ]*:[_[:word:]-]+:[\t ]*$\\)\\|\\(?3:[ \t]*:\\( \\|$\\)\\)\\|\\(?7:^[\t ]*#\\+BEGIN:[\t ]*\\)\\|\\(?4:[ \t]*#\\+\\)\\(?:BEGIN_\\(?5:[^[:space:]]+\\)\\|\\(?6:CALL:\\)\\|\\(?8:[^[:space:]]+:\\)\\)\\|\\(?9:^\\[fn:\\([-_[:word:]]+\\)\\]\\)\\|\\(?10:[ \t]*-----+[ \t]*$\\)\\|\\(?11:%%(\\)"
  70409 H "[ \t]*$"
  49780 H "^[ \t]*:END:[ \t]*$"
  46940 H "[ \t]*|"
  46800 H "[\t ]*\\+\\(?:-+\\+\\)+[\t ]*$"
  43578 H "\\`[ \t\n\x0d]+"
  43491 H "[ \t\n\x0d]+\\'"
  43276 H "\\[#.\\][ \t]*"
  43276 H "\\(CANCELLED\\|DO\\(?:ING\\|NE\\)\\|F\\(?:AILED\\|ROZEN\\)\\|HOLD\\|MERGED\\|NEXT\\|REVIEW\\|SOMEDAY\\|T\\(?:ICKLER\\|ODO\\)\\|WAITING\\)\\(?: \\|$\\)"
  43276 H "\\(:[[:alnum:]_@#%:]+:\\)[ \t]*$"
  43276 H "COMMENT\\(?: \\|$\\)"
  41509 H "[ \t]*[A-Za-z0-9]"
  33419 H "^[\t ]*:\\([_[:word:]-]+\\):[\t ]*$"
  28976 H "\\(\\S-+\\)[ \t]*$"
  27833 H "\\(?:^\\*\\{1,17\\} \\)"
  20981 H "\\[\\[\\(\\(?:[^][\\]\\|\\\\\\(?:\\\\\\\\\\)*[][]\\|\\\\+[^][]\\)+\\)]\\(?:\\[\\([^z-a]+?\\)]\\)?]"
  20701 H "\\`file\\(?:\\+\\(.+\\)\\)?\\'"
  19234 H "\\(\\\\+\\)\\(?:\\'\\|[][]\\)"
  19064 H "[ \t]*\n[ \t]*"
  19058 H "^\\([^:]*\\)\\(::?\\(.*\\)\\)?$"
  19057 H "\\`\\.\\.?/"
  19057 H "\\`\\(A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:desc\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|Smartcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:abel\\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\\*\\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(?:range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:\\+\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\\)?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\\)l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|l\\(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:ameref\\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:search\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:-\\(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|notecite\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\\|martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\):"
  17440 H "\\(?:\\(?:CLOSED\\|DEADLINE\\|SCHEDULED\\):\\)"
  15376 H "^[ \t]*$"
  14348 H "^\\*\\{18,100\\}+ "
  13471 H "\\(?:^\\*\\{1,7\\} \\)"
  10158 H "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)/[^[:space:]]\\)"
   9779 H "\\(?:^\\*\\{1,6\\} \\)"
   9122 H "\\(?:^\\*\\{1,5\\} \\)"
   7577 H "[ \t]*#\\+BEGIN_\\(\\S-+\\)"
   7531 H "^[ \t]*\\\\begin{\\([A-Za-z0-9*]+\\)}"
   7494 H "[ \t]*#\\+\\(\\S-+\\)\\[.*\\]:"
   6442 H "[ \t]*\\(.*?\\)[ \t]*\\(?:|\\|$\\)"
   5151 H "\\(?:^\\*\\{1,4\\} \\)"
   3732 H "\\(\\S-\\)\\([_^]\\)\\(\\(?:{\\([^{}]*?\\|\\(?:[^{}]*?{[^{}]*?}\\)+[^{}]*?\\|\\(?:[^{}]*?{\\(?:[^{}]*?{[^{}]*?}\\)+[^{}]*?}\\)+[^{}]*?\\)}\\)\\|\\(?:(\\([^()]*?\\|\\(?:[^()]*?([^()]*?)\\)+[^()]*?\\|\\(?:[^()]*?(\\(?:[^()]*?([^()]*?)\\)+[^()]*?)\\)+[^()]*?\\))\\)\\|\\(?:\\*\\|[+-]?[[:alnum:].,\\]*[[:alnum:]]\\)\\)"
   3494 H "[[:blank:]]*$"
   3359 H "[.)]"
   3159 H "\\(?:^\\*\\{1,8\\} \\)"
   2154 H "[ \t]*#\\+BEGIN\\(:\\|_\\S-+\\)"
   2065 H "^[ \t]*|-"
   1871 H "\\(?:[^[:space:]]\\(/\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\)"
   1643 H "\\(?:\\<\\(?:\\(A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:desc\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|Smartcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:abel\\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\\*\\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(?:range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:\\+\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\\)?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\\)l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|l\\(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:ameref\\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:search\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:-\\(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|notecite\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\\|martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\)\\):\\(\\(?:[^][ \t\n()<>]\\|(\\(?:[^][ \t\n()<>]\\|([^][ \t\n()<>]*)\\)*)\\)+\\(?:[^[:punct:] \t\n]\\|/\\|(\\(?:[^][ \t\n()<>]\\|([^][ \t\n()<>]*)\\)*)\\)\\)\\)"
   1557 H "\\\\\\(?:\\(?1:_ +\\)\\|\\(?1:there4\\|sup[123]\\|frac[13][24]\\|[a-zA-Z]+\\)\\(?2:$\\|{}\\|[^[:alpha:]]\\)\\)"
    963 H "\\\\\\\\[ \t]*$"
    876 H "^[ \t]*,*\\(,\\)\\(?:\\*\\|#\\+\\)"
    838 M "\\(\\([0-9]\\{4\\}\\)-\\([0-9]\\{2\\}\\)-\\([0-9]\\{2\\}\\)\\( +[^]+0-9>\x0d\n -]+\\)?\\( +\\([0-9]\\{1,2\\}\\):\\([0-9]\\{2\\}\\)\\)?\\)"
    838 M "\\([<[]\\(%%\\)?.*?\\)[]>]\\(?:--\\([[<]\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)[]>]\\)\\)?"
    838 M "\\([.+]?\\+\\)\\([0-9]+\\)\\([hdwmy]\\)"
    838 M "\\(-\\)?-\\([0-9]+\\)\\([hdwmy]\\)"
    838 M "[012]?[0-9]:[0-5][0-9]\\(-\\([012]?[0-9]\\):\\([0-5][0-9]\\)\\)"
    815 H "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)\\*[^[:space:]]\\)"
    811 M "[[<]\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)[]>]\\|\\(?:<[0-9]+-[0-9]+-[0-9]+[^>\n]+?\\+[0-9]+[dwmy]>\\)\\|\\(?:<%%\\(?:([^>\n]+)\\)>\\)"
    775 H "\\(?:[^[:space:]]\\(\\*\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\)"
    758 M "[ \t]*\n[ \t]*"
    755 M "^\\([^:]*\\)\\(::?\\(.*\\)\\)?$"
    755 M "\\`\\.\\.?/"
    755 M "\\`\\(A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:desc\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|Smartcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:abel\\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\\*\\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(?:range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:\\+\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\\)?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\\)l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|l\\(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:ameref\\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:search\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:-\\(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|notecite\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\\|martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\):"
    755 M "\\(\\\\+\\)\\(?:\\'\\|[][]\\)"
    752 H "[^ \x0d\t\n]"
    751 H "\\(?:^\\*\\{1,9\\} \\)"
    735 M "\\(\\S-+\\)[ \t]*$"
    725 M "\\`file\\(?:\\+\\(.+\\)\\)?\\'"
    714 M "\\[\\[\\(\\(?:[^][\\]\\|\\\\\\(?:\\\\\\\\\\)*[][]\\|\\\\+[^][]\\)+\\)]\\(?:\\[\\([^z-a]+?\\)]\\)?]"
    662 M "^\\*\\{18,100\\}+ "
    641 H "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)\\+[^[:space:]]\\)"
    635 M "[[:blank:]]*$"
    601 H "^[ \t]*#\\+END_SRC[ \t]*$"
    600 M "^[ \t]*$"
    596 H "^[ \t]*#\\+BEGIN_SRC\\(?: +\\(\\S-+\\)\\)?\\(\\(?: +\\(?:-\\(?:l \".+\"\\|[ikr]\\)\\|[-+]n\\(?: *[0-9]+\\)?\\)\\)+\\)?\\(.*\\)[ \t]*$"
    586 M "\\(?:\\(?:CLOSED\\|DEADLINE\\|SCHEDULED\\):\\)"
    579 M "[ \t]*#\\+BEGIN_\\(\\S-+\\)"
    578 M "^[ \t]*\\\\begin{\\([A-Za-z0-9*]+\\)}"
    578 M "[ \t]*#\\+\\(\\S-+\\)\\[.*\\]:"
    549 H "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)=[^[:space:]]\\)"
    515 M "[ \t]*#\\+BEGIN\\(:\\|_\\S-+\\)"
    507 M "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)/[^[:space:]]\\)"
    482 M "[.)]"
    438 M "\\(?:^[\t ]*\\(\\(?:\\(?:CLOSED\\|DEADLINE\\|SCHEDULED\\):\\)\\)\\)"
    430 M "^\\(?4:[ \t]*\\)\\(?1::\\(?2:\\S-+\\):\\)\\(?:\\(?3:$\\)\\|[ \t]+\\(?3:.*?\\)\\)\\(?5:[ \t]*\\)$"
    428 H "\\`("
    421 M "\\(?:^\\*\\{1,17\\} \\)"
    411 M "^[\t ]*:PROPERTIES:[\t ]*\n\\(?:[\t ]*:[^\t\n :]+:.*\n\\)*?[\t ]*:END:[\t ]*$"
    404 H "\\[[0-9]*\\(%\\|/[0-9]*\\)\\]"
    396 M "\\(\\S-\\)\\([_^]\\)\\(\\(?:{\\([^{}]*?\\|\\(?:[^{}]*?{[^{}]*?}\\)+[^{}]*?\\|\\(?:[^{}]*?{\\(?:[^{}]*?{[^{}]*?}\\)+[^{}]*?}\\)+[^{}]*?\\)}\\)\\|\\(?:(\\([^()]*?\\|\\(?:[^()]*?([^()]*?)\\)+[^()]*?\\|\\(?:[^()]*?(\\(?:[^()]*?([^()]*?)\\)+[^()]*?)\\)+[^()]*?\\))\\)\\|\\(?:\\*\\|[+-]?[[:alnum:].,\\]*[[:alnum:]]\\)\\)"
    395 H "\\`///*\\(.:\\)?/"
    395 H "::\\(.*\\)\\'"
    389 M "\\\\\\\\[ \t]*$"
    340 M "^[ \t]*:END:[ \t]*$"
    270 M "\\(?:^\\*\\{1,5\\} \\)"
    260 M "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)\\*[^[:space:]]\\)"
    257 H "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)~[^[:space:]]\\)"
    253 M "\\(?:^\\*\\{1,6\\} \\)"
    253 M "\\(?:[^[:space:]]\\(\\*\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\)"
    240 M "\\(?:^\\*\\{1,4\\} \\)"
    239 M "\\(?:\\<\\(?:\\(A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:desc\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|Smartcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:abel\\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\\*\\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(?:range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:\\+\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\\)?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\\)l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|l\\(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:ameref\\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:search\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:-\\(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|notecite\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\\|martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\)\\):\\(\\(?:[^][ \t\n()<>]\\|(\\(?:[^][ \t\n()<>]\\|([^][ \t\n()<>]*)\\)*)\\)+\\(?:[^[:punct:] \t\n]\\|/\\|(\\(?:[^][ \t\n()<>]\\|([^][ \t\n()<>]*)\\)*)\\)\\)\\)"
    210 M "\\(?:^\\*\\{1,7\\} \\)"
    193 H "\\(?:[^[:space:]]\\(=\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\)"
    192 M "\\\\\\(?:\\(?1:_ +\\)\\|\\(?1:there4\\|sup[123]\\|frac[13][24]\\|[a-zA-Z]+\\)\\(?2:$\\|{}\\|[^[:alpha:]]\\)\\)"
    186 H "<\\(A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:desc\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|Smartcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:abel\\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\\*\\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(?:range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:\\+\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\\)?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\\)l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|l\\(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:ameref\\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:search\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:-\\(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|notecite\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\\|martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\):\\([^>\n]*\\(?:\n[ \t]*[^> \t\n][^>\n]*\\)*\\)>"
    161 H "\\(?:^\\*\\{1,3\\} \\)"
    156 M "\\(?:^\\*\\{1,8\\} \\)"
    154 H "^ATTR_"
    149 H "\\(?:^\\*\\{1,10\\} \\)"
    147 M "[^ \x0d\t\n]"
    144 M "^[ \t]*#\\+END_SRC[ \t]*$"
    141 M "^[ \t]*#\\+BEGIN_SRC\\(?: +\\(\\S-+\\)\\)?\\(\\(?: +\\(?:-\\(?:l \".+\"\\|[ikr]\\)\\|[-+]n\\(?: *[0-9]+\\)?\\)\\)+\\)?\\(.*\\)[ \t]*$"
    135 H "\\(?:[^[:space:]]\\(~\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\)"
    122 M "\\(?:^\\*\\{1,3\\} \\)"
    118 M "[-+*]"
    117 M "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)=[^[:space:]]\\)"
    111 M "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)~[^[:space:]]\\)"
    108 M "[ \t]*[A-Za-z0-9]"
    104 M "\\[[0-9]*\\(%\\|/[0-9]*\\)\\]"
    103 M "\\(?:[^[:space:]]\\(/\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\)"
    101 M "^[ \t]*\n[ \t]*\n"
     93 H "\\(\\s.\\|\\s-\\|\\s(\\|\\s)\\|\\s\"\\|'\\|$\\)"
     92 M "<\\(A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:desc\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|Smartcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:abel\\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\\*\\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(?:range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:\\+\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\\)?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\\)l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|l\\(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:ameref\\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:search\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:-\\(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|notecite\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\\|martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\):\\([^>\n]*\\(?:\n[ \t]*[^> \t\n][^>\n]*\\)*\\)>"
     92 H "[ \t]*END[ \t]*$"
     89 M "\\`[ \t\n\x0d]+"
     89 M "[ \t\n\x0d]+\\'"
     83 M "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)\\+[^[:space:]]\\)"
     83 M "\\(?:[^[:space:]]\\(~\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\)"
     77 M "^ATTR_"
     77 M "\\`///*\\(.:\\)?/"
     77 M "::\\(.*\\)\\'"
     57 H "\\\\[a-zA-Z]+\\*?\\(\\(\\[[^][\n{}]*\\]\\)\\|\\({[^{}\n]*}\\)\\)*"
     52 M "\\`("
     52 M "\\(?:^\\*\\{1,2\\} \\)"
     51 H "^[ \t]*#\\+END_src[ \t]*$"
     50 H "[ \t]*#\\+TBLFM: +\\(.*\\)[ \t]*$"
     46 M "\\(?:[^[:space:]]\\(=\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\)"
     46 M "[ \t]*END[ \t]*$"
     39 M "^[ \t]*|-"
     39 M "^[ \t]*\\($\\|[^| \t]\\)"
     39 M "[ \t]*\\(.*?\\)[ \t]*\\(?:|\\|$\\)"
     39 M "[ \t]*#\\+TBLFM: +\\(.*\\)[ \t]*$"
     37 M "\\(?:^\\*\\{1,9\\} \\)"
     36 M "^[ \t]*\\(\\(?:[-+*]\\|\\(?:[0-9]+\\|[A-Za-z]\\)[.)]\\)\\(?:[ \t]+\\|$\\)\\)\\(?:\\[@\\(?:start:\\)?\\([0-9]+\\|[A-Za-z]\\)\\][ \t]*\\)?\\(?:\\(\\[[ X-]\\]\\)\\(?:[ \t]+\\|$\\)\\)?\\(?:\\(.*\\)[ \t]+::\\(?:[ \t]+\\|$\\)\\)?"
     35 M "^[ \t]*#\\+END_QUOTE[ \t]*$"
     33 H "^[ \t]*: ?"
     32 H "\\\\end{equation}[ \t]*$"
     29 H "^[ \t]*\\($\\|[^| \t]\\)"
     29 H "\\(?:[^[:space:]]\\(\\+\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\)"
     26 M "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)_[^[:space:]]\\)"
     23 M "^[ \t]*#\\+END_src[ \t]*$"
     22 M "\\(?:[^[:space:]]\\(\\+\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\)"
     21 M "\\(\\s.\\|\\s-\\|\\s(\\|\\s)\\|\\s\"\\|'\\|$\\)"
     21 M "\\(?:[^[:space:]]\\(_\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\)"
     21 H "\\(?:^\\*\\{1,2\\} \\)"
     20 H "[ \t]*:\\( \\|$\\)"
     19 M "\\(?:^\\*\\{1,10\\} \\)"
     19 M "[\t ]*\\+\\(?:-+\\+\\)+[\t ]*$"
     18 H "END[ \t]*$"
     16 M "^[ \t]*#\\+END_quote[ \t]*$"
     15 M "[ \t]*|"
     15 H "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)_[^[:space:]]\\)"
     14 M "END[ \t]*$"
     13 M "\\(?:^[\t ]*CLOCK: \\(?:\\[\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)\\]\\)\\(?:--\\(?:\\[\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)\\]\\)[\t ]+=>[\t ]+[[:digit:]]+:[[:digit:]][[:digit:]]\\)?[\t ]*$\\)"
     13 M "\\(?1:^[ \t]*\\\\begin{[A-Za-z0-9*]+}\\)\\|\\(?2:^[\t ]*:[_[:word:]-]+:[\t ]*$\\)\\|\\(?3:[ \t]*:\\( \\|$\\)\\)\\|\\(?7:^[\t ]*#\\+BEGIN:[\t ]*\\)\\|\\(?4:[ \t]*#\\+\\)\\(?:BEGIN_\\(?5:[^[:space:]]+\\)\\|\\(?6:CALL:\\)\\|\\(?8:[^[:space:]]+:\\)\\)\\|\\(?9:^\\[fn:\\([-_[:word:]]+\\)\\]\\)\\|\\(?10:[ \t]*-----+[ \t]*$\\)\\|\\(?11:%%(\\)"
     13 M "[ \t]*$"
     13 M "[ \t]*#\\+\\(?:\\(?1:\\(?:CAPTION\\|RESULTS\\)\\)\\(?:\\[\\(.*\\)\\]\\)?\\|\\(?1:\\(?:DATA\\|HEADERS?\\|LABEL\\|NAME\\|PLOT\\|RES\\(?:NAME\\|ULT\\)\\|\\(?:S\\(?:OURC\\|RCNAM\\)\\|TBLNAM\\)E\\)\\)\\|\\(?1:ATTR_[-_A-Za-z0-9]+\\)\\):[ \t]*"
     13 H "<<\\([^<>\n\x0d \t]\\|[^<>\n\x0d \t][^<>\n\x0d]*[^<>\n\x0d \t]\\)>>"
     13 H "<<<\\([^<>\n\x0d \t]\\|[^<>\n\x0d \t][^<>\n\x0d]*[^<>\n\x0d \t]\\)>>>"
     11 M "\\\\[a-zA-Z]+\\*?\\(\\(\\[[^][\n{}]*\\]\\)\\|\\({[^{}\n]*}\\)\\)*"
     11 M "<<<\\([^<>\n\x0d \t]\\|[^<>\n\x0d \t][^<>\n\x0d]*[^<>\n\x0d \t]\\)>>>"
     11 H "\\(?:[^[:space:]]\\(_\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\)"
     11 H "[ \t]*#\\+\\(\\S-*\\):"
     10 M "^[ \t]*: ?"
     10 M "[ \t]*:\\( \\|$\\)"
     10 M "<<\\([^<>\n\x0d \t]\\|[^<>\n\x0d \t][^<>\n\x0d]*[^<>\n\x0d \t]\\)>>"
      9 H "^[ \t]*#\\+END_QUOTE[ \t]*$"
      6 M "^\\(?:\\*+ \\|\\[fn:[-_[:word:]]+\\]\\|%%(\\|[ \t]*\\(?:$\\||\\|\\+\\(?:-+\\+\\)+[ \t]*$\\|#\\(?: \\|$\\|\\+\\(?:BEGIN_\\S-+\\|\\S-+\\(?:\\[.*\\]\\)?:[ \t]*\\)\\)\\|:\\(?: \\|$\\|[-_[:word:]]+:[ \t]*$\\)\\|-\\{5,\\}[ \t]*$\\|\\\\begin{\\([A-Za-z0-9*]+\\)}\\|\\(?:^[\t ]*CLOCK: \\(?:\\[\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)\\]\\)\\(?:--\\(?:\\[\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)\\]\\)[\t ]+=>[\t ]+[[:digit:]]+:[[:digit:]][[:digit:]]\\)?[\t ]*$\\)\\|\\(?:[-+*]\\|\\(?:[0-9]+\\)[.)]\\)\\(?:[ \t]\\|$\\)\\)\\)"
      6 M "[ \t]*#\\+\\(\\S-*\\):"
      6 H "^[ \t]*#\\+END_quote[ \t]*$"
      5 M "^[\t ]*:\\([_[:word:]-]+\\):[\t ]*$"
      5 M "\\([ \t]*\\([-+]\\|\\(\\([0-9]+\\)[.)]\\)\\)\\|[ \t]+\\*\\)\\([ \t]+\\|$\\)"
      5 M "\\(?:^\\*\\{1\\} \\)"
      5 H "\\`\\(?:\\)?/\\(?:\\(?:\\(-\\|[[:alnum:]]\\{2,\\}\\)\\(?::\\)\\(?:\\([^/:|[:blank:]]+\\)\\(?:@\\)\\)?\\(\\(?:[%._[:alnum:]-]+\\|\\(?:\\[\\)\\(?:\\(?:[[:alnum:]]*:\\)+[.[:alnum:]]*\\)?\\(?:]\\)\\)\\(?:\\(?:#\\)\\(?:[[:digit:]]+\\)\\)?\\)?\\)\\(?:|\\)\\)*\\(?:\\(?:-\\|[[:alnum:]]+\\)\\(?:\\(?::\\)\\(?:\\(?:[^/:|[:blank:]]+\\)\\(?:@\\)\\)?\\(?:[%._[:alnum:]-]+\\|\\(?:\\[\\)\\(?:\\(?:\\(?:[[:alnum:]]*:\\)+[.[:alnum:]]*\\)\\(?:]\\)?\\)?\\)?\\)?\\)?\\'"
      5 H "\\`\\(.+\\.\\(?:7z\\|CAB\\|LZH\\|MSU\\|ZIP\\|a\\(?:pk\\|r\\)\\|c\\(?:ab\\|pio\\|rate\\)\\|de\\(?:b\\|pot\\)\\|e\\(?:pub\\|xe\\)\\|iso\\|jar\\|lzh\\|m\\(?:su\\|tree\\)\\|od[bfgpst]\\|pax\\|r\\(?:ar\\|pm\\)\\|shar\\|t\\(?:ar\\|bz\\|gz\\|lz\\|xz\\|zst\\)\\|warc\\|x\\(?:ar\\|p[is]\\)\\|zip\\)\\(?:\\.\\(?:Z\\|bz2\\|gz\\|l\\(?:rz\\|z\\(?:ma\\|[4o]\\)?\\)\\|uu\\|xz\\|zst\\)\\)?\\)\\(/.*\\)\\'"
      5 H "\\`/:"
      5 H "\\.gpg\\(~\\|\\.~[0-9]+~\\)?\\'"
      5 H "\\(?:^/\\)\\(\\(?:\\(?:\\(-\\|[[:alnum:]]\\{2,\\}\\)\\(?::\\)\\(?:\\([^/:|[:blank:]]+\\)\\(?:@\\)\\)?\\(\\(?:[%._[:alnum:]-]+\\|\\(?:\\[\\)\\(?:\\(?:[[:alnum:]]*:\\)+[.[:alnum:]]*\\)?\\(?:]\\)\\)\\(?:\\(?:#\\)\\(?:[[:digit:]]+\\)\\)?\\)?\\)\\(?:|\\)\\)+\\)?\\(?:\\(-\\|[[:alnum:]]\\{2,\\}\\)\\(?::\\)\\(?:\\([^/:|[:blank:]]+\\)\\(?:@\\)\\)?\\(\\(?:[%._[:alnum:]-]+\\|\\(?:\\[\\)\\(?:\\(?:[[:alnum:]]*:\\)+[.[:alnum:]]*\\)?\\(?:]\\)\\)\\(?:\\(?:#\\)\\(?:[[:digit:]]+\\)\\)?\\)?\\)\\(?::\\)\\([^\n\x0d]*\\'\\)"
      5 H "\\(?:\\.tzst\\|\\.zst\\|\\.dz\\|\\.txz\\|\\.xz\\|\\.lzma\\|\\.lz\\|\\.g?z\\|\\.\\(?:tgz\\|svgz\\|sifz\\)\\|\\.tbz2?\\|\\.bz2\\|\\.Z\\)\\(?:~\\|\\.~[-[:alnum:]:#@^._]+\\(?:~[[:digit:]]+\\)?~\\)?\\'"
      3 M "^[ \t]*#\\+END_equation[ \t]*$"
      3 M "\\`\\(?:\\)?/\\(?:\\(?:\\(-\\|[[:alnum:]]\\{2,\\}\\)\\(?::\\)\\(?:\\([^/:|[:blank:]]+\\)\\(?:@\\)\\)?\\(\\(?:[%._[:alnum:]-]+\\|\\(?:\\[\\)\\(?:\\(?:[[:alnum:]]*:\\)+[.[:alnum:]]*\\)?\\(?:]\\)\\)\\(?:\\(?:#\\)\\(?:[[:digit:]]+\\)\\)?\\)?\\)\\(?:|\\)\\)*\\(?:\\(?:-\\|[[:alnum:]]+\\)\\(?:\\(?::\\)\\(?:\\(?:[^/:|[:blank:]]+\\)\\(?:@\\)\\)?\\(?:[%._[:alnum:]-]+\\|\\(?:\\[\\)\\(?:\\(?:\\(?:[[:alnum:]]*:\\)+[.[:alnum:]]*\\)\\(?:]\\)?\\)?\\)?\\)?\\)?\\'"
      3 M "\\`\\(.+\\.\\(?:7z\\|CAB\\|LZH\\|MSU\\|ZIP\\|a\\(?:pk\\|r\\)\\|c\\(?:ab\\|pio\\|rate\\)\\|de\\(?:b\\|pot\\)\\|e\\(?:pub\\|xe\\)\\|iso\\|jar\\|lzh\\|m\\(?:su\\|tree\\)\\|od[bfgpst]\\|pax\\|r\\(?:ar\\|pm\\)\\|shar\\|t\\(?:ar\\|bz\\|gz\\|lz\\|xz\\|zst\\)\\|warc\\|x\\(?:ar\\|p[is]\\)\\|zip\\)\\(?:\\.\\(?:Z\\|bz2\\|gz\\|l\\(?:rz\\|z\\(?:ma\\|[4o]\\)?\\)\\|uu\\|xz\\|zst\\)\\)?\\)\\(/.*\\)\\'"
      3 M "\\`/:"
      3 M "\\\\end{equation}[ \t]*$"
      3 M "\\.gpg\\(~\\|\\.~[0-9]+~\\)?\\'"
      3 M "\\(?:^/\\)\\(\\(?:\\(?:\\(-\\|[[:alnum:]]\\{2,\\}\\)\\(?::\\)\\(?:\\([^/:|[:blank:]]+\\)\\(?:@\\)\\)?\\(\\(?:[%._[:alnum:]-]+\\|\\(?:\\[\\)\\(?:\\(?:[[:alnum:]]*:\\)+[.[:alnum:]]*\\)?\\(?:]\\)\\)\\(?:\\(?:#\\)\\(?:[[:digit:]]+\\)\\)?\\)?\\)\\(?:|\\)\\)+\\)?\\(?:\\(-\\|[[:alnum:]]\\{2,\\}\\)\\(?::\\)\\(?:\\([^/:|[:blank:]]+\\)\\(?:@\\)\\)?\\(\\(?:[%._[:alnum:]-]+\\|\\(?:\\[\\)\\(?:\\(?:[[:alnum:]]*:\\)+[.[:alnum:]]*\\)?\\(?:]\\)\\)\\(?:\\(?:#\\)\\(?:[[:digit:]]+\\)\\)?\\)?\\)\\(?::\\)\\([^\n\x0d]*\\'\\)"
      3 M "\\(?:\\.tzst\\|\\.zst\\|\\.dz\\|\\.txz\\|\\.xz\\|\\.lzma\\|\\.lz\\|\\.g?z\\|\\.\\(?:tgz\\|svgz\\|sifz\\)\\|\\.tbz2?\\|\\.bz2\\|\\.Z\\)\\(?:~\\|\\.~[-[:alnum:]:#@^._]+\\(?:~[[:digit:]]+\\)?~\\)?\\'"
      3 M "[A-Za-z]"
      3 M "[0-9]+"
      3 M "[ \t]*#\\+BEGIN_\\(\\S-+\\)[ \t]*\\(.*\\)[ \t]*$"
      3 H "\\+$"
      2 M "^[ \t]*#\\+END_EXAMPLE[ \t]*$"
      2 M "^[ \t]*#\\+BEGIN_EXAMPLE\\(?: +\\(.*\\)\\)?"
      2 M "\\[cite\\(?:/\\([/_[:alnum:]-]+\\)\\)?:[\t\n ]*"
      2 M "\\+$"
      2 M "@\\([!#-+./:<>-@^-`{-~[:word:]-]+\\)"
      2 H "@\\([!#-+./:<>-@^-`{-~[:word:]-]+\\)"
      1 M "^\\*+ "
      1 M "^[\t ]*#\\(?: \\|$\\)"
      1 M "^[ \t]*,*\\(,\\)\\(?:\\*\\|#\\+\\)"
      1 M "^[ \t]*#\\+END_export[ \t]*$"
      1 M "^[ \t]*#\\+END_EXPORT[ \t]*$"
      1 M "^[ \t]*#\\+END_COMMENT[ \t]*$"
      1 M "^[ \t]*#\\+CATEGORY:"
      1 M "\\\\end{array}[ \t]*$"
      1 M "\\[fn:\\(?:\\(?1:[-_[:word:]]+\\)?\\(:\\)\\|\\(?1:[-_[:word:]]+\\)\\]\\)"
      1 M "\\[#.\\][ \t]*"
      1 M "\\.[^.]*\\'"
      1 M "\\(CANCELLED\\|DO\\(?:ING\\|NE\\)\\|F\\(?:AILED\\|ROZEN\\)\\|HOLD\\|MERGED\\|NEXT\\|REVIEW\\|SOMEDAY\\|T\\(?:ICKLER\\|ODO\\)\\|WAITING\\)\\(?: \\|$\\)"
      1 M "\\(?:~\\|\\.~[-[:alnum:]:#@^._]+\\(?:~[[:digit:]]+\\)?~\\)\\'"
      1 M "\\(?:^\\|[^[:alnum:]]\\|\\c|\\)\\(foobar\\)\\(?:$\\|[^[:alnum:]]\\|\\c|\\)"
      1 M "\\(?:^\\*\\{1,11\\} \\)"
      1 M "\\(?:[_^][-{(*+.,[:alnum:]]\\)\\|[*+/=_~][^[:space:]]\\|\\<\\(?:A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:desc\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|Smartcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:abel\\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\\*\\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(?:range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:\\+\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\\)?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\\)l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|l\\(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:ameref\\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:search\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:-\\(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|notecite\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\\|martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\):\\|\\[\\(?:cite[:/]\\|fn:\\|\\(?:[0-9]\\|\\(?:%\\|/[0-9]*\\)\\]\\)\\|\\[\\)\\|@@\\|{{{\\|<\\(?:%%\\|<\\|[0-9]\\|\\(?:A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:desc\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|Smartcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:abel\\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\\*\\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(?:range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:\\+\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\\)?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\\)l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|l\\(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:ameref\\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:search\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:-\\(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|notecite\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\\|martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\)\\)\\|\\$\\|\\\\\\(?:[a-zA-Z[(]\\|\\\\[ \t]*$\\|_ +\\)\\|\\(?:call\\|src\\)_"
      1 M "\\(:[[:alnum:]_@#%:]+:\\)[ \t]*$"
      1 M "[ \t]*#\\+BEGIN_EXPORT\\(?:[ \t]+\\(\\S-+\\)\\)?[ \t]*$"
      1 M "COMMENT\\(?: \\|$\\)"
      1 M ":"
      1 H "^[ \t]*#\\+END_EXPORT[ \t]*$"
      1 H "\\\\end{array}[ \t]*$"
      1 H "\\[fn:\\(?:\\(?1:[-_[:word:]]+\\)?\\(:\\)\\|\\(?1:[-_[:word:]]+\\)\\]\\)"
      1 H "\\(?:^\\*\\{1\\} \\)"
      1 H "[ \t]*#\\+BEGIN_EXPORT\\(?:[ \t]+\\(\\S-+\\)\\)?[ \t]*$"

[-- Attachment #3: Type: text/plain, Size: 224 bytes --]


-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-08 11:58                               ` Ihor Radchenko
@ 2023-05-08 18:21                                 ` Mattias Engdegård
  2023-05-08 19:38                                   ` Ihor Radchenko
  0 siblings, 1 reply; 35+ messages in thread
From: Mattias Engdegård @ 2023-05-08 18:21 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 63225

8 maj 2023 kl. 13.58 skrev Ihor Radchenko <yantar92@posteo.net>:

> 		 (save-excursion
> 		   (beginning-of-line 0)
> 		   (not (looking-at-p "[[:blank:]]*$"))))

I wonder if that last part isn't better written as

  (save-excursion
    (forward-line 0)   ; faster than beginning-of-line
    (skip-chars-forward "[:blank:]") ; faster than looking-at-p
    (not (eolp)))   ; very cheap

which doesn't use regexps at all. Worth a try.

But yes, I sort of understand what you are getting at (except the business with the MODE parameter which is still a bit mysterious to me).

> [now part of the giant rx]
> 
> (rx line-start (0+ (any ?\s ?\t))
>      ":" (1+ (any ?- ?_ word)) ":"
>      (0+ (any ?\s ?\t)) line-end)

Any reason you don't capture the part between the colons here, so that you don't need to match it later on?

> But why? Aren't (in word ?_ ?-) and (or word ?_ ?-) not the same?

"[-_[:word:]]" and "\\w\\|[_-]" indeed match the same thing but they don't generate the same regexp bytecode -- the former is faster. (In this case rx makes a literal translation to those strings but we should probably make it optimise to the faster regexp.)

There is a regexp disassembler for the really curious but it doesn't come with Emacs.

> Your suggestions about using (not (in ...)) in place of (in ...) are
> good, but I am afraid to break CJK cases where people can use unexpected
> set of characters. I was bitten by this in the past.

In this case there's no need since you could gain some speed by the simple rewrite (or -> in) above, but there may be others where conditions are different.

>> Maybe, but you still cons each time. (And remember that the plist-get equality funarg is new in Emacs 29.)
> 
> Sure it does.
> It is just one of the variable parts of Org syntax that might be
> changed. There are ways to make this into constant, but it is a fragile
> area of the code that I do not want to touch without a reason.
> (Especially given that I am not familiar with org-list.el)

So it's fine to use elisp constructs new in Emacs 29 in Org? Then the line

 ;; Package-Requires: ((emacs "26.1"))

in org.el should probably be updated, right?

> I hope we do. If only Emacs had a way to define `case-fold-search' right
> within the regexp itself.

I would like that too, but changing that isn't easy.
By the way, it seems that org-element-node-property-parser binds case-fold-search without actually using it. Bug?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-08 13:56                             ` Ihor Radchenko
@ 2023-05-08 19:32                               ` Mattias Engdegård
  2023-05-08 19:44                                 ` Ihor Radchenko
  0 siblings, 1 reply; 35+ messages in thread
From: Mattias Engdegård @ 2023-05-08 19:32 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 63225

8 maj 2023 kl. 15.56 skrev Ihor Radchenko <yantar92@posteo.net>:

> I am not sure what I can make out of hits/misses, but I am at least able
> to look into frequency data, via sort re.log  | uniq -c > re-freq.log

I'm mostly curious about the regexp cache behaviour. What cache size did you use in this run?
Hardly 20, given the low miss rate? It would be interesting to see what sequence of regexps most commonly cause thrashing.

> It would be even nicer if apart from frequency, there was information
> about time taken to search for each regexp.

That's a bit messier but could be done if really needed.






^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-08 18:21                                 ` Mattias Engdegård
@ 2023-05-08 19:38                                   ` Ihor Radchenko
  2023-05-08 19:53                                     ` Mattias Engdegård
  0 siblings, 1 reply; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-08 19:38 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

>> 		 (save-excursion
>> 		   (beginning-of-line 0)
>> 		   (not (looking-at-p "[[:blank:]]*$"))))
>
> I wonder if that last part isn't better written as
>
>   (save-excursion
>     (forward-line 0)   ; faster than beginning-of-line

Fair point. It is probably worth looking through Org sources and
replacing all those `beginning-of-line' and `end-of-line' calls. I doubt
that we even intend to use fields for real.

>     (skip-chars-forward "[:blank:]") ; faster than looking-at-p
>     (not (eolp)))   ; very cheap

Hmm. I feel confused.
Does this imply that simple regexps like

(looking-at-p (rx (seq bol (zero-or-more (any "\t ")) "#" (or " " eol))))

should better be implemented as the following?

(and (bolp)
     (skip-chars-forward " \t")
     (eq (char-after) ?#) (forward-char)
     (or (eolp) (eq (char-after) ?\s)))

(I now start thinking that it might be more efficient to create a bunch
or char tables and step over them to match "regexp", just like any
finite automaton would)

> But yes, I sort of understand what you are getting at (except the business with the MODE parameter which is still a bit mysterious to me).

MODE parameter is used because we constrain what kinds of syntax
elements are allowed inside other. For example, see
`org-element-object-restrictions'. And within
`org-element--current-element', MODE is used, for example, to constrain
planning/property drawer to be only the first child of a parent heading,
parsed earlier.

>> [now part of the giant rx]
>> 
>> (rx line-start (0+ (any ?\s ?\t))
>>      ":" (1+ (any ?- ?_ word)) ":"
>>      (0+ (any ?\s ?\t)) line-end)
>
> Any reason you don't capture the part between the colons here, so that you don't need to match it later on?

That would demand all the callers of `org-element-drawer-parser' to set
match data appropriately. Which is doable, but headache for maintenance.
We sometimes do call parsers explicitly not from inside
`org-element--current-element'.

>> But why? Aren't (in word ?_ ?-) and (or word ?_ ?-) not the same?
>
> "[-_[:word:]]" and "\\w\\|[_-]" indeed match the same thing but they don't generate the same regexp bytecode -- the former is faster. (In this case rx makes a literal translation to those strings but we should probably make it optimise to the faster regexp.)
>
> There is a regexp disassembler for the really curious but it doesn't come with Emacs.

I really hope that I did not need to do all these workarounds specific to
current implementation pitfalls of Emacs regexp compiler.

>>> Maybe, but you still cons each time. (And remember that the plist-get equality funarg is new in Emacs 29.)
>> 
>> Sure it does.
>> It is just one of the variable parts of Org syntax that might be
>> changed. There are ways to make this into constant, but it is a fragile
>> area of the code that I do not want to touch without a reason.
>> (Especially given that I am not familiar with org-list.el)
>
> So it's fine to use elisp constructs new in Emacs 29 in Org? Then the line
>
>  ;; Package-Requires: ((emacs "26.1"))
>
> in org.el should probably be updated, right?

Nope. It is just that the link I shared is for WIP branch I am
developing using Emacs master. I will ensure backwards compatibility
later. For example, by converting `plist-get' to `assoc'.

>> I hope we do. If only Emacs had a way to define `case-fold-search' right
>> within the regexp itself.
>
> I would like that too, but changing that isn't easy.

I am sure that it is easy.
For example, regexp command may optionally accept a vector with first
element being regexp and second element setting a flag for
`case-fold-search'.

Of course, it is just one, maybe not the best way to implement this.

My suggestion in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=63225#56
is also compatible with this kind of approach.

> By the way, it seems that org-element-node-property-parser binds case-fold-search without actually using it. Bug?

It actually does nothing given that `org-property-re' does not contain
letters. I will remove it.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-08 19:32                               ` Mattias Engdegård
@ 2023-05-08 19:44                                 ` Ihor Radchenko
  0 siblings, 0 replies; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-08 19:44 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> 8 maj 2023 kl. 15.56 skrev Ihor Radchenko <yantar92@posteo.net>:
>
>> I am not sure what I can make out of hits/misses, but I am at least able
>> to look into frequency data, via sort re.log  | uniq -c > re-freq.log
>
> I'm mostly curious about the regexp cache behaviour. What cache size
> did you use in this run?

50

> Hardly 20, given the low miss rate? It would be interesting to see what sequence of regexps most commonly cause thrashing.

Here is the log:
https://0x0.st/HZgH.log

>> It would be even nicer if apart from frequency, there was information
>> about time taken to search for each regexp.
>
> That's a bit messier but could be done if really needed.

From this discussion, I am, so far, having an impression that Elisp
regexps can various non-obvious pitfalls that may need to be considered.
However, Org uses so many regexps that optimizing them all is not a
viable option, especially when the optimization may involve changing the
syntax.  Having the data on the major bottlenecks would at least allow
us to focus on the regexps that really slow things down in practice.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-08 19:38                                   ` Ihor Radchenko
@ 2023-05-08 19:53                                     ` Mattias Engdegård
  2023-05-09  8:36                                       ` bug#63225: Using char table-based finite-state machines as a replacement for re-search-forward (was: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)) Ihor Radchenko
  2023-05-09 12:02                                       ` bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
  0 siblings, 2 replies; 35+ messages in thread
From: Mattias Engdegård @ 2023-05-08 19:53 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 63225

8 maj 2023 kl. 21.38 skrev Ihor Radchenko <yantar92@posteo.net>:

> Does this imply that simple regexps like
> 
> (looking-at-p (rx (seq bol (zero-or-more (any "\t ")) "#" (or " " eol))))
> 
> should better be implemented as the following?

Obviously this isn't practical or beneficial for any but the simplest patterns. Benchmark if you are concerned.
It is useful to know when (not) to use regexps.

> (I now start thinking that it might be more efficient to create a bunch
> or char tables and step over them to match "regexp", just like any
> finite automaton would)

I wish elisp were fast enough that we could do that in general.

> I really hope that I did not need to do all these workarounds specific to
> current implementation pitfalls of Emacs regexp compiler.

Some of them. We program for the system we have. Emacs is a very slowly moving target.

>> I would like that too, but changing that isn't easy.
> 
> I am sure that it is easy.

I didn't mean technically. Code is easy to write.

> It actually does nothing given that `org-property-re' does not contain
> letters. I will remove it.

It doesn't even matter what org-property-re contains because it isn't used with a changed case-fold-search (which is bound using let, not let*).






^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Using char table-based finite-state machines as a replacement for re-search-forward (was: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c))
  2023-05-08 19:53                                     ` Mattias Engdegård
@ 2023-05-09  8:36                                       ` Ihor Radchenko
  2023-05-09 12:02                                       ` bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
  1 sibling, 0 replies; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-09  8:36 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

>> (I now start thinking that it might be more efficient to create a bunch
>> or char tables and step over them to match "regexp", just like any
>> finite automaton would)
>
> I wish elisp were fast enough that we could do that in general.

I tried the following simple implementation for searching via char
tables:

(defun yant/make-char-re (string)
  "Create a regexp matching STRING using char tables.
Return entry char table. The values are either char tables or return value for
terminals.  This is a dumb version not trying to account for substring cycles."
  (let (root (table (make-char-table 'regexp-table)) (idx 0))
    (setq root table)
    (set-char-table-range root t root)
    (seq-map
     (lambda (char)
       (let ((next-table (make-char-table 'regexp-table root)))
	 (aset table char next-table)
         (setq table next-table)))
     string)
    (set-char-table-range table t t)
    root))

(defun yant/search-forward (regexp-table)
  "Move point after next string matching REGEXP-TABLE."
  (let ((pos (point-marker)))
    (while (and (< pos (point-max))
		(char-table-p
		 (setq regexp-table
		       (aref regexp-table (char-after pos))))
		(move-marker pos (1+ pos))))
    (unless (char-table-p regexp-table) (goto-char pos))))

DEFUN ("auto-search-forward", Fauto_search_forward, Sauto_search_forward, 1, 1, 0,
       doc: /* Search forward using CHAR-TABLE. (WIP) */)
  (Lisp_Object table)
{
  CHECK_TYPE (CHAR_TABLE_P (table), Qchar_table_p, table);
  ptrdiff_t pos_byte = PT_BYTE;
  ptrdiff_t pos_char = PT;
  if (pos_byte < BEGV_BYTE || pos_byte >= ZV_BYTE)
    return Qnil;
  table = CHAR_TABLE_REF (table, FETCH_CHAR (pos_byte));
  while (pos_byte < ZV_BYTE && CHAR_TABLE_P (table))
    {
      pos_byte += next_char_len(pos_byte);
      pos_char ++;
      table = CHAR_TABLE_REF (table, FETCH_CHAR (pos_byte));
    }
  if (pos_byte >= ZV_BYTE)
    return Qnil;
  else
    {
      SET_PT (pos_char);
      return Qt;
    }
}

(setq yant/test (yant/make-char-re ":ID:"))
(benchmark-run 10 (org-with-wide-buffer (goto-char 1) (while (yant/search-forward yant/test))))
(benchmark-run 10 (org-with-wide-buffer (goto-char 1) (while (auto-search-forward yant/test))))
(benchmark-run 10 (org-with-wide-buffer (goto-char (point-min)) (while (re-search-forward ":ID:" nil t))))

- yant/search-forward :: (24.677463432 0 0.0)
- re-search-forward   :: (0.569693718 0 0.0)
- auto-search-forward   :: (1.177098129 0 0.0)

The lisp version suffers from (1) Extra redirection when checking
function symbol value; (2) moving markers, as there is no efficient way
to move point forward from Elisp; `forward-char' is even slower.

auto-search-forward is twice slower than re-search-forward, but I expect
this to change if we ask for a more complex regexp. auto-search-forward
will not care about the complexity of the finite-state machine provided.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-08 19:53                                     ` Mattias Engdegård
  2023-05-09  8:36                                       ` bug#63225: Using char table-based finite-state machines as a replacement for re-search-forward (was: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)) Ihor Radchenko
@ 2023-05-09 12:02                                       ` Ihor Radchenko
  2023-05-09 15:05                                         ` Mattias Engdegård
  1 sibling, 1 reply; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-09 12:02 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> 8 maj 2023 kl. 21.38 skrev Ihor Radchenko <yantar92@posteo.net>:
>
>> Does this imply that simple regexps like
>> 
>> (looking-at-p (rx (seq bol (zero-or-more (any "\t ")) "#" (or " " eol))))
>> 
>> should better be implemented as the following?
>
> Obviously this isn't practical or beneficial for any but the simplest patterns. Benchmark if you are concerned.
> It is useful to know when (not) to use regexps.

I did some benchmarking and your code does provide >2x improvement. Most
of it is coming from using `forward-line' instead of
`beginning-of-line'.

I benchmarked different variations:


(defun yant/test1 ()
  (save-excursion
    (forward-line 0)   ; faster than beginning-of-line
    (skip-chars-forward "[:blank:]") ; faster than looking-at-p
    (not (eolp))))   ; very cheap

(defun yant/test2 ()
  (save-excursion
    (beginning-of-line 0)
    (not (looking-at-p "[[:blank:]]*$"))))

(defun yant/test3 ()
  (save-excursion
    (beginning-of-line 0)
    (skip-chars-forward "[:blank:]") ; faster than looking-at-p
    (not (eolp))))

(defun yant/test4 ()
  (save-excursion
    (forward-line 0)   ; faster than beginning-of-line
    (not (looking-at-p "[[:blank:]]*$"))))

- forward-line + skip-chars-forward      :: (2.980 2 0.648)
- beginning-of-line + looking-at-p       :: (7.189 2 0.653)
- beginning-of-line + skip-chars-forward :: (6.833 2 0.634)
- forward-line + looking-at-p            :: (3.180 2 0.663)

>> I really hope that I did not need to do all these workarounds specific to
>> current implementation pitfalls of Emacs regexp compiler.
>
> Some of them. We program for the system we have. Emacs is a very slowly moving target.

Will it make sense to use a combination of char-after and
skip-chars-forward every single time?

I am thinking about something among the lines of

(defconst org-fancy-re
  (propertize
   (rx bol (1+ "*") " ")
   'org-re-direct
   '(progn
      (and
       (bolp)
       (> (skip-chars-forward "*") 0)
       (prog1 (eq ?\s (char-after)) (forward-char))))))

(define-inline org-fancy-looking-at-p (regexp)
  "Like `looking-at-p', but try `org-re-direct' property."
  (let ((sexp (and (inline-const-p regexp)
		   (get-text-property
		    'org-re-direct
		    (inline-const-val regexp)))))
    (if sexp (inline-quote (save-excursion ,sexp))
      (inline-quote (looking-at-p ,regexp)))))

>>> I would like that too, but changing that isn't easy.
>> 
>> I am sure that it is easy.
>
> I didn't mean technically. Code is easy to write.

May you elaborate what is the blocker then?

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-09 12:02                                       ` bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
@ 2023-05-09 15:05                                         ` Mattias Engdegård
  2023-05-09 15:56                                           ` Ihor Radchenko
  0 siblings, 1 reply; 35+ messages in thread
From: Mattias Engdegård @ 2023-05-09 15:05 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 63225

9 maj 2023 kl. 14.02 skrev Ihor Radchenko <yantar92@posteo.net>:

> - forward-line + skip-chars-forward      :: (2.980 2 0.648)
> - beginning-of-line + looking-at-p       :: (7.189 2 0.653)
> - beginning-of-line + skip-chars-forward :: (6.833 2 0.634)
> - forward-line + looking-at-p            :: (3.180 2 0.663)

Thanks for measuring. (The regexp cache usage is a secondary cost to looking-at-p that isn't covered by your benchmark.)

You may want to try the small improvement to skip-chars-forward that just arrived on master.

> Will it make sense to use a combination of char-after and
> skip-chars-forward every single time?

Maybe, depending on how complex that combination would be. Applications under regexp cache pressure (like Org) may gain more from it, but of course it's also a question of programming convenience.

> May you elaborate what is the blocker then?

Mostly time, but also coming up with a design that is compatible and reasonably future-safe, and convincing people that it's a good way forward (assuming it actually is). Emacs is a collaborative effort, after all.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-09 15:05                                         ` Mattias Engdegård
@ 2023-05-09 15:56                                           ` Ihor Radchenko
  2023-05-09 15:57                                             ` Mattias Engdegård
  0 siblings, 1 reply; 35+ messages in thread
From: Ihor Radchenko @ 2023-05-09 15:56 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 63225

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> 9 maj 2023 kl. 14.02 skrev Ihor Radchenko <yantar92@posteo.net>:
>
>> - forward-line + skip-chars-forward      :: (2.980 2 0.648)
>> - beginning-of-line + looking-at-p       :: (7.189 2 0.653)
>> - beginning-of-line + skip-chars-forward :: (6.833 2 0.634)
>> - forward-line + looking-at-p            :: (3.180 2 0.663)
>
> You may want to try the small improvement to skip-chars-forward that just arrived on master.

I did not get anything meaningful here. Likely because my benchmark is
not very stable (the above results did not stay the same for different
Emacs session for example, except relative numbers).

[in the same order]
(4.171 2 0.420)
(6.740 2 0.419)
(5.977 1 0.210)
(4.262 2 0.431)

>> May you elaborate what is the blocker then?
>
> Mostly time, but also coming up with a design that is compatible and reasonably future-safe, and convincing people that it's a good way forward (assuming it actually is). Emacs is a collaborative effort, after all.

Then it is not a blocker, but rather "let's discuss it first in a
dedicated, clearly marked thread".

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





^ permalink raw reply	[flat|nested] 35+ messages in thread

* bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
  2023-05-09 15:56                                           ` Ihor Radchenko
@ 2023-05-09 15:57                                             ` Mattias Engdegård
  0 siblings, 0 replies; 35+ messages in thread
From: Mattias Engdegård @ 2023-05-09 15:57 UTC (permalink / raw)
  To: Ihor Radchenko; +Cc: 63225

9 maj 2023 kl. 17.56 skrev Ihor Radchenko <yantar92@posteo.net>:

> I did not get anything meaningful here. Likely because my benchmark is
> not very stable (the above results did not stay the same for different
> Emacs session for example, except relative numbers).

Oh well. It should get rid of some stupid consing though, that was the main objective.






^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2023-05-09 15:57 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-02  7:37 bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-02 14:33 ` Mattias Engdegård
2023-05-02 15:25   ` Eli Zaretskii
2023-05-02 15:28     ` Mattias Engdegård
2023-05-02 17:30       ` Eli Zaretskii
2023-05-02 17:58         ` Ihor Radchenko
2023-05-02 16:14   ` Ihor Radchenko
2023-05-02 21:00     ` Mattias Engdegård
2023-05-02 21:21       ` Ihor Radchenko
2023-05-03  8:39         ` Mattias Engdegård
2023-05-03  9:36           ` Ihor Radchenko
2023-05-03 13:59             ` Mattias Engdegård
2023-05-03 15:05               ` Ihor Radchenko
2023-05-03 15:20                 ` Mattias Engdegård
2023-05-03 16:02                   ` Ihor Radchenko
2023-05-04  9:24                     ` Mattias Engdegård
2023-05-05 10:31                       ` Ihor Radchenko
2023-05-05 16:26                         ` Mattias Engdegård
2023-05-06 13:38                           ` Ihor Radchenko
2023-05-07 10:32                             ` Mattias Engdegård
2023-05-08 11:58                               ` Ihor Radchenko
2023-05-08 18:21                                 ` Mattias Engdegård
2023-05-08 19:38                                   ` Ihor Radchenko
2023-05-08 19:53                                     ` Mattias Engdegård
2023-05-09  8:36                                       ` bug#63225: Using char table-based finite-state machines as a replacement for re-search-forward (was: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)) Ihor Radchenko
2023-05-09 12:02                                       ` bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-09 15:05                                         ` Mattias Engdegård
2023-05-09 15:56                                           ` Ihor Radchenko
2023-05-09 15:57                                             ` Mattias Engdegård
2023-05-07 12:45                           ` Mattias Engdegård
2023-05-08 13:56                             ` Ihor Radchenko
2023-05-08 19:32                               ` Mattias Engdegård
2023-05-08 19:44                                 ` Ihor Radchenko
2023-05-04 12:58               ` Ihor Radchenko
2023-05-02 23:36   ` Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).