unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Ihor Radchenko <yantar92@posteo.net>
To: "Mattias Engdegård" <mattias.engdegard@gmail.com>
Cc: 63225@debbugs.gnu.org
Subject: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
Date: Mon, 08 May 2023 11:58:05 +0000	[thread overview]
Message-ID: <87zg6fjar6.fsf@localhost> (raw)
In-Reply-To: <74CD5EF4-5424-40BA-8F80-D0FD89CB890F@gmail.com>

Mattias Engdegård <mattias.engdegard@gmail.com> writes:

> What I meant was that the consolidated root regexp could just match the initial :BEGIN: line and then dispatch to different branches for parsers specific to the drawer type. That would reduce complexity and time spent at the critical parser root.

Let me elaborate on what I mean.

The relevant `cond' branches are:

(and (pcase mode
	        (`planning (eq ?* (char-after (line-beginning-position 0))))
	        ((or `property-drawer `top-comment)
		 (save-excursion
		   (beginning-of-line 0)
		   (not (looking-at-p "[[:blank:]]*$"))))
	        (_ nil))
	      (looking-at-p org-property-drawer-re))

where org-property-drawer-re is

(rx
   ;; Drawer begin line.
   bol (0+ (in " \t")) ":PROPERTIES:" (0+ (in " \t")) "\n"
   ;; Node properties.
   (*? (0+ (in " \t")) ":" (+ (not (in " \t\n:"))) ":" (* nonl) "\n")
   ;; Drawer end line.
   (0+ (in " \t")) ":END:" (0+ (in " \t")) eol)

Note that this regexp is only matches when certain conditions are met
and the beginning of the property drawer regexp matches for less general
":PROPERTIES:".

and

[now part of the giant rx]

(rx line-start (0+ (any ?\s ?\t))
      ":" (1+ (any ?- ?_ word)) ":"
      (0+ (any ?\s ?\t)) line-end)

matches for more general ":[-_[:word:]]+:".

If we make ":[-_[:word:]]+:" a new root, how will it be helpful?

>> This will account for Org syntax change, so no.
>
> Don't dismiss it out of hand. I'm not trying to optimise a few regexps, but to use examples to illustrate some useful principles that would help you improve many of them yourself.
>
> When matching something terminated by a specific character, it's particularly useful if the regexp engine can be made to understand that the terminator doesn't occur in what precedes it, as that enables it to omit backtracking points. For example, in "a*b", the engine doesn't need to save backtracking points for each 'a' matched since the sets {a} and {b} are obviously disjoint.
>
> In this case, the
>
>    (group (+ (| wordchar (in "_-"))))
>
> part is unnecessarily slow because it's an or-pattern, which also
> inhibits that optimisation.

> Fortunately it can easily be rewritten as
>
>    (group (+ (in "_-" word)))
>
> which solves both problems.

But why? Aren't (in word ?_ ?-) and (or word ?_ ?-) not the same?

>> Slight improvement in performance cannot justify syntax changes.
>
> Always question your assumptions. A slight change of spec may not be so bad after all if it buys speed and/or improves our understanding of the code. Do you know what characters have 'word' syntax in org-mode? If not, better be careful before using them in regexps.

Honestly, I have no clue how syntax tables in Org mode are working.
I once tried to alter syntax table inside code blocks and Org got
completely broken. So, I simply do not dare to touch syntax-dependent
matches.

Ideally, Org should use dedicated, unmutable, syntax tables when
parsing. The difficulty though is that we need to support various
languages, including CJK syntax where syntax expectation quite different
from Latin.

Your suggestions about using (not (in ...)) in place of (in ...) are
good, but I am afraid to break CJK cases where people can use unexpected
set of characters. I was bitten by this in the past.

And there is consideration about not breaking the syntax.
It is not just about Elisp part - Org is a markup standard and for
drawers in particular we have defined the syntax like

https://orgmode.org/worg/org-syntax.html#Drawers

:NAME:
CONTENTS
:end:

NAME
A string consisting of word-constituent characters, hyphens and underscores (-_).

To change this, we should weigh on the possible impact on the external
Org parsers, not implemented in Elisp.

> (Looks like org-tags-expand permanently adds @ and _ to the set of word chars. A bug, surely?)

Yeah. Fixed now.
https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=6e6354c07

>> (defvar org--item-re-cache nil
>>  "Results cache for `org-item-re'.")
>> (defsubst org-item-re ()
>> ...
>> It should not give much overhead.
>
> Maybe, but you still cons each time. (And remember that the plist-get equality funarg is new in Emacs 29.)

Sure it does.
It is just one of the variable parts of Org syntax that might be
changed. There are ways to make this into constant, but it is a fragile
area of the code that I do not want to touch without a reason.
(Especially given that I am not familiar with org-list.el)

> Also make sure that if the same regexp is used in multiple places, it should always use the same `case-fold-search` value or they will be considered different for cache purposes.

I hope we do. If only Emacs had a way to define `case-fold-search' right
within the regexp itself.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>





  reply	other threads:[~2023-05-08 11:58 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-02  7:37 bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-02 14:33 ` Mattias Engdegård
2023-05-02 15:25   ` Eli Zaretskii
2023-05-02 15:28     ` Mattias Engdegård
2023-05-02 17:30       ` Eli Zaretskii
2023-05-02 17:58         ` Ihor Radchenko
2023-05-02 16:14   ` Ihor Radchenko
2023-05-02 21:00     ` Mattias Engdegård
2023-05-02 21:21       ` Ihor Radchenko
2023-05-03  8:39         ` Mattias Engdegård
2023-05-03  9:36           ` Ihor Radchenko
2023-05-03 13:59             ` Mattias Engdegård
2023-05-03 15:05               ` Ihor Radchenko
2023-05-03 15:20                 ` Mattias Engdegård
2023-05-03 16:02                   ` Ihor Radchenko
2023-05-04  9:24                     ` Mattias Engdegård
2023-05-05 10:31                       ` Ihor Radchenko
2023-05-05 16:26                         ` Mattias Engdegård
2023-05-06 13:38                           ` Ihor Radchenko
2023-05-07 10:32                             ` Mattias Engdegård
2023-05-08 11:58                               ` Ihor Radchenko [this message]
2023-05-08 18:21                                 ` Mattias Engdegård
2023-05-08 19:38                                   ` Ihor Radchenko
2023-05-08 19:53                                     ` Mattias Engdegård
2023-05-09  8:36                                       ` bug#63225: Using char table-based finite-state machines as a replacement for re-search-forward (was: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)) Ihor Radchenko
2023-05-09 12:02                                       ` bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-09 15:05                                         ` Mattias Engdegård
2023-05-09 15:56                                           ` Ihor Radchenko
2023-05-09 15:57                                             ` Mattias Engdegård
2023-05-07 12:45                           ` Mattias Engdegård
2023-05-08 13:56                             ` Ihor Radchenko
2023-05-08 19:32                               ` Mattias Engdegård
2023-05-08 19:44                                 ` Ihor Radchenko
2023-05-04 12:58               ` Ihor Radchenko
2023-05-02 23:36   ` Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zg6fjar6.fsf@localhost \
    --to=yantar92@posteo.net \
    --cc=63225@debbugs.gnu.org \
    --cc=mattias.engdegard@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).