From: Ihor Radchenko <yantar92@posteo.net>
To: "Mattias Engdegård" <mattias.engdegard@gmail.com>
Cc: 63225@debbugs.gnu.org
Subject: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
Date: Mon, 08 May 2023 19:38:59 +0000 [thread overview]
Message-ID: <875y923964.fsf@localhost> (raw)
In-Reply-To: <C037659A-DBF7-4071-A54F-84CC1D7D18AF@gmail.com>
Mattias Engdegård <mattias.engdegard@gmail.com> writes:
>> (save-excursion
>> (beginning-of-line 0)
>> (not (looking-at-p "[[:blank:]]*$"))))
>
> I wonder if that last part isn't better written as
>
> (save-excursion
> (forward-line 0) ; faster than beginning-of-line
Fair point. It is probably worth looking through Org sources and
replacing all those `beginning-of-line' and `end-of-line' calls. I doubt
that we even intend to use fields for real.
> (skip-chars-forward "[:blank:]") ; faster than looking-at-p
> (not (eolp))) ; very cheap
Hmm. I feel confused.
Does this imply that simple regexps like
(looking-at-p (rx (seq bol (zero-or-more (any "\t ")) "#" (or " " eol))))
should better be implemented as the following?
(and (bolp)
(skip-chars-forward " \t")
(eq (char-after) ?#) (forward-char)
(or (eolp) (eq (char-after) ?\s)))
(I now start thinking that it might be more efficient to create a bunch
or char tables and step over them to match "regexp", just like any
finite automaton would)
> But yes, I sort of understand what you are getting at (except the business with the MODE parameter which is still a bit mysterious to me).
MODE parameter is used because we constrain what kinds of syntax
elements are allowed inside other. For example, see
`org-element-object-restrictions'. And within
`org-element--current-element', MODE is used, for example, to constrain
planning/property drawer to be only the first child of a parent heading,
parsed earlier.
>> [now part of the giant rx]
>>
>> (rx line-start (0+ (any ?\s ?\t))
>> ":" (1+ (any ?- ?_ word)) ":"
>> (0+ (any ?\s ?\t)) line-end)
>
> Any reason you don't capture the part between the colons here, so that you don't need to match it later on?
That would demand all the callers of `org-element-drawer-parser' to set
match data appropriately. Which is doable, but headache for maintenance.
We sometimes do call parsers explicitly not from inside
`org-element--current-element'.
>> But why? Aren't (in word ?_ ?-) and (or word ?_ ?-) not the same?
>
> "[-_[:word:]]" and "\\w\\|[_-]" indeed match the same thing but they don't generate the same regexp bytecode -- the former is faster. (In this case rx makes a literal translation to those strings but we should probably make it optimise to the faster regexp.)
>
> There is a regexp disassembler for the really curious but it doesn't come with Emacs.
I really hope that I did not need to do all these workarounds specific to
current implementation pitfalls of Emacs regexp compiler.
>>> Maybe, but you still cons each time. (And remember that the plist-get equality funarg is new in Emacs 29.)
>>
>> Sure it does.
>> It is just one of the variable parts of Org syntax that might be
>> changed. There are ways to make this into constant, but it is a fragile
>> area of the code that I do not want to touch without a reason.
>> (Especially given that I am not familiar with org-list.el)
>
> So it's fine to use elisp constructs new in Emacs 29 in Org? Then the line
>
> ;; Package-Requires: ((emacs "26.1"))
>
> in org.el should probably be updated, right?
Nope. It is just that the link I shared is for WIP branch I am
developing using Emacs master. I will ensure backwards compatibility
later. For example, by converting `plist-get' to `assoc'.
>> I hope we do. If only Emacs had a way to define `case-fold-search' right
>> within the regexp itself.
>
> I would like that too, but changing that isn't easy.
I am sure that it is easy.
For example, regexp command may optionally accept a vector with first
element being regexp and second element setting a flag for
`case-fold-search'.
Of course, it is just one, maybe not the best way to implement this.
My suggestion in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=63225#56
is also compatible with this kind of approach.
> By the way, it seems that org-element-node-property-parser binds case-fold-search without actually using it. Bug?
It actually does nothing given that `org-property-re' does not contain
letters. I will remove it.
--
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>
next prev parent reply other threads:[~2023-05-08 19:38 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-02 7:37 bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-02 14:33 ` Mattias Engdegård
2023-05-02 15:25 ` Eli Zaretskii
2023-05-02 15:28 ` Mattias Engdegård
2023-05-02 17:30 ` Eli Zaretskii
2023-05-02 17:58 ` Ihor Radchenko
2023-05-02 16:14 ` Ihor Radchenko
2023-05-02 21:00 ` Mattias Engdegård
2023-05-02 21:21 ` Ihor Radchenko
2023-05-03 8:39 ` Mattias Engdegård
2023-05-03 9:36 ` Ihor Radchenko
2023-05-03 13:59 ` Mattias Engdegård
2023-05-03 15:05 ` Ihor Radchenko
2023-05-03 15:20 ` Mattias Engdegård
2023-05-03 16:02 ` Ihor Radchenko
2023-05-04 9:24 ` Mattias Engdegård
2023-05-05 10:31 ` Ihor Radchenko
2023-05-05 16:26 ` Mattias Engdegård
2023-05-06 13:38 ` Ihor Radchenko
2023-05-07 10:32 ` Mattias Engdegård
2023-05-08 11:58 ` Ihor Radchenko
2023-05-08 18:21 ` Mattias Engdegård
2023-05-08 19:38 ` Ihor Radchenko [this message]
2023-05-08 19:53 ` Mattias Engdegård
2023-05-09 8:36 ` bug#63225: Using char table-based finite-state machines as a replacement for re-search-forward (was: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)) Ihor Radchenko
2023-05-09 12:02 ` bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Ihor Radchenko
2023-05-09 15:05 ` Mattias Engdegård
2023-05-09 15:56 ` Ihor Radchenko
2023-05-09 15:57 ` Mattias Engdegård
2023-05-07 12:45 ` Mattias Engdegård
2023-05-08 13:56 ` Ihor Radchenko
2023-05-08 19:32 ` Mattias Engdegård
2023-05-08 19:44 ` Ihor Radchenko
2023-05-04 12:58 ` Ihor Radchenko
2023-05-02 23:36 ` Po Lu via Bug reports for GNU Emacs, the Swiss army knife of text editors
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=875y923964.fsf@localhost \
--to=yantar92@posteo.net \
--cc=63225@debbugs.gnu.org \
--cc=mattias.engdegard@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).