* How are regexen implemented in Emacs?
@ 2022-12-12 17:15 Marcin Borkowski
2022-12-12 18:16 ` Stefan Monnier via Users list for the GNU Emacs text editor
2022-12-15 3:41 ` Emanuel Berg
0 siblings, 2 replies; 24+ messages in thread
From: Marcin Borkowski @ 2022-12-12 17:15 UTC (permalink / raw)
To: Help Gnu Emacs mailing list
Hi all,
some time ago I read (well, skimmed) this article:
https://swtch.com/~rsc/regexp/regexp1.html
I looked at
https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Problems.html
and started to wonder if the hints there mean that Emacs has a "naive",
backtracking regex engine or a FA-based one?
Before I get yelled at, let me make it clear: I'm not accusing Emacs
devs of sloppy programming. In fact, my confusion might result from me
not understanding something. But the warnings in the manual about some
regexen being slow worry me a tiny little bit.
Best,
--
Marcin Borkowski
http://mbork.pl
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: How are regexen implemented in Emacs?
2022-12-12 17:15 How are regexen implemented in Emacs? Marcin Borkowski
@ 2022-12-12 18:16 ` Stefan Monnier via Users list for the GNU Emacs text editor
2022-12-12 19:09 ` Marcin Borkowski
2022-12-15 3:41 ` Emanuel Berg
1 sibling, 1 reply; 24+ messages in thread
From: Stefan Monnier via Users list for the GNU Emacs text editor @ 2022-12-12 18:16 UTC (permalink / raw)
To: help-gnu-emacs
> https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Problems.html
> and started to wonder if the hints there mean that Emacs has a "naive",
> backtracking regex engine or a FA-based one?
Naive!
Stefan
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: How are regexen implemented in Emacs?
2022-12-12 18:16 ` Stefan Monnier via Users list for the GNU Emacs text editor
@ 2022-12-12 19:09 ` Marcin Borkowski
2022-12-12 19:17 ` Stefan Monnier via Users list for the GNU Emacs text editor
` (2 more replies)
0 siblings, 3 replies; 24+ messages in thread
From: Marcin Borkowski @ 2022-12-12 19:09 UTC (permalink / raw)
To: Stefan Monnier; +Cc: help-gnu-emacs
On 2022-12-12, at 19:16, Stefan Monnier via Users list for the GNU Emacs text editor <help-gnu-emacs@gnu.org> wrote:
>> https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Problems.html
>> and started to wonder if the hints there mean that Emacs has a "naive",
>> backtracking regex engine or a FA-based one?
>
> Naive!
And what are the reasons? Out of curiosity: would implementing
a FA-based one be a very big undertaking? (And no, I won't do it, if
only because I don't know C.)
Best,
--
Marcin Borkowski
http://mbork.pl
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: How are regexen implemented in Emacs?
2022-12-12 19:09 ` Marcin Borkowski
@ 2022-12-12 19:17 ` Stefan Monnier via Users list for the GNU Emacs text editor
2022-12-12 20:31 ` Marcin Borkowski
2022-12-12 20:01 ` tomas
2022-12-15 3:45 ` Emanuel Berg
2 siblings, 1 reply; 24+ messages in thread
From: Stefan Monnier via Users list for the GNU Emacs text editor @ 2022-12-12 19:17 UTC (permalink / raw)
To: help-gnu-emacs
>>> https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Problems.html
>>> and started to wonder if the hints there mean that Emacs has a "naive",
>>> backtracking regex engine or a FA-based one?
>> Naive!
> And what are the reasons?
Largely historical, but changing it is a lot of work with uncertain
outcome:
- Reproducing the exact behavior of the naive match with a non-naive
match can be tricky.
- The naive approach sucks when the match fails in many
different ways, but it's quite efficient when the match succeeds
without any backtracking.
Ideally, we'd like to replace the regexp engine with a "standard one",
but it's not easy to find one that satisfies all the requirements
(e.g. ability to add Emacs-specific elements like \_<, \s, and \c,
ability to operate on "two-part strings" (i.e. with a gap in the
middle), ...).
Stefan
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: How are regexen implemented in Emacs?
2022-12-12 19:17 ` Stefan Monnier via Users list for the GNU Emacs text editor
@ 2022-12-12 20:31 ` Marcin Borkowski
2022-12-12 21:19 ` Stefan Monnier
0 siblings, 1 reply; 24+ messages in thread
From: Marcin Borkowski @ 2022-12-12 20:31 UTC (permalink / raw)
To: Stefan Monnier; +Cc: help-gnu-emacs
On 2022-12-12, at 20:17, Stefan Monnier via Users list for the GNU Emacs text editor <help-gnu-emacs@gnu.org> wrote:
>>>> https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Problems.html
>>>> and started to wonder if the hints there mean that Emacs has a "naive",
>>>> backtracking regex engine or a FA-based one?
>>> Naive!
>> And what are the reasons?
>
> Largely historical, but changing it is a lot of work with uncertain
> outcome:
> - Reproducing the exact behavior of the naive match with a non-naive
> match can be tricky.
> - The naive approach sucks when the match fails in many
> different ways, but it's quite efficient when the match succeeds
> without any backtracking.
> Ideally, we'd like to replace the regexp engine with a "standard one",
> but it's not easy to find one that satisfies all the requirements
> (e.g. ability to add Emacs-specific elements like \_<, \s, and \c,
> ability to operate on "two-part strings" (i.e. with a gap in the
> middle), ...).
Thanks, that's pretty interesting.
The linked article sounded a bit like "only idiots implement regexen
in the naive way", and I was pretty sure Emacs devs are not idiots - but
now I understand the reasons better.
Best,
--
Marcin Borkowski
http://mbork.pl
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: How are regexen implemented in Emacs?
2022-12-12 20:31 ` Marcin Borkowski
@ 2022-12-12 21:19 ` Stefan Monnier
2022-12-13 5:01 ` tomas
0 siblings, 1 reply; 24+ messages in thread
From: Stefan Monnier @ 2022-12-12 21:19 UTC (permalink / raw)
To: Marcin Borkowski; +Cc: help-gnu-emacs
> The linked article sounded a bit like "only idiots implement regexen
> in the naive way", and I was pretty sure Emacs devs are not idiots - but
> now I understand the reasons better.
I fully agree that it doesn't make sense to *start* with
a backtracking implementation, yes.
But once you've invested in one, it's harder to move to
something better.
This said, it *would* be better. Not only in terms of eliminating the
pathological blow ups, but it also offers opportunity to get new
functionality, such as the ability to capture the state of a regexp
match at a specific buffer position (so you can perform a multiline
regexp match one line at a time). It could also make it much more
reasonable to add the possibility to run ELisp code from within the
regexp match engine (e.g. add a \p(NAME) entry which calls the NAME
ELisp function).
Stefan
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: How are regexen implemented in Emacs?
2022-12-12 21:19 ` Stefan Monnier
@ 2022-12-13 5:01 ` tomas
2022-12-13 16:16 ` Stefan Monnier via Users list for the GNU Emacs text editor
2022-12-15 5:22 ` Emanuel Berg
0 siblings, 2 replies; 24+ messages in thread
From: tomas @ 2022-12-13 5:01 UTC (permalink / raw)
To: help-gnu-emacs
[-- Attachment #1: Type: text/plain, Size: 1576 bytes --]
On Mon, Dec 12, 2022 at 04:19:19PM -0500, Stefan Monnier wrote:
> > The linked article sounded a bit like "only idiots implement regexen
> > in the naive way", and I was pretty sure Emacs devs are not idiots - but
> > now I understand the reasons better.
There comes my favourite motto: "all generalizations suck" (Michael
Heerdegen enjoyed it a couple of threads back).
Whoever uses a library these days uses PCRE, and this is, AFAIK, a
DFA-with-backtracking thingy. Note that I haven't read the code,
so I might well be wrong.
> I fully agree that it doesn't make sense to *start* with
> a backtracking implementation, yes.
> But once you've invested in one, it's harder to move to
> something better.
>
> This said, it *would* be better. Not only in terms of eliminating the
> pathological blow ups, but it also offers opportunity to get new
> functionality, such as the ability to capture the state of a regexp
> match at a specific buffer position (so you can perform a multiline
> regexp match one line at a time). It could also make it much more
> reasonable to add the possibility to run ELisp code from within the
> regexp match engine (e.g. add a \p(NAME) entry which calls the NAME
> ELisp function).
Perl does the latter. That has saved my bacon from time to time.
So that seems possible with a backtracker, too. Don't ask me how,
though :-)
Saving state at any point would be cool -- you could easily invert
control (feeding the regexp machine a spoonful at a time). Think
network or an abstract buffer.
Cheers
--
t
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: How are regexen implemented in Emacs?
2022-12-12 19:09 ` Marcin Borkowski
2022-12-12 19:17 ` Stefan Monnier via Users list for the GNU Emacs text editor
@ 2022-12-12 20:01 ` tomas
2022-12-12 20:32 ` Marcin Borkowski
` (2 more replies)
2022-12-15 3:45 ` Emanuel Berg
2 siblings, 3 replies; 24+ messages in thread
From: tomas @ 2022-12-12 20:01 UTC (permalink / raw)
To: help-gnu-emacs
[-- Attachment #1: Type: text/plain, Size: 901 bytes --]
On Mon, Dec 12, 2022 at 08:09:42PM +0100, Marcin Borkowski wrote:
>
> On 2022-12-12, at 19:16, Stefan Monnier via Users list for the GNU Emacs text editor <help-gnu-emacs@gnu.org> wrote:
>
> >> https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Problems.html
> >> and started to wonder if the hints there mean that Emacs has a "naive",
> >> backtracking regex engine or a FA-based one?
> >
> > Naive!
>
> And what are the reasons? Out of curiosity: would implementing
> a FA-based one be a very big undertaking? (And no, I won't do it, if
> only because I don't know C.)
Nitpick: The Emacs regexp engine is a FA based one, but a DFA,
not a NFA (the famous Thompson NFA).
NFAs seem to be quite the beast, judging by how few implementations
are out there. I don't even know how easy it is to implement, e.g.
capture or (gasp!) back references,
Cheers
--
t
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: How are regexen implemented in Emacs?
2022-12-12 20:01 ` tomas
@ 2022-12-12 20:32 ` Marcin Borkowski
2022-12-12 21:24 ` Stefan Monnier via Users list for the GNU Emacs text editor
2022-12-15 5:18 ` Emanuel Berg
2022-12-15 16:46 ` Stefan Monnier via Users list for the GNU Emacs text editor
2 siblings, 1 reply; 24+ messages in thread
From: Marcin Borkowski @ 2022-12-12 20:32 UTC (permalink / raw)
To: tomas; +Cc: help-gnu-emacs
On 2022-12-12, at 21:01, tomas@tuxteam.de wrote:
> On Mon, Dec 12, 2022 at 08:09:42PM +0100, Marcin Borkowski wrote:
>>
>> On 2022-12-12, at 19:16, Stefan Monnier via Users list for the GNU Emacs text editor <help-gnu-emacs@gnu.org> wrote:
>>
>> >> https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Problems.html
>> >> and started to wonder if the hints there mean that Emacs has a "naive",
>> >> backtracking regex engine or a FA-based one?
>> >
>> > Naive!
>>
>> And what are the reasons? Out of curiosity: would implementing
>> a FA-based one be a very big undertaking? (And no, I won't do it, if
>> only because I don't know C.)
>
> Nitpick: The Emacs regexp engine is a FA based one, but a DFA,
> not a NFA (the famous Thompson NFA).
>
> NFAs seem to be quite the beast, judging by how few implementations
> are out there. I don't even know how easy it is to implement, e.g.
> capture or (gasp!) back references,
Very interesting. Re: back references, are even "regexen" with back
references still real regular expressions?
Best,
--
Marcin Borkowski
http://mbork.pl
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: How are regexen implemented in Emacs?
2022-12-12 20:32 ` Marcin Borkowski
@ 2022-12-12 21:24 ` Stefan Monnier via Users list for the GNU Emacs text editor
0 siblings, 0 replies; 24+ messages in thread
From: Stefan Monnier via Users list for the GNU Emacs text editor @ 2022-12-12 21:24 UTC (permalink / raw)
To: help-gnu-emacs
> Very interesting. Re: back references, are even "regexen" with back
> references still real regular expressions?
No, back references aren't part of the "regular language" family.
They can't easily be accommodated by the usual DFA/NFA
non-backtracking matchers (and IIRC they are fundamentally
exponential-complexity in the worst case), tho some do handle them
(by falling back to a backtracking algorithm, AFAIK).
Stefan
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: How are regexen implemented in Emacs?
2022-12-12 20:01 ` tomas
2022-12-12 20:32 ` Marcin Borkowski
@ 2022-12-15 5:18 ` Emanuel Berg
2022-12-15 13:38 ` tomas
2022-12-15 16:46 ` Stefan Monnier via Users list for the GNU Emacs text editor
2 siblings, 1 reply; 24+ messages in thread
From: Emanuel Berg @ 2022-12-15 5:18 UTC (permalink / raw)
To: help-gnu-emacs
tomas wrote:
> Nitpick: The Emacs regexp engine is a FA based one, but
> a DFA, not a NFA (the famous Thompson NFA).
DFA = Deterministic finite automaton
NFA = Nondeterministic finite automaton
Finite - it's a state machine, finite means the number of
states are not infinite.
A set of rules crunches the input and determines transitions
from one state to another, I can imagine deterministic such
rules - but nondeterministic, what does that mean?
Maybe that from one state S and input I there are not one but
several ways to go, one can go to S' but also S'' and both
"moves" are legal?
--
underground experts united
https://dataswamp.org/~incal
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: How are regexen implemented in Emacs?
2022-12-12 20:01 ` tomas
2022-12-12 20:32 ` Marcin Borkowski
2022-12-15 5:18 ` Emanuel Berg
@ 2022-12-15 16:46 ` Stefan Monnier via Users list for the GNU Emacs text editor
2022-12-15 17:56 ` tomas
2 siblings, 1 reply; 24+ messages in thread
From: Stefan Monnier via Users list for the GNU Emacs text editor @ 2022-12-15 16:46 UTC (permalink / raw)
To: help-gnu-emacs
> Nitpick: The Emacs regexp engine is a FA based one, but a DFA,
> not a NFA (the famous Thompson NFA).
No, it's definitely not DFA because it uses a stack to implement
the backtracking.
It's usually considered to be an NFA that's processed via backtracking
rather than via "parallel execution" as "Thompson's NFA" does.
[ For Haskeller's out there, it's a bit like the difference between the
Cont monad and the List monad with the added twist that "Thompson's
NFA" replaces the List monad with a "Set monad", which is what changes
its complexity. ]
> NFAs seem to be quite the beast, judging by how few implementations
> are out there. I don't even know how easy it is to implement, e.g.
> capture or (gasp!) back references,
It's not hard to implement, no.
At least not until you consider things like *? and backreferences.
Stefan
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: How are regexen implemented in Emacs?
2022-12-15 16:46 ` Stefan Monnier via Users list for the GNU Emacs text editor
@ 2022-12-15 17:56 ` tomas
2022-12-17 0:13 ` Emanuel Berg
0 siblings, 1 reply; 24+ messages in thread
From: tomas @ 2022-12-15 17:56 UTC (permalink / raw)
To: help-gnu-emacs
[-- Attachment #1: Type: text/plain, Size: 1267 bytes --]
On Thu, Dec 15, 2022 at 11:46:13AM -0500, Stefan Monnier via Users list for the GNU Emacs text editor wrote:
> > Nitpick: The Emacs regexp engine is a FA based one, but a DFA,
> > not a NFA (the famous Thompson NFA).
>
> No, it's definitely not DFA because it uses a stack to implement
> the backtracking.
>
> It's usually considered to be an NFA that's processed via backtracking
> rather than via "parallel execution" as "Thompson's NFA" does.
I stand corrected. Thanks.
> [ For Haskeller's out there, it's a bit like the difference between the
> Cont monad and the List monad with the added twist that "Thompson's
> NFA" replaces the List monad with a "Set monad", which is what changes
> its complexity. ]
Shudder :-)
> > NFAs seem to be quite the beast, judging by how few implementations
> > are out there. I don't even know how easy it is to implement, e.g.
> > capture or (gasp!) back references,
>
> It's not hard to implement, no.
> At least not until you consider things like *? and backreferences.
Interestingly, Tcl seems to "do" Thompson (the article mentioned
upthread has Tcl as one of the examples, AFAIR), and they do back
references and non-greedy repetitions. There seems to be a way.
Cheers
--
t
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: How are regexen implemented in Emacs?
2022-12-12 19:09 ` Marcin Borkowski
2022-12-12 19:17 ` Stefan Monnier via Users list for the GNU Emacs text editor
2022-12-12 20:01 ` tomas
@ 2022-12-15 3:45 ` Emanuel Berg
2022-12-15 13:22 ` tomas
2 siblings, 1 reply; 24+ messages in thread
From: Emanuel Berg @ 2022-12-15 3:45 UTC (permalink / raw)
To: help-gnu-emacs
Marcin Borkowski wrote:
>> Naive!
>
> And what are the reasons? Out of curiosity: would
> implementing a FA-based one be a very big undertaking? (And
> no, I won't do it, if only because I don't know C.)
C is not so difficult, but probably someone else did it
already so it would be a matter of integrating and reusing
that rather than writing it all over from scratch interesting
as that may sound ...
--
underground experts united
https://dataswamp.org/~incal
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: How are regexen implemented in Emacs?
2022-12-15 3:45 ` Emanuel Berg
@ 2022-12-15 13:22 ` tomas
0 siblings, 0 replies; 24+ messages in thread
From: tomas @ 2022-12-15 13:22 UTC (permalink / raw)
To: help-gnu-emacs
[-- Attachment #1: Type: text/plain, Size: 1035 bytes --]
On Thu, Dec 15, 2022 at 04:45:05AM +0100, Emanuel Berg wrote:
> Marcin Borkowski wrote:
>
> >> Naive!
> >
> > And what are the reasons? Out of curiosity: would
> > implementing a FA-based one be a very big undertaking? (And
> > no, I won't do it, if only because I don't know C.)
>
> C is not so difficult, but probably someone else did it
> already so it would be a matter of integrating and reusing
> that rather than writing it all over from scratch interesting
> as that may sound ...
The devil is in the details: each long-living RE implementation
develops its own useful warts, and they are all different. Stefan
already hinted at it, but you can't underestimate the weight of
(a) Emacs's existing corpus of code hidden away in REs and (b)
the one hidden away in its diverse user base's brains and
muscle memories.
So you better come up with some fully backward compatible thing,
whatever the implementation. I think the easiest would be to
substitute one little piece at a time :-)
Cheers
--
t
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: How are regexen implemented in Emacs?
2022-12-12 17:15 How are regexen implemented in Emacs? Marcin Borkowski
2022-12-12 18:16 ` Stefan Monnier via Users list for the GNU Emacs text editor
@ 2022-12-15 3:41 ` Emanuel Berg
1 sibling, 0 replies; 24+ messages in thread
From: Emanuel Berg @ 2022-12-15 3:41 UTC (permalink / raw)
To: help-gnu-emacs
Marcin Borkowski wrote:
> I looked at [...] and started to wonder if the hints there
> mean that Emacs has a "naive", backtracking regex engine or
> a FA-based one?
FA = Finite Automaton, buzzword from Computer Science
Automata theory ...
https://en.wikipedia.org/wiki/Automata_theory
--
underground experts united
https://dataswamp.org/~incal
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2022-12-19 20:31 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-12 17:15 How are regexen implemented in Emacs? Marcin Borkowski
2022-12-12 18:16 ` Stefan Monnier via Users list for the GNU Emacs text editor
2022-12-12 19:09 ` Marcin Borkowski
2022-12-12 19:17 ` Stefan Monnier via Users list for the GNU Emacs text editor
2022-12-12 20:31 ` Marcin Borkowski
2022-12-12 21:19 ` Stefan Monnier
2022-12-13 5:01 ` tomas
2022-12-13 16:16 ` Stefan Monnier via Users list for the GNU Emacs text editor
2022-12-15 5:22 ` Emanuel Berg
2022-12-16 17:39 ` Akib Azmain Turja
2022-12-17 0:31 ` Emanuel Berg
2022-12-18 0:54 ` Jean Louis
2022-12-19 20:31 ` Emanuel Berg
2022-12-12 20:01 ` tomas
2022-12-12 20:32 ` Marcin Borkowski
2022-12-12 21:24 ` Stefan Monnier via Users list for the GNU Emacs text editor
2022-12-15 5:18 ` Emanuel Berg
2022-12-15 13:38 ` tomas
2022-12-15 16:46 ` Stefan Monnier via Users list for the GNU Emacs text editor
2022-12-15 17:56 ` tomas
2022-12-17 0:13 ` Emanuel Berg
2022-12-15 3:45 ` Emanuel Berg
2022-12-15 13:22 ` tomas
2022-12-15 3:41 ` Emanuel Berg
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.