unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#18577: Regexp I-search: [(error Stack overflow in regexp matcher)]
@ 2014-09-28  8:55 Alan Mackenzie
  2014-09-28 10:56 ` Andreas Schwab
  0 siblings, 1 reply; 13+ messages in thread
From: Alan Mackenzie @ 2014-09-28  8:55 UTC (permalink / raw)
  To: 18577

Hi, Emacs.

In the trunk, emacs -Q.  Visit our favourite big file, xdisp.c.

With point at BOB, do C-M-s and enter this regular expression at the
prompt:

/\*\(\([^'*]\|\*[^/']\)*\*?'\([^'*]\|\*[^/']\)*\*?'\)*\([^'*]\|\*[^'/]\)*\*?'\([^'*]\|\*[^/']\)*\*?\*/

Press C-s.  This gives the error message "[(error Stack overflow in
regexp matcher)]".

This feels like a bug.  Regexp searching with this expression (which
finds a block comment containing an odd number of apostrophes) works fine
in the rest of the buffer.  Only in the second (large) comment in xdisp.c
does it trigger this error.  Surely an iterative regexp (which all
regexps are, surely?) shouldn't be triggering unbounded recursive
behaviour in the search engine.

[N.B. the regexp when parsed a bit looks like this:

/\*\(                                              \)*\(     \|       \)*\*?'\(     \|       \)*\*?\*/
     \(     \|       \)*\*?'\(     \|       \)*\*?'     [^'*]  \*[^'/]         [^'*]  \*[^/']
       [^'*]  \*[^/']         [^'*]  \*[^/']

].  

-- 
Alan Mackenzie (Nuremberg, Germany).





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#18577: Regexp I-search: [(error Stack overflow in regexp matcher)]
  2014-09-28  8:55 bug#18577: Regexp I-search: [(error Stack overflow in regexp matcher)] Alan Mackenzie
@ 2014-09-28 10:56 ` Andreas Schwab
  2014-09-28 12:37   ` Alan Mackenzie
  0 siblings, 1 reply; 13+ messages in thread
From: Andreas Schwab @ 2014-09-28 10:56 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: 18577

Alan Mackenzie <acm@muc.de> writes:

> With point at BOB, do C-M-s and enter this regular expression at the
> prompt:
>
> /\*\(\([^'*]\|\*[^/']\)*\*?'\([^'*]\|\*[^/']\)*\*?'\)*\([^'*]\|\*[^'/]\)*\*?'\([^'*]\|\*[^/']\)*\*?\*/

\(...\(...\)*...\)* is bad.  \(...\)*\(...\)* is also bad.  Both can cause
combinatorial explosions in backtracking.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#18577: Regexp I-search: [(error Stack overflow in regexp matcher)]
  2014-09-28 10:56 ` Andreas Schwab
@ 2014-09-28 12:37   ` Alan Mackenzie
  2014-09-28 12:48     ` Andreas Schwab
  0 siblings, 1 reply; 13+ messages in thread
From: Alan Mackenzie @ 2014-09-28 12:37 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: 18577

Good afternoon, Andreas!

On Sun, Sep 28, 2014 at 12:56:48PM +0200, Andreas Schwab wrote:
> Alan Mackenzie <acm@muc.de> writes:

> > With point at BOB, do C-M-s and enter this regular expression at the
> > prompt:

> > /\*\(\([^'*]\|\*[^/']\)*\*?'\([^'*]\|\*[^/']\)*\*?'\)*\([^'*]\|\*[^'/]\)*\*?'\([^'*]\|\*[^/']\)*\*?\*/

> \(...\(...\)*...\)* is bad.  \(...\)*\(...\)* is also bad.

But they both seem essential to the regexp's purpose.

> Both can cause combinatorial explosions in backtracking.

Is this a defect in my regexp or in the regexp engine?  If the former,
how could I rewrite the regexp so that it would not hit these problems?

> Andreas.

> -- 
> Andreas Schwab, schwab@linux-m68k.org
> GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
> "And now for something completely different."

-- 
Alan Mackenzie (Nuremberg, Germany).





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#18577: Regexp I-search: [(error Stack overflow in regexp matcher)]
  2014-09-28 12:37   ` Alan Mackenzie
@ 2014-09-28 12:48     ` Andreas Schwab
  2014-09-28 17:35       ` Stefan Monnier
  0 siblings, 1 reply; 13+ messages in thread
From: Andreas Schwab @ 2014-09-28 12:48 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: 18577

Alan Mackenzie <acm@muc.de> writes:

> Is this a defect in my regexp or in the regexp engine?

It is fundamental to the way regexp matching works.

> If the former, how could I rewrite the regexp so that it would not hit
> these problems?

Avoid matching the null string in infinite ways.  If you need to match
the null string, make sure it is anchored with non-null matches.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#18577: Regexp I-search: [(error Stack overflow in regexp matcher)]
  2014-09-28 12:48     ` Andreas Schwab
@ 2014-09-28 17:35       ` Stefan Monnier
  2014-11-27  8:44         ` Tassilo Horn
  2021-10-23  2:47         ` Stefan Kangas
  0 siblings, 2 replies; 13+ messages in thread
From: Stefan Monnier @ 2014-09-28 17:35 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Alan Mackenzie, 18577

>> Is this a defect in my regexp or in the regexp engine?
> It is fundamental to the way regexp matching works.

To clarify: it is fundamental to the way *our* regexp engine works.

As long as the regexp doesn't use backrefs, it can be matched
efficiently, without backtracking.  Of course using \(..\) (as opposed
to using \(?:..\)) can also make the problem harder since the various
different (but largely equivalent) ways to match might need to be
distinguishable via match-data.

But even tho your regexp doesn't use backrefs, and even if you replace
all \(..\) with \(?:..\), your regexp will still cause problems because
our regexp engine does not try to optimize these kinds of cases.

So you have to do it by hand.

>> If the former, how could I rewrite the regexp so that it would not hit
>> these problems?

Maybe something like:

/\*\(<insidecomment>\)*\*+/

where <insidecomment> is something like

   [^'*]\|\*+\([^/'*]\|'<afterquote>\)\|'<afterquote>

where <afterquote> is something like

   \([^'*]\|\*+[^/'*]\)*'

Tho this will still push a backtrack point for every character.
Maybe better would be something like

/\*[^'*]*\(<insidecomment>\)*\*+/

where <insidecomment> is something like

   \(\*+[^/'*]\|\**'<afterquote>\)[^'*]*

where <afterquote> is still something like

   \([^'*]\|\*+[^/'*]\)*'

so that we should only push a backtrace point when we see a * or a ' in
the comment.


        Stefan





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#18577: Regexp I-search: [(error Stack overflow in regexp matcher)]
  2014-09-28 17:35       ` Stefan Monnier
@ 2014-11-27  8:44         ` Tassilo Horn
  2021-10-23  2:47         ` Stefan Kangas
  1 sibling, 0 replies; 13+ messages in thread
From: Tassilo Horn @ 2014-11-27  8:44 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 18577, Andreas Schwab, Alan Mackenzie

We have a similar bug report for AUCTeX/RefTeX, although the regex
doesn't look that bad as Alan's.

Too reproduce, create a file with 10000 lines with the word foo, and
evaluate this:

  (re-search-forward "^[^%]*\\\\usepackage.*{biblatex}" nil t)

Here, there's no double-* or nested-* but still you get a stack overflow
in the regexp matcher.

Ah, the problem here is that [^%] also matches \n.  I've just seen that
this regex has already been fixed to

  "^[^%\n]*?\\\\usepackage.*{biblatex}"

on master, and I've backported the fix to the emacs-24 branch.

Bye,
Tassilo





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#18577: Regexp I-search: [(error Stack overflow in regexp matcher)]
  2014-09-28 17:35       ` Stefan Monnier
  2014-11-27  8:44         ` Tassilo Horn
@ 2021-10-23  2:47         ` Stefan Kangas
  2021-10-23  7:32           ` Eli Zaretskii
  1 sibling, 1 reply; 13+ messages in thread
From: Stefan Kangas @ 2021-10-23  2:47 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Alan Mackenzie, Andreas Schwab, 18577

Stefan Monnier <monnier@IRO.UMontreal.CA> writes:

>>> Is this a defect in my regexp or in the regexp engine?
>> It is fundamental to the way regexp matching works.
>
> To clarify: it is fundamental to the way *our* regexp engine works.
>
> As long as the regexp doesn't use backrefs, it can be matched
> efficiently, without backtracking.  Of course using \(..\) (as opposed
> to using \(?:..\)) can also make the problem harder since the various
> different (but largely equivalent) ways to match might need to be
> distinguishable via match-data.
>
> But even tho your regexp doesn't use backrefs, and even if you replace
> all \(..\) with \(?:..\), your regexp will still cause problems because
> our regexp engine does not try to optimize these kinds of cases.
>
> So you have to do it by hand.
>
>>> If the former, how could I rewrite the regexp so that it would not hit
>>> these problems?
>
> Maybe something like:
>
> /\*\(<insidecomment>\)*\*+/
>
> where <insidecomment> is something like
>
>    [^'*]\|\*+\([^/'*]\|'<afterquote>\)\|'<afterquote>
>
> where <afterquote> is something like
>
>    \([^'*]\|\*+[^/'*]\)*'
>
> Tho this will still push a backtrack point for every character.
> Maybe better would be something like
>
> /\*[^'*]*\(<insidecomment>\)*\*+/
>
> where <insidecomment> is something like
>
>    \(\*+[^/'*]\|\**'<afterquote>\)[^'*]*
>
> where <afterquote> is still something like
>
>    \([^'*]\|\*+[^/'*]\)*'
>
> so that we should only push a backtrace point when we see a * or a ' in
> the comment.

Should we do anything about this, like document it in etc/PROBLEMS, or
should this bug just be closed?





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#18577: Regexp I-search: [(error Stack overflow in regexp matcher)]
  2021-10-23  2:47         ` Stefan Kangas
@ 2021-10-23  7:32           ` Eli Zaretskii
  2021-10-23  8:30             ` Stefan Kangas
  0 siblings, 1 reply; 13+ messages in thread
From: Eli Zaretskii @ 2021-10-23  7:32 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: acm, schwab, monnier, 18577

> From: Stefan Kangas <stefan@marxist.se>
> Date: Fri, 22 Oct 2021 19:47:53 -0700
> Cc: Alan Mackenzie <acm@muc.de>, Andreas Schwab <schwab@linux-m68k.org>,
>  18577@debbugs.gnu.org
> 
> Should we do anything about this, like document it in etc/PROBLEMS, or
> should this bug just be closed?

If we can give useful advice in PROBLEMS, adding that can never hurt.





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#18577: Regexp I-search: [(error Stack overflow in regexp matcher)]
  2021-10-23  7:32           ` Eli Zaretskii
@ 2021-10-23  8:30             ` Stefan Kangas
  2021-10-23  8:39               ` Eli Zaretskii
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan Kangas @ 2021-10-23  8:30 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: acm, schwab, monnier, 18577

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stefan Kangas <stefan@marxist.se>
>> Date: Fri, 22 Oct 2021 19:47:53 -0700
>> Cc: Alan Mackenzie <acm@muc.de>, Andreas Schwab <schwab@linux-m68k.org>,
>>  18577@debbugs.gnu.org
>>
>> Should we do anything about this, like document it in etc/PROBLEMS, or
>> should this bug just be closed?
>
> If we can give useful advice in PROBLEMS, adding that can never hurt.

Perhaps we could just say this (which makes it easy to search for as the
headline matches the error message):

diff --git a/etc/PROBLEMS b/etc/PROBLEMS
index ede83a6e7c..1b10e20272 100644
--- a/etc/PROBLEMS
+++ b/etc/PROBLEMS
@@ -742,6 +742,15 @@ completed" message that tls.el relies upon,
causing affected Emacs
 functions to hang.  To work around the problem, use older or newer
 versions of gnutls-cli, or use Emacs's built-in gnutls support.

+*** Stack overflow in regexp matcher.
+Due to fundamental limitations in the way Emacs' regular expression
+engine is designed, you might run into combinatorial explosions in
+backtracking with certain regexps.
+
+See Emacs Bug#18577 for a discussion and ideas for how to solve it:
+
+    https://debbugs.gnu.org/18557
+
 * Runtime problems related to font handling

 ** Characters are displayed as empty boxes or with wrong font under X.





^ permalink raw reply related	[flat|nested] 13+ messages in thread

* bug#18577: Regexp I-search: [(error Stack overflow in regexp matcher)]
  2021-10-23  8:30             ` Stefan Kangas
@ 2021-10-23  8:39               ` Eli Zaretskii
  2021-10-23  9:32                 ` Stefan Kangas
  0 siblings, 1 reply; 13+ messages in thread
From: Eli Zaretskii @ 2021-10-23  8:39 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: acm, schwab, monnier, 18577

> From: Stefan Kangas <stefan@marxist.se>
> Date: Sat, 23 Oct 2021 01:30:07 -0700
> Cc: acm@muc.de, schwab@linux-m68k.org, monnier@iro.umontreal.ca, 
> 	18577@debbugs.gnu.org
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > If we can give useful advice in PROBLEMS, adding that can never hurt.
> 
> Perhaps we could just say this (which makes it easy to search for as the
> headline matches the error message):

Thanks, but pointing to bug discussions in PROBLEMS is IMO not useful
enough.  PROBLEMS is for people who look for quick solutions to their
immediate problems, so asking them to wade through discussions by
developers is sub-optimal.

Would it be possible to copy the relevant recommendations from that
discussion to the PROBLEMS entry?  If shown in a cookbook manner, that
could be very helpful in practice, I think.





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#18577: Regexp I-search: [(error Stack overflow in regexp matcher)]
  2021-10-23  8:39               ` Eli Zaretskii
@ 2021-10-23  9:32                 ` Stefan Kangas
  2021-10-23 11:27                   ` Eli Zaretskii
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan Kangas @ 2021-10-23  9:32 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: acm, schwab, monnier, 18577

[-- Attachment #1: Type: text/plain, Size: 244 bytes --]

Eli Zaretskii <eliz@gnu.org> writes:

> Would it be possible to copy the relevant recommendations from that
> discussion to the PROBLEMS entry?  If shown in a cookbook manner, that
> could be very helpful in practice, I think.

How about this?

[-- Attachment #2: 0001-etc-PROBLEMS-Mention-problems-with-regexp-engine.-Bu.patch --]
[-- Type: text/x-diff, Size: 1300 bytes --]

From 91db453ff2236bae5c73c0f42b105737532206de Mon Sep 17 00:00:00 2001
From: Stefan Kangas <stefan@marxist.se>
Date: Sat, 23 Oct 2021 10:29:30 +0200
Subject: [PATCH] * etc/PROBLEMS: Mention problems with regexp engine. 
 (Bug#18577)

---
 etc/PROBLEMS | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/etc/PROBLEMS b/etc/PROBLEMS
index ede83a6e7c..daff102a0d 100644
--- a/etc/PROBLEMS
+++ b/etc/PROBLEMS
@@ -742,6 +742,18 @@ completed" message that tls.el relies upon, causing affected Emacs
 functions to hang.  To work around the problem, use older or newer
 versions of gnutls-cli, or use Emacs's built-in gnutls support.
 
+*** Stack overflow in regexp matcher.
+Due to fundamental limitations in the way Emacs' regular expression
+engine is designed, you might run into combinatorial explosions in
+backtracking with certain regexps.
+
+Avoid "\(...\(...\)*...\)*" and "\(...\)*\(...\)*".  Look for a way to
+anchor your regular expression, to avoid matching the null string in
+infinite ways.  The latter is what creates backtrack points, and
+eventual overflow in practice.
+
+(Also prefer "\(?:...\)" to "\(...\)" unless you need the latter.)
+
 * Runtime problems related to font handling
 
 ** Characters are displayed as empty boxes or with wrong font under X.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* bug#18577: Regexp I-search: [(error Stack overflow in regexp matcher)]
  2021-10-23  9:32                 ` Stefan Kangas
@ 2021-10-23 11:27                   ` Eli Zaretskii
  2021-10-24 22:08                     ` Stefan Kangas
  0 siblings, 1 reply; 13+ messages in thread
From: Eli Zaretskii @ 2021-10-23 11:27 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: acm, schwab, monnier, 18577

> From: Stefan Kangas <stefan@marxist.se>
> Date: Sat, 23 Oct 2021 02:32:23 -0700
> Cc: acm@muc.de, schwab@linux-m68k.org, monnier@iro.umontreal.ca, 
> 	18577@debbugs.gnu.org
> 
> > Would it be possible to copy the relevant recommendations from that
> > discussion to the PROBLEMS entry?  If shown in a cookbook manner, that
> > could be very helpful in practice, I think.
> 
> How about this?

LGTM, thanks.





^ permalink raw reply	[flat|nested] 13+ messages in thread

* bug#18577: Regexp I-search: [(error Stack overflow in regexp matcher)]
  2021-10-23 11:27                   ` Eli Zaretskii
@ 2021-10-24 22:08                     ` Stefan Kangas
  0 siblings, 0 replies; 13+ messages in thread
From: Stefan Kangas @ 2021-10-24 22:08 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: acm, schwab, monnier, 18577

close 18577 28.1
thanks

Eli Zaretskii <eliz@gnu.org> writes:

> LGTM, thanks.

Pushed to emacs-28.  If anyone has any further tweaks or improvements,
please reply here and I'll take a look.  Thanks in advance.





^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-10-24 22:08 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-28  8:55 bug#18577: Regexp I-search: [(error Stack overflow in regexp matcher)] Alan Mackenzie
2014-09-28 10:56 ` Andreas Schwab
2014-09-28 12:37   ` Alan Mackenzie
2014-09-28 12:48     ` Andreas Schwab
2014-09-28 17:35       ` Stefan Monnier
2014-11-27  8:44         ` Tassilo Horn
2021-10-23  2:47         ` Stefan Kangas
2021-10-23  7:32           ` Eli Zaretskii
2021-10-23  8:30             ` Stefan Kangas
2021-10-23  8:39               ` Eli Zaretskii
2021-10-23  9:32                 ` Stefan Kangas
2021-10-23 11:27                   ` Eli Zaretskii
2021-10-24 22:08                     ` Stefan Kangas

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).