unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* Stack overflow in regexp matcher
@ 2009-12-16 17:15 akaiser
  2009-12-17 17:01 ` Barry Margolin
  0 siblings, 1 reply; 12+ messages in thread
From: akaiser @ 2009-12-16 17:15 UTC (permalink / raw)
  To: help-gnu-emacs

A function of mine gets "Stack overflow in regexp matcher" on a certain 
file using the regexp

     "^< \\(.+].+=|\\)"

The file is a single line of about 73000 characters, but the regexp matches 
ending at character 56.

Is this a bug?  And if not, why not?

djc


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stack overflow in regexp matcher
  2009-12-16 17:15 akaiser
@ 2009-12-17 17:01 ` Barry Margolin
  2009-12-17 22:10   ` Ilya Zakharevich
  2009-12-17 22:13   ` akaiser
  0 siblings, 2 replies; 12+ messages in thread
From: Barry Margolin @ 2009-12-17 17:01 UTC (permalink / raw)
  To: help-gnu-emacs

In article <8u-dncYjy6Q3iLTWnZ2dnUVZ8nKdnZ2d@posted.visi>,
 "akaiser@visi.com" <akaiser@visi.com> wrote:

> A function of mine gets "Stack overflow in regexp matcher" on a certain 
> file using the regexp
> 
>      "^< \\(.+].+=|\\)"
> 
> The file is a single line of about 73000 characters, but the regexp matches 
> ending at character 56.
> 
> Is this a bug?  And if not, why not?
> 
> djc

The problem is that + is greedy, so this has to scan the entire line 
looking for the last "]", then see if there's an "=" somewhere later.  
If it can't find an "=" it has to back up to the previous "]" and search 
again, and so on.  If there are lots of "]" characters, this has to save 
the state of each of them in the matching stack.

-- 
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stack overflow in regexp matcher
  2009-12-17 17:01 ` Barry Margolin
@ 2009-12-17 22:10   ` Ilya Zakharevich
  2009-12-17 22:13   ` akaiser
  1 sibling, 0 replies; 12+ messages in thread
From: Ilya Zakharevich @ 2009-12-17 22:10 UTC (permalink / raw)
  To: help-gnu-emacs

On 2009-12-17, Barry Margolin <barmar@alum.mit.edu> wrote:
>> A function of mine gets "Stack overflow in regexp matcher" on a certain 
>> file using the regexp
>> 
>>      "^< \\(.+].+=|\\)"

> The problem is that + is greedy, so this has to scan the entire line 
> looking for the last "]", then see if there's an "=" somewhere later.  
> If it can't find an "=" it has to back up to the previous "]" and search 
> again, and so on.  If there are lots of "]" characters, this has to save 
> the state of each of them in the matching stack.

What for?  There is no need to keep state of unsuccessful matches...

So the "next-]" state should replace the older one.

Yours,
Ilya


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stack overflow in regexp matcher
  2009-12-17 17:01 ` Barry Margolin
  2009-12-17 22:10   ` Ilya Zakharevich
@ 2009-12-17 22:13   ` akaiser
  1 sibling, 0 replies; 12+ messages in thread
From: akaiser @ 2009-12-17 22:13 UTC (permalink / raw)
  To: help-gnu-emacs

Barry Margolin wrote:
> The problem is that + is greedy, so this has to scan the entire line
> looking for the last "]", then see if there's an "=" somewhere later.
> If it can't find an "=" it has to back up to the previous "]" and search
> again, and so on.  If there are lots of "]" characters, this has to save
> the state of each of them in the matching stack.

That's what I first suspected too.  But (a) the same thing happens with

     "^< \\([^]]+].+=|\\)"

which constrains the search to find only the first "]", and (b) there's 
only one "]" in that line -- within the first 45 characters -- as well as 
only a single "=|".

Any other thoughts?

djc


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Stack overflow in regexp matcher
@ 2011-02-06 10:47 Dan Davison
  2011-02-06 13:31 ` Stephen Berman
  0 siblings, 1 reply; 12+ messages in thread
From: Dan Davison @ 2011-02-06 10:47 UTC (permalink / raw)
  To: help-gnu-emacs

The following fails with "Stack overflow in regexp matcher" in emacs 23
and 24:

(string-match
 "^\\[.+\\]$"
 (concat
  "["
  (mapconcat (lambda (i) "x") (number-sequence 1 33500) "")
  "]"))

This surprised me; I assumed that the ^ and $ anchors, and the simple
".+" requirement in the middle would result in a simple, efficient
regexp.

In this case I can replace this with a different test using `substring',
but I'm curious: was the above obviously unwise to someone who knows
about emacs regexps? What would be a better regexp be to use in this
sutuation?  Fwiw, The same match seems to work instantaneously in
e.g. perl.

Dan




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stack overflow in regexp matcher
  2011-02-06 10:47 Stack overflow in regexp matcher Dan Davison
@ 2011-02-06 13:31 ` Stephen Berman
  2011-02-06 14:01   ` guivho
  2011-02-06 14:17   ` Eli Zaretskii
  0 siblings, 2 replies; 12+ messages in thread
From: Stephen Berman @ 2011-02-06 13:31 UTC (permalink / raw)
  To: help-gnu-emacs

On Sun, 06 Feb 2011 10:47:42 +0000 Dan Davison <dandavison7@gmail.com> wrote:

> The following fails with "Stack overflow in regexp matcher" in emacs 23
> and 24:
>
> (string-match
>  "^\\[.+\\]$"
>  (concat
>   "["
>   (mapconcat (lambda (i) "x") (number-sequence 1 33500) "")
>   "]"))
>
> This surprised me; I assumed that the ^ and $ anchors, and the simple
> ".+" requirement in the middle would result in a simple, efficient
> regexp.

It does not fail on my GNU Emacs 24.0.50.1 (i686-pc-linux-gnu, GTK+
Version 2.20.1) of 2011-01-13, but returns, as expected, 0.

Steve Berman




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stack overflow in regexp matcher
  2011-02-06 13:31 ` Stephen Berman
@ 2011-02-06 14:01   ` guivho
  2011-02-06 14:17   ` Eli Zaretskii
  1 sibling, 0 replies; 12+ messages in thread
From: guivho @ 2011-02-06 14:01 UTC (permalink / raw)
  To: help-gnu-emacs

It does fail with Debugger entered--Lisp error: (error "Stack overflow
in regexp matcher")
on GNU Emacs 23.2.92.1 (x86_64-apple-darwin, NS apple-appkit-1038.35)
of 2011-01-15 on black.porkrind.org

Guivho



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stack overflow in regexp matcher
  2011-02-06 13:31 ` Stephen Berman
  2011-02-06 14:01   ` guivho
@ 2011-02-06 14:17   ` Eli Zaretskii
  2011-02-06 14:30     ` Dan Davison
  2011-02-06 16:02     ` Stephen Berman
  1 sibling, 2 replies; 12+ messages in thread
From: Eli Zaretskii @ 2011-02-06 14:17 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Stephen Berman <stephen.berman@gmx.net>
> Date: Sun, 06 Feb 2011 14:31:43 +0100
> 
> On Sun, 06 Feb 2011 10:47:42 +0000 Dan Davison <dandavison7@gmail.com> wrote:
> 
> > The following fails with "Stack overflow in regexp matcher" in emacs 23
> > and 24:
> >
> > (string-match
> >  "^\\[.+\\]$"
> >  (concat
> >   "["
> >   (mapconcat (lambda (i) "x") (number-sequence 1 33500) "")
> >   "]"))
> >
> > This surprised me; I assumed that the ^ and $ anchors, and the simple
> > ".+" requirement in the middle would result in a simple, efficient
> > regexp.
> 
> It does not fail on my GNU Emacs 24.0.50.1 (i686-pc-linux-gnu, GTK+
> Version 2.20.1) of 2011-01-13, but returns, as expected, 0.

Nor does it fail here:

   GNU Emacs 23.2.91.1 (i386-mingw-nt5.1.2600) of 2010-12-11 on 3249CTO

but does enter the debugger here:

   GNU Emacs 23.2.93.2 (x86_64-unknown-linux-gnu, GTK+ Version 2.20.1) of 2011-02-03 on fencepost

I guess it depends on how large is the available stack space.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stack overflow in regexp matcher
  2011-02-06 14:17   ` Eli Zaretskii
@ 2011-02-06 14:30     ` Dan Davison
  2011-02-06 16:02     ` Stephen Berman
  1 sibling, 0 replies; 12+ messages in thread
From: Dan Davison @ 2011-02-06 14:30 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: help-gnu-emacs

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stephen Berman <stephen.berman@gmx.net>
>> Date: Sun, 06 Feb 2011 14:31:43 +0100
>> 
>> On Sun, 06 Feb 2011 10:47:42 +0000 Dan Davison <dandavison7@gmail.com> wrote:
>> 
>> > The following fails with "Stack overflow in regexp matcher" in emacs 23
>> > and 24:
>> >
>> > (string-match
>> >  "^\\[.+\\]$"
>> >  (concat
>> >   "["
>> >   (mapconcat (lambda (i) "x") (number-sequence 1 33500) "")
>> >   "]"))
>> >
>> > This surprised me; I assumed that the ^ and $ anchors, and the simple
>> > ".+" requirement in the middle would result in a simple, efficient
>> > regexp.
>> 
>> It does not fail on my GNU Emacs 24.0.50.1 (i686-pc-linux-gnu, GTK+
>> Version 2.20.1) of 2011-01-13, but returns, as expected, 0.
>
> Nor does it fail here:
>
>    GNU Emacs 23.2.91.1 (i386-mingw-nt5.1.2600) of 2010-12-11 on 3249CTO
>
> but does enter the debugger here:
>
>    GNU Emacs 23.2.93.2 (x86_64-unknown-linux-gnu, GTK+ Version 2.20.1) of 2011-02-03 on fencepost

It fails on these (I didn't find any machines/builds where it worked):

* OS X
** 23
*** local build
   GNU Emacs 23.2.1 (x86_64-apple-darwin10.6.0, NS
   apple-appkit-1038.35) of 2011-01-20 on <my OS X 10.6.6 machine>
*** aquamacs
   GNU Emacs 23.2.91.1 (x86_64-apple-darwin10.6.0, NS
   apple-appkit-1038.35) of 2011-02-01 on
   94.196.104.92.threembb.co.uk - Aquamacs Distribution 2.2dev
*** emacsformacosx
   GNU Emacs 23.2.1 (x86_64-apple-darwin, NS apple-appkit-1038.29) of
   2010-05-09 on black.local
** 24
*** local build
    GNU Emacs 24.0.50.1 (x86_64-apple-darwin10.6.0, NS
    apple-appkit-1038.35) of 2011-01-13 on <my OS X 10.6.6 machine>
* Ubuntu
** 23
   GNU Emacs 23.1.1 (x86_64-pc-linux-gnu, GTK+ Version 2.20.0) of
   2010-03-29 on yellow, modified by Debian


> I guess it depends on how large is the available stack space.




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stack overflow in regexp matcher
  2011-02-06 14:17   ` Eli Zaretskii
  2011-02-06 14:30     ` Dan Davison
@ 2011-02-06 16:02     ` Stephen Berman
  1 sibling, 0 replies; 12+ messages in thread
From: Stephen Berman @ 2011-02-06 16:02 UTC (permalink / raw)
  To: help-gnu-emacs

On Sun, 06 Feb 2011 09:17:50 -0500 Eli Zaretskii <eliz@gnu.org> wrote:

>> From: Stephen Berman <stephen.berman@gmx.net>
>> Date: Sun, 06 Feb 2011 14:31:43 +0100
>> 
>> On Sun, 06 Feb 2011 10:47:42 +0000 Dan Davison <dandavison7@gmail.com> wrote:
>> 
>> > The following fails with "Stack overflow in regexp matcher" in emacs 23
>> > and 24:
>> >
>> > (string-match
>> >  "^\\[.+\\]$"
>> >  (concat
>> >   "["
>> >   (mapconcat (lambda (i) "x") (number-sequence 1 33500) "")
>> >   "]"))
>> >
>> > This surprised me; I assumed that the ^ and $ anchors, and the simple
>> > ".+" requirement in the middle would result in a simple, efficient
>> > regexp.
>> 
>> It does not fail on my GNU Emacs 24.0.50.1 (i686-pc-linux-gnu, GTK+
>> Version 2.20.1) of 2011-01-13, but returns, as expected, 0.
>
> Nor does it fail here:
>
>    GNU Emacs 23.2.91.1 (i386-mingw-nt5.1.2600) of 2010-12-11 on 3249CTO
>
> but does enter the debugger here:
>
>    GNU Emacs 23.2.93.2 (x86_64-unknown-linux-gnu, GTK+ Version 2.20.1) of
> 2011-02-03 on fencepost
>
> I guess it depends on how large is the available stack space.

That indeed seems to be the case: on my system, it succeeds with
(number-sequence 1 66665) but fails with (number-sequence 1 66666).

Steve Berman




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stack overflow in regexp matcher
       [not found] <mailman.0.1296989279.10345.help-gnu-emacs@gnu.org>
@ 2011-02-07 20:24 ` Stefan Monnier
  2011-02-08 22:58   ` Dan Davison
  0 siblings, 1 reply; 12+ messages in thread
From: Stefan Monnier @ 2011-02-07 20:24 UTC (permalink / raw)
  To: help-gnu-emacs

> The following fails with "Stack overflow in regexp matcher" in emacs 23
> and 24:

> (string-match
>  "^\\[.+\\]$"
>  (concat
>   "["
>   (mapconcat (lambda (i) "x") (number-sequence 1 33500) "")
>   "]"))

> This surprised me; I assumed that the ^ and $ anchors, and the simple
> ".+" requirement in the middle would result in a simple, efficient
> regexp.

The problem is that Emacs's regexp engine is not very clever.

Basically, each time . succeeds we first recurse, trying to match the
rest against ".*\\]$", and if that fails then we try to match
"\\]$" instead.  The fact that at each step we have this "fork" of two
different possible ways to match, means that at each step we have to
push something on the stack, so the longer the string against which we
match, the deeper the stack we need.

You can fix it by using a regexp that does not need this backtracking.
E.g. "^\\[[^]\n]+\\]$".  If you do that, then Emacs's regexp engine will
notice that when [^]\n] matches, then "\\]$" can't match, so there's no
point pushing something on the stack to try the second path when the
first fails.  I.e. it will do the whole match with a constant amount of
stack space.

This relates also to the recent discussion with Ilya in the thread
titled "will we ever have zero width assertions in regexps?".


        Stefan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Stack overflow in regexp matcher
  2011-02-07 20:24 ` Stefan Monnier
@ 2011-02-08 22:58   ` Dan Davison
  0 siblings, 0 replies; 12+ messages in thread
From: Dan Davison @ 2011-02-08 22:58 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: help-gnu-emacs

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> The following fails with "Stack overflow in regexp matcher" in emacs 23
>> and 24:
>
>> (string-match
>>  "^\\[.+\\]$"
>>  (concat
>>   "["
>>   (mapconcat (lambda (i) "x") (number-sequence 1 33500) "")
>>   "]"))
>
>> This surprised me; I assumed that the ^ and $ anchors, and the simple
>> ".+" requirement in the middle would result in a simple, efficient
>> regexp.
>
> The problem is that Emacs's regexp engine is not very clever.
>
> Basically, each time . succeeds we first recurse, trying to match the
> rest against ".*\\]$", and if that fails then we try to match
> "\\]$" instead.  The fact that at each step we have this "fork" of two
> different possible ways to match, means that at each step we have to
> push something on the stack, so the longer the string against which we
> match, the deeper the stack we need.
>
> You can fix it by using a regexp that does not need this backtracking.
> E.g. "^\\[[^]\n]+\\]$".  If you do that, then Emacs's regexp engine will
> notice that when [^]\n] matches, then "\\]$" can't match, so there's no
> point pushing something on the stack to try the second path when the
> first fails.  I.e. it will do the whole match with a constant amount of
> stack space.
>
> This relates also to the recent discussion with Ilya in the thread
> titled "will we ever have zero width assertions in regexps?".

Hi Stefan,

Thanks for the explanation.

Dan

>
>
>         Stefan



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-02-08 22:58 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-06 10:47 Stack overflow in regexp matcher Dan Davison
2011-02-06 13:31 ` Stephen Berman
2011-02-06 14:01   ` guivho
2011-02-06 14:17   ` Eli Zaretskii
2011-02-06 14:30     ` Dan Davison
2011-02-06 16:02     ` Stephen Berman
     [not found] <mailman.0.1296989279.10345.help-gnu-emacs@gnu.org>
2011-02-07 20:24 ` Stefan Monnier
2011-02-08 22:58   ` Dan Davison
  -- strict thread matches above, loose matches on Subject: below --
2009-12-16 17:15 akaiser
2009-12-17 17:01 ` Barry Margolin
2009-12-17 22:10   ` Ilya Zakharevich
2009-12-17 22:13   ` akaiser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).