unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace
@ 2020-03-15 10:39 Alan Mackenzie
  2020-03-15 12:22 ` Mattias Engdegård
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Alan Mackenzie @ 2020-03-15 10:39 UTC (permalink / raw)
  To: emacs-devel

Hello, Emacs.

I'm not sure whether I've tripped over bugs or not here, so I'm posting
to emacs-devel.

1. Start the Emacs-27 pretest or master with -Q.
2. Visit the file given for bug #40052, namely:

    http://hg.openjdk.java.net/jdk/jdk/raw-file/29edf1cb3c02/src/hotspot/share/runtime/globals.hpp

.  (Note: this file is one gigantic #define macro and is thus slow to
scroll, etc.  That is the topic of bug #40052.)
3. M-: (setq debug-on-error t)
4. move point to the #define at line 115.
5. Type a space.

This causes a stack overflow error in the regexp engine, producing this
backtrace:

Debugger entered--Lisp error: (error "Stack overflow in regexp matcher")
  re-search-forward("\\(\\\\\\(.\\|\n\\)\\|[^\\\n\15]\\)*" nil t)
  c-before-change-check-unbalanced-strings(5717 5717)
  #f(compiled-function (fn) #<bytecode 0x158b20edb5bd>)(c-before-change-check-unbalanced-strings)
  mapc(#f(compiled-function (fn) #<bytecode 0x158b20edb5bd>) (c-extend-region-for-CPP c-before-change-check-raw-strings c-before-change-check-<>-operators c-depropertize-CPP c-invalidate-macro-cache c-truncate-bs-cache c-before-change-chec$
  c-before-change(5717 5717)
  self-insert-command(1 32)
  funcall-interactively(self-insert-command 1 32)
  call-interactively(self-insert-command nil nil)
  command-execute(self-insert-command)

First of all, note the regexp, "\\(\\\\\\(.\\|\n\\)\\|[^\\\n\15]\\)*"
                                                            ^^^
In the source, the "\15" is "\r".  Why is this substitution being made
for the backtrace?  Is it intentional (in which case, why not do the
same to the "\n"?), or is it a bug?  To me, it is more like a bug.

More importantly, why is there a stack overflow here at all?  Even
though the regexp matcher has a long, long piece of buffer to scan over,
the regexp is a simple linear search, without any nesting to speak of.
There would appear to be no need for any backtracking.

-- 
Alan Mackenzie (Nuremberg, Germany).



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace
  2020-03-15 10:39 (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace Alan Mackenzie
@ 2020-03-15 12:22 ` Mattias Engdegård
  2020-03-15 12:33   ` Mattias Engdegård
  2020-03-15 16:57   ` Alan Mackenzie
  2020-03-15 14:21 ` Noam Postavsky
  2020-03-15 16:35 ` Stefan Monnier
  2 siblings, 2 replies; 13+ messages in thread
From: Mattias Engdegård @ 2020-03-15 12:22 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel

15 mars 2020 kl. 11.39 skrev Alan Mackenzie <acm@muc.de>:

> Hello, Emacs.

Hello Alan. Thanks for the nice example!

> First of all, note the regexp, "\\(\\\\\\(.\\|\n\\)\\|[^\\\n\15]\\)*"
>                                                            ^^^
> In the source, the "\15" is "\r".  Why is this substitution being made
> for the backtrace?  Is it intentional (in which case, why not do the
> same to the "\n"?), or is it a bug?  To me, it is more like a bug.

I agree; there are some ad-hoc switches like print-escape-newlines (which only works on \n and \f) and print-escape-control-characters (which produces octal), but nothing that gives human-friendly escapes for other known control characters.

> More importantly, why is there a stack overflow here at all?  Even
> though the regexp matcher has a long, long piece of buffer to scan over,
> the regexp is a simple linear search, without any nesting to speak of.

Let's ask xr for help:

(xr-pp "\\(\\\\\\(.\\|\n\\)\\|[^\\\n\15]\\)*")
=>
(zero-or-more
 (group
  (or (seq "\\"
           (group anything))
      (not (any "\n\r\\")))))

(note that xr pretty-prints \r properly)

There are two capture groups here, neither of which are actually used. Remove them (the outer one in particular) and the regexp no longer overflows. Navigating the file also becomes noticeably faster. Like this:

(rx (zero-or-more
     (or (seq "\\" anything)
         (not (any "\n\r\\")))))

(rx will use a slightly more efficient rendition of 'anything', but that isn't actually important in this case.)




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace
  2020-03-15 12:22 ` Mattias Engdegård
@ 2020-03-15 12:33   ` Mattias Engdegård
  2020-03-15 16:57   ` Alan Mackenzie
  1 sibling, 0 replies; 13+ messages in thread
From: Mattias Engdegård @ 2020-03-15 12:33 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel

15 mars 2020 kl. 13.22 skrev Mattias Engdegård <mattiase@acm.org>:

> Navigating the file also becomes noticeably faster.

Sorry, that probably isn't true after all -- my mistake.




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace
  2020-03-15 10:39 (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace Alan Mackenzie
  2020-03-15 12:22 ` Mattias Engdegård
@ 2020-03-15 14:21 ` Noam Postavsky
  2020-03-15 14:56   ` Eli Zaretskii
  2020-03-15 16:35 ` Stefan Monnier
  2 siblings, 1 reply; 13+ messages in thread
From: Noam Postavsky @ 2020-03-15 14:21 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Emacs developers

On Sun, 15 Mar 2020 at 06:40, Alan Mackenzie <acm@muc.de> wrote:

> First of all, note the regexp, "\\(\\\\\\(.\\|\n\\)\\|[^\\\n\15]\\)*"
>                                                             ^^^
> In the source, the "\15" is "\r".  Why is this substitution being made
> for the backtrace?  Is it intentional (in which case, why not do the
> same to the "\n"?), or is it a bug?  To me, it is more like a bug.

It's not really a bug (perhaps a missing feature), it's due to
print-escape-control-characters being non-nil. Newline is printed as
"\n" because print-escape-newlines is non-nil, but there is no
equivalent print-escape-carriage-returns.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace
  2020-03-15 14:21 ` Noam Postavsky
@ 2020-03-15 14:56   ` Eli Zaretskii
  0 siblings, 0 replies; 13+ messages in thread
From: Eli Zaretskii @ 2020-03-15 14:56 UTC (permalink / raw)
  To: Noam Postavsky; +Cc: acm, emacs-devel

> From: Noam Postavsky <npostavs@gmail.com>
> Date: Sun, 15 Mar 2020 10:21:51 -0400
> Cc: Emacs developers <emacs-devel@gnu.org>
> 
> It's not really a bug (perhaps a missing feature), it's due to
> print-escape-control-characters being non-nil. Newline is printed as
> "\n" because print-escape-newlines is non-nil, but there is no
> equivalent print-escape-carriage-returns.

Does it really make sense for this variable to convert \n and \r into
octal escapes?



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace
  2020-03-15 10:39 (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace Alan Mackenzie
  2020-03-15 12:22 ` Mattias Engdegård
  2020-03-15 14:21 ` Noam Postavsky
@ 2020-03-15 16:35 ` Stefan Monnier
  2020-03-15 17:32   ` Alan Mackenzie
  2 siblings, 1 reply; 13+ messages in thread
From: Stefan Monnier @ 2020-03-15 16:35 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel

> First of all, note the regexp, "\\(\\\\\\(.\\|\n\\)\\|[^\\\n\15]\\)*"
>                                                             ^^^
> In the source, the "\15" is "\r".  Why is this substitution being made
> for the backtrace?

The string doesn't keep track of whether it was written in the source
code as "\r" or "\^M" or with the actual ^M character or with "\015"
etc... so it's no wonder the printout is not exactly the same as what
you had in the source.  I do wonder why it says \15 instead of \015, \r,
or something else: I've almost never seen 2-digit-long octal for chars,
(only single-digit for NUL and 3-digit for other things), so it's
admittedly a poor choice.


        Stefan




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace
  2020-03-15 12:22 ` Mattias Engdegård
  2020-03-15 12:33   ` Mattias Engdegård
@ 2020-03-15 16:57   ` Alan Mackenzie
  2020-03-15 20:06     ` Mattias Engdegård
  1 sibling, 1 reply; 13+ messages in thread
From: Alan Mackenzie @ 2020-03-15 16:57 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

Hello, Mattias.

On Sun, Mar 15, 2020 at 13:22:20 +0100, Mattias Engdegård wrote:
> 15 mars 2020 kl. 11.39 skrev Alan Mackenzie <acm@muc.de>:

> Hello Alan. Thanks for the nice example!

> > First of all, note the regexp, "\\(\\\\\\(.\\|\n\\)\\|[^\\\n\15]\\)*"

> > In the source, the "\15" is "\r".  Why is this substitution being made
> > for the backtrace?  Is it intentional (in which case, why not do the
> > same to the "\n"?), or is it a bug?  To me, it is more like a bug.

> I agree; there are some ad-hoc switches like print-escape-newlines
> (which only works on \n and \f) and print-escape-control-characters
> (which produces octal), but nothing that gives human-friendly escapes
> for other known control characters.

OK.

> > More importantly, why is there a stack overflow here at all?  Even
> > though the regexp matcher has a long, long piece of buffer to scan over,
> > the regexp is a simple linear search, without any nesting to speak of.

> Let's ask xr for help:

> (xr-pp "\\(\\\\\\(.\\|\n\\)\\|[^\\\n\15]\\)*")
> =>
> (zero-or-more
>  (group
>   (or (seq "\\"
>            (group anything))
>       (not (any "\n\r\\")))))

> (note that xr pretty-prints \r properly)

> There are two capture groups here, neither of which are actually used.
> Remove them (the outer one in particular) and the regexp no longer
> overflows.

I agree (having tried "\\(?:" in place of "\\("), but why?  What is
causing the recursion here?  Each of the two groups need only remember
the latest string matching it.  Surely?  I'd like some insight into
this, so as to avoid it happening again.

[ .... ]

I actually changed the regexp to one which searches for what I'm looking
for (a non-escaped newline or EOB) from the regexp here (which matches
everything which I'm not looking for).  I might even time the two
approaches and see which is faster.

-- 
Alan Mackenzie (Nuremberg, Germany).



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace
  2020-03-15 16:35 ` Stefan Monnier
@ 2020-03-15 17:32   ` Alan Mackenzie
  2020-03-15 17:38     ` Mattias Engdegård
  0 siblings, 1 reply; 13+ messages in thread
From: Alan Mackenzie @ 2020-03-15 17:32 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Hello, Stefan.

On Sun, Mar 15, 2020 at 12:35:04 -0400, Stefan Monnier wrote:
> > First of all, note the regexp, "\\(\\\\\\(.\\|\n\\)\\|[^\\\n\15]\\)*"
> >                                                             ^^^
> > In the source, the "\15" is "\r".  Why is this substitution being made
> > for the backtrace?

> The string doesn't keep track of whether it was written in the source
> code as "\r" or "\^M" or with the actual ^M character or with "\015"
> etc... so it's no wonder the printout is not exactly the same as what
> you had in the source.  I do wonder why it says \15 instead of \015, \r,
> or something else: I've almost never seen 2-digit-long octal for chars,
> (only single-digit for NUL and 3-digit for other things), so it's
> admittedly a poor choice.

A very poor choice.  What is one to make of an output such as

    [^\\\n\151]?

>         Stefan

-- 
Alan Mackenzie (Nuremberg, Germany).



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace
  2020-03-15 17:32   ` Alan Mackenzie
@ 2020-03-15 17:38     ` Mattias Engdegård
  2020-03-15 17:46       ` Eli Zaretskii
  0 siblings, 1 reply; 13+ messages in thread
From: Mattias Engdegård @ 2020-03-15 17:38 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Stefan Monnier, emacs-devel

15 mars 2020 kl. 18.32 skrev Alan Mackenzie <acm@muc.de>:

> A very poor choice.  What is one to make of an output such as
> 
>    [^\\\n\151]?

The octal number is padded to 3 digits (the maximum) if a valid octal digit follows:

(let ((print-escape-control-characters t))
  (print "\r1,\r8"))
=>
"\0151,\158"




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace
  2020-03-15 17:38     ` Mattias Engdegård
@ 2020-03-15 17:46       ` Eli Zaretskii
  2020-03-15 18:02         ` Andreas Schwab
  0 siblings, 1 reply; 13+ messages in thread
From: Eli Zaretskii @ 2020-03-15 17:46 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: acm, monnier, emacs-devel

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sun, 15 Mar 2020 18:38:28 +0100
> Cc: Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org
> 
> (let ((print-escape-control-characters t))
>   (print "\r1,\r8"))
> =>
> "\0151,\158"

That's terrible, IMO.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace
  2020-03-15 17:46       ` Eli Zaretskii
@ 2020-03-15 18:02         ` Andreas Schwab
  2020-03-16  0:32           ` Paul Eggert
  0 siblings, 1 reply; 13+ messages in thread
From: Andreas Schwab @ 2020-03-15 18:02 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Mattias Engdegård, emacs-devel, monnier, acm

On Mär 15 2020, Eli Zaretskii wrote:

>> From: Mattias Engdegård <mattiase@acm.org>
>> Date: Sun, 15 Mar 2020 18:38:28 +0100
>> Cc: Stefan Monnier <monnier@iro.umontreal.ca>, emacs-devel@gnu.org
>> 
>> (let ((print-escape-control-characters t))
>>   (print "\r1,\r8"))
>> =>
>> "\0151,\158"
>
> That's terrible, IMO.

It changed in commit a4605cd60d.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace
  2020-03-15 16:57   ` Alan Mackenzie
@ 2020-03-15 20:06     ` Mattias Engdegård
  0 siblings, 0 replies; 13+ messages in thread
From: Mattias Engdegård @ 2020-03-15 20:06 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel

15 mars 2020 kl. 17.57 skrev Alan Mackenzie <acm@muc.de>:

> I agree (having tried "\\(?:" in place of "\\("), but why?  What is
> causing the recursion here?  Each of the two groups need only remember
> the latest string matching it.  Surely?  I'd like some insight into
> this, so as to avoid it happening again.

Each time a capture group is entered, the old extent of that group is pushed onto the stack so that it can be reloaded if a match failure occurs and the engine needs to backtrack. This means that removing the unnecessary group is not an asymptotic improvement in space here, but reduces the constant factor so that you can match larger strings. It's still O(N) space, N being the string size.

You could probably rewrite the regexp to improve the constant further by taking advantage of the fact that the [^\\\n\r] branch matches much more often than the other. Something like

(rx (* (seq "\\" anything))
    (* (+ (not (any "\n\r\\")))
       (seq "\\" anything))
    (* (not (any "\n\r\\"))))

which could perhaps be trimmed a bit. In any case, removing unnecessary capture groups everywhere is a cheap and easy improvement to start with. (Another benefit of rx, by the way -- there tends to be a lot fewer of those.)




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace
  2020-03-15 18:02         ` Andreas Schwab
@ 2020-03-16  0:32           ` Paul Eggert
  0 siblings, 0 replies; 13+ messages in thread
From: Paul Eggert @ 2020-03-16  0:32 UTC (permalink / raw)
  To: Andreas Schwab, Eli Zaretskii
  Cc: Mattias Engdegård, acm, monnier, emacs-devel

On 3/15/20 11:02 AM, Andreas Schwab wrote:
>> That's terrible, IMO.
> It changed in commit a4605cd60d.

Guilty as charged. :-)

If it were up to me I'd rationalize that maze of special variables 
controlling the printing of escapes. However, the dead hand of backward 
compatibility means there's no simple answer here. Plus I'm too tied up 
now with Covid-19.



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-03-16  0:32 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-03-15 10:39 (error "Stack overflow in regexp matcher") and (?)wrong display of regexp in backtrace Alan Mackenzie
2020-03-15 12:22 ` Mattias Engdegård
2020-03-15 12:33   ` Mattias Engdegård
2020-03-15 16:57   ` Alan Mackenzie
2020-03-15 20:06     ` Mattias Engdegård
2020-03-15 14:21 ` Noam Postavsky
2020-03-15 14:56   ` Eli Zaretskii
2020-03-15 16:35 ` Stefan Monnier
2020-03-15 17:32   ` Alan Mackenzie
2020-03-15 17:38     ` Mattias Engdegård
2020-03-15 17:46       ` Eli Zaretskii
2020-03-15 18:02         ` Andreas Schwab
2020-03-16  0:32           ` Paul Eggert

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).