unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#53680: Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50
@ 2022-02-01  7:37 Christian Johansson
  2022-02-01 11:00 ` Lars Ingebrigtsen
  2022-02-01 12:56 ` Mattias Engdegård
  0 siblings, 2 replies; 9+ messages in thread
From: Christian Johansson @ 2022-02-01  7:37 UTC (permalink / raw)
  To: 53680

Hello

Some context first, since I needed a multi-line regexp matcher that 
could tell the difference of end-of-string vs newline ($ vs \n) in a 
string I replace occurrences of \n with \r in a string.

In a peculiar case a endless loop happens and I can reproduce this bug 
in Emacs 27.2 and 28.0.50 (have not tested any other versions), ok try 
to eval following snippet:

(string-match-p·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·") 


or

(string-match·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·")


-- 
Hälsningar / Best Regards
Christian






^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#53680: Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50
  2022-02-01  7:37 bug#53680: Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50 Christian Johansson
@ 2022-02-01 11:00 ` Lars Ingebrigtsen
  2022-02-01 11:08   ` Lars Ingebrigtsen
  2022-02-01 11:15   ` Andreas Schwab
  2022-02-01 12:56 ` Mattias Engdegård
  1 sibling, 2 replies; 9+ messages in thread
From: Lars Ingebrigtsen @ 2022-02-01 11:00 UTC (permalink / raw)
  To: Christian Johansson; +Cc: 53680

Christian Johansson <christian@cvj.se> writes:

> In a peculiar case a endless loop happens and I can reproduce this bug
> in Emacs 27.2 and 28.0.50 (have not tested any other versions), ok try
> to eval following snippet:
>
> (string-match-p·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·")
> or

I guess the · are supposed to be spaces, so it's:

(string-match
 "[\r\t ]*implements[\r\t ]+\\([\r\t ]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t ]*{$"
 "ariable implements \\Magento\\Framework\\Event\\OberserverInterface\r{\r    public function __construct()\r ")

And I can reproduce this on master, too.  Is it a problem with excessive
backtracking, perhaps?  I let it run for a minute, and it didn't finish
in that time.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#53680: Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50
  2022-02-01 11:00 ` Lars Ingebrigtsen
@ 2022-02-01 11:08   ` Lars Ingebrigtsen
  2022-02-01 11:25     ` Andreas Schwab
  2022-02-01 11:15   ` Andreas Schwab
  1 sibling, 1 reply; 9+ messages in thread
From: Lars Ingebrigtsen @ 2022-02-01 11:08 UTC (permalink / raw)
  To: Christian Johansson; +Cc: 53680

Lars Ingebrigtsen <larsi@gnus.org> writes:

> And I can reproduce this on master, too.  Is it a problem with excessive
> backtracking, perhaps?  I let it run for a minute, and it didn't finish
> in that time.

This simplified version also infloops:

(string-match
 "implements\\([\r\t ]*[\\a-zA-Z0-9_]+,?\\)+{$"
 "ariable implements \\Magento\\Framework\\Event\\OberserverInterface\r{\r    public function __construct()\r ")


-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#53680: Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50
  2022-02-01 11:00 ` Lars Ingebrigtsen
  2022-02-01 11:08   ` Lars Ingebrigtsen
@ 2022-02-01 11:15   ` Andreas Schwab
  2022-02-01 12:16     ` Lars Ingebrigtsen
  1 sibling, 1 reply; 9+ messages in thread
From: Andreas Schwab @ 2022-02-01 11:15 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: Christian Johansson, 53680

On Feb 01 2022, Lars Ingebrigtsen wrote:

> I guess the · are supposed to be spaces, so it's:
>
> (string-match
>  "[\r\t ]*implements[\r\t ]+\\([\r\t ]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t ]*{$"
>  "ariable implements \\Magento\\Framework\\Event\\OberserverInterface\r{\r    public function __construct()\r ")
>
> And I can reproduce this on master, too.  Is it a problem with excessive
> backtracking, perhaps?

It sure is.  The nesting of the + operator without proper anchoring
makes the search space explode.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#53680: Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50
  2022-02-01 11:08   ` Lars Ingebrigtsen
@ 2022-02-01 11:25     ` Andreas Schwab
  0 siblings, 0 replies; 9+ messages in thread
From: Andreas Schwab @ 2022-02-01 11:25 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: Christian Johansson, 53680

On Feb 01 2022, Lars Ingebrigtsen wrote:

> This simplified version also infloops:
>
> (string-match
>  "implements\\([\r\t ]*[\\a-zA-Z0-9_]+,?\\)+{$"

This has the same search space explosion.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#53680: Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50
  2022-02-01 11:15   ` Andreas Schwab
@ 2022-02-01 12:16     ` Lars Ingebrigtsen
  0 siblings, 0 replies; 9+ messages in thread
From: Lars Ingebrigtsen @ 2022-02-01 12:16 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Christian Johansson, 53680

Andreas Schwab <schwab@linux-m68k.org> writes:

> It sure is.  The nesting of the + operator without proper anchoring
> makes the search space explode.

Yup.  So something like 

(string-match
 "[\r\t ]*implements\\([\r\t ]+[\\a-zA-Z_0-9_]+,?\\)+[\r\t ]*{$"
 "ariable implements \\Magento\\Framework\\Event\\OberserverInterface\r{\r    public function __construct()\r ")

won't have the same explosion (and indeed finished immediately). 

So I don't think this is an Emacs bug, just a very tough regexp to
match.  There's been some talk about replacing Emacs' regexp motor with
something that has better backtracking characteristics, but I don't
recall if anybody has actually made any moves to make that happen.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#53680: Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50
  2022-02-01  7:37 bug#53680: Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50 Christian Johansson
  2022-02-01 11:00 ` Lars Ingebrigtsen
@ 2022-02-01 12:56 ` Mattias Engdegård
  2022-02-01 13:12   ` Christian Johansson
  1 sibling, 1 reply; 9+ messages in thread
From: Mattias Engdegård @ 2022-02-01 12:56 UTC (permalink / raw)
  To: Christian Johansson; +Cc: Lars Ingebrigtsen, Andreas Schwab, 53680

> (string-match·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·") 

The diagnostics by Lars and Andreas is correct. Let's look at it more closely, first translating the regexp to rx for ease of reasoning, and see if we can make it work:

(rx (* (in "\t\r "))
    "implements"
    (+ (in "\t\r "))
    (+ (group
        (* (in "\t\r "))
        (+ (in "0-9A-Za-z" "\\_"))
        (? ",")))
    (* (in "\t\r "))
    "{" eol)

The first line is meaningless since it can match the empty string, but you probably want to anchor the start of "implements" so that it doesn't match "house_implements". Let's also drop the capture group, and we get:

(rx symbol-start "implements"
    (+ (in "\t\r "))
    (+ (* (in "\t\r "))
       (+ (in "0-9A-Za-z" "\\_"))
       (? ","))
    (* (in "\t\r "))
    "{" eol)

You clearly want to match a non-empty sequence of 'words' separated with whitespace and/or commas, but the pattern is ambiguous --  all inter-word separators are optional. Let's make them mandatory:

 (rx symbol-start "implements"
     ;; mandatory whitespace
     (+ (in "\t\r "))
     ;; then a word
     (+ (in "0-9A-Za-z" "\\_"))
     ;; then maybe more words, each prefixed by spaces or comma
     (* (+ (in "\t\r ,"))   ; fast and loose
        (+ (in "0-9A-Za-z" "\\_")))
     ;; finally whitespace before the curly bracket
     (* (in "\t\r "))
     "{" eol)

which is reasonably efficient, since all ambiguity is now gone: the regexp can (almost) only match in one way.

Note the "fast and loose" pattern where we accept any number of spaces or commas. Here it depends on your grammar but if you want exactly one comma separating each word, that subexpression would be something like

  (* (in "\t\r "))
  ","
  (* (in "\t\r "))

instead.






^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#53680: Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50
  2022-02-01 12:56 ` Mattias Engdegård
@ 2022-02-01 13:12   ` Christian Johansson
       [not found]     ` <AB866492-4CC0-418A-8C9E-CAAB2C522CDA@acm.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Christian Johansson @ 2022-02-01 13:12 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Lars Ingebrigtsen, Andreas Schwab, 53680

Alright, thanks for helping me find the correct regexp, this issue only appeared on some peculiar strings

> 1 feb. 2022 kl. 13:56 skrev Mattias Engdegård <mattiase@acm.org>:
> 
> 
>> 
>> (string-match·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·") 
> 
> The diagnostics by Lars and Andreas is correct. Let's look at it more closely, first translating the regexp to rx for ease of reasoning, and see if we can make it work:
> 
> (rx (* (in "\t\r "))
>    "implements"
>    (+ (in "\t\r "))
>    (+ (group
>        (* (in "\t\r "))
>        (+ (in "0-9A-Za-z" "\\_"))
>        (? ",")))
>    (* (in "\t\r "))
>    "{" eol)
> 
> The first line is meaningless since it can match the empty string, but you probably want to anchor the start of "implements" so that it doesn't match "house_implements". Let's also drop the capture group, and we get:
> 
> (rx symbol-start "implements"
>    (+ (in "\t\r "))
>    (+ (* (in "\t\r "))
>       (+ (in "0-9A-Za-z" "\\_"))
>       (? ","))
>    (* (in "\t\r "))
>    "{" eol)
> 
> You clearly want to match a non-empty sequence of 'words' separated with whitespace and/or commas, but the pattern is ambiguous --  all inter-word separators are optional. Let's make them mandatory:
> 
> (rx symbol-start "implements"
>     ;; mandatory whitespace
>     (+ (in "\t\r "))
>     ;; then a word
>     (+ (in "0-9A-Za-z" "\\_"))
>     ;; then maybe more words, each prefixed by spaces or comma
>     (* (+ (in "\t\r ,"))   ; fast and loose
>        (+ (in "0-9A-Za-z" "\\_")))
>     ;; finally whitespace before the curly bracket
>     (* (in "\t\r "))
>     "{" eol)
> 
> which is reasonably efficient, since all ambiguity is now gone: the regexp can (almost) only match in one way.
> 
> Note the "fast and loose" pattern where we accept any number of spaces or commas. Here it depends on your grammar but if you want exactly one comma separating each word, that subexpression would be something like
> 
>  (* (in "\t\r "))
>  ","
>  (* (in "\t\r "))
> 
> instead.
> 





^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#53680: Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50
       [not found]     ` <AB866492-4CC0-418A-8C9E-CAAB2C522CDA@acm.org>
@ 2022-02-01 16:23       ` Christian Johansson
  0 siblings, 0 replies; 9+ messages in thread
From: Christian Johansson @ 2022-02-01 16:23 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Lars Ingebrigtsen, Andreas Schwab, 53680

I see, thanks for the tip, I got it working now without endless-loop and 
without the CR work-around

Hälsningar / Best Regards
Christian

On 01/02/2022 14:39, Mattias Engdegård wrote:
> 1 feb. 2022 kl. 14.12 skrev Christian Johansson <christian@cvj.se>:
>> Alright, thanks for helping me find the correct regexp, this issue only appeared on some peculiar strings
> By the way, your translation of LF into CR in order to use "$" as end-of-string anchor is misguided: to match at the beginning or end of a string or buffer, use \` and \' (in rx, string-start and string-end or bos and eos) respectively.
>





^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-02-01 16:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-02-01  7:37 bug#53680: Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50 Christian Johansson
2022-02-01 11:00 ` Lars Ingebrigtsen
2022-02-01 11:08   ` Lars Ingebrigtsen
2022-02-01 11:25     ` Andreas Schwab
2022-02-01 11:15   ` Andreas Schwab
2022-02-01 12:16     ` Lars Ingebrigtsen
2022-02-01 12:56 ` Mattias Engdegård
2022-02-01 13:12   ` Christian Johansson
     [not found]     ` <AB866492-4CC0-418A-8C9E-CAAB2C522CDA@acm.org>
2022-02-01 16:23       ` Christian Johansson

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).