unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#58726: 29.0.50; Bug in regexp matching with shy groups
@ 2022-10-23  1:41 Michael Heerdegen
  2022-10-23 13:50 ` Mattias Engdegård
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Heerdegen @ 2022-10-23  1:41 UTC (permalink / raw)
  To: 58726


Hello,

  (string-match-p "\\`\\(?:ab\\)*\\'" "a") ==> 0

That's wrong, the expected result is nil.  The language matched by that
regexp is {"", "ab", "abab", "ababab", ...}.

Changing to a non-shy group doesn't exploit the issue:

  (string-match-p "\\`\\(ab\\)*\\'" "a")  ==> nil

as expected.

I've been told (emacs-help, Bruno Barbier) that the problem exists at
least in emacs 27, emacs 28 and emacs 29.


TIA,

Michael.







^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#58726: 29.0.50; Bug in regexp matching with shy groups
  2022-10-23  1:41 bug#58726: 29.0.50; Bug in regexp matching with shy groups Michael Heerdegen
@ 2022-10-23 13:50 ` Mattias Engdegård
  2022-10-24  2:38   ` Michael Heerdegen
  0 siblings, 1 reply; 6+ messages in thread
From: Mattias Engdegård @ 2022-10-23 13:50 UTC (permalink / raw)
  To: Michael Heerdegen; +Cc: 58726

[-- Attachment #1: Type: text/plain, Size: 1015 bytes --]

Michael, thank you for finding this amusing bug!

>   (string-match-p "\\`\\(?:ab\\)*\\'" "a") ==> 0

With a bit of help from the regexp-disasm package, we see that this compiles to

    0  begbuf
    1  on-failure-jump-smart to 11
    4  exact "ab"
    8  jump to 1
   11  endbuf
   12  succeed

where the on-failure-jump-smart op turns into on-failure-keep-string-jump the first time it's executed.

This gives us a clue about what is wrong: when there is a failure inside an 'exact' string match, the target pointer should be reset to the start of that string ("ab" here) before jumping to the failure location.

Reading the source it becomes clear that this is done correctly when there is a mismatch, but not if the target string ends prematurely because PREFETCH() has no idea that it should reset the target pointer! Easy enough to fix.

Please try the attached patch. (The patch takes care of counted repetitions for good measure although I wasn't able to provoke a failure directly.)


[-- Attachment #2: 0001-Fix-regexp-matching-with-atomic-strings-and-optimise.patch --]
[-- Type: application/octet-stream, Size: 3240 bytes --]

From a1bc10533625bde326434325bc75cf1934895472 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Sun, 23 Oct 2022 15:40:37 +0200
Subject: [PATCH] Fix regexp matching with atomic strings and optimised
 backtracking

This bug occurs when an atomic pattern is matched at the end of
a string and the on-failure-keep-string-jump optimisation is
in effect, as in:

  (string-match "\\'\\(?:ab\\)*\\'" "a")

which succeeded but clearly should not (bug#58726).

Reported by Michael Heerdegen.

* src/regex-emacs.c (PREFETCH): Add reset parameter.
(re_match_2_internal): Use it for proper atomic pattern treatment.
* test/src/regex-emacs-tests.el (regexp-atomic-failure): New test.
---
 src/regex-emacs.c             | 14 +++++++++-----
 test/src/regex-emacs-tests.el |  5 +++++
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 9b2c14c413..626560911f 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -3446,14 +3446,18 @@ #define POINTER_TO_OFFSET(ptr)			\
 
 /* Call before fetching a character with *d.  This switches over to
    string2 if necessary.
+   `reset' is executed before backtracking if there are no more characters.
    Check re_match_2_internal for a discussion of why end_match_2 might
    not be within string2 (but be equal to end_match_1 instead).  */
-#define PREFETCH()							\
+#define PREFETCH(reset)							\
   while (d == dend)							\
     {									\
       /* End of string2 => fail.  */					\
       if (dend == end_match_2)						\
-	goto fail;							\
+        {								\
+	  reset;							\
+	  goto fail;							\
+	}								\
       /* End of string1 => advance to string2.  */			\
       d = string2;							\
       dend = end_match_2;						\
@@ -4252,7 +4256,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp,
 		int pat_charlen, buf_charlen;
 		int pat_ch, buf_ch;
 
-		PREFETCH ();
+		PREFETCH (d = dfail);
 		if (multibyte)
 		  pat_ch = string_char_and_length (p, &pat_charlen);
 		else
@@ -4280,7 +4284,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp,
 		int pat_charlen;
 		int pat_ch, buf_ch;
 
-		PREFETCH ();
+		PREFETCH (d = dfail);
 		if (multibyte)
 		  {
 		    pat_ch = string_char_and_length (p, &pat_charlen);
@@ -4486,7 +4490,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp,
 		if (d2 == dend2) break;
 
 		/* If necessary, advance to next segment in data.  */
-		PREFETCH ();
+		PREFETCH (d = dfail);
 
 		/* How many characters left in this segment to match.  */
 		dcnt = dend - d;
diff --git a/test/src/regex-emacs-tests.el b/test/src/regex-emacs-tests.el
index ff0d6be3f5..b323f592dc 100644
--- a/test/src/regex-emacs-tests.el
+++ b/test/src/regex-emacs-tests.el
@@ -867,4 +867,9 @@ regexp-eszett
     (should (equal (string-match "[[:lower:]]" "ẞ") 0))
     (should (equal (string-match "[[:upper:]]" "ẞ") 0))))
 
+(ert-deftest regexp-atomic-failure ()
+  "Bug#58726."
+  (should (equal (string-match "\\`\\(?:ab\\)*\\'" "a") nil))
+  (should (equal (string-match "\\`a\\{2\\}*\\'" "a") nil)))
+
 ;;; regex-emacs-tests.el ends here
-- 
2.32.0 (Apple Git-132)


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* bug#58726: 29.0.50; Bug in regexp matching with shy groups
  2022-10-23 13:50 ` Mattias Engdegård
@ 2022-10-24  2:38   ` Michael Heerdegen
  2022-10-24 10:55     ` Mattias Engdegård
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Heerdegen @ 2022-10-24  2:38 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58726

Mattias Engdegård <mattiase@acm.org> writes:

> Please try the attached patch. (The patch takes care of counted
> repetitions for good measure although I wasn't able to provoke a
> failure directly.)

Yes, works for me, thanks.

Unfortunately I can't estimate whether your fix is correct and the right
thing to do, so all I have to offer is that I will run Emacs with your
patch installed and watch for any problems that it may have introduced.

Thanks again,

Michael.






^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#58726: 29.0.50; Bug in regexp matching with shy groups
  2022-10-24  2:38   ` Michael Heerdegen
@ 2022-10-24 10:55     ` Mattias Engdegård
  2022-10-24 11:17       ` Michael Heerdegen
  0 siblings, 1 reply; 6+ messages in thread
From: Mattias Engdegård @ 2022-10-24 10:55 UTC (permalink / raw)
  To: Michael Heerdegen; +Cc: 58726

24 okt. 2022 kl. 04.38 skrev Michael Heerdegen <michael_heerdegen@web.de>:

> Unfortunately I can't estimate whether your fix is correct and the right
> thing to do, so all I have to offer is that I will run Emacs with your
> patch installed and watch for any problems that it may have introduced.

Thanks for testing! I'm fairly certain of its correctness, but there could be other places with a similar bug that I didn't find.
Nevertheless this should do it for now -- pushed to master.






^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#58726: 29.0.50; Bug in regexp matching with shy groups
  2022-10-24 10:55     ` Mattias Engdegård
@ 2022-10-24 11:17       ` Michael Heerdegen
  2022-10-24 11:28         ` Mattias Engdegård
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Heerdegen @ 2022-10-24 11:17 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 58726

Mattias Engdegård <mattiase@acm.org> writes:

> Thanks for testing! I'm fairly certain of its correctness, but there
> could be other places with a similar bug that I didn't find.
> Nevertheless this should do it for now -- pushed to master.

Ok, thanks.  Can we close this one?

Side note: I'm sorry to tell you, but you messed up the example in the
commit message (that one evals to 1).


Michael.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#58726: 29.0.50; Bug in regexp matching with shy groups
  2022-10-24 11:17       ` Michael Heerdegen
@ 2022-10-24 11:28         ` Mattias Engdegård
  0 siblings, 0 replies; 6+ messages in thread
From: Mattias Engdegård @ 2022-10-24 11:28 UTC (permalink / raw)
  To: Michael Heerdegen; +Cc: 58726-done

24 okt. 2022 kl. 13.17 skrev Michael Heerdegen <michael_heerdegen@web.de>:

> Ok, thanks.  Can we close this one?

Of course, done.

> I'm sorry to tell you, but you messed up the example in the
> commit message (that one evals to 1).

Oh dear, sorry about that! At least there is a reference to this bug in case someone wonders.






^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-10-24 11:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-23  1:41 bug#58726: 29.0.50; Bug in regexp matching with shy groups Michael Heerdegen
2022-10-23 13:50 ` Mattias Engdegård
2022-10-24  2:38   ` Michael Heerdegen
2022-10-24 10:55     ` Mattias Engdegård
2022-10-24 11:17       ` Michael Heerdegen
2022-10-24 11:28         ` Mattias Engdegård

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).