all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: "Mattias Engdegård" <mattiase@acm.org>
To: Michael Heerdegen <michael_heerdegen@web.de>
Cc: 58726@debbugs.gnu.org
Subject: bug#58726: 29.0.50; Bug in regexp matching with shy groups
Date: Sun, 23 Oct 2022 15:50:41 +0200	[thread overview]
Message-ID: <7E80A46A-DB9F-407F-B3F1-33E1DA5689EF@acm.org> (raw)
In-Reply-To: <87y1t72u62.fsf@web.de>

[-- Attachment #1: Type: text/plain, Size: 1015 bytes --]

Michael, thank you for finding this amusing bug!

>   (string-match-p "\\`\\(?:ab\\)*\\'" "a") ==> 0

With a bit of help from the regexp-disasm package, we see that this compiles to

    0  begbuf
    1  on-failure-jump-smart to 11
    4  exact "ab"
    8  jump to 1
   11  endbuf
   12  succeed

where the on-failure-jump-smart op turns into on-failure-keep-string-jump the first time it's executed.

This gives us a clue about what is wrong: when there is a failure inside an 'exact' string match, the target pointer should be reset to the start of that string ("ab" here) before jumping to the failure location.

Reading the source it becomes clear that this is done correctly when there is a mismatch, but not if the target string ends prematurely because PREFETCH() has no idea that it should reset the target pointer! Easy enough to fix.

Please try the attached patch. (The patch takes care of counted repetitions for good measure although I wasn't able to provoke a failure directly.)


[-- Attachment #2: 0001-Fix-regexp-matching-with-atomic-strings-and-optimise.patch --]
[-- Type: application/octet-stream, Size: 3240 bytes --]

From a1bc10533625bde326434325bc75cf1934895472 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Sun, 23 Oct 2022 15:40:37 +0200
Subject: [PATCH] Fix regexp matching with atomic strings and optimised
 backtracking

This bug occurs when an atomic pattern is matched at the end of
a string and the on-failure-keep-string-jump optimisation is
in effect, as in:

  (string-match "\\'\\(?:ab\\)*\\'" "a")

which succeeded but clearly should not (bug#58726).

Reported by Michael Heerdegen.

* src/regex-emacs.c (PREFETCH): Add reset parameter.
(re_match_2_internal): Use it for proper atomic pattern treatment.
* test/src/regex-emacs-tests.el (regexp-atomic-failure): New test.
---
 src/regex-emacs.c             | 14 +++++++++-----
 test/src/regex-emacs-tests.el |  5 +++++
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index 9b2c14c413..626560911f 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -3446,14 +3446,18 @@ #define POINTER_TO_OFFSET(ptr)			\
 
 /* Call before fetching a character with *d.  This switches over to
    string2 if necessary.
+   `reset' is executed before backtracking if there are no more characters.
    Check re_match_2_internal for a discussion of why end_match_2 might
    not be within string2 (but be equal to end_match_1 instead).  */
-#define PREFETCH()							\
+#define PREFETCH(reset)							\
   while (d == dend)							\
     {									\
       /* End of string2 => fail.  */					\
       if (dend == end_match_2)						\
-	goto fail;							\
+        {								\
+	  reset;							\
+	  goto fail;							\
+	}								\
       /* End of string1 => advance to string2.  */			\
       d = string2;							\
       dend = end_match_2;						\
@@ -4252,7 +4256,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp,
 		int pat_charlen, buf_charlen;
 		int pat_ch, buf_ch;
 
-		PREFETCH ();
+		PREFETCH (d = dfail);
 		if (multibyte)
 		  pat_ch = string_char_and_length (p, &pat_charlen);
 		else
@@ -4280,7 +4284,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp,
 		int pat_charlen;
 		int pat_ch, buf_ch;
 
-		PREFETCH ();
+		PREFETCH (d = dfail);
 		if (multibyte)
 		  {
 		    pat_ch = string_char_and_length (p, &pat_charlen);
@@ -4486,7 +4490,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp,
 		if (d2 == dend2) break;
 
 		/* If necessary, advance to next segment in data.  */
-		PREFETCH ();
+		PREFETCH (d = dfail);
 
 		/* How many characters left in this segment to match.  */
 		dcnt = dend - d;
diff --git a/test/src/regex-emacs-tests.el b/test/src/regex-emacs-tests.el
index ff0d6be3f5..b323f592dc 100644
--- a/test/src/regex-emacs-tests.el
+++ b/test/src/regex-emacs-tests.el
@@ -867,4 +867,9 @@ regexp-eszett
     (should (equal (string-match "[[:lower:]]" "ẞ") 0))
     (should (equal (string-match "[[:upper:]]" "ẞ") 0))))
 
+(ert-deftest regexp-atomic-failure ()
+  "Bug#58726."
+  (should (equal (string-match "\\`\\(?:ab\\)*\\'" "a") nil))
+  (should (equal (string-match "\\`a\\{2\\}*\\'" "a") nil)))
+
 ;;; regex-emacs-tests.el ends here
-- 
2.32.0 (Apple Git-132)


  reply	other threads:[~2022-10-23 13:50 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-23  1:41 bug#58726: 29.0.50; Bug in regexp matching with shy groups Michael Heerdegen
2022-10-23 13:50 ` Mattias Engdegård [this message]
2022-10-24  2:38   ` Michael Heerdegen
2022-10-24 10:55     ` Mattias Engdegård
2022-10-24 11:17       ` Michael Heerdegen
2022-10-24 11:28         ` Mattias Engdegård

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7E80A46A-DB9F-407F-B3F1-33E1DA5689EF@acm.org \
    --to=mattiase@acm.org \
    --cc=58726@debbugs.gnu.org \
    --cc=michael_heerdegen@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.