all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: "Mattias Engdegård" <mattias.engdegard@gmail.com>
To: Stefan Monnier <monnier@iro.umontreal.ca>
Cc: Eli Zaretskii <eliz@gnu.org>, Paul Eggert <eggert@cs.ucla.edu>,
	64128@debbugs.gnu.org
Subject: bug#64128: regexp parser zero-width assertion bugs
Date: Tue, 20 Jun 2023 13:36:42 +0200	[thread overview]
Message-ID: <9F76C15B-0038-4DFE-B2DE-F35EC1D87C93@gmail.com> (raw)
In-Reply-To: <jwv5y7jw74n.fsf-monnier+emacs@gnu.org>

[-- Attachment #1: Type: text/plain, Size: 1509 bytes --]

19 juni 2023 kl. 22.08 skrev Stefan Monnier <monnier@iro.umontreal.ca>:

> Hmm... maybe it's less wrong, but I'd rather make it behave like
> AB\(\b\)*C, which is, I'd argue, even less wrong.

I agree, and you are probably right that it's safe to do that.

> Or maybe make it signal an error: I can't imagine that the current
> behavior is used by very much code at all, seeing how it's so
> seriously non-intuitive.

That might be even better if we can get away with it.

19 juni 2023 kl. 22.40 skrev Paul Eggert <eggert@cs.ucla.edu>:

> In other words, how about if we change the groups from your list:
> 
> Group A: ^ $ \` \' \b \B
> Group B: \< \> \_< \_> \=
> 
> to this:
> 
> Group A: ^ \`
> Group B: $ \' \b \B \< \> \_< \_> \=
> 
> where "*" is ordinary after Group A, and special after Group B and there is no other squirrelly behavior. And similarly for the other repetition operators.

Sounds fine, with the option to go full error on group B if we agree that that's even better.

> Attached is a proposed doc change for this, which I have not installed.

Thank you, it has been incorporated in the attached patch which follows your suggestions above.

Your previous regexp doc updates are most appreciated. I still think the whole chapter needs a reform from the sheer weight of organic growth over the years. In particular, the division between "regexp special" and "regexp backslash" is purely syntactical, not semantic, and groups things in the wrong way.



[-- Attachment #2: 0001-Straighten-regexp-postfix-operator-after-zero-width-.patch --]
[-- Type: application/octet-stream, Size: 8621 bytes --]

From ef54f07ca78b2eef5181deb4f28deab100be2a75 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 20 Jun 2023 12:12:50 +0200
Subject: [PATCH] Straighten regexp postfix operator after zero-width assertion
 parse

The zero-width assertions \` \' \b \B were parsed in a sloppy way so
that a following postfix repetition operator could yield surprising
results.  For instance, "\\b*" would act as "\\b\\*", and "xy\\b*"
would act as "\\(?:xy\\b\\)*".

Except for \` and ^, any following postfix operator now applies to the
zero-width assertion itself only which is predictable and consistent with
other assertions, although useless in practice.
For historical compatibility, an operator character following \` and ^
always becomes a literal. (Bug#64128)

* src/regex-emacs.c (regex_compile):
Set `laststart` appropriately for each zero-width assertion instead
of leaving it with whatever value it had before.
* test/src/regex-emacs-tests.el
(regexp-tests-zero-width-assertion-repetition): New test.
* doc/lispref/searching.texi (Regexp Special):
Say that repetition operators are not special after \`,
and that they work as expected after other backslash escapes.
* etc/NEWS: Announce.
---
 doc/lispref/searching.texi    |  6 +---
 etc/NEWS                      |  8 +++++
 src/regex-emacs.c             | 15 +++++++--
 test/src/regex-emacs-tests.el | 63 +++++++++++++++++++++++++++++++++++
 4 files changed, 85 insertions(+), 7 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 28230cea643..7c9893054d9 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -546,15 +546,11 @@ Regexp Special
 
 For historical compatibility, a repetition operator is treated as ordinary
 if it appears at the start of a regular expression
-or after @samp{^}, @samp{\(}, @samp{\(?:} or @samp{\|}.
+or after @samp{^}, @samp{\`}, @samp{\(}, @samp{\(?:} or @samp{\|}.
 For example, @samp{*foo} is treated as @samp{\*foo}, and
 @samp{two\|^\@{2\@}} is treated as @samp{two\|^@{2@}}.
 It is poor practice to depend on this behavior; use proper backslash
 escaping anyway, regardless of where the repetition operator appears.
-Also, a repetition operator should not immediately follow a backslash escape
-that matches only empty strings, as Emacs has bugs in this area.
-For example, it is unwise to use @samp{\b*}, which can be omitted
-without changing the documented meaning of the regular expression.
 
 As a @samp{\} is not special inside a bracket expression, it can
 never remove the special meaning of @samp{-}, @samp{^} or @samp{]}.
diff --git a/etc/NEWS b/etc/NEWS
index 2170323e74a..faf1f73b143 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -470,6 +470,14 @@ symbol, and either that symbol is ':eval' and the second element of
 the list evaluates to 'nil' or the symbol's value as a variable is
 'nil' or void.
 
++++
+** Regexp zero-width assertions followed by operators are better defined.
+Previously, regexps such as "xy\\B*" would have ill-defined behaviour.
+Now any operator following a zero-width assertion applies to that
+assertion only (which is useless).  For historical compatibility, an
+operator character following '^' or '\`' becomes literal, but we
+advise against relying on this.
+
 \f
 * Lisp Changes in Emacs 30.1
 
diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index fea34df991b..b02554791ce 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -1716,7 +1716,8 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
   /* Address of start of the most recently finished expression.
      This tells, e.g., postfix * where to find the start of its
-     operand.  Reset at the beginning of groups and alternatives.  */
+     operand.  Reset at the beginning of groups and alternatives,
+     and after ^ and \` for dusty-deck compatibility.  */
   unsigned char *laststart = 0;
 
   /* Address of beginning of regexp, or inside of last group.  */
@@ -1847,12 +1848,16 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 	case '^':
 	  if (! (p == pattern + 1 || at_begline_loc_p (pattern, p)))
 	    goto normal_char;
+	  /* Special case for compatibility: postfix ops after \` become
+	     literals.  */
+	  laststart = 0;
 	  BUF_PUSH (begline);
 	  break;
 
 	case '$':
 	  if (! (p == pend || at_endline_loc_p (p, pend)))
 	    goto normal_char;
+	  laststart = b;
 	  BUF_PUSH (endline);
 	  break;
 
@@ -1892,7 +1897,7 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
 	    /* Star, etc. applied to an empty pattern is equivalent
 	       to an empty pattern.  */
-	    if (!laststart || laststart == b)
+	    if (laststart == b)
 	      break;
 
 	    /* Now we know whether or not zero matches is allowed
@@ -2544,18 +2549,24 @@ regex_compile (re_char *pattern, ptrdiff_t size,
               break;
 
 	    case 'b':
+	      laststart = b;
 	      BUF_PUSH (wordbound);
 	      break;
 
 	    case 'B':
+	      laststart = b;
 	      BUF_PUSH (notwordbound);
 	      break;
 
 	    case '`':
+	      /* Special case for compatibility: postfix ops after \` become
+		 literals, as for ^ (see above).  */
+	      laststart = 0;
 	      BUF_PUSH (begbuf);
 	      break;
 
 	    case '\'':
+	      laststart = b;
 	      BUF_PUSH (endbuf);
 	      break;
 
diff --git a/test/src/regex-emacs-tests.el b/test/src/regex-emacs-tests.el
index 52d43775b8e..e739e2b28a6 100644
--- a/test/src/regex-emacs-tests.el
+++ b/test/src/regex-emacs-tests.el
@@ -883,4 +883,67 @@ regexp-tests-backtrack-optimization
     (should (looking-at "x*\\(=\\|:\\)*"))
     (should (looking-at "x*=*?"))))
 
+(ert-deftest regexp-tests-zero-width-assertion-repetition ()
+  ;; Check compatibility behaviour with repetition operators after
+  ;; certain zero-width assertions (bug#64128).
+
+  ;; Postfix operators after ^ and \` become literals, for historical
+  ;; compatibility.  Only the first character of a lazy operator (like *?)
+  ;; becomes a literal.
+  (should (equal (string-match "^*a" "x\n*a") 2))
+  (should (equal (string-match "^*?a" "x\n*a") 2))
+  (should (equal (string-match "^*?a" "x\na") 2))
+  (should (equal (string-match "^*?a" "x\n**a") nil))
+
+  (should (equal (string-match "\\`*a" "*a") 0))
+  (should (equal (string-match "\\`*?a" "*a") 0))
+  (should (equal (string-match "\\`*?a" "a") 0))
+  (should (equal (string-match "\\`*?a" "**a") nil))
+
+  ;; Other zero-width assertions are treated as normal elements, so postfix
+  ;; operators apply to them alone (which is pointless but valid).
+  (should (equal (string-match "\\b*!" "*!") 1))
+  (should (equal (string-match "!\\b+;" "!;") nil))
+  (should (equal (string-match "!\\b+a" "!a") 0))
+
+  (should (equal (string-match "\\B*!" "*!") 1))
+  (should (equal (string-match "!\\B+;" "!;") 0))
+  (should (equal (string-match "!\\B+a" "!a") nil))
+
+  (should (equal (string-match "\\<*b" "*b") 1))
+  (should (equal (string-match "a\\<*b" "ab") 0))
+  (should (equal (string-match ";\\<*b" ";b") 0))
+  (should (equal (string-match "a\\<+b" "ab") nil))
+  (should (equal (string-match ";\\<+b" ";b") 0))
+
+  (should (equal (string-match "\\>*;" "*;") 1))
+  (should (equal (string-match "a\\>*b" "ab") 0))
+  (should (equal (string-match "a\\>*;" "a;") 0))
+  (should (equal (string-match "a\\>+b" "ab") nil))
+  (should (equal (string-match "a\\>+;" "a;") 0))
+
+  (should (equal (string-match "a\\'" "ab") nil))
+  (should (equal (string-match "b\\'" "ab") 1))
+  (should (equal (string-match "a\\'*b" "ab") 0))
+  (should (equal (string-match "a\\'+" "ab") nil))
+  (should (equal (string-match "b\\'+" "ab") 1))
+  (should (equal (string-match "\\'+" "+") 1))
+
+  (should (equal (string-match "\\_<*b" "*b") 1))
+  (should (equal (string-match "a\\_<*b" "ab") 0))
+  (should (equal (string-match " \\_<*b" " b") 0))
+  (should (equal (string-match "a\\_<+b" "ab") nil))
+  (should (equal (string-match " \\_<+b" " b") 0))
+
+  (should (equal (string-match "\\_>*;" "*;") 1))
+  (should (equal (string-match "a\\_>*b" "ab") 0))
+  (should (equal (string-match "a\\_>* " "a ") 0))
+  (should (equal (string-match "a\\_>+b" "ab") nil))
+  (should (equal (string-match "a\\_>+ " "a ") 0))
+
+  (should (equal (string-match "\\=*b" "*b") 1))
+  (should (equal (string-match "a\\=*b" "a*b") nil))
+  (should (equal (string-match "a\\=*b" "ab") 0))
+  )
+
 ;;; regex-emacs-tests.el ends here
-- 
2.32.0 (Apple Git-132)


  reply	other threads:[~2023-06-20 11:36 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-17 12:20 bug#64128: regexp parser zero-width assertion bugs Mattias Engdegård
2023-06-17 18:44 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-17 20:07   ` Mattias Engdegård
2023-06-17 22:18     ` Paul Eggert
2023-06-18  4:55       ` Eli Zaretskii
2023-06-18 20:26         ` Mattias Engdegård
2023-06-19  3:04           ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-19  8:44             ` Mattias Engdegård
2023-06-19 12:54               ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-19 18:34                 ` Mattias Engdegård
2023-06-19 19:21                   ` Paul Eggert
2023-06-19 19:52                     ` Mattias Engdegård
2023-06-19 20:08                       ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-20 11:36                         ` Mattias Engdegård [this message]
2023-06-21  6:08                           ` Paul Eggert
2023-06-21 15:57                             ` Mattias Engdegård
2023-06-19 20:40                       ` Paul Eggert
2023-06-19 18:14           ` Paul Eggert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9F76C15B-0038-4DFE-B2DE-F35EC1D87C93@gmail.com \
    --to=mattias.engdegard@gmail.com \
    --cc=64128@debbugs.gnu.org \
    --cc=eggert@cs.ucla.edu \
    --cc=eliz@gnu.org \
    --cc=monnier@iro.umontreal.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.