bug#64128: regexp parser zero-width assertion bugs

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* bug#64128: regexp parser zero-width assertion bugs
@ 2023-06-17 12:20 Mattias Engdegård
  2023-06-17 18:44 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 1 reply; 18+ messages in thread
From: Mattias Engdegård @ 2023-06-17 12:20 UTC (permalink / raw)
  To: 64128; +Cc: Paul Eggert, Stefan Monnier

[-- Attachment #1: Type: text/plain, Size: 1385 bytes --]

In Emacs regexps, some but not all zero-width assertions have the special property in that they are not treated as an element for an immediately following ?, * or +. For example,

  \b*

matches a literal asterisk at a word boundary -- the `*` becomes literal because it is treated as if there were nothing for it to act upon. Even stranger:

  xy\b*

is parsed as, in rx syntax, (* "xy" word-boundary) which is remarkable: the repetition operator encompasses several elements even though there are no brackets given. Demo:

(and (string-match "quack,\\b*" "quack,quack,quack,quaaaack!")
     (match-data))
=> (0 18)

Zero-width assertions that have the property:
^ (bol), $ (eol), \` (bos), \' (eos), \b (word-boundary), \B (not-word-boundary)

Zero-width assertions that do not have the property (and are treated as any other element):
\< (bow), \> (eow), \_< (symbol-start), \_> (symbol-end), \= (point)

These regexp patterns should be very rare in practice: they should always be a mistake, but it would be nice if they behaved in a way that makes some kind of sense.

A modest improvement would be to make operators become literal after any zero-width assertion, so that

  \<*

becomes (: word-start "*") instead of (* word-start), and

  xy\b*

becomes (: "xy" word-boundary "*") instead of (* "xy" word-boundary).

Suggested patch attached.


[-- Attachment #2: regexp-zero-width-assertion-bug.diff --]
[-- Type: application/octet-stream, Size: 2857 bytes --]

diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index e3237cd425a..120a727cf74 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -1716,7 +1716,9 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
   /* Address of start of the most recently finished expression.
      This tells, e.g., postfix * where to find the start of its
-     operand.  Reset at the beginning of groups and alternatives.  */
+     operand.  Reset at the beginning of groups and alternatives,
+     and after any zero-width assertion (which should not be the target
+     of any postfix repetition operators).  */
   unsigned char *laststart = 0;
 
   /* Address of beginning of regexp, or inside of last group.  */
@@ -1847,12 +1849,14 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 	case '^':
 	  if (! (p == pattern + 1 || at_begline_loc_p (pattern, p)))
 	    goto normal_char;
+	  laststart = 0;
 	  BUF_PUSH (begline);
 	  break;
 
 	case '$':
 	  if (! (p == pend || at_endline_loc_p (p, pend)))
 	    goto normal_char;
+	  laststart = 0;
 	  BUF_PUSH (endline);
 	  break;
 
@@ -1892,7 +1896,7 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
 	    /* Star, etc. applied to an empty pattern is equivalent
 	       to an empty pattern.  */
-	    if (!laststart || laststart == b)
+	    if (laststart == b)
 	      break;
 
 	    /* Now we know whether or not zero matches is allowed
@@ -2482,7 +2486,7 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 	       goto normal_char;
 
 	    case '=':
-	      laststart = b;
+	      laststart = 0;
 	      BUF_PUSH (at_dot);
 	      break;
 
@@ -2523,17 +2527,17 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
 
 	    case '<':
-	      laststart = b;
+	      laststart = 0;
 	      BUF_PUSH (wordbeg);
 	      break;
 
 	    case '>':
-	      laststart = b;
+	      laststart = 0;
 	      BUF_PUSH (wordend);
 	      break;
 
 	    case '_':
-              laststart = b;
+              laststart = 0;
               PATFETCH (c);
               if (c == '<')
                 BUF_PUSH (symbeg);
@@ -2544,18 +2548,22 @@ regex_compile (re_char *pattern, ptrdiff_t size,
               break;
 
 	    case 'b':
+	      laststart = 0;
 	      BUF_PUSH (wordbound);
 	      break;
 
 	    case 'B':
+	      laststart = 0;
 	      BUF_PUSH (notwordbound);
 	      break;
 
 	    case '`':
+	      laststart = 0;
 	      BUF_PUSH (begbuf);
 	      break;
 
 	    case '\'':
+	      laststart = 0;
 	      BUF_PUSH (endbuf);
 	      break;
 
@@ -2597,7 +2605,7 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
 	      /* If followed by a repetition operator.  */
 	      || (p != pend
-		  && (*p == '*' || *p == '+' || *p == '?' || *p == '^'))
+		  && (*p == '*' || *p == '+' || *p == '?'))
 	      || (p + 1 < pend && p[0] == '\\' && p[1] == '{'))
 	    {
 	      /* Start building a new exactn.  */

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-17 12:20 bug#64128: regexp parser zero-width assertion bugs Mattias Engdegård
@ 2023-06-17 18:44 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2023-06-17 20:07   ` Mattias Engdegård
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-06-17 18:44 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 64128, eggert

> (and (string-match "quack,\\b*" "quack,quack,quack,quaaaack!")
>      (match-data))
> => (0 18)

That's so bizarre that it feels like we really should try and preserve
it for posterity.
Not.

> These regexp patterns should be very rare in practice: they should
> always be a mistake, but it would be nice if they behaved in a way
> that makes some kind of sense.
>
> A modest improvement would be to make operators become literal after
> any zero-width assertion, so that

I think the behavior that makes most sense is to signal an error when
compiling the regexp.


        Stefan






^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-17 18:44 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2023-06-17 20:07   ` Mattias Engdegård
  2023-06-17 22:18     ` Paul Eggert
  0 siblings, 1 reply; 18+ messages in thread
From: Mattias Engdegård @ 2023-06-17 20:07 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 64128, eggert

17 juni 2023 kl. 20.44 skrev Stefan Monnier <monnier@iro.umontreal.ca>:

> I think the behavior that makes most sense is to signal an error when
> compiling the regexp.

Clearly, but some behaviour needs to be preserved for compatibility.
Regexps like "^*" aren't uncommon. Can it be generalised in a useful way?






^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-17 20:07   ` Mattias Engdegård
@ 2023-06-17 22:18     ` Paul Eggert
  2023-06-18  4:55       ` Eli Zaretskii
  0 siblings, 1 reply; 18+ messages in thread
From: Paul Eggert @ 2023-06-17 22:18 UTC (permalink / raw)
  To: Mattias Engdegård, Stefan Monnier; +Cc: 64128

[-- Attachment #1: Type: text/plain, Size: 1023 bytes --]

On 2023-06-17 13:07, Mattias Engdegård wrote:
> 17 juni 2023 kl. 20.44 skrev Stefan Monnier <monnier@iro.umontreal.ca>:
> 
>> I think the behavior that makes most sense is to signal an error when
>> compiling the regexp.
> 
> Clearly, but some behaviour needs to be preserved for compatibility.
> Regexps like "^*" aren't uncommon. Can it be generalised in a useful way?
> 

doc/lispref/searching.texi says that "*" is treated as an ordinary 
character if it is in a context where its special meaning makes no 
sense, giving "*foo" as an example. If we break with this tradition by 
making "\b*" an error instead of being equivalent to "\b\*", we should 
update that part of the manual.

One possible way forward is to update doc/lispref/searching.texi to 
specify what we want. Then we can modify the code to match the updated 
documentation.

In my experience, modifying the doc is often the hard part, so I took a 
crack at that in the draft proposed patch, which I have not installed.

Comments?

[-- Attachment #2: 0001-Document-that-b-etc-are-now-invalid-regexps.patch --]
[-- Type: text/x-patch, Size: 3904 bytes --]

From e4fc369a624d85027d39a424a507507da00f26aa Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Sat, 17 Jun 2023 15:05:42 -0700
Subject: [PROPOSED] Document that \b* etc are now invalid regexps

---
 doc/lispref/searching.texi | 24 ++++++++++++++++--------
 etc/NEWS                   |  6 ++++++
 2 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index b8d9094b28..fd4dfcbd71 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -332,6 +332,10 @@ Regexp Special
 expression.  Thus, @samp{fo*} has a repeating @samp{o}, not a repeating
 @samp{fo}.  It matches @samp{f}, @samp{fo}, @samp{foo}, and so on.
 
+@samp{*} cannot immediately follow a backslash escape that matches
+only empty strings, as this is too likely to be a typo.  For example,
+@samp{\<*} is invalid.
+
 @cindex backtracking and regular expressions
 The matcher processes a @samp{*} construct by matching, immediately, as
 many repetitions as can be found.  Then it continues with the rest of
@@ -505,9 +509,10 @@ Regexp Special
 When matching a string instead of a buffer, @samp{^} matches at the
 beginning of the string or after a newline character.
 
-For historical compatibility reasons, @samp{^} can be used only at the
+For historical compatibility, @samp{^} is special only at the
 beginning of the regular expression, or after @samp{\(}, @samp{\(?:}
-or @samp{\|}.
+or @samp{\|}.  In other contexts it is an ordinary character, except
+for its special meaning at the start of a character alternative.
 
 @item @samp{$}
 @cindex @samp{$} in regexp
@@ -519,8 +524,9 @@ Regexp Special
 When matching a string instead of a buffer, @samp{$} matches at the end
 of the string or before a newline character.
 
-For historical compatibility reasons, @samp{$} can be used only at the
+For historical compatibility, @samp{$} is special only at the
 end of the regular expression, or before @samp{\)} or @samp{\|}.
+In other contexts it is an ordinary character.
 
 @item @samp{\}
 @cindex @samp{\} in regexp
@@ -541,11 +547,13 @@ Regexp Special
 @end table
 
 @strong{Please note:} For historical compatibility, special characters
-are treated as ordinary ones if they are in contexts where their special
-meanings make no sense.  For example, @samp{*foo} treats @samp{*} as
-ordinary since there is no preceding expression on which the @samp{*}
-can act.  It is poor practice to depend on this behavior; quote the
-special character anyway, regardless of where it appears.
+are treated as ordinary ones if they would otherwise start repetition
+operators either at the start of a regular expression, or after
+@samp{^}, @samp{\(}, @samp{\(?:} or @samp{\|}.  For example,
+@samp{*foo} is treated as @samp{\*foo}, and @samp{two\|^\@{2\@}} is
+treated as @samp{two\|^@{2@}}.  It is poor practice to depend on this
+behavior; use proper backslash escaping anyway, regardless of where
+the special character appears.
 
 As a @samp{\} is not special inside a character alternative, it can
 never remove the special meaning of @samp{-}, @samp{^} or @samp{]}.
diff --git a/etc/NEWS b/etc/NEWS
index 61e6e16166..0c4889f9a6 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -436,6 +436,12 @@ Previously, '\x' without at least one hex digit denoted character code
 zero (NUL) but as this was neither intended nor documented or even
 known by anyone, it is now treated as an error by the Lisp reader.
 
+===
+** In regular expressions, zero-width backslash escapes can no longer
+be followed by repetition operators.  For example, '\b*' is no longer
+a valid regular expression.  Previously the behavior was erratic for
+these constructs, and they were typically typos anyway.
+
 ---
 ** Connection-local variables are applied in buffers visiting a remote file.
 This overrides possible directory-local or file-local variables with
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-17 22:18     ` Paul Eggert
@ 2023-06-18  4:55       ` Eli Zaretskii
  2023-06-18 20:26         ` Mattias Engdegård
  0 siblings, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2023-06-18  4:55 UTC (permalink / raw)
  To: Paul Eggert; +Cc: mattias.engdegard, monnier, 64128

> Cc: 64128@debbugs.gnu.org
> Date: Sat, 17 Jun 2023 15:18:00 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> 
> > Clearly, but some behaviour needs to be preserved for compatibility.
> > Regexps like "^*" aren't uncommon. Can it be generalised in a useful way?
> > 
> 
> doc/lispref/searching.texi says that "*" is treated as an ordinary 
> character if it is in a context where its special meaning makes no 
> sense, giving "*foo" as an example. If we break with this tradition by 
> making "\b*" an error instead of being equivalent to "\b\*", we should 
> update that part of the manual.
> 
> One possible way forward is to update doc/lispref/searching.texi to 
> specify what we want. Then we can modify the code to match the updated 
> documentation.
> 
> In my experience, modifying the doc is often the hard part, so I took a 
> crack at that in the draft proposed patch, which I have not installed.
> 
> Comments?

My comment is that since this was a documented feature, I'm not
interested in making it an error.





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-18  4:55       ` Eli Zaretskii
@ 2023-06-18 20:26         ` Mattias Engdegård
  2023-06-19  3:04           ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2023-06-19 18:14           ` Paul Eggert
  0 siblings, 2 replies; 18+ messages in thread
From: Mattias Engdegård @ 2023-06-18 20:26 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Paul Eggert, monnier, 64128

18 juni 2023 kl. 06.55 skrev Eli Zaretskii <eliz@gnu.org>:

> My comment is that since this was a documented feature, I'm not
> interested in making it an error.

Yes, it would be unwise to raise an error for "^*" or the like; it's in active use.
The manual is a bit hazy about what we actually promise, though.

As Paul notes, we must be able to document it and that might not be easy, so perhaps we shouldn't even try (to change, or document)?

To make everything clear, we have to groups of zero-width assertions:

Group A: ^ $ \` \' \b \B
Group B: \< \> \_< \_> \=

Group B assertions work like ordinary elements, syntactically and semantically. Simple, predictable, but also useless.

Group A assertions are more interesting: either there is nothing before a train of such assertions, such as

   "^\\`\\b\\`*?"

which turns the first character of the operator into a literal (and a second character, if present, now becomes an operator acting on that literal).
Or there is something, and the operator acts on the last element preceding the assertions, except that multiple literal characters coalesce to a single element. Except if one of the literal chars is an out-of-place `^` which splits a sequence of literals into separate segments but not exactly where you think it would.
For example,

  "abc^def\\B\\B+?"

means, I think,

  (seq "ab" (+? "c^def" not-word-boundary not-word-boundary))

^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-18 20:26         ` Mattias Engdegård
@ 2023-06-19  3:04           ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2023-06-19  8:44             ` Mattias Engdegård
  2023-06-19 18:14           ` Paul Eggert
  1 sibling, 1 reply; 18+ messages in thread
From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-06-19  3:04 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Eli Zaretskii, Paul Eggert, 64128

> To make everything clear, we have to groups of zero-width assertions:
>
> Group A: ^ $ \` \' \b \B

IIRC `^` is only special if it's at the beginning of a group, so `^*` will
always treat this * as a literal, right?
"Similarly" `$` is only special if it's at the end of a group, so `$*` will
always be a repetition of the $ character no?

So the remaining problematic elements are \` \' \b and \B

I suspect if we don't want to signal errors, the next best thing is to
treat them like group B.


        Stefan






^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-19  3:04           ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2023-06-19  8:44             ` Mattias Engdegård
  2023-06-19 12:54               ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 1 reply; 18+ messages in thread
From: Mattias Engdegård @ 2023-06-19  8:44 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Eli Zaretskii, Paul Eggert, 64128

19 juni 2023 kl. 05.04 skrev Stefan Monnier <monnier@iro.umontreal.ca>:

> `^` is only special if it's at the beginning of a group, so `^*` will
> always treat this * as a literal, right?
> "Similarly" `$` is only special if it's at the end of a group, so `$*` will
> always be a repetition of the $ character no?

Yes, ^ and $ have additional rules for when they are plain literals and not subject to these bugs at all.

The literal-splitting powers of ^ have now (075e77ac44) been removed.

> So the remaining problematic elements are \` \' \b and \B

\`* has been observed, so we probably need to keep that working as well.

> I suspect if we don't want to signal errors, the next best thing is to
> treat them like group B.

Yes, maybe; they are less likely to be followed by an operator-literal, but it would also be good to have all zero-width assertions work the same way.
On the other hand, it can't be worse than we have now, as long as we get rid of the "quack,\\b*" semantics.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-19  8:44             ` Mattias Engdegård
@ 2023-06-19 12:54               ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2023-06-19 18:34                 ` Mattias Engdegård
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-06-19 12:54 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Eli Zaretskii, Paul Eggert, 64128

I wish there was a way to emit warnings about oddball constructs
(starting with the "* is literal when encountered at the beginning of
a regexp").


        Stefan


Mattias Engdegård [2023-06-19 10:44:04] wrote:

> 19 juni 2023 kl. 05.04 skrev Stefan Monnier <monnier@iro.umontreal.ca>:
>
>> `^` is only special if it's at the beginning of a group, so `^*` will
>> always treat this * as a literal, right?
>> "Similarly" `$` is only special if it's at the end of a group, so `$*` will
>> always be a repetition of the $ character no?
>
> Yes, ^ and $ have additional rules for when they are plain literals and not
> subject to these bugs at all.
>
> The literal-splitting powers of ^ have now (075e77ac44) been removed.
>
>> So the remaining problematic elements are \` \' \b and \B
>
> \`* has been observed, so we probably need to keep that working as well.
>
>> I suspect if we don't want to signal errors, the next best thing is to
>> treat them like group B.
>
> Yes, maybe; they are less likely to be followed by an operator-literal, but
> it would also be good to have all zero-width assertions work the same way.
> On the other hand, it can't be worse than we have now, as long as we get rid
> of the "quack,\\b*" semantics.






^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-18 20:26         ` Mattias Engdegård
  2023-06-19  3:04           ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2023-06-19 18:14           ` Paul Eggert
  1 sibling, 0 replies; 18+ messages in thread
From: Paul Eggert @ 2023-06-19 18:14 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Eli Zaretskii, monnier, 64128

[-- Attachment #1: Type: text/plain, Size: 982 bytes --]

On 2023-06-18 13:26, Mattias Engdegård wrote:
> The manual is a bit hazy about what we actually promise, though.
> 
> As Paul notes, we must be able to document it and that might not be easy, so perhaps we shouldn't even try (to change, or document)?

Although it's not easy to document, we should do better. I gave that a 
shot by installing the attached patches into the master branch. These 
patches try to document current behavior, including warning about the 
squirrelly behavior you mention. If/when we fix the squirrelly behavior 
we can change that part of the manual accordingly.

The last of the three patches is merely a terminology change: it 
standardizes on the term "bracket expression" for regexps like [a-z]. 
Formerly the doc and comments were inconsistent about the terminology. 
It's better to stick with the POSIX term here, to avoid confusion. I 
myself got confused about this when editing the other two patches.

Comments welcome as usual.

[-- Attachment #2: 0001-Document-regular-expression-special-cases-better.patch --]
[-- Type: text/x-patch, Size: 3091 bytes --]

From d84b026dbefce6604a35a83131649291a74fda67 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Mon, 19 Jun 2023 11:09:00 -0700
Subject: [PATCH 1/3] Document regular expression special cases better

In particular, document that escape sequences like \b*
are currently buggy.
---
 doc/lispref/searching.texi | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index b8d9094b28d..3970faebbf3 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -505,9 +505,10 @@ Regexp Special
 When matching a string instead of a buffer, @samp{^} matches at the
 beginning of the string or after a newline character.
 
-For historical compatibility reasons, @samp{^} can be used only at the
-beginning of the regular expression, or after @samp{\(}, @samp{\(?:}
-or @samp{\|}.
+For historical compatibility, @samp{^} is special only at the beginning
+of the regular expression, or after @samp{\(}, @samp{\(?:} or @samp{\|}.
+Although @samp{^} is an ordinary character in other contexts,
+it is good practice to use @samp{\^} even then.
 
 @item @samp{$}
 @cindex @samp{$} in regexp
@@ -519,8 +520,10 @@ Regexp Special
 When matching a string instead of a buffer, @samp{$} matches at the end
 of the string or before a newline character.
 
-For historical compatibility reasons, @samp{$} can be used only at the
+For historical compatibility, @samp{$} is special only at the
 end of the regular expression, or before @samp{\)} or @samp{\|}.
+Although @samp{$} is an ordinary character in other contexts,
+it is good practice to use @samp{\$} even then.
 
 @item @samp{\}
 @cindex @samp{\} in regexp
@@ -540,12 +543,17 @@ Regexp Special
 @samp{\} is @code{"\\\\"}.
 @end table
 
-@strong{Please note:} For historical compatibility, special characters
-are treated as ordinary ones if they are in contexts where their special
-meanings make no sense.  For example, @samp{*foo} treats @samp{*} as
-ordinary since there is no preceding expression on which the @samp{*}
-can act.  It is poor practice to depend on this behavior; quote the
-special character anyway, regardless of where it appears.
+For historical compatibility, a repetition operator is treated as ordinary
+if it appears at the start of a regular expression
+or after @samp{^}, @samp{\(}, @samp{\(?:} or @samp{\|}.
+For example, @samp{*foo} is treated as @samp{\*foo}, and
+@samp{two\|^\@{2\@}} is treated as @samp{two\|^@{2@}}.
+It is poor practice to depend on this behavior; use proper backslash
+escaping anyway, regardless of where the repetition operator appears.
+Also, a repetition operator should not immediately follow a backslash escape
+that matches only empty strings, as Emacs has bugs in this area.
+For example, it is unwise to use @samp{\b*}, which can be omitted
+without changing the documented meaning of the regular expression.
 
 As a @samp{\} is not special inside a character alternative, it can
 never remove the special meaning of @samp{-}, @samp{^} or @samp{]}.
-- 
2.39.2


[-- Attachment #3: 0002-Document-Emacs-vs-POSIX-REs.patch --]
[-- Type: text/x-patch, Size: 6664 bytes --]

From 5dfe3f21d12a107055fb447be58b94be98c2f628 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Mon, 19 Jun 2023 11:09:00 -0700
Subject: [PATCH 2/3] Document Emacs vs POSIX REs

* doc/lispref/searching.texi (Longest Match):
Rename from POSIX Regexps, as this section
is about longest-match functions, not about POSIX regexps.
(POSIX Regexps): New section.
---
 doc/lispref/searching.texi | 105 +++++++++++++++++++++++++++++++++++--
 1 file changed, 101 insertions(+), 4 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 3970faebbf3..608abae762c 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -18,11 +18,12 @@ Searching and Matching
 * Searching and Case::    Case-independent or case-significant searching.
 * Regular Expressions::   Describing classes of strings.
 * Regexp Search::         Searching for a match for a regexp.
-* POSIX Regexps::         Searching POSIX-style for the longest match.
+* Longest Match::         Searching for the longest match.
 * Match Data::            Finding out which part of the text matched,
                             after a string or regexp search.
 * Search and Replace::    Commands that loop, searching and replacing.
 * Standard Regexps::      Useful regexps for finding sentences, pages,...
+* POSIX Regexps::         Emacs regexps vs POSIX regexps.
 @end menu
 
   The @samp{skip-chars@dots{}} functions also perform a kind of searching.
@@ -2201,8 +2202,8 @@ Regexp Search
 a part of the code.
 @end defvar
 
-@node POSIX Regexps
-@section POSIX Regular Expression Searching
+@node Longest Match
+@section Longest-match searching for regular expression matches
 
 @cindex backtracking and POSIX regular expressions
   The usual regular expression functions do backtracking when necessary
@@ -2217,7 +2218,9 @@ POSIX Regexps
 match, as required by POSIX@.  This is much slower, so use these
 functions only when you really need the longest match.
 
-  The POSIX search and match functions do not properly support the
+  Despite their names, the POSIX search and match functions
+use Emacs regular expressions, not POSIX regular expressions.
+@xref{POSIX Regexps}.  Also, they do not properly support the
 non-greedy repetition operators (@pxref{Regexp Special, non-greedy}).
 This is because POSIX backtracking conflicts with the semantics of
 non-greedy repetition.
@@ -2965,3 +2968,97 @@ Standard Regexps
 @code{sentence-end-without-period}, and
 @code{sentence-end-without-space}.
 @end defun
+
+@node POSIX Regexps
+@section Emacs versus POSIX Regular Expressions
+@cindex POSIX regular expressions
+
+Regular expression syntax varies signficantly among computer programs.
+When writing Elisp code that generates regular expressions for use by other
+programs, it is helpful to know how syntax variants differ.
+To give a feel for the variation, this section discusses how
+Emacs regular expressions differ from two syntax variants standarded by POSIX:
+basic regular expressions (BREs) and extended regular expressions (EREs).
+Plain @command{grep} uses BREs, and @samp{grep -E} uses EREs.
+
+Emacs regular expressions have a syntax closer to EREs than to BREs,
+with some extensions.  Here is a summary of how POSIX BREs and EREs
+differ from Emacs regular expressions.
+
+@itemize @bullet
+@item
+In POSIX BREs @samp{+} and @samp{?} are not special.
+The only backslash escape sequences are @samp{\(@dots{}\)},
+@samp{\@{@dots{}\@}}, @samp{\1} through @samp{\9}, along with the
+escaped special characters @samp{\$}, @samp{\*}, @samp{\.}, @samp{\[},
+@samp{\\}, and @samp{\^}.
+Therefore @samp{\(?:} acts like @samp{\([?]:}.
+POSIX does not define how other BRE escapes behave;
+for example, GNU @command{grep} treats @samp{\|} like Emacs does,
+but does not support all the Emacs escapes.
+
+@item
+In POSIX EREs @samp{@{}, @samp{(} and @samp{|} are special,
+and @samp{)} is special when matched with a preceding @samp{(}.
+These special characters do not use preceding backslashes;
+@samp{(?} produces undefined results.
+The only backslash escape sequences are the escaped special characters
+@samp{\$}, @samp{\(}, @samp{\)}, @samp{\*}, @samp{\+}, @samp{\.},
+@samp{\?}, @samp{\[}, @samp{\\}, @samp{\^}, @samp{\@{} and @samp{\|}.
+POSIX does not define how other ERE escapes behave;
+for example, GNU @samp{grep -E} treats @samp{\1} like Emacs does,
+but does not support all the Emacs escapes.
+
+@item
+In POSIX BREs, it is an implementation option whether @samp{^} is special
+after @samp{\(}; GNU @command{grep} treats it like Emacs does.
+In POSIX EREs, @samp{^} is always special outside of character alternatives,
+which means the ERE @samp{x^} never matches.
+In Emacs regular expressions, @samp{^} is special only at the
+beginning of the regular expression, or after @samp{\(}, @samp{\(?:}
+or @samp{\|}.
+
+@item
+In POSIX BREs, it is an implementation option whether @samp{$} is special
+before @samp{\)}; GNU @command{grep} treats it like Emacs does.
+In POSIX EREs, @samp{$} is always special outside of character alternatives,
+which means the ERE @samp{$x} never matches.
+In Emacs regular expressions, @samp{$} is special only at the
+end of the regular expression, or before @samp{\)} or @samp{\|}.
+
+@item
+In POSIX BREs and EREs, undefined results are produced by repetition
+operators at the start of a regular expression or subexpression
+(possibly preceded by @samp{^}), except that the repetition operator
+@samp{*} has the same behavior in BREs as in Emacs.
+In Emacs, these operators are treated as ordinary.
+
+@item
+In BREs and EREs, undefined results are produced by two repetition
+operators in sequence.  In Emacs, these have well-defined behavior,
+e.g., @samp{a**} is equivalent to @samp{a*}.
+
+@item
+In BREs and EREs, undefined results are produced by empty regular
+expressions or subexpressions.  In Emacs these have well-defined
+behavior, e.g., @samp{\(\)*} matches the empty string,
+
+@item
+In BREs and EREs, undefined results are produced for the named
+character classes @samp{[:ascii:]}, @samp{[:multibyte:]},
+@samp{[:nonascii:]}, @samp{[:unibyte:]}, and @samp{[:word:]}.
+
+@item
+BRE and ERE alternatives can contain collating symbols and equivalence
+class expressions, e.g., @samp{[[.ch.]d[=a=]]}.
+Emacs regular expressions do not support this.
+
+@item
+BREs, EREs, and the strings they match cannot contain encoding errors
+or NUL bytes.  In Emacs these constructs simply match themselves.
+
+@item
+BRE and ERE searching always finds the longest match.
+Emacs searching by default does not necessarily do so.
+@xref{Longest Match}.
+@end itemize
-- 
2.39.2


[-- Attachment #4: 0003-Call-them-bracket-expressions-more-consistently.patch --]
[-- Type: text/x-patch, Size: 17021 bytes --]

From 94d8eeeff4ae99cb12718dab7cf7fdc52de77b6e Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Mon, 19 Jun 2023 11:09:00 -0700
Subject: [PATCH 3/3] =?UTF-8?q?Call=20them=20=E2=80=9Cbracket=20expression?=
 =?UTF-8?q?s=E2=80=9D=20more=20consistently?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Emacs comments and doc were inconsistent about the name used for
regexps like [a-z].  Sometimes it called them “character
alternatives”, sometimes “character sets”, sometimes “bracket
expressions”.  Prefer “bracket expressions” as it is less confusing:
POSIX and most other programs’ doc uses “bracket expressions”,
“alternative” is also used in the Emacs documentation to talk about
...\|... in regexps, and “character set” normally has a different
meaning in Emacs.
---
 doc/emacs/search.texi        | 12 +++---
 doc/lispref/searching.texi   | 74 ++++++++++++++++++------------------
 lisp/emacs-lisp/lisp-mode.el |  2 +-
 lisp/textmodes/picture.el    |  2 +-
 4 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/doc/emacs/search.texi b/doc/emacs/search.texi
index 45378d95f65..2a816221235 100644
--- a/doc/emacs/search.texi
+++ b/doc/emacs/search.texi
@@ -950,8 +950,8 @@ Regexps
 @dfn{special constructs} and the rest are @dfn{ordinary}.  An ordinary
 character matches that same character and nothing else.  The special
 characters are @samp{$^.*+?[\}.  The character @samp{]} is special if
-it ends a character alternative (see below).  The character @samp{-}
-is special inside a character alternative.  Any other character
+it ends a bracket expression (see below).  The character @samp{-}
+is special inside a bracket expression.  Any other character
 appearing in a regular expression is ordinary, unless a @samp{\}
 precedes it.  (When you use regular expressions in a Lisp program,
 each @samp{\} must be doubled, see the example near the end of this
@@ -1033,11 +1033,11 @@ Regexps
 a newline, it matches the whole string.  Since it @emph{can} match
 starting at the first @samp{a}, it does.
 
+@cindex bracket expression
 @cindex set of alternative characters, in regular expressions
 @cindex character set, in regular expressions
 @item @kbd{[ @dots{} ]}
-is a @dfn{set of alternative characters}, or a @dfn{character set},
-beginning with @samp{[} and terminated by @samp{]}.
+is a @dfn{bracket expression}, which matches one of a set of characters.
 
 In the simplest case, the characters between the two brackets are what
 this set can match.  Thus, @samp{[ad]} matches either one @samp{a} or
@@ -1057,7 +1057,7 @@ Regexps
 @cindex character classes, in regular expressions
 You can also include certain special @dfn{character classes} in a
 character set.  A @samp{[:} and balancing @samp{:]} enclose a
-character class inside a set of alternative characters.  For instance,
+character class inside a bracket expression.  For instance,
 @samp{[[:alnum:]]} matches any letter or digit.  @xref{Char Classes,,,
 elisp, The Emacs Lisp Reference Manual}, for a list of character
 classes.
@@ -1125,7 +1125,7 @@ Regexps
 to depend on this behavior; it is better to quote the special character anyway,
 regardless of where it appears.
 
-As a @samp{\} is not special inside a set of alternative characters, it can
+As a @samp{\} is not special inside a bracket expression, it can
 never remove the special meaning of @samp{-}, @samp{^} or @samp{]}.
 You should not quote these characters when they have no special
 meaning.  This would not clarify anything, since backslashes
diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 608abae762c..28230cea643 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -278,10 +278,10 @@ Syntax of Regexps
 and nothing else.  The special characters are @samp{.}, @samp{*},
 @samp{+}, @samp{?}, @samp{[}, @samp{^}, @samp{$}, and @samp{\}; no new
 special characters will be defined in the future.  The character
-@samp{]} is special if it ends a character alternative (see later).
-The character @samp{-} is special inside a character alternative.  A
+@samp{]} is special if it ends a bracket expression (see later).
+The character @samp{-} is special inside a bracket expression.  A
 @samp{[:} and balancing @samp{:]} enclose a character class inside a
-character alternative.  Any other character appearing in a regular
+bracket expression.  Any other character appearing in a regular
 expression is ordinary, unless a @samp{\} precedes it.
 
   For example, @samp{f} is not a special character, so it is ordinary, and
@@ -374,19 +374,19 @@ Regexp Special
 permits the whole expression to match is @samp{d}.)
 
 @item @samp{[ @dots{} ]}
-@cindex character alternative (in regexp)
+@cindex bracket expression (in regexp)
 @cindex @samp{[} in regexp
 @cindex @samp{]} in regexp
-is a @dfn{character alternative}, which begins with @samp{[} and is
+is a @dfn{bracket expression}, which begins with @samp{[} and is
 terminated by @samp{]}.  In the simplest case, the characters between
-the two brackets are what this character alternative can match.
+the two brackets are what this bracket expression can match.
 
 Thus, @samp{[ad]} matches either one @samp{a} or one @samp{d}, and
 @samp{[ad]*} matches any string composed of just @samp{a}s and @samp{d}s
 (including the empty string).  It follows that @samp{c[ad]*r}
 matches @samp{cr}, @samp{car}, @samp{cdr}, @samp{caddaar}, etc.
 
-You can also include character ranges in a character alternative, by
+You can also include character ranges in a bracket expression, by
 writing the starting and ending characters with a @samp{-} between them.
 Thus, @samp{[a-z]} matches any lower-case @acronym{ASCII} letter.
 Ranges may be intermixed freely with individual characters, as in
@@ -395,7 +395,7 @@ Regexp Special
 range should not be the starting point of another one; for example,
 @samp{[a-m-z]} should be avoided.
 
-A character alternative can also specify named character classes
+A bracket expression can also specify named character classes
 (@pxref{Char Classes}).  For example, @samp{[[:ascii:]]} matches any
 @acronym{ASCII} character.  Using a character class is equivalent to
 mentioning each of the characters in that class; but the latter is not
@@ -404,9 +404,9 @@ Regexp Special
 lower or upper bound of a range.
 
 The usual regexp special characters are not special inside a
-character alternative.  A completely different set of characters is
+bracket expression.  A completely different set of characters is
 special: @samp{]}, @samp{-} and @samp{^}.
-To include @samp{]} in a character alternative, put it at the
+To include @samp{]} in a bracket expression, put it at the
 beginning.  To include @samp{^}, put it anywhere but at the beginning.
 To include @samp{-}, put it at the end.  Thus, @samp{[]^-]} matches
 all three of these special characters.  You cannot use @samp{\} to
@@ -444,7 +444,7 @@ Regexp Special
 feature is intended for searching text in unibyte buffers and strings.
 @end enumerate
 
-Some kinds of character alternatives are not the best style even
+Some kinds of bracket expressions are not the best style even
 though they have a well-defined meaning in Emacs.  They include:
 
 @enumerate
@@ -458,7 +458,7 @@ Regexp Special
 @samp{[ก-ฺ฿-๛]} is less clear than @samp{[\u0E01-\u0E3A\u0E3F-\u0E5B]}.
 
 @item
-Although a character alternative can include duplicates, it is better
+Although a bracket expression can include duplicates, it is better
 style to avoid them.  For example, @samp{[XYa-yYb-zX]} is less clear
 than @samp{[XYa-z]}.
 
@@ -469,30 +469,30 @@ Regexp Special
 than @samp{[ij]}, and @samp{[i-k]} is less clear than @samp{[ijk]}.
 
 @item
-Although a @samp{-} can appear at the beginning of a character
-alternative or as the upper bound of a range, it is better style to
-put @samp{-} by itself at the end of a character alternative.  For
+Although a @samp{-} can appear at the beginning of a bracket
+expression or as the upper bound of a range, it is better style to
+put @samp{-} by itself at the end of a bracket expression.  For
 example, although @samp{[-a-z]} is valid, @samp{[a-z-]} is better
 style; and although @samp{[*--]} is valid, @samp{[*+,-]} is clearer.
 @end enumerate
 
 @item @samp{[^ @dots{} ]}
 @cindex @samp{^} in regexp
-@samp{[^} begins a @dfn{complemented character alternative}.  This
+@samp{[^} begins a @dfn{complemented bracket expression}.  This
 matches any character except the ones specified.  Thus,
 @samp{[^a-z0-9A-Z]} matches all characters @emph{except} ASCII letters and
 digits.
 
-@samp{^} is not special in a character alternative unless it is the first
+@samp{^} is not special in a bracket expression unless it is the first
 character.  The character following the @samp{^} is treated as if it
 were first (in other words, @samp{-} and @samp{]} are not special there).
 
-A complemented character alternative can match a newline, unless newline is
+A complemented bracket expression can match a newline, unless newline is
 mentioned as one of the characters not to match.  This is in contrast to
 the handling of regexps in programs such as @code{grep}.
 
-You can specify named character classes, just like in character
-alternatives.  For instance, @samp{[^[:ascii:]]} matches any
+You can specify named character classes, just like in bracket
+expressions.  For instance, @samp{[^[:ascii:]]} matches any
 non-@acronym{ASCII} character.  @xref{Char Classes}.
 
 @item @samp{^}
@@ -556,7 +556,7 @@ Regexp Special
 For example, it is unwise to use @samp{\b*}, which can be omitted
 without changing the documented meaning of the regular expression.
 
-As a @samp{\} is not special inside a character alternative, it can
+As a @samp{\} is not special inside a bracket expression, it can
 never remove the special meaning of @samp{-}, @samp{^} or @samp{]}.
 You should not quote these characters when they have no special
 meaning.  This would not clarify anything, since backslashes
@@ -565,23 +565,23 @@ Regexp Special
 syntax), which matches any single character except a backslash.
 
 In practice, most @samp{]} that occur in regular expressions close a
-character alternative and hence are special.  However, occasionally a
+bracket expression and hence are special.  However, occasionally a
 regular expression may try to match a complex pattern of literal
 @samp{[} and @samp{]}.  In such situations, it sometimes may be
 necessary to carefully parse the regexp from the start to determine
-which square brackets enclose a character alternative.  For example,
-@samp{[^][]]} consists of the complemented character alternative
+which square brackets enclose a bracket expression.  For example,
+@samp{[^][]]} consists of the complemented bracket expression
 @samp{[^][]} (which matches any single character that is not a square
 bracket), followed by a literal @samp{]}.
 
 The exact rules are that at the beginning of a regexp, @samp{[} is
 special and @samp{]} not.  This lasts until the first unquoted
-@samp{[}, after which we are in a character alternative; @samp{[} is
+@samp{[}, after which we are in a bracket expression; @samp{[} is
 no longer special (except when it starts a character class) but @samp{]}
 is special, unless it immediately follows the special @samp{[} or that
 @samp{[} followed by a @samp{^}.  This lasts until the next special
-@samp{]} that does not end a character class.  This ends the character
-alternative and restores the ordinary syntax of regular expressions;
+@samp{]} that does not end a character class.  This ends the bracket
+expression and restores the ordinary syntax of regular expressions;
 an unquoted @samp{[} is special again and a @samp{]} not.
 
 @node Char Classes
@@ -592,8 +592,8 @@ Char Classes
 @cindex alpha character class, regexp
 @cindex xdigit character class, regexp
 
-  Below is a table of the classes you can use in a character
-alternative, and what they mean.  Note that the @samp{[} and @samp{]}
+  Below is a table of the classes you can use in a bracket
+expression, and what they mean.  Note that the @samp{[} and @samp{]}
 characters that enclose the class name are part of the name, so a
 regular expression using these classes needs one more pair of
 brackets.  For example, a regular expression matching a sequence of
@@ -920,7 +920,7 @@ Regexp Backslash
 
 @kindex invalid-regexp
   Not every string is a valid regular expression.  For example, a string
-that ends inside a character alternative without a terminating @samp{]}
+that ends inside a bracket expression without a terminating @samp{]}
 is invalid, and so is a string that ends with a single @samp{\}.  If
 an invalid regular expression is passed to any of the search functions,
 an @code{invalid-regexp} error is signaled.
@@ -957,7 +957,7 @@ Regexp Example
 
 @table @code
 @item [.?!]
-The first part of the pattern is a character alternative that matches
+The first part of the pattern is a bracket expression that matches
 any one of three characters: period, question mark, and exclamation
 mark.  The match must begin with one of these three characters.  (This
 is one point where the new default regexp used by Emacs differs from
@@ -969,7 +969,7 @@ Regexp Example
 marks, zero or more of them, that may follow the period, question mark
 or exclamation mark.  The @code{\"} is Lisp syntax for a double-quote in
 a string.  The @samp{*} at the end indicates that the immediately
-preceding regular expression (a character alternative, in this case) may be
+preceding regular expression (a bracket expression, in this case) may be
 repeated zero or more times.
 
 @item \\($\\|@ $\\|\t\\|@ @ \\)
@@ -1920,7 +1920,7 @@ Regexp Problems
 causing a match to fail early.
 
 @item
-Avoid or-patterns in favor of character alternatives: write
+Avoid or-patterns in favor of bracket expressions: write
 @samp{[ab]} instead of @samp{a\|b}.  Recall that @samp{\s-} and @samp{\sw}
 are equivalent to @samp{[[:space:]]} and @samp{[[:word:]]}, respectively.
 
@@ -3012,7 +3012,7 @@ POSIX Regexps
 @item
 In POSIX BREs, it is an implementation option whether @samp{^} is special
 after @samp{\(}; GNU @command{grep} treats it like Emacs does.
-In POSIX EREs, @samp{^} is always special outside of character alternatives,
+In POSIX EREs, @samp{^} is always special outside of bracket expressions,
 which means the ERE @samp{x^} never matches.
 In Emacs regular expressions, @samp{^} is special only at the
 beginning of the regular expression, or after @samp{\(}, @samp{\(?:}
@@ -3021,7 +3021,7 @@ POSIX Regexps
 @item
 In POSIX BREs, it is an implementation option whether @samp{$} is special
 before @samp{\)}; GNU @command{grep} treats it like Emacs does.
-In POSIX EREs, @samp{$} is always special outside of character alternatives,
+In POSIX EREs, @samp{$} is always special outside of bracket expressions,
 which means the ERE @samp{$x} never matches.
 In Emacs regular expressions, @samp{$} is special only at the
 end of the regular expression, or before @samp{\)} or @samp{\|}.
@@ -3049,8 +3049,8 @@ POSIX Regexps
 @samp{[:nonascii:]}, @samp{[:unibyte:]}, and @samp{[:word:]}.
 
 @item
-BRE and ERE alternatives can contain collating symbols and equivalence
-class expressions, e.g., @samp{[[.ch.]d[=a=]]}.
+BREs and EREs can contain collating symbols and equivalence
+class expressions within bracket expressions, e.g., @samp{[[.ch.]d[=a=]]}.
 Emacs regular expressions do not support this.
 
 @item
diff --git a/lisp/emacs-lisp/lisp-mode.el b/lisp/emacs-lisp/lisp-mode.el
index 9914ededb85..1990630608d 100644
--- a/lisp/emacs-lisp/lisp-mode.el
+++ b/lisp/emacs-lisp/lisp-mode.el
@@ -1453,7 +1453,7 @@ lisp-fill-paragraph
       ;; are buffer-local, but we avoid changing them so that they can be set
       ;; to make `forward-paragraph' and friends do something the user wants.
       ;;
-      ;; `paragraph-start': The `(' in the character alternative and the
+      ;; `paragraph-start': The `(' in the bracket expression and the
       ;; left-singlequote plus `(' sequence after the \\| alternative prevent
       ;; sexps and backquoted sexps that follow a docstring from being filled
       ;; with the docstring.  This setting has the consequence of inhibiting
diff --git a/lisp/textmodes/picture.el b/lisp/textmodes/picture.el
index 9aa9b72c513..f98c3963b6f 100644
--- a/lisp/textmodes/picture.el
+++ b/lisp/textmodes/picture.el
@@ -383,7 +383,7 @@ picture-tab-chars
 The syntax for this variable is like the syntax used inside of `[...]'
 in a regular expression--but without the `[' and the `]'.
 It is NOT a regular expression, and should follow the usual
-rules for the contents of a character alternative.
+rules for the contents of a bracket expression.
 It defines a set of \"interesting characters\" to look for when setting
 \(or searching for) tab stops, initially \"!-~\" (all printing characters).
 For example, suppose that you are editing a table which is formatted thus:
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-19 12:54               ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2023-06-19 18:34                 ` Mattias Engdegård
  2023-06-19 19:21                   ` Paul Eggert
  0 siblings, 1 reply; 18+ messages in thread
From: Mattias Engdegård @ 2023-06-19 18:34 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Eli Zaretskii, Paul Eggert, 64128

[-- Attachment #1: Type: text/plain, Size: 717 bytes --]

19 juni 2023 kl. 14.54 skrev Stefan Monnier <monnier@iro.umontreal.ca>:
> 
> I wish there was a way to emit warnings about oddball constructs
> (starting with the "* is literal when encountered at the beginning of
> a regexp").

I agree, but I'm more of a static analysis man. (And relint does complain about all these cases as long as the regexp is detected as such, so there probably aren't many of them left in the Emacs tree.)

Here is a reduced patch that only fixes the really silly behaviour reported earlier, by making sure that `laststart` is reset correctly for all group A assertions. This should be uncontroversial.
Maybe we should change group B assertions so that they work in the same way.


[-- Attachment #2: regexp-zero-width-assertion-noquack.diff --]
[-- Type: application/octet-stream, Size: 2669 bytes --]

diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index fea34df991b..f2da1a2d0db 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -1716,7 +1716,9 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
   /* Address of start of the most recently finished expression.
      This tells, e.g., postfix * where to find the start of its
-     operand.  Reset at the beginning of groups and alternatives.  */
+     operand.  Reset at the beginning of groups and alternatives,
+     and after zero-width assertions which should not be the target
+     of any postfix repetition operators.  */
   unsigned char *laststart = 0;
 
   /* Address of beginning of regexp, or inside of last group.  */
@@ -1847,12 +1849,14 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 	case '^':
 	  if (! (p == pattern + 1 || at_begline_loc_p (pattern, p)))
 	    goto normal_char;
+	  laststart = 0;
 	  BUF_PUSH (begline);
 	  break;
 
 	case '$':
 	  if (! (p == pend || at_endline_loc_p (p, pend)))
 	    goto normal_char;
+	  laststart = 0;
 	  BUF_PUSH (endline);
 	  break;
 
@@ -1892,7 +1896,7 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
 	    /* Star, etc. applied to an empty pattern is equivalent
 	       to an empty pattern.  */
-	    if (!laststart || laststart == b)
+	    if (laststart == b)
 	      break;
 
 	    /* Now we know whether or not zero matches is allowed
@@ -2544,18 +2548,22 @@ regex_compile (re_char *pattern, ptrdiff_t size,
               break;
 
 	    case 'b':
+	      laststart = 0;
 	      BUF_PUSH (wordbound);
 	      break;
 
 	    case 'B':
+	      laststart = 0;
 	      BUF_PUSH (notwordbound);
 	      break;
 
 	    case '`':
+	      laststart = 0;
 	      BUF_PUSH (begbuf);
 	      break;
 
 	    case '\'':
+	      laststart = 0;
 	      BUF_PUSH (endbuf);
 	      break;
 
diff --git a/test/src/regex-emacs-tests.el b/test/src/regex-emacs-tests.el
index 52d43775b8e..48a487ffe15 100644
--- a/test/src/regex-emacs-tests.el
+++ b/test/src/regex-emacs-tests.el
@@ -883,4 +883,14 @@ regexp-tests-backtrack-optimization
     (should (looking-at "x*\\(=\\|:\\)*"))
     (should (looking-at "x*=*?"))))
 
+(ert-deftest regexp-tests-zero-width-assertion-repetition ()
+  ;; Check compatibility behaviour with repetition operators after
+  ;; certain zero-width assertions (bug#64128).
+  (should (equal (string-match "^*a" "*a") 0))
+  (should (equal (string-match "\\`*a" "*a") 0))
+  (should (equal (string-match "q\\b*!" "q*!") 0))
+  (should (equal (string-match "q\\b*!" "!") nil))
+  (should (equal (string-match "/\\B*z" "/*z") 0))
+  (should (equal (string-match "/\\B*z" "z") nil)))
+
 ;;; regex-emacs-tests.el ends here

[-- Attachment #3: Type: text/plain, Size: 3 bytes --]





^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-19 18:34                 ` Mattias Engdegård
@ 2023-06-19 19:21                   ` Paul Eggert
  2023-06-19 19:52                     ` Mattias Engdegård
  0 siblings, 1 reply; 18+ messages in thread
From: Paul Eggert @ 2023-06-19 19:21 UTC (permalink / raw)
  To: Mattias Engdegård, Stefan Monnier; +Cc: Eli Zaretskii, 64128

On 2023-06-19 11:34, Mattias Engdegård wrote:
> Here is a reduced patch that only fixes the really silly behaviour reported earlier, by making sure that `laststart` is reset correctly for all group A assertions. This should be uncontroversial.
> Maybe we should change group B assertions so that they work in the same way.

> -     operand.  Reset at the beginning of groups and alternatives.  */
> +     operand.  Reset at the beginning of groups and alternatives,
> +     and after zero-width assertions which should not be the target
> +     of any postfix repetition operators.  */

If I understand things correctly, this would cause "\b*c" to be treated 
like "\b\*c". If so, it's headed in the wrong direction.

It's long been documented that the only reason "*" is ordinary at the 
start of a regular expression or subexpression is "historical 
compatibility", and it's also long been documented that you shouldn't 
take advantage of this and you should backslash-escape the "*" anyway. 
In contrast, for constructs like \b* there is not a historical 
compatibility reason, so there's not a good argument for treating "*" as 
an ordinary character after "\b".

Instead, \b should not be a special case before "*", and \b* should be 
equivalent to \(\b\)* and should match only the empty string. Similarly 
for the other zero-width backslash escapes. This is what I would expect 
from these constructs from the longstanding documentation.

If we instead added a rule to say that a construct that can only match 
the empty string causes following "*" to ordinary, then \b* and \(\b\)* 
would both be equivalent to \*. Although consistent, this would be 
confusing: it would compound the historical-compatibility mistake. Let's 
keep things simple instead.

Also, whatever change we make to the behavior should be documented in 
the manual and in etc/NEWS.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-19 19:21                   ` Paul Eggert
@ 2023-06-19 19:52                     ` Mattias Engdegård
  2023-06-19 20:08                       ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2023-06-19 20:40                       ` Paul Eggert
  0 siblings, 2 replies; 18+ messages in thread
From: Mattias Engdegård @ 2023-06-19 19:52 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Eli Zaretskii, Stefan Monnier, 64128

19 juni 2023 kl. 21.21 skrev Paul Eggert <eggert@cs.ucla.edu>:

> If I understand things correctly, this would cause "\b*c" to be treated like "\b\*c".

Actually it already works that way. What the patch does, is preventing AB\b*C from being treated as \(?:AB\b\)*C but as AB\b\*C instead, which I think we can all agree is less wrong.

You can check the test cases in the patch:

  (should (equal (string-match "q\\b*!" "q*!") 0))
  (should (equal (string-match "q\\b*!" "!") nil))

which in current Emacs produce 2 and 0 respectively.

> It's long been documented that the only reason "*" is ordinary at the start of a regular expression or subexpression is "historical compatibility", and it's also long been documented that you shouldn't take advantage of this and you should backslash-escape the "*" anyway. In contrast, for constructs like \b* there is not a historical compatibility reason, so there's not a good argument for treating "*" as an ordinary character after "\b".

Sure, we can turn \b and \B into group B assertions, but the patch was more conservative in nature.
We also have \` to consider -- I think we have to preserve \`* meaning \`\* for compatibility, historical or not, because it's something we keep sighting in the wild.

> Instead, \b should not be a special case before "*", and \b* should be equivalent to \(\b\)* and should match only the empty string. Similarly for the other zero-width backslash escapes. This is what I would expect from these constructs from the longstanding documentation.
> 
> If we instead added a rule to say that a construct that can only match the empty string causes following "*" to ordinary, then \b* and \(\b\)* would both be equivalent to \*. Although consistent, this would be confusing: it would compound the historical-compatibility mistake. Let's keep things simple instead.

Yes, I definitely would be confused by such semantics.

> Also, whatever change we make to the behavior should be documented in the manual and in etc/NEWS.

Will be happy to oblige, although in this case it really just was a bug fix.

What I really would like to see is the regexp parser somehow separated from the NFA bytecode generator, which would make both clearer. The parser could then be re-used for other purposes such as a different back-end (DFA construction) or a built-in xr-like converter.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-19 19:52                     ` Mattias Engdegård
@ 2023-06-19 20:08                       ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2023-06-20 11:36                         ` Mattias Engdegård
  2023-06-19 20:40                       ` Paul Eggert
  1 sibling, 1 reply; 18+ messages in thread
From: Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2023-06-19 20:08 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Eli Zaretskii, Paul Eggert, 64128

>> If I understand things correctly, this would cause "\b*c" to be treated like "\b\*c".
> Actually it already works that way. What the patch does, is preventing
> AB\b*C from being treated as \(?:AB\b\)*C but as AB\b\*C instead, which
> I think we can all agree is less wrong.

Hmm... maybe it's less wrong, but I'd rather make it behave like
AB\(\b\)*C, which is, I'd argue, even less wrong.

Or maybe make it signal an error: I can't imagine that the current
behavior is used by very much code at all, seeing how it's so
seriously non-intuitive.


        Stefan






^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-19 19:52                     ` Mattias Engdegård
  2023-06-19 20:08                       ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2023-06-19 20:40                       ` Paul Eggert
  1 sibling, 0 replies; 18+ messages in thread
From: Paul Eggert @ 2023-06-19 20:40 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Eli Zaretskii, Stefan Monnier, 64128

[-- Attachment #1: Type: text/plain, Size: 1000 bytes --]

On 2023-06-19 12:52, Mattias Engdegård wrote:

> Sure, we can turn \b and \B into group B assertions, but the patch was more conservative in nature.

OK, but we still need to fix this, as \b and \B should not be a special 
case for following "*".

> I think we have to preserve \`* meaning \`\* for compatibility, historical or not, because it's something we keep sighting in the wild.

That makes some sense, in that \` is like ^, and ^ is already a special 
case (this is true even in POSIX BREs).

In other words, how about if we change the groups from your list:

Group A: ^ $ \` \' \b \B
Group B: \< \> \_< \_> \=

to this:

Group A: ^ \`
Group B: $ \' \b \B \< \> \_< \_> \=

where "*" is ordinary after Group A, and special after Group B and there 
is no other squirrelly behavior. And similarly for the other repetition 
operators.

Attached is a proposed doc change for this, which I have not installed. 
Of course the code and etc/NEWS would need changing too.

[-- Attachment #2: 0001-Document-proposed-regex-fix-bug-64128.patch --]
[-- Type: text/x-patch, Size: 1609 bytes --]

From 18f6e0c85a7313d221da868e6bf55af32828112b Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Mon, 19 Jun 2023 13:35:48 -0700
Subject: [PATCH] Document proposed regex fix (bug#64128)

* doc/lispref/searching.texi (Regexp Special):
Say that repetition operators are not special after \`,
and that they work as expected after other backslash escapes.
---
 doc/lispref/searching.texi | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 28230cea64..7c9893054d 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -546,15 +546,11 @@ Regexp Special
 
 For historical compatibility, a repetition operator is treated as ordinary
 if it appears at the start of a regular expression
-or after @samp{^}, @samp{\(}, @samp{\(?:} or @samp{\|}.
+or after @samp{^}, @samp{\`}, @samp{\(}, @samp{\(?:} or @samp{\|}.
 For example, @samp{*foo} is treated as @samp{\*foo}, and
 @samp{two\|^\@{2\@}} is treated as @samp{two\|^@{2@}}.
 It is poor practice to depend on this behavior; use proper backslash
 escaping anyway, regardless of where the repetition operator appears.
-Also, a repetition operator should not immediately follow a backslash escape
-that matches only empty strings, as Emacs has bugs in this area.
-For example, it is unwise to use @samp{\b*}, which can be omitted
-without changing the documented meaning of the regular expression.
 
 As a @samp{\} is not special inside a bracket expression, it can
 never remove the special meaning of @samp{-}, @samp{^} or @samp{]}.
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-19 20:08                       ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2023-06-20 11:36                         ` Mattias Engdegård
  2023-06-21  6:08                           ` Paul Eggert
  0 siblings, 1 reply; 18+ messages in thread
From: Mattias Engdegård @ 2023-06-20 11:36 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Eli Zaretskii, Paul Eggert, 64128

[-- Attachment #1: Type: text/plain, Size: 1509 bytes --]

19 juni 2023 kl. 22.08 skrev Stefan Monnier <monnier@iro.umontreal.ca>:

> Hmm... maybe it's less wrong, but I'd rather make it behave like
> AB\(\b\)*C, which is, I'd argue, even less wrong.

I agree, and you are probably right that it's safe to do that.

> Or maybe make it signal an error: I can't imagine that the current
> behavior is used by very much code at all, seeing how it's so
> seriously non-intuitive.

That might be even better if we can get away with it.

19 juni 2023 kl. 22.40 skrev Paul Eggert <eggert@cs.ucla.edu>:

> In other words, how about if we change the groups from your list:
> 
> Group A: ^ $ \` \' \b \B
> Group B: \< \> \_< \_> \=
> 
> to this:
> 
> Group A: ^ \`
> Group B: $ \' \b \B \< \> \_< \_> \=
> 
> where "*" is ordinary after Group A, and special after Group B and there is no other squirrelly behavior. And similarly for the other repetition operators.

Sounds fine, with the option to go full error on group B if we agree that that's even better.

> Attached is a proposed doc change for this, which I have not installed.

Thank you, it has been incorporated in the attached patch which follows your suggestions above.

Your previous regexp doc updates are most appreciated. I still think the whole chapter needs a reform from the sheer weight of organic growth over the years. In particular, the division between "regexp special" and "regexp backslash" is purely syntactical, not semantic, and groups things in the wrong way.



[-- Attachment #2: 0001-Straighten-regexp-postfix-operator-after-zero-width-.patch --]
[-- Type: application/octet-stream, Size: 8621 bytes --]

From ef54f07ca78b2eef5181deb4f28deab100be2a75 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 20 Jun 2023 12:12:50 +0200
Subject: [PATCH] Straighten regexp postfix operator after zero-width assertion
 parse

The zero-width assertions \` \' \b \B were parsed in a sloppy way so
that a following postfix repetition operator could yield surprising
results.  For instance, "\\b*" would act as "\\b\\*", and "xy\\b*"
would act as "\\(?:xy\\b\\)*".

Except for \` and ^, any following postfix operator now applies to the
zero-width assertion itself only which is predictable and consistent with
other assertions, although useless in practice.
For historical compatibility, an operator character following \` and ^
always becomes a literal. (Bug#64128)

* src/regex-emacs.c (regex_compile):
Set `laststart` appropriately for each zero-width assertion instead
of leaving it with whatever value it had before.
* test/src/regex-emacs-tests.el
(regexp-tests-zero-width-assertion-repetition): New test.
* doc/lispref/searching.texi (Regexp Special):
Say that repetition operators are not special after \`,
and that they work as expected after other backslash escapes.
* etc/NEWS: Announce.
---
 doc/lispref/searching.texi    |  6 +---
 etc/NEWS                      |  8 +++++
 src/regex-emacs.c             | 15 +++++++--
 test/src/regex-emacs-tests.el | 63 +++++++++++++++++++++++++++++++++++
 4 files changed, 85 insertions(+), 7 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 28230cea643..7c9893054d9 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -546,15 +546,11 @@ Regexp Special
 
 For historical compatibility, a repetition operator is treated as ordinary
 if it appears at the start of a regular expression
-or after @samp{^}, @samp{\(}, @samp{\(?:} or @samp{\|}.
+or after @samp{^}, @samp{\`}, @samp{\(}, @samp{\(?:} or @samp{\|}.
 For example, @samp{*foo} is treated as @samp{\*foo}, and
 @samp{two\|^\@{2\@}} is treated as @samp{two\|^@{2@}}.
 It is poor practice to depend on this behavior; use proper backslash
 escaping anyway, regardless of where the repetition operator appears.
-Also, a repetition operator should not immediately follow a backslash escape
-that matches only empty strings, as Emacs has bugs in this area.
-For example, it is unwise to use @samp{\b*}, which can be omitted
-without changing the documented meaning of the regular expression.
 
 As a @samp{\} is not special inside a bracket expression, it can
 never remove the special meaning of @samp{-}, @samp{^} or @samp{]}.
diff --git a/etc/NEWS b/etc/NEWS
index 2170323e74a..faf1f73b143 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -470,6 +470,14 @@ symbol, and either that symbol is ':eval' and the second element of
 the list evaluates to 'nil' or the symbol's value as a variable is
 'nil' or void.
 
++++
+** Regexp zero-width assertions followed by operators are better defined.
+Previously, regexps such as "xy\\B*" would have ill-defined behaviour.
+Now any operator following a zero-width assertion applies to that
+assertion only (which is useless).  For historical compatibility, an
+operator character following '^' or '\`' becomes literal, but we
+advise against relying on this.
+
 \f
 * Lisp Changes in Emacs 30.1
 
diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index fea34df991b..b02554791ce 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -1716,7 +1716,8 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
   /* Address of start of the most recently finished expression.
      This tells, e.g., postfix * where to find the start of its
-     operand.  Reset at the beginning of groups and alternatives.  */
+     operand.  Reset at the beginning of groups and alternatives,
+     and after ^ and \` for dusty-deck compatibility.  */
   unsigned char *laststart = 0;
 
   /* Address of beginning of regexp, or inside of last group.  */
@@ -1847,12 +1848,16 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 	case '^':
 	  if (! (p == pattern + 1 || at_begline_loc_p (pattern, p)))
 	    goto normal_char;
+	  /* Special case for compatibility: postfix ops after \` become
+	     literals.  */
+	  laststart = 0;
 	  BUF_PUSH (begline);
 	  break;
 
 	case '$':
 	  if (! (p == pend || at_endline_loc_p (p, pend)))
 	    goto normal_char;
+	  laststart = b;
 	  BUF_PUSH (endline);
 	  break;
 
@@ -1892,7 +1897,7 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
 	    /* Star, etc. applied to an empty pattern is equivalent
 	       to an empty pattern.  */
-	    if (!laststart || laststart == b)
+	    if (laststart == b)
 	      break;
 
 	    /* Now we know whether or not zero matches is allowed
@@ -2544,18 +2549,24 @@ regex_compile (re_char *pattern, ptrdiff_t size,
               break;
 
 	    case 'b':
+	      laststart = b;
 	      BUF_PUSH (wordbound);
 	      break;
 
 	    case 'B':
+	      laststart = b;
 	      BUF_PUSH (notwordbound);
 	      break;
 
 	    case '`':
+	      /* Special case for compatibility: postfix ops after \` become
+		 literals, as for ^ (see above).  */
+	      laststart = 0;
 	      BUF_PUSH (begbuf);
 	      break;
 
 	    case '\'':
+	      laststart = b;
 	      BUF_PUSH (endbuf);
 	      break;
 
diff --git a/test/src/regex-emacs-tests.el b/test/src/regex-emacs-tests.el
index 52d43775b8e..e739e2b28a6 100644
--- a/test/src/regex-emacs-tests.el
+++ b/test/src/regex-emacs-tests.el
@@ -883,4 +883,67 @@ regexp-tests-backtrack-optimization
     (should (looking-at "x*\\(=\\|:\\)*"))
     (should (looking-at "x*=*?"))))
 
+(ert-deftest regexp-tests-zero-width-assertion-repetition ()
+  ;; Check compatibility behaviour with repetition operators after
+  ;; certain zero-width assertions (bug#64128).
+
+  ;; Postfix operators after ^ and \` become literals, for historical
+  ;; compatibility.  Only the first character of a lazy operator (like *?)
+  ;; becomes a literal.
+  (should (equal (string-match "^*a" "x\n*a") 2))
+  (should (equal (string-match "^*?a" "x\n*a") 2))
+  (should (equal (string-match "^*?a" "x\na") 2))
+  (should (equal (string-match "^*?a" "x\n**a") nil))
+
+  (should (equal (string-match "\\`*a" "*a") 0))
+  (should (equal (string-match "\\`*?a" "*a") 0))
+  (should (equal (string-match "\\`*?a" "a") 0))
+  (should (equal (string-match "\\`*?a" "**a") nil))
+
+  ;; Other zero-width assertions are treated as normal elements, so postfix
+  ;; operators apply to them alone (which is pointless but valid).
+  (should (equal (string-match "\\b*!" "*!") 1))
+  (should (equal (string-match "!\\b+;" "!;") nil))
+  (should (equal (string-match "!\\b+a" "!a") 0))
+
+  (should (equal (string-match "\\B*!" "*!") 1))
+  (should (equal (string-match "!\\B+;" "!;") 0))
+  (should (equal (string-match "!\\B+a" "!a") nil))
+
+  (should (equal (string-match "\\<*b" "*b") 1))
+  (should (equal (string-match "a\\<*b" "ab") 0))
+  (should (equal (string-match ";\\<*b" ";b") 0))
+  (should (equal (string-match "a\\<+b" "ab") nil))
+  (should (equal (string-match ";\\<+b" ";b") 0))
+
+  (should (equal (string-match "\\>*;" "*;") 1))
+  (should (equal (string-match "a\\>*b" "ab") 0))
+  (should (equal (string-match "a\\>*;" "a;") 0))
+  (should (equal (string-match "a\\>+b" "ab") nil))
+  (should (equal (string-match "a\\>+;" "a;") 0))
+
+  (should (equal (string-match "a\\'" "ab") nil))
+  (should (equal (string-match "b\\'" "ab") 1))
+  (should (equal (string-match "a\\'*b" "ab") 0))
+  (should (equal (string-match "a\\'+" "ab") nil))
+  (should (equal (string-match "b\\'+" "ab") 1))
+  (should (equal (string-match "\\'+" "+") 1))
+
+  (should (equal (string-match "\\_<*b" "*b") 1))
+  (should (equal (string-match "a\\_<*b" "ab") 0))
+  (should (equal (string-match " \\_<*b" " b") 0))
+  (should (equal (string-match "a\\_<+b" "ab") nil))
+  (should (equal (string-match " \\_<+b" " b") 0))
+
+  (should (equal (string-match "\\_>*;" "*;") 1))
+  (should (equal (string-match "a\\_>*b" "ab") 0))
+  (should (equal (string-match "a\\_>* " "a ") 0))
+  (should (equal (string-match "a\\_>+b" "ab") nil))
+  (should (equal (string-match "a\\_>+ " "a ") 0))
+
+  (should (equal (string-match "\\=*b" "*b") 1))
+  (should (equal (string-match "a\\=*b" "a*b") nil))
+  (should (equal (string-match "a\\=*b" "ab") 0))
+  )
+
 ;;; regex-emacs-tests.el ends here
-- 
2.32.0 (Apple Git-132)


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-20 11:36                         ` Mattias Engdegård
@ 2023-06-21  6:08                           ` Paul Eggert
  2023-06-21 15:57                             ` Mattias Engdegård
  0 siblings, 1 reply; 18+ messages in thread
From: Paul Eggert @ 2023-06-21  6:08 UTC (permalink / raw)
  To: Mattias Engdegård, Stefan Monnier; +Cc: Eli Zaretskii, 64128

On 2023-06-20 04:36, Mattias Engdegård wrote:

> Sounds fine, with the option to go full error on group B if we agree that that's even better.

That would be fine too. I'd even prefer it. In the meantime your patch 
looks good.


> I still think the whole chapter needs a reform from the sheer weight of organic growth over the years. In particular, the division between "regexp special" and "regexp backslash" is purely syntactical, not semantic, and groups things in the wrong way.

Agreed.






^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#64128: regexp parser zero-width assertion bugs
  2023-06-21  6:08                           ` Paul Eggert
@ 2023-06-21 15:57                             ` Mattias Engdegård
  0 siblings, 0 replies; 18+ messages in thread
From: Mattias Engdegård @ 2023-06-21 15:57 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Eli Zaretskii, Stefan Monnier, 64128

21 juni 2023 kl. 08.08 skrev Paul Eggert <eggert@cs.ucla.edu>:

>> Sounds fine, with the option to go full error on group B if we agree that that's even better.
> 
> That would be fine too. I'd even prefer it. In the meantime your patch looks good.

Good, it's now in master. Let's think about whether an error can be motivated, and how.
We usually don't prevent the user to do silly things, except when there is a strong reason to believe that it might be a serious mistake.






^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-06-21 15:57 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-17 12:20 bug#64128: regexp parser zero-width assertion bugs Mattias Engdegård
2023-06-17 18:44 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-17 20:07   ` Mattias Engdegård
2023-06-17 22:18     ` Paul Eggert
2023-06-18  4:55       ` Eli Zaretskii
2023-06-18 20:26         ` Mattias Engdegård
2023-06-19  3:04           ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-19  8:44             ` Mattias Engdegård
2023-06-19 12:54               ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-19 18:34                 ` Mattias Engdegård
2023-06-19 19:21                   ` Paul Eggert
2023-06-19 19:52                     ` Mattias Engdegård
2023-06-19 20:08                       ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-20 11:36                         ` Mattias Engdegård
2023-06-21  6:08                           ` Paul Eggert
2023-06-21 15:57                             ` Mattias Engdegård
2023-06-19 20:40                       ` Paul Eggert
2023-06-19 18:14           ` Paul Eggert

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.