all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* bug#64128: regexp parser zero-width assertion bugs
@ 2023-06-17 12:20 Mattias Engdegård
  2023-06-17 18:44 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 1 reply; 18+ messages in thread
From: Mattias Engdegård @ 2023-06-17 12:20 UTC (permalink / raw)
  To: 64128; +Cc: Paul Eggert, Stefan Monnier

[-- Attachment #1: Type: text/plain, Size: 1385 bytes --]

In Emacs regexps, some but not all zero-width assertions have the special property in that they are not treated as an element for an immediately following ?, * or +. For example,

  \b*

matches a literal asterisk at a word boundary -- the `*` becomes literal because it is treated as if there were nothing for it to act upon. Even stranger:

  xy\b*

is parsed as, in rx syntax, (* "xy" word-boundary) which is remarkable: the repetition operator encompasses several elements even though there are no brackets given. Demo:

(and (string-match "quack,\\b*" "quack,quack,quack,quaaaack!")
     (match-data))
=> (0 18)

Zero-width assertions that have the property:
^ (bol), $ (eol), \` (bos), \' (eos), \b (word-boundary), \B (not-word-boundary)

Zero-width assertions that do not have the property (and are treated as any other element):
\< (bow), \> (eow), \_< (symbol-start), \_> (symbol-end), \= (point)

These regexp patterns should be very rare in practice: they should always be a mistake, but it would be nice if they behaved in a way that makes some kind of sense.

A modest improvement would be to make operators become literal after any zero-width assertion, so that

  \<*

becomes (: word-start "*") instead of (* word-start), and

  xy\b*

becomes (: "xy" word-boundary "*") instead of (* "xy" word-boundary).

Suggested patch attached.


[-- Attachment #2: regexp-zero-width-assertion-bug.diff --]
[-- Type: application/octet-stream, Size: 2857 bytes --]

diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index e3237cd425a..120a727cf74 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -1716,7 +1716,9 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
   /* Address of start of the most recently finished expression.
      This tells, e.g., postfix * where to find the start of its
-     operand.  Reset at the beginning of groups and alternatives.  */
+     operand.  Reset at the beginning of groups and alternatives,
+     and after any zero-width assertion (which should not be the target
+     of any postfix repetition operators).  */
   unsigned char *laststart = 0;
 
   /* Address of beginning of regexp, or inside of last group.  */
@@ -1847,12 +1849,14 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 	case '^':
 	  if (! (p == pattern + 1 || at_begline_loc_p (pattern, p)))
 	    goto normal_char;
+	  laststart = 0;
 	  BUF_PUSH (begline);
 	  break;
 
 	case '$':
 	  if (! (p == pend || at_endline_loc_p (p, pend)))
 	    goto normal_char;
+	  laststart = 0;
 	  BUF_PUSH (endline);
 	  break;
 
@@ -1892,7 +1896,7 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
 	    /* Star, etc. applied to an empty pattern is equivalent
 	       to an empty pattern.  */
-	    if (!laststart || laststart == b)
+	    if (laststart == b)
 	      break;
 
 	    /* Now we know whether or not zero matches is allowed
@@ -2482,7 +2486,7 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 	       goto normal_char;
 
 	    case '=':
-	      laststart = b;
+	      laststart = 0;
 	      BUF_PUSH (at_dot);
 	      break;
 
@@ -2523,17 +2527,17 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
 
 	    case '<':
-	      laststart = b;
+	      laststart = 0;
 	      BUF_PUSH (wordbeg);
 	      break;
 
 	    case '>':
-	      laststart = b;
+	      laststart = 0;
 	      BUF_PUSH (wordend);
 	      break;
 
 	    case '_':
-              laststart = b;
+              laststart = 0;
               PATFETCH (c);
               if (c == '<')
                 BUF_PUSH (symbeg);
@@ -2544,18 +2548,22 @@ regex_compile (re_char *pattern, ptrdiff_t size,
               break;
 
 	    case 'b':
+	      laststart = 0;
 	      BUF_PUSH (wordbound);
 	      break;
 
 	    case 'B':
+	      laststart = 0;
 	      BUF_PUSH (notwordbound);
 	      break;
 
 	    case '`':
+	      laststart = 0;
 	      BUF_PUSH (begbuf);
 	      break;
 
 	    case '\'':
+	      laststart = 0;
 	      BUF_PUSH (endbuf);
 	      break;
 
@@ -2597,7 +2605,7 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
 	      /* If followed by a repetition operator.  */
 	      || (p != pend
-		  && (*p == '*' || *p == '+' || *p == '?' || *p == '^'))
+		  && (*p == '*' || *p == '+' || *p == '?'))
 	      || (p + 1 < pend && p[0] == '\\' && p[1] == '{'))
 	    {
 	      /* Start building a new exactn.  */

^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-06-21 15:57 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-17 12:20 bug#64128: regexp parser zero-width assertion bugs Mattias Engdegård
2023-06-17 18:44 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-17 20:07   ` Mattias Engdegård
2023-06-17 22:18     ` Paul Eggert
2023-06-18  4:55       ` Eli Zaretskii
2023-06-18 20:26         ` Mattias Engdegård
2023-06-19  3:04           ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-19  8:44             ` Mattias Engdegård
2023-06-19 12:54               ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-19 18:34                 ` Mattias Engdegård
2023-06-19 19:21                   ` Paul Eggert
2023-06-19 19:52                     ` Mattias Engdegård
2023-06-19 20:08                       ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-20 11:36                         ` Mattias Engdegård
2023-06-21  6:08                           ` Paul Eggert
2023-06-21 15:57                             ` Mattias Engdegård
2023-06-19 20:40                       ` Paul Eggert
2023-06-19 18:14           ` Paul Eggert

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.