all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: "Mattias Engdegård" <mattias.engdegard@gmail.com>
To: 64128@debbugs.gnu.org
Cc: Paul Eggert <eggert@cs.ucla.edu>,
	Stefan Monnier <monnier@iro.umontreal.ca>
Subject: bug#64128: regexp parser zero-width assertion bugs
Date: Sat, 17 Jun 2023 14:20:27 +0200	[thread overview]
Message-ID: <E8949338-37DF-41DF-A295-46510F03515C@gmail.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 1385 bytes --]

In Emacs regexps, some but not all zero-width assertions have the special property in that they are not treated as an element for an immediately following ?, * or +. For example,

  \b*

matches a literal asterisk at a word boundary -- the `*` becomes literal because it is treated as if there were nothing for it to act upon. Even stranger:

  xy\b*

is parsed as, in rx syntax, (* "xy" word-boundary) which is remarkable: the repetition operator encompasses several elements even though there are no brackets given. Demo:

(and (string-match "quack,\\b*" "quack,quack,quack,quaaaack!")
     (match-data))
=> (0 18)

Zero-width assertions that have the property:
^ (bol), $ (eol), \` (bos), \' (eos), \b (word-boundary), \B (not-word-boundary)

Zero-width assertions that do not have the property (and are treated as any other element):
\< (bow), \> (eow), \_< (symbol-start), \_> (symbol-end), \= (point)

These regexp patterns should be very rare in practice: they should always be a mistake, but it would be nice if they behaved in a way that makes some kind of sense.

A modest improvement would be to make operators become literal after any zero-width assertion, so that

  \<*

becomes (: word-start "*") instead of (* word-start), and

  xy\b*

becomes (: "xy" word-boundary "*") instead of (* "xy" word-boundary).

Suggested patch attached.


[-- Attachment #2: regexp-zero-width-assertion-bug.diff --]
[-- Type: application/octet-stream, Size: 2857 bytes --]

diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index e3237cd425a..120a727cf74 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -1716,7 +1716,9 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
   /* Address of start of the most recently finished expression.
      This tells, e.g., postfix * where to find the start of its
-     operand.  Reset at the beginning of groups and alternatives.  */
+     operand.  Reset at the beginning of groups and alternatives,
+     and after any zero-width assertion (which should not be the target
+     of any postfix repetition operators).  */
   unsigned char *laststart = 0;
 
   /* Address of beginning of regexp, or inside of last group.  */
@@ -1847,12 +1849,14 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 	case '^':
 	  if (! (p == pattern + 1 || at_begline_loc_p (pattern, p)))
 	    goto normal_char;
+	  laststart = 0;
 	  BUF_PUSH (begline);
 	  break;
 
 	case '$':
 	  if (! (p == pend || at_endline_loc_p (p, pend)))
 	    goto normal_char;
+	  laststart = 0;
 	  BUF_PUSH (endline);
 	  break;
 
@@ -1892,7 +1896,7 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
 	    /* Star, etc. applied to an empty pattern is equivalent
 	       to an empty pattern.  */
-	    if (!laststart || laststart == b)
+	    if (laststart == b)
 	      break;
 
 	    /* Now we know whether or not zero matches is allowed
@@ -2482,7 +2486,7 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 	       goto normal_char;
 
 	    case '=':
-	      laststart = b;
+	      laststart = 0;
 	      BUF_PUSH (at_dot);
 	      break;
 
@@ -2523,17 +2527,17 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
 
 	    case '<':
-	      laststart = b;
+	      laststart = 0;
 	      BUF_PUSH (wordbeg);
 	      break;
 
 	    case '>':
-	      laststart = b;
+	      laststart = 0;
 	      BUF_PUSH (wordend);
 	      break;
 
 	    case '_':
-              laststart = b;
+              laststart = 0;
               PATFETCH (c);
               if (c == '<')
                 BUF_PUSH (symbeg);
@@ -2544,18 +2548,22 @@ regex_compile (re_char *pattern, ptrdiff_t size,
               break;
 
 	    case 'b':
+	      laststart = 0;
 	      BUF_PUSH (wordbound);
 	      break;
 
 	    case 'B':
+	      laststart = 0;
 	      BUF_PUSH (notwordbound);
 	      break;
 
 	    case '`':
+	      laststart = 0;
 	      BUF_PUSH (begbuf);
 	      break;
 
 	    case '\'':
+	      laststart = 0;
 	      BUF_PUSH (endbuf);
 	      break;
 
@@ -2597,7 +2605,7 @@ regex_compile (re_char *pattern, ptrdiff_t size,
 
 	      /* If followed by a repetition operator.  */
 	      || (p != pend
-		  && (*p == '*' || *p == '+' || *p == '?' || *p == '^'))
+		  && (*p == '*' || *p == '+' || *p == '?'))
 	      || (p + 1 < pend && p[0] == '\\' && p[1] == '{'))
 	    {
 	      /* Start building a new exactn.  */

             reply	other threads:[~2023-06-17 12:20 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-17 12:20 Mattias Engdegård [this message]
2023-06-17 18:44 ` bug#64128: regexp parser zero-width assertion bugs Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-17 20:07   ` Mattias Engdegård
2023-06-17 22:18     ` Paul Eggert
2023-06-18  4:55       ` Eli Zaretskii
2023-06-18 20:26         ` Mattias Engdegård
2023-06-19  3:04           ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-19  8:44             ` Mattias Engdegård
2023-06-19 12:54               ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-19 18:34                 ` Mattias Engdegård
2023-06-19 19:21                   ` Paul Eggert
2023-06-19 19:52                     ` Mattias Engdegård
2023-06-19 20:08                       ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-20 11:36                         ` Mattias Engdegård
2023-06-21  6:08                           ` Paul Eggert
2023-06-21 15:57                             ` Mattias Engdegård
2023-06-19 20:40                       ` Paul Eggert
2023-06-19 18:14           ` Paul Eggert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=E8949338-37DF-41DF-A295-46510F03515C@gmail.com \
    --to=mattias.engdegard@gmail.com \
    --cc=64128@debbugs.gnu.org \
    --cc=eggert@cs.ucla.edu \
    --cc=monnier@iro.umontreal.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.