From: "Mattias Engdegård" <mattias.engdegard@gmail.com>
To: 64128@debbugs.gnu.org
Cc: Paul Eggert <eggert@cs.ucla.edu>,
Stefan Monnier <monnier@iro.umontreal.ca>
Subject: bug#64128: regexp parser zero-width assertion bugs
Date: Sat, 17 Jun 2023 14:20:27 +0200 [thread overview]
Message-ID: <E8949338-37DF-41DF-A295-46510F03515C@gmail.com> (raw)
[-- Attachment #1: Type: text/plain, Size: 1385 bytes --]
In Emacs regexps, some but not all zero-width assertions have the special property in that they are not treated as an element for an immediately following ?, * or +. For example,
\b*
matches a literal asterisk at a word boundary -- the `*` becomes literal because it is treated as if there were nothing for it to act upon. Even stranger:
xy\b*
is parsed as, in rx syntax, (* "xy" word-boundary) which is remarkable: the repetition operator encompasses several elements even though there are no brackets given. Demo:
(and (string-match "quack,\\b*" "quack,quack,quack,quaaaack!")
(match-data))
=> (0 18)
Zero-width assertions that have the property:
^ (bol), $ (eol), \` (bos), \' (eos), \b (word-boundary), \B (not-word-boundary)
Zero-width assertions that do not have the property (and are treated as any other element):
\< (bow), \> (eow), \_< (symbol-start), \_> (symbol-end), \= (point)
These regexp patterns should be very rare in practice: they should always be a mistake, but it would be nice if they behaved in a way that makes some kind of sense.
A modest improvement would be to make operators become literal after any zero-width assertion, so that
\<*
becomes (: word-start "*") instead of (* word-start), and
xy\b*
becomes (: "xy" word-boundary "*") instead of (* "xy" word-boundary).
Suggested patch attached.
[-- Attachment #2: regexp-zero-width-assertion-bug.diff --]
[-- Type: application/octet-stream, Size: 2857 bytes --]
diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index e3237cd425a..120a727cf74 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -1716,7 +1716,9 @@ regex_compile (re_char *pattern, ptrdiff_t size,
/* Address of start of the most recently finished expression.
This tells, e.g., postfix * where to find the start of its
- operand. Reset at the beginning of groups and alternatives. */
+ operand. Reset at the beginning of groups and alternatives,
+ and after any zero-width assertion (which should not be the target
+ of any postfix repetition operators). */
unsigned char *laststart = 0;
/* Address of beginning of regexp, or inside of last group. */
@@ -1847,12 +1849,14 @@ regex_compile (re_char *pattern, ptrdiff_t size,
case '^':
if (! (p == pattern + 1 || at_begline_loc_p (pattern, p)))
goto normal_char;
+ laststart = 0;
BUF_PUSH (begline);
break;
case '$':
if (! (p == pend || at_endline_loc_p (p, pend)))
goto normal_char;
+ laststart = 0;
BUF_PUSH (endline);
break;
@@ -1892,7 +1896,7 @@ regex_compile (re_char *pattern, ptrdiff_t size,
/* Star, etc. applied to an empty pattern is equivalent
to an empty pattern. */
- if (!laststart || laststart == b)
+ if (laststart == b)
break;
/* Now we know whether or not zero matches is allowed
@@ -2482,7 +2486,7 @@ regex_compile (re_char *pattern, ptrdiff_t size,
goto normal_char;
case '=':
- laststart = b;
+ laststart = 0;
BUF_PUSH (at_dot);
break;
@@ -2523,17 +2527,17 @@ regex_compile (re_char *pattern, ptrdiff_t size,
case '<':
- laststart = b;
+ laststart = 0;
BUF_PUSH (wordbeg);
break;
case '>':
- laststart = b;
+ laststart = 0;
BUF_PUSH (wordend);
break;
case '_':
- laststart = b;
+ laststart = 0;
PATFETCH (c);
if (c == '<')
BUF_PUSH (symbeg);
@@ -2544,18 +2548,22 @@ regex_compile (re_char *pattern, ptrdiff_t size,
break;
case 'b':
+ laststart = 0;
BUF_PUSH (wordbound);
break;
case 'B':
+ laststart = 0;
BUF_PUSH (notwordbound);
break;
case '`':
+ laststart = 0;
BUF_PUSH (begbuf);
break;
case '\'':
+ laststart = 0;
BUF_PUSH (endbuf);
break;
@@ -2597,7 +2605,7 @@ regex_compile (re_char *pattern, ptrdiff_t size,
/* If followed by a repetition operator. */
|| (p != pend
- && (*p == '*' || *p == '+' || *p == '?' || *p == '^'))
+ && (*p == '*' || *p == '+' || *p == '?'))
|| (p + 1 < pend && p[0] == '\\' && p[1] == '{'))
{
/* Start building a new exactn. */
next reply other threads:[~2023-06-17 12:20 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-17 12:20 Mattias Engdegård [this message]
2023-06-17 18:44 ` bug#64128: regexp parser zero-width assertion bugs Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-17 20:07 ` Mattias Engdegård
2023-06-17 22:18 ` Paul Eggert
2023-06-18 4:55 ` Eli Zaretskii
2023-06-18 20:26 ` Mattias Engdegård
2023-06-19 3:04 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-19 8:44 ` Mattias Engdegård
2023-06-19 12:54 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-19 18:34 ` Mattias Engdegård
2023-06-19 19:21 ` Paul Eggert
2023-06-19 19:52 ` Mattias Engdegård
2023-06-19 20:08 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-20 11:36 ` Mattias Engdegård
2023-06-21 6:08 ` Paul Eggert
2023-06-21 15:57 ` Mattias Engdegård
2023-06-19 20:40 ` Paul Eggert
2023-06-19 18:14 ` Paul Eggert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=E8949338-37DF-41DF-A295-46510F03515C@gmail.com \
--to=mattias.engdegard@gmail.com \
--cc=64128@debbugs.gnu.org \
--cc=eggert@cs.ucla.edu \
--cc=monnier@iro.umontreal.ca \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.