all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Paul Eggert <eggert@cs.ucla.edu>
To: "Mattias Engdegård" <mattias.engdegard@gmail.com>,
	"Stefan Monnier" <monnier@iro.umontreal.ca>
Cc: 64128@debbugs.gnu.org
Subject: bug#64128: regexp parser zero-width assertion bugs
Date: Sat, 17 Jun 2023 15:18:00 -0700	[thread overview]
Message-ID: <a4a870b6-b637-7e61-0b18-7fa01b970a4f@cs.ucla.edu> (raw)
In-Reply-To: <4A303177-384E-4FEF-98F2-FAB89A12ACC9@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1023 bytes --]

On 2023-06-17 13:07, Mattias Engdegård wrote:
> 17 juni 2023 kl. 20.44 skrev Stefan Monnier <monnier@iro.umontreal.ca>:
> 
>> I think the behavior that makes most sense is to signal an error when
>> compiling the regexp.
> 
> Clearly, but some behaviour needs to be preserved for compatibility.
> Regexps like "^*" aren't uncommon. Can it be generalised in a useful way?
> 

doc/lispref/searching.texi says that "*" is treated as an ordinary 
character if it is in a context where its special meaning makes no 
sense, giving "*foo" as an example. If we break with this tradition by 
making "\b*" an error instead of being equivalent to "\b\*", we should 
update that part of the manual.

One possible way forward is to update doc/lispref/searching.texi to 
specify what we want. Then we can modify the code to match the updated 
documentation.

In my experience, modifying the doc is often the hard part, so I took a 
crack at that in the draft proposed patch, which I have not installed.

Comments?

[-- Attachment #2: 0001-Document-that-b-etc-are-now-invalid-regexps.patch --]
[-- Type: text/x-patch, Size: 3904 bytes --]

From e4fc369a624d85027d39a424a507507da00f26aa Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Sat, 17 Jun 2023 15:05:42 -0700
Subject: [PROPOSED] Document that \b* etc are now invalid regexps

---
 doc/lispref/searching.texi | 24 ++++++++++++++++--------
 etc/NEWS                   |  6 ++++++
 2 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index b8d9094b28..fd4dfcbd71 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -332,6 +332,10 @@ Regexp Special
 expression.  Thus, @samp{fo*} has a repeating @samp{o}, not a repeating
 @samp{fo}.  It matches @samp{f}, @samp{fo}, @samp{foo}, and so on.
 
+@samp{*} cannot immediately follow a backslash escape that matches
+only empty strings, as this is too likely to be a typo.  For example,
+@samp{\<*} is invalid.
+
 @cindex backtracking and regular expressions
 The matcher processes a @samp{*} construct by matching, immediately, as
 many repetitions as can be found.  Then it continues with the rest of
@@ -505,9 +509,10 @@ Regexp Special
 When matching a string instead of a buffer, @samp{^} matches at the
 beginning of the string or after a newline character.
 
-For historical compatibility reasons, @samp{^} can be used only at the
+For historical compatibility, @samp{^} is special only at the
 beginning of the regular expression, or after @samp{\(}, @samp{\(?:}
-or @samp{\|}.
+or @samp{\|}.  In other contexts it is an ordinary character, except
+for its special meaning at the start of a character alternative.
 
 @item @samp{$}
 @cindex @samp{$} in regexp
@@ -519,8 +524,9 @@ Regexp Special
 When matching a string instead of a buffer, @samp{$} matches at the end
 of the string or before a newline character.
 
-For historical compatibility reasons, @samp{$} can be used only at the
+For historical compatibility, @samp{$} is special only at the
 end of the regular expression, or before @samp{\)} or @samp{\|}.
+In other contexts it is an ordinary character.
 
 @item @samp{\}
 @cindex @samp{\} in regexp
@@ -541,11 +547,13 @@ Regexp Special
 @end table
 
 @strong{Please note:} For historical compatibility, special characters
-are treated as ordinary ones if they are in contexts where their special
-meanings make no sense.  For example, @samp{*foo} treats @samp{*} as
-ordinary since there is no preceding expression on which the @samp{*}
-can act.  It is poor practice to depend on this behavior; quote the
-special character anyway, regardless of where it appears.
+are treated as ordinary ones if they would otherwise start repetition
+operators either at the start of a regular expression, or after
+@samp{^}, @samp{\(}, @samp{\(?:} or @samp{\|}.  For example,
+@samp{*foo} is treated as @samp{\*foo}, and @samp{two\|^\@{2\@}} is
+treated as @samp{two\|^@{2@}}.  It is poor practice to depend on this
+behavior; use proper backslash escaping anyway, regardless of where
+the special character appears.
 
 As a @samp{\} is not special inside a character alternative, it can
 never remove the special meaning of @samp{-}, @samp{^} or @samp{]}.
diff --git a/etc/NEWS b/etc/NEWS
index 61e6e16166..0c4889f9a6 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -436,6 +436,12 @@ Previously, '\x' without at least one hex digit denoted character code
 zero (NUL) but as this was neither intended nor documented or even
 known by anyone, it is now treated as an error by the Lisp reader.
 
+===
+** In regular expressions, zero-width backslash escapes can no longer
+be followed by repetition operators.  For example, '\b*' is no longer
+a valid regular expression.  Previously the behavior was erratic for
+these constructs, and they were typically typos anyway.
+
 ---
 ** Connection-local variables are applied in buffers visiting a remote file.
 This overrides possible directory-local or file-local variables with
-- 
2.39.2


  reply	other threads:[~2023-06-17 22:18 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-17 12:20 bug#64128: regexp parser zero-width assertion bugs Mattias Engdegård
2023-06-17 18:44 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-17 20:07   ` Mattias Engdegård
2023-06-17 22:18     ` Paul Eggert [this message]
2023-06-18  4:55       ` Eli Zaretskii
2023-06-18 20:26         ` Mattias Engdegård
2023-06-19  3:04           ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-19  8:44             ` Mattias Engdegård
2023-06-19 12:54               ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-19 18:34                 ` Mattias Engdegård
2023-06-19 19:21                   ` Paul Eggert
2023-06-19 19:52                     ` Mattias Engdegård
2023-06-19 20:08                       ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2023-06-20 11:36                         ` Mattias Engdegård
2023-06-21  6:08                           ` Paul Eggert
2023-06-21 15:57                             ` Mattias Engdegård
2023-06-19 20:40                       ` Paul Eggert
2023-06-19 18:14           ` Paul Eggert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a4a870b6-b637-7e61-0b18-7fa01b970a4f@cs.ucla.edu \
    --to=eggert@cs.ucla.edu \
    --cc=64128@debbugs.gnu.org \
    --cc=mattias.engdegard@gmail.com \
    --cc=monnier@iro.umontreal.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.