unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: "Mattias Engdegård" <mattiase@acm.org>
To: Paul Eggert <eggert@cs.ucla.edu>,
	Phil Sainty <psainty@orcon.net.nz>, Eli Zaretskii <eliz@gnu.org>
Cc: 37659@debbugs.gnu.org
Subject: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Tue, 11 Feb 2020 13:57:27 +0100	[thread overview]
Message-ID: <2F3A70C9-969B-4E86-998E-BF3CC990B769@acm.org> (raw)
In-Reply-To: <aadc4bf6-555d-7639-5713-96e96ee0ef9f@cs.ucla.edu>

[-- Attachment #1: Type: text/plain, Size: 1706 bytes --]

22 okt. 2019 kl. 19.33 skrev Paul Eggert <eggert@cs.ucla.edu>:

> Moreover, if greed is the longstanding tradition for regexp-opt, shouldn't plain "or" be greedy, to be consistent with other operators?

Having second thoughts, I've come to believe that Paul may have been right after all. We might just as well let plain 'or' (alias '|') match as much as possible when it is able to do so. In particular, we should guarantee that this will happen when all arguments are strings, as used to be the case.

Initially I thought it was a bug that (or "a" "ab") was optimised into "ab?" on the grounds that this made the behaviour unpredictable: when matching the string "abc", (or "a" "ab") matched "ab", whereas (or "a" "ab" space) would match "a". However, the current 'fixed' code isn't necessarily more useful.

Since the change was introduced in Emacs 27 which has not yet been released, I suggest the attached patch for emacs-27. It reverts the use of regexp-opt with KEEP-ORDER = t. What do you think? It would solve the problem without introducing new constructs, and without running the risk of introducing subtle errors in existing rx expressions.

(In fact, if we do not do this in Emacs 27, we'd have to add a NEWS entry to warn users about the change.)

A further improvement would be to ensure that nested all-string 'or' forms would have the same property, and that expansion of user-defined forms would be transparent. In other words, that

 (rx-let ((x (or "abc" "de")))
   (rx (or "a" x (or "ab" "def"))))

would be equivalent to

 (rx "abc" "ab" "a" "def" "de")

I'll prepare a patch for this QoI improvement, but the attached patch should be required no matter what.


[-- Attachment #2: 0001-rx-Use-longest-match-for-all-string-or-forms-bug-376.patch --]
[-- Type: application/octet-stream, Size: 2209 bytes --]

From b05c5ace634986f3c32310a4f62bd619e6ac5db9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 11 Feb 2020 13:23:10 +0100
Subject: [PATCH] rx: Use longest match for all-string 'or' forms (bug#37659)

Revert to the Emacs 26 semantics that always gave the longest match
for rx 'or' forms with only string arguments.  This guarantee was
never well documented, but it is useful and people likely have come to
rely on it.  For example, prior to this change,

 (rx (or ">" ">="))

matched ">" even if the text contained ">=".

* lisp/emacs-lisp/rx.el (rx--translate-or): Don't tell regexp-opt to
preserve the matching order.
* doc/lispref/searching.texi (Rx Constructs): Document the
longest-match guarantee for all-string 'or' forms.
---
 doc/lispref/searching.texi | 5 ++++-
 lisp/emacs-lisp/rx.el      | 2 +-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 3d7ea93286..5f4509a8b4 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -1080,7 +1080,10 @@ Rx Constructs
 @cindex @code{or} in rx
 @itemx @code{(| @var{rx}@dots{})}
 @cindex @code{|} in rx
-Match exactly one of the @var{rx}s, trying from left to right.
+Match exactly one of the @var{rx}s.
+If all arguments are string literals, the longest possible match
+will always be used.  Otherwise, either the longest match or the
+first (in left-to-right order) will be used.
 Without arguments, the expression will not match anything at all.@*
 Corresponding string regexp: @samp{@var{A}\|@var{B}\|@dots{}}.
 
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 03af053c91..b4cab5715d 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -290,7 +290,7 @@ rx--translate-or
    ((null (cdr body))              ; Single item.
     (rx--translate (car body)))
    ((rx--every #'stringp body)     ; All strings.
-    (cons (list (regexp-opt body nil t))
+    (cons (list (regexp-opt body nil))
           t))
    ((rx--every #'rx--charset-p body)  ; All charsets.
     (rx--translate-union nil body))
-- 
2.21.1 (Apple Git-122.3)


  parent reply	other threads:[~2020-02-11 12:57 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-08  9:36 bug#37659: rx additions: anychar, unmatchable, unordered-or Mattias Engdegård
2019-10-09  8:59 ` Mattias Engdegård
2019-10-11 23:07 ` bug#37659: Mattias Engdegård <mattiase <at> acm.org> Paul Eggert
2019-10-12 10:47   ` Mattias Engdegård
2019-10-13 16:52     ` Paul Eggert
2019-10-13 19:48       ` Mattias Engdegård
2019-10-22 15:14       ` bug#37659: rx additions: anychar, unmatchable, unordered-or Mattias Engdegård
2019-10-22 15:27         ` Robert Pluim
2019-10-22 17:33         ` Paul Eggert
2019-10-23  9:15           ` Mattias Engdegård
2019-10-23 23:14             ` Paul Eggert
2019-10-24  1:56               ` Drew Adams
2019-10-24  9:09                 ` Mattias Engdegård
2019-10-24 14:24                   ` Drew Adams
2019-10-24  9:17                 ` Phil Sainty
2019-10-24 14:32                   ` Drew Adams
2019-10-24  8:58               ` Mattias Engdegård
2019-10-27 11:53                 ` Mattias Engdegård
2020-02-11 12:57           ` Mattias Engdegård [this message]
2020-02-11 15:43             ` Eli Zaretskii
2020-02-11 19:17               ` Mattias Engdegård
2020-02-12  0:52                 ` Paul Eggert
2020-02-12 11:22                   ` Mattias Engdegård
2020-02-13 18:38                     ` Mattias Engdegård
2020-02-13 18:50                       ` Paul Eggert
2020-02-13 19:16                         ` Mattias Engdegård
2020-02-13 19:30                           ` Eli Zaretskii
2020-02-13 22:23                             ` Mattias Engdegård
2020-02-14  7:45                               ` Eli Zaretskii
2020-02-14 16:15                                 ` Paul Eggert
2020-02-14 20:49                                   ` Mattias Engdegård
2020-03-01 10:09                                   ` Mattias Engdegård

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2F3A70C9-969B-4E86-998E-BF3CC990B769@acm.org \
    --to=mattiase@acm.org \
    --cc=37659@debbugs.gnu.org \
    --cc=eggert@cs.ucla.edu \
    --cc=eliz@gnu.org \
    --cc=psainty@orcon.net.nz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).