unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#34641: rx: (or ...) order unpredictable
@ 2019-02-24 18:40 Mattias Engdegård
  2019-02-24 19:06 ` Eli Zaretskii
  2019-02-25  2:37 ` Noam Postavsky
  0 siblings, 2 replies; 17+ messages in thread
From: Mattias Engdegård @ 2019-02-24 18:40 UTC (permalink / raw)
  To: 34641

The rx (or ...) construct sometimes reorders its subexpressions, which makes its semantics unpredictable. For example,

(rx (or "ab" "a") (or "a" "ab"))
=>
"\\(?:ab?\\)\\(?:ab?\\)"

The user reasonably expects (or e1 e2) to translate to E1\|E2, where ei translates to Ei, or a semantic equivalent. Not having this control makes rx useless or dangerous for many purposes.

The reason for the reordering is the use of regex-opt behind the scenes. Whether rx is the place to do this kind of optimisation is a matter of opinion; mine is that it belongs in the regexp engine, together with other, more aggressive optimisations (DFA, native-code generation, etc) could be performed as well.

We could determine whether any string is a prefix of another. If not, regexp-opt should be safe to call. Alternatively, this check could be done in regexp-opt (activated by a flag). That would be my preferred short-term solution.

(Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything. Fix it, document it, or turn it into an error?)






^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#34641: rx: (or ...) order unpredictable
  2019-02-24 18:40 bug#34641: rx: (or ...) order unpredictable Mattias Engdegård
@ 2019-02-24 19:06 ` Eli Zaretskii
  2019-02-24 21:18   ` Mattias Engdegård
  2019-02-25  2:37 ` Noam Postavsky
  1 sibling, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2019-02-24 19:06 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 34641

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sun, 24 Feb 2019 19:40:33 +0100
> 
> We could determine whether any string is a prefix of another. If not, regexp-opt should be safe to call. Alternatively, this check could be done in regexp-opt (activated by a flag). That would be my preferred short-term solution.

Your preferred solution is fine with me, FWIW.

> (Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything. Fix it, document it, or turn it into an error?)

Fix it, I think.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#34641: rx: (or ...) order unpredictable
  2019-02-24 19:06 ` Eli Zaretskii
@ 2019-02-24 21:18   ` Mattias Engdegård
  2019-02-24 22:44     ` Basil L. Contovounesios
  2019-03-02 12:33     ` Eli Zaretskii
  0 siblings, 2 replies; 17+ messages in thread
From: Mattias Engdegård @ 2019-02-24 21:18 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 34641

[-- Attachment #1: Type: text/plain, Size: 545 bytes --]

24 feb. 2019 kl. 20.06 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> Your preferred solution is fine with me, FWIW.

Thank you; patch attached.

>> (Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything. Fix it, document it, or turn it into an error?)
> 
> Fix it, I think.

I'll prepare another patch. Is there a preferred or particularly clever never-matching regexp? If not, would \(?:$\)A do?

[-- Attachment #2: 0001-rx-fix-or-ordering-by-adding-argument-to-regexp-opt.patch --]
[-- Type: application/octet-stream, Size: 7605 bytes --]

From 0ba8d5e51714519d818c519581e699ca82047e66 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Sun, 24 Feb 2019 22:12:52 +0100
Subject: [PATCH] rx: fix `or' ordering by adding argument to regexp-opt

The rx `or' form may reorder its arguments in an unpredictable way,
contrary to user expectation, since it sometimes uses `regexp-opt'.
Add a NOREORDER option to `regexp-opt' for preventing it from
producing a reordered regexp (Bug#34641).

* doc/lispref/searching.texi (Regular Expression Functions):
* etc/NEWS (Lisp Changes in Emacs 27.1):
Describe the new regexp-opt NOREORDER argument.
* lisp/emacs-lisp/regexp-opt.el (regexp-opt): Add NOREORDER.
Make no attempt at regexp improvement if the set of strings contains
a prefix of another string.
(regexp-opt--contains-prefix): New.
* lisp/emacs-lisp/rx.el (rx-or): Call regexp-opt with NOREORDER.
* test/lisp/emacs-lisp/rx-tests.el: Test rx `or' form match order.
---
 doc/lispref/searching.texi       | 12 ++++++++---
 etc/NEWS                         |  7 ++++++
 lisp/emacs-lisp/regexp-opt.el    | 37 ++++++++++++++++++++++++++++----
 lisp/emacs-lisp/rx.el            |  2 +-
 test/lisp/emacs-lisp/rx-tests.el | 13 +++++++++++
 5 files changed, 63 insertions(+), 8 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index cfbd2449b1..73a7304a3b 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -950,7 +950,7 @@ whitespace:
 @end defun
 
 @cindex optimize regexp
-@defun regexp-opt strings &optional paren
+@defun regexp-opt strings &optional paren noreorder
 This function returns an efficient regular expression that will match
 any of the strings in the list @var{strings}.  This is useful when you
 need to make matching or searching as fast as possible---for example,
@@ -985,8 +985,14 @@ if it is necessary to ensure that a postfix operator appended to
 it will apply to the whole expression.
 @end table
 
-The resulting regexp of @code{regexp-opt} is equivalent to but usually
-more efficient than that of a simplified version:
+The optional argument @var{noreorder}, if @code{nil}, allows the
+returned regexp to match the strings in any order.  If non-@code{nil},
+the regexp is equivalent to a chain of alternatives (by the @samp{\|}
+operator) of the strings in the order given.
+
+Up to reordering, the resulting regexp of @code{regexp-opt} is
+equivalent to but usually more efficient than that of a simplified
+version:
 
 @example
 (defun simplified-regexp-opt (strings &optional paren)
diff --git a/etc/NEWS b/etc/NEWS
index 67e376d9b3..de9a8defbd 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1614,6 +1614,13 @@ given frame supports resizing.
 This is currently supported on GNUish hosts and on modern versions of
 MS-Windows.
 
++++
+** The function 'regexp-opt' accepts an additional optional argument.
+By default, the regexp returned by 'regexp-opt' may match the strings
+in any order.  If the new third argument is non-nil, the match is
+guaranteed to be performed in the order given, as if the strings were
+made into a regexp by joining them with '\|'.
+
 \f
 * Changes in Emacs 27.1 on Non-Free Operating Systems
 
diff --git a/lisp/emacs-lisp/regexp-opt.el b/lisp/emacs-lisp/regexp-opt.el
index 152dca2309..33a5b770a0 100644
--- a/lisp/emacs-lisp/regexp-opt.el
+++ b/lisp/emacs-lisp/regexp-opt.el
@@ -84,7 +84,7 @@
 ;;; Code:
 
 ;;;###autoload
-(defun regexp-opt (strings &optional paren)
+(defun regexp-opt (strings &optional paren noreorder)
   "Return a regexp to match a string in the list STRINGS.
 Each string should be unique in STRINGS and should not contain
 any regexps, quoted or not.  Optional PAREN specifies how the
@@ -111,8 +111,13 @@ nil
     necessary to ensure that a postfix operator appended to it will
     apply to the whole expression.
 
-The resulting regexp is equivalent to but usually more efficient
-than that of a simplified version:
+The optional argument NOREORDER, if nil, allows the returned
+regexp to match the strings in any order.  If non-nil, the regexp
+is equivalent to a chain of alternatives (by the `\\|' operator)
+of the strings in the order given.
+
+Up to reordering, the resulting regexp is equivalent to but
+usually more efficient than that of a simplified version:
 
  (defun simplified-regexp-opt (strings &optional paren)
    (let ((parens
@@ -133,7 +138,15 @@ than that of a simplified version:
 	   (open (cond ((stringp paren) paren) (paren "\\(")))
 	   (sorted-strings (delete-dups
 			    (sort (copy-sequence strings) 'string-lessp)))
-	   (re (regexp-opt-group sorted-strings (or open t) (not open))))
+	   (re
+            ;; If NOREORDER is non-nil and the list contains a prefix
+            ;; of another string, we give up all attempts at optimisation.
+            ;; There is plenty of room for improvement (Bug#34641).
+            (if (and noreorder (regexp-opt--contains-prefix sorted-strings))
+                (concat (or open "\\(?:")
+                        (mapconcat #'regexp-quote strings "\\|")
+                        "\\)")
+              (regexp-opt-group sorted-strings (or open t) (not open)))))
       (cond ((eq paren 'words)
 	     (concat "\\<" re "\\>"))
 	    ((eq paren 'symbols)
@@ -313,6 +326,22 @@ CHARS should be a list of characters."
           (concat "[" dash caret "]"))
       (concat "[" bracket charset caret dash "]"))))
 
+
+(defun regexp-opt--contains-prefix (strings)
+  "Whether a list of strings contains a proper prefix of one of its elements.
+STRINGS must be sorted and free from duplicates."
+  (let ((s strings))
+    ;; In a lexicographically sorted list, a string always immediately
+    ;; succeeds one of its prefixes.
+    (while (and (cdr s)
+                (not (string-equal
+                      (car s)
+                      (substring (cadr s) 0 (min (length (car s))
+                                                 (length (cadr s)))))))
+      (setq s (cdr s)))
+    (cdr s)))
+
+
 (provide 'regexp-opt)
 
 ;;; regexp-opt.el ends here
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 715cd608c4..ca756efb49 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -393,7 +393,7 @@ FORM is of the form `(and FORM1 ...)'."
   (rx-group-if
    (if (memq nil (mapcar 'stringp (cdr form)))
        (mapconcat (lambda (x) (rx-form x '|)) (cdr form) "\\|")
-     (regexp-opt (cdr form)))
+     (regexp-opt (cdr form) nil t))
    (and (memq rx-parent '(: * t)) rx-parent)))
 
 
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index e14feda347..fa3d9b0d5e 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -92,5 +92,18 @@
                          (*? "e") (+? "f") (\?? "g") (?? "h"))))
                  "a*b+c?d?e*?f+?g??h??")))
 
+(ert-deftest rx-or ()
+  ;; Test or-pattern reordering (Bug#34641).
+  (let ((s "abc"))
+    (should (equal (and (string-match (rx (or "abc" "ab" "a")) s)
+                        (match-string 0 s))
+                   "abc"))
+    (should (equal (and (string-match (rx (or "ab" "abc" "a")) s)
+                        (match-string 0 s))
+                   "ab"))
+    (should (equal (and (string-match (rx (or "a" "ab" "abc")) s)
+                        (match-string 0 s))
+                   "a"))))
+
 (provide 'rx-tests)
 ;; rx-tests.el ends here.
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* bug#34641: rx: (or ...) order unpredictable
  2019-02-24 21:18   ` Mattias Engdegård
@ 2019-02-24 22:44     ` Basil L. Contovounesios
  2019-02-25 14:26       ` Mattias Engdegård
  2019-03-02 12:33     ` Eli Zaretskii
  1 sibling, 1 reply; 17+ messages in thread
From: Basil L. Contovounesios @ 2019-02-24 22:44 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 34641

Mattias Engdegård <mattiase@acm.org> writes:

> Is there a preferred or particularly clever never-matching regexp?
> If not, would \(?:$\)A do?

FWIW, CC Mode has used "a\\`" since the following discussion:

https://lists.gnu.org/archive/html/emacs-devel/2018-03/msg00876.html

Stefan also suggested to make a variable out of this, but I don't think
anything came of that:

https://lists.gnu.org/archive/html/emacs-devel/2018-04/msg00047.html

-- 
Basil





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#34641: rx: (or ...) order unpredictable
  2019-02-24 18:40 bug#34641: rx: (or ...) order unpredictable Mattias Engdegård
  2019-02-24 19:06 ` Eli Zaretskii
@ 2019-02-25  2:37 ` Noam Postavsky
  2019-02-25  9:56   ` Mattias Engdegård
  1 sibling, 1 reply; 17+ messages in thread
From: Noam Postavsky @ 2019-02-25  2:37 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 34641

On Sun, 24 Feb 2019 at 13:41, Mattias Engdegård <mattiase@acm.org> wrote:
>
> The rx (or ...) construct sometimes reorders its subexpressions, which makes its semantics unpredictable. For example,
>
> (rx (or "ab" "a") (or "a" "ab"))
> =>
> "\\(?:ab?\\)\\(?:ab?\\)"
>
> The user reasonably expects (or e1 e2) to translate to E1\|E2, where ei translates to Ei, or a semantic equivalent.

I don't see the problem, isn't "ab?" semantically equivalent to
"ab\\|a" (and "a\\|ab")?

> (Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything.

This sounds familiar, though I can't locate a report for it.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#34641: rx: (or ...) order unpredictable
  2019-02-25  2:37 ` Noam Postavsky
@ 2019-02-25  9:56   ` Mattias Engdegård
  2019-02-25 14:43     ` Noam Postavsky
  0 siblings, 1 reply; 17+ messages in thread
From: Mattias Engdegård @ 2019-02-25  9:56 UTC (permalink / raw)
  To: Noam Postavsky; +Cc: 34641

25 feb. 2019 kl. 03.37 skrev Noam Postavsky <npostavs@gmail.com>:
> 
> I don't see the problem, isn't "ab?" semantically equivalent to
> "ab\\|a" (and "a\\|ab")?

Good question! When the match is anchored at the end, they are indeed equivalent. They also are equivalent for Posix regexps, which prefer the longest match. But in Emacs, the first (leftmost) matching alternative is used.

Suppose we are matching against the string "abc". Then
ab\|a matches "ab"
a\|ab matches "a"
ab?   matches "ab"
ab??  matches "a" (non-greedy operator)

(I remember writing, young and foolish, [0-9]+\|0[xX][0-9a-fA-F]+ to match a number in decimal or hex, and was surprised that all hex numbers were zero.)

>> (Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything.
> 
> This sounds familiar, though I can't locate a report for it.

If you do remember, please tell us about it.
The `or' operator in SRE can be used with an empty argument list, and will then not match anything. It is a useful limit case for machine-generated regexps.






^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#34641: rx: (or ...) order unpredictable
  2019-02-24 22:44     ` Basil L. Contovounesios
@ 2019-02-25 14:26       ` Mattias Engdegård
  0 siblings, 0 replies; 17+ messages in thread
From: Mattias Engdegård @ 2019-02-25 14:26 UTC (permalink / raw)
  To: Basil L. Contovounesios; +Cc: 34641

[-- Attachment #1: Type: text/plain, Size: 224 bytes --]

24 feb. 2019 kl. 23.44 skrev Basil L. Contovounesios <contovob@tcd.ie>:
> 
> FWIW, CC Mode has used "a\\`" since the following discussion:

Thank you, I'll use that then.
Here is a patch (to be applied after the other one).

[-- Attachment #2: 0001-Correct-regexp-opt-return-value-for-empty-string-lis.patch --]
[-- Type: application/octet-stream, Size: 3955 bytes --]

From 28f34a04513254c5bb4507ec6daa510e7ba166da Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Mon, 25 Feb 2019 15:22:02 +0100
Subject: [PATCH] Correct regexp-opt return value for empty string list

When regexp-opt is called with an empty list of strings, return a regexp
that doesn't match anything instead of the empty string (Bug#34641).

* doc/lispref/searching.texi (Regular Expression Functions):
* etc/NEWS:
Document the new behaviour.
* lisp/emacs-lisp/regexp-opt.el (regexp-opt):
Return a never-match regexp for empty inputs.
---
 doc/lispref/searching.texi    |  3 +++
 etc/NEWS                      |  6 ++++++
 lisp/emacs-lisp/regexp-opt.el | 23 +++++++++++++++--------
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 73a7304a3b..0b944a2711 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -960,6 +960,9 @@ possible.  A hand-tuned regular expression can sometimes be slightly
 more efficient, but is almost never worth the effort.}.
 @c E.g., see https://debbugs.gnu.org/2816
 
+If @var{strings} is empty, the return value is a regexp that never
+matches anything.
+
 The optional argument @var{paren} can be any of the following:
 
 @table @asis
diff --git a/etc/NEWS b/etc/NEWS
index 5f7616429b..6506a1c6b5 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1624,6 +1624,12 @@ in any order.  If the new third argument is non-nil, the match is
 guaranteed to be performed in the order given, as if the strings were
 made into a regexp by joining them with '\|'.
 
++++
+** The function 'regexp-opt', when given an empty list of strings, now
+returns a regexp that never matches anything, which is an identity for
+this operation.  Previously, the empty string was returned in this
+case.
+
 \f
 * Changes in Emacs 27.1 on Non-Free Operating Systems
 
diff --git a/lisp/emacs-lisp/regexp-opt.el b/lisp/emacs-lisp/regexp-opt.el
index 33a5b770a0..107b453637 100644
--- a/lisp/emacs-lisp/regexp-opt.el
+++ b/lisp/emacs-lisp/regexp-opt.el
@@ -90,6 +90,9 @@ Each string should be unique in STRINGS and should not contain
 any regexps, quoted or not.  Optional PAREN specifies how the
 returned regexp is surrounded by grouping constructs.
 
+If STRINGS is empty, the return value is a regexp that never
+matches anything.
+
 The optional argument PAREN can be any of the following:
 
 a string
@@ -139,14 +142,18 @@ usually more efficient than that of a simplified version:
 	   (sorted-strings (delete-dups
 			    (sort (copy-sequence strings) 'string-lessp)))
 	   (re
-            ;; If NOREORDER is non-nil and the list contains a prefix
-            ;; of another string, we give up all attempts at optimisation.
-            ;; There is plenty of room for improvement (Bug#34641).
-            (if (and noreorder (regexp-opt--contains-prefix sorted-strings))
-                (concat (or open "\\(?:")
-                        (mapconcat #'regexp-quote strings "\\|")
-                        "\\)")
-              (regexp-opt-group sorted-strings (or open t) (not open)))))
+            (cond
+             ;; No strings: return a\` which cannot match anything.
+             ((null strings)
+              (concat (or open "\\(?:") "a\\`\\)"))
+             ;; If we cannot reorder, give up all attempts at
+             ;; optimisation.  There is room for improvement (Bug#34641).
+             ((and noreorder (regexp-opt--contains-prefix sorted-strings))
+              (concat (or open "\\(?:")
+                      (mapconcat #'regexp-quote strings "\\|")
+                      "\\)"))
+             (t
+              (regexp-opt-group sorted-strings (or open t) (not open))))))
       (cond ((eq paren 'words)
 	     (concat "\\<" re "\\>"))
 	    ((eq paren 'symbols)
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* bug#34641: rx: (or ...) order unpredictable
  2019-02-25  9:56   ` Mattias Engdegård
@ 2019-02-25 14:43     ` Noam Postavsky
  2019-02-25 14:48       ` Mattias Engdegård
  0 siblings, 1 reply; 17+ messages in thread
From: Noam Postavsky @ 2019-02-25 14:43 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 34641

On Mon, 25 Feb 2019 at 04:56, Mattias Engdegård <mattiase@acm.org> wrote:

> Good question! When the match is anchored at the end, they are indeed equivalent. They also are equivalent for Posix regexps, which prefer the longest match. But in Emacs, the first (leftmost) matching alternative is used.
>
> Suppose we are matching against the string "abc". Then
> ab\|a matches "ab"
> a\|ab matches "a"

Oh, huh. So it does. I guess I've never used regexp in a situation
where this subtle corner case would come up.

> >> (Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything.
> >
> > This sounds familiar, though I can't locate a report for it.
>
> If you do remember, please tell us about it.
> The `or' operator in SRE can be used with an empty argument list, and will then not match anything. It is a useful limit case for machine-generated regexps.

Right, found it this time, it's Bug#20307.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#34641: rx: (or ...) order unpredictable
  2019-02-25 14:43     ` Noam Postavsky
@ 2019-02-25 14:48       ` Mattias Engdegård
  0 siblings, 0 replies; 17+ messages in thread
From: Mattias Engdegård @ 2019-02-25 14:48 UTC (permalink / raw)
  To: Noam Postavsky; +Cc: 34641

25 feb. 2019 kl. 15.43 skrev Noam Postavsky <npostavs@gmail.com>:
> 
> Right, found it this time, it's Bug#20307.

Excellent! I'll move this part to that bug then, and update the patch to use that bug number.






^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#34641: rx: (or ...) order unpredictable
  2019-02-24 21:18   ` Mattias Engdegård
  2019-02-24 22:44     ` Basil L. Contovounesios
@ 2019-03-02 12:33     ` Eli Zaretskii
  2019-03-02 14:05       ` Mattias Engdegård
  2019-03-02 23:48       ` Phil Sainty
  1 sibling, 2 replies; 17+ messages in thread
From: Eli Zaretskii @ 2019-03-02 12:33 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 34641

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sun, 24 Feb 2019 22:18:28 +0100
> Cc: 34641@debbugs.gnu.org
> 
> > Your preferred solution is fine with me, FWIW.
> 
> Thank you; patch attached.

Thanks, LGTM with minor comments below.

> +The optional argument @var{noreorder}, if @code{nil}, allows the
> +returned regexp to match the strings in any order.  If non-@code{nil},
> +the regexp is equivalent to a chain of alternatives (by the @samp{\|}
> +operator) of the strings in the order given.

I find the text in NEWS much more clear regarding what happens when
the new arg is non-nil.  I think what is missing is a more explicit
description you have in NEWS:

  If the new third argument is non-nil, the match is
  guaranteed to be performed in the order given, as if the strings were
  made into a regexp by joining them with '\|'.

So I suggest to mention explicitly the "match is guaranteed to be
performed in the order given" part.

> +The optional argument NOREORDER, if nil, allows the returned

We usually say "if nil or omitted" for optional arguments.

> +(defun regexp-opt--contains-prefix (strings)
> +  "Whether a list of strings contains a proper prefix of one of its elements.
> +STRINGS must be sorted and free from duplicates."

It is usually a good idea to refer to arguments explicitly in the
first sentence of a doc string.  In this case, I think just up-casing
STRINGS there should be enough.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#34641: rx: (or ...) order unpredictable
  2019-03-02 12:33     ` Eli Zaretskii
@ 2019-03-02 14:05       ` Mattias Engdegård
  2019-03-02 14:08         ` Mattias Engdegård
  2019-03-02 23:48       ` Phil Sainty
  1 sibling, 1 reply; 17+ messages in thread
From: Mattias Engdegård @ 2019-03-02 14:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 34641

2 mars 2019 kl. 13.33 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> I find the text in NEWS much more clear regarding what happens when
> the new arg is non-nil.  I think what is missing is a more explicit
> description you have in NEWS:
> 
>  If the new third argument is non-nil, the match is
>  guaranteed to be performed in the order given, as if the strings were
>  made into a regexp by joining them with '\|'.
> 
> So I suggest to mention explicitly the "match is guaranteed to be
> performed in the order given" part.

Right; rephrased the doc string and searching.texi.

>> +The optional argument NOREORDER, if nil, allows the returned
> 
> We usually say "if nil or omitted" for optional arguments.

Understood; used in both places.

>> +(defun regexp-opt--contains-prefix (strings)
>> +  "Whether a list of strings contains a proper prefix of one of its elements.
>> +STRINGS must be sorted and free from duplicates."
> 
> It is usually a good idea to refer to arguments explicitly in the
> first sentence of a doc string.  In this case, I think just up-casing
> STRINGS there should be enough.

Rephrased.

Thanks for the review; revised patch attached.






^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#34641: rx: (or ...) order unpredictable
  2019-03-02 14:05       ` Mattias Engdegård
@ 2019-03-02 14:08         ` Mattias Engdegård
  2019-03-02 14:23           ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Mattias Engdegård @ 2019-03-02 14:08 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 34641

[-- Attachment #1: Type: text/plain, Size: 146 bytes --]

2 mars 2019 kl. 15.05 skrev Mattias Engdegård <mattiase@acm.org>:
> 
> Thanks for the review; revised patch attached.

Sorry, here it is.


[-- Attachment #2: 0001-rx-fix-or-ordering-by-adding-argument-to-regexp-opt.patch --]
[-- Type: application/octet-stream, Size: 7705 bytes --]

From 78db02e28b451efd6bb9582f5e3e1c38ca47d2f8 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Sun, 24 Feb 2019 22:12:52 +0100
Subject: [PATCH] rx: fix `or' ordering by adding argument to regexp-opt

The rx `or' form may reorder its arguments in an unpredictable way,
contrary to user expectation, since it sometimes uses `regexp-opt'.
Add a NOREORDER option to `regexp-opt' for preventing it from
producing a reordered regexp (Bug#34641).

* doc/lispref/searching.texi (Regular Expression Functions):
* etc/NEWS (Lisp Changes in Emacs 27.1):
Describe the new regexp-opt NOREORDER argument.
* lisp/emacs-lisp/regexp-opt.el (regexp-opt): Add NOREORDER.
Make no attempt at regexp improvement if the set of strings contains
a prefix of another string.
(regexp-opt--contains-prefix): New.
* lisp/emacs-lisp/rx.el (rx-or): Call regexp-opt with NOREORDER.
* test/lisp/emacs-lisp/rx-tests.el: Test rx `or' form match order.
---
 doc/lispref/searching.texi       | 13 ++++++++---
 etc/NEWS                         |  7 ++++++
 lisp/emacs-lisp/regexp-opt.el    | 38 ++++++++++++++++++++++++++++----
 lisp/emacs-lisp/rx.el            |  2 +-
 test/lisp/emacs-lisp/rx-tests.el | 13 +++++++++++
 5 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index cfbd2449b1..fb7f48474d 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -950,7 +950,7 @@ whitespace:
 @end defun
 
 @cindex optimize regexp
-@defun regexp-opt strings &optional paren
+@defun regexp-opt strings &optional paren noreorder
 This function returns an efficient regular expression that will match
 any of the strings in the list @var{strings}.  This is useful when you
 need to make matching or searching as fast as possible---for example,
@@ -985,8 +985,15 @@ if it is necessary to ensure that a postfix operator appended to
 it will apply to the whole expression.
 @end table
 
-The resulting regexp of @code{regexp-opt} is equivalent to but usually
-more efficient than that of a simplified version:
+The optional argument @var{noreorder}, if @code{nil} or omitted,
+allows the returned regexp to match the strings in any order.  If
+non-@code{nil}, the match is guaranteed to be performed in the order
+given, as if the strings were made into a regexp by joining them with
+the @samp{\|} operator.
+
+Up to reordering, the resulting regexp of @code{regexp-opt} is
+equivalent to but usually more efficient than that of a simplified
+version:
 
 @example
 (defun simplified-regexp-opt (strings &optional paren)
diff --git a/etc/NEWS b/etc/NEWS
index 29ed7ab481..7c95988ff5 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1642,6 +1642,13 @@ MS-Windows.
 ** New module environment function 'process_input' to process user
 input while module code is running.
 
++++
+** The function 'regexp-opt' accepts an additional optional argument.
+By default, the regexp returned by 'regexp-opt' may match the strings
+in any order.  If the new third argument is non-nil, the match is
+guaranteed to be performed in the order given, as if the strings were
+made into a regexp by joining them with '\|'.
+
 \f
 * Changes in Emacs 27.1 on Non-Free Operating Systems
 
diff --git a/lisp/emacs-lisp/regexp-opt.el b/lisp/emacs-lisp/regexp-opt.el
index 63786c1508..d0c5f2d3fc 100644
--- a/lisp/emacs-lisp/regexp-opt.el
+++ b/lisp/emacs-lisp/regexp-opt.el
@@ -84,7 +84,7 @@
 ;;; Code:
 
 ;;;###autoload
-(defun regexp-opt (strings &optional paren)
+(defun regexp-opt (strings &optional paren noreorder)
   "Return a regexp to match a string in the list STRINGS.
 Each string should be unique in STRINGS and should not contain
 any regexps, quoted or not.  Optional PAREN specifies how the
@@ -111,8 +111,14 @@ nil
     necessary to ensure that a postfix operator appended to it will
     apply to the whole expression.
 
-The resulting regexp is equivalent to but usually more efficient
-than that of a simplified version:
+The optional argument NOREORDER, if nil or omitted, allows the
+returned regexp to match the strings in any order.  If non-nil,
+the match is guaranteed to be performed in the order given, as if
+the strings were made into a regexp by joining them with the
+`\\|' operator.
+
+Up to reordering, the resulting regexp is equivalent to but
+usually more efficient than that of a simplified version:
 
  (defun simplified-regexp-opt (strings &optional paren)
    (let ((parens
@@ -133,7 +139,15 @@ than that of a simplified version:
 	   (open (cond ((stringp paren) paren) (paren "\\(")))
 	   (sorted-strings (delete-dups
 			    (sort (copy-sequence strings) 'string-lessp)))
-	   (re (regexp-opt-group sorted-strings (or open t) (not open))))
+	   (re
+            ;; If NOREORDER is non-nil and the list contains a prefix
+            ;; of another string, we give up all attempts at optimisation.
+            ;; There is plenty of room for improvement (Bug#34641).
+            (if (and noreorder (regexp-opt--contains-prefix sorted-strings))
+                (concat (or open "\\(?:")
+                        (mapconcat #'regexp-quote strings "\\|")
+                        "\\)")
+              (regexp-opt-group sorted-strings (or open t) (not open)))))
       (cond ((eq paren 'words)
 	     (concat "\\<" re "\\>"))
 	    ((eq paren 'symbols)
@@ -313,6 +327,22 @@ CHARS should be a list of characters."
           (concat "[" dash caret "]"))
       (concat "[" bracket charset caret dash "]"))))
 
+
+(defun regexp-opt--contains-prefix (strings)
+  "Whether STRINGS contains a proper prefix of one of its other elements.
+STRINGS must be a list of sorted strings without duplicates."
+  (let ((s strings))
+    ;; In a lexicographically sorted list, a string always immediately
+    ;; succeeds one of its prefixes.
+    (while (and (cdr s)
+                (not (string-equal
+                      (car s)
+                      (substring (cadr s) 0 (min (length (car s))
+                                                 (length (cadr s)))))))
+      (setq s (cdr s)))
+    (cdr s)))
+
+
 (provide 'regexp-opt)
 
 ;;; regexp-opt.el ends here
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 715cd608c4..ca756efb49 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -393,7 +393,7 @@ FORM is of the form `(and FORM1 ...)'."
   (rx-group-if
    (if (memq nil (mapcar 'stringp (cdr form)))
        (mapconcat (lambda (x) (rx-form x '|)) (cdr form) "\\|")
-     (regexp-opt (cdr form)))
+     (regexp-opt (cdr form) nil t))
    (and (memq rx-parent '(: * t)) rx-parent)))
 
 
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index e14feda347..fa3d9b0d5e 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -92,5 +92,18 @@
                          (*? "e") (+? "f") (\?? "g") (?? "h"))))
                  "a*b+c?d?e*?f+?g??h??")))
 
+(ert-deftest rx-or ()
+  ;; Test or-pattern reordering (Bug#34641).
+  (let ((s "abc"))
+    (should (equal (and (string-match (rx (or "abc" "ab" "a")) s)
+                        (match-string 0 s))
+                   "abc"))
+    (should (equal (and (string-match (rx (or "ab" "abc" "a")) s)
+                        (match-string 0 s))
+                   "ab"))
+    (should (equal (and (string-match (rx (or "a" "ab" "abc")) s)
+                        (match-string 0 s))
+                   "a"))))
+
 (provide 'rx-tests)
 ;; rx-tests.el ends here.
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* bug#34641: rx: (or ...) order unpredictable
  2019-03-02 14:08         ` Mattias Engdegård
@ 2019-03-02 14:23           ` Eli Zaretskii
  2019-03-02 14:37             ` Mattias Engdegård
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2019-03-02 14:23 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 34641

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 2 Mar 2019 15:08:26 +0100
> Cc: 34641@debbugs.gnu.org
> 
> > Thanks for the review; revised patch attached.
> 
> Sorry, here it is.

LGTM, thanks.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#34641: rx: (or ...) order unpredictable
  2019-03-02 14:23           ` Eli Zaretskii
@ 2019-03-02 14:37             ` Mattias Engdegård
  0 siblings, 0 replies; 17+ messages in thread
From: Mattias Engdegård @ 2019-03-02 14:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 34641-done

2 mars 2019 kl. 15.23 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> LGTM, thanks.

Thank you, pushed.






^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#34641: rx: (or ...) order unpredictable
  2019-03-02 12:33     ` Eli Zaretskii
  2019-03-02 14:05       ` Mattias Engdegård
@ 2019-03-02 23:48       ` Phil Sainty
  2019-03-03  8:54         ` Mattias Engdegård
  1 sibling, 1 reply; 17+ messages in thread
From: Phil Sainty @ 2019-03-02 23:48 UTC (permalink / raw)
  To: Eli Zaretskii, Mattias Engdegård; +Cc: 34641

>> +The optional argument NOREORDER, if nil, allows the returned

Could we change that name to have a positive sense?

Boolean arguments with a negative sense/meaning are invariably
more awkward to read than ones with a positive meaning, IMO.

NOREORDER set to nil means "No NOREORDER" (aka "Reorder").  Those
double-negatives should be avoided whenever it's simple do do so,
as they make things harder for anyone reading the documentation.

We could use KEEP-ORDER or RETAIN-ORDER or SAME-ORDER or anything
along those lines, and then a 'true' value matches the positive
sense of the name, which is much nicer.


-Phil





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#34641: rx: (or ...) order unpredictable
  2019-03-02 23:48       ` Phil Sainty
@ 2019-03-03  8:54         ` Mattias Engdegård
  2019-03-07  9:00           ` Phil Sainty
  0 siblings, 1 reply; 17+ messages in thread
From: Mattias Engdegård @ 2019-03-03  8:54 UTC (permalink / raw)
  To: Phil Sainty; +Cc: 34641

3 mars 2019 kl. 00.48 skrev Phil Sainty <psainty@orcon.net.nz>:
> 
>>> +The optional argument NOREORDER, if nil, allows the returned
> 
> Could we change that name to have a positive sense?
> 
> Boolean arguments with a negative sense/meaning are invariably
> more awkward to read than ones with a positive meaning, IMO.
> 
> NOREORDER set to nil means "No NOREORDER" (aka "Reorder").  Those
> double-negatives should be avoided whenever it's simple do do so,
> as they make things harder for anyone reading the documentation.
> 
> We could use KEEP-ORDER or RETAIN-ORDER or SAME-ORDER or anything
> along those lines, and then a 'true' value matches the positive
> sense of the name, which is much nicer.

You are right, and I wasn't happy with the negative name either.
Would KEEP-ORDER do?






^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#34641: rx: (or ...) order unpredictable
  2019-03-03  8:54         ` Mattias Engdegård
@ 2019-03-07  9:00           ` Phil Sainty
  0 siblings, 0 replies; 17+ messages in thread
From: Phil Sainty @ 2019-03-07  9:00 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 34641

On 3/03/19 9:54 PM, Mattias Engdegård wrote:
> 3 mars 2019 kl. 00.48 skrev Phil Sainty <psainty@orcon.net.nz>:
>>>> +The optional argument NOREORDER, if nil, allows the returned
>> Could we change that name to have a positive sense?
> 
> You are right, and I wasn't happy with the negative name either.
> Would KEEP-ORDER do?

I think that's good, and no one has objected, so I think you could
go ahead with that change.

cheers,
-Phil





^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-03-07  9:00 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-24 18:40 bug#34641: rx: (or ...) order unpredictable Mattias Engdegård
2019-02-24 19:06 ` Eli Zaretskii
2019-02-24 21:18   ` Mattias Engdegård
2019-02-24 22:44     ` Basil L. Contovounesios
2019-02-25 14:26       ` Mattias Engdegård
2019-03-02 12:33     ` Eli Zaretskii
2019-03-02 14:05       ` Mattias Engdegård
2019-03-02 14:08         ` Mattias Engdegård
2019-03-02 14:23           ` Eli Zaretskii
2019-03-02 14:37             ` Mattias Engdegård
2019-03-02 23:48       ` Phil Sainty
2019-03-03  8:54         ` Mattias Engdegård
2019-03-07  9:00           ` Phil Sainty
2019-02-25  2:37 ` Noam Postavsky
2019-02-25  9:56   ` Mattias Engdegård
2019-02-25 14:43     ` Noam Postavsky
2019-02-25 14:48       ` Mattias Engdegård

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).