all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: "Mattias Engdegård" <mattiase@acm.org>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 34641@debbugs.gnu.org
Subject: bug#34641: rx: (or ...) order unpredictable
Date: Sat, 2 Mar 2019 15:08:26 +0100	[thread overview]
Message-ID: <66BC7C81-523F-42D0-87B8-B27835F8EC4E@acm.org> (raw)
In-Reply-To: <0EAAABCF-8AA5-473A-A1E7-4B18040697D8@acm.org>

[-- Attachment #1: Type: text/plain, Size: 146 bytes --]

2 mars 2019 kl. 15.05 skrev Mattias Engdegård <mattiase@acm.org>:
> 
> Thanks for the review; revised patch attached.

Sorry, here it is.


[-- Attachment #2: 0001-rx-fix-or-ordering-by-adding-argument-to-regexp-opt.patch --]
[-- Type: application/octet-stream, Size: 7705 bytes --]

From 78db02e28b451efd6bb9582f5e3e1c38ca47d2f8 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Sun, 24 Feb 2019 22:12:52 +0100
Subject: [PATCH] rx: fix `or' ordering by adding argument to regexp-opt

The rx `or' form may reorder its arguments in an unpredictable way,
contrary to user expectation, since it sometimes uses `regexp-opt'.
Add a NOREORDER option to `regexp-opt' for preventing it from
producing a reordered regexp (Bug#34641).

* doc/lispref/searching.texi (Regular Expression Functions):
* etc/NEWS (Lisp Changes in Emacs 27.1):
Describe the new regexp-opt NOREORDER argument.
* lisp/emacs-lisp/regexp-opt.el (regexp-opt): Add NOREORDER.
Make no attempt at regexp improvement if the set of strings contains
a prefix of another string.
(regexp-opt--contains-prefix): New.
* lisp/emacs-lisp/rx.el (rx-or): Call regexp-opt with NOREORDER.
* test/lisp/emacs-lisp/rx-tests.el: Test rx `or' form match order.
---
 doc/lispref/searching.texi       | 13 ++++++++---
 etc/NEWS                         |  7 ++++++
 lisp/emacs-lisp/regexp-opt.el    | 38 ++++++++++++++++++++++++++++----
 lisp/emacs-lisp/rx.el            |  2 +-
 test/lisp/emacs-lisp/rx-tests.el | 13 +++++++++++
 5 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index cfbd2449b1..fb7f48474d 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -950,7 +950,7 @@ whitespace:
 @end defun
 
 @cindex optimize regexp
-@defun regexp-opt strings &optional paren
+@defun regexp-opt strings &optional paren noreorder
 This function returns an efficient regular expression that will match
 any of the strings in the list @var{strings}.  This is useful when you
 need to make matching or searching as fast as possible---for example,
@@ -985,8 +985,15 @@ if it is necessary to ensure that a postfix operator appended to
 it will apply to the whole expression.
 @end table
 
-The resulting regexp of @code{regexp-opt} is equivalent to but usually
-more efficient than that of a simplified version:
+The optional argument @var{noreorder}, if @code{nil} or omitted,
+allows the returned regexp to match the strings in any order.  If
+non-@code{nil}, the match is guaranteed to be performed in the order
+given, as if the strings were made into a regexp by joining them with
+the @samp{\|} operator.
+
+Up to reordering, the resulting regexp of @code{regexp-opt} is
+equivalent to but usually more efficient than that of a simplified
+version:
 
 @example
 (defun simplified-regexp-opt (strings &optional paren)
diff --git a/etc/NEWS b/etc/NEWS
index 29ed7ab481..7c95988ff5 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1642,6 +1642,13 @@ MS-Windows.
 ** New module environment function 'process_input' to process user
 input while module code is running.
 
++++
+** The function 'regexp-opt' accepts an additional optional argument.
+By default, the regexp returned by 'regexp-opt' may match the strings
+in any order.  If the new third argument is non-nil, the match is
+guaranteed to be performed in the order given, as if the strings were
+made into a regexp by joining them with '\|'.
+
 \f
 * Changes in Emacs 27.1 on Non-Free Operating Systems
 
diff --git a/lisp/emacs-lisp/regexp-opt.el b/lisp/emacs-lisp/regexp-opt.el
index 63786c1508..d0c5f2d3fc 100644
--- a/lisp/emacs-lisp/regexp-opt.el
+++ b/lisp/emacs-lisp/regexp-opt.el
@@ -84,7 +84,7 @@
 ;;; Code:
 
 ;;;###autoload
-(defun regexp-opt (strings &optional paren)
+(defun regexp-opt (strings &optional paren noreorder)
   "Return a regexp to match a string in the list STRINGS.
 Each string should be unique in STRINGS and should not contain
 any regexps, quoted or not.  Optional PAREN specifies how the
@@ -111,8 +111,14 @@ nil
     necessary to ensure that a postfix operator appended to it will
     apply to the whole expression.
 
-The resulting regexp is equivalent to but usually more efficient
-than that of a simplified version:
+The optional argument NOREORDER, if nil or omitted, allows the
+returned regexp to match the strings in any order.  If non-nil,
+the match is guaranteed to be performed in the order given, as if
+the strings were made into a regexp by joining them with the
+`\\|' operator.
+
+Up to reordering, the resulting regexp is equivalent to but
+usually more efficient than that of a simplified version:
 
  (defun simplified-regexp-opt (strings &optional paren)
    (let ((parens
@@ -133,7 +139,15 @@ than that of a simplified version:
 	   (open (cond ((stringp paren) paren) (paren "\\(")))
 	   (sorted-strings (delete-dups
 			    (sort (copy-sequence strings) 'string-lessp)))
-	   (re (regexp-opt-group sorted-strings (or open t) (not open))))
+	   (re
+            ;; If NOREORDER is non-nil and the list contains a prefix
+            ;; of another string, we give up all attempts at optimisation.
+            ;; There is plenty of room for improvement (Bug#34641).
+            (if (and noreorder (regexp-opt--contains-prefix sorted-strings))
+                (concat (or open "\\(?:")
+                        (mapconcat #'regexp-quote strings "\\|")
+                        "\\)")
+              (regexp-opt-group sorted-strings (or open t) (not open)))))
       (cond ((eq paren 'words)
 	     (concat "\\<" re "\\>"))
 	    ((eq paren 'symbols)
@@ -313,6 +327,22 @@ CHARS should be a list of characters."
           (concat "[" dash caret "]"))
       (concat "[" bracket charset caret dash "]"))))
 
+
+(defun regexp-opt--contains-prefix (strings)
+  "Whether STRINGS contains a proper prefix of one of its other elements.
+STRINGS must be a list of sorted strings without duplicates."
+  (let ((s strings))
+    ;; In a lexicographically sorted list, a string always immediately
+    ;; succeeds one of its prefixes.
+    (while (and (cdr s)
+                (not (string-equal
+                      (car s)
+                      (substring (cadr s) 0 (min (length (car s))
+                                                 (length (cadr s)))))))
+      (setq s (cdr s)))
+    (cdr s)))
+
+
 (provide 'regexp-opt)
 
 ;;; regexp-opt.el ends here
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 715cd608c4..ca756efb49 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -393,7 +393,7 @@ FORM is of the form `(and FORM1 ...)'."
   (rx-group-if
    (if (memq nil (mapcar 'stringp (cdr form)))
        (mapconcat (lambda (x) (rx-form x '|)) (cdr form) "\\|")
-     (regexp-opt (cdr form)))
+     (regexp-opt (cdr form) nil t))
    (and (memq rx-parent '(: * t)) rx-parent)))
 
 
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index e14feda347..fa3d9b0d5e 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -92,5 +92,18 @@
                          (*? "e") (+? "f") (\?? "g") (?? "h"))))
                  "a*b+c?d?e*?f+?g??h??")))
 
+(ert-deftest rx-or ()
+  ;; Test or-pattern reordering (Bug#34641).
+  (let ((s "abc"))
+    (should (equal (and (string-match (rx (or "abc" "ab" "a")) s)
+                        (match-string 0 s))
+                   "abc"))
+    (should (equal (and (string-match (rx (or "ab" "abc" "a")) s)
+                        (match-string 0 s))
+                   "ab"))
+    (should (equal (and (string-match (rx (or "a" "ab" "abc")) s)
+                        (match-string 0 s))
+                   "a"))))
+
 (provide 'rx-tests)
 ;; rx-tests.el ends here.
-- 
2.17.2 (Apple Git-113)


  reply	other threads:[~2019-03-02 14:08 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-24 18:40 bug#34641: rx: (or ...) order unpredictable Mattias Engdegård
2019-02-24 19:06 ` Eli Zaretskii
2019-02-24 21:18   ` Mattias Engdegård
2019-02-24 22:44     ` Basil L. Contovounesios
2019-02-25 14:26       ` Mattias Engdegård
2019-03-02 12:33     ` Eli Zaretskii
2019-03-02 14:05       ` Mattias Engdegård
2019-03-02 14:08         ` Mattias Engdegård [this message]
2019-03-02 14:23           ` Eli Zaretskii
2019-03-02 14:37             ` Mattias Engdegård
2019-03-02 23:48       ` Phil Sainty
2019-03-03  8:54         ` Mattias Engdegård
2019-03-07  9:00           ` Phil Sainty
2019-02-25  2:37 ` Noam Postavsky
2019-02-25  9:56   ` Mattias Engdegård
2019-02-25 14:43     ` Noam Postavsky
2019-02-25 14:48       ` Mattias Engdegård

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=66BC7C81-523F-42D0-87B8-B27835F8EC4E@acm.org \
    --to=mattiase@acm.org \
    --cc=34641@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.