* bug#34641: rx: (or ...) order unpredictable @ 2019-02-24 18:40 Mattias Engdegård 2019-02-24 19:06 ` Eli Zaretskii 2019-02-25 2:37 ` Noam Postavsky 0 siblings, 2 replies; 17+ messages in thread From: Mattias Engdegård @ 2019-02-24 18:40 UTC (permalink / raw) To: 34641 The rx (or ...) construct sometimes reorders its subexpressions, which makes its semantics unpredictable. For example, (rx (or "ab" "a") (or "a" "ab")) => "\\(?:ab?\\)\\(?:ab?\\)" The user reasonably expects (or e1 e2) to translate to E1\|E2, where ei translates to Ei, or a semantic equivalent. Not having this control makes rx useless or dangerous for many purposes. The reason for the reordering is the use of regex-opt behind the scenes. Whether rx is the place to do this kind of optimisation is a matter of opinion; mine is that it belongs in the regexp engine, together with other, more aggressive optimisations (DFA, native-code generation, etc) could be performed as well. We could determine whether any string is a prefix of another. If not, regexp-opt should be safe to call. Alternatively, this check could be done in regexp-opt (activated by a flag). That would be my preferred short-term solution. (Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything. Fix it, document it, or turn it into an error?) ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#34641: rx: (or ...) order unpredictable 2019-02-24 18:40 bug#34641: rx: (or ...) order unpredictable Mattias Engdegård @ 2019-02-24 19:06 ` Eli Zaretskii 2019-02-24 21:18 ` Mattias Engdegård 2019-02-25 2:37 ` Noam Postavsky 1 sibling, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2019-02-24 19:06 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 34641 > From: Mattias Engdegård <mattiase@acm.org> > Date: Sun, 24 Feb 2019 19:40:33 +0100 > > We could determine whether any string is a prefix of another. If not, regexp-opt should be safe to call. Alternatively, this check could be done in regexp-opt (activated by a flag). That would be my preferred short-term solution. Your preferred solution is fine with me, FWIW. > (Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything. Fix it, document it, or turn it into an error?) Fix it, I think. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#34641: rx: (or ...) order unpredictable 2019-02-24 19:06 ` Eli Zaretskii @ 2019-02-24 21:18 ` Mattias Engdegård 2019-02-24 22:44 ` Basil L. Contovounesios 2019-03-02 12:33 ` Eli Zaretskii 0 siblings, 2 replies; 17+ messages in thread From: Mattias Engdegård @ 2019-02-24 21:18 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 34641 [-- Attachment #1: Type: text/plain, Size: 545 bytes --] 24 feb. 2019 kl. 20.06 skrev Eli Zaretskii <eliz@gnu.org>: > > Your preferred solution is fine with me, FWIW. Thank you; patch attached. >> (Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything. Fix it, document it, or turn it into an error?) > > Fix it, I think. I'll prepare another patch. Is there a preferred or particularly clever never-matching regexp? If not, would \(?:$\)A do? [-- Attachment #2: 0001-rx-fix-or-ordering-by-adding-argument-to-regexp-opt.patch --] [-- Type: application/octet-stream, Size: 7605 bytes --] From 0ba8d5e51714519d818c519581e699ca82047e66 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org> Date: Sun, 24 Feb 2019 22:12:52 +0100 Subject: [PATCH] rx: fix `or' ordering by adding argument to regexp-opt The rx `or' form may reorder its arguments in an unpredictable way, contrary to user expectation, since it sometimes uses `regexp-opt'. Add a NOREORDER option to `regexp-opt' for preventing it from producing a reordered regexp (Bug#34641). * doc/lispref/searching.texi (Regular Expression Functions): * etc/NEWS (Lisp Changes in Emacs 27.1): Describe the new regexp-opt NOREORDER argument. * lisp/emacs-lisp/regexp-opt.el (regexp-opt): Add NOREORDER. Make no attempt at regexp improvement if the set of strings contains a prefix of another string. (regexp-opt--contains-prefix): New. * lisp/emacs-lisp/rx.el (rx-or): Call regexp-opt with NOREORDER. * test/lisp/emacs-lisp/rx-tests.el: Test rx `or' form match order. --- doc/lispref/searching.texi | 12 ++++++++--- etc/NEWS | 7 ++++++ lisp/emacs-lisp/regexp-opt.el | 37 ++++++++++++++++++++++++++++---- lisp/emacs-lisp/rx.el | 2 +- test/lisp/emacs-lisp/rx-tests.el | 13 +++++++++++ 5 files changed, 63 insertions(+), 8 deletions(-) diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index cfbd2449b1..73a7304a3b 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi @@ -950,7 +950,7 @@ whitespace: @end defun @cindex optimize regexp -@defun regexp-opt strings &optional paren +@defun regexp-opt strings &optional paren noreorder This function returns an efficient regular expression that will match any of the strings in the list @var{strings}. This is useful when you need to make matching or searching as fast as possible---for example, @@ -985,8 +985,14 @@ if it is necessary to ensure that a postfix operator appended to it will apply to the whole expression. @end table -The resulting regexp of @code{regexp-opt} is equivalent to but usually -more efficient than that of a simplified version: +The optional argument @var{noreorder}, if @code{nil}, allows the +returned regexp to match the strings in any order. If non-@code{nil}, +the regexp is equivalent to a chain of alternatives (by the @samp{\|} +operator) of the strings in the order given. + +Up to reordering, the resulting regexp of @code{regexp-opt} is +equivalent to but usually more efficient than that of a simplified +version: @example (defun simplified-regexp-opt (strings &optional paren) diff --git a/etc/NEWS b/etc/NEWS index 67e376d9b3..de9a8defbd 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -1614,6 +1614,13 @@ given frame supports resizing. This is currently supported on GNUish hosts and on modern versions of MS-Windows. ++++ +** The function 'regexp-opt' accepts an additional optional argument. +By default, the regexp returned by 'regexp-opt' may match the strings +in any order. If the new third argument is non-nil, the match is +guaranteed to be performed in the order given, as if the strings were +made into a regexp by joining them with '\|'. + \f * Changes in Emacs 27.1 on Non-Free Operating Systems diff --git a/lisp/emacs-lisp/regexp-opt.el b/lisp/emacs-lisp/regexp-opt.el index 152dca2309..33a5b770a0 100644 --- a/lisp/emacs-lisp/regexp-opt.el +++ b/lisp/emacs-lisp/regexp-opt.el @@ -84,7 +84,7 @@ ;;; Code: ;;;###autoload -(defun regexp-opt (strings &optional paren) +(defun regexp-opt (strings &optional paren noreorder) "Return a regexp to match a string in the list STRINGS. Each string should be unique in STRINGS and should not contain any regexps, quoted or not. Optional PAREN specifies how the @@ -111,8 +111,13 @@ nil necessary to ensure that a postfix operator appended to it will apply to the whole expression. -The resulting regexp is equivalent to but usually more efficient -than that of a simplified version: +The optional argument NOREORDER, if nil, allows the returned +regexp to match the strings in any order. If non-nil, the regexp +is equivalent to a chain of alternatives (by the `\\|' operator) +of the strings in the order given. + +Up to reordering, the resulting regexp is equivalent to but +usually more efficient than that of a simplified version: (defun simplified-regexp-opt (strings &optional paren) (let ((parens @@ -133,7 +138,15 @@ than that of a simplified version: (open (cond ((stringp paren) paren) (paren "\\("))) (sorted-strings (delete-dups (sort (copy-sequence strings) 'string-lessp))) - (re (regexp-opt-group sorted-strings (or open t) (not open)))) + (re + ;; If NOREORDER is non-nil and the list contains a prefix + ;; of another string, we give up all attempts at optimisation. + ;; There is plenty of room for improvement (Bug#34641). + (if (and noreorder (regexp-opt--contains-prefix sorted-strings)) + (concat (or open "\\(?:") + (mapconcat #'regexp-quote strings "\\|") + "\\)") + (regexp-opt-group sorted-strings (or open t) (not open))))) (cond ((eq paren 'words) (concat "\\<" re "\\>")) ((eq paren 'symbols) @@ -313,6 +326,22 @@ CHARS should be a list of characters." (concat "[" dash caret "]")) (concat "[" bracket charset caret dash "]")))) + +(defun regexp-opt--contains-prefix (strings) + "Whether a list of strings contains a proper prefix of one of its elements. +STRINGS must be sorted and free from duplicates." + (let ((s strings)) + ;; In a lexicographically sorted list, a string always immediately + ;; succeeds one of its prefixes. + (while (and (cdr s) + (not (string-equal + (car s) + (substring (cadr s) 0 (min (length (car s)) + (length (cadr s))))))) + (setq s (cdr s))) + (cdr s))) + + (provide 'regexp-opt) ;;; regexp-opt.el ends here diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el index 715cd608c4..ca756efb49 100644 --- a/lisp/emacs-lisp/rx.el +++ b/lisp/emacs-lisp/rx.el @@ -393,7 +393,7 @@ FORM is of the form `(and FORM1 ...)'." (rx-group-if (if (memq nil (mapcar 'stringp (cdr form))) (mapconcat (lambda (x) (rx-form x '|)) (cdr form) "\\|") - (regexp-opt (cdr form))) + (regexp-opt (cdr form) nil t)) (and (memq rx-parent '(: * t)) rx-parent))) diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el index e14feda347..fa3d9b0d5e 100644 --- a/test/lisp/emacs-lisp/rx-tests.el +++ b/test/lisp/emacs-lisp/rx-tests.el @@ -92,5 +92,18 @@ (*? "e") (+? "f") (\?? "g") (?? "h")))) "a*b+c?d?e*?f+?g??h??"))) +(ert-deftest rx-or () + ;; Test or-pattern reordering (Bug#34641). + (let ((s "abc")) + (should (equal (and (string-match (rx (or "abc" "ab" "a")) s) + (match-string 0 s)) + "abc")) + (should (equal (and (string-match (rx (or "ab" "abc" "a")) s) + (match-string 0 s)) + "ab")) + (should (equal (and (string-match (rx (or "a" "ab" "abc")) s) + (match-string 0 s)) + "a")))) + (provide 'rx-tests) ;; rx-tests.el ends here. -- 2.17.2 (Apple Git-113) ^ permalink raw reply related [flat|nested] 17+ messages in thread
* bug#34641: rx: (or ...) order unpredictable 2019-02-24 21:18 ` Mattias Engdegård @ 2019-02-24 22:44 ` Basil L. Contovounesios 2019-02-25 14:26 ` Mattias Engdegård 2019-03-02 12:33 ` Eli Zaretskii 1 sibling, 1 reply; 17+ messages in thread From: Basil L. Contovounesios @ 2019-02-24 22:44 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 34641 Mattias Engdegård <mattiase@acm.org> writes: > Is there a preferred or particularly clever never-matching regexp? > If not, would \(?:$\)A do? FWIW, CC Mode has used "a\\`" since the following discussion: https://lists.gnu.org/archive/html/emacs-devel/2018-03/msg00876.html Stefan also suggested to make a variable out of this, but I don't think anything came of that: https://lists.gnu.org/archive/html/emacs-devel/2018-04/msg00047.html -- Basil ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#34641: rx: (or ...) order unpredictable 2019-02-24 22:44 ` Basil L. Contovounesios @ 2019-02-25 14:26 ` Mattias Engdegård 0 siblings, 0 replies; 17+ messages in thread From: Mattias Engdegård @ 2019-02-25 14:26 UTC (permalink / raw) To: Basil L. Contovounesios; +Cc: 34641 [-- Attachment #1: Type: text/plain, Size: 224 bytes --] 24 feb. 2019 kl. 23.44 skrev Basil L. Contovounesios <contovob@tcd.ie>: > > FWIW, CC Mode has used "a\\`" since the following discussion: Thank you, I'll use that then. Here is a patch (to be applied after the other one). [-- Attachment #2: 0001-Correct-regexp-opt-return-value-for-empty-string-lis.patch --] [-- Type: application/octet-stream, Size: 3955 bytes --] From 28f34a04513254c5bb4507ec6daa510e7ba166da Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org> Date: Mon, 25 Feb 2019 15:22:02 +0100 Subject: [PATCH] Correct regexp-opt return value for empty string list When regexp-opt is called with an empty list of strings, return a regexp that doesn't match anything instead of the empty string (Bug#34641). * doc/lispref/searching.texi (Regular Expression Functions): * etc/NEWS: Document the new behaviour. * lisp/emacs-lisp/regexp-opt.el (regexp-opt): Return a never-match regexp for empty inputs. --- doc/lispref/searching.texi | 3 +++ etc/NEWS | 6 ++++++ lisp/emacs-lisp/regexp-opt.el | 23 +++++++++++++++-------- 3 files changed, 24 insertions(+), 8 deletions(-) diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index 73a7304a3b..0b944a2711 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi @@ -960,6 +960,9 @@ possible. A hand-tuned regular expression can sometimes be slightly more efficient, but is almost never worth the effort.}. @c E.g., see https://debbugs.gnu.org/2816 +If @var{strings} is empty, the return value is a regexp that never +matches anything. + The optional argument @var{paren} can be any of the following: @table @asis diff --git a/etc/NEWS b/etc/NEWS index 5f7616429b..6506a1c6b5 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -1624,6 +1624,12 @@ in any order. If the new third argument is non-nil, the match is guaranteed to be performed in the order given, as if the strings were made into a regexp by joining them with '\|'. ++++ +** The function 'regexp-opt', when given an empty list of strings, now +returns a regexp that never matches anything, which is an identity for +this operation. Previously, the empty string was returned in this +case. + \f * Changes in Emacs 27.1 on Non-Free Operating Systems diff --git a/lisp/emacs-lisp/regexp-opt.el b/lisp/emacs-lisp/regexp-opt.el index 33a5b770a0..107b453637 100644 --- a/lisp/emacs-lisp/regexp-opt.el +++ b/lisp/emacs-lisp/regexp-opt.el @@ -90,6 +90,9 @@ Each string should be unique in STRINGS and should not contain any regexps, quoted or not. Optional PAREN specifies how the returned regexp is surrounded by grouping constructs. +If STRINGS is empty, the return value is a regexp that never +matches anything. + The optional argument PAREN can be any of the following: a string @@ -139,14 +142,18 @@ usually more efficient than that of a simplified version: (sorted-strings (delete-dups (sort (copy-sequence strings) 'string-lessp))) (re - ;; If NOREORDER is non-nil and the list contains a prefix - ;; of another string, we give up all attempts at optimisation. - ;; There is plenty of room for improvement (Bug#34641). - (if (and noreorder (regexp-opt--contains-prefix sorted-strings)) - (concat (or open "\\(?:") - (mapconcat #'regexp-quote strings "\\|") - "\\)") - (regexp-opt-group sorted-strings (or open t) (not open))))) + (cond + ;; No strings: return a\` which cannot match anything. + ((null strings) + (concat (or open "\\(?:") "a\\`\\)")) + ;; If we cannot reorder, give up all attempts at + ;; optimisation. There is room for improvement (Bug#34641). + ((and noreorder (regexp-opt--contains-prefix sorted-strings)) + (concat (or open "\\(?:") + (mapconcat #'regexp-quote strings "\\|") + "\\)")) + (t + (regexp-opt-group sorted-strings (or open t) (not open)))))) (cond ((eq paren 'words) (concat "\\<" re "\\>")) ((eq paren 'symbols) -- 2.17.2 (Apple Git-113) ^ permalink raw reply related [flat|nested] 17+ messages in thread
* bug#34641: rx: (or ...) order unpredictable 2019-02-24 21:18 ` Mattias Engdegård 2019-02-24 22:44 ` Basil L. Contovounesios @ 2019-03-02 12:33 ` Eli Zaretskii 2019-03-02 14:05 ` Mattias Engdegård 2019-03-02 23:48 ` Phil Sainty 1 sibling, 2 replies; 17+ messages in thread From: Eli Zaretskii @ 2019-03-02 12:33 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 34641 > From: Mattias Engdegård <mattiase@acm.org> > Date: Sun, 24 Feb 2019 22:18:28 +0100 > Cc: 34641@debbugs.gnu.org > > > Your preferred solution is fine with me, FWIW. > > Thank you; patch attached. Thanks, LGTM with minor comments below. > +The optional argument @var{noreorder}, if @code{nil}, allows the > +returned regexp to match the strings in any order. If non-@code{nil}, > +the regexp is equivalent to a chain of alternatives (by the @samp{\|} > +operator) of the strings in the order given. I find the text in NEWS much more clear regarding what happens when the new arg is non-nil. I think what is missing is a more explicit description you have in NEWS: If the new third argument is non-nil, the match is guaranteed to be performed in the order given, as if the strings were made into a regexp by joining them with '\|'. So I suggest to mention explicitly the "match is guaranteed to be performed in the order given" part. > +The optional argument NOREORDER, if nil, allows the returned We usually say "if nil or omitted" for optional arguments. > +(defun regexp-opt--contains-prefix (strings) > + "Whether a list of strings contains a proper prefix of one of its elements. > +STRINGS must be sorted and free from duplicates." It is usually a good idea to refer to arguments explicitly in the first sentence of a doc string. In this case, I think just up-casing STRINGS there should be enough. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#34641: rx: (or ...) order unpredictable 2019-03-02 12:33 ` Eli Zaretskii @ 2019-03-02 14:05 ` Mattias Engdegård 2019-03-02 14:08 ` Mattias Engdegård 2019-03-02 23:48 ` Phil Sainty 1 sibling, 1 reply; 17+ messages in thread From: Mattias Engdegård @ 2019-03-02 14:05 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 34641 2 mars 2019 kl. 13.33 skrev Eli Zaretskii <eliz@gnu.org>: > > I find the text in NEWS much more clear regarding what happens when > the new arg is non-nil. I think what is missing is a more explicit > description you have in NEWS: > > If the new third argument is non-nil, the match is > guaranteed to be performed in the order given, as if the strings were > made into a regexp by joining them with '\|'. > > So I suggest to mention explicitly the "match is guaranteed to be > performed in the order given" part. Right; rephrased the doc string and searching.texi. >> +The optional argument NOREORDER, if nil, allows the returned > > We usually say "if nil or omitted" for optional arguments. Understood; used in both places. >> +(defun regexp-opt--contains-prefix (strings) >> + "Whether a list of strings contains a proper prefix of one of its elements. >> +STRINGS must be sorted and free from duplicates." > > It is usually a good idea to refer to arguments explicitly in the > first sentence of a doc string. In this case, I think just up-casing > STRINGS there should be enough. Rephrased. Thanks for the review; revised patch attached. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#34641: rx: (or ...) order unpredictable 2019-03-02 14:05 ` Mattias Engdegård @ 2019-03-02 14:08 ` Mattias Engdegård 2019-03-02 14:23 ` Eli Zaretskii 0 siblings, 1 reply; 17+ messages in thread From: Mattias Engdegård @ 2019-03-02 14:08 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 34641 [-- Attachment #1: Type: text/plain, Size: 146 bytes --] 2 mars 2019 kl. 15.05 skrev Mattias Engdegård <mattiase@acm.org>: > > Thanks for the review; revised patch attached. Sorry, here it is. [-- Attachment #2: 0001-rx-fix-or-ordering-by-adding-argument-to-regexp-opt.patch --] [-- Type: application/octet-stream, Size: 7705 bytes --] From 78db02e28b451efd6bb9582f5e3e1c38ca47d2f8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org> Date: Sun, 24 Feb 2019 22:12:52 +0100 Subject: [PATCH] rx: fix `or' ordering by adding argument to regexp-opt The rx `or' form may reorder its arguments in an unpredictable way, contrary to user expectation, since it sometimes uses `regexp-opt'. Add a NOREORDER option to `regexp-opt' for preventing it from producing a reordered regexp (Bug#34641). * doc/lispref/searching.texi (Regular Expression Functions): * etc/NEWS (Lisp Changes in Emacs 27.1): Describe the new regexp-opt NOREORDER argument. * lisp/emacs-lisp/regexp-opt.el (regexp-opt): Add NOREORDER. Make no attempt at regexp improvement if the set of strings contains a prefix of another string. (regexp-opt--contains-prefix): New. * lisp/emacs-lisp/rx.el (rx-or): Call regexp-opt with NOREORDER. * test/lisp/emacs-lisp/rx-tests.el: Test rx `or' form match order. --- doc/lispref/searching.texi | 13 ++++++++--- etc/NEWS | 7 ++++++ lisp/emacs-lisp/regexp-opt.el | 38 ++++++++++++++++++++++++++++---- lisp/emacs-lisp/rx.el | 2 +- test/lisp/emacs-lisp/rx-tests.el | 13 +++++++++++ 5 files changed, 65 insertions(+), 8 deletions(-) diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index cfbd2449b1..fb7f48474d 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi @@ -950,7 +950,7 @@ whitespace: @end defun @cindex optimize regexp -@defun regexp-opt strings &optional paren +@defun regexp-opt strings &optional paren noreorder This function returns an efficient regular expression that will match any of the strings in the list @var{strings}. This is useful when you need to make matching or searching as fast as possible---for example, @@ -985,8 +985,15 @@ if it is necessary to ensure that a postfix operator appended to it will apply to the whole expression. @end table -The resulting regexp of @code{regexp-opt} is equivalent to but usually -more efficient than that of a simplified version: +The optional argument @var{noreorder}, if @code{nil} or omitted, +allows the returned regexp to match the strings in any order. If +non-@code{nil}, the match is guaranteed to be performed in the order +given, as if the strings were made into a regexp by joining them with +the @samp{\|} operator. + +Up to reordering, the resulting regexp of @code{regexp-opt} is +equivalent to but usually more efficient than that of a simplified +version: @example (defun simplified-regexp-opt (strings &optional paren) diff --git a/etc/NEWS b/etc/NEWS index 29ed7ab481..7c95988ff5 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -1642,6 +1642,13 @@ MS-Windows. ** New module environment function 'process_input' to process user input while module code is running. ++++ +** The function 'regexp-opt' accepts an additional optional argument. +By default, the regexp returned by 'regexp-opt' may match the strings +in any order. If the new third argument is non-nil, the match is +guaranteed to be performed in the order given, as if the strings were +made into a regexp by joining them with '\|'. + \f * Changes in Emacs 27.1 on Non-Free Operating Systems diff --git a/lisp/emacs-lisp/regexp-opt.el b/lisp/emacs-lisp/regexp-opt.el index 63786c1508..d0c5f2d3fc 100644 --- a/lisp/emacs-lisp/regexp-opt.el +++ b/lisp/emacs-lisp/regexp-opt.el @@ -84,7 +84,7 @@ ;;; Code: ;;;###autoload -(defun regexp-opt (strings &optional paren) +(defun regexp-opt (strings &optional paren noreorder) "Return a regexp to match a string in the list STRINGS. Each string should be unique in STRINGS and should not contain any regexps, quoted or not. Optional PAREN specifies how the @@ -111,8 +111,14 @@ nil necessary to ensure that a postfix operator appended to it will apply to the whole expression. -The resulting regexp is equivalent to but usually more efficient -than that of a simplified version: +The optional argument NOREORDER, if nil or omitted, allows the +returned regexp to match the strings in any order. If non-nil, +the match is guaranteed to be performed in the order given, as if +the strings were made into a regexp by joining them with the +`\\|' operator. + +Up to reordering, the resulting regexp is equivalent to but +usually more efficient than that of a simplified version: (defun simplified-regexp-opt (strings &optional paren) (let ((parens @@ -133,7 +139,15 @@ than that of a simplified version: (open (cond ((stringp paren) paren) (paren "\\("))) (sorted-strings (delete-dups (sort (copy-sequence strings) 'string-lessp))) - (re (regexp-opt-group sorted-strings (or open t) (not open)))) + (re + ;; If NOREORDER is non-nil and the list contains a prefix + ;; of another string, we give up all attempts at optimisation. + ;; There is plenty of room for improvement (Bug#34641). + (if (and noreorder (regexp-opt--contains-prefix sorted-strings)) + (concat (or open "\\(?:") + (mapconcat #'regexp-quote strings "\\|") + "\\)") + (regexp-opt-group sorted-strings (or open t) (not open))))) (cond ((eq paren 'words) (concat "\\<" re "\\>")) ((eq paren 'symbols) @@ -313,6 +327,22 @@ CHARS should be a list of characters." (concat "[" dash caret "]")) (concat "[" bracket charset caret dash "]")))) + +(defun regexp-opt--contains-prefix (strings) + "Whether STRINGS contains a proper prefix of one of its other elements. +STRINGS must be a list of sorted strings without duplicates." + (let ((s strings)) + ;; In a lexicographically sorted list, a string always immediately + ;; succeeds one of its prefixes. + (while (and (cdr s) + (not (string-equal + (car s) + (substring (cadr s) 0 (min (length (car s)) + (length (cadr s))))))) + (setq s (cdr s))) + (cdr s))) + + (provide 'regexp-opt) ;;; regexp-opt.el ends here diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el index 715cd608c4..ca756efb49 100644 --- a/lisp/emacs-lisp/rx.el +++ b/lisp/emacs-lisp/rx.el @@ -393,7 +393,7 @@ FORM is of the form `(and FORM1 ...)'." (rx-group-if (if (memq nil (mapcar 'stringp (cdr form))) (mapconcat (lambda (x) (rx-form x '|)) (cdr form) "\\|") - (regexp-opt (cdr form))) + (regexp-opt (cdr form) nil t)) (and (memq rx-parent '(: * t)) rx-parent))) diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el index e14feda347..fa3d9b0d5e 100644 --- a/test/lisp/emacs-lisp/rx-tests.el +++ b/test/lisp/emacs-lisp/rx-tests.el @@ -92,5 +92,18 @@ (*? "e") (+? "f") (\?? "g") (?? "h")))) "a*b+c?d?e*?f+?g??h??"))) +(ert-deftest rx-or () + ;; Test or-pattern reordering (Bug#34641). + (let ((s "abc")) + (should (equal (and (string-match (rx (or "abc" "ab" "a")) s) + (match-string 0 s)) + "abc")) + (should (equal (and (string-match (rx (or "ab" "abc" "a")) s) + (match-string 0 s)) + "ab")) + (should (equal (and (string-match (rx (or "a" "ab" "abc")) s) + (match-string 0 s)) + "a")))) + (provide 'rx-tests) ;; rx-tests.el ends here. -- 2.17.2 (Apple Git-113) ^ permalink raw reply related [flat|nested] 17+ messages in thread
* bug#34641: rx: (or ...) order unpredictable 2019-03-02 14:08 ` Mattias Engdegård @ 2019-03-02 14:23 ` Eli Zaretskii 2019-03-02 14:37 ` Mattias Engdegård 0 siblings, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2019-03-02 14:23 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 34641 > From: Mattias Engdegård <mattiase@acm.org> > Date: Sat, 2 Mar 2019 15:08:26 +0100 > Cc: 34641@debbugs.gnu.org > > > Thanks for the review; revised patch attached. > > Sorry, here it is. LGTM, thanks. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#34641: rx: (or ...) order unpredictable 2019-03-02 14:23 ` Eli Zaretskii @ 2019-03-02 14:37 ` Mattias Engdegård 0 siblings, 0 replies; 17+ messages in thread From: Mattias Engdegård @ 2019-03-02 14:37 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 34641-done 2 mars 2019 kl. 15.23 skrev Eli Zaretskii <eliz@gnu.org>: > > LGTM, thanks. Thank you, pushed. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#34641: rx: (or ...) order unpredictable 2019-03-02 12:33 ` Eli Zaretskii 2019-03-02 14:05 ` Mattias Engdegård @ 2019-03-02 23:48 ` Phil Sainty 2019-03-03 8:54 ` Mattias Engdegård 1 sibling, 1 reply; 17+ messages in thread From: Phil Sainty @ 2019-03-02 23:48 UTC (permalink / raw) To: Eli Zaretskii, Mattias Engdegård; +Cc: 34641 >> +The optional argument NOREORDER, if nil, allows the returned Could we change that name to have a positive sense? Boolean arguments with a negative sense/meaning are invariably more awkward to read than ones with a positive meaning, IMO. NOREORDER set to nil means "No NOREORDER" (aka "Reorder"). Those double-negatives should be avoided whenever it's simple do do so, as they make things harder for anyone reading the documentation. We could use KEEP-ORDER or RETAIN-ORDER or SAME-ORDER or anything along those lines, and then a 'true' value matches the positive sense of the name, which is much nicer. -Phil ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#34641: rx: (or ...) order unpredictable 2019-03-02 23:48 ` Phil Sainty @ 2019-03-03 8:54 ` Mattias Engdegård 2019-03-07 9:00 ` Phil Sainty 0 siblings, 1 reply; 17+ messages in thread From: Mattias Engdegård @ 2019-03-03 8:54 UTC (permalink / raw) To: Phil Sainty; +Cc: 34641 3 mars 2019 kl. 00.48 skrev Phil Sainty <psainty@orcon.net.nz>: > >>> +The optional argument NOREORDER, if nil, allows the returned > > Could we change that name to have a positive sense? > > Boolean arguments with a negative sense/meaning are invariably > more awkward to read than ones with a positive meaning, IMO. > > NOREORDER set to nil means "No NOREORDER" (aka "Reorder"). Those > double-negatives should be avoided whenever it's simple do do so, > as they make things harder for anyone reading the documentation. > > We could use KEEP-ORDER or RETAIN-ORDER or SAME-ORDER or anything > along those lines, and then a 'true' value matches the positive > sense of the name, which is much nicer. You are right, and I wasn't happy with the negative name either. Would KEEP-ORDER do? ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#34641: rx: (or ...) order unpredictable 2019-03-03 8:54 ` Mattias Engdegård @ 2019-03-07 9:00 ` Phil Sainty 0 siblings, 0 replies; 17+ messages in thread From: Phil Sainty @ 2019-03-07 9:00 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 34641 On 3/03/19 9:54 PM, Mattias Engdegård wrote: > 3 mars 2019 kl. 00.48 skrev Phil Sainty <psainty@orcon.net.nz>: >>>> +The optional argument NOREORDER, if nil, allows the returned >> Could we change that name to have a positive sense? > > You are right, and I wasn't happy with the negative name either. > Would KEEP-ORDER do? I think that's good, and no one has objected, so I think you could go ahead with that change. cheers, -Phil ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#34641: rx: (or ...) order unpredictable 2019-02-24 18:40 bug#34641: rx: (or ...) order unpredictable Mattias Engdegård 2019-02-24 19:06 ` Eli Zaretskii @ 2019-02-25 2:37 ` Noam Postavsky 2019-02-25 9:56 ` Mattias Engdegård 1 sibling, 1 reply; 17+ messages in thread From: Noam Postavsky @ 2019-02-25 2:37 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 34641 On Sun, 24 Feb 2019 at 13:41, Mattias Engdegård <mattiase@acm.org> wrote: > > The rx (or ...) construct sometimes reorders its subexpressions, which makes its semantics unpredictable. For example, > > (rx (or "ab" "a") (or "a" "ab")) > => > "\\(?:ab?\\)\\(?:ab?\\)" > > The user reasonably expects (or e1 e2) to translate to E1\|E2, where ei translates to Ei, or a semantic equivalent. I don't see the problem, isn't "ab?" semantically equivalent to "ab\\|a" (and "a\\|ab")? > (Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything. This sounds familiar, though I can't locate a report for it. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#34641: rx: (or ...) order unpredictable 2019-02-25 2:37 ` Noam Postavsky @ 2019-02-25 9:56 ` Mattias Engdegård 2019-02-25 14:43 ` Noam Postavsky 0 siblings, 1 reply; 17+ messages in thread From: Mattias Engdegård @ 2019-02-25 9:56 UTC (permalink / raw) To: Noam Postavsky; +Cc: 34641 25 feb. 2019 kl. 03.37 skrev Noam Postavsky <npostavs@gmail.com>: > > I don't see the problem, isn't "ab?" semantically equivalent to > "ab\\|a" (and "a\\|ab")? Good question! When the match is anchored at the end, they are indeed equivalent. They also are equivalent for Posix regexps, which prefer the longest match. But in Emacs, the first (leftmost) matching alternative is used. Suppose we are matching against the string "abc". Then ab\|a matches "ab" a\|ab matches "a" ab? matches "ab" ab?? matches "a" (non-greedy operator) (I remember writing, young and foolish, [0-9]+\|0[xX][0-9a-fA-F]+ to match a number in decimal or hex, and was surprised that all hex numbers were zero.) >> (Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything. > > This sounds familiar, though I can't locate a report for it. If you do remember, please tell us about it. The `or' operator in SRE can be used with an empty argument list, and will then not match anything. It is a useful limit case for machine-generated regexps. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#34641: rx: (or ...) order unpredictable 2019-02-25 9:56 ` Mattias Engdegård @ 2019-02-25 14:43 ` Noam Postavsky 2019-02-25 14:48 ` Mattias Engdegård 0 siblings, 1 reply; 17+ messages in thread From: Noam Postavsky @ 2019-02-25 14:43 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 34641 On Mon, 25 Feb 2019 at 04:56, Mattias Engdegård <mattiase@acm.org> wrote: > Good question! When the match is anchored at the end, they are indeed equivalent. They also are equivalent for Posix regexps, which prefer the longest match. But in Emacs, the first (leftmost) matching alternative is used. > > Suppose we are matching against the string "abc". Then > ab\|a matches "ab" > a\|ab matches "a" Oh, huh. So it does. I guess I've never used regexp in a situation where this subtle corner case would come up. > >> (Speaking of regexp-opt, it has another bug that does not affect rx: it returns the empty string if given an empty list of strings. The correct return value is a regexp that never matches anything. > > > > This sounds familiar, though I can't locate a report for it. > > If you do remember, please tell us about it. > The `or' operator in SRE can be used with an empty argument list, and will then not match anything. It is a useful limit case for machine-generated regexps. Right, found it this time, it's Bug#20307. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#34641: rx: (or ...) order unpredictable 2019-02-25 14:43 ` Noam Postavsky @ 2019-02-25 14:48 ` Mattias Engdegård 0 siblings, 0 replies; 17+ messages in thread From: Mattias Engdegård @ 2019-02-25 14:48 UTC (permalink / raw) To: Noam Postavsky; +Cc: 34641 25 feb. 2019 kl. 15.43 skrev Noam Postavsky <npostavs@gmail.com>: > > Right, found it this time, it's Bug#20307. Excellent! I'll move this part to that bug then, and update the patch to use that bug number. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2019-03-07 9:00 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-02-24 18:40 bug#34641: rx: (or ...) order unpredictable Mattias Engdegård 2019-02-24 19:06 ` Eli Zaretskii 2019-02-24 21:18 ` Mattias Engdegård 2019-02-24 22:44 ` Basil L. Contovounesios 2019-02-25 14:26 ` Mattias Engdegård 2019-03-02 12:33 ` Eli Zaretskii 2019-03-02 14:05 ` Mattias Engdegård 2019-03-02 14:08 ` Mattias Engdegård 2019-03-02 14:23 ` Eli Zaretskii 2019-03-02 14:37 ` Mattias Engdegård 2019-03-02 23:48 ` Phil Sainty 2019-03-03 8:54 ` Mattias Engdegård 2019-03-07 9:00 ` Phil Sainty 2019-02-25 2:37 ` Noam Postavsky 2019-02-25 9:56 ` Mattias Engdegård 2019-02-25 14:43 ` Noam Postavsky 2019-02-25 14:48 ` Mattias Engdegård
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).