bug#37659: rx additions: anychar, unmatchable, unordered-or

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#37659: rx additions: anychar, unmatchable, unordered-or
@ 2019-10-08  9:36 Mattias Engdegård
  2019-10-09  8:59 ` Mattias Engdegård
  2019-10-11 23:07 ` bug#37659: Mattias Engdegård <mattiase <at> acm.org> Paul Eggert
  0 siblings, 2 replies; 32+ messages in thread
From: Mattias Engdegård @ 2019-10-08  9:36 UTC (permalink / raw)
  To: 37659

[-- Attachment #1: Type: text/plain, Size: 969 bytes --]

Three minor rx additions follow:

* Add `anychar' as an alias for `anything': the latter suggests an expression that can match any string, while in reality it only matches a single character. The documentation now uses `anychar' as the preferred name. (`any-char' would also be possible, but is longer.)

* Add `unmatchable' for a never-match regexp. This follows the previously introduced variable `regexp-unmatchable'.

* Add `unordered-or' as a variant of `or' without the left-to-right match order guarantee. It allows unconditional regexp-opt optimisations, and is particularly useful for matching sets of keywords. With rx-let and rx-define, it also has the potential for better compositionality, allowing expressions to be put together from smaller parts.

Abstractly: while `or' is associative, `unordered-or' is also commutative.

The name `unordered-or' is descriptive but phonetically (and lexically) somewhat weak. Strong alternatives welcome.


[-- Attachment #2: 0001-Add-anychar-as-alias-to-anything-in-rx.patch --]
[-- Type: application/octet-stream, Size: 4257 bytes --]

From 0b79693cdace549a8d6edda58cea20232d82faec Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Mon, 7 Oct 2019 18:07:16 +0200
Subject: [PATCH 1/3] Add `anychar' as alias to `anything' in rx

* lisp/emacs-lisp/rx.el (rx--translate-symbol, rx--builtin-symbols, rx):
* test/lisp/emacs-lisp/rx-tests.el (rx-atoms):
* doc/lispref/searching.texi (Rx Constructs):
* etc/NEWS:
Add `anychar', an alias for `anything'.  Since `anychar' is more
descriptive (and slightly shorter), treat it as the preferred name.
---
 doc/lispref/searching.texi       | 3 ++-
 etc/NEWS                         | 4 ++++
 lisp/emacs-lisp/rx.el            | 7 +++----
 test/lisp/emacs-lisp/rx-tests.el | 4 ++--
 4 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index a4b6533412..2274bab002 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -1220,7 +1220,8 @@ Rx Constructs
 Match any character except a newline.@*
 Corresponding string regexp: @samp{.} (dot)
 
-@item @code{anything}
+@item @code{anychar}, @code{anything}
+@cindex @code{anychar} in rx
 @cindex @code{anything} in rx
 Match any character.@*
 Corresponding string regexp: @samp{.\|\n} (for example)
diff --git a/etc/NEWS b/etc/NEWS
index 906dc912d6..0fea5164a7 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1795,6 +1795,10 @@ at run time, instead of a constant string.
 *** New rx extension mechanism: 'rx-define', 'rx-let', 'rx-let-eval'.
 These macros add new forms to the rx notation.
 
++++
+*** 'anychar' is now an alias for 'anything'
+Both match any single character; 'anychar' is more descriptive.
+
 ** Frames
 
 +++
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 45fec796cc..6c0b206930 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -126,7 +126,6 @@ rx--lookup-def
       (get name 'rx-definition)))
 
 ;; TODO: Additions to consider:
-;; - A better name for `anything', like `any-char' or `anychar'.
 ;; - A name for (or), maybe `unmatchable'.
 ;; - A construct like `or' but without the match order guarantee,
 ;;   maybe `unordered-or'.  Useful for composition or generation of
@@ -138,7 +137,7 @@ rx--translate-symbol
     ;; Use `list' instead of a quoted list to wrap the strings here,
     ;; since the return value may be mutated.
     ((or 'nonl 'not-newline 'any) (cons (list ".") t))
-    ('anything                    (rx--translate-form '(or nonl "\n")))
+    ((or 'anychar 'anything)      (rx--translate-form '(or nonl "\n")))
     ((or 'bol 'line-start)        (cons (list "^") 'lseq))
     ((or 'eol 'line-end)          (cons (list "$") 'rseq))
     ((or 'bos 'string-start 'bot 'buffer-start) (cons (list "\\`") t))
@@ -913,7 +912,7 @@ rx--builtin-forms
   "List of built-in rx function-like symbols.")
 
 (defconst rx--builtin-symbols
-  (append '(nonl not-newline any anything
+  (append '(nonl not-newline any anychar anything
             bol eol line-start line-end
             bos eos string-start string-end
             bow eow word-start word-end
@@ -1016,7 +1015,7 @@ rx
                 can be (any ...), (syntax ...), (category ...),
                 or a character class.
 not-newline     Match any character except a newline.  Alias: nonl.
-anything        Match any character.
+anychar         Match any character.  Alias: anything.
 
 CHARCLASS       Match a character from a character class.  One of:
  alpha, alphabetic, letter   Alphabetic characters (defined by Unicode).
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index 76dcf41942..d4524e5a25 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -184,8 +184,8 @@ rx-repeat
                  "ab")))
 
 (ert-deftest rx-atoms ()
-  (should (equal (rx anything)
-                 ".\\|\n"))
+  (should (equal (rx anychar anything)
+                 "\\(?:.\\|\n\\)\\(?:.\\|\n\\)"))
   (should (equal (rx line-start not-newline nonl any line-end)
                  "^...$"))
   (should (equal (rx bol string-start string-end buffer-start buffer-end
-- 
2.21.0 (Apple Git-122)


[-- Attachment #3: 0002-Add-unmatchable-as-alias-for-or-in-rx.patch --]
[-- Type: application/octet-stream, Size: 4453 bytes --]

From 6ad34bf9aa86aee6539851366c5267fa8a72d929 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Mon, 7 Oct 2019 18:28:18 +0200
Subject: [PATCH 2/3] Add `unmatchable' as alias for (or) in rx

* lisp/emacs-lisp/rx.el (rx--translate-symbol, rx--builtin-symbols, rx):
* test/lisp/emacs-lisp/rx-tests.el (rx-atoms):
* doc/lispref/searching.texi (Rx Constructs):
* etc/NEWS:
Add `unmatchable', more descriptive than (or), and corresponding to
the variable `regexp-unmatchable'.
---
 doc/lispref/searching.texi       | 6 ++++++
 etc/NEWS                         | 1 +
 lisp/emacs-lisp/rx.el            | 5 +++--
 test/lisp/emacs-lisp/rx-tests.el | 2 ++
 4 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 2274bab002..a6c6bf2d4a 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -1083,6 +1083,11 @@ Rx Constructs
 Match exactly one of the @var{rx}s, trying from left to right.
 Without arguments, the expression will not match anything at all.@*
 Corresponding string regexp: @samp{@var{A}\|@var{B}\|@dots{}}.
+
+@item @code{unmatchable}
+@cindex @code{unmatchable} in rx
+Refuse any match.  Equivalent to @code{(or)}.
+@xref{regexp-unmatchable}.
 @end table
 
 @subsubheading Repetition
@@ -1806,6 +1811,7 @@ Regexp Functions
 
 @c Internal functions: regexp-opt-group
 
+@anchor{regexp-unmatchable}
 @defvar regexp-unmatchable
 This variable contains a regexp that is guaranteed not to match any
 string at all.  It is particularly useful as default value for
diff --git a/etc/NEWS b/etc/NEWS
index 0fea5164a7..96c26c6623 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1785,6 +1785,7 @@ the 128...255 range, as expected.
 matches the empty string, each being an identity for the operation.
 This also works for their aliases: '|' for 'or'; ':', 'and' and
 'sequence' for 'seq'.
+The symbol 'unmatchable' can be used as an alternative to (or).
 
 ---
 *** 'regexp' and new 'literal' accept arbitrary lisp as arguments.
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 6c0b206930..cf02df239f 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -126,7 +126,6 @@ rx--lookup-def
       (get name 'rx-definition)))
 
 ;; TODO: Additions to consider:
-;; - A name for (or), maybe `unmatchable'.
 ;; - A construct like `or' but without the match order guarantee,
 ;;   maybe `unordered-or'.  Useful for composition or generation of
 ;;   alternatives; permits more effective use of regexp-opt.
@@ -138,6 +137,7 @@ rx--translate-symbol
     ;; since the return value may be mutated.
     ((or 'nonl 'not-newline 'any) (cons (list ".") t))
     ((or 'anychar 'anything)      (rx--translate-form '(or nonl "\n")))
+    ('unmatchable                 (rx--empty))
     ((or 'bol 'line-start)        (cons (list "^") 'lseq))
     ((or 'eol 'line-end)          (cons (list "$") 'rseq))
     ((or 'bos 'string-start 'bot 'buffer-start) (cons (list "\\`") t))
@@ -912,7 +912,7 @@ rx--builtin-forms
   "List of built-in rx function-like symbols.")
 
 (defconst rx--builtin-symbols
-  (append '(nonl not-newline any anychar anything
+  (append '(nonl not-newline any anychar anything unmatchable
             bol eol line-start line-end
             bos eos string-start string-end
             bow eow word-start word-end
@@ -1016,6 +1016,7 @@ rx
                 or a character class.
 not-newline     Match any character except a newline.  Alias: nonl.
 anychar         Match any character.  Alias: anything.
+unmatchable     Never match anything at all.
 
 CHARCLASS       Match a character from a character class.  One of:
  alpha, alphabetic, letter   Alphabetic characters (defined by Unicode).
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index d4524e5a25..903b191c98 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -186,6 +186,8 @@ rx-repeat
 (ert-deftest rx-atoms ()
   (should (equal (rx anychar anything)
                  "\\(?:.\\|\n\\)\\(?:.\\|\n\\)"))
+  (should (equal (rx unmatchable)
+                 "\\`a\\`"))
   (should (equal (rx line-start not-newline nonl any line-end)
                  "^...$"))
   (should (equal (rx bol string-start string-end buffer-start buffer-end
-- 
2.21.0 (Apple Git-122)


[-- Attachment #4: 0003-Add-rx-unordered-or-construct.patch --]
[-- Type: application/octet-stream, Size: 5505 bytes --]

From 55bbcccf26e95bcb69a27431d72a802b7e457a75 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Mon, 7 Oct 2019 19:39:08 +0200
Subject: [PATCH 3/3] Add rx `unordered-or' construct

* lisp/emacs-lisp/rx.el (rx--translate-or, rx--translate-form)
(rx--builting-forms, rx):
* test/lisp/emacs-lisp/rx-tests.el (rx-unordered-or):
* doc/lispref/searching.texi (Rx Constructs):
* etc/NEWS:
Add `unordered-or', like `or' but with unconstrained matching order.
---
 doc/lispref/searching.texi       |  8 ++++++++
 etc/NEWS                         |  5 +++++
 lisp/emacs-lisp/rx.el            | 16 +++++++---------
 test/lisp/emacs-lisp/rx-tests.el |  8 ++++++++
 4 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index a6c6bf2d4a..faeedc5978 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -1084,6 +1084,13 @@ Rx Constructs
 Without arguments, the expression will not match anything at all.@*
 Corresponding string regexp: @samp{@var{A}\|@var{B}\|@dots{}}.
 
+@item @code{(unordered-or @var{rx}@dots{})}
+@cindex @code{unordered-or} in rx
+Like @code{or}, but with unspecified matching order.
+This may be more efficient when the order doesn't matter,
+in particular if all subforms are string literals.
+@xref{regexp-opt}.
+
 @item @code{unmatchable}
 @cindex @code{unmatchable} in rx
 Refuse any match.  Equivalent to @code{(or)}.
@@ -1728,6 +1735,7 @@ Regexp Functions
 any special characters.
 @end defun
 
+@anchor{regexp-opt}
 @cindex optimize regexp
 @defun regexp-opt strings &optional paren keep-order
 This function returns an efficient regular expression that will match
diff --git a/etc/NEWS b/etc/NEWS
index 96c26c6623..cf3ef8183b 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1800,6 +1800,11 @@ These macros add new forms to the rx notation.
 *** 'anychar' is now an alias for 'anything'
 Both match any single character; 'anychar' is more descriptive.
 
++++
+*** New 'unordered-or' rx construct
+It works like 'or', but with unspecified matching order.  It may be
+faster in some cases, especially when the clauses are string literals.
+
 ** Frames
 
 +++
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index cf02df239f..0b14144698 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -125,11 +125,6 @@ rx--lookup-def
   (or (cdr (assq name rx--local-definitions))
       (get name 'rx-definition)))
 
-;; TODO: Additions to consider:
-;; - A construct like `or' but without the match order guarantee,
-;;   maybe `unordered-or'.  Useful for composition or generation of
-;;   alternatives; permits more effective use of regexp-opt.
-
 (defun rx--translate-symbol (sym)
   "Translate an rx symbol.  Return (REGEXP . PRECEDENCE)."
   (pcase sym
@@ -230,8 +225,9 @@ rx--every
     (setq list (cdr list)))
   (null list))
 
-(defun rx--translate-or (body)
+(defun rx--translate-or (body unordered)
   "Translate an or-pattern of one of more rx items.
+If UNORDERED, then matching order is unspecified.
 Return (REGEXP . PRECEDENCE)."
   ;; FIXME: Possible improvements:
   ;;
@@ -268,7 +264,7 @@ rx--translate-or
    ((null (cdr body))              ; Single item.
     (rx--translate (car body)))
    ((rx--every #'stringp body)     ; All strings.
-    (cons (list (regexp-opt body nil t))
+    (cons (list (regexp-opt body nil (not unordered)))
           t))
    (t
     (cons (append (car (rx--translate (car body)))
@@ -835,7 +831,8 @@ rx--translate-form
   (let ((body (cdr form)))
     (pcase (car form)
       ((or 'seq : 'and 'sequence) (rx--translate-seq body))
-      ((or 'or '|)              (rx--translate-or body))
+      ((or 'or '|)              (rx--translate-or body nil))
+      ((or 'unordered-or)       (rx--translate-or body t))
       ((or 'any 'in 'char)      (rx--translate-any nil body))
       ('not-char                (rx--translate-any t body))
       ('not                     (rx--translate-not nil body))
@@ -899,7 +896,7 @@ rx--translate-form
                (error "Unknown rx form `%s'" op)))))))))
 
 (defconst rx--builtin-forms
-  '(seq sequence : and or | any in char not-char not
+  '(seq sequence : and or | unordered-or any in char not-char not
     repeat = >= **
     zero-or-more 0+ *
     one-or-more 1+ +
@@ -990,6 +987,7 @@ rx
 
 (seq RX...)    Match the RXs in sequence.  Alias: :, sequence, and.
 (or RX...)     Match one of the RXs.  Alias: |.
+(unordered-or RX...) Match one of the RXs, in unspecified order.
 
 (zero-or-more RX...) Match RXs zero or more times.  Alias: 0+.
 (one-or-more RX...)  Match RXs one or more times.  Alias: 1+.
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index 903b191c98..bced74569f 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -49,6 +49,14 @@ rx-or
   (should (equal (rx (|))
                  "\\`a\\`")))
 
+(ert-deftest rx-unordered-or ()
+  (should (equal (rx (unordered-or "ab" nonl "cd"))
+                 "ab\\|.\\|cd"))
+  (should (equal (rx (unordered-or "ab" "abc" "a"))
+                 "\\(?:a\\(?:bc?\\)?\\)"))
+  (should (equal (rx (unordered-or))
+                 "\\`a\\`")))
+
 (ert-deftest rx-char-any ()
   "Test character alternatives with `]' and `-' (Bug#25123)."
   (should (equal
-- 
2.21.0 (Apple Git-122)


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2019-10-08  9:36 bug#37659: rx additions: anychar, unmatchable, unordered-or Mattias Engdegård
@ 2019-10-09  8:59 ` Mattias Engdegård
  2019-10-11 23:07 ` bug#37659: Mattias Engdegård <mattiase <at> acm.org> Paul Eggert
  1 sibling, 0 replies; 32+ messages in thread
From: Mattias Engdegård @ 2019-10-09  8:59 UTC (permalink / raw)
  To: 37659

[-- Attachment #1: Type: text/plain, Size: 193 bytes --]

Also consider changing the rendition of anychar/anything from ".\\|\n" to "[^z-a]", which is faster and does not allocate stack space. Previously, (* anything) wouldn't match large strings.


[-- Attachment #2: 0004-Use-z-a-for-matching-any-character-anychar-anything-.patch --]
[-- Type: application/octet-stream, Size: 2502 bytes --]

From c72633774b375eaadd6117eb0b26fb9792fed1bd Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Wed, 9 Oct 2019 10:22:10 +0200
Subject: [PATCH 4/4] Use [^z-a] for matching any character (anychar/anything)
 in rx
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* lisp/emacs-lisp/rx.el (rx--translate-symbol):
* test/lisp/emacs-lisp/rx-tests.el (rx-any, rx-atoms):
Use [^z-a] instead of ".\\|\n" for anychar.

The new expression is faster (about 2×) and does not allocate regexp
stack space.  For example, (0+ anychar) now matches strings of any
size.
---
 lisp/emacs-lisp/rx.el            | 2 +-
 test/lisp/emacs-lisp/rx-tests.el | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 0b14144698..2f58033ffd 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -131,7 +131,7 @@ rx--translate-symbol
     ;; Use `list' instead of a quoted list to wrap the strings here,
     ;; since the return value may be mutated.
     ((or 'nonl 'not-newline 'any) (cons (list ".") t))
-    ((or 'anychar 'anything)      (rx--translate-form '(or nonl "\n")))
+    ((or 'anychar 'anything)      (cons (list "[^z-a]") t))
     ('unmatchable                 (rx--empty))
     ((or 'bol 'line-start)        (cons (list "^") 'lseq))
     ((or 'eol 'line-end)          (cons (list "$") 'rseq))
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index bced74569f..a098784a85 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -134,9 +134,9 @@ rx-any
   (should (equal (rx (not (any "!a" "0-8" digit nonascii)))
                  "[^!0-8a[:digit:][:nonascii:]]"))
   (should (equal (rx (any) (not (any)))
-                 "\\`a\\`\\(?:.\\|\n\\)"))
+                 "\\`a\\`[^z-a]"))
   (should (equal (rx (any "") (not (any "")))
-                 "\\`a\\`\\(?:.\\|\n\\)")))
+                 "\\`a\\`[^z-a]")))
 
 (ert-deftest rx-pcase ()
   (should (equal (pcase "a 1 2 3 1 1 b"
@@ -193,7 +193,7 @@ rx-repeat
 
 (ert-deftest rx-atoms ()
   (should (equal (rx anychar anything)
-                 "\\(?:.\\|\n\\)\\(?:.\\|\n\\)"))
+                 "[^z-a][^z-a]"))
   (should (equal (rx unmatchable)
                  "\\`a\\`"))
   (should (equal (rx line-start not-newline nonl any line-end)
-- 
2.21.0 (Apple Git-122)


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* bug#37659: Mattias Engdegård <mattiase <at> acm.org>
  2019-10-08  9:36 bug#37659: rx additions: anychar, unmatchable, unordered-or Mattias Engdegård
  2019-10-09  8:59 ` Mattias Engdegård
@ 2019-10-11 23:07 ` Paul Eggert
  2019-10-12 10:47   ` Mattias Engdegård
  1 sibling, 1 reply; 32+ messages in thread
From: Paul Eggert @ 2019-10-11 23:07 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 37659

Thanks for the proposed patch. Two thoughts:

1. Instead of the symbol 'unordered-or' (which is remarkably hard to 
read), I suggest using the ASCII letter 'V'. This ASCIIfies the Unicode 
symbol U+2228 LOGICAL OR (∨). If you prefer, you could make the Unicode 
symbol an alias for 'V', or use lower-case ASCII 'v', or whatever. The 
point is that '(unordered-or A B)' is too hard to read with all those 
'or's in there.

2. Re this patch:

> -    ((or 'anychar 'anything)      (rx--translate-form '(or nonl "\n")))
> +    ((or 'anychar 'anything)      (cons (list "[^z-a]") t))

Is there a reason this uses (cons (list "[^z-a]") t) rather than 
'(("[^z-a]") . t) ? I realize neighboring code does something similar, 
but it's not clear to me why it's important to construct new objects 
here instead of using literals.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: Mattias Engdegård <mattiase <at> acm.org>
  2019-10-11 23:07 ` bug#37659: Mattias Engdegård <mattiase <at> acm.org> Paul Eggert
@ 2019-10-12 10:47   ` Mattias Engdegård
  2019-10-13 16:52     ` Paul Eggert
  0 siblings, 1 reply; 32+ messages in thread
From: Mattias Engdegård @ 2019-10-12 10:47 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 37659

12 okt. 2019 kl. 01.07 skrev Paul Eggert <eggert@cs.ucla.edu>:
> 
> 1. Instead of the symbol 'unordered-or' (which is remarkably hard to read), I suggest using the ASCII letter 'V'. This ASCIIfies the Unicode symbol U+2228 LOGICAL OR (∨). If you prefer, you could make the Unicode symbol an alias for 'V', or use lower-case ASCII 'v', or whatever. The point is that '(unordered-or A B)' is too hard to read with all those 'or's in there.

Definitely agree on the imperfections of 'unordered-or', and while I'd be the first to welcome more use of Unicode symbols, I'm not sure V (or v, or ∨) are very descriptive --- even if an alert reader intuits the rebus of 'V' (perhaps via \vee in TeX), there is no hint of the difference from 'or' or '|'.

Other suggestions:

'or*' --- follows the Lisp tradition of appending a star to get a variant and informs the reader that it's like 'or' but with a twist. The downside is that it might suggest a Kleene closure somehow.

'either', 'one-of', 'choose', 'pick-one', 'alternative', 'alt' --- very readable although the relationship to 'or' isn't quite clear. Perhaps they suggest a looser sense of ordering?

'unseq-or' --- a bit more readable and phonetically sharper than 'unordered-or', but it suggest a relation to 'seq'.

'nonstrict-or' --- abuses the familiar programming notion of strictness?

'or-ooo' --- will mostly make sense to the comp-arch crowd.

> Is there a reason this uses (cons (list "[^z-a]") t) rather than '(("[^z-a]") . t) ? I realize neighboring code does something similar, but it's not clear to me why it's important to construct new objects here instead of using literals.

Yes, there is a comment right above explaining that the returned value may be mutated (at least one use of mapcan). I tried doing it the other way, but neither was clearly better than the other (in performance or style), so I've let it stand for now. Nothing I feel strongly about either way.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: Mattias Engdegård <mattiase <at> acm.org>
  2019-10-12 10:47   ` Mattias Engdegård
@ 2019-10-13 16:52     ` Paul Eggert
  2019-10-13 19:48       ` Mattias Engdegård
  2019-10-22 15:14       ` bug#37659: rx additions: anychar, unmatchable, unordered-or Mattias Engdegård
  0 siblings, 2 replies; 32+ messages in thread
From: Paul Eggert @ 2019-10-13 16:52 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 37659

On 10/12/19 3:47 AM, Mattias Engdegård wrote:

> there is no hint of the difference from 'or' or '|'.

That goal is secondary and can be dispensed with. It is OK to use a symbol that 
one must remember or look up. 'unordered-or' is simply too ungainly.

Of the names you suggest, 'alt' is the the best. But here's another idea: use 
'|' for unordered or, and 'or' for ORdered OR. Strictly speaking this would be 
an incompatible change, but I doubt whether many users will notice or care.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: Mattias Engdegård <mattiase <at> acm.org>
  2019-10-13 16:52     ` Paul Eggert
@ 2019-10-13 19:48       ` Mattias Engdegård
  2019-10-22 15:14       ` bug#37659: rx additions: anychar, unmatchable, unordered-or Mattias Engdegård
  1 sibling, 0 replies; 32+ messages in thread
From: Mattias Engdegård @ 2019-10-13 19:48 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 37659

13 okt. 2019 kl. 18.52 skrev Paul Eggert <eggert@cs.ucla.edu>:
> 
> Of the names you suggest, 'alt' is the the best. But here's another idea: use '|' for unordered or, and 'or' for ORdered OR. Strictly speaking this would be an incompatible change, but I doubt whether many users will notice or care.

That's an interesting notion, but am wary about the incompatibility; I'll have to think about it. Until then, let's consider 'alt' the default name. More (informed) opinions on the subject sought!

I take it the other changes -- 'anychar', [^z-a], and 'unmatchable' --- are less debatable; I'll push them in a day or two if nobody objects.

Thanks for taking an interest in naming, but the way; it isn't bikeshedding and deserves care.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2019-10-13 16:52     ` Paul Eggert
  2019-10-13 19:48       ` Mattias Engdegård
@ 2019-10-22 15:14       ` Mattias Engdegård
  2019-10-22 15:27         ` Robert Pluim
  2019-10-22 17:33         ` Paul Eggert
  1 sibling, 2 replies; 32+ messages in thread
From: Mattias Engdegård @ 2019-10-22 15:14 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 37659

'regexp-opt' always generates a regexp preferring long matches. This is undocumented, but useful enough that I would be surprised if this property wasn't exploited (perhaps unknowingly) by callers. It's quite natural: given a set of strings, surely the caller want them all to be candidates for a match, even if there is no following anchoring pattern.

Thus, instead of 'unordered-or', define the operator in terms of long matches: 'or-max' (working name) would work like 'or' but guarantee a longest match, and only permit strings and 'or-max' forms as arguments. Thus, the rx user gets all the benefits from 'regexp-opt' in a composable way, without a need to sort the strings or otherwise prepare them.

(The old 'or' behaviour always used 'regexp-opt' when possible, which was very fragile: (or "a" "ab") would match "ab", but (or "a" "ab" digit) would just match "a". 'or-max' is robust, without surprises.)

Of course, we should also guarantee the maximum-matching property of regexp-opt. This is just a matter of documentation (and test); it does not restrict optimisations as far as I can tell.

Again, I'm open to suggestions about a better name than 'or-max'.

The other patches (anychar, unmatchable, and [^z-a]) have been pushed to master.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2019-10-22 15:14       ` bug#37659: rx additions: anychar, unmatchable, unordered-or Mattias Engdegård
@ 2019-10-22 15:27         ` Robert Pluim
  2019-10-22 17:33         ` Paul Eggert
  1 sibling, 0 replies; 32+ messages in thread
From: Robert Pluim @ 2019-10-22 15:27 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Paul Eggert, 37659

>>>>> On Tue, 22 Oct 2019 17:14:08 +0200, Mattias Engdegård <mattiase@acm.org> said:

    Mattias> 'regexp-opt' always generates a regexp preferring long matches. This
    Mattias> is undocumented, but useful enough that I would be surprised if this
    Mattias> property wasn't exploited (perhaps unknowingly) by callers. It's quite
    Mattias> natural: given a set of strings, surely the caller want them all to be
    Mattias> candidates for a match, even if there is no following anchoring
    Mattias> pattern.

    Mattias> Thus, instead of 'unordered-or', define the operator in terms of long
    Mattias> matches: 'or-max' (working name) would work like 'or' but guarantee a
    Mattias> longest match, and only permit strings and 'or-max' forms as
    Mattias> arguments. Thus, the rx user gets all the benefits from 'regexp-opt'
    Mattias> in a composable way, without a need to sort the strings or otherwise
    Mattias> prepare them.

    Mattias> (The old 'or' behaviour always used 'regexp-opt' when possible, which
    Mattias> was very fragile: (or "a" "ab") would match "ab", but (or "a" "ab"
    Mattias> digit) would just match "a". 'or-max' is robust, without surprises.)

    Mattias> Of course, we should also guarantee the maximum-matching property of
    Mattias> regexp-opt. This is just a matter of documentation (and test); it does
    Mattias> not restrict optimisations as far as I can tell.

    Mattias> Again, I'm open to suggestions about a better name than 'or-max'.

or-greedy?





^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2019-10-22 15:14       ` bug#37659: rx additions: anychar, unmatchable, unordered-or Mattias Engdegård
  2019-10-22 15:27         ` Robert Pluim
@ 2019-10-22 17:33         ` Paul Eggert
  2019-10-23  9:15           ` Mattias Engdegård
  2020-02-11 12:57           ` Mattias Engdegård
  1 sibling, 2 replies; 32+ messages in thread
From: Paul Eggert @ 2019-10-22 17:33 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 37659

On 10/22/19 8:14 AM, Mattias Engdegård wrote:
> 'regexp-opt' always generates a regexp preferring long matches. This is undocumented, but useful enough that I would be surprised if this property wasn't exploited (perhaps unknowingly) by callers. It's quite natural: given a set of strings, surely the caller want them all to be candidates for a match, even if there is no following anchoring pattern.

Yes, the longstanding tradition is that regular expressions are greedy.

> Thus, instead of 'unordered-or', define the operator in terms of long matches: 'or-max' (working name) would work like 'or' but guarantee a longest match, and only permit strings and 'or-max' forms as arguments.

That's an odd restriction. I'm not sure it's a good idea to add an 
operator with such a restriction. That is, I know why the restriction is 
there (it's because of limitations in the Emacs regexp matcher), but 
it's not clear that users should have to know and understand these details.

Moreover, if greed is the longstanding tradition for regexp-opt, 
shouldn't plain "or" be greedy, to be consistent with other operators? 
That is true for POSIX regular expressions involving "|". For example, 
the shell command:

echo abbc |
awk '{n=split($0, a, /b|bb/); for (i=1;i<=n;i++) print a[i]}'

outputs the two lines "a" and "c" (not the three lines "a", "", and "c") 
because the "b|bb" matches greedily.

If it's too much trouble to make plain "or" greedy, I suggest just 
documenting it as possibly being greedy and possibly not (that is, 
document it as being unordered, even if it happens to be ordered now). 
This will give us more opportunity for optimization later.

More generally, surely it would be better to improve the underlying 
Emacs regular expression matcher to have a greedy "or", or a stingy 
"or", or whatever.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2019-10-22 17:33         ` Paul Eggert
@ 2019-10-23  9:15           ` Mattias Engdegård
  2019-10-23 23:14             ` Paul Eggert
  2020-02-11 12:57           ` Mattias Engdegård
  1 sibling, 1 reply; 32+ messages in thread
From: Mattias Engdegård @ 2019-10-23  9:15 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 37659

22 okt. 2019 kl. 19.33 skrev Paul Eggert <eggert@cs.ucla.edu>:

>> Thus, instead of 'unordered-or', define the operator in terms of long matches: 'or-max' (working name) would work like 'or' but guarantee a longest match, and only permit strings and 'or-max' forms as arguments.
> 
> That's an odd restriction. I'm not sure it's a good idea to add an operator with such a restriction. That is, I know why the restriction is there (it's because of limitations in the Emacs regexp matcher), but it's not clear that users should have to know and understand these details.

The restriction is simple and easy to document. It is not necessary to know the underlying reason for it in order to use the construct effectively.

> Moreover, if greed is the longstanding tradition for regexp-opt, shouldn't plain "or" be greedy, to be consistent with other operators?

Yes, I very much favour switching to a DFA engine; is there another way? Even then a backtracking engine would be needed for backrefs and other messy cases. However, that's a completely different amount of work. (Meanwhile, we have 'posix-string-match' etc for those who want greed at any cost.)

The problem that I'm trying to solve here is: how do we make it easy to match one of multiple strings --- keywords, say --- in rx? Currently, the answer is something like (regexp (regexp-opt my-keywords)), which doesn't integrate well with rx user definitions. In addition, the output of one regexp-opt cannot be used as input to another.

'or-max' would allow a user to say

(rx-define veggies (or-max "carrot" "tomato" "cucumber"))
(rx-define meats (or-max "beef" "chicken" "pork"))
... (rx (or-max veggies meats)) ...

and get a regexp that is guaranteed to be greedy, well-optimised as if all strings were passed to 'regexp-opt' at once, and robust: a small change won't change the behaviour radically, and the user won't have to game or second-guess the engine in order to produce the desired result.

If, in the future, 'or' becomes greedy, then 'or-max' will just be a synonym.

> If it's too much trouble to make plain "or" greedy, I suggest just documenting it as possibly being greedy and possibly not (that is, document it as being unordered, even if it happens to be ordered now). This will give us more opportunity for optimization later.

That would make rx strictly less useful than string regexps. That is why 'unordered-or' was a mistake: the unpredictability made it useless in many cases, and everyone would just have used regexp-opt (or skipped rx altogether).

It is desirable to have the semantics for 'or' in rx and \| in string regexps; otherwise, translating and understanding become unnecessarily difficult.

We could say that 'or' and \| either match greedily or in left-to-right order. However, I'm not sure this solves any problem right now.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2019-10-23  9:15           ` Mattias Engdegård
@ 2019-10-23 23:14             ` Paul Eggert
  2019-10-24  1:56               ` Drew Adams
  2019-10-24  8:58               ` Mattias Engdegård
  0 siblings, 2 replies; 32+ messages in thread
From: Paul Eggert @ 2019-10-23 23:14 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 37659

On 10/23/19 2:15 AM, Mattias Engdegård wrote:

> how do we make it easy to match one of multiple strings --- keywords, say --- in rx?

If that's the real problem, perhaps the name should be "or-tokens" or 
something like that, to help remind the reader of the limitations of the 
proposed operator: it's meant only for greedy tokenization and it isn't 
suited for regular expressions in general. A problem with the name 
"or-max" is that it implies a more-general functionality than the 
implementation really has.

What happens if you apply or-tokens to arguments that aren't strings or 
other or-tokens? Does rx diagnose this? I hope it does.

> We could say that 'or' and \| either match greedily or in left-to-right order. However, I'm not sure this solves any problem right now.

I was thinking of something more-compatible: we could say that \| is 
left-to-right (for users who need compatibility with regexp "|"), and 
that 'or' is not necessarily left-to-right (to make room for future 
extensions that make 'or' greedy, or more efficient, or both).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2019-10-23 23:14             ` Paul Eggert
@ 2019-10-24  1:56               ` Drew Adams
  2019-10-24  9:09                 ` Mattias Engdegård
  2019-10-24  9:17                 ` Phil Sainty
  2019-10-24  8:58               ` Mattias Engdegård
  1 sibling, 2 replies; 32+ messages in thread
From: Drew Adams @ 2019-10-24  1:56 UTC (permalink / raw)
  To: Paul Eggert, Mattias Engdegård; +Cc: 37659

Without wanting to butt in here, and knowing
nothing about what rx offers...

Is there an identifiable subset of rx features
(operators, functions, thingies, or whatever
its composable pieces are called) that map
(even if not one-to-one) to regexp syntax
components?

If so, are the things in that subset identified
as such in the doc?

Just wondering.  Wondering, in part, whether
use of rx might indirectly help someone learn
about Emacs regexp syntax and behavior.

Not saying that's important.  Just thought it
might be of some use, or at least interesting.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2019-10-23 23:14             ` Paul Eggert
  2019-10-24  1:56               ` Drew Adams
@ 2019-10-24  8:58               ` Mattias Engdegård
  2019-10-27 11:53                 ` Mattias Engdegård
  1 sibling, 1 reply; 32+ messages in thread
From: Mattias Engdegård @ 2019-10-24  8:58 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 37659

[-- Attachment #1: Type: text/plain, Size: 2289 bytes --]

24 okt. 2019 kl. 01.14 skrev Paul Eggert <eggert@cs.ucla.edu>:
> 
>> how do we make it easy to match one of multiple strings --- keywords, say --- in rx?
> 
> If that's the real problem, perhaps the name should be "or-tokens" or something like that, to help remind the reader of the limitations of the proposed operator: it's meant only for greedy tokenization and it isn't suited for regular expressions in general. A problem with the name "or-max" is that it implies a more-general functionality than the implementation really has.

'or-strings' then perhaps, since there is nothing really restricting it to 'tokens' (which is a bit hazardous terminology given that regexps are commonly used for tokenising). In particular, there is no delimiting; (or-max "IN" "OUT") will match the first part of "INSPECT", which may be unexpected of something ostensibly matching tokens.

On the other hand, 'or-strings' sort of precludes a future relaxation of the argument restriction.

> What happens if you apply or-tokens to arguments that aren't strings or other or-tokens? Does rx diagnose this? I hope it does.

Yes, of course. Working patch attached (it still uses the name 'or-max').

'or-max' isn't a vital addition; it just seemed to fill a gap, after experience with traditional regexp usage. It clearly shouldn't be added it on a whim. I wanted to get it in place for 27.1, but such a version rush has rarely resulted in good design.

> I was thinking of something more-compatible: we could say that \| is left-to-right (for users who need compatibility with regexp "|"), and that 'or' is not necessarily left-to-right (to make room for future extensions that make 'or' greedy, or more efficient, or both).

Sorry, by '\|' I meant the string regexp operator; I take it you propose separate semantics for the rx '|' and 'or' operators? Maybe we should worry about that if we ever get near the point of replacing the engine. There are other concerns, such as how capture groups are set (even if two branches match equally long texts).

I honestly don't think much would break if '\|' (in string regexps) became greedy overnight, but there is plenty of room to confuse the user if we introduce subtle distinctions between what has hitherto been perceived as synonyms.


[-- Attachment #2: 0003-Add-the-rx-or-max-operator.patch --]
[-- Type: application/octet-stream, Size: 4638 bytes --]

From b6e1900e64803dc28e915f38febd9a9389acb697 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Wed, 23 Oct 2019 12:47:53 +0200
Subject: [PATCH 3/3] Add the rx 'or-max' operator

* doc/lispref/searching.texi (Rx Constructs):
* lisp/emacs-lisp/rx.el (rx--or-max-strings, rx--translate-or-max)
(rx--translate-form, rx--builtin-forms, rx):
* test/lisp/emacs-lisp/rx-tests.el (rx-or-max-def):
Add 'or-max'.
---
 doc/lispref/searching.texi       |  7 +++++++
 lisp/emacs-lisp/rx.el            | 22 +++++++++++++++++++++-
 test/lisp/emacs-lisp/rx-tests.el | 13 +++++++++++++
 3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 5178575a3b..3feaebc16d 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -1084,6 +1084,13 @@ Rx Constructs
 Without arguments, the expression will not match anything at all.@*
 Corresponding string regexp: @samp{@var{A}\|@var{B}\|@dots{}}.
 
+@item @code{(or-max @var{rx}@dots{})}
+@cindex @code{or-max} in rx
+Like @code{or}, but always favours the longest possible match.  The
+@var{rx}s must be strings or @code{or-max} forms.  The resulting
+expression may be more efficient in matching than the corresponding
+@code{or} form.
+
 @item @code{unmatchable}
 @cindex @code{unmatchable} in rx
 Refuse any match.  Equivalent to @code{(or)}.
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index d7677f1444..9afa17c617 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -293,6 +293,24 @@ rx--translate-or
                           (cdr body)))
           nil))))
 
+(defun rx--or-max-strings (args)
+  "List of string arguments in an 'or-max' construct."
+  (mapcan (lambda (item)
+            (cond
+             ;; FIXME: Allow single characters as well?
+             ((stringp item) (list item))
+             ((and (consp item) (eq (car item) 'or-max))
+              (rx--or-max-strings (cdr item)))
+             ((let ((expanded (rx--expand-def item)))
+                (and expanded
+                     (rx--or-max-strings (list expanded)))))
+             (t (error "Illegal `or-max' argument: %S" item))))
+          args))
+
+(defun rx--translate-or-max (body)
+  "Translate (or-max BODY...).  Return (REGEXP . PRECEDENCE)."
+  (cons (list (regexp-opt (rx--or-max-strings body))) t))
+
 (defun rx--string-to-intervals (str)
   "Decode STR as intervals: A-Z becomes (?A . ?Z), and the single
 character X becomes (?X . ?X).  Return the intervals in a list."
@@ -854,6 +872,7 @@ rx--translate-form
     (pcase (car form)
       ((or 'seq : 'and 'sequence) (rx--translate-seq body))
       ((or 'or '|)              (rx--translate-or body))
+      ('or-max                  (rx--translate-or-max body))
       ((or 'any 'in 'char)      (rx--translate-any nil body))
       ('not-char                (rx--translate-any t body))
       ('not                     (rx--translate-not nil body))
@@ -915,7 +934,7 @@ rx--translate-form
         (t (error "Unknown rx form `%s'" op)))))))
 
 (defconst rx--builtin-forms
-  '(seq sequence : and or | any in char not-char not
+  '(seq sequence : and or | or-max any in char not-char not
     repeat = >= **
     zero-or-more 0+ *
     one-or-more 1+ +
@@ -1006,6 +1025,7 @@ rx
 
 (seq RX...)    Match the RXs in sequence.  Alias: :, sequence, and.
 (or RX...)     Match one of the RXs.  Alias: |.
+(or-max RX...) Match one of the RXs (strings and or-max only), longest match.
 
 (zero-or-more RX...) Match RXs zero or more times.  Alias: 0+.
 (one-or-more RX...)  Match RXs one or more times.  Alias: 1+.
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index ef2541d83a..f60932e670 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -49,6 +49,19 @@ rx-or
   (should (equal (rx (|))
                  "\\`a\\`")))
 
+(ert-deftest rx-or-max ()
+  (should (equal (rx (or-max "ab" "abc"))
+                 "\\(?:abc?\\)"))
+  (should (equal (rx (or-max (or-max "a" "xy") (or-max "ab" "abcd")))
+                 "\\(?:a\\(?:b\\(?:cd\\)?\\)?\\|xy\\)")))
+
+(ert-deftest rx-or-max-def ()
+  (rx-let ((a (or-max "a" "xy"))
+           (b a)
+           (c (or-max "ab" "abcd")))
+    (should (equal (rx (or-max c b))
+                   "\\(?:a\\(?:b\\(?:cd\\)?\\)?\\|xy\\)"))))
+
 (ert-deftest rx-char-any ()
   "Test character alternatives with `]' and `-' (Bug#25123)."
   (should (equal
-- 
2.21.0 (Apple Git-122)


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2019-10-24  1:56               ` Drew Adams
@ 2019-10-24  9:09                 ` Mattias Engdegård
  2019-10-24 14:24                   ` Drew Adams
  2019-10-24  9:17                 ` Phil Sainty
  1 sibling, 1 reply; 32+ messages in thread
From: Mattias Engdegård @ 2019-10-24  9:09 UTC (permalink / raw)
  To: Drew Adams; +Cc: Paul Eggert, 37659

24 okt. 2019 kl. 03.56 skrev Drew Adams <drew.adams@oracle.com>:
> 
> Is there an identifiable subset of rx features
> (operators, functions, thingies, or whatever
> its composable pieces are called) that map
> (even if not one-to-one) to regexp syntax
> components?

Almost all of rx maps one-to-one to the string regexp syntax. The rx docs mention the corresponding string regexps for most forms.

In addition, rx is self-explaining in the sense that a user curious about what a particular expression means needs only to evaluate (rx SOMETHING) to get the translation. (There are external packages for going in the other direction.)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2019-10-24  1:56               ` Drew Adams
  2019-10-24  9:09                 ` Mattias Engdegård
@ 2019-10-24  9:17                 ` Phil Sainty
  2019-10-24 14:32                   ` Drew Adams
  1 sibling, 1 reply; 32+ messages in thread
From: Phil Sainty @ 2019-10-24  9:17 UTC (permalink / raw)
  To: Drew Adams; +Cc: Mattias Engdegård, Paul Eggert, 37659

On 24/10/19 2:56 PM, Drew Adams wrote:
> Is there an identifiable subset of rx features ... that map
> (even if not one-to-one) to regexp syntax components?

C-h f rx

(syntax SYNTAX)  Match a character with syntax SYNTAX, being one of:
  whitespace, punctuation, word, symbol, open-parenthesis,
  close-parenthesis, expression-prefix, string-quote,
  paired-delimiter, escape, character-quote, comment-start,
  comment-end, string-delimiter, comment-delimiter






^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2019-10-24  9:09                 ` Mattias Engdegård
@ 2019-10-24 14:24                   ` Drew Adams
  0 siblings, 0 replies; 32+ messages in thread
From: Drew Adams @ 2019-10-24 14:24 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Paul Eggert, 37659

> > Is there an identifiable subset of rx features
> > (operators, functions, thingies, or whatever
> > its composable pieces are called) that map
> > (even if not one-to-one) to regexp syntax
> > components?
> 
> Almost all of rx maps one-to-one to the string regexp syntax. The rx
> docs mention the corresponding string regexps for most forms.
> 
> In addition, rx is self-explaining in the sense that a user curious
> about what a particular expression means needs only to evaluate (rx
> SOMETHING) to get the translation. (There are external packages for
> going in the other direction.)

Yes, I knew the last part - rx returns a regexp
(which you can examine).

My suggestion was really about documenting the
correspondence between individual rx components
and regexp components.  If that's already done,
great.  Thx.







^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2019-10-24  9:17                 ` Phil Sainty
@ 2019-10-24 14:32                   ` Drew Adams
  0 siblings, 0 replies; 32+ messages in thread
From: Drew Adams @ 2019-10-24 14:32 UTC (permalink / raw)
  To: Phil Sainty; +Cc: Mattias Engdegård, Paul Eggert, 37659

> > Is there an identifiable subset of rx features ... that map
> > (even if not one-to-one) to regexp syntax components?
> 
> C-h f rx
> 
> (syntax SYNTAX)  Match a character with syntax SYNTAX, being one of:
>   whitespace, punctuation, word, symbol, open-parenthesis,
>   close-parenthesis, expression-prefix, string-quote,
>   paired-delimiter, escape, character-quote, comment-start,
>   comment-end, string-delimiter, comment-delimiter

Yes, that's fine for char syntax classes.  It's
good that their correspondences are listed.

But for, say, `line-start' (aka `bol') there is a
verbal description but no mention of the Elisp
regexp syntax that corresponds:

  ‘line-start’, ‘bol’
     matches the empty string, but only at the
     beginning of a line in the text being matched

That's the kind of thing I was suggesting.  It would
be helpful, I think, to mention (somewhere) that the
regexp syntax for this is "^".

Same thing for the other constructs (`string-start'
and all the rest).

I'm using Emacs 26.3.  I didn't find anything beyond
the doc string - nothing in the Emacs or Elisp manual
(which is OK).





^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2019-10-24  8:58               ` Mattias Engdegård
@ 2019-10-27 11:53                 ` Mattias Engdegård
  0 siblings, 0 replies; 32+ messages in thread
From: Mattias Engdegård @ 2019-10-27 11:53 UTC (permalink / raw)
  To: Paul Eggert; +Cc: 37659

[-- Attachment #1: Type: text/plain, Size: 858 bytes --]

An observation is that 'or-max' cannot currently be defined by the user, because there is no way to expand rx forms explicitly. One way to fill that hole is to add the function

  (rx-expand-definitions RX-FORM)

which would expand RX-FORM until it no longer is a user-defined form.
This would permit or-max to be defined as

(rx-define or-max (&rest forms)
  (eval `(regexp ,(regexp-opt (or-max-strings (list forms))))))

(defun or-max-strings (args)
  (mapcan (lambda (item)
            (pcase item
              ((pred stringp) (list item))
              (`(or-max . ,rest) (or-max-strings rest))
              (_ (error "Illegal `or-max' argument: %S" item))))
          (mapcar #'rx-expand-definitions args)))

Of course, if the 'or-max' operator is generally useful, it would probably still make sense to define it as a primitive.


[-- Attachment #2: 0001-Add-rx-expand-definitions.patch --]
[-- Type: application/octet-stream, Size: 2520 bytes --]

From 52df813cac0ecb38f62d6740082ee6741ca0e167 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Sun, 27 Oct 2019 12:50:15 +0100
Subject: [PATCH] Add `rx-expand-definitions'

* lisp/emacs-lisp/rx.el (rx-expand-definitions): New function.
* test/lisp/emacs-lisp/rx-tests.el (rx-expand-definitions): Test.

Add `rx-expand-definitions', allowing explicit expansion of rx forms
inside (eval ...). (Bug#37659)
---
 lisp/emacs-lisp/rx.el            |  8 ++++++++
 test/lisp/emacs-lisp/rx-tests.el | 17 +++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 52a35ffa2a..660b336efd 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -143,6 +143,14 @@ rx--expand-def
                      op (cdr form) (nth 0 def) (nth 1 def))
                   (error "Not an `rx' form definition: %s" op)))))))
 
+(defun rx-expand-definitions (form)
+  "Expand FORM until it is no longer a user-defined rx construct.
+Then return the result."
+  (let ((expanded (rx--expand-def form)))
+    (if expanded
+        (rx-expand-definitions expanded)
+      form)))
+
 ;; TODO: Additions to consider:
 ;; - A construct like `or' but without the match order guarantee,
 ;;   maybe `unordered-or'.  Useful for composition or generation of
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index 4ecc805aea..369dab83f6 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -419,6 +419,23 @@ rx-def-in-not
     (should (equal (rx (not (d ?m)) (not (e symbol)))
                    "[^amz]\\S_"))))
 
+(ert-deftest rx-expand-definitions ()
+  (rx-let ((a b)
+           (b (* nonl)))
+    (should (equal (rx (eval (rx-expand-definitions 'space)))
+                   "[[:space:]]"))
+    (should (equal (rx (eval (rx-expand-definitions 'a)))
+                   ".*")))
+  (rx-let-eval '((f (x) (seq ?a x ?b))
+                 (g (y) (f (+ y))))
+    (should (equal (rx-to-string '(eval (rx-expand-definitions 'digit)) t)
+                   "[[:digit:]]"))
+    (should (equal (rx-to-string '(eval (rx-expand-definitions '(g ?c))) t)
+                   "ac+b")))
+  (rx-define rx--g "z")
+  (should (equal (rx (eval (rx-expand-definitions 'rx--g)))
+                 "z")))
+
 (ert-deftest rx-constituents ()
   (let ((rx-constituents
          (append '((beta . gamma)
-- 
2.21.0 (Apple Git-122)


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2019-10-22 17:33         ` Paul Eggert
  2019-10-23  9:15           ` Mattias Engdegård
@ 2020-02-11 12:57           ` Mattias Engdegård
  2020-02-11 15:43             ` Eli Zaretskii
  1 sibling, 1 reply; 32+ messages in thread
From: Mattias Engdegård @ 2020-02-11 12:57 UTC (permalink / raw)
  To: Paul Eggert, Phil Sainty, Eli Zaretskii; +Cc: 37659

[-- Attachment #1: Type: text/plain, Size: 1706 bytes --]

22 okt. 2019 kl. 19.33 skrev Paul Eggert <eggert@cs.ucla.edu>:

> Moreover, if greed is the longstanding tradition for regexp-opt, shouldn't plain "or" be greedy, to be consistent with other operators?

Having second thoughts, I've come to believe that Paul may have been right after all. We might just as well let plain 'or' (alias '|') match as much as possible when it is able to do so. In particular, we should guarantee that this will happen when all arguments are strings, as used to be the case.

Initially I thought it was a bug that (or "a" "ab") was optimised into "ab?" on the grounds that this made the behaviour unpredictable: when matching the string "abc", (or "a" "ab") matched "ab", whereas (or "a" "ab" space) would match "a". However, the current 'fixed' code isn't necessarily more useful.

Since the change was introduced in Emacs 27 which has not yet been released, I suggest the attached patch for emacs-27. It reverts the use of regexp-opt with KEEP-ORDER = t. What do you think? It would solve the problem without introducing new constructs, and without running the risk of introducing subtle errors in existing rx expressions.

(In fact, if we do not do this in Emacs 27, we'd have to add a NEWS entry to warn users about the change.)

A further improvement would be to ensure that nested all-string 'or' forms would have the same property, and that expansion of user-defined forms would be transparent. In other words, that

 (rx-let ((x (or "abc" "de")))
   (rx (or "a" x (or "ab" "def"))))

would be equivalent to

 (rx "abc" "ab" "a" "def" "de")

I'll prepare a patch for this QoI improvement, but the attached patch should be required no matter what.

[-- Attachment #2: 0001-rx-Use-longest-match-for-all-string-or-forms-bug-376.patch --]
[-- Type: application/octet-stream, Size: 2209 bytes --]

From b05c5ace634986f3c32310a4f62bd619e6ac5db9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 11 Feb 2020 13:23:10 +0100
Subject: [PATCH] rx: Use longest match for all-string 'or' forms (bug#37659)

Revert to the Emacs 26 semantics that always gave the longest match
for rx 'or' forms with only string arguments.  This guarantee was
never well documented, but it is useful and people likely have come to
rely on it.  For example, prior to this change,

 (rx (or ">" ">="))

matched ">" even if the text contained ">=".

* lisp/emacs-lisp/rx.el (rx--translate-or): Don't tell regexp-opt to
preserve the matching order.
* doc/lispref/searching.texi (Rx Constructs): Document the
longest-match guarantee for all-string 'or' forms.
---
 doc/lispref/searching.texi | 5 ++++-
 lisp/emacs-lisp/rx.el      | 2 +-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 3d7ea93286..5f4509a8b4 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -1080,7 +1080,10 @@ Rx Constructs
 @cindex @code{or} in rx
 @itemx @code{(| @var{rx}@dots{})}
 @cindex @code{|} in rx
-Match exactly one of the @var{rx}s, trying from left to right.
+Match exactly one of the @var{rx}s.
+If all arguments are string literals, the longest possible match
+will always be used.  Otherwise, either the longest match or the
+first (in left-to-right order) will be used.
 Without arguments, the expression will not match anything at all.@*
 Corresponding string regexp: @samp{@var{A}\|@var{B}\|@dots{}}.

diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 03af053c91..b4cab5715d 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -290,7 +290,7 @@ rx--translate-or
    ((null (cdr body))              ; Single item.
     (rx--translate (car body)))
    ((rx--every #'stringp body)     ; All strings.
-    (cons (list (regexp-opt body nil t))
+    (cons (list (regexp-opt body nil))
           t))
    ((rx--every #'rx--charset-p body)  ; All charsets.
     (rx--translate-union nil body))
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2020-02-11 12:57           ` Mattias Engdegård
@ 2020-02-11 15:43             ` Eli Zaretskii
  2020-02-11 19:17               ` Mattias Engdegård
  0 siblings, 1 reply; 32+ messages in thread
From: Eli Zaretskii @ 2020-02-11 15:43 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: psainty, eggert, 37659

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Tue, 11 Feb 2020 13:57:27 +0100
> Cc: 37659@debbugs.gnu.org
> 
> Since the change was introduced in Emacs 27 which has not yet been released, I suggest the attached patch for emacs-27. It reverts the use of regexp-opt with KEEP-ORDER = t. What do you think? It would solve the problem without introducing new constructs, and without running the risk of introducing subtle errors in existing rx expressions.

Can't say I'm happy with these last-minute experiments, but okay.

Thanks.





^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2020-02-11 15:43             ` Eli Zaretskii
@ 2020-02-11 19:17               ` Mattias Engdegård
  2020-02-12  0:52                 ` Paul Eggert
  0 siblings, 1 reply; 32+ messages in thread
From: Mattias Engdegård @ 2020-02-11 19:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, Paul Eggert, 37659

[-- Attachment #1: Type: text/plain, Size: 321 bytes --]

11 feb. 2020 kl. 16.43 skrev Eli Zaretskii <eliz@gnu.org>:

> Can't say I'm happy with these last-minute experiments, but okay.

Thanks, and I think it's actually a lesser experiment than status quo ante.
I'm allowing for more comments before pushing it; meanwhile, here is the follow-up patch mentioned earlier.


[-- Attachment #2: 0001-rx-Improve-or-compositionality.patch --]
[-- Type: application/octet-stream, Size: 7758 bytes --]

From dc4aadfdacc2466420833281c5374279eee03aca Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 11 Feb 2020 20:04:42 +0100
Subject: [PATCH] rx: Improve 'or' compositionality

Perform 'regexp-opt' on nested 'or' forms, and after expansion of
user-defined and 'eval' forms.  Characters are now turned into
strings for wider 'regexp-opt' scope.

* doc/lispref/searching.texi (Rx Constructs): Document.
* lisp/emacs-lisp/rx.el (rx--normalise-or-arg)
rx--all-string-or-args): New.
(rx--translate-or): Normalise arguments first, and check for strings
in subforms.
(rx--expand-eval): Extracted from rx--translate-eval.
(rx--translate-eval): Call rx--expand-eval.
---
 doc/lispref/searching.texi       |  5 ++-
 lisp/emacs-lisp/rx.el            | 76 ++++++++++++++++++++------------
 test/lisp/emacs-lisp/rx-tests.el | 13 +++++-
 3 files changed, 63 insertions(+), 31 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 5f4509a8b4..526ee77d40 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -1081,8 +1081,9 @@ Rx Constructs
 @itemx @code{(| @var{rx}@dots{})}
 @cindex @code{|} in rx
 Match exactly one of the @var{rx}s.
-If all arguments are string literals, the longest possible match
-will always be used.  Otherwise, either the longest match or the
+If all arguments are strings, characters, or @code{or} forms
+so constrained, the longest possible match will always be used.
+Otherwise, either the longest match or the
 first (in left-to-right order) will be used.
 Without arguments, the expression will not match anything at all.@*
 Corresponding string regexp: @samp{@var{A}\|@var{B}\|@dots{}}.
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index b4cab5715d..7a7d09f10b 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -254,22 +254,39 @@ rx--foldl
     (setq l (cdr l)))
   x)
 
+(defun rx--normalise-or-arg (form)
+  "Normalise the `or' argument FORM.
+Characters become strings, user-definitions and `eval' forms are expanded,
+and `or' forms are normalised recursively."
+  (cond ((characterp form)
+         (char-to-string form))
+        ((and (consp form) (memq (car form) '(or |)))
+         (cons (car form) (mapcar #'rx--normalise-or-arg (cdr form))))
+        ((and (consp form) (eq (car form) 'eval))
+         (rx--normalise-or-arg (rx--expand-eval (cdr form))))
+        (t
+         (let ((expanded (rx--expand-def form)))
+           (if expanded
+               (rx--normalise-or-arg expanded)
+             form)))))
+
+(defun rx--all-string-or-args (body)
+  "If BODY only consists of strings or such `or' forms, return all the strings.
+Otherwise throw `rx--nonstring'."
+  (mapcan (lambda (form)
+            (cond ((stringp form) (list form))
+                  ((and (consp form) (memq (car form) '(or |)))
+                   (rx--all-string-or-args (cdr form)))
+                  (t (throw 'rx--nonstring nil))))
+          body))
+
 (defun rx--translate-or (body)
   "Translate an or-pattern of zero or more rx items.
 Return (REGEXP . PRECEDENCE)."
   ;; FIXME: Possible improvements:
   ;;
-  ;; - Turn single characters to strings: (or ?a ?b) -> (or "a" "b"),
-  ;;   so that they can be candidates for regexp-opt.
-  ;;
-  ;; - Translate compile-time strings (`eval' forms), again for regexp-opt.
-  ;;
   ;; - Flatten sub-patterns first: (or (or A B) (or C D)) -> (or A B C D)
-  ;;   in order to improve effectiveness of regexp-opt.
-  ;;   This would also help composability.
-  ;;
-  ;; - Use associativity to run regexp-opt on contiguous subsets of arguments
-  ;;   if not all of them are strings.  Example:
+  ;;   Then call regexp-opt on runs of string arguments. Example:
   ;;   (or (+ digit) "CHARLIE" "CHAN" (+ blank))
   ;;   -> (or (+ digit) (or "CHARLIE" "CHAN") (+ blank))
   ;;
@@ -279,27 +296,26 @@ rx--translate-or
   ;;   so that (or "@" "%" digit (any "A-Z" space) (syntax word))
   ;;        -> (any "@" "%" digit "A-Z" space word)
   ;;        -> "[A-Z@%[:digit:][:space:][:word:]]"
-  ;;
-  ;; Problem: If a subpattern is carefully written to be
-  ;; optimizable by regexp-opt, how do we prevent the transforms
-  ;; above from destroying that property?
-  ;; Example: (or "a" (or "abc" "abd" "abe"))
   (cond
    ((null body)                    ; No items: a never-matching regexp.
     (rx--empty))
    ((null (cdr body))              ; Single item.
     (rx--translate (car body)))
-   ((rx--every #'stringp body)     ; All strings.
-    (cons (list (regexp-opt body nil))
-          t))
-   ((rx--every #'rx--charset-p body)  ; All charsets.
-    (rx--translate-union nil body))
    (t
-    (cons (append (car (rx--translate (car body)))
-                  (mapcan (lambda (item)
-                            (cons "\\|" (car (rx--translate item))))
-                          (cdr body)))
-          nil))))
+    (let* ((args (mapcar #'rx--normalise-or-arg body))
+           (all-strings (catch 'rx--nonstring (rx--all-string-or-args args))))
+      (cond
+       (all-strings                       ; Only strings.
+        (cons (list (regexp-opt all-strings nil))
+              t))
+       ((rx--every #'rx--charset-p args)  ; All charsets.
+        (rx--translate-union nil args))
+       (t
+        (cons (append (car (rx--translate (car args)))
+                      (mapcan (lambda (item)
+                                (cons "\\|" (car (rx--translate item))))
+                              (cdr args)))
+              nil)))))))
 
 (defun rx--charset-p (form)
   "Whether FORM looks like a charset, only consisting of character intervals
@@ -836,11 +852,15 @@ rx--translate-literal
            (cons (list (list 'regexp-quote arg)) 'seq))
           (t (error "rx `literal' form with non-string argument")))))
 
-(defun rx--translate-eval (body)
-  "Translate the `eval' form.  Return (REGEXP . PRECEDENCE)."
+(defun rx--expand-eval (body)
+  "Expand `eval' arguments.  Return a new rx form."
   (unless (and body (null (cdr body)))
     (error "rx `eval' form takes exactly one argument"))
-  (rx--translate (eval (car body))))
+  (eval (car body)))
+
+(defun rx--translate-eval (body)
+  "Translate the `eval' form.  Return (REGEXP . PRECEDENCE)."
+  (rx--translate (rx--expand-eval body)))
 
 (defvar rx--regexp-atomic-regexp nil)
 
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index a6c172adfe..4db6eb8c9c 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -42,13 +42,24 @@ rx-seq
 (ert-deftest rx-or ()
   (should (equal (rx (or "ab" (| "c" nonl) "de"))
                  "ab\\|c\\|.\\|de"))
-  (should (equal (rx (or "ab" "abc" "a"))
+  (should (equal (rx (or "ab" "abc" ?a))
                  "\\(?:a\\(?:bc?\\)?\\)"))
+  (should (equal (rx (or "ab" (| (or "abcd" "abcde")) (or "a" "abc")))
+                 "\\(?:a\\(?:b\\(?:c\\(?:de?\\)?\\)?\\)?\\)"))
+  (should (equal (rx (or "a" (eval (string ?a ?b))))
+                 "\\(?:ab?\\)"))
   (should (equal (rx (| nonl "a") (| "b" blank))
                  "\\(?:.\\|a\\)\\(?:b\\|[[:blank:]]\\)"))
   (should (equal (rx (|))
                  "\\`a\\`")))
 
+(ert-deftest rx-def-in-or ()
+  (rx-let ((a b)
+           (b (or "abc" c))
+           (c ?a))
+    (should (equal (rx (or a (| "ab" "abcde") "abcd"))
+                   "\\(?:a\\(?:b\\(?:c\\(?:de?\\)?\\)?\\)?\\)"))))
+
 (ert-deftest rx-char-any ()
   "Test character alternatives with `]' and `-' (Bug#25123)."
   (should (equal
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2020-02-11 19:17               ` Mattias Engdegård
@ 2020-02-12  0:52                 ` Paul Eggert
  2020-02-12 11:22                   ` Mattias Engdegård
  0 siblings, 1 reply; 32+ messages in thread
From: Paul Eggert @ 2020-02-12  0:52 UTC (permalink / raw)
  To: Mattias Engdegård, Eli Zaretskii; +Cc: psainty, 37659

On 2/11/20 11:17 AM, Mattias Engdegård wrote:
>> Can't say I'm happy with these last-minute experiments, but okay.
> Thanks, and I think it's actually a lesser experiment than status quo ante.

I agree with both of you.

I assume the followon patch "rx: Improve 'or' compositionality" is for 
the master branch. I didn't look at it carefully but the basic idea is 
sound.





^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2020-02-12  0:52                 ` Paul Eggert
@ 2020-02-12 11:22                   ` Mattias Engdegård
  2020-02-13 18:38                     ` Mattias Engdegård
  0 siblings, 1 reply; 32+ messages in thread
From: Mattias Engdegård @ 2020-02-12 11:22 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Phil Sainty, 37659

12 feb. 2020 kl. 01.52 skrev Paul Eggert <eggert@cs.ucla.edu>:

> I assume the followon patch "rx: Improve 'or' compositionality" is for the master branch.

Right -- while it would make sense for emacs-27, as user-defined forms were introduced in that version, it is not strictly necessary and could well be done on master.






^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2020-02-12 11:22                   ` Mattias Engdegård
@ 2020-02-13 18:38                     ` Mattias Engdegård
  2020-02-13 18:50                       ` Paul Eggert
  0 siblings, 1 reply; 32+ messages in thread
From: Mattias Engdegård @ 2020-02-13 18:38 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Phil Sainty, 37659

Now the regexp-opt KEEP-ORDER argument no longer serves any purpose; it was added for use by rx, and has no obvious use elsewhere. It could safely be removed to save some weight, unless you prefer it be kept as scar tissue.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2020-02-13 18:38                     ` Mattias Engdegård
@ 2020-02-13 18:50                       ` Paul Eggert
  2020-02-13 19:16                         ` Mattias Engdegård
  0 siblings, 1 reply; 32+ messages in thread
From: Paul Eggert @ 2020-02-13 18:50 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: Phil Sainty, 37659

On 2/13/20 10:38 AM, Mattias Engdegård wrote:
> Now the regexp-opt KEEP-ORDER argument no longer serves any purpose; it was added for use by rx, and has no obvious use elsewhere. It could safely be removed to save some weight, unless you prefer it be kept as scar tissue.

Simplest would be to remove it in Emacs 27, and merge that change into 
master.

If that's too drastic, we can mark it as deprecated in Emacs 27 (though 
it is odd for a release to add a feature that is immediately 
deprecated), and remove it in the master branch.





^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2020-02-13 18:50                       ` Paul Eggert
@ 2020-02-13 19:16                         ` Mattias Engdegård
  2020-02-13 19:30                           ` Eli Zaretskii
  0 siblings, 1 reply; 32+ messages in thread
From: Mattias Engdegård @ 2020-02-13 19:16 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Phil Sainty, 37659

[-- Attachment #1: Type: text/plain, Size: 291 bytes --]

13 feb. 2020 kl. 19.50 skrev Paul Eggert <eggert@cs.ucla.edu>:

> Simplest would be to remove it in Emacs 27, and merge that change into master.

Right, introducing it already deprecated doesn't make sense at all. I'll do it in emacs-27 if Eli doesn't mind too much (patch attached).


[-- Attachment #2: 0001-Remove-the-optional-KEEP-ORDER-argument-to-regexp-op.patch --]
[-- Type: application/octet-stream, Size: 8920 bytes --]

From 79ecef631e667650418b187143c4059d7e6eebe6 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Thu, 13 Feb 2020 20:06:48 +0100
Subject: [PATCH] Remove the optional KEEP-ORDER argument to regexp-opt

This argument was added for the 'or' clause in rx, but it turned out
to be a bad idea (bug#37659), and there seems to be little other use
for it.

* lisp/emacs-lisp/regexp-opt.el (regexp-opt): Remove KEEP-ORDER.
* doc/lispref/searching.texi (Regexp Functions):
* etc/NEWS: Remove it from the documentation.
* test/lisp/emacs-lisp/regexp-opt-tests.el (regexp-opt-test--match-all)
(regexp-opt-test--check-perm, regexp-opt-test--explain-perm)
(regexp-opt-keep-order, regexp-opt-longest-match): Simplify test.
---
 doc/lispref/searching.texi               |  9 ++---
 etc/NEWS                                 |  7 ----
 lisp/emacs-lisp/regexp-opt.el            | 43 +++++------------------
 test/lisp/emacs-lisp/regexp-opt-tests.el | 44 ++++--------------------
 4 files changed, 19 insertions(+), 84 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 5f4509a8b4..1f6db0643e 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -1745,7 +1745,7 @@ Regexp Functions
 @end defun
 
 @cindex optimize regexp
-@defun regexp-opt strings &optional paren keep-order
+@defun regexp-opt strings &optional paren
 This function returns an efficient regular expression that will match
 any of the strings in the list @var{strings}.  This is useful when you
 need to make matching or searching as fast as possible---for example,
@@ -1783,11 +1783,8 @@ Regexp Functions
 it will apply to the whole expression.
 @end table
 
-The optional argument @var{keep-order}, if non-@code{nil}, forces the
-match to be performed in the order given, as if the strings were made
-into a regexp by joining them with the @samp{\|} operator.  If nil or
-omitted, the returned regexp will always match the longest string
-possible.
+The returned regexp is ordered in such a way that it will always match
+the longest string possible.
 
 Up to reordering, the resulting regexp of @code{regexp-opt} is
 equivalent to but usually more efficient than that of a simplified
diff --git a/etc/NEWS b/etc/NEWS
index 312869fe57..380ac71260 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -3526,13 +3526,6 @@ the process.  That way 'make-process' can start remote processes.
 This is currently supported on GNUish hosts and on modern versions of
 MS-Windows.
 
-+++
-** The function 'regexp-opt' accepts an additional optional argument.
-By default, the regexp returned by 'regexp-opt' may match the strings
-in any order.  If the new third argument is non-nil, the match is
-guaranteed to be performed in the order given, as if the strings were
-made into a regexp by joining them with '\|'.
-
 +++
 ** The function 'regexp-opt', when given an empty list of strings, now
 returns a regexp that never matches anything, which is an identity for
diff --git a/lisp/emacs-lisp/regexp-opt.el b/lisp/emacs-lisp/regexp-opt.el
index 2cce4e6353..35a5fda184 100644
--- a/lisp/emacs-lisp/regexp-opt.el
+++ b/lisp/emacs-lisp/regexp-opt.el
@@ -84,7 +84,7 @@
 ;;; Code:
 
 ;;;###autoload
-(defun regexp-opt (strings &optional paren keep-order)
+(defun regexp-opt (strings &optional paren)
   "Return a regexp to match a string in the list STRINGS.
 Each member of STRINGS is treated as a fixed string, not as a regexp.
 Optional PAREN specifies how the returned regexp is surrounded by
@@ -114,11 +114,8 @@ regexp-opt
     necessary to ensure that a postfix operator appended to it will
     apply to the whole expression.
 
-The optional argument KEEP-ORDER, if non-nil, forces the match to
-be performed in the order given, as if the strings were made into
-a regexp by joining them with the `\\|' operator.  If nil or
-omitted, the returned regexp is will always match the longest
-string possible.
+The returned regexp is ordered in such a way that it will always
+match the longest string possible.
 
 Up to reordering, the resulting regexp is equivalent to but
 usually more efficient than that of a simplified version:
@@ -140,34 +137,12 @@ regexp-opt
 	   (completion-ignore-case nil)
 	   (completion-regexp-list nil)
 	   (open (cond ((stringp paren) paren) (paren "\\(")))
-	   (re
-            (cond
-             ;; No strings: return an unmatchable regexp.
-             ((null strings)
-              (concat (or open "\\(?:") regexp-unmatchable "\\)"))
-
-             ;; The algorithm will generate a pattern that matches
-             ;; longer strings in the list before shorter.  If the
-             ;; list order matters, then no string must come after a
-             ;; proper prefix of that string.  To check this, verify
-             ;; that a straight or-pattern matches each string
-             ;; entirely.
-             ((and keep-order
-                   (let* ((case-fold-search nil)
-                          (alts (mapconcat #'regexp-quote strings "\\|")))
-                     (and (let ((s strings))
-                            (while (and s
-                                        (string-match alts (car s))
-                                        (= (match-end 0) (length (car s))))
-                              (setq s (cdr s)))
-                            ;; If we exited early, we found evidence that
-                            ;; regexp-opt-group cannot be used.
-                            s)
-                          (concat (or open "\\(?:") alts "\\)")))))
-             (t
-              (regexp-opt-group
-               (delete-dups (sort (copy-sequence strings) 'string-lessp))
-               (or open t) (not open))))))
+	   (re (if strings
+                   (regexp-opt-group
+                    (delete-dups (sort (copy-sequence strings) 'string-lessp))
+                    (or open t) (not open))
+                 ;; No strings: return an unmatchable regexp.
+                 (concat (or open "\\(?:") regexp-unmatchable "\\)"))))
       (cond ((eq paren 'words)
 	     (concat "\\<" re "\\>"))
 	    ((eq paren 'symbols)
diff --git a/test/lisp/emacs-lisp/regexp-opt-tests.el b/test/lisp/emacs-lisp/regexp-opt-tests.el
index 9b4567c72c..0179ac4f1f 100644
--- a/test/lisp/emacs-lisp/regexp-opt-tests.el
+++ b/test/lisp/emacs-lisp/regexp-opt-tests.el
@@ -47,43 +47,13 @@ regexp-opt-test--permutations
   (mapcar (lambda (i) (regexp-opt-test--permutation i list))
           (number-sequence 0 (1- (regexp-opt-test--factorial (length list))))))
 
-(defun regexp-opt-test--match-all (words re)
-  (mapcar (lambda (w) (and (string-match re w)
-                           (match-string 0 w)))
-          words))
-
-(defun regexp-opt-test--check-perm (perm)
-  (let* ((ref-re (mapconcat #'regexp-quote perm "\\|"))
-         (opt-re (regexp-opt perm nil t))
-         (ref (regexp-opt-test--match-all perm ref-re))
-         (opt (regexp-opt-test--match-all perm opt-re)))
-    (equal opt ref)))
-
-(defun regexp-opt-test--explain-perm (perm)
-  (let* ((ref-re (mapconcat #'regexp-quote perm "\\|"))
-         (opt-re (regexp-opt perm nil t))
-         (ref (regexp-opt-test--match-all perm ref-re))
-         (opt (regexp-opt-test--match-all perm opt-re)))
-    (concat "\n"
-            (format "Naïve regexp:     %s\n" ref-re)
-            (format "Optimized regexp: %s\n" opt-re)
-            (format "Got:      %s\n" opt)
-            (format "Expected: %s\n" ref))))
-
-(put 'regexp-opt-test--check-perm 'ert-explainer 'regexp-opt-test--explain-perm)
-
-(ert-deftest regexp-opt-keep-order ()
-  "Check that KEEP-ORDER works."
-  (dolist (perm (regexp-opt-test--permutations '("abc" "bca" "cab")))
-    (should (regexp-opt-test--check-perm perm)))
-  (dolist (perm (regexp-opt-test--permutations '("abc" "ab" "bca" "bc")))
-    (should (regexp-opt-test--check-perm perm)))
-  (dolist (perm (regexp-opt-test--permutations '("abxy" "cdxy")))
-    (should (regexp-opt-test--check-perm perm)))
-  (dolist (perm (regexp-opt-test--permutations '("afgx" "bfgx" "afgy" "bfgy")))
-    (should (regexp-opt-test--check-perm perm)))
-  (dolist (perm (regexp-opt-test--permutations '("a" "ab" "ac" "abc")))
-    (should (regexp-opt-test--check-perm perm))))
+(ert-deftest regexp-opt-longest-match ()
+  "Check that the regexp always matches as much as possible."
+  (let ((s "abcd"))
+    (dolist (perm (regexp-opt-test--permutations '("a" "ab" "ac" "abc")))
+      (should (equal (and (string-match (regexp-opt perm) s)
+                          (match-string 0 s))
+                     "abc")))))
 
 (ert-deftest regexp-opt-charset ()
   (should (equal (regexp-opt-charset '(?a ?b ?a)) "[ab]"))
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2020-02-13 19:16                         ` Mattias Engdegård
@ 2020-02-13 19:30                           ` Eli Zaretskii
  2020-02-13 22:23                             ` Mattias Engdegård
  0 siblings, 1 reply; 32+ messages in thread
From: Eli Zaretskii @ 2020-02-13 19:30 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: psainty, eggert, 37659

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Thu, 13 Feb 2020 20:16:28 +0100
> Cc: Eli Zaretskii <eliz@gnu.org>, Phil Sainty <psainty@orcon.net.nz>,
>         37659@debbugs.gnu.org
> 
> > Simplest would be to remove it in Emacs 27, and merge that change into master.
> 
> Right, introducing it already deprecated doesn't make sense at all. I'll do it in emacs-27 if Eli doesn't mind too much (patch attached).

Fine with me, thanks.





^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2020-02-13 19:30                           ` Eli Zaretskii
@ 2020-02-13 22:23                             ` Mattias Engdegård
  2020-02-14  7:45                               ` Eli Zaretskii
  0 siblings, 1 reply; 32+ messages in thread
From: Mattias Engdegård @ 2020-02-13 22:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: psainty, eggert, 37659

13 feb. 2020 kl. 20.30 skrev Eli Zaretskii <eliz@gnu.org>:

> Fine with me, thanks.

Thank you, pushed.

Now only the 'compositionality' patch remains. As noted it could be done on master, but since it is motivated by user-defined forms which were introduced in Emacs 27, I'd rather like it to be done in that branch.

It simply seems a bit incomplete otherwise. Users read about the longest-match guarantee of 'or', and of user-definitions, and look for a way of combining the two. Perhaps they try

(rx-let ((arith-op (or "+" "-" "*" "/"))
         (assign-op (or "=" "+=" "-=" "*=" "/="))
         (op (or arith-op assign-op)))
...)

which doesn't quite work for matching "+=", say.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2020-02-13 22:23                             ` Mattias Engdegård
@ 2020-02-14  7:45                               ` Eli Zaretskii
  2020-02-14 16:15                                 ` Paul Eggert
  0 siblings, 1 reply; 32+ messages in thread
From: Eli Zaretskii @ 2020-02-14  7:45 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: psainty, eggert, 37659

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Thu, 13 Feb 2020 23:23:01 +0100
> Cc: eggert@cs.ucla.edu, psainty@orcon.net.nz, 37659@debbugs.gnu.org
> 
> Now only the 'compositionality' patch remains. As noted it could be done on master, but since it is motivated by user-defined forms which were introduced in Emacs 27, I'd rather like it to be done in that branch.
> 
> It simply seems a bit incomplete otherwise.

The important question is: is "incomplete" really so bad here.  We
have a few features that need to be completed in future releases, so
this one isn't alone.

I don't have a strong opinion either way.





^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2020-02-14  7:45                               ` Eli Zaretskii
@ 2020-02-14 16:15                                 ` Paul Eggert
  2020-02-14 20:49                                   ` Mattias Engdegård
  2020-03-01 10:09                                   ` Mattias Engdegård
  0 siblings, 2 replies; 32+ messages in thread
From: Paul Eggert @ 2020-02-14 16:15 UTC (permalink / raw)
  To: Eli Zaretskii, Mattias Engdegård; +Cc: psainty, 37659

On 2/13/20 11:45 PM, Eli Zaretskii wrote:
> I don't have a strong opinion either way.
How about this idea: if the compositionality patch affects only forms introduced 
in Emacs 27, then it's OK to put it into Emacs 27. If it affects forms that have 
been around since Emacs 26, it might be safer to put the patch into the master 
branch. (I haven't looked at the patch in detail.)





^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2020-02-14 16:15                                 ` Paul Eggert
@ 2020-02-14 20:49                                   ` Mattias Engdegård
  2020-03-01 10:09                                   ` Mattias Engdegård
  1 sibling, 0 replies; 32+ messages in thread
From: Mattias Engdegård @ 2020-02-14 20:49 UTC (permalink / raw)
  To: Paul Eggert; +Cc: psainty, 37659

14 feb. 2020 kl. 17.15 skrev Paul Eggert <eggert@cs.ucla.edu>:

> How about this idea: if the compositionality patch affects only forms introduced in Emacs 27, then it's OK to put it into Emacs 27. If it affects forms that have been around since Emacs 26, it might be safer to put the patch into the master branch.

It is a reasonable criterion. In theory the patch might affect existing code, since it enables maximal matching for nested 'or' trees of string literals. The expression

(rx (or (or "a" "abc") (or "ab" "abcd")))

will currently not match the whole string "abcd", but with the patch, it would, as if flattened. The reasoning is that this is more useful, composable, and always what the user wants.

Whether this makes a difference for actually existing code is anyone's guess.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* bug#37659: rx additions: anychar, unmatchable, unordered-or
  2020-02-14 16:15                                 ` Paul Eggert
  2020-02-14 20:49                                   ` Mattias Engdegård
@ 2020-03-01 10:09                                   ` Mattias Engdegård
  1 sibling, 0 replies; 32+ messages in thread
From: Mattias Engdegård @ 2020-03-01 10:09 UTC (permalink / raw)
  To: Paul Eggert; +Cc: psainty, 37659

I've belatedly made up my mind about this and now firmly believe that the compositioning xr patch belongs in emacs-27 and have therefore pushed it to that branch. Sorry about the long deliberation.






^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2020-03-01 10:09 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-08  9:36 bug#37659: rx additions: anychar, unmatchable, unordered-or Mattias Engdegård
2019-10-09  8:59 ` Mattias Engdegård
2019-10-11 23:07 ` bug#37659: Mattias Engdegård <mattiase <at> acm.org> Paul Eggert
2019-10-12 10:47   ` Mattias Engdegård
2019-10-13 16:52     ` Paul Eggert
2019-10-13 19:48       ` Mattias Engdegård
2019-10-22 15:14       ` bug#37659: rx additions: anychar, unmatchable, unordered-or Mattias Engdegård
2019-10-22 15:27         ` Robert Pluim
2019-10-22 17:33         ` Paul Eggert
2019-10-23  9:15           ` Mattias Engdegård
2019-10-23 23:14             ` Paul Eggert
2019-10-24  1:56               ` Drew Adams
2019-10-24  9:09                 ` Mattias Engdegård
2019-10-24 14:24                   ` Drew Adams
2019-10-24  9:17                 ` Phil Sainty
2019-10-24 14:32                   ` Drew Adams
2019-10-24  8:58               ` Mattias Engdegård
2019-10-27 11:53                 ` Mattias Engdegård
2020-02-11 12:57           ` Mattias Engdegård
2020-02-11 15:43             ` Eli Zaretskii
2020-02-11 19:17               ` Mattias Engdegård
2020-02-12  0:52                 ` Paul Eggert
2020-02-12 11:22                   ` Mattias Engdegård
2020-02-13 18:38                     ` Mattias Engdegård
2020-02-13 18:50                       ` Paul Eggert
2020-02-13 19:16                         ` Mattias Engdegård
2020-02-13 19:30                           ` Eli Zaretskii
2020-02-13 22:23                             ` Mattias Engdegård
2020-02-14  7:45                               ` Eli Zaretskii
2020-02-14 16:15                                 ` Paul Eggert
2020-02-14 20:49                                   ` Mattias Engdegård
2020-03-01 10:09                                   ` Mattias Engdegård

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).