unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#37659: rx additions: anychar, unmatchable, unordered-or
@ 2019-10-08  9:36 Mattias Engdegård
  2019-10-09  8:59 ` Mattias Engdegård
  2019-10-11 23:07 ` bug#37659: Mattias Engdegård <mattiase <at> acm.org> Paul Eggert
  0 siblings, 2 replies; 32+ messages in thread
From: Mattias Engdegård @ 2019-10-08  9:36 UTC (permalink / raw)
  To: 37659

[-- Attachment #1: Type: text/plain, Size: 969 bytes --]

Three minor rx additions follow:

* Add `anychar' as an alias for `anything': the latter suggests an expression that can match any string, while in reality it only matches a single character. The documentation now uses `anychar' as the preferred name. (`any-char' would also be possible, but is longer.)

* Add `unmatchable' for a never-match regexp. This follows the previously introduced variable `regexp-unmatchable'.

* Add `unordered-or' as a variant of `or' without the left-to-right match order guarantee. It allows unconditional regexp-opt optimisations, and is particularly useful for matching sets of keywords. With rx-let and rx-define, it also has the potential for better compositionality, allowing expressions to be put together from smaller parts.

Abstractly: while `or' is associative, `unordered-or' is also commutative.

The name `unordered-or' is descriptive but phonetically (and lexically) somewhat weak. Strong alternatives welcome.


[-- Attachment #2: 0001-Add-anychar-as-alias-to-anything-in-rx.patch --]
[-- Type: application/octet-stream, Size: 4257 bytes --]

From 0b79693cdace549a8d6edda58cea20232d82faec Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Mon, 7 Oct 2019 18:07:16 +0200
Subject: [PATCH 1/3] Add `anychar' as alias to `anything' in rx

* lisp/emacs-lisp/rx.el (rx--translate-symbol, rx--builtin-symbols, rx):
* test/lisp/emacs-lisp/rx-tests.el (rx-atoms):
* doc/lispref/searching.texi (Rx Constructs):
* etc/NEWS:
Add `anychar', an alias for `anything'.  Since `anychar' is more
descriptive (and slightly shorter), treat it as the preferred name.
---
 doc/lispref/searching.texi       | 3 ++-
 etc/NEWS                         | 4 ++++
 lisp/emacs-lisp/rx.el            | 7 +++----
 test/lisp/emacs-lisp/rx-tests.el | 4 ++--
 4 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index a4b6533412..2274bab002 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -1220,7 +1220,8 @@ Rx Constructs
 Match any character except a newline.@*
 Corresponding string regexp: @samp{.} (dot)
 
-@item @code{anything}
+@item @code{anychar}, @code{anything}
+@cindex @code{anychar} in rx
 @cindex @code{anything} in rx
 Match any character.@*
 Corresponding string regexp: @samp{.\|\n} (for example)
diff --git a/etc/NEWS b/etc/NEWS
index 906dc912d6..0fea5164a7 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1795,6 +1795,10 @@ at run time, instead of a constant string.
 *** New rx extension mechanism: 'rx-define', 'rx-let', 'rx-let-eval'.
 These macros add new forms to the rx notation.
 
++++
+*** 'anychar' is now an alias for 'anything'
+Both match any single character; 'anychar' is more descriptive.
+
 ** Frames
 
 +++
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 45fec796cc..6c0b206930 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -126,7 +126,6 @@ rx--lookup-def
       (get name 'rx-definition)))
 
 ;; TODO: Additions to consider:
-;; - A better name for `anything', like `any-char' or `anychar'.
 ;; - A name for (or), maybe `unmatchable'.
 ;; - A construct like `or' but without the match order guarantee,
 ;;   maybe `unordered-or'.  Useful for composition or generation of
@@ -138,7 +137,7 @@ rx--translate-symbol
     ;; Use `list' instead of a quoted list to wrap the strings here,
     ;; since the return value may be mutated.
     ((or 'nonl 'not-newline 'any) (cons (list ".") t))
-    ('anything                    (rx--translate-form '(or nonl "\n")))
+    ((or 'anychar 'anything)      (rx--translate-form '(or nonl "\n")))
     ((or 'bol 'line-start)        (cons (list "^") 'lseq))
     ((or 'eol 'line-end)          (cons (list "$") 'rseq))
     ((or 'bos 'string-start 'bot 'buffer-start) (cons (list "\\`") t))
@@ -913,7 +912,7 @@ rx--builtin-forms
   "List of built-in rx function-like symbols.")
 
 (defconst rx--builtin-symbols
-  (append '(nonl not-newline any anything
+  (append '(nonl not-newline any anychar anything
             bol eol line-start line-end
             bos eos string-start string-end
             bow eow word-start word-end
@@ -1016,7 +1015,7 @@ rx
                 can be (any ...), (syntax ...), (category ...),
                 or a character class.
 not-newline     Match any character except a newline.  Alias: nonl.
-anything        Match any character.
+anychar         Match any character.  Alias: anything.
 
 CHARCLASS       Match a character from a character class.  One of:
  alpha, alphabetic, letter   Alphabetic characters (defined by Unicode).
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index 76dcf41942..d4524e5a25 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -184,8 +184,8 @@ rx-repeat
                  "ab")))
 
 (ert-deftest rx-atoms ()
-  (should (equal (rx anything)
-                 ".\\|\n"))
+  (should (equal (rx anychar anything)
+                 "\\(?:.\\|\n\\)\\(?:.\\|\n\\)"))
   (should (equal (rx line-start not-newline nonl any line-end)
                  "^...$"))
   (should (equal (rx bol string-start string-end buffer-start buffer-end
-- 
2.21.0 (Apple Git-122)


[-- Attachment #3: 0002-Add-unmatchable-as-alias-for-or-in-rx.patch --]
[-- Type: application/octet-stream, Size: 4453 bytes --]

From 6ad34bf9aa86aee6539851366c5267fa8a72d929 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Mon, 7 Oct 2019 18:28:18 +0200
Subject: [PATCH 2/3] Add `unmatchable' as alias for (or) in rx

* lisp/emacs-lisp/rx.el (rx--translate-symbol, rx--builtin-symbols, rx):
* test/lisp/emacs-lisp/rx-tests.el (rx-atoms):
* doc/lispref/searching.texi (Rx Constructs):
* etc/NEWS:
Add `unmatchable', more descriptive than (or), and corresponding to
the variable `regexp-unmatchable'.
---
 doc/lispref/searching.texi       | 6 ++++++
 etc/NEWS                         | 1 +
 lisp/emacs-lisp/rx.el            | 5 +++--
 test/lisp/emacs-lisp/rx-tests.el | 2 ++
 4 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 2274bab002..a6c6bf2d4a 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -1083,6 +1083,11 @@ Rx Constructs
 Match exactly one of the @var{rx}s, trying from left to right.
 Without arguments, the expression will not match anything at all.@*
 Corresponding string regexp: @samp{@var{A}\|@var{B}\|@dots{}}.
+
+@item @code{unmatchable}
+@cindex @code{unmatchable} in rx
+Refuse any match.  Equivalent to @code{(or)}.
+@xref{regexp-unmatchable}.
 @end table
 
 @subsubheading Repetition
@@ -1806,6 +1811,7 @@ Regexp Functions
 
 @c Internal functions: regexp-opt-group
 
+@anchor{regexp-unmatchable}
 @defvar regexp-unmatchable
 This variable contains a regexp that is guaranteed not to match any
 string at all.  It is particularly useful as default value for
diff --git a/etc/NEWS b/etc/NEWS
index 0fea5164a7..96c26c6623 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1785,6 +1785,7 @@ the 128...255 range, as expected.
 matches the empty string, each being an identity for the operation.
 This also works for their aliases: '|' for 'or'; ':', 'and' and
 'sequence' for 'seq'.
+The symbol 'unmatchable' can be used as an alternative to (or).
 
 ---
 *** 'regexp' and new 'literal' accept arbitrary lisp as arguments.
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 6c0b206930..cf02df239f 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -126,7 +126,6 @@ rx--lookup-def
       (get name 'rx-definition)))
 
 ;; TODO: Additions to consider:
-;; - A name for (or), maybe `unmatchable'.
 ;; - A construct like `or' but without the match order guarantee,
 ;;   maybe `unordered-or'.  Useful for composition or generation of
 ;;   alternatives; permits more effective use of regexp-opt.
@@ -138,6 +137,7 @@ rx--translate-symbol
     ;; since the return value may be mutated.
     ((or 'nonl 'not-newline 'any) (cons (list ".") t))
     ((or 'anychar 'anything)      (rx--translate-form '(or nonl "\n")))
+    ('unmatchable                 (rx--empty))
     ((or 'bol 'line-start)        (cons (list "^") 'lseq))
     ((or 'eol 'line-end)          (cons (list "$") 'rseq))
     ((or 'bos 'string-start 'bot 'buffer-start) (cons (list "\\`") t))
@@ -912,7 +912,7 @@ rx--builtin-forms
   "List of built-in rx function-like symbols.")
 
 (defconst rx--builtin-symbols
-  (append '(nonl not-newline any anychar anything
+  (append '(nonl not-newline any anychar anything unmatchable
             bol eol line-start line-end
             bos eos string-start string-end
             bow eow word-start word-end
@@ -1016,6 +1016,7 @@ rx
                 or a character class.
 not-newline     Match any character except a newline.  Alias: nonl.
 anychar         Match any character.  Alias: anything.
+unmatchable     Never match anything at all.
 
 CHARCLASS       Match a character from a character class.  One of:
  alpha, alphabetic, letter   Alphabetic characters (defined by Unicode).
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index d4524e5a25..903b191c98 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -186,6 +186,8 @@ rx-repeat
 (ert-deftest rx-atoms ()
   (should (equal (rx anychar anything)
                  "\\(?:.\\|\n\\)\\(?:.\\|\n\\)"))
+  (should (equal (rx unmatchable)
+                 "\\`a\\`"))
   (should (equal (rx line-start not-newline nonl any line-end)
                  "^...$"))
   (should (equal (rx bol string-start string-end buffer-start buffer-end
-- 
2.21.0 (Apple Git-122)


[-- Attachment #4: 0003-Add-rx-unordered-or-construct.patch --]
[-- Type: application/octet-stream, Size: 5505 bytes --]

From 55bbcccf26e95bcb69a27431d72a802b7e457a75 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Mon, 7 Oct 2019 19:39:08 +0200
Subject: [PATCH 3/3] Add rx `unordered-or' construct

* lisp/emacs-lisp/rx.el (rx--translate-or, rx--translate-form)
(rx--builting-forms, rx):
* test/lisp/emacs-lisp/rx-tests.el (rx-unordered-or):
* doc/lispref/searching.texi (Rx Constructs):
* etc/NEWS:
Add `unordered-or', like `or' but with unconstrained matching order.
---
 doc/lispref/searching.texi       |  8 ++++++++
 etc/NEWS                         |  5 +++++
 lisp/emacs-lisp/rx.el            | 16 +++++++---------
 test/lisp/emacs-lisp/rx-tests.el |  8 ++++++++
 4 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index a6c6bf2d4a..faeedc5978 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -1084,6 +1084,13 @@ Rx Constructs
 Without arguments, the expression will not match anything at all.@*
 Corresponding string regexp: @samp{@var{A}\|@var{B}\|@dots{}}.
 
+@item @code{(unordered-or @var{rx}@dots{})}
+@cindex @code{unordered-or} in rx
+Like @code{or}, but with unspecified matching order.
+This may be more efficient when the order doesn't matter,
+in particular if all subforms are string literals.
+@xref{regexp-opt}.
+
 @item @code{unmatchable}
 @cindex @code{unmatchable} in rx
 Refuse any match.  Equivalent to @code{(or)}.
@@ -1728,6 +1735,7 @@ Regexp Functions
 any special characters.
 @end defun
 
+@anchor{regexp-opt}
 @cindex optimize regexp
 @defun regexp-opt strings &optional paren keep-order
 This function returns an efficient regular expression that will match
diff --git a/etc/NEWS b/etc/NEWS
index 96c26c6623..cf3ef8183b 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1800,6 +1800,11 @@ These macros add new forms to the rx notation.
 *** 'anychar' is now an alias for 'anything'
 Both match any single character; 'anychar' is more descriptive.
 
++++
+*** New 'unordered-or' rx construct
+It works like 'or', but with unspecified matching order.  It may be
+faster in some cases, especially when the clauses are string literals.
+
 ** Frames
 
 +++
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index cf02df239f..0b14144698 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -125,11 +125,6 @@ rx--lookup-def
   (or (cdr (assq name rx--local-definitions))
       (get name 'rx-definition)))
 
-;; TODO: Additions to consider:
-;; - A construct like `or' but without the match order guarantee,
-;;   maybe `unordered-or'.  Useful for composition or generation of
-;;   alternatives; permits more effective use of regexp-opt.
-
 (defun rx--translate-symbol (sym)
   "Translate an rx symbol.  Return (REGEXP . PRECEDENCE)."
   (pcase sym
@@ -230,8 +225,9 @@ rx--every
     (setq list (cdr list)))
   (null list))
 
-(defun rx--translate-or (body)
+(defun rx--translate-or (body unordered)
   "Translate an or-pattern of one of more rx items.
+If UNORDERED, then matching order is unspecified.
 Return (REGEXP . PRECEDENCE)."
   ;; FIXME: Possible improvements:
   ;;
@@ -268,7 +264,7 @@ rx--translate-or
    ((null (cdr body))              ; Single item.
     (rx--translate (car body)))
    ((rx--every #'stringp body)     ; All strings.
-    (cons (list (regexp-opt body nil t))
+    (cons (list (regexp-opt body nil (not unordered)))
           t))
    (t
     (cons (append (car (rx--translate (car body)))
@@ -835,7 +831,8 @@ rx--translate-form
   (let ((body (cdr form)))
     (pcase (car form)
       ((or 'seq : 'and 'sequence) (rx--translate-seq body))
-      ((or 'or '|)              (rx--translate-or body))
+      ((or 'or '|)              (rx--translate-or body nil))
+      ((or 'unordered-or)       (rx--translate-or body t))
       ((or 'any 'in 'char)      (rx--translate-any nil body))
       ('not-char                (rx--translate-any t body))
       ('not                     (rx--translate-not nil body))
@@ -899,7 +896,7 @@ rx--translate-form
                (error "Unknown rx form `%s'" op)))))))))
 
 (defconst rx--builtin-forms
-  '(seq sequence : and or | any in char not-char not
+  '(seq sequence : and or | unordered-or any in char not-char not
     repeat = >= **
     zero-or-more 0+ *
     one-or-more 1+ +
@@ -990,6 +987,7 @@ rx
 
 (seq RX...)    Match the RXs in sequence.  Alias: :, sequence, and.
 (or RX...)     Match one of the RXs.  Alias: |.
+(unordered-or RX...) Match one of the RXs, in unspecified order.
 
 (zero-or-more RX...) Match RXs zero or more times.  Alias: 0+.
 (one-or-more RX...)  Match RXs one or more times.  Alias: 1+.
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index 903b191c98..bced74569f 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -49,6 +49,14 @@ rx-or
   (should (equal (rx (|))
                  "\\`a\\`")))
 
+(ert-deftest rx-unordered-or ()
+  (should (equal (rx (unordered-or "ab" nonl "cd"))
+                 "ab\\|.\\|cd"))
+  (should (equal (rx (unordered-or "ab" "abc" "a"))
+                 "\\(?:a\\(?:bc?\\)?\\)"))
+  (should (equal (rx (unordered-or))
+                 "\\`a\\`")))
+
 (ert-deftest rx-char-any ()
   "Test character alternatives with `]' and `-' (Bug#25123)."
   (should (equal
-- 
2.21.0 (Apple Git-122)


^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2020-03-01 10:09 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-10-08  9:36 bug#37659: rx additions: anychar, unmatchable, unordered-or Mattias Engdegård
2019-10-09  8:59 ` Mattias Engdegård
2019-10-11 23:07 ` bug#37659: Mattias Engdegård <mattiase <at> acm.org> Paul Eggert
2019-10-12 10:47   ` Mattias Engdegård
2019-10-13 16:52     ` Paul Eggert
2019-10-13 19:48       ` Mattias Engdegård
2019-10-22 15:14       ` bug#37659: rx additions: anychar, unmatchable, unordered-or Mattias Engdegård
2019-10-22 15:27         ` Robert Pluim
2019-10-22 17:33         ` Paul Eggert
2019-10-23  9:15           ` Mattias Engdegård
2019-10-23 23:14             ` Paul Eggert
2019-10-24  1:56               ` Drew Adams
2019-10-24  9:09                 ` Mattias Engdegård
2019-10-24 14:24                   ` Drew Adams
2019-10-24  9:17                 ` Phil Sainty
2019-10-24 14:32                   ` Drew Adams
2019-10-24  8:58               ` Mattias Engdegård
2019-10-27 11:53                 ` Mattias Engdegård
2020-02-11 12:57           ` Mattias Engdegård
2020-02-11 15:43             ` Eli Zaretskii
2020-02-11 19:17               ` Mattias Engdegård
2020-02-12  0:52                 ` Paul Eggert
2020-02-12 11:22                   ` Mattias Engdegård
2020-02-13 18:38                     ` Mattias Engdegård
2020-02-13 18:50                       ` Paul Eggert
2020-02-13 19:16                         ` Mattias Engdegård
2020-02-13 19:30                           ` Eli Zaretskii
2020-02-13 22:23                             ` Mattias Engdegård
2020-02-14  7:45                               ` Eli Zaretskii
2020-02-14 16:15                                 ` Paul Eggert
2020-02-14 20:49                                   ` Mattias Engdegård
2020-03-01 10:09                                   ` Mattias Engdegård

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).