unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: "Mattias Engdegård" <mattiase@acm.org>
To: Eli Zaretskii <eliz@gnu.org>
Cc: psainty@orcon.net.nz, Paul Eggert <eggert@cs.ucla.edu>,
	37659@debbugs.gnu.org
Subject: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Tue, 11 Feb 2020 20:17:50 +0100	[thread overview]
Message-ID: <E3B5D0A3-E7EF-4238-93D6-4CB0C55476A6@acm.org> (raw)
In-Reply-To: <838sl9ruxe.fsf@gnu.org>

[-- Attachment #1: Type: text/plain, Size: 321 bytes --]

11 feb. 2020 kl. 16.43 skrev Eli Zaretskii <eliz@gnu.org>:

> Can't say I'm happy with these last-minute experiments, but okay.

Thanks, and I think it's actually a lesser experiment than status quo ante.
I'm allowing for more comments before pushing it; meanwhile, here is the follow-up patch mentioned earlier.


[-- Attachment #2: 0001-rx-Improve-or-compositionality.patch --]
[-- Type: application/octet-stream, Size: 7758 bytes --]

From dc4aadfdacc2466420833281c5374279eee03aca Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 11 Feb 2020 20:04:42 +0100
Subject: [PATCH] rx: Improve 'or' compositionality

Perform 'regexp-opt' on nested 'or' forms, and after expansion of
user-defined and 'eval' forms.  Characters are now turned into
strings for wider 'regexp-opt' scope.

* doc/lispref/searching.texi (Rx Constructs): Document.
* lisp/emacs-lisp/rx.el (rx--normalise-or-arg)
rx--all-string-or-args): New.
(rx--translate-or): Normalise arguments first, and check for strings
in subforms.
(rx--expand-eval): Extracted from rx--translate-eval.
(rx--translate-eval): Call rx--expand-eval.
---
 doc/lispref/searching.texi       |  5 ++-
 lisp/emacs-lisp/rx.el            | 76 ++++++++++++++++++++------------
 test/lisp/emacs-lisp/rx-tests.el | 13 +++++-
 3 files changed, 63 insertions(+), 31 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 5f4509a8b4..526ee77d40 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -1081,8 +1081,9 @@ Rx Constructs
 @itemx @code{(| @var{rx}@dots{})}
 @cindex @code{|} in rx
 Match exactly one of the @var{rx}s.
-If all arguments are string literals, the longest possible match
-will always be used.  Otherwise, either the longest match or the
+If all arguments are strings, characters, or @code{or} forms
+so constrained, the longest possible match will always be used.
+Otherwise, either the longest match or the
 first (in left-to-right order) will be used.
 Without arguments, the expression will not match anything at all.@*
 Corresponding string regexp: @samp{@var{A}\|@var{B}\|@dots{}}.
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index b4cab5715d..7a7d09f10b 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -254,22 +254,39 @@ rx--foldl
     (setq l (cdr l)))
   x)
 
+(defun rx--normalise-or-arg (form)
+  "Normalise the `or' argument FORM.
+Characters become strings, user-definitions and `eval' forms are expanded,
+and `or' forms are normalised recursively."
+  (cond ((characterp form)
+         (char-to-string form))
+        ((and (consp form) (memq (car form) '(or |)))
+         (cons (car form) (mapcar #'rx--normalise-or-arg (cdr form))))
+        ((and (consp form) (eq (car form) 'eval))
+         (rx--normalise-or-arg (rx--expand-eval (cdr form))))
+        (t
+         (let ((expanded (rx--expand-def form)))
+           (if expanded
+               (rx--normalise-or-arg expanded)
+             form)))))
+
+(defun rx--all-string-or-args (body)
+  "If BODY only consists of strings or such `or' forms, return all the strings.
+Otherwise throw `rx--nonstring'."
+  (mapcan (lambda (form)
+            (cond ((stringp form) (list form))
+                  ((and (consp form) (memq (car form) '(or |)))
+                   (rx--all-string-or-args (cdr form)))
+                  (t (throw 'rx--nonstring nil))))
+          body))
+
 (defun rx--translate-or (body)
   "Translate an or-pattern of zero or more rx items.
 Return (REGEXP . PRECEDENCE)."
   ;; FIXME: Possible improvements:
   ;;
-  ;; - Turn single characters to strings: (or ?a ?b) -> (or "a" "b"),
-  ;;   so that they can be candidates for regexp-opt.
-  ;;
-  ;; - Translate compile-time strings (`eval' forms), again for regexp-opt.
-  ;;
   ;; - Flatten sub-patterns first: (or (or A B) (or C D)) -> (or A B C D)
-  ;;   in order to improve effectiveness of regexp-opt.
-  ;;   This would also help composability.
-  ;;
-  ;; - Use associativity to run regexp-opt on contiguous subsets of arguments
-  ;;   if not all of them are strings.  Example:
+  ;;   Then call regexp-opt on runs of string arguments. Example:
   ;;   (or (+ digit) "CHARLIE" "CHAN" (+ blank))
   ;;   -> (or (+ digit) (or "CHARLIE" "CHAN") (+ blank))
   ;;
@@ -279,27 +296,26 @@ rx--translate-or
   ;;   so that (or "@" "%" digit (any "A-Z" space) (syntax word))
   ;;        -> (any "@" "%" digit "A-Z" space word)
   ;;        -> "[A-Z@%[:digit:][:space:][:word:]]"
-  ;;
-  ;; Problem: If a subpattern is carefully written to be
-  ;; optimizable by regexp-opt, how do we prevent the transforms
-  ;; above from destroying that property?
-  ;; Example: (or "a" (or "abc" "abd" "abe"))
   (cond
    ((null body)                    ; No items: a never-matching regexp.
     (rx--empty))
    ((null (cdr body))              ; Single item.
     (rx--translate (car body)))
-   ((rx--every #'stringp body)     ; All strings.
-    (cons (list (regexp-opt body nil))
-          t))
-   ((rx--every #'rx--charset-p body)  ; All charsets.
-    (rx--translate-union nil body))
    (t
-    (cons (append (car (rx--translate (car body)))
-                  (mapcan (lambda (item)
-                            (cons "\\|" (car (rx--translate item))))
-                          (cdr body)))
-          nil))))
+    (let* ((args (mapcar #'rx--normalise-or-arg body))
+           (all-strings (catch 'rx--nonstring (rx--all-string-or-args args))))
+      (cond
+       (all-strings                       ; Only strings.
+        (cons (list (regexp-opt all-strings nil))
+              t))
+       ((rx--every #'rx--charset-p args)  ; All charsets.
+        (rx--translate-union nil args))
+       (t
+        (cons (append (car (rx--translate (car args)))
+                      (mapcan (lambda (item)
+                                (cons "\\|" (car (rx--translate item))))
+                              (cdr args)))
+              nil)))))))
 
 (defun rx--charset-p (form)
   "Whether FORM looks like a charset, only consisting of character intervals
@@ -836,11 +852,15 @@ rx--translate-literal
            (cons (list (list 'regexp-quote arg)) 'seq))
           (t (error "rx `literal' form with non-string argument")))))
 
-(defun rx--translate-eval (body)
-  "Translate the `eval' form.  Return (REGEXP . PRECEDENCE)."
+(defun rx--expand-eval (body)
+  "Expand `eval' arguments.  Return a new rx form."
   (unless (and body (null (cdr body)))
     (error "rx `eval' form takes exactly one argument"))
-  (rx--translate (eval (car body))))
+  (eval (car body)))
+
+(defun rx--translate-eval (body)
+  "Translate the `eval' form.  Return (REGEXP . PRECEDENCE)."
+  (rx--translate (rx--expand-eval body)))
 
 (defvar rx--regexp-atomic-regexp nil)
 
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index a6c172adfe..4db6eb8c9c 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -42,13 +42,24 @@ rx-seq
 (ert-deftest rx-or ()
   (should (equal (rx (or "ab" (| "c" nonl) "de"))
                  "ab\\|c\\|.\\|de"))
-  (should (equal (rx (or "ab" "abc" "a"))
+  (should (equal (rx (or "ab" "abc" ?a))
                  "\\(?:a\\(?:bc?\\)?\\)"))
+  (should (equal (rx (or "ab" (| (or "abcd" "abcde")) (or "a" "abc")))
+                 "\\(?:a\\(?:b\\(?:c\\(?:de?\\)?\\)?\\)?\\)"))
+  (should (equal (rx (or "a" (eval (string ?a ?b))))
+                 "\\(?:ab?\\)"))
   (should (equal (rx (| nonl "a") (| "b" blank))
                  "\\(?:.\\|a\\)\\(?:b\\|[[:blank:]]\\)"))
   (should (equal (rx (|))
                  "\\`a\\`")))
 
+(ert-deftest rx-def-in-or ()
+  (rx-let ((a b)
+           (b (or "abc" c))
+           (c ?a))
+    (should (equal (rx (or a (| "ab" "abcde") "abcd"))
+                   "\\(?:a\\(?:b\\(?:c\\(?:de?\\)?\\)?\\)?\\)"))))
+
 (ert-deftest rx-char-any ()
   "Test character alternatives with `]' and `-' (Bug#25123)."
   (should (equal
-- 
2.21.1 (Apple Git-122.3)


  reply	other threads:[~2020-02-11 19:17 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-08  9:36 bug#37659: rx additions: anychar, unmatchable, unordered-or Mattias Engdegård
2019-10-09  8:59 ` Mattias Engdegård
2019-10-11 23:07 ` bug#37659: Mattias Engdegård <mattiase <at> acm.org> Paul Eggert
2019-10-12 10:47   ` Mattias Engdegård
2019-10-13 16:52     ` Paul Eggert
2019-10-13 19:48       ` Mattias Engdegård
2019-10-22 15:14       ` bug#37659: rx additions: anychar, unmatchable, unordered-or Mattias Engdegård
2019-10-22 15:27         ` Robert Pluim
2019-10-22 17:33         ` Paul Eggert
2019-10-23  9:15           ` Mattias Engdegård
2019-10-23 23:14             ` Paul Eggert
2019-10-24  1:56               ` Drew Adams
2019-10-24  9:09                 ` Mattias Engdegård
2019-10-24 14:24                   ` Drew Adams
2019-10-24  9:17                 ` Phil Sainty
2019-10-24 14:32                   ` Drew Adams
2019-10-24  8:58               ` Mattias Engdegård
2019-10-27 11:53                 ` Mattias Engdegård
2020-02-11 12:57           ` Mattias Engdegård
2020-02-11 15:43             ` Eli Zaretskii
2020-02-11 19:17               ` Mattias Engdegård [this message]
2020-02-12  0:52                 ` Paul Eggert
2020-02-12 11:22                   ` Mattias Engdegård
2020-02-13 18:38                     ` Mattias Engdegård
2020-02-13 18:50                       ` Paul Eggert
2020-02-13 19:16                         ` Mattias Engdegård
2020-02-13 19:30                           ` Eli Zaretskii
2020-02-13 22:23                             ` Mattias Engdegård
2020-02-14  7:45                               ` Eli Zaretskii
2020-02-14 16:15                                 ` Paul Eggert
2020-02-14 20:49                                   ` Mattias Engdegård
2020-03-01 10:09                                   ` Mattias Engdegård

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=E3B5D0A3-E7EF-4238-93D6-4CB0C55476A6@acm.org \
    --to=mattiase@acm.org \
    --cc=37659@debbugs.gnu.org \
    --cc=eggert@cs.ucla.edu \
    --cc=eliz@gnu.org \
    --cc=psainty@orcon.net.nz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).