unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#37849: composable character alternatives in rx
@ 2019-10-21 10:24 Mattias Engdegård
  2019-10-27  9:17 ` Mattias Engdegård
  2019-12-06 21:58 ` Mattias Engdegård
  0 siblings, 2 replies; 7+ messages in thread
From: Mattias Engdegård @ 2019-10-21 10:24 UTC (permalink / raw)
  To: 37849

[-- Attachment #1: Type: text/plain, Size: 1762 bytes --]

Now that rx is user-extendible, some holes are showing. Example (from python.el):

      (simple-operator      . ,(rx (any ?+ ?- ?/ ?& ?^ ?~ ?| ?* ?< ?> ?= ?%)))
      ;; FIXME: rx should support (not simple-operator).
      (not-simple-operator  . ,(rx
                                (not
                                 (any ?+ ?- ?/ ?& ?^ ?~ ?| ?* ?< ?> ?= ?%))))

(This code uses the old rx-constituents mechanism, but the point applies equally to new-style definitions.)
More generally, there is currently no way to:

(1) Get the complement of a defined (any ...) form
(2) Get the union of two defined (any ...) forms
(3) Get the intersection of two defined (not (any ...)) forms

(1), which the example above was about, could be solved by expanding definitions inside 'not'. This is a step away from the principle that user-defined things are only allowed where general rx forms are, but perhaps tolerable. Proposed patch attached.

(2) can be solved by expanding definitions inside 'any', and allowing 'any' inside 'any' (flattening). Not sure I like this.

An alternative is to ensure that (or (any X) (any Y)) -> (any X Y), but then we either need to allow 'or' inside 'not', or add an intersection operator:

  (intersect (not (any X)) (not (any Y)) -> (not (any X Y))

We could also make 'not' variadic, turning it into complement-of-union:

  (not (any A) (any B)) -> (not (any A B))

Olin Shivers's SRE has a complete and closed set of operations on character sets (https://scsh.net/docu/post/sre.html). That would be principled and perhaps useful, but difficult to do fully in rx because not all such expressions can be rendered into Emacs regexps. Nothing prevents us from making a partial implementation, however.


[-- Attachment #2: 0001-Expand-rx-definitions-inside-not.patch --]
[-- Type: application/octet-stream, Size: 9231 bytes --]

From a2f7d4fbe0b1d37c233e0beffc4b2b8fd4df3013 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Fri, 18 Oct 2019 16:03:20 +0200
Subject: [PATCH] Expand rx definitions inside (not ...)

* lisp/emacs-lisp/rx.el (rx--lookup-def, rx--expand-def)
(rx--translate-symbol, rx--translate-any, rx--translate-form):
* test/lisp/emacs-lisp/rx-tests.el (rx-not, rx-def-in-not):
* doc/lispref/searching.texi (Rx Constructs, Extending Rx):
Allow user-defined rx constructs to be expanded inside (not ...)
forms, for better composability.
---
 doc/lispref/searching.texi       |   4 +-
 lisp/emacs-lisp/rx.el            | 100 ++++++++++++++++++-------------
 test/lisp/emacs-lisp/rx-tests.el |  17 +++++-
 3 files changed, 77 insertions(+), 44 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 5178575a3b..74b15cfc7f 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -1214,7 +1214,7 @@ Rx Constructs
 @item @code{(not @var{charspec})}
 @cindex @code{not} in rx
 Match a character not included in @var{charspec}.  @var{charspec} can
-be an @code{any}, @code{syntax} or @code{category} form, or a
+be an @code{any}, @code{not}, @code{syntax} or @code{category} form, or a
 character class.@*
 Corresponding string regexp: @samp{[^@dots{}]}, @samp{\S@var{code}},
 @samp{\C@var{code}}
@@ -1581,7 +1581,7 @@ Extending Rx
 User-defined forms are allowed wherever arbitrary @code{rx}
 expressions are expected; for example, in the body of a
 @code{zero-or-one} form, but not inside @code{any} or @code{category}
-forms.
+forms.  They are also allowed inside @code{not} forms.
 @end itemize
 
 @defmac rx-define name [arglist] rx-form
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 006a393921..8d8db5f3c4 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -122,9 +122,27 @@ rx--local-definitions
                     as the rx form DEF (which can contain members of ARGS).")
 
 (defsubst rx--lookup-def (name)
+  "Current definition of NAME: (DEF) or (ARGS DEF), or nil if none."
   (or (cdr (assq name rx--local-definitions))
       (get name 'rx-definition)))
 
+(defun rx--expand-def (form)
+  "FORM expanded (once) if a user-defined construct; otherwise nil."
+  (cond ((symbolp form)
+         (let ((def (rx--lookup-def form)))
+           (and def
+                (if (cdr def)
+                    (error "Not an `rx' symbol definition: %s" form)
+                  (car def)))))
+        ((consp form)
+         (let* ((op (car form))
+                (def (rx--lookup-def op)))
+           (and def
+                (if (cdr def)
+                    (rx--expand-template
+                     op (cdr form) (nth 0 def) (nth 1 def))
+                  (error "Not an `rx' form definition: %s" op)))))))
+
 ;; TODO: Additions to consider:
 ;; - A construct like `or' but without the match order guarantee,
 ;;   maybe `unordered-or'.  Useful for composition or generation of
@@ -155,11 +173,8 @@ rx--translate-symbol
       ((let ((class (cdr (assq sym rx--char-classes))))
          (and class (cons (list (concat "[[:" (symbol-name class) ":]]")) t))))
 
-      ((let ((definition (rx--lookup-def sym)))
-         (and definition
-              (if (cdr definition)
-                  (error "Not an `rx' symbol definition: %s" sym)
-                (rx--translate (nth 0 definition))))))
+      ((let ((expanded (rx--expand-def sym)))
+         (and expanded (rx--translate expanded))))
 
       ;; For compatibility with old rx.
       ((let ((entry (assq sym rx-constituents)))
@@ -445,21 +460,26 @@ rx--translate-not
     (error "rx `not' form takes exactly one argument"))
   (let ((arg (car body)))
     (cond
-     ((consp arg)
-      (pcase (car arg)
-        ((or 'any 'in 'char) (rx--translate-any      (not negated) (cdr arg)))
-        ('syntax             (rx--translate-syntax   (not negated) (cdr arg)))
-        ('category           (rx--translate-category (not negated) (cdr arg)))
-        ('not                (rx--translate-not      (not negated) (cdr arg)))
-        (_ (error "Illegal argument to rx `not': %S" arg))))
+     ((and (consp arg)
+           (pcase (car arg)
+             ((or 'any 'in 'char)
+              (rx--translate-any      (not negated) (cdr arg)))
+             ('syntax
+              (rx--translate-syntax   (not negated) (cdr arg)))
+             ('category
+              (rx--translate-category (not negated) (cdr arg)))
+             ('not
+              (rx--translate-not      (not negated) (cdr arg))))))
+     ((let ((class (cdr (assq arg rx--char-classes))))
+        (and class
+             (rx--translate-any (not negated) (list class)))))
      ((eq arg 'word-boundary)
       (rx--translate-symbol
        (if negated 'word-boundary 'not-word-boundary)))
-     (t
-      (let ((class (cdr (assq arg rx--char-classes))))
-        (if class
-            (rx--translate-any (not negated) (list class))
-          (error "Illegal argument to rx `not': %s" arg)))))))
+     ((let ((expanded (rx--expand-def arg)))
+        (and expanded
+             (rx--translate-not negated (list expanded)))))
+     (t (error "Illegal argument to rx `not': %S" arg)))))
 
 (defun rx--atomic-regexp (item)
   "ITEM is (REGEXP . PRECEDENCE); return a regexp of precedence t."
@@ -873,30 +893,28 @@ rx--translate-form
       ((or 'regexp 'regex)      (rx--translate-regexp body))
 
       (op
-       (unless (symbolp op)
-         (error "Bad rx operator `%S'" op))
-       (let ((definition (rx--lookup-def op)))
-         (if definition
-             (if (cdr definition)
-                 (rx--translate
-                  (rx--expand-template
-                   op body (nth 0 definition) (nth 1 definition)))
-               (error "Not an `rx' form definition: %s" op))
-
-           ;; For compatibility with old rx.
-           (let ((entry (assq op rx-constituents)))
-             (if (progn
-                   (while (and entry (not (consp (cdr entry))))
-                     (setq entry
-                           (if (symbolp (cdr entry))
-                               ;; Alias for another entry.
-                               (assq (cdr entry) rx-constituents)
-                             ;; Wrong type, try further down the list.
-                             (assq (car entry)
-                                   (cdr (memq entry rx-constituents))))))
-                   entry)
-                 (rx--translate-compat-form (cdr entry) form)
-               (error "Unknown rx form `%s'" op)))))))))
+       (cond
+        ((not (symbolp op)) (error "Bad rx operator `%S'" op))
+
+        ((let ((expanded (rx--expand-def form)))
+           (and expanded
+                (rx--translate expanded))))
+
+        ;; For compatibility with old rx.
+        ((let ((entry (assq op rx-constituents)))
+           (and (progn
+                  (while (and entry (not (consp (cdr entry))))
+                    (setq entry
+                          (if (symbolp (cdr entry))
+                              ;; Alias for another entry.
+                              (assq (cdr entry) rx-constituents)
+                            ;; Wrong type, try further down the list.
+                            (assq (car entry)
+                                  (cdr (memq entry rx-constituents))))))
+                  entry)
+                (rx--translate-compat-form (cdr entry) form))))
+
+        (t (error "Unknown rx form `%s'" op)))))))
 
 (defconst rx--builtin-forms
   '(seq sequence : and or | any in char not-char not
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index ef2541d83a..4ecc805aea 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -268,7 +268,9 @@ rx-not
   (should (equal (rx (not (syntax punctuation)) (not (syntax escape)))
                  "\\S.\\S\\"))
   (should (equal (rx (not (category tone-mark)) (not (category lao)))
-                 "\\C4\\Co")))
+                 "\\C4\\Co"))
+  (should (equal (rx (not (not ascii)) (not (not (not (any "a-z")))))
+                 "[[:ascii:]][^a-z]")))
 
 (ert-deftest rx-group ()
   (should (equal (rx (group nonl) (submatch "x")
@@ -404,6 +406,19 @@ rx-redefine-builtin
   (should-error (rx-let-eval '((not-char () "x")) nil))
   (should-error (rx-let-eval '((not-char "x")) nil)))
 
+(ert-deftest rx-def-in-not ()
+  "Test definition expansion inside (not ...)."
+  (rx-let ((a alpha)
+           (b (not hex))
+           (c (not (category base)))
+           (d (x) (any ?a x ?z))
+           (e (x) (syntax x))
+           (f (not b)))
+    (should (equal (rx (not a) (not b) (not c) (not f))
+                   "[^[:alpha:]][[:xdigit:]]\\c.[^[:xdigit:]]"))
+    (should (equal (rx (not (d ?m)) (not (e symbol)))
+                   "[^amz]\\S_"))))
+
 (ert-deftest rx-constituents ()
   (let ((rx-constituents
          (append '((beta . gamma)
-- 
2.21.0 (Apple Git-122)


^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-12-13 12:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-21 10:24 bug#37849: composable character alternatives in rx Mattias Engdegård
2019-10-27  9:17 ` Mattias Engdegård
2019-12-06 21:58 ` Mattias Engdegård
2019-12-09 11:04   ` Mattias Engdegård
2019-12-09 13:36     ` Eli Zaretskii
2019-12-10 21:39       ` Mattias Engdegård
2019-12-13 12:35         ` Mattias Engdegård

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).