unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#37849: composable character alternatives in rx
@ 2019-10-21 10:24 Mattias Engdegård
  2019-10-27  9:17 ` Mattias Engdegård
  2019-12-06 21:58 ` Mattias Engdegård
  0 siblings, 2 replies; 7+ messages in thread
From: Mattias Engdegård @ 2019-10-21 10:24 UTC (permalink / raw)
  To: 37849

[-- Attachment #1: Type: text/plain, Size: 1762 bytes --]

Now that rx is user-extendible, some holes are showing. Example (from python.el):

      (simple-operator      . ,(rx (any ?+ ?- ?/ ?& ?^ ?~ ?| ?* ?< ?> ?= ?%)))
      ;; FIXME: rx should support (not simple-operator).
      (not-simple-operator  . ,(rx
                                (not
                                 (any ?+ ?- ?/ ?& ?^ ?~ ?| ?* ?< ?> ?= ?%))))

(This code uses the old rx-constituents mechanism, but the point applies equally to new-style definitions.)
More generally, there is currently no way to:

(1) Get the complement of a defined (any ...) form
(2) Get the union of two defined (any ...) forms
(3) Get the intersection of two defined (not (any ...)) forms

(1), which the example above was about, could be solved by expanding definitions inside 'not'. This is a step away from the principle that user-defined things are only allowed where general rx forms are, but perhaps tolerable. Proposed patch attached.

(2) can be solved by expanding definitions inside 'any', and allowing 'any' inside 'any' (flattening). Not sure I like this.

An alternative is to ensure that (or (any X) (any Y)) -> (any X Y), but then we either need to allow 'or' inside 'not', or add an intersection operator:

  (intersect (not (any X)) (not (any Y)) -> (not (any X Y))

We could also make 'not' variadic, turning it into complement-of-union:

  (not (any A) (any B)) -> (not (any A B))

Olin Shivers's SRE has a complete and closed set of operations on character sets (https://scsh.net/docu/post/sre.html). That would be principled and perhaps useful, but difficult to do fully in rx because not all such expressions can be rendered into Emacs regexps. Nothing prevents us from making a partial implementation, however.


[-- Attachment #2: 0001-Expand-rx-definitions-inside-not.patch --]
[-- Type: application/octet-stream, Size: 9231 bytes --]

From a2f7d4fbe0b1d37c233e0beffc4b2b8fd4df3013 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Fri, 18 Oct 2019 16:03:20 +0200
Subject: [PATCH] Expand rx definitions inside (not ...)

* lisp/emacs-lisp/rx.el (rx--lookup-def, rx--expand-def)
(rx--translate-symbol, rx--translate-any, rx--translate-form):
* test/lisp/emacs-lisp/rx-tests.el (rx-not, rx-def-in-not):
* doc/lispref/searching.texi (Rx Constructs, Extending Rx):
Allow user-defined rx constructs to be expanded inside (not ...)
forms, for better composability.
---
 doc/lispref/searching.texi       |   4 +-
 lisp/emacs-lisp/rx.el            | 100 ++++++++++++++++++-------------
 test/lisp/emacs-lisp/rx-tests.el |  17 +++++-
 3 files changed, 77 insertions(+), 44 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 5178575a3b..74b15cfc7f 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -1214,7 +1214,7 @@ Rx Constructs
 @item @code{(not @var{charspec})}
 @cindex @code{not} in rx
 Match a character not included in @var{charspec}.  @var{charspec} can
-be an @code{any}, @code{syntax} or @code{category} form, or a
+be an @code{any}, @code{not}, @code{syntax} or @code{category} form, or a
 character class.@*
 Corresponding string regexp: @samp{[^@dots{}]}, @samp{\S@var{code}},
 @samp{\C@var{code}}
@@ -1581,7 +1581,7 @@ Extending Rx
 User-defined forms are allowed wherever arbitrary @code{rx}
 expressions are expected; for example, in the body of a
 @code{zero-or-one} form, but not inside @code{any} or @code{category}
-forms.
+forms.  They are also allowed inside @code{not} forms.
 @end itemize
 
 @defmac rx-define name [arglist] rx-form
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 006a393921..8d8db5f3c4 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -122,9 +122,27 @@ rx--local-definitions
                     as the rx form DEF (which can contain members of ARGS).")
 
 (defsubst rx--lookup-def (name)
+  "Current definition of NAME: (DEF) or (ARGS DEF), or nil if none."
   (or (cdr (assq name rx--local-definitions))
       (get name 'rx-definition)))
 
+(defun rx--expand-def (form)
+  "FORM expanded (once) if a user-defined construct; otherwise nil."
+  (cond ((symbolp form)
+         (let ((def (rx--lookup-def form)))
+           (and def
+                (if (cdr def)
+                    (error "Not an `rx' symbol definition: %s" form)
+                  (car def)))))
+        ((consp form)
+         (let* ((op (car form))
+                (def (rx--lookup-def op)))
+           (and def
+                (if (cdr def)
+                    (rx--expand-template
+                     op (cdr form) (nth 0 def) (nth 1 def))
+                  (error "Not an `rx' form definition: %s" op)))))))
+
 ;; TODO: Additions to consider:
 ;; - A construct like `or' but without the match order guarantee,
 ;;   maybe `unordered-or'.  Useful for composition or generation of
@@ -155,11 +173,8 @@ rx--translate-symbol
       ((let ((class (cdr (assq sym rx--char-classes))))
          (and class (cons (list (concat "[[:" (symbol-name class) ":]]")) t))))
 
-      ((let ((definition (rx--lookup-def sym)))
-         (and definition
-              (if (cdr definition)
-                  (error "Not an `rx' symbol definition: %s" sym)
-                (rx--translate (nth 0 definition))))))
+      ((let ((expanded (rx--expand-def sym)))
+         (and expanded (rx--translate expanded))))
 
       ;; For compatibility with old rx.
       ((let ((entry (assq sym rx-constituents)))
@@ -445,21 +460,26 @@ rx--translate-not
     (error "rx `not' form takes exactly one argument"))
   (let ((arg (car body)))
     (cond
-     ((consp arg)
-      (pcase (car arg)
-        ((or 'any 'in 'char) (rx--translate-any      (not negated) (cdr arg)))
-        ('syntax             (rx--translate-syntax   (not negated) (cdr arg)))
-        ('category           (rx--translate-category (not negated) (cdr arg)))
-        ('not                (rx--translate-not      (not negated) (cdr arg)))
-        (_ (error "Illegal argument to rx `not': %S" arg))))
+     ((and (consp arg)
+           (pcase (car arg)
+             ((or 'any 'in 'char)
+              (rx--translate-any      (not negated) (cdr arg)))
+             ('syntax
+              (rx--translate-syntax   (not negated) (cdr arg)))
+             ('category
+              (rx--translate-category (not negated) (cdr arg)))
+             ('not
+              (rx--translate-not      (not negated) (cdr arg))))))
+     ((let ((class (cdr (assq arg rx--char-classes))))
+        (and class
+             (rx--translate-any (not negated) (list class)))))
      ((eq arg 'word-boundary)
       (rx--translate-symbol
        (if negated 'word-boundary 'not-word-boundary)))
-     (t
-      (let ((class (cdr (assq arg rx--char-classes))))
-        (if class
-            (rx--translate-any (not negated) (list class))
-          (error "Illegal argument to rx `not': %s" arg)))))))
+     ((let ((expanded (rx--expand-def arg)))
+        (and expanded
+             (rx--translate-not negated (list expanded)))))
+     (t (error "Illegal argument to rx `not': %S" arg)))))
 
 (defun rx--atomic-regexp (item)
   "ITEM is (REGEXP . PRECEDENCE); return a regexp of precedence t."
@@ -873,30 +893,28 @@ rx--translate-form
       ((or 'regexp 'regex)      (rx--translate-regexp body))
 
       (op
-       (unless (symbolp op)
-         (error "Bad rx operator `%S'" op))
-       (let ((definition (rx--lookup-def op)))
-         (if definition
-             (if (cdr definition)
-                 (rx--translate
-                  (rx--expand-template
-                   op body (nth 0 definition) (nth 1 definition)))
-               (error "Not an `rx' form definition: %s" op))
-
-           ;; For compatibility with old rx.
-           (let ((entry (assq op rx-constituents)))
-             (if (progn
-                   (while (and entry (not (consp (cdr entry))))
-                     (setq entry
-                           (if (symbolp (cdr entry))
-                               ;; Alias for another entry.
-                               (assq (cdr entry) rx-constituents)
-                             ;; Wrong type, try further down the list.
-                             (assq (car entry)
-                                   (cdr (memq entry rx-constituents))))))
-                   entry)
-                 (rx--translate-compat-form (cdr entry) form)
-               (error "Unknown rx form `%s'" op)))))))))
+       (cond
+        ((not (symbolp op)) (error "Bad rx operator `%S'" op))
+
+        ((let ((expanded (rx--expand-def form)))
+           (and expanded
+                (rx--translate expanded))))
+
+        ;; For compatibility with old rx.
+        ((let ((entry (assq op rx-constituents)))
+           (and (progn
+                  (while (and entry (not (consp (cdr entry))))
+                    (setq entry
+                          (if (symbolp (cdr entry))
+                              ;; Alias for another entry.
+                              (assq (cdr entry) rx-constituents)
+                            ;; Wrong type, try further down the list.
+                            (assq (car entry)
+                                  (cdr (memq entry rx-constituents))))))
+                  entry)
+                (rx--translate-compat-form (cdr entry) form))))
+
+        (t (error "Unknown rx form `%s'" op)))))))
 
 (defconst rx--builtin-forms
   '(seq sequence : and or | any in char not-char not
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index ef2541d83a..4ecc805aea 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -268,7 +268,9 @@ rx-not
   (should (equal (rx (not (syntax punctuation)) (not (syntax escape)))
                  "\\S.\\S\\"))
   (should (equal (rx (not (category tone-mark)) (not (category lao)))
-                 "\\C4\\Co")))
+                 "\\C4\\Co"))
+  (should (equal (rx (not (not ascii)) (not (not (not (any "a-z")))))
+                 "[[:ascii:]][^a-z]")))
 
 (ert-deftest rx-group ()
   (should (equal (rx (group nonl) (submatch "x")
@@ -404,6 +406,19 @@ rx-redefine-builtin
   (should-error (rx-let-eval '((not-char () "x")) nil))
   (should-error (rx-let-eval '((not-char "x")) nil)))
 
+(ert-deftest rx-def-in-not ()
+  "Test definition expansion inside (not ...)."
+  (rx-let ((a alpha)
+           (b (not hex))
+           (c (not (category base)))
+           (d (x) (any ?a x ?z))
+           (e (x) (syntax x))
+           (f (not b)))
+    (should (equal (rx (not a) (not b) (not c) (not f))
+                   "[^[:alpha:]][[:xdigit:]]\\c.[^[:xdigit:]]"))
+    (should (equal (rx (not (d ?m)) (not (e symbol)))
+                   "[^amz]\\S_"))))
+
 (ert-deftest rx-constituents ()
   (let ((rx-constituents
          (append '((beta . gamma)
-- 
2.21.0 (Apple Git-122)


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* bug#37849: composable character alternatives in rx
  2019-10-21 10:24 bug#37849: composable character alternatives in rx Mattias Engdegård
@ 2019-10-27  9:17 ` Mattias Engdegård
  2019-12-06 21:58 ` Mattias Engdegård
  1 sibling, 0 replies; 7+ messages in thread
From: Mattias Engdegård @ 2019-10-27  9:17 UTC (permalink / raw)
  To: 37849

Expansion inside (not ...) should be uncontroversial; now pushed (cbd439e785).

Character set operators (union, intersection, difference) would be useful. Consider:

(rx-define ident-chars (any "a-zA-Z0-9"))
(rx-define operator-chars (any ?+ ?- ?* ?/ ?< ?> ?=))

There is then currently no way to form the set of characters that excludes both the above sets.







^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#37849: composable character alternatives in rx
  2019-10-21 10:24 bug#37849: composable character alternatives in rx Mattias Engdegård
  2019-10-27  9:17 ` Mattias Engdegård
@ 2019-12-06 21:58 ` Mattias Engdegård
  2019-12-09 11:04   ` Mattias Engdegård
  1 sibling, 1 reply; 7+ messages in thread
From: Mattias Engdegård @ 2019-12-06 21:58 UTC (permalink / raw)
  To: 37849

[-- Attachment #1: Type: text/plain, Size: 1174 bytes --]

This patch adds `union' and `intersection' to rx. They both take zero or more charsets as arguments. A charset is either an `any' form that does not contain character classes, a `union' or `intersection' form, or a `not' form with charset argument.

Example:

(rx (union (any "a-f") (any "b-m")))
=> "[a-m]"

(rx (intersection (any "a-f") (any "b-m")))
=> "[b-f]"

The character class limitation stems from the inability to complement or intersect classes in general. It would be possible to partially lift this restriction for `union'; it is clear that

(rx (union (any "ab" space) (any "bc" space digit)))
=> "[abc[:space:][:digit:]]"

but it makes the facility harder to explain to the user in a way that makes sense. Still, it could be a future extension.

A `difference' operator was not included but could be added; it is trivially defined in rx as

(rx-define difference (a b)
  (intersection a (not b)))

The names `union' and `intersection' are verbose, but should be rare enough that it's better with something descriptive.
SRE, from where the concept was taken, uses `|' and `&' respectively, and `~' for complement, `-' for difference.


[-- Attachment #2: 0001-Add-union-and-intersection-to-rx-bug-37849.patch --]
[-- Type: application/octet-stream, Size: 20657 bytes --]

From 3d3bc1529ae90d3cfd5605055060f14696d815c2 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Fri, 6 Dec 2019 22:23:57 +0100
Subject: [PATCH] Add `union' and `intersection' to rx (bug#37849)

These character set operations, together with `not' for set
complement, improve the compositionality of rx, and reduce duplication
in complicated cases.  Named character classes are not permitted in
set operations.

* lisp/emacs-lisp/rx.el (rx--translate-any): Split into multiple
functions.
(rx--foldl, rx--parse-any, rx--generate-alt, rx--intervals-to-alt)
(rx--complement-intervals, rx--intersect-intervals)
(rx--union-intervals, rx--charset-intervals, rx--charset-union)
(rx--charset-all, rx--charset-intersection, rx--translate-union)
(rx--translate-intersection): New.
(rx--translate-not, rx--translate-form, rx--builtin-forms):
Add `union' and `intersection'.
* test/lisp/emacs-lisp/rx-tests.el (rx-union ,rx-def-in-union)
(rx-intersection, rx-def-in-intersection): New tests.
* doc/lispref/searching.texi (Rx Constructs):
* etc/NEWS:
Document `union' and `intersection'.
---
 doc/lispref/searching.texi       |  14 +-
 etc/NEWS                         |   7 +-
 lisp/emacs-lisp/rx.el            | 302 +++++++++++++++++++++----------
 test/lisp/emacs-lisp/rx-tests.el |  57 ++++++
 4 files changed, 284 insertions(+), 96 deletions(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 74b15cfc7f..19888e7cfa 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -1214,11 +1214,21 @@ Rx Constructs
 @item @code{(not @var{charspec})}
 @cindex @code{not} in rx
 Match a character not included in @var{charspec}.  @var{charspec} can
-be an @code{any}, @code{not}, @code{syntax} or @code{category} form, or a
-character class.@*
+be an @code{any}, @code{not}, @code{union}, @code{intersection},
+@code{syntax} or @code{category} form, or a character class.@*
 Corresponding string regexp: @samp{[^@dots{}]}, @samp{\S@var{code}},
 @samp{\C@var{code}}
 
+@item @code{(union @var{charset}@dots{})}
+@itemx @code{(intersection @var{charset}@dots{})}
+@cindex @code{union} in rx
+@cindex @code{intersection} in rx
+Match a character that matches the union or intersection,
+respectively, of the @var{charset}s.  Each @var{charset} can be an
+@code{any} form without character classes, or a @code{union},
+@code{intersection} or @code{not} form whose arguments are also
+@var{charset}s.
+
 @item @code{not-newline}, @code{nonl}
 @cindex @code{not-newline} in rx
 @cindex @code{nonl} in rx
diff --git a/etc/NEWS b/etc/NEWS
index 28bcb720cd..fb95a6e704 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -2100,9 +2100,14 @@ at run time, instead of a constant string.
 These macros add new forms to the rx notation.
 
 +++
-*** 'anychar' is now an alias for 'anything'
+*** 'anychar' is now an alias for 'anything'.
 Both match any single character; 'anychar' is more descriptive.
 
++++
+*** New 'union' and 'intersection' forms for character sets.
+These permit composing character-matching expressions from simpler
+parts.
+
 ** Frames
 
 +++
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index 0dc6e19866..b17f44f1a7 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -246,6 +246,14 @@ rx--every
     (setq list (cdr list)))
   (null list))
 
+(defun rx--foldl (f x l)
+  "(F (F (F X L0) L1) L2) ...
+Left-fold the list L, starting with X, by the binary function F."
+  (while l
+    (setq x (funcall f x (car l)))
+    (setq l (cdr l)))
+  x)
+
 (defun rx--translate-or (body)
   "Translate an or-pattern of zero or more rx items.
 Return (REGEXP . PRECEDENCE)."
@@ -343,22 +351,11 @@ rx--condense-intervals
         (setq tail d)))
     intervals))
 
-;; FIXME: Consider expanding definitions inside (any ...) and (not ...),
-;; and perhaps allow (any ...) inside (any ...).
-;; It would be benefit composability (build a character alternative by pieces)
-;; and be handy for obtaining the complement of a defined set of
-;; characters.  (See, for example, python.el:421, `not-simple-operator'.)
-;; (Expansion in other non-rx positions is probably not a good idea:
-;; syntax, category, backref, and the integer parameters of group-n,
-;; =, >=, **, repeat)
-;; Similar effect could be attained by ensuring that
-;; (or (any X) (any Y)) -> (any X Y), and find a way to compose negative
-;; sets.  `and' is taken, but we could add
-;; (intersection (not (any X)) (not (any Y))) -> (not (any X Y)).
-
-(defun rx--translate-any (negated body)
-  "Translate an (any ...) construct.  Return (REGEXP . PRECEDENCE).
-If NEGATED, negate the sense."
+(defun rx--parse-any (body)
+  "Parse arguments of an (any ...) construct.
+Return (INTERVALS . CLASSES), where INTERVALS is a sorted list of
+disjoint intervals (each a cons of chars), and CLASSES
+a list of named character classes in the order they occur in BODY."
   (let ((classes nil)
         (strings nil)
         (conses nil))
@@ -380,81 +377,109 @@ rx--translate-any
                          (or (memq class classes)
                              (progn (push class classes) t))))))
             (t (error "Invalid rx `any' argument: %s" arg))))
-    (let ((items
-           ;; Translate strings and conses into nonoverlapping intervals,
-           ;; and add classes as symbols at the end.
-           (append
-            (rx--condense-intervals
-             (sort (append conses
-                           (mapcan #'rx--string-to-intervals strings))
-                   #'car-less-than-car))
-            (reverse classes))))
-
-      ;; Move lone ] and range ]-x to the start.
-      (let ((rbrac-l (assq ?\] items)))
-        (when rbrac-l
-          (setq items (cons rbrac-l (delq rbrac-l items)))))
-
-      ;; Split x-] and move the lone ] to the start.
-      (let ((rbrac-r (rassq ?\] items)))
-        (when (and rbrac-r (not (eq (car rbrac-r) ?\])))
-          (setcdr rbrac-r ?\\)
-          (setq items (cons '(?\] . ?\]) items))))
-
-      ;; Split ,-- (which would end up as ,- otherwise).
-      (let ((dash-r (rassq ?- items)))
-        (when (eq (car dash-r) ?,)
-          (setcdr dash-r ?,)
-          (setq items (nconc items '((?- . ?-))))))
-
-      ;; Remove - (lone or at start of interval)
-      (let ((dash-l (assq ?- items)))
-        (when dash-l
-          (if (eq (cdr dash-l) ?-)
-              (setq items (delq dash-l items))   ; Remove lone -
-            (setcar dash-l ?.))                  ; Reduce --x to .-x
-          (setq items (nconc items '((?- . ?-))))))
-
-      ;; Deal with leading ^ and range ^-x.
-      (when (and (consp (car items))
-                 (eq (caar items) ?^)
-                 (cdr items))
-        ;; Move ^ and ^-x to second place.
-        (setq items (cons (cadr items)
-                          (cons (car items) (cddr items)))))
+    (cons (rx--condense-intervals
+           (sort (append conses
+                         (mapcan #'rx--string-to-intervals strings))
+                 #'car-less-than-car))
+          (reverse classes))))
+
+(defun rx--generate-alt (negated intervals classes)
+  "Generate a character alternative.  Return (REGEXP . PRECEDENCE).
+If NEGATED is non-nil, negate the result; INTERVALS is a sorted
+list of disjoint intervals and CLASSES a list of named character
+classes."
+  (let ((items (append intervals classes)))
+    ;; Move lone ] and range ]-x to the start.
+    (let ((rbrac-l (assq ?\] items)))
+      (when rbrac-l
+        (setq items (cons rbrac-l (delq rbrac-l items)))))
+
+    ;; Split x-] and move the lone ] to the start.
+    (let ((rbrac-r (rassq ?\] items)))
+      (when (and rbrac-r (not (eq (car rbrac-r) ?\])))
+        (setcdr rbrac-r ?\\)
+        (setq items (cons '(?\] . ?\]) items))))
+
+    ;; Split ,-- (which would end up as ,- otherwise).
+    (let ((dash-r (rassq ?- items)))
+      (when (eq (car dash-r) ?,)
+        (setcdr dash-r ?,)
+        (setq items (nconc items '((?- . ?-))))))
+
+    ;; Remove - (lone or at start of interval)
+    (let ((dash-l (assq ?- items)))
+      (when dash-l
+        (if (eq (cdr dash-l) ?-)
+            (setq items (delq dash-l items))   ; Remove lone -
+          (setcar dash-l ?.))                  ; Reduce --x to .-x
+        (setq items (nconc items '((?- . ?-))))))
+
+    ;; Deal with leading ^ and range ^-x.
+    (when (and (consp (car items))
+               (eq (caar items) ?^)
+               (cdr items))
+      ;; Move ^ and ^-x to second place.
+      (setq items (cons (cadr items)
+                        (cons (car items) (cddr items)))))
 
-      (cond
-       ;; Empty set: if negated, any char, otherwise match-nothing.
-       ((null items)
-        (if negated
-            (rx--translate-symbol 'anything)
-          (rx--empty)))
-       ;; Single non-negated character.
-       ((and (null (cdr items))
-             (consp (car items))
-             (eq (caar items) (cdar items))
-             (not negated))
-        (cons (list (regexp-quote (char-to-string (caar items))))
-              t))
-       ;; At least one character or class, possibly negated.
-       (t
-        (cons
-         (list
-          (concat
-           "["
-           (and negated "^")
-           (mapconcat (lambda (item)
-                        (cond ((symbolp item)
-                               (format "[:%s:]" item))
-                              ((eq (car item) (cdr item))
-                               (char-to-string (car item)))
-                              ((eq (1+ (car item)) (cdr item))
-                               (string (car item) (cdr item)))
-                              (t
-                               (string (car item) ?- (cdr item)))))
-                      items nil)
-           "]"))
-         t))))))
+    (cond
+     ;; Empty set: if negated, any char, otherwise match-nothing.
+     ((null items)
+      (if negated
+          (rx--translate-symbol 'anything)
+        (rx--empty)))
+     ;; Single non-negated character.
+     ((and (null (cdr items))
+           (consp (car items))
+           (eq (caar items) (cdar items))
+           (not negated))
+      (cons (list (regexp-quote (char-to-string (caar items))))
+            t))
+     ;; At least one character or class, possibly negated.
+     (t
+      (cons
+       (list
+        (concat
+         "["
+         (and negated "^")
+         (mapconcat (lambda (item)
+                      (cond ((symbolp item)
+                             (format "[:%s:]" item))
+                            ((eq (car item) (cdr item))
+                             (char-to-string (car item)))
+                            ((eq (1+ (car item)) (cdr item))
+                             (string (car item) (cdr item)))
+                            (t
+                             (string (car item) ?- (cdr item)))))
+                    items nil)
+         "]"))
+       t)))))
+
+(defun rx--translate-any (negated body)
+  "Translate an (any ...) construct.  Return (REGEXP . PRECEDENCE).
+If NEGATED, negate the sense."
+  (let ((parsed (rx--parse-any body)))
+    (rx--generate-alt negated (car parsed) (cdr parsed))))
+
+(defun rx--intervals-to-alt (negated intervals)
+  "Generate a character alternative from an interval set.
+Return (REGEXP . PRECEDENCE).
+INTERVALS is a sorted list of disjoint intervals.
+If NEGATED, negate the sense."
+  ;; Detect whether the interval set is better described in
+  ;; complemented form.  This is not just a matter of aesthetics: any
+  ;; range from ASCII to raw bytes will automatically exclude the
+  ;; entire non-ASCII Unicode range by the regexp engine.
+  (if (rx--every (lambda (iv) (not (<= (car iv) #x3ffeff (cdr iv))))
+                 intervals)
+      (rx--generate-alt negated intervals nil)
+    (rx--generate-alt
+     (not negated) (rx--complement-intervals intervals) nil)))
+
+;; FIXME: Consider turning `not' into a variadic operator, following SRE:
+;; (not A B) = (not (union A B)) = (intersection (not A) (not B)), and
+;; (not) = anychar.
+;; Maybe allow singleton characters as arguments.
 
 (defun rx--translate-not (negated body)
   "Translate a (not ...) construct.  Return (REGEXP . PRECEDENCE).
@@ -472,10 +497,14 @@ rx--translate-not
              ('category
               (rx--translate-category (not negated) (cdr arg)))
              ('not
-              (rx--translate-not      (not negated) (cdr arg))))))
+              (rx--translate-not      (not negated) (cdr arg)))
+             ('union
+              (rx--translate-union    (not negated) (cdr arg)))
+             ('intersection
+              (rx--translate-intersection (not negated) (cdr arg))))))
      ((let ((class (cdr (assq arg rx--char-classes))))
         (and class
-             (rx--translate-any (not negated) (list class)))))
+             (rx--generate-alt (not negated) nil (list class)))))
      ((eq arg 'word-boundary)
       (rx--translate-symbol
        (if negated 'word-boundary 'not-word-boundary)))
@@ -484,6 +513,91 @@ rx--translate-not
              (rx--translate-not negated (list expanded)))))
      (t (error "Illegal argument to rx `not': %S" arg)))))
 
+(defun rx--complement-intervals (intervals)
+  "Complement of the interval list INTERVALS."
+  (let ((compl nil)
+        (c 0))
+    (dolist (iv intervals)
+      (when (< c (car iv))
+        (push (cons c (1- (car iv))) compl))
+      (setq c (1+ (cdr iv))))
+    (when (< c (max-char))
+      (push (cons c (max-char)) compl))
+    (nreverse compl)))
+
+(defun rx--intersect-intervals (ivs-a ivs-b)
+  "Intersection of the interval lists IVS-A and IVS-B."
+  (let ((isect nil))
+    (while (and ivs-a ivs-b)
+      (let ((a (car ivs-a))
+            (b (car ivs-b)))
+        (cond
+         ((< (cdr a) (car b)) (setq ivs-a (cdr ivs-a)))
+         ((> (car a) (cdr b)) (setq ivs-b (cdr ivs-b)))
+         (t
+          (push (cons (max (car a) (car b))
+                      (min (cdr a) (cdr b)))
+                isect)
+          (setq ivs-a (cdr ivs-a))
+          (setq ivs-b (cdr ivs-b))
+          (cond ((< (cdr a) (cdr b))
+                 (push (cons (1+ (cdr a)) (cdr b))
+                       ivs-b))
+                ((> (cdr a) (cdr b))
+                 (push (cons (1+ (cdr b)) (cdr a))
+                       ivs-a)))))))
+    (nreverse isect)))
+
+(defun rx--union-intervals (ivs-a ivs-b)
+  "Union of the interval lists IVS-A and IVS-B."
+  (rx--complement-intervals
+   (rx--intersect-intervals
+    (rx--complement-intervals ivs-a)
+    (rx--complement-intervals ivs-b))))
+
+(defun rx--charset-intervals (charset)
+  "Return a sorted list of non-adjacent disjoint intervals from CHARSET.
+CHARSET is any expression allowed in a character set expression:
+either `any' (no classes permitted), or `not', `union' or `intersection'
+forms whose arguments are charsets."
+  (pcase charset
+    (`(,(or 'any 'in 'char) . ,body)
+     (let ((parsed (rx--parse-any body)))
+       (when (cdr parsed)
+         (error
+          "Character class not permitted in set operations: %S"
+          (cadr parsed)))
+       (car parsed)))
+    (`(not ,x) (rx--complement-intervals (rx--charset-intervals x)))
+    (`(union . ,xs) (rx--charset-union xs))
+    (`(intersection . ,xs) (rx--charset-intersection xs))
+    (_ (let ((expanded (rx--expand-def charset)))
+         (if expanded
+             (rx--charset-intervals expanded)
+           (error "Bad character set: %S" charset))))))
+
+(defun rx--charset-union (charsets)
+  "Union of CHARSETS, as a set of intervals."
+  (rx--foldl #'rx--union-intervals nil
+             (mapcar #'rx--charset-intervals charsets)))
+
+(defconst rx--charset-all (list (cons 0 (max-char))))
+
+(defun rx--charset-intersection (charsets)
+  "Intersection of CHARSETS, as a set of intervals."
+  (rx--foldl #'rx--intersect-intervals rx--charset-all
+             (mapcar #'rx--charset-intervals charsets)))
+
+(defun rx--translate-union (negated body)
+  "Translate a (union ...) construct.  Return (REGEXP . PRECEDENCE).
+If NEGATED, negate the sense."
+  (rx--intervals-to-alt negated (rx--charset-union body)))
+
+(defun rx--translate-intersection (negated body)
+  "Translate an (intersection ...) construct.  Return (REGEXP . PRECEDENCE).
+If NEGATED, negate the sense."
+  (rx--intervals-to-alt negated (rx--charset-intersection body)))
+
 (defun rx--atomic-regexp (item)
   "ITEM is (REGEXP . PRECEDENCE); return a regexp of precedence t."
   (if (eq (cdr item) t)
@@ -862,6 +976,8 @@ rx--translate-form
       ((or 'any 'in 'char)      (rx--translate-any nil body))
       ('not-char                (rx--translate-any t body))
       ('not                     (rx--translate-not nil body))
+      ('union                   (rx--translate-union nil body))
+      ('intersection            (rx--translate-intersection nil body))
 
       ('repeat                  (rx--translate-repeat body))
       ('=                       (rx--translate-= body))
@@ -920,7 +1036,7 @@ rx--translate-form
         (t (error "Unknown rx form `%s'" op)))))))
 
 (defconst rx--builtin-forms
-  '(seq sequence : and or | any in char not-char not
+  '(seq sequence : and or | any in char not-char not union intersection
     repeat = >= **
     zero-or-more 0+ *
     one-or-more 1+ +
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index 26e39f8c8e..fdf8db61df 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -274,6 +274,63 @@ rx-not
   (should (equal (rx (not (not ascii)) (not (not (not (any "a-z")))))
                  "[[:ascii:]][^a-z]")))
 
+(ert-deftest rx-union ()
+  (should (equal (rx (union))
+                 "\\`a\\`"))
+  (should (equal (rx (union (any "ba")))
+                 "[ab]"))
+  (should (equal (rx (union (any "a-f") (any "c-k" ?y) (any ?r "x-z")))
+                 "[a-krx-z]"))
+  (should (equal (rx (union (not (any "a-m")) (not (any "f-p"))))
+                 "[^f-m]"))
+  (should (equal (rx (union (any "e-m") (not (any "a-z"))))
+                 "[^a-dn-z]"))
+  (should (equal (rx (union (not (any "g-r")) (not (any "t"))))
+                 "[^z-a]"))
+  (should (equal (rx (not (union (not (any "g-r")) (not (any "t")))))
+                 "\\`a\\`"))
+  (should (equal (rx (union (union (any "a-f") (any "u-z"))
+                            (any "g-r")))
+                 "[a-ru-z]"))
+  (should (equal (rx (union (intersection (any "c-z") (any "a-g"))
+                            (not (any "a-k"))))
+                 "[^abh-k]")))
+
+(ert-deftest rx-def-in-union ()
+  (rx-let ((a (any "badc"))
+           (b (union a (any "def"))))
+    (should (equal(rx (union b (any "q")))
+                  "[a-fq]"))))
+
+(ert-deftest rx-intersection ()
+  (should (equal (rx (intersection))
+                 "[^z-a]"))
+  (should (equal (rx (intersection (any "ba")))
+                 "[ab]"))
+  (should (equal (rx (intersection (any "a-j" "u-z") (any "c-k" ?y)
+                                   (any "a-i" "x-z")))
+                 "[c-iy]"))
+  (should (equal (rx (intersection (not (any "a-m")) (not (any "f-p"))))
+                 "[^a-p]"))
+  (should (equal (rx (intersection (any "a-z") (not (any "g-q"))))
+                 "[a-fr-z]"))
+  (should (equal (rx (intersection (any "a-d") (any "e")))
+                 "\\`a\\`"))
+  (should (equal (rx (not (intersection (any "a-d") (any "e"))))
+                 "[^z-a]"))
+  (should (equal (rx (intersection (any "d-u")
+                                   (intersection (any "e-z") (any "a-m"))))
+                 "[e-m]"))
+  (should (equal (rx (intersection (union (any "a-f") (any "f-t"))
+                                   (any "e-w")))
+                 "[e-t]")))
+
+(ert-deftest rx-def-in-intersection ()
+  (rx-let ((a (any "a-g"))
+           (b (intersection a (any "d-j"))))
+    (should (equal(rx (intersection b (any "e-k")))
+                  "[e-g]"))))
+
 (ert-deftest rx-group ()
   (should (equal (rx (group nonl) (submatch "x")
                      (group-n 3 "y") (submatch-n 13 "z") (backref 1))
-- 
2.21.0 (Apple Git-122.2)


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* bug#37849: composable character alternatives in rx
  2019-12-06 21:58 ` Mattias Engdegård
@ 2019-12-09 11:04   ` Mattias Engdegård
  2019-12-09 13:36     ` Eli Zaretskii
  0 siblings, 1 reply; 7+ messages in thread
From: Mattias Engdegård @ 2019-12-09 11:04 UTC (permalink / raw)
  To: 37849

Eli, as a matter of protocol: assuming the union/intersection patch meets no opposition, can it be pushed to master? It is self-contained and should not affect anything outside rx.






^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#37849: composable character alternatives in rx
  2019-12-09 11:04   ` Mattias Engdegård
@ 2019-12-09 13:36     ` Eli Zaretskii
  2019-12-10 21:39       ` Mattias Engdegård
  0 siblings, 1 reply; 7+ messages in thread
From: Eli Zaretskii @ 2019-12-09 13:36 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 37849

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Mon, 9 Dec 2019 12:04:40 +0100
> Cc: Eli Zaretskii <eliz@gnu.org>
> 
> Eli, as a matter of protocol: assuming the union/intersection patch meets no opposition, can it be pushed to master? It is self-contained and should not affect anything outside rx.

It's a new feature, so yes, assuming that you've verified it passes
all the tests and cannot possibly interfere with any existing code.

Thanks.





^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#37849: composable character alternatives in rx
  2019-12-09 13:36     ` Eli Zaretskii
@ 2019-12-10 21:39       ` Mattias Engdegård
  2019-12-13 12:35         ` Mattias Engdegård
  0 siblings, 1 reply; 7+ messages in thread
From: Mattias Engdegård @ 2019-12-10 21:39 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 37849

Thank you, now pushed.






^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#37849: composable character alternatives in rx
  2019-12-10 21:39       ` Mattias Engdegård
@ 2019-12-13 12:35         ` Mattias Engdegård
  0 siblings, 0 replies; 7+ messages in thread
From: Mattias Engdegård @ 2019-12-13 12:35 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 37849-done

As suggested by Stefan Monnier, 'union' was replaced with plain 'or' for character sets as well.

A minor usability improvement has been pushed to master as well: characters and single-char strings no longer have to be wrapped in (any...), so (not (any ?a)) can now be written (not ?a).






^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-12-13 12:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-21 10:24 bug#37849: composable character alternatives in rx Mattias Engdegård
2019-10-27  9:17 ` Mattias Engdegård
2019-12-06 21:58 ` Mattias Engdegård
2019-12-09 11:04   ` Mattias Engdegård
2019-12-09 13:36     ` Eli Zaretskii
2019-12-10 21:39       ` Mattias Engdegård
2019-12-13 12:35         ` Mattias Engdegård

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).