* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) @ 2019-06-15 23:43 Noam Postavsky 2019-06-16 0:03 ` Michael Heerdegen ` (3 more replies) 0 siblings, 4 replies; 28+ messages in thread From: Noam Postavsky @ 2019-06-15 23:43 UTC (permalink / raw) To: 36237; +Cc: stefan monnier, kévin le gouguec [-- Attachment #1: Type: text/plain, Size: 1577 bytes --] X-Debbugs-CC: Kévin Le Gouguec <kevin.legouguec@gmail.com>, Stefan Monnier <monnier@iro.umontreal.ca> Severity: wishlist Tags: patch Currently, if you want to construct a regexp which includes a runtime values using rx, there are two options: - Use the (eval FORM) subform. But if using using the rx macro, FORM is evaluated at macroexpansion time, which is awkward. If using rx-to-string, then FORM can't access the lexical environment, which is also awkward. - Build a list at runtime and pass to rx-to-string. This requires the whole rx translation infrastructure at runtime, which is sad. The patch below allows the rx macro to generate a concat expression instead of just a plain string. So the example from https://debbugs.gnu.org/35564#53 would become (let ((start (max 0 (1- pos))) (char (string (aref command pos)))) ; need string for `regexp-quote'. (and (string-match (rx (or (seq (or bos blank) (group-n 1 (regexp-quote char)) (or eos blank)) (seq ?` (group-n 1 (regexp-quote char)) ?`))) command start) (= pos (match-beginning 1)))) The rx call in the above macroexpands into: (concat "\\(?:\\`\\|[[:blank:]]\\)" "\\(?" "1" ":" (regexp-quote char) "\\)" "\\(?:\\'\\|[[:blank:]]\\)" "\\|" "`" "\\(?" "1" ":" (regexp-quote char) "\\)" "`") Which will be optimal once we apply the patch from #14769 "optimize `concat's literals". [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: patch --] [-- Type: text/x-diff, Size: 17507 bytes --] From 6b6c6d8997d02236a4e53ccbe1f6a4b362d9b86c Mon Sep 17 00:00:00 2001 From: Noam Postavsky <npostavs@gmail.com> Date: Fri, 14 Jun 2019 08:43:17 -0400 Subject: [PATCH] Support (rx (and (regexp EXPR) (regexp-quote EXPR))) * lisp/emacs-lisp/rx.el (rx-regexp): Allow non-string forms. (rx-constituents): Add regexp-quote constituent, which is like a plain STRING form, but allows arbitrary lisp expressions. (rx-regexp-quote): New function. (rx-compile-to-lisp): New variable. (rx-subforms): New helper function for handling subforms, including non-constant case. (rx-group-if, rx-and, rx-or, rx-=, rx->=, rx-repeat, rx-submatch) (rx-submatch-n, rx-kleene, rx-atomic-p): Use it to handle non-constant subforms. (rx): Document new form, wrap non-constant forms with concat call. * test/lisp/emacs-lisp/rx-tests.el (rx-tests--match): New macro. (rx-nonstring-expr, rx-nonstring-expr-non-greedy): New tests. * etc/NEWS: Announce changes. --- etc/NEWS | 6 ++ lisp/emacs-lisp/rx.el | 189 +++++++++++++++++++++++++-------------- test/lisp/emacs-lisp/rx-tests.el | 41 +++++++++ 3 files changed, 171 insertions(+), 65 deletions(-) diff --git a/etc/NEWS b/etc/NEWS index 723f0a0fb0..bce755a211 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -1380,12 +1380,18 @@ when given in a string. Previously, '(any "\x80-\xff")' would match characters U+0080...U+00FF. Now the expression matches raw bytes in the 128...255 range, as expected. +--- *** The rx 'or' and 'seq' forms no longer require any arguments. (or) produces a regexp that never matches anything, while (seq) matches the empty string, each being an identity for the operation. This also works for their aliases: '|' for 'or'; ':', 'and' and 'sequence' for 'seq'. +--- +*** 'regexp' and new 'regexp-quote' accept arbirtray lisp as arguments. +In this case, 'rx' will generate code which produces a regexp string +at runtime, instead of a constant string. + ** Frames +++ diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el index 8ef78fd69e..0b7765322b 100644 --- a/lisp/emacs-lisp/rx.el +++ b/lisp/emacs-lisp/rx.el @@ -47,9 +47,11 @@ ;; Rx translates a sexp notation for regular expressions into the ;; usual string notation. The translation can be done at compile-time -;; by using the `rx' macro. It can be done at run-time by calling -;; function `rx-to-string'. See the documentation of `rx' for a -;; complete description of the sexp notation. +;; by using the `rx' macro. The `regexp' and `regexp-quote' accept +;; non-constant expressions, in which case `rx' will translate to a +;; `concat' expression. Translation can be done fully at run-time by +;; calling function `rx-to-string'. See the documentation of `rx' for +;; a complete description of the sexp notation. ;; ;; Some examples of string regexps and their sexp counterparts: ;; @@ -78,8 +80,8 @@ ;; (+ (? ?\n)) (any " \t")) ;; ;; (concat "^\\(?:" something-else "\\)") -;; (rx (and line-start (eval something-else))), statically or -;; (rx-to-string '(and line-start ,something-else)), dynamically. +;; (rx (and line-start (regexp something-else))), statically or +;; (rx-to-string `(and line-start ,something-else)), dynamically. ;; ;; (regexp-opt '(STRING1 STRING2 ...)) ;; (rx (or STRING1 STRING2 ...)), or in other words, `or' automatically @@ -176,6 +178,7 @@ (defvar rx-constituents ;Not `const' because some modes extend it. (not-syntax . (rx-not-syntax 1 1)) ; sregex (category . (rx-category 1 1 rx-check-category)) (eval . (rx-eval 1 1)) + (regexp-quote . (rx-regexp-quote 1 1 stringp)) (regexp . (rx-regexp 1 1 stringp)) (regex . regexp) ; sregex (digit . "[[:digit:]]") @@ -302,6 +305,10 @@ (defvar rx-greedy-flag t "Non-nil means produce greedy regular expressions for `zero-or-one', `zero-or-more', and `one-or-more'. Dynamically bound.") +(defvar rx-compile-to-lisp nil + "Nil means return a regexp as a string. +Non-nil means we may return a lisp form which produces a +string (used for `rx' macro).") (defun rx-info (op head) "Return parsing/code generation info for OP. @@ -344,7 +351,7 @@ (defun rx-check (form) (> nargs max-args)) (error "rx form `%s' accepts at most %d args" (car form) max-args)) - (when (not (null type-pred)) + (when type-pred (dolist (sub-form (cdr form)) (unless (funcall type-pred sub-form) (error "rx form `%s' requires args satisfying `%s'" @@ -360,19 +367,21 @@ (defun rx-group-if (regexp group) ;; for concatenation ((eq group ':) (if (rx-atomic-p - (if (string-match - "\\(?:[?*+]\\??\\|\\\\{[0-9]*,?[0-9]*\\\\}\\)\\'" regexp) - (substring regexp 0 (match-beginning 0)) - regexp)) - (setq group nil))) + (if (and (stringp regexp) + (string-match + "\\(?:[?*+]\\??\\|\\\\{[0-9]*,?[0-9]*\\\\}\\)\\'" regexp)) + (substring regexp 0 (match-beginning 0)) + regexp)) + (setq group nil))) ;; for OR ((eq group '|) (setq group nil)) ;; do anyway ((eq group t)) ((rx-atomic-p regexp t) (setq group nil))) - (if group - (concat "\\(?:" regexp "\\)") - regexp)) + (cond ((and group (stringp regexp)) + (concat "\\(?:" regexp "\\)")) + (group `("\\(?:" ,@regexp "\\)")) + (t regexp))) (defvar rx-parent) @@ -384,7 +393,7 @@ (defun rx-and (form) FORM is of the form `(and FORM1 ...)'." (rx-check form) (rx-group-if - (mapconcat (lambda (x) (rx-form x ':)) (cdr form) nil) + (rx-subforms (cdr form) ':) (and (memq rx-parent '(* t)) rx-parent))) @@ -396,7 +405,7 @@ (defun rx-or (form) ((null (cdr form)) regexp-unmatchable) ((cl-every #'stringp (cdr form)) (regexp-opt (cdr form) nil t)) - (t (mapconcat (lambda (x) (rx-form x '|)) (cdr form) "\\|"))) + (t (rx-subforms (cdr form) '| "\\|"))) (and (memq rx-parent '(: * t)) rx-parent))) @@ -669,7 +678,10 @@ (defun rx-= (form) (unless (and (integerp (nth 1 form)) (> (nth 1 form) 0)) (error "rx `=' requires positive integer first arg")) - (format "%s\\{%d\\}" (rx-form (nth 2 form) '*) (nth 1 form))) + (let ((subform (rx-form (nth 2 form) '*))) + (if (stringp subform) + (format "%s\\{%d\\}" subform (nth 1 form)) + `(,@subform ,(format "\\{%d\\}" (nth 1 form)))))) (defun rx->= (form) @@ -679,7 +691,10 @@ (defun rx->= (form) (unless (and (integerp (nth 1 form)) (> (nth 1 form) 0)) (error "rx `>=' requires positive integer first arg")) - (format "%s\\{%d,\\}" (rx-form (nth 2 form) '*) (nth 1 form))) + (let ((subform (rx-form (nth 2 form) '*))) + (if (stringp subform) + (format "%s\\{%d,\\}" subform (nth 1 form)) + `(,@subform ,(format "\\{%d,\\}" (nth 1 form)))))) (defun rx-** (form) @@ -700,7 +715,10 @@ (defun rx-repeat (form) (unless (and (integerp (nth 1 form)) (> (nth 1 form) 0)) (error "rx `repeat' requires positive integer first arg")) - (format "%s\\{%d\\}" (rx-form (nth 2 form) '*) (nth 1 form))) + (let ((subform (rx-form (nth 2 form) '*))) + (if (stringp subform) + (format "%s\\{%d\\}" subform (nth 1 form)) + `(,@subform ,(format "\\{%d\\}" (nth 1 form)))))) ((or (not (integerp (nth 2 form))) (< (nth 2 form) 0) (not (integerp (nth 1 form))) @@ -708,30 +726,26 @@ (defun rx-repeat (form) (< (nth 2 form) (nth 1 form))) (error "rx `repeat' range error")) (t - (format "%s\\{%d,%d\\}" (rx-form (nth 3 form) '*) - (nth 1 form) (nth 2 form))))) + (let ((subform (rx-form (nth 3 form) '*))) + (if (stringp subform) + (format "%s\\{%d,%d\\}" subform (nth 1 form) (nth 2 form)) + `(,@subform ,(format "\\{%d,%d\\}" (nth 1 form) (nth 2 form)))))))) (defun rx-submatch (form) "Parse and produce code from FORM, which is `(submatch ...)'." - (concat "\\(" - (if (= 2 (length form)) - ;; Only one sub-form. - (rx-form (cadr form)) - ;; Several sub-forms implicitly concatenated. - (mapconcat (lambda (re) (rx-form re ':)) (cdr form) nil)) - "\\)")) + (let ((subforms (rx-subforms (cdr form) ':))) + (if (stringp subforms) + (concat "\\(" subforms "\\)") + `("\\(" ,@subforms "\\)")))) (defun rx-submatch-n (form) "Parse and produce code from FORM, which is `(submatch-n N ...)'." - (let ((n (nth 1 form))) - (concat "\\(?" (number-to-string n) ":" - (if (= 3 (length form)) - ;; Only one sub-form. - (rx-form (nth 2 form)) - ;; Several sub-forms implicitly concatenated. - (mapconcat (lambda (re) (rx-form re ':)) (cddr form) nil)) - "\\)"))) + (let ((n (nth 1 form)) + (subforms (rx-subforms (cddr form) ':))) + (if (stringp subforms) + (concat "\\(?" (number-to-string n) ":" subforms "\\)") + `("\\(?" ,(number-to-string n) ":" ,@subforms "\\)")))) (defun rx-backref (form) "Parse and produce code from FORM, which is `(backref N)'." @@ -759,9 +773,12 @@ (defun rx-kleene (form) (t "?"))) (op (cond ((memq (car form) '(* *? 0+ zero-or-more)) "*") ((memq (car form) '(+ +? 1+ one-or-more)) "+") - (t "?")))) + (t "?"))) + (subform (rx-form (cadr form) '*))) (rx-group-if - (concat (rx-form (cadr form) '*) op suffix) + (if (stringp subform) + (concat subform op suffix) + `(,@subform ,(concat op suffix))) (and (memq rx-parent '(t *)) rx-parent)))) @@ -789,15 +806,18 @@ (defun rx-atomic-p (r &optional lax) be detected without much effort. A guarantee of no false negatives would require a theoretic specification of the set of all atomic regexps." - (let ((l (length r))) - (cond - ((<= l 1)) - ((= l 2) (= (aref r 0) ?\\)) - ((= l 3) (string-match "\\`\\(?:\\\\[cCsS_]\\|\\[[^^]\\]\\)" r)) - ((null lax) + (if (and rx-compile-to-lisp + (not (stringp r))) + nil ;; Runtime value, we must assume non-atomic. + (let ((l (length r))) (cond - ((string-match "\\`\\[\\^?]?\\(?:\\[:[a-z]+:]\\|[^]]\\)*]\\'" r)) - ((string-match "\\`\\\\(\\(?:[^\\]\\|\\\\[^)]\\)*\\\\)\\'" r))))))) + ((<= l 1)) + ((= l 2) (= (aref r 0) ?\\)) + ((= l 3) (string-match "\\`\\(?:\\\\[cCsS_]\\|\\[[^^]\\]\\)" r)) + ((null lax) + (cond + ((string-match "\\`\\[\\^?]?\\(?:\\[:[a-z]+:]\\|[^]]\\)*]\\'" r)) + ((string-match "\\`\\\\(\\(?:[^\\]\\|\\\\[^)]\\)*\\\\)\\'" r)))))))) (defun rx-syntax (form) @@ -853,9 +873,23 @@ (defun rx-greedy (form) (defun rx-regexp (form) "Parse and produce code from FORM, which is `(regexp STRING)'." - (rx-check form) - (rx-group-if (cadr form) rx-parent)) - + (cond ((stringp form) + (rx-group-if (cadr form) rx-parent)) + (rx-compile-to-lisp + ;; Always group non string forms, since we can't be sure they + ;; are atomic. + (rx-group-if (cdr form) t)) + (t (rx-check form)))) + +(defun rx-regexp-quote (form) + "Parse and produce code from FORM, which is `(regexp-quote STRING-EXP)'." + (cond ((stringp form) + ;; This is allowed(?), but makes little sense, you could just + ;; use STRING directly. + (rx-group-if (regexp-quote (cadr form)) rx-parent)) + (rx-compile-to-lisp + (rx-group-if (list form) rx-parent)) + (t (rx-check form)))) (defun rx-form (form &optional parent) "Parse and produce code for regular expression FORM. @@ -886,6 +920,27 @@ (defun rx-form (form &optional parent) (t (error "rx syntax error at `%s'" form))))) +(defun rx-subforms (subforms &optional parent regexp-op) + (let ((listify (lambda (x) + (if (listp x) (copy-sequence x) + (list x)))) + (subregexps (cond ((cdr subforms) + (mapcar (lambda (x) (rx-form x parent)) subforms)) + (subforms + ;; Single form, no need for grouping. + (list (rx-form (car subforms)))) + ;; Zero forms. + (t "")))) + (cond ((or (not rx-compile-to-lisp) + (cl-every #'stringp subregexps)) + (mapconcat #'identity subregexps regexp-op)) + (regexp-op + (nconc (funcall listify (car subregexps)) + (cl-mapcan (lambda (x) + (cons regexp-op (funcall listify x))) + (cdr subregexps)))) + (t (cl-mapcan listify subregexps))))) + ;;;###autoload (defun rx-to-string (form &optional no-group) @@ -901,8 +956,12 @@ (defmacro rx (&rest regexps) REGEXPS is a non-empty sequence of forms of the sort listed below. Note that `rx' is a Lisp macro; when used in a Lisp program being -compiled, the translation is performed by the compiler. -See `rx-to-string' for how to do such a translation at run-time. +compiled, the translation is performed by the compiler. The +`regexp-quote' and `regexp' accept forms that will evaluate to +strings, in addition to constant strings. If REGEXPS include +such forms, then the result is an expression which returns a +regexp string, rather than a regexp string directly. See +`rx-to-string' for performing translation completely at run-time. The following are valid subforms of regular expressions in sexp notation. @@ -910,6 +969,10 @@ (defmacro rx (&rest regexps) STRING matches string STRING literally. +`(regexp-quote STRING)' + matches STRING literally, where STRING is any lisp + expression that evaluates to a string. + CHAR matches character CHAR literally. @@ -1208,12 +1271,16 @@ (defmacro rx (&rest regexps) `(regexp REGEXP)' include REGEXP in string notation in the result." - (cond ((null regexps) - (error "No regexp")) - ((cdr regexps) - (rx-to-string `(and ,@regexps) t)) - (t - (rx-to-string (car regexps) t)))) + (let* ((rx-compile-to-lisp t) + (re (cond ((null regexps) + (error "No regexp")) + ((cdr regexps) + (rx-to-string `(and ,@regexps) t)) + (t + (rx-to-string (car regexps) t))))) + (if (stringp re) + re + `(concat ,@re)))) (pcase-defmacro rx (&rest regexps) @@ -1275,14 +1342,6 @@ (pcase-defmacro rx (&rest regexps) for var in vars collect `(app (match-string ,i) ,var))))) \f -;; ;; sregex.el replacement - -;; ;;;###autoload (provide 'sregex) -;; ;;;###autoload (autoload 'sregex "rx") -;; (defalias 'sregex 'rx-to-string) -;; ;;;###autoload (autoload 'sregexq "rx" nil nil 'macro) -;; (defalias 'sregexq 'rx) -\f (provide 'rx) ;;; rx.el ends here diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el index 6f392d616d..d457f6919d 100644 --- a/test/lisp/emacs-lisp/rx-tests.el +++ b/test/lisp/emacs-lisp/rx-tests.el @@ -115,5 +115,46 @@ (ert-deftest rx-seq () ;; Test zero-argument `seq'. (should (equal (rx (seq)) ""))) +(defmacro rx-tests--match (regexp string &optional match) + (macroexp-let2 nil strexp string + `(ert-info ((format "Matching %S to %S" ',regexp ,strexp)) + (should (string-match ,regexp ,strexp)) + ,@(when match + `((should (equal (match-string 0 ,strexp) ,match))))))) + +(ert-deftest rx-nonstring-expr () + (let ((bee "b") + (vowel "[aeiou]")) + (rx-tests--match (rx "a" (regexp-quote bee) "c") "abc") + (rx-tests--match (rx "a" (regexp bee) "c") "abc") + (rx-tests--match (rx "a" (or (regexp bee) "xy") "c") "abc") + (rx-tests--match (rx "a" (or "xy" (regexp bee)) "c") "abc") + (should-not (string-match (rx (or (regexp bee) "xy")) "")) + (rx-tests--match (rx "a" (= 3 (regexp bee)) "c") "abbbc") + (rx-tests--match (rx "x" (= 3 (regexp vowel)) "z") "xeoez") + (should-not (string-match (rx "x" (= 3 (regexp vowel)) "z") "xe[]z")) + (rx-tests--match (rx "x" (= 3 (regexp-quote vowel)) "z") + "x[aeiou][aeiou][aeiou]z") + (rx-tests--match (rx "x" (repeat 1 (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (repeat 1 2 (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (repeat 1 2 (regexp vowel)) "z") "xauz") + (rx-tests--match (rx "x" (>= 1 (regexp vowel)) "z") "xaiiz") + (rx-tests--match (rx "x" (** 1 2 (regexp vowel)) "z") "xaiz") + (rx-tests--match (rx "x" (group (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (group-n 1 (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (? (regexp vowel)) "z") "xz"))) + +(ert-deftest rx-nonstring-expr-non-greedy () + "`rx's greediness can't affect runtime regexp parts." + (let ((ad-min "[ad]*?") + (ad-max "[ad]*") + (ad "[ad]")) + (rx-tests--match (rx "c" (regexp ad-min) "a") "cdaaada" "cda") + (rx-tests--match (rx "c" (regexp ad-max) "a") "cdaaada" "cdaaada") + (rx-tests--match (rx "c" (minimal-match (regexp ad-max)) "a") "cdaaada" "cdaaada") + (rx-tests--match (rx "c" (maximal-match (regexp ad-min)) "a") "cdaaada" "cda") + (rx-tests--match (rx "c" (minimal-match (0+ (regexp ad))) "a") "cdaaada" "cda") + (rx-tests--match (rx "c" (maximal-match (0+ (regexp ad))) "a") "cdaaada" "cdaaada"))) + (provide 'rx-tests) ;; rx-tests.el ends here. -- 2.11.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-15 23:43 bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) Noam Postavsky @ 2019-06-16 0:03 ` Michael Heerdegen 2019-06-16 0:28 ` Noam Postavsky 2019-06-16 10:03 ` Mattias Engdegård ` (2 subsequent siblings) 3 siblings, 1 reply; 28+ messages in thread From: Michael Heerdegen @ 2019-06-16 0:03 UTC (permalink / raw) To: Noam Postavsky; +Cc: 36237, stefan monnier, kévin le gouguec Noam Postavsky <npostavs@gmail.com> writes: > +--- > +*** 'regexp' and new 'regexp-quote' accept arbirtray lisp as arguments. ^^^^^^^^^ (typo) Cool. I only wondered whether it wouldn't have been better if `eval' would have been upgraded - but that wouldn't be able to regexp-quote when I wanted it, right? Does it make sense to keep `eval'? There are so many rx-form types, would be good if we could make it obsolete (in some sense). Michael. ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-16 0:03 ` Michael Heerdegen @ 2019-06-16 0:28 ` Noam Postavsky 0 siblings, 0 replies; 28+ messages in thread From: Noam Postavsky @ 2019-06-16 0:28 UTC (permalink / raw) To: Michael Heerdegen; +Cc: 36237, stefan monnier, kévin le gouguec Michael Heerdegen <michael_heerdegen@web.de> writes: >> +*** 'regexp' and new 'regexp-quote' accept arbirtray lisp as arguments. > ^^^^^^^^^ (typo) Oops. > I only wondered whether it wouldn't have been better if `eval' would > have been upgraded (rx (eval EXPR)) vs (rx (regexp-quote EXPR)) have different semantics with respect to scoping and evaluation time, so I think that would be a problem with respect to backwards compatibility. > Does it make sense to keep `eval'? There are so many rx-form types, > would be good if we could make it obsolete (in some sense). Yeah, I think it would make sense to obsolete `eval'. Also `and': it sounds like it should be the logical-and counterpart to `or', but it's just a synonym for `seq' (which is a much clearer name). I used `and' in the bug title just for punning reasons :) ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-15 23:43 bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) Noam Postavsky 2019-06-16 0:03 ` Michael Heerdegen @ 2019-06-16 10:03 ` Mattias Engdegård 2019-06-16 11:34 ` Stefan Monnier 2019-06-26 12:23 ` Andy Moreton 2019-06-26 13:08 ` Andy Moreton 3 siblings, 1 reply; 28+ messages in thread From: Mattias Engdegård @ 2019-06-16 10:03 UTC (permalink / raw) To: Noam Postavsky, 36237; +Cc: Michael Heerdegen, Stefan Monnier, kevin.legouguec Thank you! Definitely agree with the need for something like this, although I haven't read the patch in detail yet. Some notes: - A more suitable name is needed; `regexp-quote' makes no sense at all in the context of rx. What about `literal'? For simplicity, I'm proceeding with that name below. - The docs probably need to be explicit about the differences between `literal' and `eval' w.r.t. evaluation time and environment. - What is now the correct way of including a compile-time regexp expression, such as a defconst? (regexp (eval-when-compile EXPR))? Still a mouthful, but perhaps outside the scope of this bug. - Thanks for mentioning bug#14769; I'll give it a go. - I like rx a lot and use it extensively, but the implementation... could be improved (as you no doubt noticed). And so I have: I'm sitting on a full rewrite, code-named `ry'. It's shorter, much cleaner, and about twice as fast (usually more). The only thing still missing is compatibility with the old crusty `rx-constituents' extension mechanism. The plan was to replace rx with ry entirely when complete. I'll see what it would take to add `literal'. ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-16 10:03 ` Mattias Engdegård @ 2019-06-16 11:34 ` Stefan Monnier 2019-06-16 12:25 ` Noam Postavsky 0 siblings, 1 reply; 28+ messages in thread From: Stefan Monnier @ 2019-06-16 11:34 UTC (permalink / raw) To: Mattias Engdegård Cc: Michael Heerdegen, 36237, Noam Postavsky, kevin.legouguec First, thanks Noam, this is a very welcome improvement. [ BTW, please use "--" for rx-compile-to-lisp since I believe it's internal. ] What do regexp and regexp-quote do in rx-to-string? > - A more suitable name is needed; `regexp-quote' makes no sense at all in > the context of rx. What about `literal'? For simplicity, I'm proceeding with > that name below. I'll let others figure that one out. > - What is now the correct way of including a compile-time regexp expression, > such as a defconst? (regexp (eval-when-compile EXPR))? Still a mouthful, but > perhaps outside the scope of this bug. FWIW, I have the impression that in most cases where this could be useful, a better solution would be to provide something like `rx-defmacro` and/or `rx-macrolet`. Stefan ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-16 11:34 ` Stefan Monnier @ 2019-06-16 12:25 ` Noam Postavsky 2019-06-16 12:35 ` Stefan Monnier 2019-06-18 19:45 ` Mattias Engdegård 0 siblings, 2 replies; 28+ messages in thread From: Noam Postavsky @ 2019-06-16 12:25 UTC (permalink / raw) To: Stefan Monnier Cc: Michael Heerdegen, Mattias Engdegård, 36237, kevin.legouguec > [ BTW, please use "--" for rx-compile-to-lisp since I believe it's > internal. ] I'm not sure that it is, e.g., python-rx might bind it too (if it weren't for the fact that python.el needs to maintain backwards compatibility with older Emacs). > What do regexp and regexp-quote do in rx-to-string? regexp does exactly what it did before, i.e., it accepts only a constant string. Likewise regexp-quote accepts only a constant string, which makes it pointless to use in rx-to-string (just use a plain STRING directly), but I didn't disallow it. Hmm, I think I had meant to update rx-to-string's docstring, but forgot about it. >> - A more suitable name is needed; `regexp-quote' makes no sense at all in >> the context of rx. What about `literal'? For simplicity, I'm proceeding with >> that name below. > > I'll let others figure that one out. Probably `literal' makes sense. I originally used regexp-quote, just because I was thinking of it as a short form of (rx (regexp (regexp-quote EXPR))) >> - What is now the correct way of including a compile-time regexp expression, >> such as a defconst? (regexp (eval-when-compile EXPR))? Still a mouthful, but >> perhaps outside the scope of this bug. Oh, hmm. That might be a reason to keep using `eval'. > FWIW, I have the impression that in most cases where this could be > useful, a better solution would be to provide something like > `rx-defmacro` and/or `rx-macrolet`. I guess that could replace the "old crusty" rx-constituents thing too. >> I have: I'm sitting on a full rewrite, code-named `ry'. It's shorter, >> much cleaner, and about twice as fast (usually more). The only thing >> still missing is compatibility with the old crusty `rx-constituents' >> extension mechanism. >> >> The plan was to replace rx with ry entirely when complete. How far away is this? Would it make sense to delay this bug until "ry" comes in? ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-16 12:25 ` Noam Postavsky @ 2019-06-16 12:35 ` Stefan Monnier 2019-06-16 19:50 ` Noam Postavsky 2019-06-18 19:45 ` Mattias Engdegård 1 sibling, 1 reply; 28+ messages in thread From: Stefan Monnier @ 2019-06-16 12:35 UTC (permalink / raw) To: Noam Postavsky Cc: Michael Heerdegen, Mattias Engdegård, 36237, kevin.legouguec > I'm not sure that it is, e.g., python-rx might bind it too (if it > weren't for the fact that python.el needs to maintain backwards > compatibility with older Emacs). I think that would still be "python-rx using internals of rx". And that could be fixed if we can rewrite it with rx-macrolet, right? > regexp does exactly what it did before, i.e., it accepts only a constant > string. Likewise regexp-quote accepts only a constant string, which > makes it pointless to use in rx-to-string (just use a plain STRING > directly), but I didn't disallow it. Good, thanks. >>> - What is now the correct way of including a compile-time regexp expression, >>> such as a defconst? (regexp (eval-when-compile EXPR))? Still a mouthful, but >>> perhaps outside the scope of this bug. > Oh, hmm. That might be a reason to keep using `eval'. It could make people reluctant to change, yes, but that still wouldn't be a valid reason in my book. >> FWIW, I have the impression that in most cases where this could be >> useful, a better solution would be to provide something like >> `rx-defmacro` and/or `rx-macrolet`. > I guess that could replace the "old crusty" rx-constituents thing too. That's the idea, yes. Stefan ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-16 12:35 ` Stefan Monnier @ 2019-06-16 19:50 ` Noam Postavsky 2019-06-16 20:04 ` Stefan Monnier 2019-06-16 20:25 ` Drew Adams 0 siblings, 2 replies; 28+ messages in thread From: Noam Postavsky @ 2019-06-16 19:50 UTC (permalink / raw) To: Stefan Monnier Cc: Michael Heerdegen, Mattias Engdegård, 36237, kevin.legouguec Stefan Monnier <monnier@iro.umontreal.ca> writes: >> I'm not sure that it is, e.g., python-rx might bind it too (if it >> weren't for the fact that python.el needs to maintain backwards >> compatibility with older Emacs). > > I think that would still be "python-rx using internals of rx". > And that could be fixed if we can rewrite it with rx-macrolet, right? Yeah, I can buy that. Mattias, how far is this "ry" thing from being finished? Would it make sense to land it before adding this new feature? ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-16 19:50 ` Noam Postavsky @ 2019-06-16 20:04 ` Stefan Monnier 2019-06-16 20:25 ` Drew Adams 1 sibling, 0 replies; 28+ messages in thread From: Stefan Monnier @ 2019-06-16 20:04 UTC (permalink / raw) To: Noam Postavsky Cc: Michael Heerdegen, Mattias Engdegård, 36237, kevin.legouguec >>> I'm not sure that it is, e.g., python-rx might bind it too (if it >>> weren't for the fact that python.el needs to maintain backwards >>> compatibility with older Emacs). >> I think that would still be "python-rx using internals of rx". >> And that could be fixed if we can rewrite it with rx-macrolet, right? > Yeah, I can buy that. In this case I think it's OK to use "--" to signal that it's an internal name, and then use this internal name in python-rx. Stefan ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-16 19:50 ` Noam Postavsky 2019-06-16 20:04 ` Stefan Monnier @ 2019-06-16 20:25 ` Drew Adams 2019-06-16 20:34 ` Mattias Engdegård 2019-06-17 20:57 ` Juri Linkov 1 sibling, 2 replies; 28+ messages in thread From: Drew Adams @ 2019-06-16 20:25 UTC (permalink / raw) To: Noam Postavsky, Stefan Monnier Cc: Michael Heerdegen, Mattias Engdegård, 36237, kevin.legouguec > how far is this "ry" thing from being finished? Please excuse this interruption from someone not really following this thread. Naive question: Are you really thinking about using `ry' as the user-visible name - e.g. as a replacement for, or alternative to, `rx'? (Until Noam posed that question I was thinking it was tongue-in-cheek.) I kinda hope not (but again, I'm ignorant of the context). If this is meant to stand for a regexp thingie, and you want an alternative to `rx', please try for something that suggests "regexp" or "regular expression" - `rx2', `rex', `rgx', `rg' ... - something suggestive. (We've had 45+ years of `C' as the successor to `B'. Surely we can do better. ;-)) ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-16 20:25 ` Drew Adams @ 2019-06-16 20:34 ` Mattias Engdegård 2019-06-16 21:09 ` Drew Adams 2019-06-17 20:57 ` Juri Linkov 1 sibling, 1 reply; 28+ messages in thread From: Mattias Engdegård @ 2019-06-16 20:34 UTC (permalink / raw) To: Drew Adams Cc: Michael Heerdegen, 36237, Noam Postavsky, Stefan Monnier, kevin.legouguec 16 juni 2019 kl. 22.25 skrev Drew Adams <drew.adams@oracle.com>: > >> how far is this "ry" thing from being finished? Not far, I hope. A few days, maybe, depending on available time. > Are you really thinking about using `ry' as the > user-visible name - e.g. as a replacement for, > or alternative to, `rx'? No, it's just a working title, to avoid clashes with the current rx. It will be renamed to `rx' when done. ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-16 20:34 ` Mattias Engdegård @ 2019-06-16 21:09 ` Drew Adams 0 siblings, 0 replies; 28+ messages in thread From: Drew Adams @ 2019-06-16 21:09 UTC (permalink / raw) To: Mattias Engdegård Cc: Michael Heerdegen, 36237, Noam Postavsky, Stefan Monnier, kevin.legouguec > > Are you really thinking about using `ry' as the > > user-visible name - e.g. as a replacement for, > > or alternative to, `rx'? > > No, it's just a working title, to avoid clashes with the current rx. It > will be renamed to `rx' when done. Great; thx. Sorry for the noise. ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-16 20:25 ` Drew Adams 2019-06-16 20:34 ` Mattias Engdegård @ 2019-06-17 20:57 ` Juri Linkov 1 sibling, 0 replies; 28+ messages in thread From: Juri Linkov @ 2019-06-17 20:57 UTC (permalink / raw) To: Drew Adams Cc: Michael Heerdegen, Mattias Engdegård, Noam Postavsky, Stefan Monnier, kevin.legouguec, 36237 > I kinda hope not (but again, I'm ignorant of the > context). If this is meant to stand for a regexp > thingie, and you want an alternative to `rx', > please try for something that suggests "regexp" > or "regular expression" - `rx2', `rex', `rgx', > `rg' ... - something suggestive. > > (We've had 45+ years of `C' as the successor to > `B'. Surely we can do better. ;-)) rx++ ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-16 12:25 ` Noam Postavsky 2019-06-16 12:35 ` Stefan Monnier @ 2019-06-18 19:45 ` Mattias Engdegård 2019-06-19 1:34 ` Noam Postavsky 1 sibling, 1 reply; 28+ messages in thread From: Mattias Engdegård @ 2019-06-18 19:45 UTC (permalink / raw) To: Noam Postavsky; +Cc: Michael Heerdegen, 36237, Stefan Monnier, kevin.legouguec 16 juni 2019 kl. 14.25 skrev Noam Postavsky <npostavs@gmail.com>: >>> The plan was to replace rx with ry entirely when complete. > > How far away is this? Would it make sense to delay this bug until "ry" > comes in? The function-complete ry has been put at https://gitlab.com/mattiase/ry for the time being. It should now be entirely compatible, including support for `rx-constituents'. Your proposed `literal' was also added, which was instructive; I needed to know how it would fit in. Bootstrapping Emacs with it (ry renamed to rx, of course) works, including python-mode which uses `rx-constituents' for extension. It also has fewer of the known bugs of the current rx, and a much more extensive test suite. There are two things that perhaps need to be resolved before replacing the current rx: 1. Is there externally developed elisp code that makes use of internal rx functions and variables, perhaps for extension purposes, and do we need to worry about breaking it? The code is old enough not to delineate internal and public symbols clearly (no double-dash), which means that people may have gone and used all sort of things. 2. What would a good extension mechanism look like, and should one be put into place right away, so that we can point to a decent replacement for the internal toys that users relied upon? Noam, unless the consensus is that ry is unequivocally as good or better than rx, you could just as well apply your patch (suitably fixed up). Even if later replaced, there is nothing fundamentally wrong with the design; let's not hold it hostage. ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-18 19:45 ` Mattias Engdegård @ 2019-06-19 1:34 ` Noam Postavsky 2019-06-19 15:42 ` Mattias Engdegård 0 siblings, 1 reply; 28+ messages in thread From: Noam Postavsky @ 2019-06-19 1:34 UTC (permalink / raw) To: Mattias Engdegård Cc: Michael Heerdegen, 36237, Stefan Monnier, kevin.legouguec [-- Attachment #1: Type: text/plain, Size: 801 bytes --] Mattias Engdegård <mattiase@acm.org> writes: > The function-complete ry has been put at > https://gitlab.com/mattiase/ry for the time being. It should now be > entirely compatible, including support for `rx-constituents'. Your > proposed `literal' was also added, which was instructive; I needed to > know how it would fit in. Cool, I'll take a look. > Noam, unless the consensus is that ry is unequivocally as good or > better than rx, you could just as well apply your patch (suitably > fixed up). Even if later replaced, there is nothing fundamentally > wrong with the design; let's not hold it hostage. Sure. Here's the patch with regexp-quote change to literal, and rx--compile-to-lisp renamed. I'll wait a bit more and push this weekend if there are no more comments. [-- Attachment #2: patch --] [-- Type: text/plain, Size: 17855 bytes --] From 3302374b4b484e64d234084661cbf710807bfbe1 Mon Sep 17 00:00:00 2001 From: Noam Postavsky <npostavs@gmail.com> Date: Fri, 14 Jun 2019 08:43:17 -0400 Subject: [PATCH] Support (rx (and (regexp EXPR) (literal EXPR))) (Bug#36237) * lisp/emacs-lisp/rx.el (rx-regexp): Allow non-string forms. (rx-constituents): Add literal constituent, which is like a plain STRING form, but allows arbitrary lisp expressions. (rx-literal): New function. (rx-compile-to-lisp): New variable. (rx-subforms): New helper function for handling subforms, including non-constant case. (rx-group-if, rx-and, rx-or, rx-=, rx->=, rx-repeat, rx-submatch) (rx-submatch-n, rx-kleene, rx-atomic-p): Use it to handle non-constant subforms. (rx): Document new form, wrap non-constant forms with concat call. * test/lisp/emacs-lisp/rx-tests.el (rx-tests--match): New macro. (rx-nonstring-expr, rx-nonstring-expr-non-greedy): New tests. * etc/NEWS: Announce changes. --- etc/NEWS | 6 ++ lisp/emacs-lisp/rx.el | 188 ++++++++++++++++++++++++++------------- test/lisp/emacs-lisp/rx-tests.el | 41 +++++++++ 3 files changed, 172 insertions(+), 63 deletions(-) diff --git a/etc/NEWS b/etc/NEWS index 723f0a0fb0..42958bca36 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -1380,12 +1380,18 @@ when given in a string. Previously, '(any "\x80-\xff")' would match characters U+0080...U+00FF. Now the expression matches raw bytes in the 128...255 range, as expected. +--- *** The rx 'or' and 'seq' forms no longer require any arguments. (or) produces a regexp that never matches anything, while (seq) matches the empty string, each being an identity for the operation. This also works for their aliases: '|' for 'or'; ':', 'and' and 'sequence' for 'seq'. +--- +*** 'regexp' and new 'literal' accept arbitrary lisp as arguments. +In this case, 'rx' will generate code which produces a regexp string +at runtime, instead of a constant string. + ** Frames +++ diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el index 8ef78fd69e..c925cc4415 100644 --- a/lisp/emacs-lisp/rx.el +++ b/lisp/emacs-lisp/rx.el @@ -47,9 +47,11 @@ ;; Rx translates a sexp notation for regular expressions into the ;; usual string notation. The translation can be done at compile-time -;; by using the `rx' macro. It can be done at run-time by calling -;; function `rx-to-string'. See the documentation of `rx' for a -;; complete description of the sexp notation. +;; by using the `rx' macro. The `regexp' and `literal' forms accept +;; non-constant expressions, in which case `rx' will translate to a +;; `concat' expression. Translation can be done fully at run-time by +;; calling function `rx-to-string'. See the documentation of `rx' for +;; a complete description of the sexp notation. ;; ;; Some examples of string regexps and their sexp counterparts: ;; @@ -78,8 +80,8 @@ ;; (+ (? ?\n)) (any " \t")) ;; ;; (concat "^\\(?:" something-else "\\)") -;; (rx (and line-start (eval something-else))), statically or -;; (rx-to-string '(and line-start ,something-else)), dynamically. +;; (rx (and line-start (regexp something-else))), statically or +;; (rx-to-string `(and line-start ,something-else)), dynamically. ;; ;; (regexp-opt '(STRING1 STRING2 ...)) ;; (rx (or STRING1 STRING2 ...)), or in other words, `or' automatically @@ -176,6 +178,7 @@ (defvar rx-constituents ;Not `const' because some modes extend it. (not-syntax . (rx-not-syntax 1 1)) ; sregex (category . (rx-category 1 1 rx-check-category)) (eval . (rx-eval 1 1)) + (literal . (rx-literal 1 1 stringp)) (regexp . (rx-regexp 1 1 stringp)) (regex . regexp) ; sregex (digit . "[[:digit:]]") @@ -302,6 +305,10 @@ (defvar rx-greedy-flag t "Non-nil means produce greedy regular expressions for `zero-or-one', `zero-or-more', and `one-or-more'. Dynamically bound.") +(defvar rx--compile-to-lisp nil + "Nil means return a regexp as a string. +Non-nil means we may return a lisp form which produces a +string (used for `rx' macro).") (defun rx-info (op head) "Return parsing/code generation info for OP. @@ -344,7 +351,7 @@ (defun rx-check (form) (> nargs max-args)) (error "rx form `%s' accepts at most %d args" (car form) max-args)) - (when (not (null type-pred)) + (when type-pred (dolist (sub-form (cdr form)) (unless (funcall type-pred sub-form) (error "rx form `%s' requires args satisfying `%s'" @@ -360,8 +367,9 @@ (defun rx-group-if (regexp group) ;; for concatenation ((eq group ':) (if (rx-atomic-p - (if (string-match - "\\(?:[?*+]\\??\\|\\\\{[0-9]*,?[0-9]*\\\\}\\)\\'" regexp) + (if (and (stringp regexp) + (string-match + "\\(?:[?*+]\\??\\|\\\\{[0-9]*,?[0-9]*\\\\}\\)\\'" regexp)) (substring regexp 0 (match-beginning 0)) regexp)) (setq group nil))) @@ -370,9 +378,10 @@ (defun rx-group-if (regexp group) ;; do anyway ((eq group t)) ((rx-atomic-p regexp t) (setq group nil))) - (if group - (concat "\\(?:" regexp "\\)") - regexp)) + (cond ((and group (stringp regexp)) + (concat "\\(?:" regexp "\\)")) + (group `("\\(?:" ,@regexp "\\)")) + (t regexp))) (defvar rx-parent) @@ -384,7 +393,7 @@ (defun rx-and (form) FORM is of the form `(and FORM1 ...)'." (rx-check form) (rx-group-if - (mapconcat (lambda (x) (rx-form x ':)) (cdr form) nil) + (rx-subforms (cdr form) ':) (and (memq rx-parent '(* t)) rx-parent))) @@ -396,7 +405,7 @@ (defun rx-or (form) ((null (cdr form)) regexp-unmatchable) ((cl-every #'stringp (cdr form)) (regexp-opt (cdr form) nil t)) - (t (mapconcat (lambda (x) (rx-form x '|)) (cdr form) "\\|"))) + (t (rx-subforms (cdr form) '| "\\|"))) (and (memq rx-parent '(: * t)) rx-parent))) @@ -669,7 +678,10 @@ (defun rx-= (form) (unless (and (integerp (nth 1 form)) (> (nth 1 form) 0)) (error "rx `=' requires positive integer first arg")) - (format "%s\\{%d\\}" (rx-form (nth 2 form) '*) (nth 1 form))) + (let ((subform (rx-form (nth 2 form) '*))) + (if (stringp subform) + (format "%s\\{%d\\}" subform (nth 1 form)) + `(,@subform ,(format "\\{%d\\}" (nth 1 form)))))) (defun rx->= (form) @@ -679,7 +691,10 @@ (defun rx->= (form) (unless (and (integerp (nth 1 form)) (> (nth 1 form) 0)) (error "rx `>=' requires positive integer first arg")) - (format "%s\\{%d,\\}" (rx-form (nth 2 form) '*) (nth 1 form))) + (let ((subform (rx-form (nth 2 form) '*))) + (if (stringp subform) + (format "%s\\{%d,\\}" subform (nth 1 form)) + `(,@subform ,(format "\\{%d,\\}" (nth 1 form)))))) (defun rx-** (form) @@ -700,7 +715,10 @@ (defun rx-repeat (form) (unless (and (integerp (nth 1 form)) (> (nth 1 form) 0)) (error "rx `repeat' requires positive integer first arg")) - (format "%s\\{%d\\}" (rx-form (nth 2 form) '*) (nth 1 form))) + (let ((subform (rx-form (nth 2 form) '*))) + (if (stringp subform) + (format "%s\\{%d\\}" subform (nth 1 form)) + `(,@subform ,(format "\\{%d\\}" (nth 1 form)))))) ((or (not (integerp (nth 2 form))) (< (nth 2 form) 0) (not (integerp (nth 1 form))) @@ -708,30 +726,26 @@ (defun rx-repeat (form) (< (nth 2 form) (nth 1 form))) (error "rx `repeat' range error")) (t - (format "%s\\{%d,%d\\}" (rx-form (nth 3 form) '*) - (nth 1 form) (nth 2 form))))) + (let ((subform (rx-form (nth 3 form) '*))) + (if (stringp subform) + (format "%s\\{%d,%d\\}" subform (nth 1 form) (nth 2 form)) + `(,@subform ,(format "\\{%d,%d\\}" (nth 1 form) (nth 2 form)))))))) (defun rx-submatch (form) "Parse and produce code from FORM, which is `(submatch ...)'." - (concat "\\(" - (if (= 2 (length form)) - ;; Only one sub-form. - (rx-form (cadr form)) - ;; Several sub-forms implicitly concatenated. - (mapconcat (lambda (re) (rx-form re ':)) (cdr form) nil)) - "\\)")) + (let ((subforms (rx-subforms (cdr form) ':))) + (if (stringp subforms) + (concat "\\(" subforms "\\)") + `("\\(" ,@subforms "\\)")))) (defun rx-submatch-n (form) "Parse and produce code from FORM, which is `(submatch-n N ...)'." - (let ((n (nth 1 form))) - (concat "\\(?" (number-to-string n) ":" - (if (= 3 (length form)) - ;; Only one sub-form. - (rx-form (nth 2 form)) - ;; Several sub-forms implicitly concatenated. - (mapconcat (lambda (re) (rx-form re ':)) (cddr form) nil)) - "\\)"))) + (let ((n (nth 1 form)) + (subforms (rx-subforms (cddr form) ':))) + (if (stringp subforms) + (concat "\\(?" (number-to-string n) ":" subforms "\\)") + `("\\(?" ,(number-to-string n) ":" ,@subforms "\\)")))) (defun rx-backref (form) "Parse and produce code from FORM, which is `(backref N)'." @@ -759,9 +773,12 @@ (defun rx-kleene (form) (t "?"))) (op (cond ((memq (car form) '(* *? 0+ zero-or-more)) "*") ((memq (car form) '(+ +? 1+ one-or-more)) "+") - (t "?")))) + (t "?"))) + (subform (rx-form (cadr form) '*))) (rx-group-if - (concat (rx-form (cadr form) '*) op suffix) + (if (stringp subform) + (concat subform op suffix) + `(,@subform ,(concat op suffix))) (and (memq rx-parent '(t *)) rx-parent)))) @@ -789,15 +806,18 @@ (defun rx-atomic-p (r &optional lax) be detected without much effort. A guarantee of no false negatives would require a theoretic specification of the set of all atomic regexps." - (let ((l (length r))) - (cond - ((<= l 1)) - ((= l 2) (= (aref r 0) ?\\)) - ((= l 3) (string-match "\\`\\(?:\\\\[cCsS_]\\|\\[[^^]\\]\\)" r)) - ((null lax) + (if (and rx--compile-to-lisp + (not (stringp r))) + nil ;; Runtime value, we must assume non-atomic. + (let ((l (length r))) (cond - ((string-match "\\`\\[\\^?]?\\(?:\\[:[a-z]+:]\\|[^]]\\)*]\\'" r)) - ((string-match "\\`\\\\(\\(?:[^\\]\\|\\\\[^)]\\)*\\\\)\\'" r))))))) + ((<= l 1)) + ((= l 2) (= (aref r 0) ?\\)) + ((= l 3) (string-match "\\`\\(?:\\\\[cCsS_]\\|\\[[^^]\\]\\)" r)) + ((null lax) + (cond + ((string-match "\\`\\[\\^?]?\\(?:\\[:[a-z]+:]\\|[^]]\\)*]\\'" r)) + ((string-match "\\`\\\\(\\(?:[^\\]\\|\\\\[^)]\\)*\\\\)\\'" r)))))))) (defun rx-syntax (form) @@ -853,9 +873,23 @@ (defun rx-greedy (form) (defun rx-regexp (form) "Parse and produce code from FORM, which is `(regexp STRING)'." - (rx-check form) - (rx-group-if (cadr form) rx-parent)) - + (cond ((stringp form) + (rx-group-if (cadr form) rx-parent)) + (rx--compile-to-lisp + ;; Always group non string forms, since we can't be sure they + ;; are atomic. + (rx-group-if (cdr form) t)) + (t (rx-check form)))) + +(defun rx-literal (form) + "Parse and produce code from FORM, which is `(literal STRING-EXP)'." + (cond ((stringp form) + ;; This is allowed(?), but makes little sense, you could just + ;; use STRING directly. + (rx-group-if (regexp-quote (cadr form)) rx-parent)) + (rx--compile-to-lisp + (rx-group-if `((regexp-quote ,(cadr form))) rx-parent)) + (t (rx-check form)))) (defun rx-form (form &optional parent) "Parse and produce code for regular expression FORM. @@ -886,12 +920,36 @@ (defun rx-form (form &optional parent) (t (error "rx syntax error at `%s'" form))))) +(defun rx-subforms (subforms &optional parent regexp-op) + (let ((listify (lambda (x) + (if (listp x) (copy-sequence x) + (list x)))) + (subregexps (cond ((cdr subforms) + (mapcar (lambda (x) (rx-form x parent)) subforms)) + (subforms + ;; Single form, no need for grouping. + (list (rx-form (car subforms)))) + ;; Zero forms. + (t "")))) + (cond ((or (not rx--compile-to-lisp) + (cl-every #'stringp subregexps)) + (mapconcat #'identity subregexps regexp-op)) + (regexp-op + (nconc (funcall listify (car subregexps)) + (cl-mapcan (lambda (x) + (cons regexp-op (funcall listify x))) + (cdr subregexps)))) + (t (cl-mapcan listify subregexps))))) + ;;;###autoload (defun rx-to-string (form &optional no-group) "Parse and produce code for regular expression FORM. FORM is a regular expression in sexp form. -NO-GROUP non-nil means don't put shy groups around the result." +NO-GROUP non-nil means don't put shy groups around the result. +Note that unlike for the `rx' macro, subforms `literal' and +`regexp' will not accept non-string arguments (so (literal +STRING) becomes just a more verbose version of STRING)." (rx-group-if (rx-form form) (null no-group))) @@ -901,8 +959,12 @@ (defmacro rx (&rest regexps) REGEXPS is a non-empty sequence of forms of the sort listed below. Note that `rx' is a Lisp macro; when used in a Lisp program being -compiled, the translation is performed by the compiler. -See `rx-to-string' for how to do such a translation at run-time. +compiled, the translation is performed by the compiler. The +`literal' and `regexp' forms accept subforms that will evaluate +to strings, in addition to constant strings. If REGEXPS include +such forms, then the result is an expression which returns a +regexp string, rather than a regexp string directly. See +`rx-to-string' for performing translation completely at run-time. The following are valid subforms of regular expressions in sexp notation. @@ -910,6 +972,10 @@ (defmacro rx (&rest regexps) STRING matches string STRING literally. +`(literal STRING)' + matches STRING literally, where STRING is any lisp + expression that evaluates to a string. + CHAR matches character CHAR literally. @@ -1208,12 +1274,16 @@ (defmacro rx (&rest regexps) `(regexp REGEXP)' include REGEXP in string notation in the result." - (cond ((null regexps) - (error "No regexp")) - ((cdr regexps) - (rx-to-string `(and ,@regexps) t)) - (t - (rx-to-string (car regexps) t)))) + (let* ((rx--compile-to-lisp t) + (re (cond ((null regexps) + (error "No regexp")) + ((cdr regexps) + (rx-to-string `(and ,@regexps) t)) + (t + (rx-to-string (car regexps) t))))) + (if (stringp re) + re + `(concat ,@re)))) (pcase-defmacro rx (&rest regexps) @@ -1275,14 +1345,6 @@ (pcase-defmacro rx (&rest regexps) for var in vars collect `(app (match-string ,i) ,var))))) \f -;; ;; sregex.el replacement - -;; ;;;###autoload (provide 'sregex) -;; ;;;###autoload (autoload 'sregex "rx") -;; (defalias 'sregex 'rx-to-string) -;; ;;;###autoload (autoload 'sregexq "rx" nil nil 'macro) -;; (defalias 'sregexq 'rx) -\f (provide 'rx) ;;; rx.el ends here diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el index 6f392d616d..bab71b522b 100644 --- a/test/lisp/emacs-lisp/rx-tests.el +++ b/test/lisp/emacs-lisp/rx-tests.el @@ -115,5 +115,46 @@ (ert-deftest rx-seq () ;; Test zero-argument `seq'. (should (equal (rx (seq)) ""))) +(defmacro rx-tests--match (regexp string &optional match) + (macroexp-let2 nil strexp string + `(ert-info ((format "Matching %S to %S" ',regexp ,strexp)) + (should (string-match ,regexp ,strexp)) + ,@(when match + `((should (equal (match-string 0 ,strexp) ,match))))))) + +(ert-deftest rx-nonstring-expr () + (let ((bee "b") + (vowel "[aeiou]")) + (rx-tests--match (rx "a" (literal bee) "c") "abc") + (rx-tests--match (rx "a" (regexp bee) "c") "abc") + (rx-tests--match (rx "a" (or (regexp bee) "xy") "c") "abc") + (rx-tests--match (rx "a" (or "xy" (regexp bee)) "c") "abc") + (should-not (string-match (rx (or (regexp bee) "xy")) "")) + (rx-tests--match (rx "a" (= 3 (regexp bee)) "c") "abbbc") + (rx-tests--match (rx "x" (= 3 (regexp vowel)) "z") "xeoez") + (should-not (string-match (rx "x" (= 3 (regexp vowel)) "z") "xe[]z")) + (rx-tests--match (rx "x" (= 3 (literal vowel)) "z") + "x[aeiou][aeiou][aeiou]z") + (rx-tests--match (rx "x" (repeat 1 (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (repeat 1 2 (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (repeat 1 2 (regexp vowel)) "z") "xauz") + (rx-tests--match (rx "x" (>= 1 (regexp vowel)) "z") "xaiiz") + (rx-tests--match (rx "x" (** 1 2 (regexp vowel)) "z") "xaiz") + (rx-tests--match (rx "x" (group (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (group-n 1 (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (? (regexp vowel)) "z") "xz"))) + +(ert-deftest rx-nonstring-expr-non-greedy () + "`rx's greediness can't affect runtime regexp parts." + (let ((ad-min "[ad]*?") + (ad-max "[ad]*") + (ad "[ad]")) + (rx-tests--match (rx "c" (regexp ad-min) "a") "cdaaada" "cda") + (rx-tests--match (rx "c" (regexp ad-max) "a") "cdaaada" "cdaaada") + (rx-tests--match (rx "c" (minimal-match (regexp ad-max)) "a") "cdaaada" "cdaaada") + (rx-tests--match (rx "c" (maximal-match (regexp ad-min)) "a") "cdaaada" "cda") + (rx-tests--match (rx "c" (minimal-match (0+ (regexp ad))) "a") "cdaaada" "cda") + (rx-tests--match (rx "c" (maximal-match (0+ (regexp ad))) "a") "cdaaada" "cdaaada"))) + (provide 'rx-tests) ;; rx-tests.el ends here. -- 2.11.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-19 1:34 ` Noam Postavsky @ 2019-06-19 15:42 ` Mattias Engdegård 2019-06-20 0:29 ` Noam Postavsky 0 siblings, 1 reply; 28+ messages in thread From: Mattias Engdegård @ 2019-06-19 15:42 UTC (permalink / raw) To: Noam Postavsky; +Cc: Michael Heerdegen, 36237, Stefan Monnier, kevin.legouguec 19 juni 2019 kl. 03.34 skrev Noam Postavsky <npostavs@gmail.com>: > > Sure. Here's the patch with regexp-quote change to literal, and > rx--compile-to-lisp renamed. I'll wait a bit more and push this weekend > if there are no more comments. Thank you. Some comments: -;; (rx (and line-start (eval something-else))), statically or -;; (rx-to-string '(and line-start ,something-else)), dynamically. +;; (rx (and line-start (regexp something-else))), statically or +;; (rx-to-string `(and line-start ,something-else)), dynamically. With your patch, the rx-to-string example should no longer be recommended, but eval is still of interest for compile-time substitution. What about: ;; (rx (and line-start (eval something-else))), statically or ;; (rx (and line-start (regexp something-else))), dynamically. (extra points if you change `and' into `seq') + (cond ((stringp form) + (rx-group-if (cadr form) rx-parent)) + (rx--compile-to-lisp + ;; Always group non string forms, since we can't be sure they "non-string forms" +(defun rx-literal (form) + "Parse and produce code from FORM, which is `(literal STRING-EXP)'." + (cond ((stringp form) + ;; This is allowed(?), but makes little sense, you could just + ;; use STRING directly. Yes, I did the same in ry. Maybe `literal' should be disallowed entirely in rx-to-string, since it's more likely to be a misunderstanding on the user's part. +(defun rx-subforms (subforms &optional parent regexp-op) REGEXP-OP is perhaps better named SEPARATOR? + (cl-mapcan (lambda (x) + (cons regexp-op (funcall listify x))) + (cdr subregexps)))) + (t (cl-mapcan listify subregexps))))) Any reason for using cl-mapcan instead of straight mapcan? Not that it matters much. +NO-GROUP non-nil means don't put shy groups around the result. +Note that unlike for the `rx' macro, subforms `literal' and +`regexp' will not accept non-string arguments (so (literal +STRING) becomes just a more verbose version of STRING)." Try not breaking the line inside (literal STRING). ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-19 15:42 ` Mattias Engdegård @ 2019-06-20 0:29 ` Noam Postavsky 2019-06-20 10:26 ` Mattias Engdegård 0 siblings, 1 reply; 28+ messages in thread From: Noam Postavsky @ 2019-06-20 0:29 UTC (permalink / raw) To: Mattias Engdegård Cc: Michael Heerdegen, 36237, Stefan Monnier, kevin.legouguec [-- Attachment #1: Type: text/plain, Size: 2257 bytes --] Mattias Engdegård <mattiase@acm.org> writes: > +;; (rx (and line-start (regexp something-else))), statically or > +;; (rx-to-string `(and line-start ,something-else)), dynamically. > > With your patch, the rx-to-string example should no longer be > recommended, but eval is still of interest for compile-time > substitution. What about: > > ;; (rx (and line-start (eval something-else))), statically or > ;; (rx (and line-start (regexp something-else))), dynamically. Not sure that we really want to get into the subtleties of static eval in the intro examples. I'm thinking we just drop the rx-to-string example, without replacement. > + ;; Always group non string forms, since we can't be sure they > > "non-string forms" Right. > +(defun rx-literal (form) > + "Parse and produce code from FORM, which is `(literal STRING-EXP)'." > + (cond ((stringp form) > + ;; This is allowed(?), but makes little sense, you could just > + ;; use STRING directly. > > Yes, I did the same in ry. Maybe `literal' should be disallowed > entirely in rx-to-string, since it's more likely to be a > misunderstanding on the user's part. I think disallowing it could potentially be annoying during development, e.g., building incrementally in re-builder. > +(defun rx-subforms (subforms &optional parent regexp-op) > > REGEXP-OP is perhaps better named SEPARATOR? Yeah, especially since it's just the one "\\|" operator. > + (cl-mapcan (lambda (x) > + (cons regexp-op (funcall listify x))) > + (cdr subregexps)))) > + (t (cl-mapcan listify subregexps))))) > > Any reason for using cl-mapcan instead of straight mapcan? > Not that it matters much. I, um, didn't realize mapcan was builtin (when I saw mapcan elsewhere I just assumed it was from cl.el). In my defence, it's new since 26.1 :p (In addition to the above two points, I've renamed this function to rx--subforms, and re-arranged the code a bit) > +`regexp' will not accept non-string arguments (so (literal > +STRING) becomes just a more verbose version of STRING)." > > Try not breaking the line inside (literal STRING). Right. [-- Attachment #2: patch --] [-- Type: text/plain, Size: 17816 bytes --] From 6351f9d8ac0ff5643b849f2c8e3eb44ea1641fc5 Mon Sep 17 00:00:00 2001 From: Noam Postavsky <npostavs@gmail.com> Date: Fri, 14 Jun 2019 08:43:17 -0400 Subject: [PATCH] Support (rx (and (regexp EXPR) (literal EXPR))) (Bug#36237) * lisp/emacs-lisp/rx.el (rx-regexp): Allow non-string forms. (rx-constituents): Add literal constituent, which is like a plain STRING form, but allows arbitrary lisp expressions. (rx-literal): New function. (rx-compile-to-lisp): New variable. (rx--subforms): New helper function for handling subforms, including non-constant case. (rx-group-if, rx-and, rx-or, rx-=, rx->=, rx-repeat, rx-submatch) (rx-submatch-n, rx-kleene, rx-atomic-p): Use it to handle non-constant subforms. (rx): Document new form, wrap non-constant forms with concat call. * test/lisp/emacs-lisp/rx-tests.el (rx-tests--match): New macro. (rx-nonstring-expr, rx-nonstring-expr-non-greedy): New tests. * etc/NEWS: Announce changes. --- etc/NEWS | 6 ++ lisp/emacs-lisp/rx.el | 189 ++++++++++++++++++++++++++------------- test/lisp/emacs-lisp/rx-tests.el | 41 +++++++++ 3 files changed, 173 insertions(+), 63 deletions(-) diff --git a/etc/NEWS b/etc/NEWS index 723f0a0fb0..42958bca36 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -1380,12 +1380,18 @@ when given in a string. Previously, '(any "\x80-\xff")' would match characters U+0080...U+00FF. Now the expression matches raw bytes in the 128...255 range, as expected. +--- *** The rx 'or' and 'seq' forms no longer require any arguments. (or) produces a regexp that never matches anything, while (seq) matches the empty string, each being an identity for the operation. This also works for their aliases: '|' for 'or'; ':', 'and' and 'sequence' for 'seq'. +--- +*** 'regexp' and new 'literal' accept arbitrary lisp as arguments. +In this case, 'rx' will generate code which produces a regexp string +at runtime, instead of a constant string. + ** Frames +++ diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el index 8ef78fd69e..12e33aaded 100644 --- a/lisp/emacs-lisp/rx.el +++ b/lisp/emacs-lisp/rx.el @@ -47,9 +47,11 @@ ;; Rx translates a sexp notation for regular expressions into the ;; usual string notation. The translation can be done at compile-time -;; by using the `rx' macro. It can be done at run-time by calling -;; function `rx-to-string'. See the documentation of `rx' for a -;; complete description of the sexp notation. +;; by using the `rx' macro. The `regexp' and `literal' forms accept +;; non-constant expressions, in which case `rx' will translate to a +;; `concat' expression. Translation can be done fully at run-time by +;; calling function `rx-to-string'. See the documentation of `rx' for +;; a complete description of the sexp notation. ;; ;; Some examples of string regexps and their sexp counterparts: ;; @@ -78,8 +80,7 @@ ;; (+ (? ?\n)) (any " \t")) ;; ;; (concat "^\\(?:" something-else "\\)") -;; (rx (and line-start (eval something-else))), statically or -;; (rx-to-string '(and line-start ,something-else)), dynamically. +;; (rx (seq line-start (regexp something-else))) ;; ;; (regexp-opt '(STRING1 STRING2 ...)) ;; (rx (or STRING1 STRING2 ...)), or in other words, `or' automatically @@ -176,6 +177,7 @@ (defvar rx-constituents ;Not `const' because some modes extend it. (not-syntax . (rx-not-syntax 1 1)) ; sregex (category . (rx-category 1 1 rx-check-category)) (eval . (rx-eval 1 1)) + (literal . (rx-literal 1 1 stringp)) (regexp . (rx-regexp 1 1 stringp)) (regex . regexp) ; sregex (digit . "[[:digit:]]") @@ -302,6 +304,10 @@ (defvar rx-greedy-flag t "Non-nil means produce greedy regular expressions for `zero-or-one', `zero-or-more', and `one-or-more'. Dynamically bound.") +(defvar rx--compile-to-lisp nil + "Nil means return a regexp as a string. +Non-nil means we may return a lisp form which produces a +string (used for `rx' macro).") (defun rx-info (op head) "Return parsing/code generation info for OP. @@ -344,7 +350,7 @@ (defun rx-check (form) (> nargs max-args)) (error "rx form `%s' accepts at most %d args" (car form) max-args)) - (when (not (null type-pred)) + (when type-pred (dolist (sub-form (cdr form)) (unless (funcall type-pred sub-form) (error "rx form `%s' requires args satisfying `%s'" @@ -360,8 +366,9 @@ (defun rx-group-if (regexp group) ;; for concatenation ((eq group ':) (if (rx-atomic-p - (if (string-match - "\\(?:[?*+]\\??\\|\\\\{[0-9]*,?[0-9]*\\\\}\\)\\'" regexp) + (if (and (stringp regexp) + (string-match + "\\(?:[?*+]\\??\\|\\\\{[0-9]*,?[0-9]*\\\\}\\)\\'" regexp)) (substring regexp 0 (match-beginning 0)) regexp)) (setq group nil))) @@ -370,9 +377,10 @@ (defun rx-group-if (regexp group) ;; do anyway ((eq group t)) ((rx-atomic-p regexp t) (setq group nil))) - (if group - (concat "\\(?:" regexp "\\)") - regexp)) + (cond ((and group (stringp regexp)) + (concat "\\(?:" regexp "\\)")) + (group `("\\(?:" ,@regexp "\\)")) + (t regexp))) (defvar rx-parent) @@ -384,7 +392,7 @@ (defun rx-and (form) FORM is of the form `(and FORM1 ...)'." (rx-check form) (rx-group-if - (mapconcat (lambda (x) (rx-form x ':)) (cdr form) nil) + (rx--subforms (cdr form) ':) (and (memq rx-parent '(* t)) rx-parent))) @@ -396,7 +404,7 @@ (defun rx-or (form) ((null (cdr form)) regexp-unmatchable) ((cl-every #'stringp (cdr form)) (regexp-opt (cdr form) nil t)) - (t (mapconcat (lambda (x) (rx-form x '|)) (cdr form) "\\|"))) + (t (rx--subforms (cdr form) '| "\\|"))) (and (memq rx-parent '(: * t)) rx-parent))) @@ -669,7 +677,10 @@ (defun rx-= (form) (unless (and (integerp (nth 1 form)) (> (nth 1 form) 0)) (error "rx `=' requires positive integer first arg")) - (format "%s\\{%d\\}" (rx-form (nth 2 form) '*) (nth 1 form))) + (let ((subform (rx-form (nth 2 form) '*))) + (if (stringp subform) + (format "%s\\{%d\\}" subform (nth 1 form)) + `(,@subform ,(format "\\{%d\\}" (nth 1 form)))))) (defun rx->= (form) @@ -679,7 +690,10 @@ (defun rx->= (form) (unless (and (integerp (nth 1 form)) (> (nth 1 form) 0)) (error "rx `>=' requires positive integer first arg")) - (format "%s\\{%d,\\}" (rx-form (nth 2 form) '*) (nth 1 form))) + (let ((subform (rx-form (nth 2 form) '*))) + (if (stringp subform) + (format "%s\\{%d,\\}" subform (nth 1 form)) + `(,@subform ,(format "\\{%d,\\}" (nth 1 form)))))) (defun rx-** (form) @@ -700,7 +714,10 @@ (defun rx-repeat (form) (unless (and (integerp (nth 1 form)) (> (nth 1 form) 0)) (error "rx `repeat' requires positive integer first arg")) - (format "%s\\{%d\\}" (rx-form (nth 2 form) '*) (nth 1 form))) + (let ((subform (rx-form (nth 2 form) '*))) + (if (stringp subform) + (format "%s\\{%d\\}" subform (nth 1 form)) + `(,@subform ,(format "\\{%d\\}" (nth 1 form)))))) ((or (not (integerp (nth 2 form))) (< (nth 2 form) 0) (not (integerp (nth 1 form))) @@ -708,30 +725,26 @@ (defun rx-repeat (form) (< (nth 2 form) (nth 1 form))) (error "rx `repeat' range error")) (t - (format "%s\\{%d,%d\\}" (rx-form (nth 3 form) '*) - (nth 1 form) (nth 2 form))))) + (let ((subform (rx-form (nth 3 form) '*))) + (if (stringp subform) + (format "%s\\{%d,%d\\}" subform (nth 1 form) (nth 2 form)) + `(,@subform ,(format "\\{%d,%d\\}" (nth 1 form) (nth 2 form)))))))) (defun rx-submatch (form) "Parse and produce code from FORM, which is `(submatch ...)'." - (concat "\\(" - (if (= 2 (length form)) - ;; Only one sub-form. - (rx-form (cadr form)) - ;; Several sub-forms implicitly concatenated. - (mapconcat (lambda (re) (rx-form re ':)) (cdr form) nil)) - "\\)")) + (let ((subforms (rx--subforms (cdr form) ':))) + (if (stringp subforms) + (concat "\\(" subforms "\\)") + `("\\(" ,@subforms "\\)")))) (defun rx-submatch-n (form) "Parse and produce code from FORM, which is `(submatch-n N ...)'." - (let ((n (nth 1 form))) - (concat "\\(?" (number-to-string n) ":" - (if (= 3 (length form)) - ;; Only one sub-form. - (rx-form (nth 2 form)) - ;; Several sub-forms implicitly concatenated. - (mapconcat (lambda (re) (rx-form re ':)) (cddr form) nil)) - "\\)"))) + (let ((n (nth 1 form)) + (subforms (rx--subforms (cddr form) ':))) + (if (stringp subforms) + (concat "\\(?" (number-to-string n) ":" subforms "\\)") + `("\\(?" ,(number-to-string n) ":" ,@subforms "\\)")))) (defun rx-backref (form) "Parse and produce code from FORM, which is `(backref N)'." @@ -759,9 +772,12 @@ (defun rx-kleene (form) (t "?"))) (op (cond ((memq (car form) '(* *? 0+ zero-or-more)) "*") ((memq (car form) '(+ +? 1+ one-or-more)) "+") - (t "?")))) + (t "?"))) + (subform (rx-form (cadr form) '*))) (rx-group-if - (concat (rx-form (cadr form) '*) op suffix) + (if (stringp subform) + (concat subform op suffix) + `(,@subform ,(concat op suffix))) (and (memq rx-parent '(t *)) rx-parent)))) @@ -789,15 +805,18 @@ (defun rx-atomic-p (r &optional lax) be detected without much effort. A guarantee of no false negatives would require a theoretic specification of the set of all atomic regexps." - (let ((l (length r))) - (cond - ((<= l 1)) - ((= l 2) (= (aref r 0) ?\\)) - ((= l 3) (string-match "\\`\\(?:\\\\[cCsS_]\\|\\[[^^]\\]\\)" r)) - ((null lax) + (if (and rx--compile-to-lisp + (not (stringp r))) + nil ;; Runtime value, we must assume non-atomic. + (let ((l (length r))) (cond - ((string-match "\\`\\[\\^?]?\\(?:\\[:[a-z]+:]\\|[^]]\\)*]\\'" r)) - ((string-match "\\`\\\\(\\(?:[^\\]\\|\\\\[^)]\\)*\\\\)\\'" r))))))) + ((<= l 1)) + ((= l 2) (= (aref r 0) ?\\)) + ((= l 3) (string-match "\\`\\(?:\\\\[cCsS_]\\|\\[[^^]\\]\\)" r)) + ((null lax) + (cond + ((string-match "\\`\\[\\^?]?\\(?:\\[:[a-z]+:]\\|[^]]\\)*]\\'" r)) + ((string-match "\\`\\\\(\\(?:[^\\]\\|\\\\[^)]\\)*\\\\)\\'" r)))))))) (defun rx-syntax (form) @@ -853,9 +872,23 @@ (defun rx-greedy (form) (defun rx-regexp (form) "Parse and produce code from FORM, which is `(regexp STRING)'." - (rx-check form) - (rx-group-if (cadr form) rx-parent)) - + (cond ((stringp form) + (rx-group-if (cadr form) rx-parent)) + (rx--compile-to-lisp + ;; Always group non-string forms, since we can't be sure they + ;; are atomic. + (rx-group-if (cdr form) t)) + (t (rx-check form)))) + +(defun rx-literal (form) + "Parse and produce code from FORM, which is `(literal STRING-EXP)'." + (cond ((stringp form) + ;; This is allowed, but makes little sense, you could just + ;; use STRING directly. + (rx-group-if (regexp-quote (cadr form)) rx-parent)) + (rx--compile-to-lisp + (rx-group-if `((regexp-quote ,(cadr form))) rx-parent)) + (t (rx-check form)))) (defun rx-form (form &optional parent) "Parse and produce code for regular expression FORM. @@ -886,12 +919,38 @@ (defun rx-form (form &optional parent) (t (error "rx syntax error at `%s'" form))))) +(defun rx--subforms (subforms &optional parent separator) + "Produce code for regular expressions SUBFORMS. +SUBFORMS is a list of regular expression sexps. +PARENT controls grouping, as in `rx-form'. +Insert SEPARATOR between the code from each of SUBFORMS." + (if (null (cdr subforms)) + ;; Zero or one forms, no need for grouping. + (and subforms (rx-form (car subforms))) + (let ((listify (lambda (x) + (if (listp x) (copy-sequence x) + (list x))))) + (setq subforms (mapcar (lambda (x) (rx-form x parent)) subforms)) + (cond ((or (not rx--compile-to-lisp) + (cl-every #'stringp subforms)) + (mapconcat #'identity subforms separator)) + (separator + (nconc (funcall listify (car subforms)) + (mapcan (lambda (x) + (cons separator (funcall listify x))) + (cdr subforms)))) + (t (mapcan listify subforms)))))) + ;;;###autoload (defun rx-to-string (form &optional no-group) "Parse and produce code for regular expression FORM. FORM is a regular expression in sexp form. -NO-GROUP non-nil means don't put shy groups around the result." +NO-GROUP non-nil means don't put shy groups around the result. + +In contrast to the `rx' macro, subforms `literal' and `regexp' +will not accept non-string arguments, i.e., (literal STRING) +becomes just a more verbose version of STRING." (rx-group-if (rx-form form) (null no-group))) @@ -901,8 +960,12 @@ (defmacro rx (&rest regexps) REGEXPS is a non-empty sequence of forms of the sort listed below. Note that `rx' is a Lisp macro; when used in a Lisp program being -compiled, the translation is performed by the compiler. -See `rx-to-string' for how to do such a translation at run-time. +compiled, the translation is performed by the compiler. The +`literal' and `regexp' forms accept subforms that will evaluate +to strings, in addition to constant strings. If REGEXPS include +such forms, then the result is an expression which returns a +regexp string, rather than a regexp string directly. See +`rx-to-string' for performing translation completely at run-time. The following are valid subforms of regular expressions in sexp notation. @@ -910,6 +973,10 @@ (defmacro rx (&rest regexps) STRING matches string STRING literally. +`(literal STRING)' + matches STRING literally, where STRING is any lisp + expression that evaluates to a string. + CHAR matches character CHAR literally. @@ -1208,12 +1275,16 @@ (defmacro rx (&rest regexps) `(regexp REGEXP)' include REGEXP in string notation in the result." - (cond ((null regexps) - (error "No regexp")) - ((cdr regexps) - (rx-to-string `(and ,@regexps) t)) - (t - (rx-to-string (car regexps) t)))) + (let* ((rx--compile-to-lisp t) + (re (cond ((null regexps) + (error "No regexp")) + ((cdr regexps) + (rx-to-string `(and ,@regexps) t)) + (t + (rx-to-string (car regexps) t))))) + (if (stringp re) + re + `(concat ,@re)))) (pcase-defmacro rx (&rest regexps) @@ -1275,14 +1346,6 @@ (pcase-defmacro rx (&rest regexps) for var in vars collect `(app (match-string ,i) ,var))))) \f -;; ;; sregex.el replacement - -;; ;;;###autoload (provide 'sregex) -;; ;;;###autoload (autoload 'sregex "rx") -;; (defalias 'sregex 'rx-to-string) -;; ;;;###autoload (autoload 'sregexq "rx" nil nil 'macro) -;; (defalias 'sregexq 'rx) -\f (provide 'rx) ;;; rx.el ends here diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el index 6f392d616d..bab71b522b 100644 --- a/test/lisp/emacs-lisp/rx-tests.el +++ b/test/lisp/emacs-lisp/rx-tests.el @@ -115,5 +115,46 @@ (ert-deftest rx-seq () ;; Test zero-argument `seq'. (should (equal (rx (seq)) ""))) +(defmacro rx-tests--match (regexp string &optional match) + (macroexp-let2 nil strexp string + `(ert-info ((format "Matching %S to %S" ',regexp ,strexp)) + (should (string-match ,regexp ,strexp)) + ,@(when match + `((should (equal (match-string 0 ,strexp) ,match))))))) + +(ert-deftest rx-nonstring-expr () + (let ((bee "b") + (vowel "[aeiou]")) + (rx-tests--match (rx "a" (literal bee) "c") "abc") + (rx-tests--match (rx "a" (regexp bee) "c") "abc") + (rx-tests--match (rx "a" (or (regexp bee) "xy") "c") "abc") + (rx-tests--match (rx "a" (or "xy" (regexp bee)) "c") "abc") + (should-not (string-match (rx (or (regexp bee) "xy")) "")) + (rx-tests--match (rx "a" (= 3 (regexp bee)) "c") "abbbc") + (rx-tests--match (rx "x" (= 3 (regexp vowel)) "z") "xeoez") + (should-not (string-match (rx "x" (= 3 (regexp vowel)) "z") "xe[]z")) + (rx-tests--match (rx "x" (= 3 (literal vowel)) "z") + "x[aeiou][aeiou][aeiou]z") + (rx-tests--match (rx "x" (repeat 1 (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (repeat 1 2 (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (repeat 1 2 (regexp vowel)) "z") "xauz") + (rx-tests--match (rx "x" (>= 1 (regexp vowel)) "z") "xaiiz") + (rx-tests--match (rx "x" (** 1 2 (regexp vowel)) "z") "xaiz") + (rx-tests--match (rx "x" (group (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (group-n 1 (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (? (regexp vowel)) "z") "xz"))) + +(ert-deftest rx-nonstring-expr-non-greedy () + "`rx's greediness can't affect runtime regexp parts." + (let ((ad-min "[ad]*?") + (ad-max "[ad]*") + (ad "[ad]")) + (rx-tests--match (rx "c" (regexp ad-min) "a") "cdaaada" "cda") + (rx-tests--match (rx "c" (regexp ad-max) "a") "cdaaada" "cdaaada") + (rx-tests--match (rx "c" (minimal-match (regexp ad-max)) "a") "cdaaada" "cdaaada") + (rx-tests--match (rx "c" (maximal-match (regexp ad-min)) "a") "cdaaada" "cda") + (rx-tests--match (rx "c" (minimal-match (0+ (regexp ad))) "a") "cdaaada" "cda") + (rx-tests--match (rx "c" (maximal-match (0+ (regexp ad))) "a") "cdaaada" "cdaaada"))) + (provide 'rx-tests) ;; rx-tests.el ends here. -- 2.11.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-20 0:29 ` Noam Postavsky @ 2019-06-20 10:26 ` Mattias Engdegård 2019-06-22 22:05 ` Noam Postavsky 0 siblings, 1 reply; 28+ messages in thread From: Mattias Engdegård @ 2019-06-20 10:26 UTC (permalink / raw) To: Noam Postavsky; +Cc: Michael Heerdegen, 36237, Stefan Monnier, kevin.legouguec 20 juni 2019 kl. 02.29 skrev Noam Postavsky <npostavs@gmail.com>: > > <0001-Support-rx-and-regexp-EXPR-literal-EXPR-Bug-36237.patch> Things I didn't notice earlier: -;; (rx (and line-start (eval something-else))), statically or -;; (rx-to-string '(and line-start ,something-else)), dynamically. +;; (rx (seq line-start (regexp something-else))) You can actually drop the `seq' form entirely, since it's implicit in `rx'. It was only needed for `rx-to-string' which is now gone. +`(literal STRING)' + matches STRING literally, where STRING is any lisp + expression that evaluates to a string. It's better to name the metavariable EXPR, STRING-EXPR or LISP-EXPR to make it clear that it's an arbitrary lisp expression, especially since STRING is used for a constant string just above. The same goes for `regexp', since it can now be a lisp expression; this should be mentioned in the describing paragraph, using a similar phrasing. The `literal' item should probably be moved next to `regexp', since they are closely related. The paragraph on `eval' uses FORM, which is too generic; presumably it should become EXPR too (but not STRING-EXPR, since it can be any rx value). ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-20 10:26 ` Mattias Engdegård @ 2019-06-22 22:05 ` Noam Postavsky 2019-06-23 11:09 ` Mattias Engdegård 2019-06-23 14:45 ` Drew Adams 0 siblings, 2 replies; 28+ messages in thread From: Noam Postavsky @ 2019-06-22 22:05 UTC (permalink / raw) To: Mattias Engdegård Cc: Michael Heerdegen, 36237, Stefan Monnier, kevin.legouguec [-- Attachment #1: Type: text/plain, Size: 1555 bytes --] Mattias Engdegård <mattiase@acm.org> writes: > -;; (rx (and line-start (eval something-else))), statically or > -;; (rx-to-string '(and line-start ,something-else)), dynamically. > +;; (rx (seq line-start (regexp something-else))) > > You can actually drop the `seq' form entirely, since it's implicit in `rx'. > It was only needed for `rx-to-string' which is now gone. Yeah, that applies to most of the examples actually. Updated (and I found a couple of mistakes in them). > +`(literal STRING)' > + matches STRING literally, where STRING is any lisp > + expression that evaluates to a string. > > It's better to name the metavariable EXPR, STRING-EXPR or LISP-EXPR to > make it clear that it's an arbitrary lisp expression, especially since > STRING is used for a constant string just above. Sure. > The same goes for `regexp', since it can now be a lisp expression; > this should be mentioned in the describing paragraph, using a similar > phrasing. The `literal' item should probably be moved next to > `regexp', since they are closely related. Yeah, I wasn't entirely sure whether `literal' should be considered more related to `regexp' or STRING. I guess since I've added a mention of `literal' and `regexp' in the paragraphs above it makes sense to put them at the end together. > The paragraph on `eval' uses FORM, which is too generic No, it's not generic, see (info "(elisp) Intro Eval"): A Lisp object that is intended for evaluation is called a "form" or "expression"(1). [-- Attachment #2: patch --] [-- Type: text/plain, Size: 19932 bytes --] From 3346081acbac014ac3ecef468a46c19e60e9dcc0 Mon Sep 17 00:00:00 2001 From: Noam Postavsky <npostavs@gmail.com> Date: Fri, 14 Jun 2019 08:43:17 -0400 Subject: [PATCH] Support (rx (and (regexp EXPR) (literal EXPR))) (Bug#36237) * lisp/emacs-lisp/rx.el (rx-regexp): Allow non-string forms. (rx-constituents): Add literal constituent, which is like a plain STRING form, but allows arbitrary lisp expressions. (rx-literal): New function. (rx-compile-to-lisp): New variable. (rx--subforms): New helper function for handling subforms, including non-constant case. (rx-group-if, rx-and, rx-or, rx-=, rx->=, rx-repeat, rx-submatch) (rx-submatch-n, rx-kleene, rx-atomic-p): Use it to handle non-constant subforms. (rx): Document new form, wrap non-constant forms with concat call. * test/lisp/emacs-lisp/rx-tests.el (rx-tests--match): New macro. (rx-nonstring-expr, rx-nonstring-expr-non-greedy): New tests. * etc/NEWS: Announce changes. squash! Support (rx (and (regexp EXPR) (literal EXPR))) (Bug#36237) --- etc/NEWS | 6 + lisp/emacs-lisp/rx.el | 240 +++++++++++++++++++++++++-------------- test/lisp/emacs-lisp/rx-tests.el | 41 +++++++ 3 files changed, 200 insertions(+), 87 deletions(-) diff --git a/etc/NEWS b/etc/NEWS index 0cfac248a3..dc034a55af 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -1400,12 +1400,18 @@ when given in a string. Previously, '(any "\x80-\xff")' would match characters U+0080...U+00FF. Now the expression matches raw bytes in the 128...255 range, as expected. +--- *** The rx 'or' and 'seq' forms no longer require any arguments. (or) produces a regexp that never matches anything, while (seq) matches the empty string, each being an identity for the operation. This also works for their aliases: '|' for 'or'; ':', 'and' and 'sequence' for 'seq'. +--- +*** 'regexp' and new 'literal' accept arbitrary lisp as arguments. +In this case, 'rx' will generate code which produces a regexp string +at runtime, instead of a constant string. + ** Frames +++ diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el index 8ef78fd69e..c59eb40f08 100644 --- a/lisp/emacs-lisp/rx.el +++ b/lisp/emacs-lisp/rx.el @@ -47,57 +47,58 @@ ;; Rx translates a sexp notation for regular expressions into the ;; usual string notation. The translation can be done at compile-time -;; by using the `rx' macro. It can be done at run-time by calling -;; function `rx-to-string'. See the documentation of `rx' for a -;; complete description of the sexp notation. +;; by using the `rx' macro. The `regexp' and `literal' forms accept +;; non-constant expressions, in which case `rx' will translate to a +;; `concat' expression. Translation can be done fully at run-time by +;; calling function `rx-to-string'. See the documentation of `rx' for +;; a complete description of the sexp notation. ;; ;; Some examples of string regexps and their sexp counterparts: ;; ;; "^[a-z]*" -;; (rx (and line-start (0+ (in "a-z")))) +;; (rx line-start (0+ (in "a-z"))) ;; ;; "\n[^ \t]" -;; (rx (and "\n" (not (any " \t")))) +;; (rx ?\n (not (in " \t"))) ;; ;; "\\*\\*\\* EOOH \\*\\*\\*\n" ;; (rx "*** EOOH ***\n") ;; ;; "\\<\\(catch\\|finally\\)\\>[^_]" -;; (rx (and word-start (submatch (or "catch" "finally")) word-end -;; (not (any ?_)))) +;; (rx word-start (submatch (or "catch" "finally")) word-end +;; (not (in ?_))) ;; ;; "[ \t\n]*:\\([^:]+\\|$\\)" -;; (rx (and (zero-or-more (in " \t\n")) ":" -;; (submatch (or line-end (one-or-more (not (any ?:))))))) +;; (rx (* (in " \t\n")) ":" +;; (submatch (or line-end (+ (not (in ?:)))))) ;; -;; "^content-transfer-encoding:\\(\n?[\t ]\\)*quoted-printable\\(\n?[\t ]\\)*" -;; (rx (and line-start -;; "content-transfer-encoding:" -;; (+ (? ?\n)) (any " \t") -;; "quoted-printable" -;; (+ (? ?\n)) (any " \t")) +;; "^content-transfer-encoding:\\(?:\n?[\t ]\\)*quoted-printable\\(?:\n?[\t ]\\)*" +;; (rx line-start +;; "content-transfer-encoding:" +;; (* (? ?\n) (in " \t")) +;; "quoted-printable" +;; (* (? ?\n) (in " \t"))) ;; ;; (concat "^\\(?:" something-else "\\)") -;; (rx (and line-start (eval something-else))), statically or -;; (rx-to-string '(and line-start ,something-else)), dynamically. +;; (rx line-start (regexp something-else)) ;; ;; (regexp-opt '(STRING1 STRING2 ...)) ;; (rx (or STRING1 STRING2 ...)), or in other words, `or' automatically ;; calls `regexp-opt' as needed. ;; ;; "^;;\\s-*\n\\|^\n" -;; (rx (or (and line-start ";;" (0+ space) ?\n) -;; (and line-start ?\n))) +;; (rx (or (seq line-start ";;" (0+ space) ?\n) +;; (seq line-start ?\n))) ;; ;; "\\$[I]d: [^ ]+ \\([^ ]+\\) " -;; (rx (and "$Id: " -;; (1+ (not (in " "))) -;; " " -;; (submatch (1+ (not (in " ")))) -;; " ")) +;; (rx "$Id: " +;; (1+ (not (in " "))) +;; " " +;; (submatch (1+ (not (in " ")))) +;; " ") ;; ;; "\\\\\\\\\\[\\w+" -;; (rx (and ?\\ ?\\ ?\[ (1+ word))) +;; (rx "\\\\[" (1+ word)) ;; ;; etc. @@ -176,6 +177,7 @@ (defvar rx-constituents ;Not `const' because some modes extend it. (not-syntax . (rx-not-syntax 1 1)) ; sregex (category . (rx-category 1 1 rx-check-category)) (eval . (rx-eval 1 1)) + (literal . (rx-literal 1 1 stringp)) (regexp . (rx-regexp 1 1 stringp)) (regex . regexp) ; sregex (digit . "[[:digit:]]") @@ -302,6 +304,10 @@ (defvar rx-greedy-flag t "Non-nil means produce greedy regular expressions for `zero-or-one', `zero-or-more', and `one-or-more'. Dynamically bound.") +(defvar rx--compile-to-lisp nil + "Nil means return a regexp as a string. +Non-nil means we may return a lisp form which produces a +string (used for `rx' macro).") (defun rx-info (op head) "Return parsing/code generation info for OP. @@ -344,7 +350,7 @@ (defun rx-check (form) (> nargs max-args)) (error "rx form `%s' accepts at most %d args" (car form) max-args)) - (when (not (null type-pred)) + (when type-pred (dolist (sub-form (cdr form)) (unless (funcall type-pred sub-form) (error "rx form `%s' requires args satisfying `%s'" @@ -360,8 +366,9 @@ (defun rx-group-if (regexp group) ;; for concatenation ((eq group ':) (if (rx-atomic-p - (if (string-match - "\\(?:[?*+]\\??\\|\\\\{[0-9]*,?[0-9]*\\\\}\\)\\'" regexp) + (if (and (stringp regexp) + (string-match + "\\(?:[?*+]\\??\\|\\\\{[0-9]*,?[0-9]*\\\\}\\)\\'" regexp)) (substring regexp 0 (match-beginning 0)) regexp)) (setq group nil))) @@ -370,9 +377,10 @@ (defun rx-group-if (regexp group) ;; do anyway ((eq group t)) ((rx-atomic-p regexp t) (setq group nil))) - (if group - (concat "\\(?:" regexp "\\)") - regexp)) + (cond ((and group (stringp regexp)) + (concat "\\(?:" regexp "\\)")) + (group `("\\(?:" ,@regexp "\\)")) + (t regexp))) (defvar rx-parent) @@ -384,7 +392,7 @@ (defun rx-and (form) FORM is of the form `(and FORM1 ...)'." (rx-check form) (rx-group-if - (mapconcat (lambda (x) (rx-form x ':)) (cdr form) nil) + (rx--subforms (cdr form) ':) (and (memq rx-parent '(* t)) rx-parent))) @@ -396,7 +404,7 @@ (defun rx-or (form) ((null (cdr form)) regexp-unmatchable) ((cl-every #'stringp (cdr form)) (regexp-opt (cdr form) nil t)) - (t (mapconcat (lambda (x) (rx-form x '|)) (cdr form) "\\|"))) + (t (rx--subforms (cdr form) '| "\\|"))) (and (memq rx-parent '(: * t)) rx-parent))) @@ -669,7 +677,10 @@ (defun rx-= (form) (unless (and (integerp (nth 1 form)) (> (nth 1 form) 0)) (error "rx `=' requires positive integer first arg")) - (format "%s\\{%d\\}" (rx-form (nth 2 form) '*) (nth 1 form))) + (let ((subform (rx-form (nth 2 form) '*))) + (if (stringp subform) + (format "%s\\{%d\\}" subform (nth 1 form)) + `(,@subform ,(format "\\{%d\\}" (nth 1 form)))))) (defun rx->= (form) @@ -679,7 +690,10 @@ (defun rx->= (form) (unless (and (integerp (nth 1 form)) (> (nth 1 form) 0)) (error "rx `>=' requires positive integer first arg")) - (format "%s\\{%d,\\}" (rx-form (nth 2 form) '*) (nth 1 form))) + (let ((subform (rx-form (nth 2 form) '*))) + (if (stringp subform) + (format "%s\\{%d,\\}" subform (nth 1 form)) + `(,@subform ,(format "\\{%d,\\}" (nth 1 form)))))) (defun rx-** (form) @@ -700,7 +714,10 @@ (defun rx-repeat (form) (unless (and (integerp (nth 1 form)) (> (nth 1 form) 0)) (error "rx `repeat' requires positive integer first arg")) - (format "%s\\{%d\\}" (rx-form (nth 2 form) '*) (nth 1 form))) + (let ((subform (rx-form (nth 2 form) '*))) + (if (stringp subform) + (format "%s\\{%d\\}" subform (nth 1 form)) + `(,@subform ,(format "\\{%d\\}" (nth 1 form)))))) ((or (not (integerp (nth 2 form))) (< (nth 2 form) 0) (not (integerp (nth 1 form))) @@ -708,30 +725,26 @@ (defun rx-repeat (form) (< (nth 2 form) (nth 1 form))) (error "rx `repeat' range error")) (t - (format "%s\\{%d,%d\\}" (rx-form (nth 3 form) '*) - (nth 1 form) (nth 2 form))))) + (let ((subform (rx-form (nth 3 form) '*))) + (if (stringp subform) + (format "%s\\{%d,%d\\}" subform (nth 1 form) (nth 2 form)) + `(,@subform ,(format "\\{%d,%d\\}" (nth 1 form) (nth 2 form)))))))) (defun rx-submatch (form) "Parse and produce code from FORM, which is `(submatch ...)'." - (concat "\\(" - (if (= 2 (length form)) - ;; Only one sub-form. - (rx-form (cadr form)) - ;; Several sub-forms implicitly concatenated. - (mapconcat (lambda (re) (rx-form re ':)) (cdr form) nil)) - "\\)")) + (let ((subforms (rx--subforms (cdr form) ':))) + (if (stringp subforms) + (concat "\\(" subforms "\\)") + `("\\(" ,@subforms "\\)")))) (defun rx-submatch-n (form) "Parse and produce code from FORM, which is `(submatch-n N ...)'." - (let ((n (nth 1 form))) - (concat "\\(?" (number-to-string n) ":" - (if (= 3 (length form)) - ;; Only one sub-form. - (rx-form (nth 2 form)) - ;; Several sub-forms implicitly concatenated. - (mapconcat (lambda (re) (rx-form re ':)) (cddr form) nil)) - "\\)"))) + (let ((n (nth 1 form)) + (subforms (rx--subforms (cddr form) ':))) + (if (stringp subforms) + (concat "\\(?" (number-to-string n) ":" subforms "\\)") + `("\\(?" ,(number-to-string n) ":" ,@subforms "\\)")))) (defun rx-backref (form) "Parse and produce code from FORM, which is `(backref N)'." @@ -759,9 +772,12 @@ (defun rx-kleene (form) (t "?"))) (op (cond ((memq (car form) '(* *? 0+ zero-or-more)) "*") ((memq (car form) '(+ +? 1+ one-or-more)) "+") - (t "?")))) + (t "?"))) + (subform (rx-form (cadr form) '*))) (rx-group-if - (concat (rx-form (cadr form) '*) op suffix) + (if (stringp subform) + (concat subform op suffix) + `(,@subform ,(concat op suffix))) (and (memq rx-parent '(t *)) rx-parent)))) @@ -789,15 +805,18 @@ (defun rx-atomic-p (r &optional lax) be detected without much effort. A guarantee of no false negatives would require a theoretic specification of the set of all atomic regexps." - (let ((l (length r))) - (cond - ((<= l 1)) - ((= l 2) (= (aref r 0) ?\\)) - ((= l 3) (string-match "\\`\\(?:\\\\[cCsS_]\\|\\[[^^]\\]\\)" r)) - ((null lax) + (if (and rx--compile-to-lisp + (not (stringp r))) + nil ;; Runtime value, we must assume non-atomic. + (let ((l (length r))) (cond - ((string-match "\\`\\[\\^?]?\\(?:\\[:[a-z]+:]\\|[^]]\\)*]\\'" r)) - ((string-match "\\`\\\\(\\(?:[^\\]\\|\\\\[^)]\\)*\\\\)\\'" r))))))) + ((<= l 1)) + ((= l 2) (= (aref r 0) ?\\)) + ((= l 3) (string-match "\\`\\(?:\\\\[cCsS_]\\|\\[[^^]\\]\\)" r)) + ((null lax) + (cond + ((string-match "\\`\\[\\^?]?\\(?:\\[:[a-z]+:]\\|[^]]\\)*]\\'" r)) + ((string-match "\\`\\\\(\\(?:[^\\]\\|\\\\[^)]\\)*\\\\)\\'" r)))))))) (defun rx-syntax (form) @@ -853,9 +872,23 @@ (defun rx-greedy (form) (defun rx-regexp (form) "Parse and produce code from FORM, which is `(regexp STRING)'." - (rx-check form) - (rx-group-if (cadr form) rx-parent)) - + (cond ((stringp form) + (rx-group-if (cadr form) rx-parent)) + (rx--compile-to-lisp + ;; Always group non-string forms, since we can't be sure they + ;; are atomic. + (rx-group-if (cdr form) t)) + (t (rx-check form)))) + +(defun rx-literal (form) + "Parse and produce code from FORM, which is `(literal STRING-EXP)'." + (cond ((stringp form) + ;; This is allowed, but makes little sense, you could just + ;; use STRING directly. + (rx-group-if (regexp-quote (cadr form)) rx-parent)) + (rx--compile-to-lisp + (rx-group-if `((regexp-quote ,(cadr form))) rx-parent)) + (t (rx-check form)))) (defun rx-form (form &optional parent) "Parse and produce code for regular expression FORM. @@ -886,12 +919,38 @@ (defun rx-form (form &optional parent) (t (error "rx syntax error at `%s'" form))))) +(defun rx--subforms (subforms &optional parent separator) + "Produce code for regular expressions SUBFORMS. +SUBFORMS is a list of regular expression sexps. +PARENT controls grouping, as in `rx-form'. +Insert SEPARATOR between the code from each of SUBFORMS." + (if (null (cdr subforms)) + ;; Zero or one forms, no need for grouping. + (and subforms (rx-form (car subforms))) + (let ((listify (lambda (x) + (if (listp x) (copy-sequence x) + (list x))))) + (setq subforms (mapcar (lambda (x) (rx-form x parent)) subforms)) + (cond ((or (not rx--compile-to-lisp) + (cl-every #'stringp subforms)) + (mapconcat #'identity subforms separator)) + (separator + (nconc (funcall listify (car subforms)) + (mapcan (lambda (x) + (cons separator (funcall listify x))) + (cdr subforms)))) + (t (mapcan listify subforms)))))) + ;;;###autoload (defun rx-to-string (form &optional no-group) "Parse and produce code for regular expression FORM. FORM is a regular expression in sexp form. -NO-GROUP non-nil means don't put shy groups around the result." +NO-GROUP non-nil means don't put shy groups around the result. + +In contrast to the `rx' macro, subforms `literal' and `regexp' +will not accept non-string arguments, i.e., (literal STRING) +becomes just a more verbose version of STRING." (rx-group-if (rx-form form) (null no-group))) @@ -901,8 +960,12 @@ (defmacro rx (&rest regexps) REGEXPS is a non-empty sequence of forms of the sort listed below. Note that `rx' is a Lisp macro; when used in a Lisp program being -compiled, the translation is performed by the compiler. -See `rx-to-string' for how to do such a translation at run-time. +compiled, the translation is performed by the compiler. The +`literal' and `regexp' forms accept subforms that will evaluate +to strings, in addition to constant strings. If REGEXPS include +such forms, then the result is an expression which returns a +regexp string, rather than a regexp string directly. See +`rx-to-string' for performing translation completely at run-time. The following are valid subforms of regular expressions in sexp notation. @@ -1202,18 +1265,29 @@ (defmacro rx (&rest regexps) `(backref N)' matches what was matched previously by submatch N. +`(literal STRING-EXPR)' + matches STRING-EXPR literally, where STRING-EXPR is any lisp + expression that evaluates to a string. + +`(regexp REGEXP-EXPR)' + include REGEXP-EXPR in string notation in the result, where + REGEXP-EXPR is any lisp expression that evaluates a string + containing a valid regexp. + `(eval FORM)' evaluate FORM and insert result. If result is a string, - `regexp-quote' it. - -`(regexp REGEXP)' - include REGEXP in string notation in the result." - (cond ((null regexps) - (error "No regexp")) - ((cdr regexps) - (rx-to-string `(and ,@regexps) t)) - (t - (rx-to-string (car regexps) t)))) + `regexp-quote' it. Note that FORM is evaluated during + macroexpansion." + (let* ((rx--compile-to-lisp t) + (re (cond ((null regexps) + (error "No regexp")) + ((cdr regexps) + (rx-to-string `(and ,@regexps) t)) + (t + (rx-to-string (car regexps) t))))) + (if (stringp re) + re + `(concat ,@re)))) (pcase-defmacro rx (&rest regexps) @@ -1275,14 +1349,6 @@ (pcase-defmacro rx (&rest regexps) for var in vars collect `(app (match-string ,i) ,var))))) \f -;; ;; sregex.el replacement - -;; ;;;###autoload (provide 'sregex) -;; ;;;###autoload (autoload 'sregex "rx") -;; (defalias 'sregex 'rx-to-string) -;; ;;;###autoload (autoload 'sregexq "rx" nil nil 'macro) -;; (defalias 'sregexq 'rx) -\f (provide 'rx) ;;; rx.el ends here diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el index 6f392d616d..bab71b522b 100644 --- a/test/lisp/emacs-lisp/rx-tests.el +++ b/test/lisp/emacs-lisp/rx-tests.el @@ -115,5 +115,46 @@ (ert-deftest rx-seq () ;; Test zero-argument `seq'. (should (equal (rx (seq)) ""))) +(defmacro rx-tests--match (regexp string &optional match) + (macroexp-let2 nil strexp string + `(ert-info ((format "Matching %S to %S" ',regexp ,strexp)) + (should (string-match ,regexp ,strexp)) + ,@(when match + `((should (equal (match-string 0 ,strexp) ,match))))))) + +(ert-deftest rx-nonstring-expr () + (let ((bee "b") + (vowel "[aeiou]")) + (rx-tests--match (rx "a" (literal bee) "c") "abc") + (rx-tests--match (rx "a" (regexp bee) "c") "abc") + (rx-tests--match (rx "a" (or (regexp bee) "xy") "c") "abc") + (rx-tests--match (rx "a" (or "xy" (regexp bee)) "c") "abc") + (should-not (string-match (rx (or (regexp bee) "xy")) "")) + (rx-tests--match (rx "a" (= 3 (regexp bee)) "c") "abbbc") + (rx-tests--match (rx "x" (= 3 (regexp vowel)) "z") "xeoez") + (should-not (string-match (rx "x" (= 3 (regexp vowel)) "z") "xe[]z")) + (rx-tests--match (rx "x" (= 3 (literal vowel)) "z") + "x[aeiou][aeiou][aeiou]z") + (rx-tests--match (rx "x" (repeat 1 (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (repeat 1 2 (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (repeat 1 2 (regexp vowel)) "z") "xauz") + (rx-tests--match (rx "x" (>= 1 (regexp vowel)) "z") "xaiiz") + (rx-tests--match (rx "x" (** 1 2 (regexp vowel)) "z") "xaiz") + (rx-tests--match (rx "x" (group (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (group-n 1 (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (? (regexp vowel)) "z") "xz"))) + +(ert-deftest rx-nonstring-expr-non-greedy () + "`rx's greediness can't affect runtime regexp parts." + (let ((ad-min "[ad]*?") + (ad-max "[ad]*") + (ad "[ad]")) + (rx-tests--match (rx "c" (regexp ad-min) "a") "cdaaada" "cda") + (rx-tests--match (rx "c" (regexp ad-max) "a") "cdaaada" "cdaaada") + (rx-tests--match (rx "c" (minimal-match (regexp ad-max)) "a") "cdaaada" "cdaaada") + (rx-tests--match (rx "c" (maximal-match (regexp ad-min)) "a") "cdaaada" "cda") + (rx-tests--match (rx "c" (minimal-match (0+ (regexp ad))) "a") "cdaaada" "cda") + (rx-tests--match (rx "c" (maximal-match (0+ (regexp ad))) "a") "cdaaada" "cdaaada"))) + (provide 'rx-tests) ;; rx-tests.el ends here. -- 2.11.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-22 22:05 ` Noam Postavsky @ 2019-06-23 11:09 ` Mattias Engdegård 2019-06-23 15:46 ` Noam Postavsky 2019-06-23 14:45 ` Drew Adams 1 sibling, 1 reply; 28+ messages in thread From: Mattias Engdegård @ 2019-06-23 11:09 UTC (permalink / raw) To: Noam Postavsky; +Cc: Michael Heerdegen, 36237, Stefan Monnier, kevin.legouguec 23 juni 2019 kl. 00.05 skrev Noam Postavsky <npostavs@gmail.com>: > > Yeah, that applies to most of the examples actually. Updated (and I > found a couple of mistakes in them). Very good, thank you! I double-checked them with xr and only found one error (see below). >> The paragraph on `eval' uses FORM, which is too generic > > No, it's not generic, see (info "(elisp) Intro Eval"): > > A Lisp object that is intended for evaluation is called a "form" or > "expression"(1). You are entirely correct, of course; what I meant is that the docs frequently use "form" for the `rx' whatchamacallits even though they aren't Lisp expressions. The terminology is a mess; use whatever you find understandable. >squash! Support (rx (and (regexp EXPR) (literal EXPR))) (Bug#36237) Remnants of rebase editing? ;; "[ \t\n]*:\\([^:]+\\|$\\)" -;; (rx (and (zero-or-more (in " \t\n")) ":" -;; (submatch (or line-end (one-or-more (not (any ?:))))))) +;; (rx (* (in " \t\n")) ":" +;; (submatch (or line-end (+ (not (in ?:)))))) The correct translation of the `or'-pattern is (or (+ (not (any ":"))) eol) since the order of the branches matters. Maybe it's the regexp string that should be the other way around; hard to tell without any context. ;; "^;;\\s-*\n\\|^\n" -;; (rx (or (and line-start ";;" (0+ space) ?\n) -;; (and line-start ?\n))) +;; (rx (or (seq line-start ";;" (0+ space) ?\n) +;; (seq line-start ?\n))) This should be correct. The regexp compiler translates `[[:space:]]` and `\s-` to different bytecodes, but as far as I can tell they end up having identical semantics in the end. Same goes for `[[:word:]]' vs `\sw' (alias `\w'), and so on. +`(literal STRING-EXPR)' + matches STRING-EXPR literally, where STRING-EXPR is any lisp + expression that evaluates to a string. + +`(regexp REGEXP-EXPR)' + include REGEXP-EXPR in string notation in the result, where + REGEXP-EXPR is any lisp expression that evaluates a string + containing a valid regexp. Missed "to" after "evaluate"? I'm happy with the patch after the obvious fixes. ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-23 11:09 ` Mattias Engdegård @ 2019-06-23 15:46 ` Noam Postavsky 2019-06-24 3:50 ` Stefan Monnier 2019-06-26 2:07 ` Noam Postavsky 0 siblings, 2 replies; 28+ messages in thread From: Noam Postavsky @ 2019-06-23 15:46 UTC (permalink / raw) To: Mattias Engdegård Cc: Michael Heerdegen, 36237, Stefan Monnier, kevin.legouguec [-- Attachment #1: Type: text/plain, Size: 1712 bytes --] Mattias Engdegård <mattiase@acm.org> writes: > > what I meant is that the docs frequently use "form" for the `rx' > whatchamacallits even though they aren't Lisp expressions. The > terminology is a mess; use whatever you find understandable. Well, the rest of the rx docstring uses SEXP for the `rx' whatchamacallits, so I think leaving it as (eval FORM) should be fine. And hopefully we'll be able to deprecate eval soon enough so it won't matter too much. >>squash! Support (rx (and (regexp EXPR) (literal EXPR))) (Bug#36237) > > Remnants of rebase editing? Oops, yes. I wish git would comment out the "squash!..." line automatically. > since the order of the branches matters. Maybe it's the regexp string > that should be the other way around; hard to tell without any context. Yeah, I switched the regexp string instead, on the grounds that otherwise "$" would almost never match (except at end of buffer) since [^:] already matches \n. > +`(regexp REGEXP-EXPR)' > + include REGEXP-EXPR in string notation in the result, where > + REGEXP-EXPR is any lisp expression that evaluates a string > + containing a valid regexp. > > Missed "to" after "evaluate"? Oops. > I'm happy with the patch after the obvious fixes. I'll wait a few more days in case something else comes up. Drew Adams <drew.adams@oracle.com> writes: > Minor: > > "at run-time" -> "at run time" > > Emacs docs seem to use "runtime" as > adjective and "run time" as noun, which > is fairly conventional. Sometimes, > outside Emacs, "runtime" is used for both. The elisp manual has a couple of "run-time" as well, but more cases of "run time" so I went with that. [-- Attachment #2: patch --] [-- Type: text/plain, Size: 19898 bytes --] From f0680d83907ada3d2c094625ee3446dfe696251a Mon Sep 17 00:00:00 2001 From: Noam Postavsky <npostavs@gmail.com> Date: Fri, 14 Jun 2019 08:43:17 -0400 Subject: [PATCH] Support (rx (and (regexp EXPR) (literal EXPR))) (Bug#36237) * lisp/emacs-lisp/rx.el (rx-regexp): Allow non-string forms. (rx-constituents): Add literal constituent, which is like a plain STRING form, but allows arbitrary lisp expressions. (rx-literal): New function. (rx-compile-to-lisp): New variable. (rx--subforms): New helper function for handling subforms, including non-constant case. (rx-group-if, rx-and, rx-or, rx-=, rx->=, rx-repeat, rx-submatch) (rx-submatch-n, rx-kleene, rx-atomic-p): Use it to handle non-constant subforms. (rx): Document new form, wrap non-constant forms with concat call. * test/lisp/emacs-lisp/rx-tests.el (rx-tests--match): New macro. (rx-nonstring-expr, rx-nonstring-expr-non-greedy): New tests. * etc/NEWS: Announce changes. --- etc/NEWS | 6 + lisp/emacs-lisp/rx.el | 242 +++++++++++++++++++++++++-------------- test/lisp/emacs-lisp/rx-tests.el | 41 +++++++ 3 files changed, 201 insertions(+), 88 deletions(-) diff --git a/etc/NEWS b/etc/NEWS index 0cfac248a3..4c31f48da2 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -1400,12 +1400,18 @@ when given in a string. Previously, '(any "\x80-\xff")' would match characters U+0080...U+00FF. Now the expression matches raw bytes in the 128...255 range, as expected. +--- *** The rx 'or' and 'seq' forms no longer require any arguments. (or) produces a regexp that never matches anything, while (seq) matches the empty string, each being an identity for the operation. This also works for their aliases: '|' for 'or'; ':', 'and' and 'sequence' for 'seq'. +--- +*** 'regexp' and new 'literal' accept arbitrary lisp as arguments. +In this case, 'rx' will generate code which produces a regexp string +at run time, instead of a constant string. + ** Frames +++ diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el index 8ef78fd69e..5951caf051 100644 --- a/lisp/emacs-lisp/rx.el +++ b/lisp/emacs-lisp/rx.el @@ -47,57 +47,58 @@ ;; Rx translates a sexp notation for regular expressions into the ;; usual string notation. The translation can be done at compile-time -;; by using the `rx' macro. It can be done at run-time by calling -;; function `rx-to-string'. See the documentation of `rx' for a -;; complete description of the sexp notation. +;; by using the `rx' macro. The `regexp' and `literal' forms accept +;; non-constant expressions, in which case `rx' will translate to a +;; `concat' expression. Translation can be done fully at run time by +;; calling function `rx-to-string'. See the documentation of `rx' for +;; a complete description of the sexp notation. ;; ;; Some examples of string regexps and their sexp counterparts: ;; ;; "^[a-z]*" -;; (rx (and line-start (0+ (in "a-z")))) +;; (rx line-start (0+ (in "a-z"))) ;; ;; "\n[^ \t]" -;; (rx (and "\n" (not (any " \t")))) +;; (rx ?\n (not (in " \t"))) ;; ;; "\\*\\*\\* EOOH \\*\\*\\*\n" ;; (rx "*** EOOH ***\n") ;; ;; "\\<\\(catch\\|finally\\)\\>[^_]" -;; (rx (and word-start (submatch (or "catch" "finally")) word-end -;; (not (any ?_)))) +;; (rx word-start (submatch (or "catch" "finally")) word-end +;; (not (in ?_))) ;; -;; "[ \t\n]*:\\([^:]+\\|$\\)" -;; (rx (and (zero-or-more (in " \t\n")) ":" -;; (submatch (or line-end (one-or-more (not (any ?:))))))) +;; "[ \t\n]*:\\($\\|[^:]+\\)" +;; (rx (* (in " \t\n")) ":" +;; (submatch (or line-end (+ (not (in ?:)))))) ;; -;; "^content-transfer-encoding:\\(\n?[\t ]\\)*quoted-printable\\(\n?[\t ]\\)*" -;; (rx (and line-start -;; "content-transfer-encoding:" -;; (+ (? ?\n)) (any " \t") -;; "quoted-printable" -;; (+ (? ?\n)) (any " \t")) +;; "^content-transfer-encoding:\\(?:\n?[\t ]\\)*quoted-printable\\(?:\n?[\t ]\\)*" +;; (rx line-start +;; "content-transfer-encoding:" +;; (* (? ?\n) (in " \t")) +;; "quoted-printable" +;; (* (? ?\n) (in " \t"))) ;; ;; (concat "^\\(?:" something-else "\\)") -;; (rx (and line-start (eval something-else))), statically or -;; (rx-to-string '(and line-start ,something-else)), dynamically. +;; (rx line-start (regexp something-else)) ;; ;; (regexp-opt '(STRING1 STRING2 ...)) ;; (rx (or STRING1 STRING2 ...)), or in other words, `or' automatically ;; calls `regexp-opt' as needed. ;; ;; "^;;\\s-*\n\\|^\n" -;; (rx (or (and line-start ";;" (0+ space) ?\n) -;; (and line-start ?\n))) +;; (rx (or (seq line-start ";;" (0+ space) ?\n) +;; (seq line-start ?\n))) ;; ;; "\\$[I]d: [^ ]+ \\([^ ]+\\) " -;; (rx (and "$Id: " -;; (1+ (not (in " "))) -;; " " -;; (submatch (1+ (not (in " ")))) -;; " ")) +;; (rx "$Id: " +;; (1+ (not (in " "))) +;; " " +;; (submatch (1+ (not (in " ")))) +;; " ") ;; ;; "\\\\\\\\\\[\\w+" -;; (rx (and ?\\ ?\\ ?\[ (1+ word))) +;; (rx "\\\\[" (1+ word)) ;; ;; etc. @@ -176,6 +177,7 @@ (defvar rx-constituents ;Not `const' because some modes extend it. (not-syntax . (rx-not-syntax 1 1)) ; sregex (category . (rx-category 1 1 rx-check-category)) (eval . (rx-eval 1 1)) + (literal . (rx-literal 1 1 stringp)) (regexp . (rx-regexp 1 1 stringp)) (regex . regexp) ; sregex (digit . "[[:digit:]]") @@ -302,6 +304,10 @@ (defvar rx-greedy-flag t "Non-nil means produce greedy regular expressions for `zero-or-one', `zero-or-more', and `one-or-more'. Dynamically bound.") +(defvar rx--compile-to-lisp nil + "Nil means return a regexp as a string. +Non-nil means we may return a lisp form which produces a +string (used for `rx' macro).") (defun rx-info (op head) "Return parsing/code generation info for OP. @@ -344,7 +350,7 @@ (defun rx-check (form) (> nargs max-args)) (error "rx form `%s' accepts at most %d args" (car form) max-args)) - (when (not (null type-pred)) + (when type-pred (dolist (sub-form (cdr form)) (unless (funcall type-pred sub-form) (error "rx form `%s' requires args satisfying `%s'" @@ -360,8 +366,9 @@ (defun rx-group-if (regexp group) ;; for concatenation ((eq group ':) (if (rx-atomic-p - (if (string-match - "\\(?:[?*+]\\??\\|\\\\{[0-9]*,?[0-9]*\\\\}\\)\\'" regexp) + (if (and (stringp regexp) + (string-match + "\\(?:[?*+]\\??\\|\\\\{[0-9]*,?[0-9]*\\\\}\\)\\'" regexp)) (substring regexp 0 (match-beginning 0)) regexp)) (setq group nil))) @@ -370,9 +377,10 @@ (defun rx-group-if (regexp group) ;; do anyway ((eq group t)) ((rx-atomic-p regexp t) (setq group nil))) - (if group - (concat "\\(?:" regexp "\\)") - regexp)) + (cond ((and group (stringp regexp)) + (concat "\\(?:" regexp "\\)")) + (group `("\\(?:" ,@regexp "\\)")) + (t regexp))) (defvar rx-parent) @@ -384,7 +392,7 @@ (defun rx-and (form) FORM is of the form `(and FORM1 ...)'." (rx-check form) (rx-group-if - (mapconcat (lambda (x) (rx-form x ':)) (cdr form) nil) + (rx--subforms (cdr form) ':) (and (memq rx-parent '(* t)) rx-parent))) @@ -396,7 +404,7 @@ (defun rx-or (form) ((null (cdr form)) regexp-unmatchable) ((cl-every #'stringp (cdr form)) (regexp-opt (cdr form) nil t)) - (t (mapconcat (lambda (x) (rx-form x '|)) (cdr form) "\\|"))) + (t (rx--subforms (cdr form) '| "\\|"))) (and (memq rx-parent '(: * t)) rx-parent))) @@ -669,7 +677,10 @@ (defun rx-= (form) (unless (and (integerp (nth 1 form)) (> (nth 1 form) 0)) (error "rx `=' requires positive integer first arg")) - (format "%s\\{%d\\}" (rx-form (nth 2 form) '*) (nth 1 form))) + (let ((subform (rx-form (nth 2 form) '*))) + (if (stringp subform) + (format "%s\\{%d\\}" subform (nth 1 form)) + `(,@subform ,(format "\\{%d\\}" (nth 1 form)))))) (defun rx->= (form) @@ -679,7 +690,10 @@ (defun rx->= (form) (unless (and (integerp (nth 1 form)) (> (nth 1 form) 0)) (error "rx `>=' requires positive integer first arg")) - (format "%s\\{%d,\\}" (rx-form (nth 2 form) '*) (nth 1 form))) + (let ((subform (rx-form (nth 2 form) '*))) + (if (stringp subform) + (format "%s\\{%d,\\}" subform (nth 1 form)) + `(,@subform ,(format "\\{%d,\\}" (nth 1 form)))))) (defun rx-** (form) @@ -700,7 +714,10 @@ (defun rx-repeat (form) (unless (and (integerp (nth 1 form)) (> (nth 1 form) 0)) (error "rx `repeat' requires positive integer first arg")) - (format "%s\\{%d\\}" (rx-form (nth 2 form) '*) (nth 1 form))) + (let ((subform (rx-form (nth 2 form) '*))) + (if (stringp subform) + (format "%s\\{%d\\}" subform (nth 1 form)) + `(,@subform ,(format "\\{%d\\}" (nth 1 form)))))) ((or (not (integerp (nth 2 form))) (< (nth 2 form) 0) (not (integerp (nth 1 form))) @@ -708,30 +725,26 @@ (defun rx-repeat (form) (< (nth 2 form) (nth 1 form))) (error "rx `repeat' range error")) (t - (format "%s\\{%d,%d\\}" (rx-form (nth 3 form) '*) - (nth 1 form) (nth 2 form))))) + (let ((subform (rx-form (nth 3 form) '*))) + (if (stringp subform) + (format "%s\\{%d,%d\\}" subform (nth 1 form) (nth 2 form)) + `(,@subform ,(format "\\{%d,%d\\}" (nth 1 form) (nth 2 form)))))))) (defun rx-submatch (form) "Parse and produce code from FORM, which is `(submatch ...)'." - (concat "\\(" - (if (= 2 (length form)) - ;; Only one sub-form. - (rx-form (cadr form)) - ;; Several sub-forms implicitly concatenated. - (mapconcat (lambda (re) (rx-form re ':)) (cdr form) nil)) - "\\)")) + (let ((subforms (rx--subforms (cdr form) ':))) + (if (stringp subforms) + (concat "\\(" subforms "\\)") + `("\\(" ,@subforms "\\)")))) (defun rx-submatch-n (form) "Parse and produce code from FORM, which is `(submatch-n N ...)'." - (let ((n (nth 1 form))) - (concat "\\(?" (number-to-string n) ":" - (if (= 3 (length form)) - ;; Only one sub-form. - (rx-form (nth 2 form)) - ;; Several sub-forms implicitly concatenated. - (mapconcat (lambda (re) (rx-form re ':)) (cddr form) nil)) - "\\)"))) + (let ((n (nth 1 form)) + (subforms (rx--subforms (cddr form) ':))) + (if (stringp subforms) + (concat "\\(?" (number-to-string n) ":" subforms "\\)") + `("\\(?" ,(number-to-string n) ":" ,@subforms "\\)")))) (defun rx-backref (form) "Parse and produce code from FORM, which is `(backref N)'." @@ -759,9 +772,12 @@ (defun rx-kleene (form) (t "?"))) (op (cond ((memq (car form) '(* *? 0+ zero-or-more)) "*") ((memq (car form) '(+ +? 1+ one-or-more)) "+") - (t "?")))) + (t "?"))) + (subform (rx-form (cadr form) '*))) (rx-group-if - (concat (rx-form (cadr form) '*) op suffix) + (if (stringp subform) + (concat subform op suffix) + `(,@subform ,(concat op suffix))) (and (memq rx-parent '(t *)) rx-parent)))) @@ -789,15 +805,18 @@ (defun rx-atomic-p (r &optional lax) be detected without much effort. A guarantee of no false negatives would require a theoretic specification of the set of all atomic regexps." - (let ((l (length r))) - (cond - ((<= l 1)) - ((= l 2) (= (aref r 0) ?\\)) - ((= l 3) (string-match "\\`\\(?:\\\\[cCsS_]\\|\\[[^^]\\]\\)" r)) - ((null lax) + (if (and rx--compile-to-lisp + (not (stringp r))) + nil ;; Runtime value, we must assume non-atomic. + (let ((l (length r))) (cond - ((string-match "\\`\\[\\^?]?\\(?:\\[:[a-z]+:]\\|[^]]\\)*]\\'" r)) - ((string-match "\\`\\\\(\\(?:[^\\]\\|\\\\[^)]\\)*\\\\)\\'" r))))))) + ((<= l 1)) + ((= l 2) (= (aref r 0) ?\\)) + ((= l 3) (string-match "\\`\\(?:\\\\[cCsS_]\\|\\[[^^]\\]\\)" r)) + ((null lax) + (cond + ((string-match "\\`\\[\\^?]?\\(?:\\[:[a-z]+:]\\|[^]]\\)*]\\'" r)) + ((string-match "\\`\\\\(\\(?:[^\\]\\|\\\\[^)]\\)*\\\\)\\'" r)))))))) (defun rx-syntax (form) @@ -853,9 +872,23 @@ (defun rx-greedy (form) (defun rx-regexp (form) "Parse and produce code from FORM, which is `(regexp STRING)'." - (rx-check form) - (rx-group-if (cadr form) rx-parent)) - + (cond ((stringp form) + (rx-group-if (cadr form) rx-parent)) + (rx--compile-to-lisp + ;; Always group non-string forms, since we can't be sure they + ;; are atomic. + (rx-group-if (cdr form) t)) + (t (rx-check form)))) + +(defun rx-literal (form) + "Parse and produce code from FORM, which is `(literal STRING-EXP)'." + (cond ((stringp form) + ;; This is allowed, but makes little sense, you could just + ;; use STRING directly. + (rx-group-if (regexp-quote (cadr form)) rx-parent)) + (rx--compile-to-lisp + (rx-group-if `((regexp-quote ,(cadr form))) rx-parent)) + (t (rx-check form)))) (defun rx-form (form &optional parent) "Parse and produce code for regular expression FORM. @@ -886,12 +919,38 @@ (defun rx-form (form &optional parent) (t (error "rx syntax error at `%s'" form))))) +(defun rx--subforms (subforms &optional parent separator) + "Produce code for regular expressions SUBFORMS. +SUBFORMS is a list of regular expression sexps. +PARENT controls grouping, as in `rx-form'. +Insert SEPARATOR between the code from each of SUBFORMS." + (if (null (cdr subforms)) + ;; Zero or one forms, no need for grouping. + (and subforms (rx-form (car subforms))) + (let ((listify (lambda (x) + (if (listp x) (copy-sequence x) + (list x))))) + (setq subforms (mapcar (lambda (x) (rx-form x parent)) subforms)) + (cond ((or (not rx--compile-to-lisp) + (cl-every #'stringp subforms)) + (mapconcat #'identity subforms separator)) + (separator + (nconc (funcall listify (car subforms)) + (mapcan (lambda (x) + (cons separator (funcall listify x))) + (cdr subforms)))) + (t (mapcan listify subforms)))))) + ;;;###autoload (defun rx-to-string (form &optional no-group) "Parse and produce code for regular expression FORM. FORM is a regular expression in sexp form. -NO-GROUP non-nil means don't put shy groups around the result." +NO-GROUP non-nil means don't put shy groups around the result. + +In contrast to the `rx' macro, subforms `literal' and `regexp' +will not accept non-string arguments, i.e., (literal STRING) +becomes just a more verbose version of STRING." (rx-group-if (rx-form form) (null no-group))) @@ -901,8 +960,12 @@ (defmacro rx (&rest regexps) REGEXPS is a non-empty sequence of forms of the sort listed below. Note that `rx' is a Lisp macro; when used in a Lisp program being -compiled, the translation is performed by the compiler. -See `rx-to-string' for how to do such a translation at run-time. +compiled, the translation is performed by the compiler. The +`literal' and `regexp' forms accept subforms that will evaluate +to strings, in addition to constant strings. If REGEXPS include +such forms, then the result is an expression which returns a +regexp string, rather than a regexp string directly. See +`rx-to-string' for performing translation completely at run time. The following are valid subforms of regular expressions in sexp notation. @@ -1202,18 +1265,29 @@ (defmacro rx (&rest regexps) `(backref N)' matches what was matched previously by submatch N. +`(literal STRING-EXPR)' + matches STRING-EXPR literally, where STRING-EXPR is any lisp + expression that evaluates to a string. + +`(regexp REGEXP-EXPR)' + include REGEXP-EXPR in string notation in the result, where + REGEXP-EXPR is any lisp expression that evaluates to a + string containing a valid regexp. + `(eval FORM)' evaluate FORM and insert result. If result is a string, - `regexp-quote' it. - -`(regexp REGEXP)' - include REGEXP in string notation in the result." - (cond ((null regexps) - (error "No regexp")) - ((cdr regexps) - (rx-to-string `(and ,@regexps) t)) - (t - (rx-to-string (car regexps) t)))) + `regexp-quote' it. Note that FORM is evaluated during + macroexpansion." + (let* ((rx--compile-to-lisp t) + (re (cond ((null regexps) + (error "No regexp")) + ((cdr regexps) + (rx-to-string `(and ,@regexps) t)) + (t + (rx-to-string (car regexps) t))))) + (if (stringp re) + re + `(concat ,@re)))) (pcase-defmacro rx (&rest regexps) @@ -1275,14 +1349,6 @@ (pcase-defmacro rx (&rest regexps) for var in vars collect `(app (match-string ,i) ,var))))) \f -;; ;; sregex.el replacement - -;; ;;;###autoload (provide 'sregex) -;; ;;;###autoload (autoload 'sregex "rx") -;; (defalias 'sregex 'rx-to-string) -;; ;;;###autoload (autoload 'sregexq "rx" nil nil 'macro) -;; (defalias 'sregexq 'rx) -\f (provide 'rx) ;;; rx.el ends here diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el index 6f392d616d..bab71b522b 100644 --- a/test/lisp/emacs-lisp/rx-tests.el +++ b/test/lisp/emacs-lisp/rx-tests.el @@ -115,5 +115,46 @@ (ert-deftest rx-seq () ;; Test zero-argument `seq'. (should (equal (rx (seq)) ""))) +(defmacro rx-tests--match (regexp string &optional match) + (macroexp-let2 nil strexp string + `(ert-info ((format "Matching %S to %S" ',regexp ,strexp)) + (should (string-match ,regexp ,strexp)) + ,@(when match + `((should (equal (match-string 0 ,strexp) ,match))))))) + +(ert-deftest rx-nonstring-expr () + (let ((bee "b") + (vowel "[aeiou]")) + (rx-tests--match (rx "a" (literal bee) "c") "abc") + (rx-tests--match (rx "a" (regexp bee) "c") "abc") + (rx-tests--match (rx "a" (or (regexp bee) "xy") "c") "abc") + (rx-tests--match (rx "a" (or "xy" (regexp bee)) "c") "abc") + (should-not (string-match (rx (or (regexp bee) "xy")) "")) + (rx-tests--match (rx "a" (= 3 (regexp bee)) "c") "abbbc") + (rx-tests--match (rx "x" (= 3 (regexp vowel)) "z") "xeoez") + (should-not (string-match (rx "x" (= 3 (regexp vowel)) "z") "xe[]z")) + (rx-tests--match (rx "x" (= 3 (literal vowel)) "z") + "x[aeiou][aeiou][aeiou]z") + (rx-tests--match (rx "x" (repeat 1 (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (repeat 1 2 (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (repeat 1 2 (regexp vowel)) "z") "xauz") + (rx-tests--match (rx "x" (>= 1 (regexp vowel)) "z") "xaiiz") + (rx-tests--match (rx "x" (** 1 2 (regexp vowel)) "z") "xaiz") + (rx-tests--match (rx "x" (group (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (group-n 1 (regexp vowel)) "z") "xaz") + (rx-tests--match (rx "x" (? (regexp vowel)) "z") "xz"))) + +(ert-deftest rx-nonstring-expr-non-greedy () + "`rx's greediness can't affect runtime regexp parts." + (let ((ad-min "[ad]*?") + (ad-max "[ad]*") + (ad "[ad]")) + (rx-tests--match (rx "c" (regexp ad-min) "a") "cdaaada" "cda") + (rx-tests--match (rx "c" (regexp ad-max) "a") "cdaaada" "cdaaada") + (rx-tests--match (rx "c" (minimal-match (regexp ad-max)) "a") "cdaaada" "cdaaada") + (rx-tests--match (rx "c" (maximal-match (regexp ad-min)) "a") "cdaaada" "cda") + (rx-tests--match (rx "c" (minimal-match (0+ (regexp ad))) "a") "cdaaada" "cda") + (rx-tests--match (rx "c" (maximal-match (0+ (regexp ad))) "a") "cdaaada" "cdaaada"))) + (provide 'rx-tests) ;; rx-tests.el ends here. -- 2.11.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-23 15:46 ` Noam Postavsky @ 2019-06-24 3:50 ` Stefan Monnier 2019-06-24 10:52 ` Mattias Engdegård 2019-06-26 2:07 ` Noam Postavsky 1 sibling, 1 reply; 28+ messages in thread From: Stefan Monnier @ 2019-06-24 3:50 UTC (permalink / raw) To: Noam Postavsky Cc: Michael Heerdegen, Mattias Engdegård, 36237, kevin.legouguec > Well, the rest of the rx docstring uses SEXP for the `rx' > whatchamacallits, FWIW, I think `RX` would make sense for them. Stefan ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-24 3:50 ` Stefan Monnier @ 2019-06-24 10:52 ` Mattias Engdegård 0 siblings, 0 replies; 28+ messages in thread From: Mattias Engdegård @ 2019-06-24 10:52 UTC (permalink / raw) To: Stefan Monnier; +Cc: Michael Heerdegen, 36237, Noam Postavsky, kevin.legouguec 24 juni 2019 kl. 05.50 skrev Stefan Monnier <monnier@iro.umontreal.ca>: > >> Well, the rest of the rx docstring uses SEXP for the `rx' >> whatchamacallits, > > FWIW, I think `RX` would make sense for them. Agreed; ry uses RX in its doc strings. ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-23 15:46 ` Noam Postavsky 2019-06-24 3:50 ` Stefan Monnier @ 2019-06-26 2:07 ` Noam Postavsky 1 sibling, 0 replies; 28+ messages in thread From: Noam Postavsky @ 2019-06-26 2:07 UTC (permalink / raw) To: Mattias Engdegård Cc: Michael Heerdegen, 36237, Stefan Monnier, kevin.legouguec tags 36237 fixed close 36237 27.1 quit > I'll wait a few more days in case something else comes up. Pushed to master. b59ffd2290 2019-06-25T22:00:03-04:00 "Support (rx (and (regexp EXPR) (literal EXPR))) (Bug#36237)" https://git.savannah.gnu.org/cgit/emacs.git/commit/?id=b59ffd2290ff744ca4e7cc2748ba6b66fb2f99f1 ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-22 22:05 ` Noam Postavsky 2019-06-23 11:09 ` Mattias Engdegård @ 2019-06-23 14:45 ` Drew Adams 1 sibling, 0 replies; 28+ messages in thread From: Drew Adams @ 2019-06-23 14:45 UTC (permalink / raw) To: Noam Postavsky, Mattias Engdegård Cc: Michael Heerdegen, 36237, Stefan Monnier, kevin.legouguec Minor: "at run-time" -> "at run time" Emacs docs seem to use "runtime" as adjective and "run time" as noun, which is fairly conventional. Sometimes, outside Emacs, "runtime" is used for both. ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-15 23:43 bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) Noam Postavsky 2019-06-16 0:03 ` Michael Heerdegen 2019-06-16 10:03 ` Mattias Engdegård @ 2019-06-26 12:23 ` Andy Moreton 2019-06-26 12:56 ` Noam Postavsky 2019-06-26 13:08 ` Andy Moreton 3 siblings, 1 reply; 28+ messages in thread From: Andy Moreton @ 2019-06-26 12:23 UTC (permalink / raw) To: 36237 On Tue 25 Jun 2019, Noam Postavsky wrote: > tags 36237 fixed > close 36237 27.1 > quit > >> I'll wait a few more days in case something else comes up. > > Pushed to master. > > b59ffd2290 2019-06-25T22:00:03-04:00 "Support (rx (and (regexp EXPR) (literal EXPR))) (Bug#36237)" > https://git.savannah.gnu.org/cgit/emacs.git/commit/?id=b59ffd2290ff744ca4e7cc2748ba6b66fb2f99f1 With "emacs -Q" from emacs-26: (setq foo (rx (group (or "abc" "def")))) => "\\(\\(?:abc\\|def\\)\\)" (setq bar (rx-to-string `(regexp ,foo))) => "\\(?:\\(\\(?:abc\\|def\\)\\)\\)" (setq bar (rx (regexp foo))) => error "rx form ‘regexp’ requires args satisfying ‘stringp’" With "emacs -Q from master: (setq foo (rx (group (or "abc" "def")))) => "\\(\\(?:abc\\|def\\)\\)" (setq bar (rx-to-string `(regexp ,foo))) => ("\\(?:" "\\)") ; unexpected result (setq bar (rx (regexp foo))) => "\\(?:\\(\\(?:abc\\|def\\)\\)\\)" Please fix this regression. AndyM ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-26 12:23 ` Andy Moreton @ 2019-06-26 12:56 ` Noam Postavsky 0 siblings, 0 replies; 28+ messages in thread From: Noam Postavsky @ 2019-06-26 12:56 UTC (permalink / raw) To: Andy Moreton; +Cc: 36237 Andy Moreton <andrewjmoreton@gmail.com> writes: > (setq bar (rx-to-string `(regexp ,foo))) > => ("\\(?:" "\\)") ; unexpected result Oops, should be fixed now. 9233865b70 2019-06-26T08:50:27-04:00 "Fix (rx-to-string (and (literal STR) (regexp STR)) regression" https://git.savannah.gnu.org/cgit/emacs.git/commit/?id=9233865b7005831e63755eb84ae7da060f878a55 ^ permalink raw reply [flat|nested] 28+ messages in thread
* bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) 2019-06-15 23:43 bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) Noam Postavsky ` (2 preceding siblings ...) 2019-06-26 12:23 ` Andy Moreton @ 2019-06-26 13:08 ` Andy Moreton 3 siblings, 0 replies; 28+ messages in thread From: Andy Moreton @ 2019-06-26 13:08 UTC (permalink / raw) To: 36237 On Wed 26 Jun 2019, Noam Postavsky wrote: > Andy Moreton <andrewjmoreton@gmail.com> writes: > >> (setq bar (rx-to-string `(regexp ,foo))) >> => ("\\(?:" "\\)") ; unexpected result > > Oops, should be fixed now. > > 9233865b70 2019-06-26T08:50:27-04:00 "Fix (rx-to-string (and (literal STR) (regexp STR)) regression" > https://git.savannah.gnu.org/cgit/emacs.git/commit/?id=9233865b7005831e63755eb84ae7da060f878a55 Thanks Noam, that's all back to normal again. AndyM ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2019-06-26 13:08 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-06-15 23:43 bug#36237: Support (rx (and (regexp EXPR) (regexp-quote EXPR))) Noam Postavsky 2019-06-16 0:03 ` Michael Heerdegen 2019-06-16 0:28 ` Noam Postavsky 2019-06-16 10:03 ` Mattias Engdegård 2019-06-16 11:34 ` Stefan Monnier 2019-06-16 12:25 ` Noam Postavsky 2019-06-16 12:35 ` Stefan Monnier 2019-06-16 19:50 ` Noam Postavsky 2019-06-16 20:04 ` Stefan Monnier 2019-06-16 20:25 ` Drew Adams 2019-06-16 20:34 ` Mattias Engdegård 2019-06-16 21:09 ` Drew Adams 2019-06-17 20:57 ` Juri Linkov 2019-06-18 19:45 ` Mattias Engdegård 2019-06-19 1:34 ` Noam Postavsky 2019-06-19 15:42 ` Mattias Engdegård 2019-06-20 0:29 ` Noam Postavsky 2019-06-20 10:26 ` Mattias Engdegård 2019-06-22 22:05 ` Noam Postavsky 2019-06-23 11:09 ` Mattias Engdegård 2019-06-23 15:46 ` Noam Postavsky 2019-06-24 3:50 ` Stefan Monnier 2019-06-24 10:52 ` Mattias Engdegård 2019-06-26 2:07 ` Noam Postavsky 2019-06-23 14:45 ` Drew Adams 2019-06-26 12:23 ` Andy Moreton 2019-06-26 12:56 ` Noam Postavsky 2019-06-26 13:08 ` Andy Moreton
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).