unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: "Mattias Engdegård" <mattiase@acm.org>
To: Paul Eggert <eggert@cs.ucla.edu>
Cc: 37659@debbugs.gnu.org
Subject: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Thu, 24 Oct 2019 10:58:43 +0200	[thread overview]
Message-ID: <6B3E322E-6058-4D8B-A73C-07847411AE1D@acm.org> (raw)
In-Reply-To: <9016eb3d-7d58-5950-862a-13db4c7ff32b@cs.ucla.edu>

[-- Attachment #1: Type: text/plain, Size: 2289 bytes --]

24 okt. 2019 kl. 01.14 skrev Paul Eggert <eggert@cs.ucla.edu>:
> 
>> how do we make it easy to match one of multiple strings --- keywords, say --- in rx?
> 
> If that's the real problem, perhaps the name should be "or-tokens" or something like that, to help remind the reader of the limitations of the proposed operator: it's meant only for greedy tokenization and it isn't suited for regular expressions in general. A problem with the name "or-max" is that it implies a more-general functionality than the implementation really has.

'or-strings' then perhaps, since there is nothing really restricting it to 'tokens' (which is a bit hazardous terminology given that regexps are commonly used for tokenising). In particular, there is no delimiting; (or-max "IN" "OUT") will match the first part of "INSPECT", which may be unexpected of something ostensibly matching tokens.

On the other hand, 'or-strings' sort of precludes a future relaxation of the argument restriction.

> What happens if you apply or-tokens to arguments that aren't strings or other or-tokens? Does rx diagnose this? I hope it does.

Yes, of course. Working patch attached (it still uses the name 'or-max').

'or-max' isn't a vital addition; it just seemed to fill a gap, after experience with traditional regexp usage. It clearly shouldn't be added it on a whim. I wanted to get it in place for 27.1, but such a version rush has rarely resulted in good design.

> I was thinking of something more-compatible: we could say that \| is left-to-right (for users who need compatibility with regexp "|"), and that 'or' is not necessarily left-to-right (to make room for future extensions that make 'or' greedy, or more efficient, or both).

Sorry, by '\|' I meant the string regexp operator; I take it you propose separate semantics for the rx '|' and 'or' operators? Maybe we should worry about that if we ever get near the point of replacing the engine. There are other concerns, such as how capture groups are set (even if two branches match equally long texts).

I honestly don't think much would break if '\|' (in string regexps) became greedy overnight, but there is plenty of room to confuse the user if we introduce subtle distinctions between what has hitherto been perceived as synonyms.


[-- Attachment #2: 0003-Add-the-rx-or-max-operator.patch --]
[-- Type: application/octet-stream, Size: 4638 bytes --]

From b6e1900e64803dc28e915f38febd9a9389acb697 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Wed, 23 Oct 2019 12:47:53 +0200
Subject: [PATCH 3/3] Add the rx 'or-max' operator

* doc/lispref/searching.texi (Rx Constructs):
* lisp/emacs-lisp/rx.el (rx--or-max-strings, rx--translate-or-max)
(rx--translate-form, rx--builtin-forms, rx):
* test/lisp/emacs-lisp/rx-tests.el (rx-or-max-def):
Add 'or-max'.
---
 doc/lispref/searching.texi       |  7 +++++++
 lisp/emacs-lisp/rx.el            | 22 +++++++++++++++++++++-
 test/lisp/emacs-lisp/rx-tests.el | 13 +++++++++++++
 3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 5178575a3b..3feaebc16d 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -1084,6 +1084,13 @@ Rx Constructs
 Without arguments, the expression will not match anything at all.@*
 Corresponding string regexp: @samp{@var{A}\|@var{B}\|@dots{}}.
 
+@item @code{(or-max @var{rx}@dots{})}
+@cindex @code{or-max} in rx
+Like @code{or}, but always favours the longest possible match.  The
+@var{rx}s must be strings or @code{or-max} forms.  The resulting
+expression may be more efficient in matching than the corresponding
+@code{or} form.
+
 @item @code{unmatchable}
 @cindex @code{unmatchable} in rx
 Refuse any match.  Equivalent to @code{(or)}.
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index d7677f1444..9afa17c617 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -293,6 +293,24 @@ rx--translate-or
                           (cdr body)))
           nil))))
 
+(defun rx--or-max-strings (args)
+  "List of string arguments in an 'or-max' construct."
+  (mapcan (lambda (item)
+            (cond
+             ;; FIXME: Allow single characters as well?
+             ((stringp item) (list item))
+             ((and (consp item) (eq (car item) 'or-max))
+              (rx--or-max-strings (cdr item)))
+             ((let ((expanded (rx--expand-def item)))
+                (and expanded
+                     (rx--or-max-strings (list expanded)))))
+             (t (error "Illegal `or-max' argument: %S" item))))
+          args))
+
+(defun rx--translate-or-max (body)
+  "Translate (or-max BODY...).  Return (REGEXP . PRECEDENCE)."
+  (cons (list (regexp-opt (rx--or-max-strings body))) t))
+
 (defun rx--string-to-intervals (str)
   "Decode STR as intervals: A-Z becomes (?A . ?Z), and the single
 character X becomes (?X . ?X).  Return the intervals in a list."
@@ -854,6 +872,7 @@ rx--translate-form
     (pcase (car form)
       ((or 'seq : 'and 'sequence) (rx--translate-seq body))
       ((or 'or '|)              (rx--translate-or body))
+      ('or-max                  (rx--translate-or-max body))
       ((or 'any 'in 'char)      (rx--translate-any nil body))
       ('not-char                (rx--translate-any t body))
       ('not                     (rx--translate-not nil body))
@@ -915,7 +934,7 @@ rx--translate-form
         (t (error "Unknown rx form `%s'" op)))))))
 
 (defconst rx--builtin-forms
-  '(seq sequence : and or | any in char not-char not
+  '(seq sequence : and or | or-max any in char not-char not
     repeat = >= **
     zero-or-more 0+ *
     one-or-more 1+ +
@@ -1006,6 +1025,7 @@ rx
 
 (seq RX...)    Match the RXs in sequence.  Alias: :, sequence, and.
 (or RX...)     Match one of the RXs.  Alias: |.
+(or-max RX...) Match one of the RXs (strings and or-max only), longest match.
 
 (zero-or-more RX...) Match RXs zero or more times.  Alias: 0+.
 (one-or-more RX...)  Match RXs one or more times.  Alias: 1+.
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index ef2541d83a..f60932e670 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -49,6 +49,19 @@ rx-or
   (should (equal (rx (|))
                  "\\`a\\`")))
 
+(ert-deftest rx-or-max ()
+  (should (equal (rx (or-max "ab" "abc"))
+                 "\\(?:abc?\\)"))
+  (should (equal (rx (or-max (or-max "a" "xy") (or-max "ab" "abcd")))
+                 "\\(?:a\\(?:b\\(?:cd\\)?\\)?\\|xy\\)")))
+
+(ert-deftest rx-or-max-def ()
+  (rx-let ((a (or-max "a" "xy"))
+           (b a)
+           (c (or-max "ab" "abcd")))
+    (should (equal (rx (or-max c b))
+                   "\\(?:a\\(?:b\\(?:cd\\)?\\)?\\|xy\\)"))))
+
 (ert-deftest rx-char-any ()
   "Test character alternatives with `]' and `-' (Bug#25123)."
   (should (equal
-- 
2.21.0 (Apple Git-122)


  parent reply	other threads:[~2019-10-24  8:58 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-08  9:36 bug#37659: rx additions: anychar, unmatchable, unordered-or Mattias Engdegård
2019-10-09  8:59 ` Mattias Engdegård
2019-10-11 23:07 ` bug#37659: Mattias Engdegård <mattiase <at> acm.org> Paul Eggert
2019-10-12 10:47   ` Mattias Engdegård
2019-10-13 16:52     ` Paul Eggert
2019-10-13 19:48       ` Mattias Engdegård
2019-10-22 15:14       ` bug#37659: rx additions: anychar, unmatchable, unordered-or Mattias Engdegård
2019-10-22 15:27         ` Robert Pluim
2019-10-22 17:33         ` Paul Eggert
2019-10-23  9:15           ` Mattias Engdegård
2019-10-23 23:14             ` Paul Eggert
2019-10-24  1:56               ` Drew Adams
2019-10-24  9:09                 ` Mattias Engdegård
2019-10-24 14:24                   ` Drew Adams
2019-10-24  9:17                 ` Phil Sainty
2019-10-24 14:32                   ` Drew Adams
2019-10-24  8:58               ` Mattias Engdegård [this message]
2019-10-27 11:53                 ` Mattias Engdegård
2020-02-11 12:57           ` Mattias Engdegård
2020-02-11 15:43             ` Eli Zaretskii
2020-02-11 19:17               ` Mattias Engdegård
2020-02-12  0:52                 ` Paul Eggert
2020-02-12 11:22                   ` Mattias Engdegård
2020-02-13 18:38                     ` Mattias Engdegård
2020-02-13 18:50                       ` Paul Eggert
2020-02-13 19:16                         ` Mattias Engdegård
2020-02-13 19:30                           ` Eli Zaretskii
2020-02-13 22:23                             ` Mattias Engdegård
2020-02-14  7:45                               ` Eli Zaretskii
2020-02-14 16:15                                 ` Paul Eggert
2020-02-14 20:49                                   ` Mattias Engdegård
2020-03-01 10:09                                   ` Mattias Engdegård

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6B3E322E-6058-4D8B-A73C-07847411AE1D@acm.org \
    --to=mattiase@acm.org \
    --cc=37659@debbugs.gnu.org \
    --cc=eggert@cs.ucla.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).