unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Forbid reverse ranges in rx
@ 2019-03-12 14:02 Mattias Engdegård
  2019-03-12 14:27 ` Clément Pit-Claudel
  2019-03-12 16:12 ` Eli Zaretskii
  0 siblings, 2 replies; 8+ messages in thread
From: Mattias Engdegård @ 2019-03-12 14:02 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 385 bytes --]

Given the trouble that reversed character ranges like a-Z cause in
standard regexps, there is no reason they should be allowed in rx.

Right now, (rx (any "0-9z-a")) becomes "[0-9]" and (rx (any (?9 . ?0)))
becomes "[9-0]". Such expressions are always a mistake; we do users a
disservice by allowing them.

The attached patch adds error checks, tests, documentation and a NEWS
entry.


[-- Attachment #2: 0001-Disallow-reversed-char-ranges-in-rx.patch --]
[-- Type: text/x-patch, Size: 3670 bytes --]

From 76fdafcdc84d7fd4d7f66420e1284387dc1537c0 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 12 Mar 2019 14:39:47 +0100
Subject: [PATCH] Disallow reversed char ranges in `rx'

(any "a-Z0-9") generated "[0-9]", and (any (?9 . ?0)) generated "[9-0]".
Reversed ranges are either mistakes or abuse.  Neither should be allowed.

etc/NEWS: Explain the change.
lisp/emacs-lisp/rx.el (rx): Document.
(rx-check-any-string, rx-check-any): Add error checks for reversed ranges.
test/lisp/emacs-lisp/rx-tests.el (rx-char-any-range-bad): New test.
---
 etc/NEWS                         |  7 +++++++
 lisp/emacs-lisp/rx.el            | 11 +++++++++--
 test/lisp/emacs-lisp/rx-tests.el |  4 ++++
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/etc/NEWS b/etc/NEWS
index 410c1821ae..eb497aba83 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1177,6 +1177,13 @@ when given in a string.  Previously, '(any "\x80-\xff")' would match
 characters U+0080...U+00FF.  Now the expression matches raw bytes in
 the 128...255 range, as expected.
 
++++
+*** Reversed character ranges are no longer permitted.
+Previously, ranges where the starting character is greater than the
+ending character were silently omitted.
+For example, '(rx (any "@z-a" (?9 . ?0)))' would match '@' only.
+Now, such rx expressions generate an error.
+
 ** Frames
 
 +++
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index f6deb45d44..0533321423 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -482,7 +482,10 @@ rx-check-any-string
              (let ((start (funcall decode-char (aref str i)))
                    (end   (funcall decode-char (aref str (+ i 2)))))
                (cond ((< start end) (push (cons start end) ret))
-                     ((= start end) (push start ret)))
+                     ((= start end) (push start ret))
+                     (t
+                      (error "Reverse range `%c-%c' in Rx `any' not permitted"
+                             start end)))
                (setq i (+ i 3))))
             (t
              ;; Single character.
@@ -503,7 +506,10 @@ rx-check-any
 	       (null (string-match "\\`\\[\\[:[-a-z]+:\\]\\]\\'" translation)))
 	   (error "Invalid char class `%s' in Rx `any'" arg))
        (list (substring translation 1 -1)))) ; strip outer brackets
-    ((and (integerp (car-safe arg)) (integerp (cdr-safe arg)))
+    ((and (characterp (car-safe arg)) (characterp (cdr-safe arg)))
+     (unless (<= (car arg) (cdr arg))
+       (error "Reverse range `%c-%c' in Rx `any' not permitted"
+              (car arg) (cdr arg)))
      (list arg))
     ((stringp arg) (rx-check-any-string arg))
     ((error
@@ -916,6 +922,7 @@ rx
      matches any character in SET ....  SET may be a character or string.
      Ranges of characters can be specified as `A-Z' in strings.
      Ranges may also be specified as conses like `(?A . ?Z)'.
+     Reverse ranges such as `Z-A' or `(?Z . ?A)' are not permitted.
 
      SET may also be the name of a character class: `digit',
      `control', `hex-digit', `blank', `graph', `print', `alnum',
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index fa3d9b0d5e..7410012b1f 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -40,6 +40,10 @@
   (should (equal (rx (any "\a-\n"))
                  "[\a-\n]")))
 
+(ert-deftest rx-char-any-range-bad ()
+  (should-error (rx (any "0-9a-Z")))
+  (should-error (rx (any (?0 . ?9) (?a . ?Z)))))
+
 (ert-deftest rx-char-any-raw-byte ()
   "Test raw bytes in character alternatives."
   ;; Separate raw characters.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Forbid reverse ranges in rx
  2019-03-12 14:02 Forbid reverse ranges in rx Mattias Engdegård
@ 2019-03-12 14:27 ` Clément Pit-Claudel
  2019-03-12 15:26   ` Mattias Engdegård
  2019-03-12 16:12 ` Eli Zaretskii
  1 sibling, 1 reply; 8+ messages in thread
From: Clément Pit-Claudel @ 2019-03-12 14:27 UTC (permalink / raw)
  To: emacs-devel

This is a neat patch.  Thanks for all your efforts.

I worry that maybe we should make this a warning, rather than an error, since there are lots of poorly supported but decently working Emacs modes out there.  This change will make some of these (subtly broken) modes unusable (we've seen that e.g. the change to comint caused some breakage, so I think we should be especially careful). I'd love to see all these errors detected by xr as warnings in the byte-compiler.

Also, if we make this an error in rx, should we also make it an error in regular regexps?

Clément.

On 12/03/2019 10.02, Mattias Engdegård wrote:
> +                     (t
> +                      (error "Reverse range `%c-%c' in Rx `any' not permitted"
> +                             start end)))

Consider rephrasing this to explain why it's not permitted (maybe "… no permitted, as it does not match anything").

>                 (setq i (+ i 3))))
>              (t
>               ;; Single character.
> @@ -503,7 +506,10 @@ rx-check-any
>  	       (null (string-match "\\`\\[\\[:[-a-z]+:\\]\\]\\'" translation)))
>  	   (error "Invalid char class `%s' in Rx `any'" arg))
>         (list (substring translation 1 -1)))) ; strip outer brackets
> -    ((and (integerp (car-safe arg)) (integerp (cdr-safe arg)))
> +    ((and (characterp (car-safe arg)) (characterp (cdr-safe arg)))
> +     (unless (<= (car arg) (cdr arg))
> +       (error "Reverse range `%c-%c' in Rx `any' not permitted"
> +              (car arg) (cdr arg)))

Same here?



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Forbid reverse ranges in rx
  2019-03-12 14:27 ` Clément Pit-Claudel
@ 2019-03-12 15:26   ` Mattias Engdegård
  2019-03-12 16:56     ` Stefan Monnier
  0 siblings, 1 reply; 8+ messages in thread
From: Mattias Engdegård @ 2019-03-12 15:26 UTC (permalink / raw)
  To: Clément Pit-Claudel, emacs-devel

tis 2019-03-12 klockan 10:27 -0400 skrev Clément Pit-Claudel:
> I worry that maybe we should make this a warning, rather than an
> error, since there are lots of poorly supported but decently working
> Emacs modes out there.  This change will make some of these (subtly
> broken) modes unusable (we've seen that e.g. the change to comint
> caused some breakage, so I think we should be especially careful). 

The risk of breaking anything should be substantially lower for rx,
since reverse ranges are less likely to occur by accident than in
string regexps, and because rx is just much less frequently used.

> I'd love to see all these errors detected by xr as warnings in the
> byte-compiler.

I don't want to make a separate linter for rx; it should protect its
own abstractions. That's another reason for the change: we don't want
to end up with lots of ill-defined syntax and semantics that cannot be
changed for legacy reasons.

For that matter, the latest version of trawl does evaluate most rx
forms that are used as regexps, so rx errors there stand a good chance
of being detected.

> Also, if we make this an error in rx, should we also make it an error
> in regular regexps?

I would like to, but as you say there is lots of code that work well
enough despite little mistakes. There is also wilful abuse: auctex uses
[^z-a] as "any character".

> > +                      (error "Reverse range `%c-%c' in Rx `any'
> > not permitted"
> > +                             start end)))
> 
> Consider rephrasing this to explain why it's not permitted (maybe "…
> no permitted, as it does not match anything").

Do we really need to mention the legacy semantics, which were
accidental and arbitrary? "Z-A" is disallowed because it doesn't make
sense, not because of what happened before we thought of checking.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Forbid reverse ranges in rx
  2019-03-12 14:02 Forbid reverse ranges in rx Mattias Engdegård
  2019-03-12 14:27 ` Clément Pit-Claudel
@ 2019-03-12 16:12 ` Eli Zaretskii
  1 sibling, 0 replies; 8+ messages in thread
From: Eli Zaretskii @ 2019-03-12 16:12 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

> Feedback-ID: mattiase@acm.or
> From: Mattias Engdegård <mattiase@acm.org>
> Date: Tue, 12 Mar 2019 15:02:13 +0100
> 
> The attached patch adds error checks, tests, documentation and a NEWS
> entry.

Thanks.  The NEWS entry should go into the "Incompatible changes"
section, though.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Forbid reverse ranges in rx
  2019-03-12 15:26   ` Mattias Engdegård
@ 2019-03-12 16:56     ` Stefan Monnier
  2019-03-12 21:01       ` Mattias Engdegård
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Monnier @ 2019-03-12 16:56 UTC (permalink / raw)
  To: emacs-devel

>> > +                      (error "Reverse range `%c-%c' in Rx `any'
>> > not permitted"
>> > +                             start end)))
>> 
>> Consider rephrasing this to explain why it's not permitted (maybe "…
>> no permitted, as it does not match anything").
>
> Do we really need to mention the legacy semantics, which were
> accidental and arbitrary? "Z-A" is disallowed because it doesn't make
> sense, not because of what happened before we thought of checking.

Indeed, I don't think we need to explain in detail, but I think it's
likely that the naive user will simply not understand "reverse
range".

Maybe instead of saying it's not permitted, we should say "range %c-%c
is reversed"?


        Stefan




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Forbid reverse ranges in rx
  2019-03-12 16:56     ` Stefan Monnier
@ 2019-03-12 21:01       ` Mattias Engdegård
  2019-03-19 15:02         ` Mattias Engdegård
  0 siblings, 1 reply; 8+ messages in thread
From: Mattias Engdegård @ 2019-03-12 21:01 UTC (permalink / raw)
  To: Stefan Monnier, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 407 bytes --]

tis 2019-03-12 klockan 12:56 -0400 skrev Stefan Monnier:

> Indeed, I don't think we need to explain in detail, but I think it's
> likely that the naive user will simply not understand "reverse
> range".
> 
> Maybe instead of saying it's not permitted, we should say "range %c-
> %c
> is reversed"?

Right. I've followed your suggestion, and per Eli's advice moved the
NEWS item to 'Incompatible Changes'.


[-- Attachment #2: 0001-Disallow-reversed-char-ranges-in-rx.patch --]
[-- Type: text/x-patch, Size: 3646 bytes --]

From 8f85f3b5ce1ea9241e857c6b25860bb07fe3de11 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Tue, 12 Mar 2019 14:39:47 +0100
Subject: [PATCH] Disallow reversed char ranges in `rx'

(any "a-Z0-9") generated "[0-9]", and (any (?9 . ?0)) generated "[9-0]".
Reversed ranges are either mistakes or abuse.  Neither should be allowed.

etc/NEWS: Explain the change.
lisp/emacs-lisp/rx.el (rx): Document.
(rx-check-any-string, rx-check-any): Add error checks for reversed ranges.
test/lisp/emacs-lisp/rx-tests.el (rx-char-any-range-bad): New test.
---
 etc/NEWS                         |  7 +++++++
 lisp/emacs-lisp/rx.el            | 11 +++++++++--
 test/lisp/emacs-lisp/rx-tests.el |  4 ++++
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/etc/NEWS b/etc/NEWS
index 410c1821ae..071854cf86 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1318,6 +1318,13 @@ they are now allocated like any other pseudovector.  As a result, the
 'misc' component, and the 'misc-objects-consed' variable has been
 removed.
 
++++
+** Reversed character ranges are no longer permitted in rx.
+Previously, ranges where the starting character is greater than the
+ending character were silently omitted.
+For example, '(rx (any "@z-a" (?9 . ?0)))' would match '@' only.
+Now, such rx expressions generate an error.
+
 \f
 * Lisp Changes in Emacs 27.1
 
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index f6deb45d44..fdd24317c6 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -482,7 +482,10 @@ rx-check-any-string
              (let ((start (funcall decode-char (aref str i)))
                    (end   (funcall decode-char (aref str (+ i 2)))))
                (cond ((< start end) (push (cons start end) ret))
-                     ((= start end) (push start ret)))
+                     ((= start end) (push start ret))
+                     (t
+                      (error "Rx character range `%c-%c' is reversed"
+                             start end)))
                (setq i (+ i 3))))
             (t
              ;; Single character.
@@ -503,7 +506,10 @@ rx-check-any
 	       (null (string-match "\\`\\[\\[:[-a-z]+:\\]\\]\\'" translation)))
 	   (error "Invalid char class `%s' in Rx `any'" arg))
        (list (substring translation 1 -1)))) ; strip outer brackets
-    ((and (integerp (car-safe arg)) (integerp (cdr-safe arg)))
+    ((and (characterp (car-safe arg)) (characterp (cdr-safe arg)))
+     (unless (<= (car arg) (cdr arg))
+       (error "Rx character range `%c-%c' is reversed"
+              (car arg) (cdr arg)))
      (list arg))
     ((stringp arg) (rx-check-any-string arg))
     ((error
@@ -916,6 +922,7 @@ rx
      matches any character in SET ....  SET may be a character or string.
      Ranges of characters can be specified as `A-Z' in strings.
      Ranges may also be specified as conses like `(?A . ?Z)'.
+     Reversed ranges like `Z-A' and `(?Z . ?A)' are not permitted.
 
      SET may also be the name of a character class: `digit',
      `control', `hex-digit', `blank', `graph', `print', `alnum',
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index fa3d9b0d5e..7410012b1f 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -40,6 +40,10 @@
   (should (equal (rx (any "\a-\n"))
                  "[\a-\n]")))
 
+(ert-deftest rx-char-any-range-bad ()
+  (should-error (rx (any "0-9a-Z")))
+  (should-error (rx (any (?0 . ?9) (?a . ?Z)))))
+
 (ert-deftest rx-char-any-raw-byte ()
   "Test raw bytes in character alternatives."
   ;; Separate raw characters.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Forbid reverse ranges in rx
  2019-03-12 21:01       ` Mattias Engdegård
@ 2019-03-19 15:02         ` Mattias Engdegård
  2019-03-19 16:17           ` Stefan Monnier
  0 siblings, 1 reply; 8+ messages in thread
From: Mattias Engdegård @ 2019-03-19 15:02 UTC (permalink / raw)
  To: Stefan Monnier, emacs-devel

12 mars 2019 kl. 22.01 skrev Mattias Engdegård <mattiase@acm.org>:
> 
> tis 2019-03-12 klockan 12:56 -0400 skrev Stefan Monnier:
> 
>> Indeed, I don't think we need to explain in detail, but I think it's
>> likely that the naive user will simply not understand "reverse
>> range".
>> 
>> Maybe instead of saying it's not permitted, we should say "range %c-
>> %c
>> is reversed"?
> 
> Right. I've followed your suggestion, and per Eli's advice moved the
> NEWS item to 'Incompatible Changes'.

Since there have been no more comments on the patch, can it be applied?




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Forbid reverse ranges in rx
  2019-03-19 15:02         ` Mattias Engdegård
@ 2019-03-19 16:17           ` Stefan Monnier
  0 siblings, 0 replies; 8+ messages in thread
From: Stefan Monnier @ 2019-03-19 16:17 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: emacs-devel

> Since there have been no more comments on the patch, can it be applied?

Yes, please,


        Stefan



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-03-19 16:17 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-03-12 14:02 Forbid reverse ranges in rx Mattias Engdegård
2019-03-12 14:27 ` Clément Pit-Claudel
2019-03-12 15:26   ` Mattias Engdegård
2019-03-12 16:56     ` Stefan Monnier
2019-03-12 21:01       ` Mattias Engdegård
2019-03-19 15:02         ` Mattias Engdegård
2019-03-19 16:17           ` Stefan Monnier
2019-03-12 16:12 ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).