all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* bug#34492: rx: ASCII-raw byte ranges comprise all of Unicode
@ 2019-02-15 18:23 Mattias Engdegård
       [not found] ` <handler.34492.B.15502550523602.ack@debbugs.gnu.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Mattias Engdegård @ 2019-02-15 18:23 UTC (permalink / raw)
  To: 34492

`rx' incorrectly considers character ranges between ASCII and raw bytes to cover all codes in-between, which includes all non-ASCII Unicode chars.
This causes (any "\000-\377" ?Å) to be simplified to (any "\000-\377"), which is not at all the same thing: [\000-\377] really means [\000-\177\200-\377] -- the transformation is normally made by the Emacs regexp engine. The two ranges are not contiguous on the character code level.

It's a sleeper bug that was awakened by my fixing bug#33205, so I'm to blame for not checking this.






^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode)
       [not found] ` <handler.34492.B.15502550523602.ack@debbugs.gnu.org>
@ 2019-02-15 18:29   ` Mattias Engdegård
  2019-02-16  7:20     ` Eli Zaretskii
  0 siblings, 1 reply; 8+ messages in thread
From: Mattias Engdegård @ 2019-02-15 18:29 UTC (permalink / raw)
  To: 34492

[-- Attachment #1: Type: text/plain, Size: 8 bytes --]

Patch.


[-- Attachment #2: 0001-Prevent-over-eager-rx-character-range-condensation.patch --]
[-- Type: application/octet-stream, Size: 2596 bytes --]

From 39a593336d00c3418f52fbe205b4dc284e8b65ce Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Fri, 15 Feb 2019 19:27:48 +0100
Subject: [PATCH] Prevent over-eager rx character range condensation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

`rx' incorrectly considers character ranges between ASCII and raw bytes to
cover all codes in-between, which includes all non-ASCII Unicode chars.
This causes (any "\000-\377" ?Å) to be simplified to (any "\000-\377"),
which is not at all the same thing: [\000-\377] really means
[\000-\177\200-\377] (Bug#34492).

* lisp/emacs-lisp/rx.el (rx-any-condense-range): Split ranges going
from ASCII to raw bytes.
* test/lisp/emacs-lisp/rx-tests.el (rx-char-any-raw-byte): Add test case.
---
 lisp/emacs-lisp/rx.el            | 7 +++++++
 test/lisp/emacs-lisp/rx-tests.el | 6 +++++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index b2299030a1..715cd608c4 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -429,6 +429,13 @@ Only both edges of each range is checked."
     ;; set L list of all ranges
     (mapc (lambda (e) (cond ((stringp e) (push e str))
 			    ((numberp e) (push (cons e e) l))
+                            ;; Ranges between ASCII and raw bytes are split,
+                            ;; to prevent accidental inclusion of Unicode
+                            ;; characters later on.
+                            ((and (<= (car e) #x7f)
+                                  (>= (cdr e) #x3fff80))
+                             (push (cons (car e) #x7f) l)
+                             (push (cons #x3fff80 (cdr e)) l))
 			    (t (push e l))))
 	  args)
     ;; condense overlapped ranges in L
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index f15e1016f7..e14feda347 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -53,7 +53,11 @@
   ;; Range of raw characters, multibyte.
   (should (equal (string-match-p (rx (any "Å\211\326-\377\177"))
                                  "XY\355\177\327")
-                 2)))
+                 2))
+  ;; Split range; \177-\377ÿ should not be optimised to \177-\377.
+  (should (equal (string-match-p (rx (any "\177-\377" ?ÿ))
+                                 "ÿA\310B")
+                 0)))
 
 (ert-deftest rx-pcase ()
   (should (equal (pcase "a 1 2 3 1 1 b"
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode)
  2019-02-15 18:29   ` bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode) Mattias Engdegård
@ 2019-02-16  7:20     ` Eli Zaretskii
  2019-02-16  8:08       ` Mattias Engdegård
  0 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2019-02-16  7:20 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 34492

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 15 Feb 2019 19:29:28 +0100
> 
> Patch.

Thanks, this LGTM, but I think this should be in NEWS.  It's arguably
a bug, but only arguably, and it changes user-visible behavior.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode)
  2019-02-16  7:20     ` Eli Zaretskii
@ 2019-02-16  8:08       ` Mattias Engdegård
  2019-02-16 10:14         ` Eli Zaretskii
  0 siblings, 1 reply; 8+ messages in thread
From: Mattias Engdegård @ 2019-02-16  8:08 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 34492

16 feb. 2019 kl. 08.20 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> Thanks, this LGTM, but I think this should be in NEWS.  It's arguably
> a bug, but only arguably, and it changes user-visible behavior.

I'll be happy to write a NEWS item, but for what? The change of bug #33205, or this change, which is not visible unless the other change is already applied (and it hasn't made it into a release yet)?

If you mean the #33205 fix, it might result in something like the following:

** `rx' now handles raw bytes in character alternatives correctly when
given in a string.  Previously, `(any "\x80-\xff")' would match characters
U+0080...U+00FF.  Now the expression matches raw bytes in the 128...255 range,
as expected.

Is that what you had in mind? If so, in what subsection would it go?

* Changes in Specialized Modes and Packages
* Incompatible Lisp Changes
* Lisp Changes






^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode)
  2019-02-16  8:08       ` Mattias Engdegård
@ 2019-02-16 10:14         ` Eli Zaretskii
  2019-02-16 11:05           ` Mattias Engdegård
  0 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2019-02-16 10:14 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 34492

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 16 Feb 2019 09:08:11 +0100
> Cc: 34492@debbugs.gnu.org
> 
> 16 feb. 2019 kl. 08.20 skrev Eli Zaretskii <eliz@gnu.org>:
> > 
> > Thanks, this LGTM, but I think this should be in NEWS.  It's arguably
> > a bug, but only arguably, and it changes user-visible behavior.
> 
> I'll be happy to write a NEWS item, but for what? The change of bug #33205, or this change, which is not visible unless the other change is already applied (and it hasn't made it into a release yet)?

I mean both.

> If you mean the #33205 fix, it might result in something like the following:
> 
> ** `rx' now handles raw bytes in character alternatives correctly when
> given in a string.  Previously, `(any "\x80-\xff")' would match characters
> U+0080...U+00FF.  Now the expression matches raw bytes in the 128...255 range,
> as expected.
> 
> Is that what you had in mind?

Yes.

> If so, in what subsection would it go?

Either make a new section for rx under "Changes in Specialized Modes
and Packages", or put it under "Incompatible Lisp Changes".

Thanks.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode)
  2019-02-16 10:14         ` Eli Zaretskii
@ 2019-02-16 11:05           ` Mattias Engdegård
  2019-02-16 11:40             ` Eli Zaretskii
  0 siblings, 1 reply; 8+ messages in thread
From: Mattias Engdegård @ 2019-02-16 11:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 34492

[-- Attachment #1: Type: text/plain, Size: 341 bytes --]

16 feb. 2019 kl. 11.14 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> Either make a new section for rx under "Changes in Specialized Modes
> and Packages", or put it under "Incompatible Lisp Changes".

I picked the former --- thanks for reviewing.
Since it's my first change to NEWS, I'm attaching the modified patch here for a final look.

[-- Attachment #2: 0001-Prevent-over-eager-rx-character-range-condensation.patch --]
[-- Type: application/octet-stream, Size: 3296 bytes --]

From b3e549114ab705d3efd866adea6a0cce76febb49 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Fri, 15 Feb 2019 19:27:48 +0100
Subject: [PATCH] Prevent over-eager rx character range condensation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

`rx' incorrectly considers character ranges between ASCII and raw bytes to
cover all codes in-between, which includes all non-ASCII Unicode chars.
This causes (any "\000-\377" ?Å) to be simplified to (any "\000-\377"),
which is not at all the same thing: [\000-\377] really means
[\000-\177\200-\377] (Bug#34492).

* lisp/emacs-lisp/rx.el (rx-any-condense-range): Split ranges going
from ASCII to raw bytes.
* test/lisp/emacs-lisp/rx-tests.el (rx-char-any-raw-byte): Add test case.
* etc/NEWS: Mention the overall change (Bug#33205).
---
 etc/NEWS                         | 8 ++++++++
 lisp/emacs-lisp/rx.el            | 7 +++++++
 test/lisp/emacs-lisp/rx-tests.el | 6 +++++-
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/etc/NEWS b/etc/NEWS
index 70a50c02c4..987e661044 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1101,6 +1101,14 @@ subexpression.
 When there is no menu for a mode, display the mode name after the
 indicator instead of just the indicator (which is sometimes cryptic).
 
+** rx
+
+---
+*** rx now handles raw bytes in character alternatives correctly,
+when given in a string.  Previously, `(any "\x80-\xff")' would match
+characters U+0080...U+00FF.  Now the expression matches raw bytes in
+the 128...255 range, as expected.
+
 \f
 * New Modes and Packages in Emacs 27.1
 
diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el
index b2299030a1..715cd608c4 100644
--- a/lisp/emacs-lisp/rx.el
+++ b/lisp/emacs-lisp/rx.el
@@ -429,6 +429,13 @@ Only both edges of each range is checked."
     ;; set L list of all ranges
     (mapc (lambda (e) (cond ((stringp e) (push e str))
 			    ((numberp e) (push (cons e e) l))
+                            ;; Ranges between ASCII and raw bytes are split,
+                            ;; to prevent accidental inclusion of Unicode
+                            ;; characters later on.
+                            ((and (<= (car e) #x7f)
+                                  (>= (cdr e) #x3fff80))
+                             (push (cons (car e) #x7f) l)
+                             (push (cons #x3fff80 (cdr e)) l))
 			    (t (push e l))))
 	  args)
     ;; condense overlapped ranges in L
diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el
index f15e1016f7..e14feda347 100644
--- a/test/lisp/emacs-lisp/rx-tests.el
+++ b/test/lisp/emacs-lisp/rx-tests.el
@@ -53,7 +53,11 @@
   ;; Range of raw characters, multibyte.
   (should (equal (string-match-p (rx (any "Å\211\326-\377\177"))
                                  "XY\355\177\327")
-                 2)))
+                 2))
+  ;; Split range; \177-\377ÿ should not be optimised to \177-\377.
+  (should (equal (string-match-p (rx (any "\177-\377" ?ÿ))
+                                 "ÿA\310B")
+                 0)))
 
 (ert-deftest rx-pcase ()
   (should (equal (pcase "a 1 2 3 1 1 b"
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode)
  2019-02-16 11:05           ` Mattias Engdegård
@ 2019-02-16 11:40             ` Eli Zaretskii
  2019-02-16 11:46               ` Mattias Engdegård
  0 siblings, 1 reply; 8+ messages in thread
From: Eli Zaretskii @ 2019-02-16 11:40 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: 34492

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 16 Feb 2019 12:05:09 +0100
> Cc: 34492@debbugs.gnu.org
> 
> +** rx
> +
> +---
> +*** rx now handles raw bytes in character alternatives correctly,
> +when given in a string.  Previously, `(any "\x80-\xff")' would match
> +characters U+0080...U+00FF.  Now the expression matches raw bytes in
> +the 128...255 range, as expected.

This is OK, but we use quoting 'like this' in NEWS.

Thanks.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode)
  2019-02-16 11:40             ` Eli Zaretskii
@ 2019-02-16 11:46               ` Mattias Engdegård
  0 siblings, 0 replies; 8+ messages in thread
From: Mattias Engdegård @ 2019-02-16 11:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 34492-done

16 feb. 2019 kl. 12.40 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> This is OK, but we use quoting 'like this' in NEWS.

Thank you, pushed with that modification.






^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-02-16 11:46 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-02-15 18:23 bug#34492: rx: ASCII-raw byte ranges comprise all of Unicode Mattias Engdegård
     [not found] ` <handler.34492.B.15502550523602.ack@debbugs.gnu.org>
2019-02-15 18:29   ` bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode) Mattias Engdegård
2019-02-16  7:20     ` Eli Zaretskii
2019-02-16  8:08       ` Mattias Engdegård
2019-02-16 10:14         ` Eli Zaretskii
2019-02-16 11:05           ` Mattias Engdegård
2019-02-16 11:40             ` Eli Zaretskii
2019-02-16 11:46               ` Mattias Engdegård

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.