* bug#34492: rx: ASCII-raw byte ranges comprise all of Unicode @ 2019-02-15 18:23 Mattias Engdegård [not found] ` <handler.34492.B.15502550523602.ack@debbugs.gnu.org> 0 siblings, 1 reply; 8+ messages in thread From: Mattias Engdegård @ 2019-02-15 18:23 UTC (permalink / raw) To: 34492 `rx' incorrectly considers character ranges between ASCII and raw bytes to cover all codes in-between, which includes all non-ASCII Unicode chars. This causes (any "\000-\377" ?Å) to be simplified to (any "\000-\377"), which is not at all the same thing: [\000-\377] really means [\000-\177\200-\377] -- the transformation is normally made by the Emacs regexp engine. The two ranges are not contiguous on the character code level. It's a sleeper bug that was awakened by my fixing bug#33205, so I'm to blame for not checking this. ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <handler.34492.B.15502550523602.ack@debbugs.gnu.org>]
* bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode) [not found] ` <handler.34492.B.15502550523602.ack@debbugs.gnu.org> @ 2019-02-15 18:29 ` Mattias Engdegård 2019-02-16 7:20 ` Eli Zaretskii 0 siblings, 1 reply; 8+ messages in thread From: Mattias Engdegård @ 2019-02-15 18:29 UTC (permalink / raw) To: 34492 [-- Attachment #1: Type: text/plain, Size: 8 bytes --] Patch. [-- Attachment #2: 0001-Prevent-over-eager-rx-character-range-condensation.patch --] [-- Type: application/octet-stream, Size: 2596 bytes --] From 39a593336d00c3418f52fbe205b4dc284e8b65ce Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org> Date: Fri, 15 Feb 2019 19:27:48 +0100 Subject: [PATCH] Prevent over-eager rx character range condensation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit `rx' incorrectly considers character ranges between ASCII and raw bytes to cover all codes in-between, which includes all non-ASCII Unicode chars. This causes (any "\000-\377" ?Å) to be simplified to (any "\000-\377"), which is not at all the same thing: [\000-\377] really means [\000-\177\200-\377] (Bug#34492). * lisp/emacs-lisp/rx.el (rx-any-condense-range): Split ranges going from ASCII to raw bytes. * test/lisp/emacs-lisp/rx-tests.el (rx-char-any-raw-byte): Add test case. --- lisp/emacs-lisp/rx.el | 7 +++++++ test/lisp/emacs-lisp/rx-tests.el | 6 +++++- 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el index b2299030a1..715cd608c4 100644 --- a/lisp/emacs-lisp/rx.el +++ b/lisp/emacs-lisp/rx.el @@ -429,6 +429,13 @@ Only both edges of each range is checked." ;; set L list of all ranges (mapc (lambda (e) (cond ((stringp e) (push e str)) ((numberp e) (push (cons e e) l)) + ;; Ranges between ASCII and raw bytes are split, + ;; to prevent accidental inclusion of Unicode + ;; characters later on. + ((and (<= (car e) #x7f) + (>= (cdr e) #x3fff80)) + (push (cons (car e) #x7f) l) + (push (cons #x3fff80 (cdr e)) l)) (t (push e l)))) args) ;; condense overlapped ranges in L diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el index f15e1016f7..e14feda347 100644 --- a/test/lisp/emacs-lisp/rx-tests.el +++ b/test/lisp/emacs-lisp/rx-tests.el @@ -53,7 +53,11 @@ ;; Range of raw characters, multibyte. (should (equal (string-match-p (rx (any "Å\211\326-\377\177")) "XY\355\177\327") - 2))) + 2)) + ;; Split range; \177-\377ÿ should not be optimised to \177-\377. + (should (equal (string-match-p (rx (any "\177-\377" ?ÿ)) + "ÿA\310B") + 0))) (ert-deftest rx-pcase () (should (equal (pcase "a 1 2 3 1 1 b" -- 2.17.2 (Apple Git-113) ^ permalink raw reply related [flat|nested] 8+ messages in thread
* bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode) 2019-02-15 18:29 ` bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode) Mattias Engdegård @ 2019-02-16 7:20 ` Eli Zaretskii 2019-02-16 8:08 ` Mattias Engdegård 0 siblings, 1 reply; 8+ messages in thread From: Eli Zaretskii @ 2019-02-16 7:20 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 34492 > From: Mattias Engdegård <mattiase@acm.org> > Date: Fri, 15 Feb 2019 19:29:28 +0100 > > Patch. Thanks, this LGTM, but I think this should be in NEWS. It's arguably a bug, but only arguably, and it changes user-visible behavior. ^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode) 2019-02-16 7:20 ` Eli Zaretskii @ 2019-02-16 8:08 ` Mattias Engdegård 2019-02-16 10:14 ` Eli Zaretskii 0 siblings, 1 reply; 8+ messages in thread From: Mattias Engdegård @ 2019-02-16 8:08 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 34492 16 feb. 2019 kl. 08.20 skrev Eli Zaretskii <eliz@gnu.org>: > > Thanks, this LGTM, but I think this should be in NEWS. It's arguably > a bug, but only arguably, and it changes user-visible behavior. I'll be happy to write a NEWS item, but for what? The change of bug #33205, or this change, which is not visible unless the other change is already applied (and it hasn't made it into a release yet)? If you mean the #33205 fix, it might result in something like the following: ** `rx' now handles raw bytes in character alternatives correctly when given in a string. Previously, `(any "\x80-\xff")' would match characters U+0080...U+00FF. Now the expression matches raw bytes in the 128...255 range, as expected. Is that what you had in mind? If so, in what subsection would it go? * Changes in Specialized Modes and Packages * Incompatible Lisp Changes * Lisp Changes ^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode) 2019-02-16 8:08 ` Mattias Engdegård @ 2019-02-16 10:14 ` Eli Zaretskii 2019-02-16 11:05 ` Mattias Engdegård 0 siblings, 1 reply; 8+ messages in thread From: Eli Zaretskii @ 2019-02-16 10:14 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 34492 > From: Mattias Engdegård <mattiase@acm.org> > Date: Sat, 16 Feb 2019 09:08:11 +0100 > Cc: 34492@debbugs.gnu.org > > 16 feb. 2019 kl. 08.20 skrev Eli Zaretskii <eliz@gnu.org>: > > > > Thanks, this LGTM, but I think this should be in NEWS. It's arguably > > a bug, but only arguably, and it changes user-visible behavior. > > I'll be happy to write a NEWS item, but for what? The change of bug #33205, or this change, which is not visible unless the other change is already applied (and it hasn't made it into a release yet)? I mean both. > If you mean the #33205 fix, it might result in something like the following: > > ** `rx' now handles raw bytes in character alternatives correctly when > given in a string. Previously, `(any "\x80-\xff")' would match characters > U+0080...U+00FF. Now the expression matches raw bytes in the 128...255 range, > as expected. > > Is that what you had in mind? Yes. > If so, in what subsection would it go? Either make a new section for rx under "Changes in Specialized Modes and Packages", or put it under "Incompatible Lisp Changes". Thanks. ^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode) 2019-02-16 10:14 ` Eli Zaretskii @ 2019-02-16 11:05 ` Mattias Engdegård 2019-02-16 11:40 ` Eli Zaretskii 0 siblings, 1 reply; 8+ messages in thread From: Mattias Engdegård @ 2019-02-16 11:05 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 34492 [-- Attachment #1: Type: text/plain, Size: 341 bytes --] 16 feb. 2019 kl. 11.14 skrev Eli Zaretskii <eliz@gnu.org>: > > Either make a new section for rx under "Changes in Specialized Modes > and Packages", or put it under "Incompatible Lisp Changes". I picked the former --- thanks for reviewing. Since it's my first change to NEWS, I'm attaching the modified patch here for a final look. [-- Attachment #2: 0001-Prevent-over-eager-rx-character-range-condensation.patch --] [-- Type: application/octet-stream, Size: 3296 bytes --] From b3e549114ab705d3efd866adea6a0cce76febb49 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org> Date: Fri, 15 Feb 2019 19:27:48 +0100 Subject: [PATCH] Prevent over-eager rx character range condensation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit `rx' incorrectly considers character ranges between ASCII and raw bytes to cover all codes in-between, which includes all non-ASCII Unicode chars. This causes (any "\000-\377" ?Å) to be simplified to (any "\000-\377"), which is not at all the same thing: [\000-\377] really means [\000-\177\200-\377] (Bug#34492). * lisp/emacs-lisp/rx.el (rx-any-condense-range): Split ranges going from ASCII to raw bytes. * test/lisp/emacs-lisp/rx-tests.el (rx-char-any-raw-byte): Add test case. * etc/NEWS: Mention the overall change (Bug#33205). --- etc/NEWS | 8 ++++++++ lisp/emacs-lisp/rx.el | 7 +++++++ test/lisp/emacs-lisp/rx-tests.el | 6 +++++- 3 files changed, 20 insertions(+), 1 deletion(-) diff --git a/etc/NEWS b/etc/NEWS index 70a50c02c4..987e661044 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -1101,6 +1101,14 @@ subexpression. When there is no menu for a mode, display the mode name after the indicator instead of just the indicator (which is sometimes cryptic). +** rx + +--- +*** rx now handles raw bytes in character alternatives correctly, +when given in a string. Previously, `(any "\x80-\xff")' would match +characters U+0080...U+00FF. Now the expression matches raw bytes in +the 128...255 range, as expected. + \f * New Modes and Packages in Emacs 27.1 diff --git a/lisp/emacs-lisp/rx.el b/lisp/emacs-lisp/rx.el index b2299030a1..715cd608c4 100644 --- a/lisp/emacs-lisp/rx.el +++ b/lisp/emacs-lisp/rx.el @@ -429,6 +429,13 @@ Only both edges of each range is checked." ;; set L list of all ranges (mapc (lambda (e) (cond ((stringp e) (push e str)) ((numberp e) (push (cons e e) l)) + ;; Ranges between ASCII and raw bytes are split, + ;; to prevent accidental inclusion of Unicode + ;; characters later on. + ((and (<= (car e) #x7f) + (>= (cdr e) #x3fff80)) + (push (cons (car e) #x7f) l) + (push (cons #x3fff80 (cdr e)) l)) (t (push e l)))) args) ;; condense overlapped ranges in L diff --git a/test/lisp/emacs-lisp/rx-tests.el b/test/lisp/emacs-lisp/rx-tests.el index f15e1016f7..e14feda347 100644 --- a/test/lisp/emacs-lisp/rx-tests.el +++ b/test/lisp/emacs-lisp/rx-tests.el @@ -53,7 +53,11 @@ ;; Range of raw characters, multibyte. (should (equal (string-match-p (rx (any "Å\211\326-\377\177")) "XY\355\177\327") - 2))) + 2)) + ;; Split range; \177-\377ÿ should not be optimised to \177-\377. + (should (equal (string-match-p (rx (any "\177-\377" ?ÿ)) + "ÿA\310B") + 0))) (ert-deftest rx-pcase () (should (equal (pcase "a 1 2 3 1 1 b" -- 2.17.2 (Apple Git-113) ^ permalink raw reply related [flat|nested] 8+ messages in thread
* bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode) 2019-02-16 11:05 ` Mattias Engdegård @ 2019-02-16 11:40 ` Eli Zaretskii 2019-02-16 11:46 ` Mattias Engdegård 0 siblings, 1 reply; 8+ messages in thread From: Eli Zaretskii @ 2019-02-16 11:40 UTC (permalink / raw) To: Mattias Engdegård; +Cc: 34492 > From: Mattias Engdegård <mattiase@acm.org> > Date: Sat, 16 Feb 2019 12:05:09 +0100 > Cc: 34492@debbugs.gnu.org > > +** rx > + > +--- > +*** rx now handles raw bytes in character alternatives correctly, > +when given in a string. Previously, `(any "\x80-\xff")' would match > +characters U+0080...U+00FF. Now the expression matches raw bytes in > +the 128...255 range, as expected. This is OK, but we use quoting 'like this' in NEWS. Thanks. ^ permalink raw reply [flat|nested] 8+ messages in thread
* bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode) 2019-02-16 11:40 ` Eli Zaretskii @ 2019-02-16 11:46 ` Mattias Engdegård 0 siblings, 0 replies; 8+ messages in thread From: Mattias Engdegård @ 2019-02-16 11:46 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 34492-done 16 feb. 2019 kl. 12.40 skrev Eli Zaretskii <eliz@gnu.org>: > > This is OK, but we use quoting 'like this' in NEWS. Thank you, pushed with that modification. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2019-02-16 11:46 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-02-15 18:23 bug#34492: rx: ASCII-raw byte ranges comprise all of Unicode Mattias Engdegård [not found] ` <handler.34492.B.15502550523602.ack@debbugs.gnu.org> 2019-02-15 18:29 ` bug#34492: Acknowledgement (rx: ASCII-raw byte ranges comprise all of Unicode) Mattias Engdegård 2019-02-16 7:20 ` Eli Zaretskii 2019-02-16 8:08 ` Mattias Engdegård 2019-02-16 10:14 ` Eli Zaretskii 2019-02-16 11:05 ` Mattias Engdegård 2019-02-16 11:40 ` Eli Zaretskii 2019-02-16 11:46 ` Mattias Engdegård
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.