* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps @ 2009-06-26 9:56 YAMAMOTO Mitsuharu 2009-06-26 13:43 ` Eli Zaretskii 2019-06-28 12:41 ` bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] Mattias Engdegård 0 siblings, 2 replies; 17+ messages in thread From: YAMAMOTO Mitsuharu @ 2009-06-26 9:56 UTC (permalink / raw) To: emacs-pretest-bug The following results look inconsistent: (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80")) => 0 (string-match (string-to-multibyte "\x80") "\x80") => nil (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80")) => nil (string-match (string-to-multibyte "[\x80]") "\x80") => 0 YAMAMOTO Mitsuharu mituharu@math.s.chiba-u.ac.jp In GNU Emacs 23.1.50.1 (sparc-sun-solaris2.8, X toolkit, Xaw3d scroll bars) of 2009-06-26 on church Windowing system distributor `The X.Org Foundation', version 11.0.10402000 configured using `configure 'LDFLAGS=-L/usr/local/lib -R/usr/local/lib' 'CPPFLAGS=-I/usr/local/lib'' Important settings: value of $LC_ALL: nil value of $LC_COLLATE: nil value of $LC_CTYPE: nil value of $LC_MESSAGES: nil value of $LC_MONETARY: nil value of $LC_NUMERIC: nil value of $LC_TIME: nil value of $LANG: ja value of $XMODIFIERS: nil locale-coding-system: japanese-iso-8bit-unix default-enable-multibyte-characters: t Major mode: Fundamental Minor modes in effect: tooltip-mode: t tool-bar-mode: t mouse-wheel-mode: t menu-bar-mode: t file-name-shadow-mode: t global-font-lock-mode: t blink-cursor-mode: t global-auto-composition-mode: t auto-encryption-mode: t auto-compression-mode: t line-number-mode: t transient-mark-mode: t ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps 2009-06-26 9:56 bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps YAMAMOTO Mitsuharu @ 2009-06-26 13:43 ` Eli Zaretskii 2009-06-27 1:30 ` YAMAMOTO Mitsuharu 2019-06-28 12:41 ` bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] Mattias Engdegård 1 sibling, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2009-06-26 13:43 UTC (permalink / raw) To: YAMAMOTO Mitsuharu, 3687 > Date: Fri, 26 Jun 2009 18:56:50 +0900 (JST) > From: YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp> > Cc: > > The following results look inconsistent: > > (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80")) > => 0 > (string-match (string-to-multibyte "\x80") "\x80") > => nil > > (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80")) > => nil > (string-match (string-to-multibyte "[\x80]") "\x80") > => 0 Please tell why you think they are inconsistent. More importantly, please show real-life examples of code or situations where this gets in your way. This area is full of subtleties and gotchas, and in general the current code does what it does because it needs to cater to many different practical situations. There could still be bugs, of course. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps 2009-06-26 13:43 ` Eli Zaretskii @ 2009-06-27 1:30 ` YAMAMOTO Mitsuharu 2009-06-27 9:36 ` Eli Zaretskii 0 siblings, 1 reply; 17+ messages in thread From: YAMAMOTO Mitsuharu @ 2009-06-27 1:30 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 3687 >>>>> On Fri, 26 Jun 2009 16:43:25 +0300, Eli Zaretskii <eliz@gnu.org> said: >> Date: Fri, 26 Jun 2009 18:56:50 +0900 (JST) >> From: YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp> >> Cc: >> >> The following results look inconsistent: >> >> (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80")) >> => 0 >> (string-match (string-to-multibyte "\x80") "\x80") >> => nil >> >> (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80")) >> => nil >> (string-match (string-to-multibyte "[\x80]") "\x80") >> => 0 > Please tell why you think they are inconsistent. I thought there's no room for argument about their inconsistency with respect to the specification of "[...]" in regexps. > More importantly, please show real-life examples of code or > situations where this gets in your way. If you decode some data containing invalid (undecodable) byte sequences using a coding system such as utf-8, then such sequences are embedded in the decoded result as eight-bit characters in multibyte form. You can detect particular such sequences by searching a "characer alternative" regexp (or its multibyte form) in the decoded result if it works. Further examples that look inconsistent: (string-match (string-to-multibyte "[\x80\x81]") (string-to-multibyte "\x80")) => nil (string-match (string-to-multibyte "[\x80-\xbf]") (string-to-multibyte "\x80")) => nil (string-match (string-to-multibyte "[\x80-\xc0]") (string-to-multibyte "\x80")) => 0 (string-match (string-to-multibyte "[\x80-\xc0]") (string-to-multibyte "\xbf")) => 0 (string-match (string-to-multibyte "[\x80-\xc0]") (string-to-multibyte "\xc0")) => nil > This area is full of subtleties and gotchas, and in general the > current code does what it does because it needs to cater to many > different practical situations. > There could still be bugs, of course. Yeah. I found another suspected bug in this area: (string-match "[[:unibyte:]]" "\x80") => nil (string-match "[[:unibyte:]]" (string-to-multibyte "\x80")) => nil YAMAMOTO Mitsuharu mituharu@math.s.chiba-u.ac.jp ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps 2009-06-27 1:30 ` YAMAMOTO Mitsuharu @ 2009-06-27 9:36 ` Eli Zaretskii 2009-06-29 3:02 ` YAMAMOTO Mitsuharu 0 siblings, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2009-06-27 9:36 UTC (permalink / raw) To: YAMAMOTO Mitsuharu; +Cc: 3687 > Date: Sat, 27 Jun 2009 10:30:10 +0900 > From: YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp> > Cc: 3687@emacsbugs.donarmstrong.com > > >>>>> On Fri, 26 Jun 2009 16:43:25 +0300, Eli Zaretskii <eliz@gnu.org> said: > > >> Date: Fri, 26 Jun 2009 18:56:50 +0900 (JST) > >> From: YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp> > >> Cc: > >> > >> The following results look inconsistent: > >> > >> (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80")) > >> => 0 > >> (string-match (string-to-multibyte "\x80") "\x80") > >> => nil > >> > >> (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80")) > >> => nil > >> (string-match (string-to-multibyte "[\x80]") "\x80") > >> => 0 > > > Please tell why you think they are inconsistent. > > I thought there's no room for argument about their inconsistency with > respect to the specification of "[...]" in regexps. Well, obviously there is such a room. Please consider explaining why you think there's inconsistency. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps 2009-06-27 9:36 ` Eli Zaretskii @ 2009-06-29 3:02 ` YAMAMOTO Mitsuharu 2009-06-29 8:47 ` Stefan Monnier 0 siblings, 1 reply; 17+ messages in thread From: YAMAMOTO Mitsuharu @ 2009-06-29 3:02 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 3687 >>>>> On Sat, 27 Jun 2009 12:36:03 +0300, Eli Zaretskii <eliz@gnu.org> said: >> Date: Sat, 27 Jun 2009 10:30:10 +0900 >> From: YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp> >> Cc: 3687@emacsbugs.donarmstrong.com >> >> >>>>> On Fri, 26 Jun 2009 16:43:25 +0300, Eli Zaretskii <eliz@gnu.org> said: >> >> >> Date: Fri, 26 Jun 2009 18:56:50 +0900 (JST) >> >> From: YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp> >> >> Cc: >> >> >> >> The following results look inconsistent: >> >> >> >> (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80")) >> >> => 0 >> >> (string-match (string-to-multibyte "\x80") "\x80") >> >> => nil >> >> >> >> (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80")) >> >> => nil >> >> (string-match (string-to-multibyte "[\x80]") "\x80") >> >> => 0 >> >> > Please tell why you think they are inconsistent. >> >> I thought there's no room for argument about their inconsistency with >> respect to the specification of "[...]" in regexps. > Well, obviously there is such a room. Please consider explaining why > you think there's inconsistency. It seemed to be too obvious to explain and I hesitated to do that. Anyway, I assume "C" and "[C]" work equivalently as regexps if the character C has no special meaning in either context. YAMAMOTO Mitsuharu mituharu@math.s.chiba-u.ac.jp ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps 2009-06-29 3:02 ` YAMAMOTO Mitsuharu @ 2009-06-29 8:47 ` Stefan Monnier 2009-07-24 1:08 ` YAMAMOTO Mitsuharu 0 siblings, 1 reply; 17+ messages in thread From: Stefan Monnier @ 2009-06-29 8:47 UTC (permalink / raw) To: YAMAMOTO Mitsuharu; +Cc: 3687 > It seemed to be too obvious to explain and I hesitated to do that. > Anyway, I assume "C" and "[C]" work equivalently as regexps if the > character C has no special meaning in either context. Yes, it's pretty obvious, thank you. I haven't had time to look deeper, but that part of the code is pretty nasty because it tries to be clever about the fact that values between 128-256 can be either latin-1 chars and eight-bit-bytes and it tries to be lenient about confusion between the two. The behavior you see is clearly a bug. Stefan ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps 2009-06-29 8:47 ` Stefan Monnier @ 2009-07-24 1:08 ` YAMAMOTO Mitsuharu 0 siblings, 0 replies; 17+ messages in thread From: YAMAMOTO Mitsuharu @ 2009-07-24 1:08 UTC (permalink / raw) To: Stefan Monnier; +Cc: 3687 >>>>> On Mon, 29 Jun 2009 10:47:30 +0200, Stefan Monnier <monnier@iro.umontreal.ca> said: >> It seemed to be too obvious to explain and I hesitated to do that. >> Anyway, I assume "C" and "[C]" work equivalently as regexps if the >> character C has no special meaning in either context. > Yes, it's pretty obvious, thank you. I haven't had time to look > deeper, but that part of the code is pretty nasty because it tries > to be clever about the fact that values between 128-256 can be > either latin-1 chars and eight-bit-bytes and it tries to be lenient > about confusion between the two. Are there any written specifications explaining how the leniency is supposed to work? As for documentations, the description below in the elisp info (Special Characters in Regular Expressions) probably needs to be updated. The beginning and end of a range of multibyte characters must be in the same character set (*note Character Sets::). Thus, `"[\x8e0-\x97c]"' is invalid because character 0x8e0 (`a' with grave accent) is in the Emacs character set for Latin-1 but the character 0x97c (`u' with diaeresis) is in the Emacs character set for Latin-2. (We use Lisp string syntax to write that example, and a few others in the next few paragraphs, in order to include hex escape sequences in them.) If a range starts with a unibyte character C and ends with a multibyte character C2, the range is divided into two parts: one is `C..?\377', the other is `C1..C2', where C1 is the first character of the charset to which C2 belongs. You cannot always match all non-ASCII characters with the regular expression `"[\200-\377]"'. This works when searching a unibyte buffer or string (*note Text Representations::), but not in a multibyte buffer or string, because many non-ASCII characters have codes above octal 0377. However, the regular expression `"[^\000-\177]"' does match all non-ASCII characters (see below regarding `^'), in both multibyte and unibyte representations, because only the ASCII characters are excluded. YAMAMOTO Mitsuharu mituharu@math.s.chiba-u.ac.jp ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] 2009-06-26 9:56 bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps YAMAMOTO Mitsuharu 2009-06-26 13:43 ` Eli Zaretskii @ 2019-06-28 12:41 ` Mattias Engdegård 2019-06-28 13:03 ` Eli Zaretskii 1 sibling, 1 reply; 17+ messages in thread From: Mattias Engdegård @ 2019-06-28 12:41 UTC (permalink / raw) To: mituharu, Stefan Monnier, Eli Zaretskii; +Cc: 3687 [-- Attachment #1: Type: text/plain, Size: 559 bytes --] Let's assume the following semantics as desirable: 1. All characters and raw bytes (up to regexp syntax) match themselves no matter whether they are given as literals or in character alternatives. 2. All raw bytes C match themselves and nothing else no matter whether the pattern or target string/buffer are unibyte or multibyte. 3. Ranges from ASCII to raw bytes work as expected and do not contain Unicode characters above U+007F. 4. Ranges from non-ASCII Unicode characters to raw bytes make no sense and are treated as empty. Here is a patch. [-- Attachment #2: 0001-Correct-regexp-matching-of-raw-bytes.patch --] [-- Type: application/octet-stream, Size: 9348 bytes --] From 6683077bf5d9509abbae050e1aa4c3dddae1bba9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org> Date: Fri, 28 Jun 2019 10:20:55 +0200 Subject: [PATCH] Correct regexp matching of raw bytes Make regexp matching of raw bytes work in all combination of unibyte and multibyte patterns and targets, as exact strings and in character alternatives (bug#3687). * src/regex-emacs.c (analyze_first): Include raw byte in fastmap when pattern is a multibyte exact string. Include leading byte in fastmap for raw bytes in character alternatives. (re_match_2_internal): Decrement the byte count by the number of bytes in the pattern character, not 1. * test/src/regex-emacs-tests.el (regexp-unibyte-unibyte) (regexp-multibyte-unibyte, regexp-unibyte-mutibyte) (regexp-multibyte-multibyte): New tests. --- src/regex-emacs.c | 24 +++++-- test/src/regex-emacs-tests.el | 120 ++++++++++++++++++++++++++++++++++ 2 files changed, 140 insertions(+), 4 deletions(-) diff --git a/src/regex-emacs.c b/src/regex-emacs.c index c353a78fb4..5887eaa30c 100644 --- a/src/regex-emacs.c +++ b/src/regex-emacs.c @@ -2794,6 +2794,7 @@ static int analyze_first (re_char *p, re_char *pend, char *fastmap, bool multibyte) { int j, k; + int nbits; bool not; /* If all elements for base leading-codes in fastmap is set, this @@ -2854,7 +2855,14 @@ analyze_first (re_char *p, re_char *pend, char *fastmap, bool multibyte) each byte is a character. Thus, this works in both cases. */ fastmap[p[1]] = 1; - if (! multibyte) + if (multibyte) + { + /* Cover the case of matching a raw char in a + multibyte regexp against unibyte. */ + if (CHAR_BYTE8_HEAD_P (p[1])) + fastmap[CHAR_TO_BYTE8 (STRING_CHAR (p + 1))] = 1; + } + else { /* For the case of matching this unibyte regex against multibyte, we must set a leading code of @@ -2886,11 +2894,18 @@ analyze_first (re_char *p, re_char *pend, char *fastmap, bool multibyte) case charset: if (!fastmap) break; not = (re_opcode_t) *(p - 1) == charset_not; - for (j = CHARSET_BITMAP_SIZE (&p[-1]) * BYTEWIDTH - 1, p++; - j >= 0; j--) + nbits = CHARSET_BITMAP_SIZE (&p[-1]) * BYTEWIDTH; + p++; + for (j = 0; j < nbits; j++) if (!!(p[j / BYTEWIDTH] & (1 << (j % BYTEWIDTH))) ^ not) fastmap[j] = 1; + /* To match raw bytes (in the 80..ff range) against multibyte + strings, add their leading bytes to the fastmap. */ + for (j = 0x80; j < nbits; j++) + if (!!(p[j / BYTEWIDTH] & (1 << (j % BYTEWIDTH))) ^ not) + fastmap[CHAR_LEADING_CODE (BYTE8_TO_CHAR (j))] = 1; + if (/* Any leading code can possibly start a character which doesn't match the specified set of characters. */ not @@ -4251,8 +4266,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, } p += pat_charlen; d++; + mcnt -= pat_charlen; } - while (--mcnt); + while (mcnt > 0); break; diff --git a/test/src/regex-emacs-tests.el b/test/src/regex-emacs-tests.el index 0ae50c94d4..50ed3e870a 100644 --- a/test/src/regex-emacs-tests.el +++ b/test/src/regex-emacs-tests.el @@ -683,4 +683,124 @@ regex-tests-TESTS (should-not (string-match "\\`x\\{65535\\}" (make-string 65534 ?x))) (should-error (string-match "\\`x\\{65536\\}" "X") :type 'invalid-regexp)) +(ert-deftest regexp-unibyte-unibyte () + "Test matching a unibyte regexp against a unibyte string." + ;; Sanity check + (should-not (multibyte-string-p "ab")) + (should-not (multibyte-string-p "\xff")) + ;; ASCII + (should (string-match "a[b]" "ab")) + ;; Raw + (should (string-match "\xf1" "\xf1")) + (should-not (string-match "\xf1" "\xc1\xb1")) + ;; Raw, char alt + (should (string-match "[\xf1]" "\xf1")) + (should-not (string-match "[\xf1]" "\xc1\xb1")) + ;; Raw range + (should (string-match "[\x82-\xd3]" "\xbb")) + (should-not (string-match "[\x82-\xd3]" "a")) + (should-not (string-match "[\x82-\xd3]" "\x81")) + (should-not (string-match "[\x82-\xd3]" "\xd4")) + ;; ASCII-raw range + (should (string-match "[f-\xd3]" "q")) + (should (string-match "[f-\xd3]" "\xbb")) + (should-not (string-match "[f-\xd3]" "e")) + (should-not (string-match "[f-\xd3]" "\xd4"))) + +(ert-deftest regexp-multibyte-multibyte () + "Test matching a multibyte regexp against a multibyte string." + ;; Sanity check + (should (multibyte-string-p "åü")) + ;; ASCII + (should (string-match (string-to-multibyte "a[b]") + (string-to-multibyte "ab"))) + ;; Unicode + (should (string-match "å[ü]z" "åüz")) + (should-not (string-match "ü" (string-to-multibyte "\xc3\xbc"))) + ;; Raw + (should (string-match (string-to-multibyte "\xf1") + (string-to-multibyte "\xf1"))) + (should-not (string-match (string-to-multibyte "\xf1") + (string-to-multibyte "\xc1\xb1"))) + (should-not (string-match (string-to-multibyte "\xc1\xb1") + (string-to-multibyte "\xf1"))) + ;; Raw, char alt + (should (string-match (string-to-multibyte "[\xf1]") + (string-to-multibyte "\xf1"))) + ;; Raw range + (should (string-match (string-to-multibyte "[\x82-\xd3]") + (string-to-multibyte "\xbb"))) + (should-not (string-match (string-to-multibyte "[\x82-\xd3]") "a")) + (should-not (string-match (string-to-multibyte "[\x82-\xd3]") "Å")) + (should-not (string-match (string-to-multibyte "[\x82-\xd3]") "ü")) + (should-not (string-match (string-to-multibyte "[\x82-\xd3]") "\x81")) + (should-not (string-match (string-to-multibyte "[\x82-\xd3]") "\xd4")) + ;; ASCII-raw range: should exclude U+0100..U+10FFFF + (should (string-match (string-to-multibyte "[f-\xd3]") + (string-to-multibyte "q"))) + (should (string-match (string-to-multibyte "[f-\xd3]") + (string-to-multibyte "\xbb"))) + (should-not (string-match (string-to-multibyte "[f-\xd3]") "e")) + (should-not (string-match (string-to-multibyte "[f-\xd3]") "Å")) + (should-not (string-match (string-to-multibyte "[f-\xd3]") "ü")) + (should-not (string-match (string-to-multibyte "[f-\xd3]") "\xd4")) + ;; Unicode-raw range: should be empty + (should-not (string-match "[å-\xd3]" "å")) + (should-not (string-match "[å-\xd3]" (string-to-multibyte "\xd3"))) + (should-not (string-match "[å-\xd3]" (string-to-multibyte "\xbb"))) + (should-not (string-match "[å-\xd3]" "ü")) + ;; No equivalence between raw bytes and latin-1 + (should-not (string-match "å" (string-to-multibyte "\xe5"))) + (should-not (string-match "[å]" (string-to-multibyte "\xe5"))) + (should-not (string-match "\xe5" "å")) + (should-not (string-match "[\xe5]" "å"))) + +(ert-deftest regexp-unibyte-multibyte () + "Test matching a unibyte regexp against a multibyte string." + ;; ASCII + (should (string-match "a[b]" (string-to-multibyte "ab"))) + ;; Unicode + (should (string-match "a.[^b]c" (string-to-multibyte "aåüc"))) + ;; Raw + (should (string-match "\xf1" (string-to-multibyte "\xf1"))) + (should-not (string-match "\xc1\xb1" (string-to-multibyte "\xf1"))) + ;; Raw, char alt + (should (string-match "[\xf1]" (string-to-multibyte "\xf1"))) + (should-not (string-match "[\xc1][\xb1]" (string-to-multibyte "\xf1"))) + ;; ASCII-raw range: should exclude U+0100..U+10FFFF + (should (string-match "[f-\xd3]" (string-to-multibyte "q"))) + (should (string-match "[f-\xd3]" (string-to-multibyte "\xbb"))) + (should-not (string-match "[f-\xd3]" "e")) + (should-not (string-match "[f-\xd3]" "Å")) + (should-not (string-match "[f-\xd3]" "ü")) + (should-not (string-match "[f-\xd3]" "\xd4")) + ;; No equivalence between raw bytes and latin-1 + (should-not (string-match "\xe5" "å")) + (should-not (string-match "[\xe5]" "å"))) + +(ert-deftest regexp-multibyte-unibyte () + "Test matching a multibyte regexp against a unibyte string." + ;; ASCII + (should (string-match (string-to-multibyte "a[b]") "ab")) + ;; Unicode + (should (string-match "a[^ü]c" "abc")) + (should-not (string-match "ü" "\xc3\xbc")) + ;; Raw + (should (string-match (string-to-multibyte "\xf1") "\xf1")) + (should-not (string-match (string-to-multibyte "\xf1") "\xc1\xb1")) + ;; Raw, char alt + (should (string-match (string-to-multibyte "[\xf1]") "\xf1")) + (should-not (string-match (string-to-multibyte "[\xf1]") "\xc1\xb1")) + ;; ASCII-raw range: should exclude U+0100..U+10FFFF + (should (string-match (string-to-multibyte "[f-\xd3]") "q")) + (should (string-match (string-to-multibyte "[f-\xd3]") "\xbb")) + (should-not (string-match (string-to-multibyte "[f-\xd3]") "e")) + (should-not (string-match (string-to-multibyte "[f-\xd3]") "\xd4")) + ;; Unicode-raw range: should be empty + (should-not (string-match "[å-\xd3]" "\xd3")) + (should-not (string-match "[å-\xd3]" "\xbb")) + ;; No equivalence between raw bytes and latin-1 + (should-not (string-match "å" "\xe5")) + (should-not (string-match "[å]" "\xe5"))) + ;;; regex-emacs-tests.el ends here -- 2.20.1 (Apple Git-117) ^ permalink raw reply related [flat|nested] 17+ messages in thread
* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] 2019-06-28 12:41 ` bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] Mattias Engdegård @ 2019-06-28 13:03 ` Eli Zaretskii 2019-06-28 14:05 ` Mattias Engdegård 0 siblings, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2019-06-28 13:03 UTC (permalink / raw) To: Mattias Engdegård; +Cc: monnier, 3687 > From: Mattias Engdegård <mattiase@acm.org> > Date: Fri, 28 Jun 2019 14:41:51 +0200 > Cc: 3687@debbugs.gnu.org > > Let's assume the following semantics as desirable: > > 1. All characters and raw bytes (up to regexp syntax) match themselves no matter whether they are given as literals or in character alternatives. > 2. All raw bytes C match themselves and nothing else no matter whether the pattern or target string/buffer are unibyte or multibyte. > 3. Ranges from ASCII to raw bytes work as expected and do not contain Unicode characters above U+007F. > 4. Ranges from non-ASCII Unicode characters to raw bytes make no sense and are treated as empty. > > Here is a patch. Thanks. However, I don't want to look at the patch before we discuss and agree on the principles. So please consider expanding your principles to answer the following questions: 1. What do you mean by "raw bytes"? Is #xab a raw byte or a Unicode point U+00AB? IOW, how do we distinguish, in a regexp, between a raw byte and a character whose Unicode codepoint is that byte's value? And how does one go about concocting a regexp that matches raw bytes in a unibyte or multibyte buffer or string? 2. What is meant by "ranges from ASCII to raw bytes"? Which characters are included in such ranges? 3. If ranges from non-ASCII characters to raw bytes make no sense, how would one go about specifying a range that includes all the characters and raw bytes supported by Emacs? When we discuss these issues, let's please be on the same page regarding the handling of raw bytes in current Emacs. Specifically: . Raw bytes are internally treated as "characters" whose Unicode codepoints are in the range [#x3fff00..#x3fffff]. . The internal representation of raw bytes in buffers and strings uses 2-byte sequences that begin with #xc0 or #xc1. . Emacs jumps through hoops to never expose the above internals to th external world. Thus, any encoding of a string with raw bytes will convert them to their single-byte representation, where they are indistinguishable from the characters which have the same codepoints, and many operations other than encoding also silently perform these conversions. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] 2019-06-28 13:03 ` Eli Zaretskii @ 2019-06-28 14:05 ` Mattias Engdegård 2019-06-28 14:40 ` Eli Zaretskii 2019-06-28 14:56 ` Eli Zaretskii 0 siblings, 2 replies; 17+ messages in thread From: Mattias Engdegård @ 2019-06-28 14:05 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, 3687 28 juni 2019 kl. 15.03 skrev Eli Zaretskii <eliz@gnu.org>: > > However, I don't want to look at the patch before we discuss and agree > on the principles. A most sensible approach. > 1. What do you mean by "raw bytes"? Is #xab a raw byte or a Unicode > point U+00AB? IOW, how do we distinguish, in a regexp, between a > raw byte and a character whose Unicode codepoint is that byte's > value? And how does one go about concocting a regexp that matches > raw bytes in a unibyte or multibyte buffer or string? Sorry, I should have been more clear. The terminology in the manual is a bit muddled; in this case I mean the characters (or whatever you prefer calling them) obtained with hex or octal escapes in the range 128-255, such as "\xff" or "\377", regardless of the string's type (unibyte or multibyte). Unicode characters in the range 128-255 can be generated using the \u00HH or \U000000HH notations, or by just including them literally. They are distinct from raw bytes. To match raw bytes, just write them. They are not special in regexp syntax and need no escaping. > 2. What is meant by "ranges from ASCII to raw bytes"? Which > characters are included in such ranges? Ranges such as [A-\xb1] or [\000-\377], where the first endpoint is an ASCII character and the last endpoint is a raw byte as defined above. These should include all characters from the first endpoint up to and including ASCII 127, and all raw bytes from 128 to the last endpoint. This makes intuitive sense for unibyte strings where such an interval is contiguous in the underlying representation; extending them to multibyte is obvious. In fact, the existing regexp engine already works this way; I didn't need to change that at all. > 3. If ranges from non-ASCII characters to raw bytes make no sense, > how would one go about specifying a range that includes all the > characters and raw bytes supported by Emacs? "[\x00-\U0010ffff\x80-\xff]" "[^z-a]" (rx anything) etc. > . Raw bytes are internally treated as "characters" whose Unicode > codepoints are in the range [#x3fff00..#x3fffff]. > . The internal representation of raw bytes in buffers and strings > uses 2-byte sequences that begin with #xc0 or #xc1. > . Emacs jumps through hoops to never expose the above internals to > th external world. Thus, any encoding of a string with raw bytes > will convert them to their single-byte representation, where they > are indistinguishable from the characters which have the same > codepoints, and many operations other than encoding also > silently perform these conversions. This is also my understanding. The patch does not expose the internal representation of raw bytes. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] 2019-06-28 14:05 ` Mattias Engdegård @ 2019-06-28 14:40 ` Eli Zaretskii 2019-06-28 15:00 ` Mattias Engdegård 2019-06-28 14:56 ` Eli Zaretskii 1 sibling, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2019-06-28 14:40 UTC (permalink / raw) To: Mattias Engdegård; +Cc: monnier, 3687 > From: Mattias Engdegård <mattiase@acm.org> > Date: Fri, 28 Jun 2019 16:05:07 +0200 > Cc: mituharu@math.s.chiba-u.ac.jp, monnier@iro.umontreal.ca, > 3687@debbugs.gnu.org > > > 1. What do you mean by "raw bytes"? Is #xab a raw byte or a Unicode > > point U+00AB? IOW, how do we distinguish, in a regexp, between a > > raw byte and a character whose Unicode codepoint is that byte's > > value? And how does one go about concocting a regexp that matches > > raw bytes in a unibyte or multibyte buffer or string? > > Sorry, I should have been more clear. The terminology in the manual is a bit muddled; in this case I mean the characters (or whatever you prefer calling them) obtained with hex or octal escapes in the range 128-255, such as "\xff" or "\377", regardless of the string's type (unibyte or multibyte). > > Unicode characters in the range 128-255 can be generated using the \u00HH or \U000000HH notations, or by just including them literally. They are distinct from raw bytes. > > To match raw bytes, just write them. They are not special in regexp syntax and need no escaping. So this means \240 is no longer the same as NBSP and \300 is no longer the same as À? But \176 is still the same as ~? Doesn't this open a clear path for another bug report about inconsistencies in regexps? Also, which ways do you propose for specifying raw bytes? Only hex escapes? octal escapes as well? something else? > > 2. What is meant by "ranges from ASCII to raw bytes"? Which > > characters are included in such ranges? > > Ranges such as [A-\xb1] or [\000-\377], where the first endpoint is an ASCII character and the last endpoint is a raw byte as defined above. These should include all characters from the first endpoint up to and including ASCII 127, and all raw bytes from 128 to the last endpoint. This makes intuitive sense for unibyte strings where such an interval is contiguous in the underlying representation; extending them to multibyte is obvious. So you are saying that we will consider the raw bytes as if they followed ASCII characters in the lexicographical order? But non-ASCII characters whose codepoints start at 0x80? where are they in this order? > > 3. If ranges from non-ASCII characters to raw bytes make no sense, > > how would one go about specifying a range that includes all the > > characters and raw bytes supported by Emacs? > > "[\x00-\U0010ffff\x80-\xff]" This looks confusing, because to a naïve reader the first part already includes the second one. My point is that I'm afraid this proposal will replace one set of inconsistencies by another. I think the only way to avoid inconsistencies is to consider the likes of \177 mean different things depending on whether the text being matched is unibyte or multibyte. In particular, raw bytes in multibyte regexps should (if they are needed) be spelled out as #x3fff00, \17777400, etc. This, of course, has the disadvantage that one needs to know which text is being matched before one concocts the regexp. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] 2019-06-28 14:40 ` Eli Zaretskii @ 2019-06-28 15:00 ` Mattias Engdegård 2019-06-28 16:20 ` Eli Zaretskii 0 siblings, 1 reply; 17+ messages in thread From: Mattias Engdegård @ 2019-06-28 15:00 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, 3687 28 juni 2019 kl. 16.40 skrev Eli Zaretskii <eliz@gnu.org>: > > So this means \240 is no longer the same as NBSP and \300 is no longer > the same as À? But \176 is still the same as ~? This has been the case for quite a while; the patch does not change any of this. > So you are saying that we will consider the raw bytes as if they > followed ASCII characters in the lexicographical order? But non-ASCII > characters whose codepoints start at 0x80? where are they in this > order? Again, this is existing semantics and the patch does not change any of it. It sounds like you misunderstand the patch, which means that I have been bad at explaining it. It just fixes a few edge cases related to raw bytes in regexp matching. It does not attempt to change existing semantics, other than where they are clearly buggy, such as "\x9f" and "[\x9f]" not being equivalent regexps. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] 2019-06-28 15:00 ` Mattias Engdegård @ 2019-06-28 16:20 ` Eli Zaretskii 2019-06-28 16:47 ` Mattias Engdegård 0 siblings, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2019-06-28 16:20 UTC (permalink / raw) To: Mattias Engdegård; +Cc: monnier, 3687 > From: Mattias Engdegård <mattiase@acm.org> > Date: Fri, 28 Jun 2019 17:00:33 +0200 > Cc: mituharu@math.s.chiba-u.ac.jp, monnier@iro.umontreal.ca, > 3687@debbugs.gnu.org > > 28 juni 2019 kl. 16.40 skrev Eli Zaretskii <eliz@gnu.org>: > > > > So this means \240 is no longer the same as NBSP and \300 is no longer > > the same as À? But \176 is still the same as ~? > > This has been the case for quite a while; the patch does not change any of this. > > > So you are saying that we will consider the raw bytes as if they > > followed ASCII characters in the lexicographical order? But non-ASCII > > characters whose codepoints start at 0x80? where are they in this > > order? > > Again, this is existing semantics and the patch does not change any of it. > > It sounds like you misunderstand the patch, which means that I have been bad at explaining it. It just fixes a few edge cases related to raw bytes in regexp matching. It does not attempt to change existing semantics, other than where they are clearly buggy, such as "\x9f" and "[\x9f]" not being equivalent regexps. Maybe I did misunderstand: if the patch change nothing fundamental, then why did you need to precede it with "principles"? But since you already pushed the change, I guess there's no reason to discuss this, and I regret I replied. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] 2019-06-28 16:20 ` Eli Zaretskii @ 2019-06-28 16:47 ` Mattias Engdegård 0 siblings, 0 replies; 17+ messages in thread From: Mattias Engdegård @ 2019-06-28 16:47 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 3687-done, Stefan Monnier 28 juni 2019 kl. 18.20 skrev Eli Zaretskii <eliz@gnu.org>: > > Maybe I did misunderstand: if the patch change nothing fundamental, > then why did you need to precede it with "principles"? There was some discussion about these matters previously in the bug, so I thought I should state up-front what I based my interpretations of correct behaviour upon. In hindsight, this was a mistake -- sorry again. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] 2019-06-28 14:05 ` Mattias Engdegård 2019-06-28 14:40 ` Eli Zaretskii @ 2019-06-28 14:56 ` Eli Zaretskii 2019-06-28 15:18 ` Stefan Monnier 1 sibling, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2019-06-28 14:56 UTC (permalink / raw) To: Mattias Engdegård; +Cc: monnier, 3687 > From: Mattias Engdegård <mattiase@acm.org> > Date: Fri, 28 Jun 2019 16:05:07 +0200 > Cc: mituharu@math.s.chiba-u.ac.jp, monnier@iro.umontreal.ca, > 3687@debbugs.gnu.org > > > 1. What do you mean by "raw bytes"? Is #xab a raw byte or a Unicode > > point U+00AB? IOW, how do we distinguish, in a regexp, between a > > raw byte and a character whose Unicode codepoint is that byte's > > value? And how does one go about concocting a regexp that matches > > raw bytes in a unibyte or multibyte buffer or string? > > Sorry, I should have been more clear. The terminology in the manual is a bit muddled; in this case I mean the characters (or whatever you prefer calling them) obtained with hex or octal escapes in the range 128-255, such as "\xff" or "\377", regardless of the string's type (unibyte or multibyte). > > Unicode characters in the range 128-255 can be generated using the \u00HH or \U000000HH notations, or by just including them literally. They are distinct from raw bytes. > > To match raw bytes, just write them. They are not special in regexp syntax and need no escaping. And one more question about this part: if hex and octal escapes are reserved for raw bytes, then what is \123456 and its ilk, i.e. octal escapes whose values are above 255 decimal? Are they errors to be signaled about? ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] 2019-06-28 14:56 ` Eli Zaretskii @ 2019-06-28 15:18 ` Stefan Monnier 2019-06-28 15:34 ` Mattias Engdegård 0 siblings, 1 reply; 17+ messages in thread From: Stefan Monnier @ 2019-06-28 15:18 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Mattias Engdegård, 3687 >> To match raw bytes, just write them. They are not special in regexp syntax and need no escaping. Right: the proposal has nothing to do with how raw-bytes are added into strings or buffers. > And one more question about this part: if hex and octal escapes are > reserved for raw bytes, then what is \123456 and its ilk, i.e. octal > escapes whose values are above 255 decimal? Are they errors to be > signaled about? This question is about how to write raw bytes in a string's printed representation, which is orthogonal: we already have escape sequences for that and Mattias doesn't propose any changes in this respect. Stefan ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] 2019-06-28 15:18 ` Stefan Monnier @ 2019-06-28 15:34 ` Mattias Engdegård 0 siblings, 0 replies; 17+ messages in thread From: Mattias Engdegård @ 2019-06-28 15:34 UTC (permalink / raw) To: Stefan Monnier; +Cc: 3687 28 juni 2019 kl. 17.18 skrev Stefan Monnier <monnier@iro.umontreal.ca>: > > This question is about how to write raw bytes in a string's printed > representation, which is orthogonal: we already have escape sequences > for that and Mattias doesn't propose any changes in this respect. Thank you. It's just a bug fix; pushed as such. Sorry about the confusion. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2019-06-28 16:47 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-06-26 9:56 bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps YAMAMOTO Mitsuharu 2009-06-26 13:43 ` Eli Zaretskii 2009-06-27 1:30 ` YAMAMOTO Mitsuharu 2009-06-27 9:36 ` Eli Zaretskii 2009-06-29 3:02 ` YAMAMOTO Mitsuharu 2009-06-29 8:47 ` Stefan Monnier 2009-07-24 1:08 ` YAMAMOTO Mitsuharu 2019-06-28 12:41 ` bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] Mattias Engdegård 2019-06-28 13:03 ` Eli Zaretskii 2019-06-28 14:05 ` Mattias Engdegård 2019-06-28 14:40 ` Eli Zaretskii 2019-06-28 15:00 ` Mattias Engdegård 2019-06-28 16:20 ` Eli Zaretskii 2019-06-28 16:47 ` Mattias Engdegård 2019-06-28 14:56 ` Eli Zaretskii 2019-06-28 15:18 ` Stefan Monnier 2019-06-28 15:34 ` Mattias Engdegård
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).