bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
@ 2009-06-26  9:56 YAMAMOTO Mitsuharu
  2009-06-26 13:43 ` Eli Zaretskii
  2019-06-28 12:41 ` bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] Mattias Engdegård
  0 siblings, 2 replies; 17+ messages in thread
From: YAMAMOTO Mitsuharu @ 2009-06-26  9:56 UTC (permalink / raw)
  To: emacs-pretest-bug

The following results look inconsistent:

  (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80"))
  => 0
  (string-match (string-to-multibyte "\x80") "\x80")
  => nil

  (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80"))
  => nil
  (string-match (string-to-multibyte "[\x80]") "\x80")
  => 0

				     YAMAMOTO Mitsuharu
				mituharu@math.s.chiba-u.ac.jp

In GNU Emacs 23.1.50.1 (sparc-sun-solaris2.8, X toolkit, Xaw3d scroll bars)
 of 2009-06-26 on church
Windowing system distributor `The X.Org Foundation', version 11.0.10402000
configured using `configure  'LDFLAGS=-L/usr/local/lib -R/usr/local/lib' 'CPPFLAGS=-I/usr/local/lib''

Important settings:
  value of $LC_ALL: nil
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: ja
  value of $XMODIFIERS: nil
  locale-coding-system: japanese-iso-8bit-unix
  default-enable-multibyte-characters: t

Major mode: Fundamental

Minor modes in effect:
  tooltip-mode: t
  tool-bar-mode: t
  mouse-wheel-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  blink-cursor-mode: t
  global-auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
  2009-06-26  9:56 bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps YAMAMOTO Mitsuharu
@ 2009-06-26 13:43 ` Eli Zaretskii
  2009-06-27  1:30   ` YAMAMOTO Mitsuharu
  2019-06-28 12:41 ` bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] Mattias Engdegård
  1 sibling, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2009-06-26 13:43 UTC (permalink / raw)
  To: YAMAMOTO Mitsuharu, 3687

> Date: Fri, 26 Jun 2009 18:56:50 +0900 (JST)
> From: YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp>
> Cc: 
> 
> The following results look inconsistent:
> 
>   (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80"))
>   => 0
>   (string-match (string-to-multibyte "\x80") "\x80")
>   => nil
> 
>   (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80"))
>   => nil
>   (string-match (string-to-multibyte "[\x80]") "\x80")
>   => 0

Please tell why you think they are inconsistent.  More importantly,
please show real-life examples of code or situations where this gets
in your way.  This area is full of subtleties and gotchas, and in
general the current code does what it does because it needs to cater
to many different practical situations.

There could still be bugs, of course.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
  2009-06-26 13:43 ` Eli Zaretskii
@ 2009-06-27  1:30   ` YAMAMOTO Mitsuharu
  2009-06-27  9:36     ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: YAMAMOTO Mitsuharu @ 2009-06-27  1:30 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 3687

>>>>> On Fri, 26 Jun 2009 16:43:25 +0300, Eli Zaretskii <eliz@gnu.org> said:

>> Date: Fri, 26 Jun 2009 18:56:50 +0900 (JST)
>> From: YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp>
>> Cc: 
>> 
>> The following results look inconsistent:
>> 
>> (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80"))
>> => 0
>> (string-match (string-to-multibyte "\x80") "\x80")
>> => nil
>> 
>> (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80"))
>> => nil
>> (string-match (string-to-multibyte "[\x80]") "\x80")
>> => 0

> Please tell why you think they are inconsistent.

I thought there's no room for argument about their inconsistency with
respect to the specification of "[...]" in regexps.

> More importantly, please show real-life examples of code or
> situations where this gets in your way.

If you decode some data containing invalid (undecodable) byte
sequences using a coding system such as utf-8, then such sequences are
embedded in the decoded result as eight-bit characters in multibyte
form.  You can detect particular such sequences by searching a
"characer alternative" regexp (or its multibyte form) in the decoded
result if it works.

Further examples that look inconsistent:

  (string-match (string-to-multibyte "[\x80\x81]") (string-to-multibyte "\x80"))
  => nil
  (string-match (string-to-multibyte "[\x80-\xbf]") (string-to-multibyte "\x80"))
  => nil
  (string-match (string-to-multibyte "[\x80-\xc0]") (string-to-multibyte "\x80"))
  => 0
  (string-match (string-to-multibyte "[\x80-\xc0]") (string-to-multibyte "\xbf"))
  => 0
  (string-match (string-to-multibyte "[\x80-\xc0]") (string-to-multibyte "\xc0"))
  => nil

> This area is full of subtleties and gotchas, and in general the
> current code does what it does because it needs to cater to many
> different practical situations.

> There could still be bugs, of course.

Yeah.  I found another suspected bug in this area:

  (string-match "[[:unibyte:]]" "\x80")
  => nil
  (string-match "[[:unibyte:]]" (string-to-multibyte "\x80"))
  => nil

				     YAMAMOTO Mitsuharu
				mituharu@math.s.chiba-u.ac.jp





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
  2009-06-27  1:30   ` YAMAMOTO Mitsuharu
@ 2009-06-27  9:36     ` Eli Zaretskii
  2009-06-29  3:02       ` YAMAMOTO Mitsuharu
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2009-06-27  9:36 UTC (permalink / raw)
  To: YAMAMOTO Mitsuharu; +Cc: 3687

> Date: Sat, 27 Jun 2009 10:30:10 +0900
> From: YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp>
> Cc: 3687@emacsbugs.donarmstrong.com
> 
> >>>>> On Fri, 26 Jun 2009 16:43:25 +0300, Eli Zaretskii <eliz@gnu.org> said:
> 
> >> Date: Fri, 26 Jun 2009 18:56:50 +0900 (JST)
> >> From: YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp>
> >> Cc: 
> >> 
> >> The following results look inconsistent:
> >> 
> >> (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80"))
> >> => 0
> >> (string-match (string-to-multibyte "\x80") "\x80")
> >> => nil
> >> 
> >> (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80"))
> >> => nil
> >> (string-match (string-to-multibyte "[\x80]") "\x80")
> >> => 0
> 
> > Please tell why you think they are inconsistent.
> 
> I thought there's no room for argument about their inconsistency with
> respect to the specification of "[...]" in regexps.

Well, obviously there is such a room.  Please consider explaining why
you think there's inconsistency.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
  2009-06-27  9:36     ` Eli Zaretskii
@ 2009-06-29  3:02       ` YAMAMOTO Mitsuharu
  2009-06-29  8:47         ` Stefan Monnier
  0 siblings, 1 reply; 17+ messages in thread
From: YAMAMOTO Mitsuharu @ 2009-06-29  3:02 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 3687

>>>>> On Sat, 27 Jun 2009 12:36:03 +0300, Eli Zaretskii <eliz@gnu.org> said:

>> Date: Sat, 27 Jun 2009 10:30:10 +0900
>> From: YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp>
>> Cc: 3687@emacsbugs.donarmstrong.com
>> 
>> >>>>> On Fri, 26 Jun 2009 16:43:25 +0300, Eli Zaretskii <eliz@gnu.org> said:
>> 
>> >> Date: Fri, 26 Jun 2009 18:56:50 +0900 (JST)
>> >> From: YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp>
>> >> Cc: 
>> >> 
>> >> The following results look inconsistent:
>> >> 
>> >> (string-match (string-to-multibyte "\x80") (string-to-multibyte "\x80"))
>> >> => 0
>> >> (string-match (string-to-multibyte "\x80") "\x80")
>> >> => nil
>> >> 
>> >> (string-match (string-to-multibyte "[\x80]") (string-to-multibyte "\x80"))
>> >> => nil
>> >> (string-match (string-to-multibyte "[\x80]") "\x80")
>> >> => 0
>> 
>> > Please tell why you think they are inconsistent.
>> 
>> I thought there's no room for argument about their inconsistency with
>> respect to the specification of "[...]" in regexps.

> Well, obviously there is such a room.  Please consider explaining why
> you think there's inconsistency.

It seemed to be too obvious to explain and I hesitated to do that.
Anyway, I assume "C" and "[C]" work equivalently as regexps if the
character C has no special meaning in either context.

				    YAMAMOTO Mitsuharu
				mituharu@math.s.chiba-u.ac.jp





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
  2009-06-29  3:02       ` YAMAMOTO Mitsuharu
@ 2009-06-29  8:47         ` Stefan Monnier
  2009-07-24  1:08           ` YAMAMOTO Mitsuharu
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Monnier @ 2009-06-29  8:47 UTC (permalink / raw)
  To: YAMAMOTO Mitsuharu; +Cc: 3687

> It seemed to be too obvious to explain and I hesitated to do that.
> Anyway, I assume "C" and "[C]" work equivalently as regexps if the
> character C has no special meaning in either context.

Yes, it's pretty obvious, thank you.
I haven't had time to look deeper, but that part of the code is pretty
nasty because it tries to be clever about the fact that values between
128-256 can be either latin-1 chars and eight-bit-bytes and it tries to
be lenient about confusion between the two.
The behavior you see is clearly a bug.

        Stefan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
  2009-06-29  8:47         ` Stefan Monnier
@ 2009-07-24  1:08           ` YAMAMOTO Mitsuharu
  0 siblings, 0 replies; 17+ messages in thread
From: YAMAMOTO Mitsuharu @ 2009-07-24  1:08 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 3687

>>>>> On Mon, 29 Jun 2009 10:47:30 +0200, Stefan Monnier <monnier@iro.umontreal.ca> said:

>> It seemed to be too obvious to explain and I hesitated to do that.
>> Anyway, I assume "C" and "[C]" work equivalently as regexps if the
>> character C has no special meaning in either context.

> Yes, it's pretty obvious, thank you.  I haven't had time to look
> deeper, but that part of the code is pretty nasty because it tries
> to be clever about the fact that values between 128-256 can be
> either latin-1 chars and eight-bit-bytes and it tries to be lenient
> about confusion between the two.

Are there any written specifications explaining how the leniency is
supposed to work?

As for documentations, the description below in the elisp info
(Special Characters in Regular Expressions) probably needs to be
updated.

     The beginning and end of a range of multibyte characters must be in
     the same character set (*note Character Sets::).  Thus,
     `"[\x8e0-\x97c]"' is invalid because character 0x8e0 (`a' with
     grave accent) is in the Emacs character set for Latin-1 but the
     character 0x97c (`u' with diaeresis) is in the Emacs character set
     for Latin-2.  (We use Lisp string syntax to write that example,
     and a few others in the next few paragraphs, in order to include
     hex escape sequences in them.)

     If a range starts with a unibyte character C and ends with a
     multibyte character C2, the range is divided into two parts: one
     is `C..?\377', the other is `C1..C2', where C1 is the first
     character of the charset to which C2 belongs.

     You cannot always match all non-ASCII characters with the regular
     expression `"[\200-\377]"'.  This works when searching a unibyte
     buffer or string (*note Text Representations::), but not in a
     multibyte buffer or string, because many non-ASCII characters have
     codes above octal 0377.  However, the regular expression
     `"[^\000-\177]"' does match all non-ASCII characters (see below
     regarding `^'), in both multibyte and unibyte representations,
     because only the ASCII characters are excluded.

				     YAMAMOTO Mitsuharu
				mituharu@math.s.chiba-u.ac.jp





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH]
  2009-06-26  9:56 bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps YAMAMOTO Mitsuharu
  2009-06-26 13:43 ` Eli Zaretskii
@ 2019-06-28 12:41 ` Mattias Engdegård
  2019-06-28 13:03   ` Eli Zaretskii
  1 sibling, 1 reply; 17+ messages in thread
From: Mattias Engdegård @ 2019-06-28 12:41 UTC (permalink / raw)
  To: mituharu, Stefan Monnier, Eli Zaretskii; +Cc: 3687

[-- Attachment #1: Type: text/plain, Size: 559 bytes --]

Let's assume the following semantics as desirable:

1. All characters and raw bytes (up to regexp syntax) match themselves no matter whether they are given as literals or in character alternatives.
2. All raw bytes C match themselves and nothing else no matter whether the pattern or target string/buffer are unibyte or multibyte.
3. Ranges from ASCII to raw bytes work as expected and do not contain Unicode characters above U+007F.
4. Ranges from non-ASCII Unicode characters to raw bytes make no sense and are treated as empty.

Here is a patch.


[-- Attachment #2: 0001-Correct-regexp-matching-of-raw-bytes.patch --]
[-- Type: application/octet-stream, Size: 9348 bytes --]

From 6683077bf5d9509abbae050e1aa4c3dddae1bba9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mattias=20Engdeg=C3=A5rd?= <mattiase@acm.org>
Date: Fri, 28 Jun 2019 10:20:55 +0200
Subject: [PATCH] Correct regexp matching of raw bytes

Make regexp matching of raw bytes work in all combination of unibyte
and multibyte patterns and targets, as exact strings and in character
alternatives (bug#3687).

* src/regex-emacs.c (analyze_first):
Include raw byte in fastmap when pattern is a multibyte exact string.
Include leading byte in fastmap for raw bytes in character alternatives.
(re_match_2_internal):
Decrement the byte count by the number of bytes in the pattern character,
not 1.
* test/src/regex-emacs-tests.el (regexp-unibyte-unibyte)
(regexp-multibyte-unibyte, regexp-unibyte-mutibyte)
(regexp-multibyte-multibyte): New tests.
---
 src/regex-emacs.c             |  24 +++++--
 test/src/regex-emacs-tests.el | 120 ++++++++++++++++++++++++++++++++++
 2 files changed, 140 insertions(+), 4 deletions(-)

diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index c353a78fb4..5887eaa30c 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -2794,6 +2794,7 @@ static int
 analyze_first (re_char *p, re_char *pend, char *fastmap, bool multibyte)
 {
   int j, k;
+  int nbits;
   bool not;
 
   /* If all elements for base leading-codes in fastmap is set, this
@@ -2854,7 +2855,14 @@ analyze_first (re_char *p, re_char *pend, char *fastmap, bool multibyte)
 		 each byte is a character.  Thus, this works in both
 		 cases. */
 	      fastmap[p[1]] = 1;
-	      if (! multibyte)
+	      if (multibyte)
+		{
+		  /* Cover the case of matching a raw char in a
+		     multibyte regexp against unibyte.	*/
+		  if (CHAR_BYTE8_HEAD_P (p[1]))
+		    fastmap[CHAR_TO_BYTE8 (STRING_CHAR (p + 1))] = 1;
+		}
+	      else
 		{
 		  /* For the case of matching this unibyte regex
 		     against multibyte, we must set a leading code of
@@ -2886,11 +2894,18 @@ analyze_first (re_char *p, re_char *pend, char *fastmap, bool multibyte)
 	case charset:
 	  if (!fastmap) break;
 	  not = (re_opcode_t) *(p - 1) == charset_not;
-	  for (j = CHARSET_BITMAP_SIZE (&p[-1]) * BYTEWIDTH - 1, p++;
-	       j >= 0; j--)
+	  nbits = CHARSET_BITMAP_SIZE (&p[-1]) * BYTEWIDTH;
+	  p++;
+	  for (j = 0; j < nbits; j++)
 	    if (!!(p[j / BYTEWIDTH] & (1 << (j % BYTEWIDTH))) ^ not)
 	      fastmap[j] = 1;
 
+	  /* To match raw bytes (in the 80..ff range) against multibyte
+	     strings, add their leading bytes to the fastmap.  */
+	  for (j = 0x80; j < nbits; j++)
+	    if (!!(p[j / BYTEWIDTH] & (1 << (j % BYTEWIDTH))) ^ not)
+	      fastmap[CHAR_LEADING_CODE (BYTE8_TO_CHAR (j))] = 1;
+
 	  if (/* Any leading code can possibly start a character
 		 which doesn't match the specified set of characters.  */
 	      not
@@ -4251,8 +4266,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp,
 		  }
 		p += pat_charlen;
 		d++;
+		mcnt -= pat_charlen;
 	      }
-	    while (--mcnt);
+	    while (mcnt > 0);
 
 	  break;
 
diff --git a/test/src/regex-emacs-tests.el b/test/src/regex-emacs-tests.el
index 0ae50c94d4..50ed3e870a 100644
--- a/test/src/regex-emacs-tests.el
+++ b/test/src/regex-emacs-tests.el
@@ -683,4 +683,124 @@ regex-tests-TESTS
   (should-not (string-match "\\`x\\{65535\\}" (make-string 65534 ?x)))
   (should-error (string-match "\\`x\\{65536\\}" "X") :type 'invalid-regexp))
 
+(ert-deftest regexp-unibyte-unibyte ()
+  "Test matching a unibyte regexp against a unibyte string."
+  ;; Sanity check
+  (should-not (multibyte-string-p "ab"))
+  (should-not (multibyte-string-p "\xff"))
+  ;; ASCII
+  (should (string-match "a[b]" "ab"))
+  ;; Raw
+  (should (string-match "\xf1" "\xf1"))
+  (should-not (string-match "\xf1" "\xc1\xb1"))
+  ;; Raw, char alt
+  (should (string-match "[\xf1]" "\xf1"))
+  (should-not (string-match "[\xf1]" "\xc1\xb1"))
+  ;; Raw range
+  (should (string-match "[\x82-\xd3]" "\xbb"))
+  (should-not (string-match "[\x82-\xd3]" "a"))
+  (should-not (string-match "[\x82-\xd3]" "\x81"))
+  (should-not (string-match "[\x82-\xd3]" "\xd4"))
+  ;; ASCII-raw range
+  (should (string-match "[f-\xd3]" "q"))
+  (should (string-match "[f-\xd3]" "\xbb"))
+  (should-not (string-match "[f-\xd3]" "e"))
+  (should-not (string-match "[f-\xd3]" "\xd4")))
+
+(ert-deftest regexp-multibyte-multibyte ()
+  "Test matching a multibyte regexp against a multibyte string."
+  ;; Sanity check
+  (should (multibyte-string-p "åü"))
+  ;; ASCII
+  (should (string-match (string-to-multibyte "a[b]")
+                        (string-to-multibyte "ab")))
+  ;; Unicode
+  (should (string-match "å[ü]z" "åüz"))
+  (should-not (string-match "ü" (string-to-multibyte "\xc3\xbc")))
+  ;; Raw
+  (should (string-match (string-to-multibyte "\xf1")
+                        (string-to-multibyte "\xf1")))
+  (should-not (string-match (string-to-multibyte "\xf1")
+                            (string-to-multibyte "\xc1\xb1")))
+  (should-not (string-match (string-to-multibyte "\xc1\xb1")
+                            (string-to-multibyte "\xf1")))
+  ;; Raw, char alt
+  (should (string-match (string-to-multibyte "[\xf1]")
+                        (string-to-multibyte "\xf1")))
+  ;; Raw range
+  (should (string-match (string-to-multibyte "[\x82-\xd3]")
+                        (string-to-multibyte "\xbb")))
+  (should-not (string-match (string-to-multibyte "[\x82-\xd3]") "a"))
+  (should-not (string-match (string-to-multibyte "[\x82-\xd3]") "Å"))
+  (should-not (string-match (string-to-multibyte "[\x82-\xd3]") "ü"))
+  (should-not (string-match (string-to-multibyte "[\x82-\xd3]") "\x81"))
+  (should-not (string-match (string-to-multibyte "[\x82-\xd3]") "\xd4"))
+  ;; ASCII-raw range: should exclude U+0100..U+10FFFF
+  (should (string-match (string-to-multibyte "[f-\xd3]")
+                        (string-to-multibyte "q")))
+  (should (string-match (string-to-multibyte "[f-\xd3]")
+                        (string-to-multibyte "\xbb")))
+  (should-not (string-match (string-to-multibyte "[f-\xd3]") "e"))
+  (should-not (string-match (string-to-multibyte "[f-\xd3]") "Å"))
+  (should-not (string-match (string-to-multibyte "[f-\xd3]") "ü"))
+  (should-not (string-match (string-to-multibyte "[f-\xd3]") "\xd4"))
+  ;; Unicode-raw range: should be empty
+  (should-not (string-match "[å-\xd3]" "å"))
+  (should-not (string-match "[å-\xd3]" (string-to-multibyte "\xd3")))
+  (should-not (string-match "[å-\xd3]" (string-to-multibyte "\xbb")))
+  (should-not (string-match "[å-\xd3]" "ü"))
+  ;; No equivalence between raw bytes and latin-1
+  (should-not (string-match "å" (string-to-multibyte "\xe5")))
+  (should-not (string-match "[å]" (string-to-multibyte "\xe5")))
+  (should-not (string-match "\xe5" "å"))
+  (should-not (string-match "[\xe5]" "å")))
+
+(ert-deftest regexp-unibyte-multibyte ()
+  "Test matching a unibyte regexp against a multibyte string."
+  ;; ASCII
+  (should (string-match "a[b]" (string-to-multibyte "ab")))
+  ;; Unicode
+  (should (string-match "a.[^b]c" (string-to-multibyte "aåüc")))
+  ;; Raw
+  (should (string-match "\xf1" (string-to-multibyte "\xf1")))
+  (should-not (string-match "\xc1\xb1" (string-to-multibyte "\xf1")))
+  ;; Raw, char alt
+  (should (string-match "[\xf1]" (string-to-multibyte "\xf1")))
+  (should-not (string-match "[\xc1][\xb1]" (string-to-multibyte "\xf1")))
+  ;; ASCII-raw range: should exclude U+0100..U+10FFFF
+  (should (string-match "[f-\xd3]" (string-to-multibyte "q")))
+  (should (string-match "[f-\xd3]" (string-to-multibyte "\xbb")))
+  (should-not (string-match "[f-\xd3]" "e"))
+  (should-not (string-match "[f-\xd3]" "Å"))
+  (should-not (string-match "[f-\xd3]" "ü"))
+  (should-not (string-match "[f-\xd3]" "\xd4"))
+  ;; No equivalence between raw bytes and latin-1
+  (should-not (string-match "\xe5" "å"))
+  (should-not (string-match "[\xe5]" "å")))
+
+(ert-deftest regexp-multibyte-unibyte ()
+  "Test matching a multibyte regexp against a unibyte string."
+  ;; ASCII
+  (should (string-match (string-to-multibyte "a[b]") "ab"))
+  ;; Unicode
+  (should (string-match "a[^ü]c" "abc"))
+  (should-not (string-match "ü" "\xc3\xbc"))
+  ;; Raw
+  (should (string-match (string-to-multibyte "\xf1") "\xf1"))
+  (should-not (string-match (string-to-multibyte "\xf1") "\xc1\xb1"))
+  ;; Raw, char alt
+  (should (string-match (string-to-multibyte "[\xf1]") "\xf1"))
+  (should-not (string-match (string-to-multibyte "[\xf1]") "\xc1\xb1"))
+  ;; ASCII-raw range: should exclude U+0100..U+10FFFF
+  (should (string-match (string-to-multibyte "[f-\xd3]") "q"))
+  (should (string-match (string-to-multibyte "[f-\xd3]") "\xbb"))
+  (should-not (string-match (string-to-multibyte "[f-\xd3]") "e"))
+  (should-not (string-match (string-to-multibyte "[f-\xd3]") "\xd4"))
+  ;; Unicode-raw range: should be empty
+  (should-not (string-match "[å-\xd3]" "\xd3"))
+  (should-not (string-match "[å-\xd3]" "\xbb"))
+  ;; No equivalence between raw bytes and latin-1
+  (should-not (string-match "å" "\xe5"))
+  (should-not (string-match "[å]" "\xe5")))
+
 ;;; regex-emacs-tests.el ends here
-- 
2.20.1 (Apple Git-117)


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH]
  2019-06-28 12:41 ` bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] Mattias Engdegård
@ 2019-06-28 13:03   ` Eli Zaretskii
  2019-06-28 14:05     ` Mattias Engdegård
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2019-06-28 13:03 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: monnier, 3687

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 28 Jun 2019 14:41:51 +0200
> Cc: 3687@debbugs.gnu.org
> 
> Let's assume the following semantics as desirable:
> 
> 1. All characters and raw bytes (up to regexp syntax) match themselves no matter whether they are given as literals or in character alternatives.
> 2. All raw bytes C match themselves and nothing else no matter whether the pattern or target string/buffer are unibyte or multibyte.
> 3. Ranges from ASCII to raw bytes work as expected and do not contain Unicode characters above U+007F.
> 4. Ranges from non-ASCII Unicode characters to raw bytes make no sense and are treated as empty.
> 
> Here is a patch.

Thanks.

However, I don't want to look at the patch before we discuss and agree
on the principles.  So please consider expanding your principles to
answer the following questions:

 1. What do you mean by "raw bytes"?  Is #xab a raw byte or a Unicode
    point U+00AB?  IOW, how do we distinguish, in a regexp, between a
    raw byte and a character whose Unicode codepoint is that byte's
    value?  And how does one go about concocting a regexp that matches
    raw bytes in a unibyte or multibyte buffer or string?

 2. What is meant by "ranges from ASCII to raw bytes"?  Which
    characters are included in such ranges?

 3. If ranges from non-ASCII characters to raw bytes make no sense,
    how would one go about specifying a range that includes all the
    characters and raw bytes supported by Emacs?

When we discuss these issues, let's please be on the same page
regarding the handling of raw bytes in current Emacs.  Specifically:

  . Raw bytes are internally treated as "characters" whose Unicode
    codepoints are in the range [#x3fff00..#x3fffff].
  . The internal representation of raw bytes in buffers and strings
    uses 2-byte sequences that begin with #xc0 or #xc1.
  . Emacs jumps through hoops to never expose the above internals to
    th external world.  Thus, any encoding of a string with raw bytes
    will convert them to their single-byte representation, where they
    are indistinguishable from the characters which have the same
    codepoints, and many operations other than encoding also
    silently perform these conversions.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH]
  2019-06-28 13:03   ` Eli Zaretskii
@ 2019-06-28 14:05     ` Mattias Engdegård
  2019-06-28 14:40       ` Eli Zaretskii
  2019-06-28 14:56       ` Eli Zaretskii
  0 siblings, 2 replies; 17+ messages in thread
From: Mattias Engdegård @ 2019-06-28 14:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, 3687

28 juni 2019 kl. 15.03 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> However, I don't want to look at the patch before we discuss and agree
> on the principles.

A most sensible approach.

> 1. What do you mean by "raw bytes"?  Is #xab a raw byte or a Unicode
>    point U+00AB?  IOW, how do we distinguish, in a regexp, between a
>    raw byte and a character whose Unicode codepoint is that byte's
>    value?  And how does one go about concocting a regexp that matches
>    raw bytes in a unibyte or multibyte buffer or string?

Sorry, I should have been more clear. The terminology in the manual is a bit muddled; in this case I mean the characters (or whatever you prefer calling them) obtained with hex or octal escapes in the range 128-255, such as "\xff" or "\377", regardless of the string's type (unibyte or multibyte).

Unicode characters in the range 128-255 can be generated using the \u00HH or \U000000HH notations, or by just including them literally. They are distinct from raw bytes.

To match raw bytes, just write them. They are not special in regexp syntax and need no escaping.

> 2. What is meant by "ranges from ASCII to raw bytes"?  Which
>    characters are included in such ranges?

Ranges such as [A-\xb1] or [\000-\377], where the first endpoint is an ASCII character and the last endpoint is a raw byte as defined above. These should include all characters from the first endpoint up to and including ASCII 127, and all raw bytes from 128 to the last endpoint. This makes intuitive sense for unibyte strings where such an interval is contiguous in the underlying representation; extending them to multibyte is obvious.

In fact, the existing regexp engine already works this way; I didn't need to change that at all.

> 3. If ranges from non-ASCII characters to raw bytes make no sense,
>    how would one go about specifying a range that includes all the
>    characters and raw bytes supported by Emacs?

"[\x00-\U0010ffff\x80-\xff]"
"[^z-a]"
(rx anything)
etc.

>  . Raw bytes are internally treated as "characters" whose Unicode
>    codepoints are in the range [#x3fff00..#x3fffff].
>  . The internal representation of raw bytes in buffers and strings
>    uses 2-byte sequences that begin with #xc0 or #xc1.
>  . Emacs jumps through hoops to never expose the above internals to
>    th external world.  Thus, any encoding of a string with raw bytes
>    will convert them to their single-byte representation, where they
>    are indistinguishable from the characters which have the same
>    codepoints, and many operations other than encoding also
>    silently perform these conversions.

This is also my understanding. The patch does not expose the internal representation of raw bytes.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH]
  2019-06-28 14:05     ` Mattias Engdegård
@ 2019-06-28 14:40       ` Eli Zaretskii
  2019-06-28 15:00         ` Mattias Engdegård
  2019-06-28 14:56       ` Eli Zaretskii
  1 sibling, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2019-06-28 14:40 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: monnier, 3687

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 28 Jun 2019 16:05:07 +0200
> Cc: mituharu@math.s.chiba-u.ac.jp, monnier@iro.umontreal.ca,
>         3687@debbugs.gnu.org
> 
> > 1. What do you mean by "raw bytes"?  Is #xab a raw byte or a Unicode
> >    point U+00AB?  IOW, how do we distinguish, in a regexp, between a
> >    raw byte and a character whose Unicode codepoint is that byte's
> >    value?  And how does one go about concocting a regexp that matches
> >    raw bytes in a unibyte or multibyte buffer or string?
> 
> Sorry, I should have been more clear. The terminology in the manual is a bit muddled; in this case I mean the characters (or whatever you prefer calling them) obtained with hex or octal escapes in the range 128-255, such as "\xff" or "\377", regardless of the string's type (unibyte or multibyte).
> 
> Unicode characters in the range 128-255 can be generated using the \u00HH or \U000000HH notations, or by just including them literally. They are distinct from raw bytes.
> 
> To match raw bytes, just write them. They are not special in regexp syntax and need no escaping.

So this means \240 is no longer the same as NBSP and \300 is no longer
the same as À?  But \176 is still the same as ~?  Doesn't this open a
clear path for another bug report about inconsistencies in regexps?

Also, which ways do you propose for specifying raw bytes?  Only hex
escapes? octal escapes as well? something else?

> > 2. What is meant by "ranges from ASCII to raw bytes"?  Which
> >    characters are included in such ranges?
> 
> Ranges such as [A-\xb1] or [\000-\377], where the first endpoint is an ASCII character and the last endpoint is a raw byte as defined above. These should include all characters from the first endpoint up to and including ASCII 127, and all raw bytes from 128 to the last endpoint. This makes intuitive sense for unibyte strings where such an interval is contiguous in the underlying representation; extending them to multibyte is obvious.

So you are saying that we will consider the raw bytes as if they
followed ASCII characters in the lexicographical order?  But non-ASCII
characters whose codepoints start at 0x80? where are they in this
order?

> > 3. If ranges from non-ASCII characters to raw bytes make no sense,
> >    how would one go about specifying a range that includes all the
> >    characters and raw bytes supported by Emacs?
> 
> "[\x00-\U0010ffff\x80-\xff]"

This looks confusing, because to a naïve reader the first part already
includes the second one.

My point is that I'm afraid this proposal will replace one set of
inconsistencies by another.

I think the only way to avoid inconsistencies is to consider the likes
of \177 mean different things depending on whether the text being
matched is unibyte or multibyte.  In particular, raw bytes in
multibyte regexps should (if they are needed) be spelled out as
#x3fff00, \17777400, etc.  This, of course, has the disadvantage that
one needs to know which text is being matched before one concocts the
regexp.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH]
  2019-06-28 14:05     ` Mattias Engdegård
  2019-06-28 14:40       ` Eli Zaretskii
@ 2019-06-28 14:56       ` Eli Zaretskii
  2019-06-28 15:18         ` Stefan Monnier
  1 sibling, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2019-06-28 14:56 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: monnier, 3687

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 28 Jun 2019 16:05:07 +0200
> Cc: mituharu@math.s.chiba-u.ac.jp, monnier@iro.umontreal.ca,
>         3687@debbugs.gnu.org
> 
> > 1. What do you mean by "raw bytes"?  Is #xab a raw byte or a Unicode
> >    point U+00AB?  IOW, how do we distinguish, in a regexp, between a
> >    raw byte and a character whose Unicode codepoint is that byte's
> >    value?  And how does one go about concocting a regexp that matches
> >    raw bytes in a unibyte or multibyte buffer or string?
> 
> Sorry, I should have been more clear. The terminology in the manual is a bit muddled; in this case I mean the characters (or whatever you prefer calling them) obtained with hex or octal escapes in the range 128-255, such as "\xff" or "\377", regardless of the string's type (unibyte or multibyte).
> 
> Unicode characters in the range 128-255 can be generated using the \u00HH or \U000000HH notations, or by just including them literally. They are distinct from raw bytes.
> 
> To match raw bytes, just write them. They are not special in regexp syntax and need no escaping.

And one more question about this part: if hex and octal escapes are
reserved for raw bytes, then what is \123456 and its ilk, i.e. octal
escapes whose values are above 255 decimal?  Are they errors to be
signaled about?





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH]
  2019-06-28 14:40       ` Eli Zaretskii
@ 2019-06-28 15:00         ` Mattias Engdegård
  2019-06-28 16:20           ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Mattias Engdegård @ 2019-06-28 15:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, 3687

28 juni 2019 kl. 16.40 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> So this means \240 is no longer the same as NBSP and \300 is no longer
> the same as À?  But \176 is still the same as ~?

This has been the case for quite a while; the patch does not change any of this.

> So you are saying that we will consider the raw bytes as if they
> followed ASCII characters in the lexicographical order?  But non-ASCII
> characters whose codepoints start at 0x80? where are they in this
> order?

Again, this is existing semantics and the patch does not change any of it.

It sounds like you misunderstand the patch, which means that I have been bad at explaining it. It just fixes a few edge cases related to raw bytes in regexp matching. It does not attempt to change existing semantics, other than where they are clearly buggy, such as "\x9f" and "[\x9f]" not being equivalent regexps.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH]
  2019-06-28 14:56       ` Eli Zaretskii
@ 2019-06-28 15:18         ` Stefan Monnier
  2019-06-28 15:34           ` Mattias Engdegård
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Monnier @ 2019-06-28 15:18 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Mattias Engdegård, 3687

>> To match raw bytes, just write them. They are not special in regexp syntax and need no escaping.

Right: the proposal has nothing to do with how raw-bytes are added into
strings or buffers.

> And one more question about this part: if hex and octal escapes are
> reserved for raw bytes, then what is \123456 and its ilk, i.e. octal
> escapes whose values are above 255 decimal?  Are they errors to be
> signaled about?

This question is about how to write raw bytes in a string's printed
representation, which is orthogonal: we already have escape sequences
for that and Mattias doesn't propose any changes in this respect.


        Stefan






^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH]
  2019-06-28 15:18         ` Stefan Monnier
@ 2019-06-28 15:34           ` Mattias Engdegård
  0 siblings, 0 replies; 17+ messages in thread
From: Mattias Engdegård @ 2019-06-28 15:34 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 3687

28 juni 2019 kl. 17.18 skrev Stefan Monnier <monnier@iro.umontreal.ca>:
> 
> This question is about how to write raw bytes in a string's printed
> representation, which is orthogonal: we already have escape sequences
> for that and Mattias doesn't propose any changes in this respect.

Thank you. It's just a bug fix; pushed as such.
Sorry about the confusion.






^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH]
  2019-06-28 15:00         ` Mattias Engdegård
@ 2019-06-28 16:20           ` Eli Zaretskii
  2019-06-28 16:47             ` Mattias Engdegård
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2019-06-28 16:20 UTC (permalink / raw)
  To: Mattias Engdegård; +Cc: monnier, 3687

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 28 Jun 2019 17:00:33 +0200
> Cc: mituharu@math.s.chiba-u.ac.jp, monnier@iro.umontreal.ca,
>         3687@debbugs.gnu.org
> 
> 28 juni 2019 kl. 16.40 skrev Eli Zaretskii <eliz@gnu.org>:
> > 
> > So this means \240 is no longer the same as NBSP and \300 is no longer
> > the same as À?  But \176 is still the same as ~?
> 
> This has been the case for quite a while; the patch does not change any of this.
> 
> > So you are saying that we will consider the raw bytes as if they
> > followed ASCII characters in the lexicographical order?  But non-ASCII
> > characters whose codepoints start at 0x80? where are they in this
> > order?
> 
> Again, this is existing semantics and the patch does not change any of it.
> 
> It sounds like you misunderstand the patch, which means that I have been bad at explaining it. It just fixes a few edge cases related to raw bytes in regexp matching. It does not attempt to change existing semantics, other than where they are clearly buggy, such as "\x9f" and "[\x9f]" not being equivalent regexps.

Maybe I did misunderstand: if the patch change nothing fundamental,
then why did you need to precede it with "principles"?

But since you already pushed the change, I guess there's no reason to
discuss this, and I regret I replied.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH]
  2019-06-28 16:20           ` Eli Zaretskii
@ 2019-06-28 16:47             ` Mattias Engdegård
  0 siblings, 0 replies; 17+ messages in thread
From: Mattias Engdegård @ 2019-06-28 16:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 3687-done, Stefan Monnier

28 juni 2019 kl. 18.20 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> Maybe I did misunderstand: if the patch change nothing fundamental,
> then why did you need to precede it with "principles"?

There was some discussion about these matters previously in the bug, so I thought I should state up-front what I based my interpretations of correct behaviour upon. In hindsight, this was a mistake -- sorry again.






^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-06-28 16:47 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-26  9:56 bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps YAMAMOTO Mitsuharu
2009-06-26 13:43 ` Eli Zaretskii
2009-06-27  1:30   ` YAMAMOTO Mitsuharu
2009-06-27  9:36     ` Eli Zaretskii
2009-06-29  3:02       ` YAMAMOTO Mitsuharu
2009-06-29  8:47         ` Stefan Monnier
2009-07-24  1:08           ` YAMAMOTO Mitsuharu
2019-06-28 12:41 ` bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] Mattias Engdegård
2019-06-28 13:03   ` Eli Zaretskii
2019-06-28 14:05     ` Mattias Engdegård
2019-06-28 14:40       ` Eli Zaretskii
2019-06-28 15:00         ` Mattias Engdegård
2019-06-28 16:20           ` Eli Zaretskii
2019-06-28 16:47             ` Mattias Engdegård
2019-06-28 14:56       ` Eli Zaretskii
2019-06-28 15:18         ` Stefan Monnier
2019-06-28 15:34           ` Mattias Engdegård

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).