From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Newsgroups: gmane.emacs.bugs Subject: bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] Date: Fri, 28 Jun 2019 16:05:07 +0200 Message-ID: <6138515E-3202-437D-8341-7A8856AD0AE9@acm.org> References: <831rzdj1z9.fsf@gnu.org> Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="47603"; mail-complaints-to="usenet@blaine.gmane.org" Cc: monnier@iro.umontreal.ca, 3687@debbugs.gnu.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Fri Jun 28 16:39:09 2019 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1hgs1s-000CIQ-LV for geb-bug-gnu-emacs@m.gmane.org; Fri, 28 Jun 2019 16:39:08 +0200 Original-Received: from localhost ([::1]:60840 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.86_2) (envelope-from ) id 1hgs1q-0002mv-DN for geb-bug-gnu-emacs@m.gmane.org; Fri, 28 Jun 2019 10:39:06 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:36040) by lists.gnu.org with esmtp (Exim 4.86_2) (envelope-from ) id 1hgrVr-0002kP-Bg for bug-gnu-emacs@gnu.org; Fri, 28 Jun 2019 10:06:04 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hgrVq-0004MZ-1n for bug-gnu-emacs@gnu.org; Fri, 28 Jun 2019 10:06:03 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:57403) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1hgrVp-0004MQ-Ue for bug-gnu-emacs@gnu.org; Fri, 28 Jun 2019 10:06:01 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1hgrVp-0004GS-Nd for bug-gnu-emacs@gnu.org; Fri, 28 Jun 2019 10:06:01 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Fri, 28 Jun 2019 14:06:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 3687 X-GNU-PR-Package: emacs Original-Received: via spool by 3687-submit@debbugs.gnu.org id=B3687.156173071816330 (code B ref 3687); Fri, 28 Jun 2019 14:06:01 +0000 Original-Received: (at 3687) by debbugs.gnu.org; 28 Jun 2019 14:05:18 +0000 Original-Received: from localhost ([127.0.0.1]:42713 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hgrV7-0004FJ-IW for submit@debbugs.gnu.org; Fri, 28 Jun 2019 10:05:17 -0400 Original-Received: from mail157c50.megamailservers.eu ([91.136.10.167]:40888 helo=mail51c50.megamailservers.eu) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hgrV3-0004F3-Hf for 3687@debbugs.gnu.org; Fri, 28 Jun 2019 10:05:16 -0400 X-Authenticated-User: mattiase@bredband.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=megamailservers.eu; s=maildub; t=1561730711; bh=UpbhKx1/WLI3UOQjMG9HD5PsF+He7tfpF4nWwWQjFhU=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From; b=p0eYoBxRToAjwHV7ZWk6Gf4axfAG83LGffARi88E0Sz0ZKJMZ01fGX7jrAD+f9XKT F9+iim32GZBaCjHgx2YHYu4S+gXwcDwAuKP17Ur/fA4kl2/AnvoK0ikJJMn3zzetOw 97ixBehWV+O238BD/JL/t/UYyOv/C0G54G/3KWXA= Feedback-ID: mattiase@acm.or Original-Received: from [192.168.1.65] (c-e636e253.032-75-73746f71.bbcust.telenor.se [83.226.54.230]) (authenticated bits=0) by mail51c50.megamailservers.eu (8.14.9/8.13.1) with ESMTP id x5SE574n025492; Fri, 28 Jun 2019 14:05:09 +0000 In-Reply-To: <831rzdj1z9.fsf@gnu.org> X-Mailer: Apple Mail (2.3445.104.11) X-CTCH-RefID: str=0001.0A0B020C.5D161E97.0035, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0 X-CTCH-VOD: Unknown X-CTCH-Spam: Unknown X-CTCH-Score: 0.000 X-CTCH-Flags: 0 X-CTCH-ScoreCust: 0.000 X-CSC: 0 X-CHA: v=2.3 cv=HPft6Llv c=1 sm=1 tr=0 a=M+GU/qJco4WXjv8D6jB2IA==:117 a=M+GU/qJco4WXjv8D6jB2IA==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=mDV3o1hIAAAA:8 a=PgZXLH4B2dh30F7YHmMA:9 a=CjuIK1q_8ugA:10 a=_FVE-zBwftR9WsbkzFJk:22 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 209.51.188.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:161725 Archived-At: 28 juni 2019 kl. 15.03 skrev Eli Zaretskii : >=20 > However, I don't want to look at the patch before we discuss and agree > on the principles. A most sensible approach. > 1. What do you mean by "raw bytes"? Is #xab a raw byte or a Unicode > point U+00AB? IOW, how do we distinguish, in a regexp, between a > raw byte and a character whose Unicode codepoint is that byte's > value? And how does one go about concocting a regexp that matches > raw bytes in a unibyte or multibyte buffer or string? Sorry, I should have been more clear. The terminology in the manual is a = bit muddled; in this case I mean the characters (or whatever you prefer = calling them) obtained with hex or octal escapes in the range 128-255, = such as "\xff" or "\377", regardless of the string's type (unibyte or = multibyte). Unicode characters in the range 128-255 can be generated using the = \u00HH or \U000000HH notations, or by just including them literally. = They are distinct from raw bytes. To match raw bytes, just write them. They are not special in regexp = syntax and need no escaping. > 2. What is meant by "ranges from ASCII to raw bytes"? Which > characters are included in such ranges? Ranges such as [A-\xb1] or [\000-\377], where the first endpoint is an = ASCII character and the last endpoint is a raw byte as defined above. = These should include all characters from the first endpoint up to and = including ASCII 127, and all raw bytes from 128 to the last endpoint. = This makes intuitive sense for unibyte strings where such an interval is = contiguous in the underlying representation; extending them to multibyte = is obvious. In fact, the existing regexp engine already works this way; I didn't = need to change that at all. > 3. If ranges from non-ASCII characters to raw bytes make no sense, > how would one go about specifying a range that includes all the > characters and raw bytes supported by Emacs? "[\x00-\U0010ffff\x80-\xff]" "[^z-a]" (rx anything) etc. > . Raw bytes are internally treated as "characters" whose Unicode > codepoints are in the range [#x3fff00..#x3fffff]. > . The internal representation of raw bytes in buffers and strings > uses 2-byte sequences that begin with #xc0 or #xc1. > . Emacs jumps through hoops to never expose the above internals to > th external world. Thus, any encoding of a string with raw bytes > will convert them to their single-byte representation, where they > are indistinguishable from the characters which have the same > codepoints, and many operations other than encoding also > silently perform these conversions. This is also my understanding. The patch does not expose the internal = representation of raw bytes.