From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH] Date: Fri, 28 Jun 2019 16:03:54 +0300 Message-ID: <831rzdj1z9.fsf@gnu.org> References: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="123562"; mail-complaints-to="usenet@blaine.gmane.org" Cc: monnier@iro.umontreal.ca, 3687@debbugs.gnu.org To: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Fri Jun 28 15:54:46 2019 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1hgrKv-000W15-Hv for geb-bug-gnu-emacs@m.gmane.org; Fri, 28 Jun 2019 15:54:45 +0200 Original-Received: from localhost ([::1]:60164 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.86_2) (envelope-from ) id 1hgrKu-0001Em-Fi for geb-bug-gnu-emacs@m.gmane.org; Fri, 28 Jun 2019 09:54:44 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:48079) by lists.gnu.org with esmtp (Exim 4.86_2) (envelope-from ) id 1hgqYq-0004fm-LF for bug-gnu-emacs@gnu.org; Fri, 28 Jun 2019 09:05:05 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hgqYp-0005bf-D6 for bug-gnu-emacs@gnu.org; Fri, 28 Jun 2019 09:05:04 -0400 Original-Received: from debbugs.gnu.org ([209.51.188.43]:55230) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1hgqYp-0005bX-9Q for bug-gnu-emacs@gnu.org; Fri, 28 Jun 2019 09:05:03 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1hgqYo-0002Gu-Ru for bug-gnu-emacs@gnu.org; Fri, 28 Jun 2019 09:05:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Fri, 28 Jun 2019 13:05:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 3687 X-GNU-PR-Package: emacs Original-Received: via spool by 3687-submit@debbugs.gnu.org id=B3687.15617270888702 (code B ref 3687); Fri, 28 Jun 2019 13:05:02 +0000 Original-Received: (at 3687) by debbugs.gnu.org; 28 Jun 2019 13:04:48 +0000 Original-Received: from localhost ([127.0.0.1]:40540 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hgqYa-0002GI-9v for submit@debbugs.gnu.org; Fri, 28 Jun 2019 09:04:48 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:41981) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hgqYY-0002G5-TC for 3687@debbugs.gnu.org; Fri, 28 Jun 2019 09:04:47 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:54655) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hgqYJ-00056b-Lc; Fri, 28 Jun 2019 09:04:32 -0400 Original-Received: from [176.228.60.248] (port=3734 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1hgqYB-0000M7-Qu; Fri, 28 Jun 2019 09:04:28 -0400 In-reply-to: (message from Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= on Fri, 28 Jun 2019 14:41:51 +0200) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 209.51.188.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:161719 Archived-At: > From: Mattias EngdegÄrd > Date: Fri, 28 Jun 2019 14:41:51 +0200 > Cc: 3687@debbugs.gnu.org > > Let's assume the following semantics as desirable: > > 1. All characters and raw bytes (up to regexp syntax) match themselves no matter whether they are given as literals or in character alternatives. > 2. All raw bytes C match themselves and nothing else no matter whether the pattern or target string/buffer are unibyte or multibyte. > 3. Ranges from ASCII to raw bytes work as expected and do not contain Unicode characters above U+007F. > 4. Ranges from non-ASCII Unicode characters to raw bytes make no sense and are treated as empty. > > Here is a patch. Thanks. However, I don't want to look at the patch before we discuss and agree on the principles. So please consider expanding your principles to answer the following questions: 1. What do you mean by "raw bytes"? Is #xab a raw byte or a Unicode point U+00AB? IOW, how do we distinguish, in a regexp, between a raw byte and a character whose Unicode codepoint is that byte's value? And how does one go about concocting a regexp that matches raw bytes in a unibyte or multibyte buffer or string? 2. What is meant by "ranges from ASCII to raw bytes"? Which characters are included in such ranges? 3. If ranges from non-ASCII characters to raw bytes make no sense, how would one go about specifying a range that includes all the characters and raw bytes supported by Emacs? When we discuss these issues, let's please be on the same page regarding the handling of raw bytes in current Emacs. Specifically: . Raw bytes are internally treated as "characters" whose Unicode codepoints are in the range [#x3fff00..#x3fffff]. . The internal representation of raw bytes in buffers and strings uses 2-byte sequences that begin with #xc0 or #xc1. . Emacs jumps through hoops to never expose the above internals to th external world. Thus, any encoding of a string with raw bytes will convert them to their single-byte representation, where they are indistinguishable from the characters which have the same codepoints, and many operations other than encoding also silently perform these conversions.