From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: YAMAMOTO Mitsuharu Newsgroups: gmane.emacs.bugs Subject: bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps Date: Fri, 24 Jul 2009 10:08:11 +0900 Organization: Faculty of Science, Chiba University Message-ID: References: <200906260956.n5Q9uo917123@church.math.s.chiba-u.ac.jp> <83my7vyute.fsf@gnu.org> <83iqiiyq64.fsf@gnu.org> Reply-To: YAMAMOTO Mitsuharu , 3687@emacsbugs.donarmstrong.com NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka") Content-Type: text/plain; charset=US-ASCII X-Trace: ger.gmane.org 1248398283 6481 80.91.229.12 (24 Jul 2009 01:18:03 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 24 Jul 2009 01:18:03 +0000 (UTC) Cc: 3687@emacsbugs.donarmstrong.com To: Stefan Monnier Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Fri Jul 24 03:17:55 2009 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1MU9Q2-0004te-C4 for geb-bug-gnu-emacs@m.gmane.org; Fri, 24 Jul 2009 03:17:54 +0200 Original-Received: from localhost ([127.0.0.1]:45615 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MU9Q1-0004dv-PN for geb-bug-gnu-emacs@m.gmane.org; Thu, 23 Jul 2009 21:17:53 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1MU9Pn-0004Zv-7u for bug-gnu-emacs@gnu.org; Thu, 23 Jul 2009 21:17:39 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1MU9Pi-0004Z6-9t for bug-gnu-emacs@gnu.org; Thu, 23 Jul 2009 21:17:38 -0400 Original-Received: from [199.232.76.173] (port=43088 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MU9Ph-0004Yw-Ok for bug-gnu-emacs@gnu.org; Thu, 23 Jul 2009 21:17:33 -0400 Original-Received: from rzlab.ucr.edu ([138.23.92.77]:44366) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1MU9Pg-0008H2-MS for bug-gnu-emacs@gnu.org; Thu, 23 Jul 2009 21:17:33 -0400 Original-Received: from rzlab.ucr.edu (rzlab.ucr.edu [127.0.0.1]) by rzlab.ucr.edu (8.14.3/8.14.3/Debian-5) with ESMTP id n6O1HUM4005364; Thu, 23 Jul 2009 18:17:30 -0700 Original-Received: (from debbugs@localhost) by rzlab.ucr.edu (8.14.3/8.14.3/Submit) id n6O1F8Qi004833; Thu, 23 Jul 2009 18:15:08 -0700 X-Loop: owner@emacsbugs.donarmstrong.com Resent-From: YAMAMOTO Mitsuharu Resent-To: bug-submit-list@donarmstrong.com Resent-CC: Emacs Bugs Resent-Date: Fri, 24 Jul 2009 01:15:07 +0000 Resent-Message-ID: Resent-Sender: owner@emacsbugs.donarmstrong.com X-Emacs-PR-Message: followup 3687 X-Emacs-PR-Package: emacs X-Emacs-PR-Keywords: Original-Received: via spool by 3687-submit@emacsbugs.donarmstrong.com id=B3687.12483977013818 (code B ref 3687); Fri, 24 Jul 2009 01:15:07 +0000 Original-Received: (at 3687) by emacsbugs.donarmstrong.com; 24 Jul 2009 01:08:21 +0000 X-Spam-Bayes: score:0.5 Bayes not run. spammytokens:Tokens not available. hammytokens:Tokens not available. Original-Received: from mathmail.math.s.chiba-u.ac.jp (mathmail.math.s.chiba-u.ac.jp [133.82.132.2]) by rzlab.ucr.edu (8.14.3/8.14.3/Debian-5) with ESMTP id n6O18ERH003794 for <3687@emacsbugs.donarmstrong.com>; Thu, 23 Jul 2009 18:08:15 -0700 Original-Received: from church.math.s.chiba-u.ac.jp (church [133.82.132.36]) by mathmail.math.s.chiba-u.ac.jp (Postfix) with ESMTP id CA5552C49; Fri, 24 Jul 2009 10:08:11 +0900 (JST) In-Reply-To: User-Agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 =?UTF-8?Q?(Shij=C5=8D)?= APEL/10.6 Emacs/22.3 (sparc-sun-solaris2.8) MULE/5.0 (SAKAKI) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6 (newer, 2) Resent-Date: Thu, 23 Jul 2009 21:17:38 -0400 X-BeenThere: bug-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:29599 Archived-At: >>>>> On Mon, 29 Jun 2009 10:47:30 +0200, Stefan Monnier said: >> It seemed to be too obvious to explain and I hesitated to do that. >> Anyway, I assume "C" and "[C]" work equivalently as regexps if the >> character C has no special meaning in either context. > Yes, it's pretty obvious, thank you. I haven't had time to look > deeper, but that part of the code is pretty nasty because it tries > to be clever about the fact that values between 128-256 can be > either latin-1 chars and eight-bit-bytes and it tries to be lenient > about confusion between the two. Are there any written specifications explaining how the leniency is supposed to work? As for documentations, the description below in the elisp info (Special Characters in Regular Expressions) probably needs to be updated. The beginning and end of a range of multibyte characters must be in the same character set (*note Character Sets::). Thus, `"[\x8e0-\x97c]"' is invalid because character 0x8e0 (`a' with grave accent) is in the Emacs character set for Latin-1 but the character 0x97c (`u' with diaeresis) is in the Emacs character set for Latin-2. (We use Lisp string syntax to write that example, and a few others in the next few paragraphs, in order to include hex escape sequences in them.) If a range starts with a unibyte character C and ends with a multibyte character C2, the range is divided into two parts: one is `C..?\377', the other is `C1..C2', where C1 is the first character of the charset to which C2 belongs. You cannot always match all non-ASCII characters with the regular expression `"[\200-\377]"'. This works when searching a unibyte buffer or string (*note Text Representations::), but not in a multibyte buffer or string, because many non-ASCII characters have codes above octal 0377. However, the regular expression `"[^\000-\177]"' does match all non-ASCII characters (see below regarding `^'), in both multibyte and unibyte representations, because only the ASCII characters are excluded. YAMAMOTO Mitsuharu mituharu@math.s.chiba-u.ac.jp