From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Michal Nazarewicz Newsgroups: gmane.emacs.bugs Subject: bug#24020: [PATCH] Fix =?UTF-8?Q?=E2=80=98[[:word:]]*\u2620=E2=80=99?= failing to match =?UTF-8?Q?=E2=80=98foo\u2620=E2=80=99?= Date: Mon, 18 Jul 2016 16:04:44 +0200 Message-ID: <1468850684-17867-1-git-send-email-mina86@mina86.com> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1468850811 2067 80.91.229.3 (18 Jul 2016 14:06:51 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 18 Jul 2016 14:06:51 +0000 (UTC) To: 24020@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Mon Jul 18 16:06:45 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1bP9Bz-0006Pw-T7 for geb-bug-gnu-emacs@m.gmane.org; Mon, 18 Jul 2016 16:06:44 +0200 Original-Received: from localhost ([::1]:47604 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bP9Bz-0002zL-3q for geb-bug-gnu-emacs@m.gmane.org; Mon, 18 Jul 2016 10:06:43 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:44024) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bP9BN-0002bP-JM for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 10:06:10 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bP9BK-0008Ht-EE for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 10:06:04 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:43515) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bP9BK-0008Hp-AK for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 10:06:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1bP9BK-0007KT-4m for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 10:06:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 18 Jul 2016 14:06:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 24020 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch X-Debbugs-Original-To: bug-gnu-emacs@gnu.org Original-Received: via spool by submit@debbugs.gnu.org id=B.146885073228133 (code B ref -1); Mon, 18 Jul 2016 14:06:01 +0000 Original-Received: (at submit) by debbugs.gnu.org; 18 Jul 2016 14:05:32 +0000 Original-Received: from localhost ([127.0.0.1]:55852 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bP9Al-0007JZ-MS for submit@debbugs.gnu.org; Mon, 18 Jul 2016 10:05:31 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:52871) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bP9Ah-0007JI-1G for submit@debbugs.gnu.org; Mon, 18 Jul 2016 10:05:26 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bP9AZ-00086X-S8 for submit@debbugs.gnu.org; Mon, 18 Jul 2016 10:05:17 -0400 Original-Received: from lists.gnu.org ([2001:4830:134:3::11]:46365) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bP9AZ-000867-PD for submit@debbugs.gnu.org; Mon, 18 Jul 2016 10:05:15 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:43768) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bP9AT-0001tv-3v for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 10:05:15 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bP9AN-0007y8-SZ for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 10:05:08 -0400 Original-Received: from mail-wm0-x235.google.com ([2a00:1450:400c:c09::235]:37997) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bP9AN-0007xv-FK for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 10:05:03 -0400 Original-Received: by mail-wm0-x235.google.com with SMTP id o80so118557409wme.1 for ; Mon, 18 Jul 2016 07:05:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:subject:date:message-id:mime-version :content-transfer-encoding; bh=gj8lQHRmO27VzkOaKA84h29qN91LttkfA7ZPPHSnzMQ=; b=FZHaLYAD1k1xL/0wz2gDRMSYnhbXV4uNFjOYq0wD0J0KjNvx8ANQQlCtV2sR8HrYN9 EQm8YYFmvpGmUh06CTsjmkyDDNOJKp8Wh33ppSewZWttWU43H0rVSLbPP05EevZ8rp2K qUYuQLM4DRBFe0oFF8IxB4K7fTlVbXh95EM7wqdTV+2ErK2L68sw8b9/Hso7OFeSwTmL r5pwzu8KwJPm7zLh9GNCOm8AWqesZy0snIrgyvsG1Vx37WUj5iIzx21n3Mz4XkPEJ+4V rPpdgymnApPT6We8dIOW6qvhKZPBNTDSaKrM94CuQfRBPkLUp1DJc3U3uovebnu7Shbc evCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:subject:date:message-id :mime-version:content-transfer-encoding; bh=gj8lQHRmO27VzkOaKA84h29qN91LttkfA7ZPPHSnzMQ=; b=gJJm2EUQ96Mr/6F05JGpXzhJfoda/EKWXdxDpXmxS7c7bZ7RWXnKRbRzT5qmGTbv8r gHAkp1ANwgw9v5r0I8k1x+M1oBo0+GOOyKf1tD6DDScpdFkRI8TsU3rEnqMIAI6fNseP FDNK0bjEKvWLeVR0D0xaPdjQ9VFUnObG1bNXpazF7WBFmtCb9OLPzjdBXIfaJQ3Wa+2X 3wBn7L5sKDZ4oWVzswIdKwUx2HlF9U+ipFxRG2TDyzq4peUSIjofqqtX/Xk8N/kgKzhw T27chptcr3eNhhkd4gZA9BFcSDsG/GNGQqtOG8+zt4P/08VNOyzV1PT5b4AZ9/a1XGU5 jCCA== X-Gm-Message-State: ALyK8tJJbC3Pcrblo2cfC9GnaMuSEWw5aFLcEMWZBnLMPgmWHjVyPn1rL+MgMyuIjZdJN2fj X-Received: by 10.194.222.230 with SMTP id qp6mr1419132wjc.102.1468850701839; Mon, 18 Jul 2016 07:05:01 -0700 (PDT) Original-Received: from mpn.zrh.corp.google.com ([172.16.113.135]) by smtp.gmail.com with ESMTPSA id z5sm16880107wme.5.2016.07.18.07.04.59 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 18 Jul 2016 07:05:00 -0700 (PDT) Original-Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942) id 2CA881E0270; Mon, 18 Jul 2016 16:04:59 +0200 (CEST) X-Mailer: git-send-email 2.8.0.rc3.226.g39d4020 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:121215 Archived-At: mutually_exclusive_p did not check for the claass bits of an charset opcode when comparing it with an exactn which resulted in situation where it thought a multibyte character could not match the character class. This assumption caused incorrect optimisation of the regular expression and eventually failure of ‘[[:word:]]*\u2620’ to match ‘foo\u2620’. The issue affects all multibyte word characters as well as other character classes which may match multibyte characters. * src/regex.c (executing_charset): A new function for executing the charset and charset_not opcodes. It performs check on the character taking into consideration existing bitmap, rang table and class bits. It also advances the pointer in the regex bytecode past the parsed opcode. (CHARSET_LOOKUP_RANGE_TABLE_RAW, CHARSET_LOOKUP_RANGE_TABLE): Removed. Code now included in executing_charset. (mutually_exclusive_p, re_match_2_internal): Changed to take advantage of executing_charset function. * test/src/regex-tests.el: New file with tests for the character class matching. --- Unless there are objections I’ll push it within a week or so. src/regex.c | 209 +++++++++++++++++++++--------------------------- test/src/regex-tests.el | 75 +++++++++++++++++ 2 files changed, 168 insertions(+), 116 deletions(-) create mode 100644 test/src/regex-tests.el diff --git a/src/regex.c b/src/regex.c index f92bcb7..9f999a7 100644 --- a/src/regex.c +++ b/src/regex.c @@ -783,44 +783,6 @@ extract_number_and_incr (re_char **source) and end. */ #define CHARSET_RANGE_TABLE_END(range_table, count) \ ((range_table) + (count) * 2 * 3) - -/* Test if C is in RANGE_TABLE. A flag NOT is negated if C is in. - COUNT is number of ranges in RANGE_TABLE. */ -#define CHARSET_LOOKUP_RANGE_TABLE_RAW(not, c, range_table, count) \ - do \ - { \ - re_wchar_t range_start, range_end; \ - re_char *rtp; \ - re_char *range_table_end \ - = CHARSET_RANGE_TABLE_END ((range_table), (count)); \ - \ - for (rtp = (range_table); rtp < range_table_end; rtp += 2 * 3) \ - { \ - EXTRACT_CHARACTER (range_start, rtp); \ - EXTRACT_CHARACTER (range_end, rtp + 3); \ - \ - if (range_start <= (c) && (c) <= range_end) \ - { \ - (not) = !(not); \ - break; \ - } \ - } \ - } \ - while (0) - -/* Test if C is in range table of CHARSET. The flag NOT is negated if - C is listed in it. */ -#define CHARSET_LOOKUP_RANGE_TABLE(not, c, charset) \ - do \ - { \ - /* Number of ranges in range table. */ \ - int count; \ - re_char *range_table = CHARSET_RANGE_TABLE (charset); \ - \ - EXTRACT_NUMBER_AND_INCR (count, range_table); \ - CHARSET_LOOKUP_RANGE_TABLE_RAW ((not), (c), range_table, count); \ - } \ - while (0) /* If DEBUG is defined, Regex prints many voluminous messages about what it is doing (if the variable `debug' is nonzero). If linked with the @@ -4661,6 +4623,93 @@ skip_noops (const_re_char *p, const_re_char *pend) return p; } +/* Test if C matches charset op. *PP points to the charset or chraset_not + opcode. When the function finishes, *PP will be advanced past that opcode. + C is character to test (possibly after translations) and CORIG is original + character (i.e. without any translations). UNIBYTE denotes whether c is + unibyte or multibyte character. */ +static bool +execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte) +{ + re_char *p = *pp, *rtp = NULL; + bool not = (re_opcode_t) *p == charset_not; + + if (CHARSET_RANGE_TABLE_EXISTS_P (p)) + { + int count; + rtp = CHARSET_RANGE_TABLE (p); + EXTRACT_NUMBER_AND_INCR (count, rtp); + *pp = CHARSET_RANGE_TABLE_END ((rtp), (count)); + } + else + *pp += 2 + CHARSET_BITMAP_SIZE (p); + + if (unibyte && c < (1 << BYTEWIDTH)) + { /* Lookup bitmap. */ + /* Cast to `unsigned' instead of `unsigned char' in + case the bit list is a full 32 bytes long. */ + if (c < (unsigned) (CHARSET_BITMAP_SIZE (p) * BYTEWIDTH) + && p[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) + return !not; + } +#ifdef emacs + else if (rtp) + { + int class_bits = CHARSET_RANGE_TABLE_BITS (p); + re_wchar_t range_start, range_end; + + /* Sort tests by the most commonly used classes with some adjustment to which + tests are easiest to perform. Frequencies of character class names as of + 2016-07-15: + + $ find \( -name \*.c -o -name \*.el \) -exec grep -h '\[:[a-z]*:]' {} + | + sed 's/]/]\n/g' |grep -o '\[:[a-z]*:]' |sort |uniq -c |sort -nr + 213 [:alnum:] + 104 [:alpha:] + 62 [:space:] + 39 [:digit:] + 36 [:blank:] + 26 [:upper:] + 24 [:word:] + 21 [:lower:] + 10 [:punct:] + 10 [:ascii:] + 9 [:xdigit:] + 4 [:nonascii:] + 4 [:graph:] + 2 [:print:] + 2 [:cntrl:] + 1 [:ff:] + */ + + if ((class_bits & BIT_MULTIBYTE) || + (class_bits & BIT_ALNUM && ISALNUM (c)) || + (class_bits & BIT_ALPHA && ISALPHA (c)) || + (class_bits & BIT_SPACE && ISSPACE (c)) || + (class_bits & BIT_WORD && ISWORD (c)) || + ((class_bits & BIT_UPPER) && + (ISUPPER (c) || (corig != c && + c == downcase (corig) && ISLOWER (c)))) || + ((class_bits & BIT_LOWER) && + (ISLOWER (c) || (corig != c && + c == upcase (corig) && ISUPPER(c)))) || + (class_bits & BIT_PUNCT && ISPUNCT (c)) || + (class_bits & BIT_GRAPH && ISGRAPH (c)) || + (class_bits & BIT_PRINT && ISPRINT (c))) + return !not; + + for (p = *pp; rtp < p; rtp += 2 * 3) + { + EXTRACT_CHARACTER (range_start, rtp); + EXTRACT_CHARACTER (range_end, rtp + 3); + if (range_start <= c && c <= range_end) + return !not; + } + } +#endif /* emacs */ + return not; +} + /* Non-zero if "p1 matches something" implies "p2 fails". */ static int mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1, @@ -4718,22 +4767,7 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1, else if ((re_opcode_t) *p1 == charset || (re_opcode_t) *p1 == charset_not) { - int not = (re_opcode_t) *p1 == charset_not; - - /* Test if C is listed in charset (or charset_not) - at `p1'. */ - if (! multibyte || IS_REAL_ASCII (c)) - { - if (c < CHARSET_BITMAP_SIZE (p1) * BYTEWIDTH - && p1[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) - not = !not; - } - else if (CHARSET_RANGE_TABLE_EXISTS_P (p1)) - CHARSET_LOOKUP_RANGE_TABLE (not, c, p1); - - /* `not' is equal to 1 if c would match, which means - that we can't change to pop_failure_jump. */ - if (!not) + if (!execute_charset (&p1, c, c, !multibyte)) { DEBUG_PRINT (" No match => fast loop.\n"); return 1; @@ -5439,32 +5473,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, case charset_not: { register unsigned int c, corig; - boolean not = (re_opcode_t) *(p - 1) == charset_not; int len; - /* Start of actual range_table, or end of bitmap if there is no - range table. */ - re_char *range_table UNINIT; - - /* Nonzero if there is a range table. */ - int range_table_exists; - - /* Number of ranges of range table. This is not included - in the initial byte-length of the command. */ - int count = 0; - /* Whether matching against a unibyte character. */ boolean unibyte_char = false; - DEBUG_PRINT ("EXECUTING charset%s.\n", not ? "_not" : ""); - - range_table_exists = CHARSET_RANGE_TABLE_EXISTS_P (&p[-1]); - - if (range_table_exists) - { - range_table = CHARSET_RANGE_TABLE (&p[-1]); /* Past the bitmap. */ - EXTRACT_NUMBER_AND_INCR (count, range_table); - } + DEBUG_PRINT ("EXECUTING charset%s.\n", + (re_opcode_t) *(p - 1) == charset_not ? "_not" : ""); PREFETCH (); corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte); @@ -5498,47 +5513,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, unibyte_char = true; } - if (unibyte_char && c < (1 << BYTEWIDTH)) - { /* Lookup bitmap. */ - /* Cast to `unsigned' instead of `unsigned char' in - case the bit list is a full 32 bytes long. */ - if (c < (unsigned) (CHARSET_BITMAP_SIZE (&p[-1]) * BYTEWIDTH) - && p[1 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) - not = !not; - } -#ifdef emacs - else if (range_table_exists) - { - int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]); - - if ( (class_bits & BIT_LOWER - && (ISLOWER (c) - || (corig != c - && c == upcase (corig) && ISUPPER(c)))) - | (class_bits & BIT_MULTIBYTE) - | (class_bits & BIT_PUNCT && ISPUNCT (c)) - | (class_bits & BIT_SPACE && ISSPACE (c)) - | (class_bits & BIT_UPPER - && (ISUPPER (c) - || (corig != c - && c == downcase (corig) && ISLOWER (c)))) - | (class_bits & BIT_WORD && ISWORD (c)) - | (class_bits & BIT_ALPHA && ISALPHA (c)) - | (class_bits & BIT_ALNUM && ISALNUM (c)) - | (class_bits & BIT_GRAPH && ISGRAPH (c)) - | (class_bits & BIT_PRINT && ISPRINT (c))) - not = !not; - else - CHARSET_LOOKUP_RANGE_TABLE_RAW (not, c, range_table, count); - } -#endif /* emacs */ - - if (range_table_exists) - p = CHARSET_RANGE_TABLE_END (range_table, count); - else - p += CHARSET_BITMAP_SIZE (&p[-1]) + 1; - - if (!not) goto fail; + p -= 1; + if (!execute_charset (&p, c, corig, unibyte_char)) + goto fail; d += len; } diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el new file mode 100644 index 0000000..a2dd4f0 --- /dev/null +++ b/test/src/regex-tests.el @@ -0,0 +1,75 @@ +;;; buffer-tests.el --- tests for regex.c functions -*- lexical-binding: t -*- + +;; Copyright (C) 2015-2016 Free Software Foundation, Inc. + +;; This file is part of GNU Emacs. + +;; GNU Emacs is free software: you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. + +;; GNU Emacs is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with GNU Emacs. If not, see . + +;;; Code: + +(require 'ert) + +(ert-deftest regex-word-cc-fallback-test () + (dolist (class '("[[:word:]]" "\\sw")) + (dolist (repeat '("*" "+")) + (dolist (suffix '("" "b" "bar" "\u2620")) + (should (string-match (concat "^" class repeat suffix "$") + (concat "foo" suffix))))))) + +(defun regex--test-cc (name matching not-matching) + (should (string-match-p (concat "^[[:" name ":]]*$") matching)) + (should (string-match-p (concat "^[[:" name ":]]*?\u2622$") + (concat matching "\u2622"))) + (should (string-match-p (concat "^[^[:" name ":]]*$") not-matching)) + (should (string-match-p (concat "^[^[:" name ":]]*\u2622$") + (concat not-matching "\u2622"))) + (with-temp-buffer + (insert matching) + (let ((p (point))) + (insert not-matching) + (goto-char (point-min)) + (skip-chars-forward (concat "[:" name ":]")) + (should (equal (point) p)) + (skip-chars-forward (concat "^[:" name ":]")) + (should (equal (point) (point-max))) + (goto-char (point-min)) + (skip-chars-forward (concat "[:" name ":]\u2622")) + (should (or (equal (point) p) (equal (point) (1+ p))))))) + +(ert-deftest regex-character-classes () + (let (case-fold-search) + (regex--test-cc "alnum" "abcABC012łąka" "-, \t\n") + (regex--test-cc "alpha" "abcABCłąka" "-,012 \t\n") + (regex--test-cc "digit" "012" "abcABCłąka-, \t\n") + (regex--test-cc "xdigit" "0123aBc" "łąk-, \t\n") + (regex--test-cc "upper" "ABCŁĄKA" "abc012-, \t\n") + (regex--test-cc "lower" "abcłąka" "ABC012-, \t\n") + + (regex--test-cc "word" "abcABC012\u2620" "-, \t\n") + + (regex--test-cc "punct" ".,-" "abcABC012\u2620 \t\n") + (regex--test-cc "cntrl" "\1\2\t\n" ".,-abcABC012\u2620 ") + (regex--test-cc "graph" "abcłąka\u2620-," " \t\n\1") + (regex--test-cc "print" "abcłąka\u2620-, " "\t\n\1") + + (regex--test-cc "space" " \t\n\u2001" "abcABCł0123") + (regex--test-cc "blank" " \t" "\n\u2001") + + (regex--test-cc "ascii" "abcABC012 \t\n\1" "łą\u2620") + (regex--test-cc "nonascii" "łą\u2622" "abcABC012 \t\n\1") + (regex--test-cc "unibyte" "abcABC012 \t\n\1" "łą\u2622") + (regex--test-cc "multibyte" "łą\u2622" "abcABC012 \t\n\1"))) + +;;; buffer-tests.el ends here -- 2.8.0.rc3.226.g39d4020