From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Michal Nazarewicz Newsgroups: gmane.emacs.bugs Subject: bug#24603: [RFC 18/18] Fix case-fold-search character class matching Date: Tue, 4 Oct 2016 03:10:41 +0200 Message-ID: <1475543441-10493-18-git-send-email-mina86@mina86.com> References: <1475543441-10493-1-git-send-email-mina86@mina86.com> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1475543843 28219 195.159.176.226 (4 Oct 2016 01:17:23 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 4 Oct 2016 01:17:23 +0000 (UTC) To: 24603@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Oct 04 03:17:19 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brEM2-0005lk-9B for geb-bug-gnu-emacs@m.gmane.org; Tue, 04 Oct 2016 03:17:10 +0200 Original-Received: from localhost ([::1]:39744 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brEM0-0001Ul-Pc for geb-bug-gnu-emacs@m.gmane.org; Mon, 03 Oct 2016 21:17:08 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:56692) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brEHI-0006vG-AB for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:18 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1brEHD-0002bE-8N for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:16 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:37375) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brEHD-0002al-2Q for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:11 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1brEHA-0006kk-Tg for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:08 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 04 Oct 2016 01:12:08 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24603 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 24603-submit@debbugs.gnu.org id=B24603.147554347325744 (code B ref 24603); Tue, 04 Oct 2016 01:12:08 +0000 Original-Received: (at 24603) by debbugs.gnu.org; 4 Oct 2016 01:11:13 +0000 Original-Received: from localhost ([127.0.0.1]:43550 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brEGG-0006h5-NH for submit@debbugs.gnu.org; Mon, 03 Oct 2016 21:11:13 -0400 Original-Received: from mail-wm0-f53.google.com ([74.125.82.53]:37871) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brEGC-0006eG-3G for 24603@debbugs.gnu.org; Mon, 03 Oct 2016 21:11:08 -0400 Original-Received: by mail-wm0-f53.google.com with SMTP id b201so114004881wmb.0 for <24603@debbugs.gnu.org>; Mon, 03 Oct 2016 18:11:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=KcQ6nK8Z/JknvTGwXhTHEyhtCCQNCjTN79cF8ZOrKp8=; b=HvyhkWWnoDMRdR+B46wCmqgfRSXVLpd4wNErCBFeMg0jpynTZKiF7+lnWbxdaUHVwh 2NUK14VsEhcsPv/fe3n7p6KjwXVJHM1ltAJfa3U74GIjD/sTwmgMPnWsrf+ArnidIJ03 qzW/i1eGe45T4/uInLwsn2IGMpWR5+5Wky51h8W8OhGXBDR1bYbVSnGeNTL/MFx3Cm56 mHbfC0co1qACXHlrjdU36/pdSD4T/xgh0rL8m37EQQ3ljbB1ye0veZCCnjmT3lweQMjJ 3OW1fLkWWiAm/1aY57CJlx7tiYrFqh5rgOM5it5eDBpaBHaEEH401W8MZ5DYZ47qzfwt DJ6Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:subject:date:message-id :in-reply-to:references:mime-version:content-transfer-encoding; bh=KcQ6nK8Z/JknvTGwXhTHEyhtCCQNCjTN79cF8ZOrKp8=; b=hxH94RgWduFCpEmC2tFuI6hHU2IjLco6VyfNucfj6BX4/4Nr3vnllhtWgYlqRPtObc GI7IRMrXoS0vwm0i77BpZMmBAUwD07GN8x4ATcIOdrRPhp7Az0sBmXAhQRWkFLOrxFXE DBbzMhfkLFx1ngG5eLOdVwDPq1kDvfRFgIRPDD+xgbk7AX/O2z1u2eEFIUshGyt8v0NY DYYdG6437ASmWzdVz/qNXJAouxTF2MjTl0qkM0DICU0vlEQMlIo9VNRbD2gvI7EMUWYF JYKdlIuqP28w+oyJk7vVvxfb73SaU9b9VDBg/dhBoIkPRvvCJLvyzhhyXFgmW0Lyy8YM VT+w== X-Gm-Message-State: AA6/9RkpCdDx/tZZHCtQPPRBoEBrwKP7KajALxCxU2F6fkOZojegoFUCvqhkQur+XqLvZUCF X-Received: by 10.28.5.133 with SMTP id 127mr1192499wmf.129.1475543462193; Mon, 03 Oct 2016 18:11:02 -0700 (PDT) Original-Received: from mpn.zrh.corp.google.com ([2620:0:105f:301:e126:377e:c57c:59ab]) by smtp.gmail.com with ESMTPSA id n5sm723781wjv.35.2016.10.03.18.10.55 for <24603@debbugs.gnu.org> (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 03 Oct 2016 18:11:01 -0700 (PDT) Original-Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942) id 560D51E02A1; Tue, 4 Oct 2016 03:10:49 +0200 (CEST) X-Mailer: git-send-email 2.8.0.rc3.226.g39d4020 In-Reply-To: <1475543441-10493-1-git-send-email-mina86@mina86.com> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:123999 Archived-At: The uppar and lower character classes should match any cased characters when case-fold-search is enabled. So ‘[[:upper:]]’ sould match ‘a’ but also ‘ł’, ‘ß’ and ‘fi’. Fix character class tests to make that happen. * src/character.h (CHAR_BIT_TITLE): New character bit for title case characters (such as Dz). * src/character.c (category_char_bits): Characters in Lt category are title case; update lookup table. * src/regex.c (re_wctype_to_bit): When case-folding is enabled return any-case bits pattern for RECC_LOWER and RECC_UPPER. (regex_compile): Update re_wctype_to_bit calls (it has new argument). (execute_charset): Simplify case-folding case since now it’s encoded in the bits. corig argument is no longer necessary. (mutually_exclusive_p, re_match_2_internal): Update execute_charset (it no longer has corig argument). * test/src/regex-tests.el (regex-tests--letter-character-classes): Fix case-fold letter matching. --- src/character.c | 2 +- src/character.h | 5 +++-- src/regex.c | 53 ++++++++++++++++++++----------------------------- test/src/regex-tests.el | 16 +++++---------- 4 files changed, 30 insertions(+), 46 deletions(-) diff --git a/src/character.c b/src/character.c index 63f89d3..cf42f30 100644 --- a/src/character.c +++ b/src/character.c @@ -979,7 +979,7 @@ const unsigned char category_char_bits[] = { [UNICODE_CATEGORY_UNKNOWN] = 0, [UNICODE_CATEGORY_Lu] = CHAR_BIT_ALPHA_ | CHAR_BIT_UPPER, [UNICODE_CATEGORY_Ll] = CHAR_BIT_ALPHA_ | CHAR_BIT_LOWER, - [UNICODE_CATEGORY_Lt] = CHAR_BIT_ALPHA_, + [UNICODE_CATEGORY_Lt] = CHAR_BIT_ALPHA_ | CHAR_BIT_TITLE, [UNICODE_CATEGORY_Lm] = CHAR_BIT_ALPHA_, [UNICODE_CATEGORY_Lo] = CHAR_BIT_ALPHA_, [UNICODE_CATEGORY_Mn] = CHAR_BIT_ALPHA_, diff --git a/src/character.h b/src/character.h index 6dc95ad..f2849e5 100644 --- a/src/character.h +++ b/src/character.h @@ -665,8 +665,9 @@ extern unicode_category_t char_unicode_category (int); #define CHAR_BIT_ALPHA (1 << 1) #define CHAR_BIT_UPPER (1 << 2) #define CHAR_BIT_LOWER (1 << 3) -#define CHAR_BIT_GRAPH (1 << 4) -#define CHAR_BIT_PRINT (1 << 5) +#define CHAR_BIT_TITLE (1 << 4) +#define CHAR_BIT_GRAPH (1 << 5) +#define CHAR_BIT_PRINT (1 << 6) /* Map from Unicode general category to character classes the character is in. * diff --git a/src/regex.c b/src/regex.c index bfd04a1..aa8c6ef 100644 --- a/src/regex.c +++ b/src/regex.c @@ -1794,6 +1794,7 @@ struct range_table_work_area # define BIT_ALPHA CHAR_BIT_ALPHA # define BIT_UPPER CHAR_BIT_UPPER # define BIT_LOWER CHAR_BIT_LOWER +# define BIT_TITLE CHAR_BIT_TITLE # define BIT_GRAPH CHAR_BIT_GRAPH # define BIT_PRINT CHAR_BIT_PRINT #else @@ -1801,8 +1802,9 @@ struct range_table_work_area # define BIT_ALPHA (1 << 1) # define BIT_UPPER (1 << 2) # define BIT_LOWER (1 << 3) -# define BIT_GRAPH (1 << 4) -# define BIT_PRINT (1 << 5) +# define BIT_TITLE (1 << 4) +# define BIT_GRAPH (1 << 5) +# define BIT_PRINT (1 << 6) #endif #define BIT_WORD (BIT_PRINT << 1) #define BIT_PUNCT (BIT_PRINT << 2) @@ -2067,7 +2069,7 @@ re_iswctype (int ch, re_wctype_t cc) /* Return a bit-pattern to use in the range-table bits to match multibyte chars of class CC. */ static int -re_wctype_to_bit (re_wctype_t cc) +re_wctype_to_bit (re_wctype_t cc, bool case_fold) { switch (cc) { @@ -2076,8 +2078,10 @@ re_wctype_to_bit (re_wctype_t cc) case RECC_ALPHA: return BIT_ALPHA; case RECC_ALNUM: return BIT_ALNUM; case RECC_WORD: return BIT_WORD; - case RECC_LOWER: return BIT_LOWER; - case RECC_UPPER: return BIT_UPPER; + case RECC_LOWER: + return case_fold ? BIT_LOWER | BIT_UPPER | BIT_TITLE : BIT_LOWER; + case RECC_UPPER: + return case_fold ? BIT_LOWER | BIT_UPPER | BIT_TITLE : BIT_UPPER; case RECC_PUNCT: return BIT_PUNCT; case RECC_SPACE: return BIT_SPACE; case RECC_GRAPH: return BIT_GRAPH; @@ -2886,7 +2890,8 @@ regex_compile (const_re_char *pattern, size_t size, SET_LIST_BIT (c1); } SET_RANGE_TABLE_WORK_AREA_BIT - (range_table_work, re_wctype_to_bit (cc)); + (range_table_work, + re_wctype_to_bit (cc, RE_TRANSLATE_P (translate))); #endif /* emacs */ /* In most cases the matching rule for char classes only uses the syntax table for multibyte chars, so that the @@ -4633,11 +4638,10 @@ skip_noops (const_re_char *p, const_re_char *pend) /* Test if C matches charset op. *PP points to the charset or charset_not opcode. When the function finishes, *PP will be advanced past that opcode. - C is character to test (possibly after translations) and CORIG is original - character (i.e. without any translations). UNIBYTE denotes whether c is - unibyte or multibyte character. */ + C is character to test. UNIBYTE denotes whether c is unibyte or multibyte + character. */ static bool -execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte) +execute_charset (const_re_char **pp, unsigned c, bool unibyte) { re_char *p = *pp, *rtp = NULL; bool not = (re_opcode_t) *p == charset_not; @@ -4675,24 +4679,9 @@ execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte) IS_REAL_ASCII (c), we can ignore that. */ bits = class_bits & (BIT_ALNUM | BIT_ALPHA | BIT_UPPER | BIT_LOWER | - BIT_GRAPH | BIT_PRINT); - if (bits) - { - int char_bits = category_char_bits[char_unicode_category (c)]; - if (bits & char_bits) - return !not; - - /* Handle case folding. */ - if (corig != c) - { - if ((bits & BIT_UPPER) && (char_bits & BIT_LOWER) && - c == downcase (corig)) - return !not; - if ((bits & BIT_LOWER) && (char_bits & BIT_UPPER) && - c == upcase (corig)) - return !not; - } - } + BIT_TITLE | BIT_GRAPH | BIT_PRINT); + if (bits && (category_char_bits[char_unicode_category (c)] & bits)) + return !not; if (class_bits & (BIT_SPACE | BIT_WORD | BIT_PUNCT)) { @@ -4772,7 +4761,7 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1, else if ((re_opcode_t) *p1 == charset || (re_opcode_t) *p1 == charset_not) { - if (!execute_charset (&p1, c, c, !multibyte || IS_REAL_ASCII (c))) + if (!execute_charset (&p1, c, !multibyte || IS_REAL_ASCII (c))) { DEBUG_PRINT (" No match => fast loop.\n"); return 1; @@ -5482,7 +5471,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, case charset: case charset_not: { - register unsigned int c, corig; + register unsigned int c; int len; /* Whether matching against a unibyte character. */ @@ -5492,7 +5481,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, (re_opcode_t) *(p - 1) == charset_not ? "_not" : ""); PREFETCH (); - corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte); + c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte); if (target_multibyte) { int c1; @@ -5524,7 +5513,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, } p -= 1; - if (!execute_charset (&p, c, corig, unibyte_char)) + if (!execute_charset (&p, c, unibyte_char)) goto fail; d += len; diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el index 7617823..4da9ab3 100644 --- a/test/src/regex-tests.el +++ b/test/src/regex-tests.el @@ -127,17 +127,11 @@ regex--test-cc (?ẞ . "Lu | alnum alpha upper | case-fold: alnum alpha upper lower") (?DZ . "Lu | alnum alpha upper | case-fold: alnum alpha upper lower") (?a . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") - ;; FIXME: Should match upper when case-fold case - ;; (?ł . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") - ;; (?ß . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") - ;; (?fi . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") - ;; (?ɕ . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") - ;; (?dz . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") - (?ł . "Ll | alnum alpha lower | case-fold: alnum alpha lower") - (?ß . "Ll | alnum alpha lower | case-fold: alnum alpha lower") - (?fi . "Ll | alnum alpha lower | case-fold: alnum alpha lower") - (?ɕ . "Ll | alnum alpha lower | case-fold: alnum alpha lower") - (?dz . "Ll | alnum alpha lower | case-fold: alnum alpha lower") + (?ł . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") + (?ß . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") + (?fi . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") + (?ɕ . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") + (?dz . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") (?Dz . "Lt | alnum alpha | case-fold: alnum alpha upper lower") (?ʰ . "Lm | alnum alpha | case-fold: alnum alpha") (?º . "Lo | alnum alpha | case-fold: alnum alpha"))))))) -- 2.8.0.rc3.226.g39d4020