From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Michal Nazarewicz Newsgroups: gmane.emacs.bugs Subject: bug#24603: [RFC 17/18] Optimise character class matching in regexes Date: Tue, 4 Oct 2016 03:10:40 +0200 Message-ID: <1475543441-10493-17-git-send-email-mina86@mina86.com> References: <1475543441-10493-1-git-send-email-mina86@mina86.com> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1475543572 4566 195.159.176.226 (4 Oct 2016 01:12:52 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 4 Oct 2016 01:12:52 +0000 (UTC) To: 24603@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Oct 04 03:12:48 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brEHd-0007hT-0X for geb-bug-gnu-emacs@m.gmane.org; Tue, 04 Oct 2016 03:12:37 +0200 Original-Received: from localhost ([::1]:39716 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brEHb-0006xo-ED for geb-bug-gnu-emacs@m.gmane.org; Mon, 03 Oct 2016 21:12:35 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:56628) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brEHF-0006rQ-9G for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:15 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1brEHD-0002bB-7H for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:13 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:37374) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1brEHD-0002aj-2C for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:11 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1brEHA-0006ke-DK for bug-gnu-emacs@gnu.org; Mon, 03 Oct 2016 21:12:08 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 04 Oct 2016 01:12:08 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24603 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 24603-submit@debbugs.gnu.org id=B24603.147554347225737 (code B ref 24603); Tue, 04 Oct 2016 01:12:08 +0000 Original-Received: (at 24603) by debbugs.gnu.org; 4 Oct 2016 01:11:12 +0000 Original-Received: from localhost ([127.0.0.1]:43548 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brEGG-0006gw-AQ for submit@debbugs.gnu.org; Mon, 03 Oct 2016 21:11:12 -0400 Original-Received: from mail-wm0-f54.google.com ([74.125.82.54]:36320) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1brEGB-0006du-2T for 24603@debbugs.gnu.org; Mon, 03 Oct 2016 21:11:08 -0400 Original-Received: by mail-wm0-f54.google.com with SMTP id k125so178185143wma.1 for <24603@debbugs.gnu.org>; Mon, 03 Oct 2016 18:11:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=x4ylqt7gY2Z4tNMUKCwIhxxon2J3Ulpwp6gVzfDj9iY=; b=aWfDnB+sAeBEHCtfJXWTAM4cxC8DSU3966juMO/BekY9DARyOVpDIvmliTewxWnqNH mv8IXNbvOfRBYvPm7gYdEzi0XoMZeyygZGN1b3bOddk0IiMJnSKwAfaIhZbe6I9Yqbi8 qljlwqF9Cj/0n6clUd2/NfVrq9xCrfxPTFrpmhiSgI3DvF07KXfYlb1IUAiEfXA4vaR1 luNS9KvgVFaurnMX4tN8OaPBJvzhCWSKma/xgaQqmTDOcqTej/b+4T+8hR0F5pzahQkX 0uZLi9hDLqfNAIk3zWUnYBSJO8JS9VaNfZxuT3NZn60kyYSj0NFtFlY6ZSHDjfQVGj8a drcg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:subject:date:message-id :in-reply-to:references:mime-version:content-transfer-encoding; bh=x4ylqt7gY2Z4tNMUKCwIhxxon2J3Ulpwp6gVzfDj9iY=; b=gA+CLP0xdSx1f5QDYdsW/4mSTdRpTal2ARfrCTvG8aM5iyUS8G8xV1K3KAFNcXRcgF t2/icDGcUgm8KXIaOhWmMiRRoM+jHUSNxPQdTa8yi8u/30E/L/uMD6LINOTepOI940e4 fpm+ze03LKH0ICaDgZvdL3HazusJ38mO47mYAClSkasnXxvfgL9bDeeIhqcThMg3IZH5 xjCbsSn90vNngMco1c/mh/QzdCWlegcD3rvjVOZHx2A0b6snRQ0Yh9dC4mwK0kHMhB9A rF6CxZwWQxMdSIHmY6benTvLr0Wuu2dwUUDQOrsn3cAzhs6Pv7Gz1HWQXOq4EyR2d1pS xjRw== X-Gm-Message-State: AA6/9Rln/M5QUETJtOYqGbuDGbvz4wbm/OEcoiuI581nbRmq9kLNDihjsqge2eLriofqGASW X-Received: by 10.194.202.166 with SMTP id kj6mr585297wjc.72.1475543461227; Mon, 03 Oct 2016 18:11:01 -0700 (PDT) Original-Received: from mpn.zrh.corp.google.com ([2620:0:105f:301:e126:377e:c57c:59ab]) by smtp.gmail.com with ESMTPSA id r1sm713667wjc.43.2016.10.03.18.10.54 for <24603@debbugs.gnu.org> (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 03 Oct 2016 18:10:56 -0700 (PDT) Original-Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942) id 4FFAA1E029F; Tue, 4 Oct 2016 03:10:49 +0200 (CEST) X-Mailer: git-send-email 2.8.0.rc3.226.g39d4020 In-Reply-To: <1475543441-10493-1-git-send-email-mina86@mina86.com> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:123991 Archived-At: Use lookup tables defined in src/character.h to bundle checks together if possible. For example, ‘[[:lower:][:digit:]]’ would perform an equivalence of ‘lowercasep(ch) || numericp(ch)’ check. Now, such checks are done all at once with at most one Unicode general category lookup. Similarly, do at most one syntax table lookup by unrolling macros testing character properties. * src/regex.c (execute_charset): Use category_char_bits and call SYNTAX at most once. * test/src/regex-tests.el (regex-tests--letter-character-classes): New test case for various character classes relating to letters etc. --- src/regex.c | 86 +++++++++++++++++++++++++++++++------------------ test/src/regex-tests.el | 45 ++++++++++++++++++++++++++ 2 files changed, 100 insertions(+), 31 deletions(-) diff --git a/src/regex.c b/src/regex.c index 02da1fb..bfd04a1 100644 --- a/src/regex.c +++ b/src/regex.c @@ -1789,16 +1789,26 @@ struct range_table_work_area /* Bits used to implement the multibyte-part of the various character classes such as [:alnum:] in a charset's range table. The code currently assumes that only the low 16 bits are used. */ -#define BIT_WORD 0x1 -#define BIT_LOWER 0x2 -#define BIT_PUNCT 0x4 -#define BIT_SPACE 0x8 -#define BIT_UPPER 0x10 -#define BIT_MULTIBYTE 0x20 -#define BIT_ALPHA 0x40 -#define BIT_ALNUM 0x80 -#define BIT_GRAPH 0x100 -#define BIT_PRINT 0x200 +#ifdef emacs +# define BIT_ALNUM CHAR_BIT_ALNUM +# define BIT_ALPHA CHAR_BIT_ALPHA +# define BIT_UPPER CHAR_BIT_UPPER +# define BIT_LOWER CHAR_BIT_LOWER +# define BIT_GRAPH CHAR_BIT_GRAPH +# define BIT_PRINT CHAR_BIT_PRINT +#else +# define BIT_ALNUM (1 << 0) +# define BIT_ALPHA (1 << 1) +# define BIT_UPPER (1 << 2) +# define BIT_LOWER (1 << 3) +# define BIT_GRAPH (1 << 4) +# define BIT_PRINT (1 << 5) +#endif +#define BIT_WORD (BIT_PRINT << 1) +#define BIT_PUNCT (BIT_PRINT << 2) +#define BIT_SPACE (BIT_PRINT << 3) +#define BIT_MULTIBYTE (BIT_PRINT << 4) + /* Set the bit for character C in a list. */ @@ -1988,9 +1998,6 @@ re_wctype_parse (const unsigned char **strp, unsigned limit) 2 [:print:] 2 [:cntrl:] 1 [:ff:] - - If you update this list, consider also updating chain of or’ed conditions - in execute_charset function. */ switch (it - beg) { @@ -4657,28 +4664,45 @@ execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte) else if (rtp) { int class_bits = CHARSET_RANGE_TABLE_BITS (p); + int bits; re_wchar_t range_start, range_end; - /* Sort tests by the most commonly used classes with some adjustment to which - tests are easiest to perform. Take a look at comment in re_wctype_parse - for table with frequencies of character class names. */ - - if ((class_bits & BIT_MULTIBYTE) || - (class_bits & BIT_ALNUM && ISALNUM (c)) || - (class_bits & BIT_ALPHA && ISALPHA (c)) || - (class_bits & BIT_SPACE && ISSPACE (c)) || - (class_bits & BIT_WORD && ISWORD (c)) || - ((class_bits & BIT_UPPER) && - (ISUPPER (c) || (corig != c && - c == downcase (corig) && ISLOWER (c)))) || - ((class_bits & BIT_LOWER) && - (ISLOWER (c) || (corig != c && - c == upcase (corig) && ISUPPER(c)))) || - (class_bits & BIT_PUNCT && ISPUNCT (c)) || - (class_bits & BIT_GRAPH && ISGRAPH (c)) || - (class_bits & BIT_PRINT && ISPRINT (c))) + if (class_bits & BIT_MULTIBYTE) return !not; + /* If we are at this point, the character is not an ASCII or single byte + character. This means that whenever ISFOO macros have special case for + IS_REAL_ASCII (c), we can ignore that. */ + + bits = class_bits & (BIT_ALNUM | BIT_ALPHA | BIT_UPPER | BIT_LOWER | + BIT_GRAPH | BIT_PRINT); + if (bits) + { + int char_bits = category_char_bits[char_unicode_category (c)]; + if (bits & char_bits) + return !not; + + /* Handle case folding. */ + if (corig != c) + { + if ((bits & BIT_UPPER) && (char_bits & BIT_LOWER) && + c == downcase (corig)) + return !not; + if ((bits & BIT_LOWER) && (char_bits & BIT_UPPER) && + c == upcase (corig)) + return !not; + } + } + + if (class_bits & (BIT_SPACE | BIT_WORD | BIT_PUNCT)) + { + enum syntaxcode s = SYNTAX (c); + if (((class_bits & BIT_SPACE) && s == Swhitespace) || + ((class_bits & BIT_WORD ) && s == Sword) || + ((class_bits & BIT_PUNCT) && s != Sword)) + return !not; + } + for (p = *pp; rtp < p; rtp += 2 * 3) { EXTRACT_CHARACTER (range_start, rtp); diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el index fc50344..7617823 100644 --- a/test/src/regex-tests.el +++ b/test/src/regex-tests.el @@ -98,6 +98,51 @@ regex--test-cc (eval `(ert-deftest ,name () ,doc ,(cons 'regex--test-cc test)) t))) +(ert-deftest regex-tests--letter-character-classes () + "Test character classes against various letters types." + (should-not + (cl-remove-if + 'not + (let ((check-ccs (lambda (ch fold) + (mapconcat + (lambda (str) str) + (let ((case-fold-search fold)) + (cl-remove-if-not + (lambda (cc) + (string-match-p (concat "[[:" cc ":]]") + (string ch))) + '("alnum" "alpha" "upper" "lower"))) + " ")))) + (mapcar + (lambda (entry) + (let ((ch (car entry)) (expected (cdr entry))) + (setq entry + (format "%s | %s | case-fold: %s" + (get-char-code-property ch 'general-category) + (funcall check-ccs ch nil) (funcall check-ccs ch t))) + (unless (string-equal expected entry) + (format "\n%c expected: %s\nU+%06X but got: %s" + ch expected ch entry)))) + '((?A . "Lu | alnum alpha upper | case-fold: alnum alpha upper lower") + (?ẞ . "Lu | alnum alpha upper | case-fold: alnum alpha upper lower") + (?DZ . "Lu | alnum alpha upper | case-fold: alnum alpha upper lower") + (?a . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") + ;; FIXME: Should match upper when case-fold case + ;; (?ł . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") + ;; (?ß . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") + ;; (?fi . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") + ;; (?ɕ . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") + ;; (?dz . "Ll | alnum alpha lower | case-fold: alnum alpha upper lower") + (?ł . "Ll | alnum alpha lower | case-fold: alnum alpha lower") + (?ß . "Ll | alnum alpha lower | case-fold: alnum alpha lower") + (?fi . "Ll | alnum alpha lower | case-fold: alnum alpha lower") + (?ɕ . "Ll | alnum alpha lower | case-fold: alnum alpha lower") + (?dz . "Ll | alnum alpha lower | case-fold: alnum alpha lower") + (?Dz . "Lt | alnum alpha | case-fold: alnum alpha upper lower") + (?ʰ . "Lm | alnum alpha | case-fold: alnum alpha") + (?º . "Lo | alnum alpha | case-fold: alnum alpha"))))))) + + (defmacro regex-tests-generic-line (comment-char test-file whitelist &rest body) "Reads a line of the test file TEST-FILE, skipping comments (defined by COMMENT-CHAR), and evaluates the tests in -- 2.8.0.rc3.226.g39d4020