unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Michal Nazarewicz <mina86@mina86.com>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 24603@debbugs.gnu.org
Subject: bug#24603: [RFC 16/18] Refactor character class checking; optimise ASCII case
Date: Sun, 06 Nov 2016 20:26:11 +0100	[thread overview]
Message-ID: <xa1tk2cgqujw.fsf@mina86.com> (raw)
In-Reply-To: <837f9oo8q3.fsf@gnu.org>

On Tue, Oct 04 2016, Eli Zaretskii wrote:
> Thanks.  I think this change will require a benchmark to make sure we
> don't lose too much in terms of performance.

Benchmark and its results included below.

It’s a bit noisy and as all benchmarks of that kind it doesn’t really
measure the real usage, but I think it’s safe to say that things aren’t
getting worse.

---- >8 ------------------------------------------------------------------------

Class      [[:cc:]]   no-case    [^[:cc:]]  no-case  
---------  ---------  ---------  ---------  ---------

==== Add regex character class matching benchmark ====
alnum         59.870     60.148     63.548     64.048
alpha         60.355     60.137     63.333     62.684
digit         27.835     27.648      0.513      0.488
xdigit        27.160     27.320      0.969      0.883
upper         91.027     91.572     39.423     39.595
lower         60.591     61.307     60.332     59.730
word          36.201     36.046    108.118    109.396
punct        110.987    111.683     35.110     35.200
cntrl         27.005     26.756      1.212      1.176
graph         25.694     26.097     75.872     75.711
print         24.783     24.976     76.652     74.921
space        147.210    148.431      1.261      1.252
blank         27.602     27.722      0.373      0.189
ascii         23.243     23.302      4.550      4.486
nonascii       5.448      5.407     90.733     90.410
unibyte       22.986     23.342      4.559      4.655
multibyte      5.508      5.535     92.457     91.163
...all...      1.138      1.030     93.275     93.383

==== Refactor character class checking; optimise ASCII case ====
alnum         54.643     54.301     56.668     56.898
alpha         54.654     54.558     56.134     56.281
digit         26.103     26.044      0.495      0.443
xdigit        25.606     25.690      0.815      0.806
upper         83.269     83.306     36.704     36.487
lower         56.278     55.804     54.872     54.917
word          34.820     55.092     99.577    100.618
punct        103.410    103.465     31.673     31.590
cntrl         25.509     25.274      1.119      1.101
graph         23.593     23.673     69.335     69.481
print         23.003     23.123     69.962     70.132
space        132.224    132.458      1.143      1.120
blank         26.223     26.342      0.193      0.187
ascii         22.329     22.257      4.094      4.082
nonascii       4.910      4.897     84.633     84.515
unibyte       22.866     22.385      4.094      4.078
multibyte      4.913      4.886     95.385     85.341
...all...      0.942      0.936     88.979     88.744

==== Optimise character class matching in regexes ====
alnum         53.338     53.052     56.571     56.434
alpha         53.591     53.350     56.218     56.255
digit         26.266     26.502      0.438      0.438
xdigit        25.793     25.887      0.877      0.876
upper         82.539     82.700     31.994     32.200
lower         55.280     55.040     54.615     54.429
word          33.666     33.530    100.678    101.721
punct        101.714    101.715     31.766     31.620
cntrl         25.669     25.068      1.113      1.114
graph         27.848     28.067     81.669     81.619
print         27.128     28.297     82.326     82.306
space        131.847    132.242      1.124      1.128
blank         26.493     26.607      0.190      0.188
ascii         22.332     22.315      4.379      4.358
nonascii       5.169      5.159     84.872     85.488
unibyte       22.259     22.529      4.374      4.361
multibyte      5.193      5.181     86.421     86.568
...all...      0.945      0.939     92.903     93.209

==== Fix case-fold-search character class matching ====
alnum         53.553     53.527     56.918     56.886
alpha         53.657     53.758     56.541     57.107
digit         26.616     26.641      0.467      0.510
xdigit        27.255     26.271      0.894      0.923
upper         56.608     55.073     55.792     55.422
lower         55.419     55.330     55.486     55.018
word          35.537     35.434    103.414    103.516
punct        105.810    106.618     33.454     33.322
cntrl         25.875     26.020      1.274      1.271
graph         28.011     28.185     82.239     82.245
print         26.935     27.016     99.945     83.213
space        136.774    138.135      1.170      1.159
blank         26.984     26.976      0.192      0.204
ascii         22.365     22.661      4.652      4.652
nonascii       5.759      5.524     85.805     86.403
unibyte       22.568     22.375      4.995      4.909
multibyte      5.729      5.749     84.671     84.396
...all...      0.990      0.978     89.520     89.612

All times in ms; lower is better.

---- >8 ------------------------------------------------------------------------
From 23d8fe0b093730406b64e0e20207c2fb929f707f Mon Sep 17 00:00:00 2001
From: Michal Nazarewicz <mina86@mina86.com>
Date: Fri, 7 Oct 2016 02:44:30 +0200
Subject: [PATCH] Add regex character class matching benchmark

* test/src/regex-tests.el (regex-tests-benchmark-cc-match): New function
running character class matching benchmark.
---
 test/src/regex-tests.el | 59 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
index fc50344..d0aad97 100644
--- a/test/src/regex-tests.el
+++ b/test/src/regex-tests.el
@@ -98,6 +98,65 @@ regex--test-cc
     (eval `(ert-deftest ,name () ,doc ,(cons 'regex--test-cc test)) t)))
 
 
+(defun regex-tests-benchmark-cc-match ()
+  "Benchmark regex character class matching."
+  (interactive)
+  (let* ((prn (if (called-interactively-p)
+                  'insert
+                (lambda (&rest args) (mapc 'princ args))))
+         (strings
+          (nconc (list
+                  (apply 'string (number-sequence 32 126))
+                  (apply 'string (number-sequence 0 127))
+                  (apply 'unibyte-string (number-sequence 128 255))
+                  (concat (apply 'string (number-sequence 0 255))
+                          (apply 'unibyte-string (number-sequence 128 255)))
+                  (make-string 10000 #x3FFF80)
+                  (make-string 10000 #x3FFFFF))
+                 (mapcar (lambda (ch) (make-string 10000 ch))
+                         (number-sequence 0 256))))
+
+         (ccs '("alnum" "alpha" "digit" "xdigit" "upper" "lower"
+                "word" "punct" "cntrl" "graph" "print" "space" "blank"
+                "ascii" "nonascii" "unibyte" "multibyte"))
+
+         (benchmark-re
+          (lambda (re)
+            (dolist (cf '(nil t))
+              ;; Compile the regex so it ends up in cache.
+              (string-match re "")
+              (let ((res (benchmark-run 10
+                           (dolist (str strings) (string-match re str)))))
+                (funcall prn (format " %10.3f"
+                                     (* (- (nth 0 res) (nth 2 res)) 100))))))))
+
+    (when (called-interactively-p)
+      (switch-to-buffer (get-buffer-create "*Regex Benchmark*"))
+      (delete-region (point-min) (point-max)))
+
+    (funcall prn (format "%-9s  %-9s  %-9s  %-9s  %-9s\n"
+                         "Class" "[[:cc:]]" "no-case"
+                         "[^[:cc:]]" "no-case")
+             (make-string 9 ?-)
+             "  " (make-string 9 ?-) "  " (make-string 9 ?-)
+             "  " (make-string 9 ?-) "  " (make-string 9 ?-) "\n")
+
+    (dolist (cc ccs)
+      (funcall prn (format "%-9s" cc))
+      (dolist (re (list (format "[[:%s:]]" cc)
+                        (format "[^[:%s:]]" cc)))
+        (funcall benchmark-re re))
+      (funcall prn "\n"))
+
+    (funcall prn (format "%-9s" "...all..."))
+    (let ((all-ccs (mapconcat (lambda (cc) (format "[:%s:]" cc)) ccs "")))
+      (funcall benchmark-re (concat "[" all-ccs "]"))
+      (funcall benchmark-re (concat "[^" all-ccs "]")))
+
+    (funcall prn "\n" (make-string 53 ?-)
+             "\nAll times in ms; lower is better.\n")))
+
+
 (defmacro regex-tests-generic-line (comment-char test-file whitelist &rest body)
   "Reads a line of the test file TEST-FILE, skipping
 comments (defined by COMMENT-CHAR), and evaluates the tests in
-- 
2.8.0.rc3.226.g39d4020





  parent reply	other threads:[~2016-11-06 19:26 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-04  1:05 bug#24603: [RFC 00/18] Improvement to casing Michal Nazarewicz
2016-10-04  1:10 ` bug#24603: [RFC 01/18] Add tests for casefiddle.c Michal Nazarewicz
2016-10-04  1:10   ` bug#24603: [RFC 02/18] Generate upcase and downcase tables from Unicode data Michal Nazarewicz
2016-10-04  7:27     ` Eli Zaretskii
2016-10-04 14:54       ` Michal Nazarewicz
2016-10-04 15:06         ` Eli Zaretskii
2016-10-04 16:57           ` Michal Nazarewicz
2016-10-04 17:27             ` Eli Zaretskii
2016-10-04 17:44               ` Eli Zaretskii
2016-10-06 20:29                 ` Michal Nazarewicz
2016-10-07  6:52                   ` Eli Zaretskii
2016-10-04  1:10   ` bug#24603: [RFC 03/18] Don’t assume character can be either upper- or lower-case when casing Michal Nazarewicz
2016-10-04  1:10   ` bug#24603: [RFC 04/18] Split casify_object into multiple functions Michal Nazarewicz
2016-10-04  1:10   ` bug#24603: [RFC 05/18] Introduce case_character function Michal Nazarewicz
2016-10-04  1:10   ` bug#24603: [RFC 06/18] Add support for title-casing letters Michal Nazarewicz
2016-10-04  1:10   ` bug#24603: [RFC 07/18] Split up casify_region function Michal Nazarewicz
2016-10-04  7:17     ` Eli Zaretskii
2016-10-18  2:27       ` Michal Nazarewicz
2016-10-04  1:10   ` bug#24603: [RFC 08/18] Support casing characters which map into multiple code points Michal Nazarewicz
2016-10-04  7:38     ` Eli Zaretskii
2016-10-06 21:40       ` Michal Nazarewicz
2016-10-07  7:46         ` Eli Zaretskii
2017-01-28 23:48           ` Michal Nazarewicz
2017-02-10  9:12             ` Eli Zaretskii
2016-10-04  1:10   ` bug#24603: [RFC 09/18] Implement special sigma casing rule Michal Nazarewicz
2016-10-04  7:22     ` Eli Zaretskii
2016-10-04  1:10   ` bug#24603: [RFC 10/18] Implement Turkic dotless and dotted i handling when casing strings Michal Nazarewicz
2016-10-04  7:12     ` Eli Zaretskii
2016-10-04  1:10   ` bug#24603: [RFC 11/18] Implement casing rules for Lithuanian Michal Nazarewicz
2016-10-04  1:10   ` bug#24603: [RFC 12/18] Implement rules for title-casing Dutch ij ‘letter’ Michal Nazarewicz
2016-10-04  1:10   ` bug#24603: [RFC 13/18] Add some tricky Unicode characters to regex test Michal Nazarewicz
2016-10-04  1:10   ` bug#24603: [RFC 14/18] Factor out character category lookup to separate function Michal Nazarewicz
2016-10-04  1:10   ` bug#24603: [RFC 15/18] Base lower- and upper-case tests on Unicode properties Michal Nazarewicz
2016-10-04  6:54     ` Eli Zaretskii
2016-10-04  1:10   ` bug#24603: [RFC 16/18] Refactor character class checking; optimise ASCII case Michal Nazarewicz
2016-10-04  7:48     ` Eli Zaretskii
2016-10-17 13:22       ` Michal Nazarewicz
2016-11-06 19:26       ` Michal Nazarewicz [this message]
2016-11-06 19:44         ` Eli Zaretskii
2016-12-20 14:32           ` Michal Nazarewicz
2016-12-20 16:39             ` Eli Zaretskii
2016-12-22 14:02               ` Michal Nazarewicz
2016-10-04  1:10   ` bug#24603: [RFC 17/18] Optimise character class matching in regexes Michal Nazarewicz
2016-10-04  1:10   ` bug#24603: [RFC 18/18] Fix case-fold-search character class matching Michal Nazarewicz
2016-10-17 22:03 ` bug#24603: [PATCH 0/3] Case table updates Michal Nazarewicz
2016-10-17 22:03   ` bug#24603: [PATCH 1/3] Add tests for casefiddle.c Michal Nazarewicz
2016-10-17 22:03   ` bug#24603: [PATCH 2/3] Generate upcase and downcase tables from Unicode data Michal Nazarewicz
2016-10-17 22:03   ` bug#24603: [PATCH 3/3] Don’t generate ‘X maps to X’ entries in case tables Michal Nazarewicz
2016-10-18  6:36   ` bug#24603: [PATCH 0/3] Case table updates Eli Zaretskii
2016-10-24 15:11     ` Michal Nazarewicz
2016-10-24 15:33       ` Eli Zaretskii
2017-03-09 21:51 ` bug#24603: [PATCHv5 00/11] Casing improvements Michal Nazarewicz
2017-03-09 21:51   ` bug#24603: [PATCHv5 01/11] Split casify_object into multiple functions Michal Nazarewicz
2017-03-10  9:00     ` Andreas Schwab
2017-03-09 21:51   ` bug#24603: [PATCHv5 02/11] Introduce case_character function Michal Nazarewicz
2017-03-09 21:51   ` bug#24603: [PATCHv5 03/11] Add support for title-casing letters (bug#24603) Michal Nazarewicz
2017-03-11  9:03     ` Eli Zaretskii
2017-03-09 21:51   ` bug#24603: [PATCHv5 04/11] Split up casify_region function (bug#24603) Michal Nazarewicz
2017-03-09 21:51   ` bug#24603: [PATCHv5 05/11] Support casing characters which map into multiple code points (bug#24603) Michal Nazarewicz
2017-03-11  9:14     ` Eli Zaretskii
2017-03-21  2:09       ` Michal Nazarewicz
2017-03-09 21:51   ` bug#24603: [PATCHv5 06/11] Implement special sigma casing rule (bug#24603) Michal Nazarewicz
2017-03-09 21:51   ` bug#24603: [PATCHv5 07/11] Introduce ‘buffer-language’ buffer-locar variable Michal Nazarewicz
2017-03-11  9:29     ` Eli Zaretskii
2017-03-09 21:51   ` bug#24603: [PATCHv5 08/11] Implement rules for title-casing Dutch ij ‘letter’ (bug#24603) Michal Nazarewicz
2017-03-11  9:40     ` Eli Zaretskii
2017-03-16 21:30       ` Michal Nazarewicz
2017-03-17 13:43         ` Eli Zaretskii
2017-03-09 21:51   ` bug#24603: [PATCHv5 09/11] Implement Turkic dotless and dotted i casing rules (bug#24603) Michal Nazarewicz
2017-03-09 21:51   ` bug#24603: [PATCHv5 10/11] Implement casing rules for Lithuanian (bug#24603) Michal Nazarewicz
2017-03-09 21:51   ` bug#24603: [PATCHv5 11/11] Implement Irish casing rules (bug#24603) Michal Nazarewicz
2017-03-11  9:44     ` Eli Zaretskii
2017-03-16 22:16       ` Michal Nazarewicz
2017-03-17  8:20         ` Eli Zaretskii
2017-03-11 10:00   ` bug#24603: [PATCHv5 00/11] Casing improvements Eli Zaretskii
2017-03-21  1:27   ` bug#24603: [PATCHv6 0/6] Casing improvements, language-independent part Michal Nazarewicz
2017-03-21  1:27     ` bug#24603: [PATCHv6 1/6] Split casify_object into multiple functions Michal Nazarewicz
2017-03-21  1:27     ` bug#24603: [PATCHv6 2/6] Introduce case_character function Michal Nazarewicz
2017-03-21  1:27     ` bug#24603: [PATCHv6 3/6] Add support for title-casing letters (bug#24603) Michal Nazarewicz
2017-03-21  1:27     ` bug#24603: [PATCHv6 4/6] Split up casify_region function (bug#24603) Michal Nazarewicz
2017-03-21  1:27     ` bug#24603: [PATCHv6 5/6] Support casing characters which map into multiple code points (bug#24603) Michal Nazarewicz
2017-03-22 16:06       ` Eli Zaretskii
2017-04-03  9:01         ` Michal Nazarewicz
2017-04-03 14:52           ` Eli Zaretskii
2019-06-25  0:09           ` Lars Ingebrigtsen
2019-06-25  0:29             ` Michał Nazarewicz
2020-08-11 13:46               ` Lars Ingebrigtsen
2021-05-10 11:51                 ` bug#24603: [RFC 00/18] Improvement to casing Lars Ingebrigtsen
2017-03-21  1:27     ` bug#24603: [PATCHv6 6/6] Implement special sigma casing rule (bug#24603) Michal Nazarewicz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xa1tk2cgqujw.fsf@mina86.com \
    --to=mina86@mina86.com \
    --cc=24603@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).