From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Michal Nazarewicz <mina86@mina86.com>
Newsgroups: gmane.emacs.bugs
Subject: bug#24020: [PATCHv2] Fix
	=?UTF-8?Q?=E2=80=98[[:word:]]*\u2620=E2=80=99?= failing to
	match =?UTF-8?Q?=E2=80=98foo\u2620=E2=80=99?= (bug#24020)
Date: Tue, 19 Jul 2016 01:30:01 +0200
Message-ID: <1468884601-31164-1-git-send-email-mina86@mina86.com>
References: <83r3ar0z0u.fsf@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: ger.gmane.org 1468886621 9011 80.91.229.3 (19 Jul 2016 00:03:41 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Tue, 19 Jul 2016 00:03:41 +0000 (UTC)
To: 24020@debbugs.gnu.org
Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Jul 19 02:03:32 2016
Return-path: <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>
Envelope-to: geb-bug-gnu-emacs@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>)
	id 1bPIVW-0001dc-4I
	for geb-bug-gnu-emacs@m.gmane.org; Tue, 19 Jul 2016 02:03:30 +0200
Original-Received: from localhost ([::1]:51116 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>)
	id 1bPIVV-0003Og-H4
	for geb-bug-gnu-emacs@m.gmane.org; Mon, 18 Jul 2016 20:03:29 -0400
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:44227)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1bPI09-0000NG-Pq
	for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 19:31:07 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1bPI06-0006BV-0l
	for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 19:31:05 -0400
Original-Received: from debbugs.gnu.org ([208.118.235.43]:43751)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1bPI05-0006BP-Sc
	for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 19:31:01 -0400
Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1bPI05-0005nD-Mb
	for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 19:31:01 -0400
X-Loop: help-debbugs@gnu.org
Resent-From: Michal Nazarewicz <mina86@mina86.com>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
Resent-CC: bug-gnu-emacs@gnu.org
Resent-Date: Mon, 18 Jul 2016 23:31:01 +0000
Resent-Message-ID: <handler.24020.B24020.146888461722213@debbugs.gnu.org>
Resent-Sender: help-debbugs@gnu.org
X-GNU-PR-Message: followup 24020
X-GNU-PR-Package: emacs
X-GNU-PR-Keywords: patch
Original-Received: via spool by 24020-submit@debbugs.gnu.org id=B24020.146888461722213
	(code B ref 24020); Mon, 18 Jul 2016 23:31:01 +0000
Original-Received: (at 24020) by debbugs.gnu.org; 18 Jul 2016 23:30:17 +0000
Original-Received: from localhost ([127.0.0.1]:56088 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1bPHzM-0005mD-J4
	for submit@debbugs.gnu.org; Mon, 18 Jul 2016 19:30:17 -0400
Original-Received: from mail-wm0-f42.google.com ([74.125.82.42]:37595)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <mpn@google.com>) id 1bPHzK-0005lv-A2
	for 24020@debbugs.gnu.org; Mon, 18 Jul 2016 19:30:15 -0400
Original-Received: by mail-wm0-f42.google.com with SMTP id i5so5902951wmg.0
	for <24020@debbugs.gnu.org>; Mon, 18 Jul 2016 16:30:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113;
	h=sender:from:to:cc:subject:date:message-id:in-reply-to:references
	:mime-version:content-transfer-encoding;
	bh=3fe9volpqIA6+u6eqP1Ko2crUK5QGpvRO0Z3TJv3uD0=;
	b=QaXObqS3KuHuSO/IP7Jf3sfc/5KJkOz/Xw89euh8zsbqtBBe+SiNFWS1iGVArwJqEY
	kveScgIcV7JtRZUE8MKaeChHk4SJHn+cYQ/LXq5v/NzcJnFihbBUQXQkzjnOH60XLXPA
	55sd8Sv0QJwYh9nFjSP9Pm2RW8wK8Gf9Q8aCiEa7AumaWU1wh5X4i9OpO2PNtaKTqdMn
	GRfzvdzB+/yGEPouDJEwKp19BAmFCXWCtfgrTop62UCtUUaR2HeS5hxCAsrmOPEhdLIQ
	usXnP+9qyneCn92hRTq44GzrRXTI+939ndWvkERLcdcWFO14c/kQ1loB6ztJ/4gfkK4b
	zkYA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20130820;
	h=x-gm-message-state:sender:from:to:cc:subject:date:message-id
	:in-reply-to:references:mime-version:content-transfer-encoding;
	bh=3fe9volpqIA6+u6eqP1Ko2crUK5QGpvRO0Z3TJv3uD0=;
	b=c0qocr4+5AQwhwLpImDklza5WSXtpqpSFifVPX6rpnzrt+FGQ31HYRGxO7txKibS5d
	f+F9MxUn+HD4dFu4ZWTrl2w/VhNQdWaxaN32kz10D9i+YpMcPXFK8bS8dycpCiFuVz0i
	3QD6hh5gQ47b/bAYGO3UChX8XdVRxqEkliWEf78Oh7HQ1w9lqnLUNBxEUKqFZRf2e0Vq
	b+3NuTinkeHDHzYokaOVO0GuauJcp7X1cwfTxxIG81G44A7qxMc0n12yl9ltsHdLC4AH
	XFsCndlzoxYvw3bGLL4zEJX04khQMfSmJDSryHy1Q982GhFYAYAUYzQJZac2Rq3NDPw9
	1ZTA==
X-Gm-Message-State: ALyK8tJSDZrgYuuSosAT9JjQLI7xz4iUBeGVEWZuHG/+P0+FrehBgBqdt6ZThAhf9C5444o3
X-Received: by 10.194.77.193 with SMTP id u1mr3557437wjw.94.1468884608110;
	Mon, 18 Jul 2016 16:30:08 -0700 (PDT)
Original-Received: from mpn.zrh.corp.google.com ([172.16.113.135])
	by smtp.gmail.com with ESMTPSA id 17sm18996891wmf.6.2016.07.18.16.30.07
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Mon, 18 Jul 2016 16:30:07 -0700 (PDT)
Original-Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942)
	id 74D2B1E028F; Tue, 19 Jul 2016 01:30:06 +0200 (CEST)
X-Mailer: git-send-email 2.8.0.rc3.226.g39d4020
In-Reply-To: <83r3ar0z0u.fsf@gnu.org>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 208.118.235.43
X-BeenThere: bug-gnu-emacs@gnu.org
List-Id: "Bug reports for GNU Emacs,
	the Swiss army knife of text editors" <bug-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-gnu-emacs>,
	<mailto:bug-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/bug-gnu-emacs/>
List-Post: <mailto:bug-gnu-emacs@gnu.org>
List-Help: <mailto:bug-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-gnu-emacs>,
	<mailto:bug-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org
Original-Sender: "bug-gnu-emacs"
	<bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.bugs:121244
Archived-At: <http://permalink.gmane.org/gmane.emacs.bugs/121244>

The regex engine tries to optimise (greedy) Kleene star by avoiding
backtracking when it can detect that portion of the expression after
the star cannot match if the repeated portion does match.

For example, take regular expression ‘[[:alpha:]]*1’ trying to match
a string ‘foo’.  Since the Kleene star is greedy, the engine will test
the shortest match for ‘[[:alpha:]]*’ which is ‘foo’.  At this point
though, the string being matched is empty while there’s still
a literal digit one in the pattern.

The engine will not, however, attempt to back-trace testing a shorter
match for the character class (namely ‘fo’ leaving ‘o’ in the string)
because it knows that whatever will be left in the string cannot match
literal digit one.

In the regexes of the form ‘[[:CC:]]*X’, the optimisation can be
applied if and only if the regex engine can prove that the character
class CC does not match character X (as is the case with alpha
character class not matching digit 1).

In the code, the proof is performed by mutually_exclusive_p function.

However, it did not check class bits of a charset opcode which
resulted in it assuming that character classes cannot match multibyte
characters.  For example, it would assume that [[:alpha:]] cannot
match ‘ż’ even though ‘ż’ is indeed an alphanumeric character matching
the alpha character class.

This assumption caused incorrect optimisation of the regular
expression and eventually failure of ‘[[:alpha:]]*żółw’ to match
‘żółw’.

This issue affects any character class witch matches multibyte
characters.

* src/regex.c (executing_charset): A new function for executing the
charset and charset_not opcodes.  It performs check on the character
taking into consideration existing bitmap, range table and class bits.
It also advances the pointer in the regex bytecode past the parsed
opcode.
(CHARSET_LOOKUP_RANGE_TABLE_RAW, CHARSET_LOOKUP_RANGE_TABLE): Removed.
Code now included in executing_charset.
(mutually_exclusive_p, re_match_2_internal): Changed to take advantage
of executing_charset function.

* test/src/regex-tests.el: New file with tests for the character class
matching.
---
 src/regex.c             | 209 +++++++++++++++++++++---------------------------
 test/src/regex-tests.el |  92 +++++++++++++++++++++
 2 files changed, 185 insertions(+), 116 deletions(-)
 create mode 100644 test/src/regex-tests.el

diff --git a/src/regex.c b/src/regex.c
index f92bcb7..297bf71 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -783,44 +783,6 @@ extract_number_and_incr (re_char **source)
    and end.  */
 #define CHARSET_RANGE_TABLE_END(range_table, count)	\
   ((range_table) + (count) * 2 * 3)
-
-/* Test if C is in RANGE_TABLE.  A flag NOT is negated if C is in.
-   COUNT is number of ranges in RANGE_TABLE.  */
-#define CHARSET_LOOKUP_RANGE_TABLE_RAW(not, c, range_table, count)	\
-  do									\
-    {									\
-      re_wchar_t range_start, range_end;				\
-      re_char *rtp;							\
-      re_char *range_table_end						\
-	= CHARSET_RANGE_TABLE_END ((range_table), (count));		\
-									\
-      for (rtp = (range_table); rtp < range_table_end; rtp += 2 * 3)	\
-	{								\
-	  EXTRACT_CHARACTER (range_start, rtp);				\
-	  EXTRACT_CHARACTER (range_end, rtp + 3);			\
-									\
-	  if (range_start <= (c) && (c) <= range_end)			\
-	    {								\
-	      (not) = !(not);						\
-	      break;							\
-	    }								\
-	}								\
-    }									\
-  while (0)
-
-/* Test if C is in range table of CHARSET.  The flag NOT is negated if
-   C is listed in it.  */
-#define CHARSET_LOOKUP_RANGE_TABLE(not, c, charset)			\
-  do									\
-    {									\
-      /* Number of ranges in range table. */				\
-      int count;							\
-      re_char *range_table = CHARSET_RANGE_TABLE (charset);		\
-      									\
-      EXTRACT_NUMBER_AND_INCR (count, range_table);			\
-      CHARSET_LOOKUP_RANGE_TABLE_RAW ((not), (c), range_table, count);	\
-    }									\
-  while (0)
 
 /* If DEBUG is defined, Regex prints many voluminous messages about what
    it is doing (if the variable `debug' is nonzero).  If linked with the
@@ -4661,6 +4623,93 @@ skip_noops (const_re_char *p, const_re_char *pend)
   return p;
 }
 
+/* Test if C matches charset op.  *PP points to the charset or chraset_not
+   opcode.  When the function finishes, *PP will be advanced past that opcode.
+   C is character to test (possibly after translations) and CORIG is original
+   character (i.e. without any translations).  UNIBYTE denotes whether c is
+   unibyte or multibyte character. */
+static bool
+execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte)
+{
+  re_char *p = *pp, *rtp = NULL;
+  bool not = (re_opcode_t) *p == charset_not;
+
+  if (CHARSET_RANGE_TABLE_EXISTS_P (p))
+    {
+      int count;
+      rtp = CHARSET_RANGE_TABLE (p);
+      EXTRACT_NUMBER_AND_INCR (count, rtp);
+      *pp = CHARSET_RANGE_TABLE_END ((rtp), (count));
+    }
+  else
+    *pp += 2 + CHARSET_BITMAP_SIZE (p);
+
+  if (unibyte && c < (1 << BYTEWIDTH))
+    {			/* Lookup bitmap.  */
+      /* Cast to `unsigned' instead of `unsigned char' in
+	 case the bit list is a full 32 bytes long.  */
+      if (c < (unsigned) (CHARSET_BITMAP_SIZE (p) * BYTEWIDTH)
+	  && p[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH)))
+	return !not;
+    }
+#ifdef emacs
+  else if (rtp)
+    {
+      int class_bits = CHARSET_RANGE_TABLE_BITS (p);
+      re_wchar_t range_start, range_end;
+
+  /* Sort tests by the most commonly used classes with some adjustment to which
+     tests are easiest to perform.  Frequencies of character class names used in
+     Emacs sources as of 2016-07-15:
+
+     $ find \( -name \*.c -o -name \*.el \) -exec grep -h '\[:[a-z]*:]' {} + |
+           sed 's/]/]\n/g' |grep -o '\[:[a-z]*:]' |sort |uniq -c |sort -nr
+         213 [:alnum:]
+         104 [:alpha:]
+          62 [:space:]
+          39 [:digit:]
+          36 [:blank:]
+          26 [:upper:]
+          24 [:word:]
+          21 [:lower:]
+          10 [:punct:]
+          10 [:ascii:]
+           9 [:xdigit:]
+           4 [:nonascii:]
+           4 [:graph:]
+           2 [:print:]
+           2 [:cntrl:]
+           1 [:ff:]
+   */
+
+      if ((class_bits & BIT_MULTIBYTE) ||
+	  (class_bits & BIT_ALNUM && ISALNUM (c)) ||
+	  (class_bits & BIT_ALPHA && ISALPHA (c)) ||
+	  (class_bits & BIT_SPACE && ISSPACE (c)) ||
+	  (class_bits & BIT_WORD  && ISWORD  (c)) ||
+	  ((class_bits & BIT_UPPER) &&
+	   (ISUPPER (c) || (corig != c &&
+			    c == downcase (corig) && ISLOWER (c)))) ||
+	  ((class_bits & BIT_LOWER) &&
+	   (ISLOWER (c) || (corig != c &&
+			    c == upcase (corig) && ISUPPER(c)))) ||
+	  (class_bits & BIT_PUNCT && ISPUNCT (c)) ||
+	  (class_bits & BIT_GRAPH && ISGRAPH (c)) ||
+	  (class_bits & BIT_PRINT && ISPRINT (c)))
+	return !not;
+
+      for (p = *pp; rtp < p; rtp += 2 * 3)
+	{
+	  EXTRACT_CHARACTER (range_start, rtp);
+	  EXTRACT_CHARACTER (range_end, rtp + 3);
+	  if (range_start <= c && c <= range_end)
+	    return !not;
+	}
+    }
+#endif /* emacs */
+  return not;
+}
+
 /* Non-zero if "p1 matches something" implies "p2 fails".  */
 static int
 mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1,
@@ -4718,22 +4767,7 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1,
 	else if ((re_opcode_t) *p1 == charset
 		 || (re_opcode_t) *p1 == charset_not)
 	  {
-	    int not = (re_opcode_t) *p1 == charset_not;
-
-	    /* Test if C is listed in charset (or charset_not)
-	       at `p1'.  */
-	    if (! multibyte || IS_REAL_ASCII (c))
-	      {
-		if (c < CHARSET_BITMAP_SIZE (p1) * BYTEWIDTH
-		    && p1[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH)))
-		  not = !not;
-	      }
-	    else if (CHARSET_RANGE_TABLE_EXISTS_P (p1))
-	      CHARSET_LOOKUP_RANGE_TABLE (not, c, p1);
-
-	    /* `not' is equal to 1 if c would match, which means
-	       that we can't change to pop_failure_jump.  */
-	    if (!not)
+	    if (!execute_charset (&p1, c, c, !multibyte))
 	      {
 		DEBUG_PRINT ("	 No match => fast loop.\n");
 		return 1;
@@ -5439,32 +5473,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
 	case charset_not:
 	  {
 	    register unsigned int c, corig;
-	    boolean not = (re_opcode_t) *(p - 1) == charset_not;
 	    int len;
 
-	    /* Start of actual range_table, or end of bitmap if there is no
-	       range table.  */
-	    re_char *range_table UNINIT;
-
-	    /* Nonzero if there is a range table.  */
-	    int range_table_exists;
-
-	    /* Number of ranges of range table.  This is not included
-	       in the initial byte-length of the command.  */
-	    int count = 0;
-
 	    /* Whether matching against a unibyte character.  */
 	    boolean unibyte_char = false;
 
-	    DEBUG_PRINT ("EXECUTING charset%s.\n", not ? "_not" : "");
-
-	    range_table_exists = CHARSET_RANGE_TABLE_EXISTS_P (&p[-1]);
-
-	    if (range_table_exists)
-	      {
-		range_table = CHARSET_RANGE_TABLE (&p[-1]); /* Past the bitmap.  */
-		EXTRACT_NUMBER_AND_INCR (count, range_table);
-	      }
+	    DEBUG_PRINT ("EXECUTING charset%s.\n",
+			 (re_opcode_t) *(p - 1) == charset_not ? "_not" : "");
 
 	    PREFETCH ();
 	    corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte);
@@ -5498,47 +5513,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
 		  unibyte_char = true;
 	      }
 
-	    if (unibyte_char && c < (1 << BYTEWIDTH))
-	      {			/* Lookup bitmap.  */
-		/* Cast to `unsigned' instead of `unsigned char' in
-		   case the bit list is a full 32 bytes long.  */
-		if (c < (unsigned) (CHARSET_BITMAP_SIZE (&p[-1]) * BYTEWIDTH)
-		    && p[1 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH)))
-		  not = !not;
-	      }
-#ifdef emacs
-	    else if (range_table_exists)
-	      {
-		int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]);
-
-		if (  (class_bits & BIT_LOWER
-		       && (ISLOWER (c)
-			   || (corig != c
-			       && c == upcase (corig) && ISUPPER(c))))
-		    | (class_bits & BIT_MULTIBYTE)
-		    | (class_bits & BIT_PUNCT && ISPUNCT (c))
-		    | (class_bits & BIT_SPACE && ISSPACE (c))
-		    | (class_bits & BIT_UPPER
-		       && (ISUPPER (c)
-			   || (corig != c
-			       && c == downcase (corig) && ISLOWER (c))))
-		    | (class_bits & BIT_WORD  && ISWORD  (c))
-		    | (class_bits & BIT_ALPHA && ISALPHA (c))
-		    | (class_bits & BIT_ALNUM && ISALNUM (c))
-		    | (class_bits & BIT_GRAPH && ISGRAPH (c))
-		    | (class_bits & BIT_PRINT && ISPRINT (c)))
-		  not = !not;
-		else
-		  CHARSET_LOOKUP_RANGE_TABLE_RAW (not, c, range_table, count);
-	      }
-#endif /* emacs */
-
-	    if (range_table_exists)
-	      p = CHARSET_RANGE_TABLE_END (range_table, count);
-	    else
-	      p += CHARSET_BITMAP_SIZE (&p[-1]) + 1;
-
-	    if (!not) goto fail;
+	    p -= 1;
+	    if (!execute_charset (&p, c, corig, unibyte_char))
+	      goto fail;
 
 	    d += len;
 	  }
diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
new file mode 100644
index 0000000..00165ab
--- /dev/null
+++ b/test/src/regex-tests.el
@@ -0,0 +1,92 @@
+;;; regex-tests.el --- tests for regex.c functions -*- lexical-binding: t -*-
+
+;; Copyright (C) 2015-2016 Free Software Foundation, Inc.
+
+;; This file is part of GNU Emacs.
+
+;; GNU Emacs is free software: you can redistribute it and/or modify
+;; it under the terms of the GNU General Public License as published by
+;; the Free Software Foundation, either version 3 of the License, or
+;; (at your option) any later version.
+
+;; GNU Emacs is distributed in the hope that it will be useful,
+;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;; GNU General Public License for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with GNU Emacs.  If not, see <http://www.gnu.org/licenses/>.
+
+;;; Code:
+
+(require 'ert)
+
+(ert-deftest regex-word-cc-fallback-test ()
+  "Test that ‘[[:cc:]]*x’ matches ‘x’ (bug#24020).
+
+Test that a regex of the form \"[[:cc:]]*x\" where CC is
+a character class which matches a multibyte character X, matches
+string \"x\".
+
+For example, ‘[[:word:]]*\u2620’ regex (note: \u2620 is a word
+character) must match a string \"\u2420\"."
+  (dolist (class '("[[:word:]]" "\\sw"))
+    (dolist (repeat '("*" "+"))
+      (dolist (suffix '("" "b" "bar" "\u2620"))
+        (dolist (string '("" "foo"))
+          (when (not (and (string-equal repeat "+")
+                          (string-equal string "")))
+            (should (string-match (concat "^" class repeat suffix "$")
+                                  (concat string suffix)))))))))
+
+(defun regex--test-cc (name matching not-matching)
+  (should (string-match-p (concat "^[[:" name ":]]*$") matching))
+  (should (string-match-p (concat "^[[:" name ":]]*?\u2622$")
+                          (concat matching "\u2622")))
+  (should (string-match-p (concat "^[^[:" name ":]]*$") not-matching))
+  (should (string-match-p (concat "^[^[:" name ":]]*\u2622$")
+                          (concat not-matching "\u2622")))
+  (with-temp-buffer
+    (insert matching)
+    (let ((p (point)))
+      (insert not-matching)
+      (goto-char (point-min))
+      (skip-chars-forward (concat "[:" name ":]"))
+      (should (equal (point) p))
+      (skip-chars-forward (concat "^[:" name ":]"))
+      (should (equal (point) (point-max)))
+      (goto-char (point-min))
+      (skip-chars-forward (concat "[:" name ":]\u2622"))
+      (should (or (equal (point) p) (equal (point) (1+ p)))))))
+
+(ert-deftest regex-character-classes ()
+  "Perform sanity test of regexes using character classes.
+
+Go over all the supported character classes and test whether the
+classes and their inversions match what they are supposed to
+match.  The test is done using `string-match-p' as well as
+`skip-chars-forward'."
+  (let (case-fold-search)
+    (regex--test-cc "alnum" "abcABC012łąka" "-, \t\n")
+    (regex--test-cc "alpha" "abcABCłąka" "-,012 \t\n")
+    (regex--test-cc "digit" "012" "abcABCłąka-, \t\n")
+    (regex--test-cc "xdigit" "0123aBc" "łąk-, \t\n")
+    (regex--test-cc "upper" "ABCŁĄKA" "abc012-, \t\n")
+    (regex--test-cc "lower" "abcłąka" "ABC012-, \t\n")
+
+    (regex--test-cc "word" "abcABC012\u2620" "-, \t\n")
+
+    (regex--test-cc "punct" ".,-" "abcABC012\u2620 \t\n")
+    (regex--test-cc "cntrl" "\1\2\t\n" ".,-abcABC012\u2620 ")
+    (regex--test-cc "graph" "abcłąka\u2620-," " \t\n\1")
+    (regex--test-cc "print" "abcłąka\u2620-, " "\t\n\1")
+
+    (regex--test-cc "space" " \t\n\u2001" "abcABCł0123")
+    (regex--test-cc "blank" " \t" "\n\u2001")
+
+    (regex--test-cc "ascii" "abcABC012 \t\n\1" "łą\u2620")
+    (regex--test-cc "nonascii" "łą\u2622" "abcABC012 \t\n\1")
+    (regex--test-cc "unibyte" "abcABC012 \t\n\1" "łą\u2622")
+    (regex--test-cc "multibyte" "łą\u2622" "abcABC012 \t\n\1")))
+
+;;; regex-tests.el ends here
-- 
2.8.0.rc3.226.g39d4020