bug#24071: [PATCH] Refactor regex character class parsing in [:name:]

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#24071: [PATCH] Refactor regex character class parsing in [:name:]
@ 2016-07-25 22:54 Michal Nazarewicz
  2016-07-26 14:46 ` Eli Zaretskii
  2016-08-02 16:06 ` Michal Nazarewicz
  0 siblings, 2 replies; 18+ messages in thread
From: Michal Nazarewicz @ 2016-07-25 22:54 UTC (permalink / raw)
  To: 24071

re_wctype function is used in three separate places and in all of
those places almost exact code extracting the name from [:name:]
surrounds it.  Furthermore, re_wctype requires a NUL-terminated
string, so the name of the character class is copied to a temporary
buffer.

The code duplication and unnecessary memory copying can be avoided by
pushing the responsibility of parsing the whole [:name:] sequence to
the function.

Furthermore, since now the function has access to the length of the
character class name (since it’s doing the parsing), it can take
advantage of that information in skipping some string comparisons and
using a constant-length memcmp instead of strcmp which needs to take
care of NUL bytes.

* src/regex.c (re_wctype): Delete function.  Replace it with:
(re_wctype_parse): New function which parses a whole [:name:] string
and returns a RECC_* constant or -1 if the string is not of [:name:]
format.
(regex_compile): Use re_wctype_parse.
* src/syntax.c (skip_chars): Use re_wctype_parse.
---
 src/regex.c  | 312 +++++++++++++++++++++++++++++------------------------------
 src/regex.h  |  14 +--
 src/syntax.c |  96 +++++-------------
 3 files changed, 183 insertions(+), 239 deletions(-)

 Unless there is some opposition to this patch, I’ll submit it in
 a week or so.

diff --git a/src/regex.c b/src/regex.c
index 297bf71..50e6862 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -1969,29 +1969,98 @@ struct range_table_work_area
 \f
 #if ! WIDE_CHAR_SUPPORT
 
-/* Map a string to the char class it names (if any).  */
+/* Parse a character class, i.e. string such as "[:name:]".  *strp
+   points to the string to be parsed and limit is length, in bytes, of
+   that string.
+
+   If *strp point to a string that begins with "[:name:]", where name is
+   a non-empty sequence of lower case letters, *strp will be advanced past the
+   closing square bracket and RECC_* constant which maps to the name will be
+   returned.  If name is not a valid character class name zero, or RECC_ERROR,
+   is returned.
+
+   Otherwise, if *strp doesn’t begin with "[:name:]", -1 is returned.
+
+   The function can be used on ASCII as well as multitude strings.
+ */
 re_wctype_t
-re_wctype (const_re_char *str)
+re_wctype_parse (const unsigned char **strp, unsigned limit)
 {
-  const char *string = (const char *) str;
-  if      (STREQ (string, "alnum"))	return RECC_ALNUM;
-  else if (STREQ (string, "alpha"))	return RECC_ALPHA;
-  else if (STREQ (string, "word"))	return RECC_WORD;
-  else if (STREQ (string, "ascii"))	return RECC_ASCII;
-  else if (STREQ (string, "nonascii"))	return RECC_NONASCII;
-  else if (STREQ (string, "graph"))	return RECC_GRAPH;
-  else if (STREQ (string, "lower"))	return RECC_LOWER;
-  else if (STREQ (string, "print"))	return RECC_PRINT;
-  else if (STREQ (string, "punct"))	return RECC_PUNCT;
-  else if (STREQ (string, "space"))	return RECC_SPACE;
-  else if (STREQ (string, "upper"))	return RECC_UPPER;
-  else if (STREQ (string, "unibyte"))	return RECC_UNIBYTE;
-  else if (STREQ (string, "multibyte"))	return RECC_MULTIBYTE;
-  else if (STREQ (string, "digit"))	return RECC_DIGIT;
-  else if (STREQ (string, "xdigit"))	return RECC_XDIGIT;
-  else if (STREQ (string, "cntrl"))	return RECC_CNTRL;
-  else if (STREQ (string, "blank"))	return RECC_BLANK;
-  else return 0;
+  const char *beg = (const char *)*strp, *end, *it;
+
+  if (limit < 5 || beg[0] != '[' || beg[1] != ':')
+    return -1;
+
+  end = beg + limit - 2;  /* ‘- 2’ for the closing ":]" */
+  beg += 2;  /* skip opening "[:" */
+  it = beg;
+  while (it != end && *it >= 'a' && *it <= 'z')
+    ++it;
+  if (it[0] != ':' || it[1] != ']')
+    return -1;
+
+  *strp = (const unsigned char *)(it + 2);
+
+  /* Sort tests in the length=five case by frequency the classes to minimise
+     number of times we fail the comparison.  The frequencies of character class
+     names used in Emacs sources as of 2016-07-15:
+
+     $ find \( -name \*.c -o -name \*.el \) -exec grep -h '\[:[a-z]*:]' {} + |
+           sed 's/]/]\n/g' |grep -o '\[:[a-z]*:]' |sort |uniq -c |sort -nr
+         213 [:alnum:]
+         104 [:alpha:]
+          62 [:space:]
+          39 [:digit:]
+          36 [:blank:]
+          26 [:upper:]
+          24 [:word:]
+          21 [:lower:]
+          10 [:punct:]
+          10 [:ascii:]
+           9 [:xdigit:]
+           4 [:nonascii:]
+           4 [:graph:]
+           2 [:print:]
+           2 [:cntrl:]
+           1 [:ff:]
+
+     If you update this list, consider also updating chain of or’ed conditions
+     in execute_charset function.
+   */
+
+  switch (it - beg) {
+  case 4:
+    if (!memcmp (beg, "word", 4))      return RECC_WORD;
+    break;
+  case 5:
+    if (!memcmp (beg, "alnum", 5))     return RECC_ALNUM;
+    if (!memcmp (beg, "alpha", 5))     return RECC_ALPHA;
+    if (!memcmp (beg, "space", 5))     return RECC_SPACE;
+    if (!memcmp (beg, "digit", 5))     return RECC_DIGIT;
+    if (!memcmp (beg, "blank", 5))     return RECC_BLANK;
+    if (!memcmp (beg, "upper", 5))     return RECC_UPPER;
+    if (!memcmp (beg, "lower", 5))     return RECC_LOWER;
+    if (!memcmp (beg, "punct", 5))     return RECC_PUNCT;
+    if (!memcmp (beg, "ascii", 5))     return RECC_ASCII;
+    if (!memcmp (beg, "graph", 5))     return RECC_GRAPH;
+    if (!memcmp (beg, "print", 5))     return RECC_PRINT;
+    if (!memcmp (beg, "cntrl", 5))     return RECC_CNTRL;
+    break;
+  case 6:
+    if (!memcmp (beg, "xdigit", 6))    return RECC_XDIGIT;
+    break;
+  case 7:
+    if (!memcmp (beg, "unibyte", 7))   return RECC_UNIBYTE;
+    break;
+  case 8:
+    if (!memcmp (beg, "nonascii", 8))  return RECC_NONASCII;
+    break;
+  case 9:
+    if (!memcmp (beg, "multibyte", 9)) return RECC_MULTIBYTE;
+    break;
+  }
+
+  return RECC_ERROR;
 }
 
 /* True if CH is in the char class CC.  */
@@ -2776,10 +2845,74 @@ regex_compile (const_re_char *pattern, size_t size, reg_syntax_t syntax,
 	      {
 		boolean escaped_char = false;
 		const unsigned char *p2 = p;
+		re_wctype_t cc;
 		re_wchar_t ch;
 
 		if (p == pend) FREE_STACK_RETURN (REG_EBRACK);
 
+		/* See if we're at the beginning of a possible character
+		   class.  */
+		if (syntax & RE_CHAR_CLASSES &&
+		    (cc = re_wctype_parse(&p, pend - p)) != -1)
+		  {
+		    if (cc == 0)
+		      FREE_STACK_RETURN (REG_ECTYPE);
+
+		    if (p == pend)
+		      FREE_STACK_RETURN (REG_EBRACK);
+
+#ifndef emacs
+		    for (ch = 0; ch < (1 << BYTEWIDTH); ++ch)
+		      if (re_iswctype (btowc (ch), cc))
+			{
+			  c = TRANSLATE (ch);
+			  if (c < (1 << BYTEWIDTH))
+			    SET_LIST_BIT (c);
+			}
+#else  /* emacs */
+		    /* Most character classes in a multibyte match just set
+		       a flag.  Exceptions are is_blank, is_digit, is_cntrl, and
+		       is_xdigit, since they can only match ASCII characters.
+		       We don't need to handle them for multibyte.  They are
+		       distinguished by a negative wctype.  */
+
+		    /* Setup the gl_state object to its buffer-defined value.
+		       This hardcodes the buffer-global syntax-table for ASCII
+		       chars, while the other chars will obey syntax-table
+		       properties.  It's not ideal, but it's the way it's been
+		       done until now.  */
+		    SETUP_BUFFER_SYNTAX_TABLE ();
+
+		    for (ch = 0; ch < 256; ++ch)
+		      {
+			c = RE_CHAR_TO_MULTIBYTE (ch);
+			if (! CHAR_BYTE8_P (c)
+			    && re_iswctype (c, cc))
+			  {
+			    SET_LIST_BIT (ch);
+			    c1 = TRANSLATE (c);
+			    if (c1 == c)
+			      continue;
+			    if (ASCII_CHAR_P (c1))
+			      SET_LIST_BIT (c1);
+			    else if ((c1 = RE_CHAR_TO_UNIBYTE (c1)) >= 0)
+			      SET_LIST_BIT (c1);
+			  }
+		      }
+		    SET_RANGE_TABLE_WORK_AREA_BIT
+		      (range_table_work, re_wctype_to_bit (cc));
+#endif	/* emacs */
+		    /* In most cases the matching rule for char classes only
+		       uses the syntax table for multibyte chars, so that the
+		       content of the syntax-table is not hardcoded in the
+		       range_table.  SPACE and WORD are the two exceptions.  */
+		    if ((1 << cc) & ((1 << RECC_SPACE) | (1 << RECC_WORD)))
+		      bufp->used_syntax = 1;
+
+		    /* Repeat the loop. */
+		    continue;
+		  }
+
 		/* Don't translate yet.  The range TRANSLATE(X..Y) cannot
 		   always be determined from TRANSLATE(X) and TRANSLATE(Y)
 		   So the translation is done later in a loop.  Example:
@@ -2803,119 +2936,6 @@ regex_compile (const_re_char *pattern, size_t size, reg_syntax_t syntax,
 		      break;
 		  }
 
-		/* See if we're at the beginning of a possible character
-		   class.  */
-
-		if (!escaped_char &&
-		    syntax & RE_CHAR_CLASSES && c == '[' && *p == ':')
-		  {
-		    /* Leave room for the null.  */
-		    unsigned char str[CHAR_CLASS_MAX_LENGTH + 1];
-		    const unsigned char *class_beg;
-
-		    PATFETCH (c);
-		    c1 = 0;
-		    class_beg = p;
-
-		    /* If pattern is `[[:'.  */
-		    if (p == pend) FREE_STACK_RETURN (REG_EBRACK);
-
-		    for (;;)
-		      {
-		        PATFETCH (c);
-		        if ((c == ':' && *p == ']') || p == pend)
-		          break;
-			if (c1 < CHAR_CLASS_MAX_LENGTH)
-			  str[c1++] = c;
-			else
-			  /* This is in any case an invalid class name.  */
-			  str[0] = '\0';
-		      }
-		    str[c1] = '\0';
-
-		    /* If isn't a word bracketed by `[:' and `:]':
-		       undo the ending character, the letters, and
-		       leave the leading `:' and `[' (but set bits for
-		       them).  */
-		    if (c == ':' && *p == ']')
-		      {
-			re_wctype_t cc = re_wctype (str);
-
-			if (cc == 0)
-			  FREE_STACK_RETURN (REG_ECTYPE);
-
-                        /* Throw away the ] at the end of the character
-                           class.  */
-                        PATFETCH (c);
-
-                        if (p == pend) FREE_STACK_RETURN (REG_EBRACK);
-
-#ifndef emacs
-			for (ch = 0; ch < (1 << BYTEWIDTH); ++ch)
-			  if (re_iswctype (btowc (ch), cc))
-			    {
-			      c = TRANSLATE (ch);
-			      if (c < (1 << BYTEWIDTH))
-				SET_LIST_BIT (c);
-			    }
-#else  /* emacs */
-			/* Most character classes in a multibyte match
-			   just set a flag.  Exceptions are is_blank,
-			   is_digit, is_cntrl, and is_xdigit, since
-			   they can only match ASCII characters.  We
-			   don't need to handle them for multibyte.
-			   They are distinguished by a negative wctype.  */
-
-			/* Setup the gl_state object to its buffer-defined
-			   value.  This hardcodes the buffer-global
-			   syntax-table for ASCII chars, while the other chars
-			   will obey syntax-table properties.  It's not ideal,
-			   but it's the way it's been done until now.  */
-			SETUP_BUFFER_SYNTAX_TABLE ();
-
-			for (ch = 0; ch < 256; ++ch)
-			  {
-			    c = RE_CHAR_TO_MULTIBYTE (ch);
-			    if (! CHAR_BYTE8_P (c)
-				&& re_iswctype (c, cc))
-			      {
-				SET_LIST_BIT (ch);
-				c1 = TRANSLATE (c);
-				if (c1 == c)
-				  continue;
-				if (ASCII_CHAR_P (c1))
-				  SET_LIST_BIT (c1);
-				else if ((c1 = RE_CHAR_TO_UNIBYTE (c1)) >= 0)
-				  SET_LIST_BIT (c1);
-			      }
-			  }
-			SET_RANGE_TABLE_WORK_AREA_BIT
-			  (range_table_work, re_wctype_to_bit (cc));
-#endif	/* emacs */
-			/* In most cases the matching rule for char classes
-			   only uses the syntax table for multibyte chars,
-			   so that the content of the syntax-table is not
-			   hardcoded in the range_table.  SPACE and WORD are
-			   the two exceptions.  */
-			if ((1 << cc) & ((1 << RECC_SPACE) | (1 << RECC_WORD)))
-			  bufp->used_syntax = 1;
-
-			/* Repeat the loop. */
-			continue;
-		      }
-		    else
-		      {
-			/* Go back to right after the "[:".  */
-			p = class_beg;
-			SET_LIST_BIT ('[');
-
-			/* Because the `:' may start the range, we
-			   can't simply set bit and repeat the loop.
-			   Instead, just set it to C and handle below.  */
-			c = ':';
-		      }
-		  }
-
 		if (p < pend && p[0] == '-' && p[1] != ']')
 		  {
 
@@ -4659,28 +4679,8 @@ execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte)
       re_wchar_t range_start, range_end;
 
   /* Sort tests by the most commonly used classes with some adjustment to which
-     tests are easiest to perform.  Frequencies of character class names used in
-     Emacs sources as of 2016-07-15:
-
-     $ find \( -name \*.c -o -name \*.el \) -exec grep -h '\[:[a-z]*:]' {} + |
-           sed 's/]/]\n/g' |grep -o '\[:[a-z]*:]' |sort |uniq -c |sort -nr
-         213 [:alnum:]
-         104 [:alpha:]
-          62 [:space:]
-          39 [:digit:]
-          36 [:blank:]
-          26 [:upper:]
-          24 [:word:]
-          21 [:lower:]
-          10 [:punct:]
-          10 [:ascii:]
-           9 [:xdigit:]
-           4 [:nonascii:]
-           4 [:graph:]
-           2 [:print:]
-           2 [:cntrl:]
-           1 [:ff:]
-   */
+     tests are easiest to perform.  Take a look at comment in re_wctype_parse
+     for table with frequencies of character class names. */
 
       if ((class_bits & BIT_MULTIBYTE) ||
 	  (class_bits & BIT_ALNUM && ISALNUM (c)) ||
diff --git a/src/regex.h b/src/regex.h
index 817167a..01b659a 100644
--- a/src/regex.h
+++ b/src/regex.h
@@ -585,25 +585,13 @@ extern void regfree (regex_t *__preg);
 /* Solaris 2.5 has a bug: <wchar.h> must be included before <wctype.h>.  */
 # include <wchar.h>
 # include <wctype.h>
-#endif
 
-#if WIDE_CHAR_SUPPORT
-/* The GNU C library provides support for user-defined character classes
-   and the functions from ISO C amendment 1.  */
-# ifdef CHARCLASS_NAME_MAX
-#  define CHAR_CLASS_MAX_LENGTH CHARCLASS_NAME_MAX
-# else
-/* This shouldn't happen but some implementation might still have this
-   problem.  Use a reasonable default value.  */
-#  define CHAR_CLASS_MAX_LENGTH 256
-# endif
 typedef wctype_t re_wctype_t;
 typedef wchar_t re_wchar_t;
 # define re_wctype wctype
 # define re_iswctype iswctype
 # define re_wctype_to_bit(cc) 0
 #else
-# define CHAR_CLASS_MAX_LENGTH  9 /* Namely, `multibyte'.  */
 # ifndef emacs
 #  define btowc(c) c
 # endif
@@ -621,7 +609,7 @@ typedef enum { RECC_ERROR = 0,
 } re_wctype_t;
 
 extern char re_iswctype (int ch,    re_wctype_t cc);
-extern re_wctype_t re_wctype (const unsigned char* str);
+extern re_wctype_t re_wctype_parse (const unsigned char **strp, unsigned limit);
 
 typedef int re_wchar_t;
 
diff --git a/src/syntax.c b/src/syntax.c
index f8d987b..667de40 100644
--- a/src/syntax.c
+++ b/src/syntax.c
@@ -1691,44 +1691,22 @@ skip_chars (bool forwardp, Lisp_Object string, Lisp_Object lim,
       /* At first setup fastmap.  */
       while (i_byte < size_byte)
 	{
-	  c = str[i_byte++];
-
-	  if (handle_iso_classes && c == '['
-	      && i_byte < size_byte
-	      && str[i_byte] == ':')
+	  if (handle_iso_classes)
 	    {
-	      const unsigned char *class_beg = str + i_byte + 1;
-	      const unsigned char *class_end = class_beg;
-	      const unsigned char *class_limit = str + size_byte - 2;
-	      /* Leave room for the null.  */
-	      unsigned char class_name[CHAR_CLASS_MAX_LENGTH + 1];
-	      re_wctype_t cc;
-
-	      if (class_limit - class_beg > CHAR_CLASS_MAX_LENGTH)
-		class_limit = class_beg + CHAR_CLASS_MAX_LENGTH;
-
-	      while (class_end < class_limit
-		     && *class_end >= 'a' && *class_end <= 'z')
-		class_end++;
-
-	      if (class_end == class_beg
-		  || *class_end != ':' || class_end[1] != ']')
-		goto not_a_class_name;
-
-	      memcpy (class_name, class_beg, class_end - class_beg);
-	      class_name[class_end - class_beg] = 0;
-
-	      cc = re_wctype (class_name);
+	      const unsigned char *ch = str + i_byte;
+	      re_wctype_t cc = re_wctype_parse (&ch, size_byte - i_byte);
 	      if (cc == 0)
 		error ("Invalid ISO C character class");
-
-	      iso_classes = Fcons (make_number (cc), iso_classes);
-
-	      i_byte = class_end + 2 - str;
-	      continue;
+	      if (cc != -1)
+		{
+		  iso_classes = Fcons (make_number (cc), iso_classes);
+		  i_byte = ch - str;
+		  continue;
+		}
 	    }
 
-	not_a_class_name:
+	  c = str[i_byte++];
+
 	  if (c == '\\')
 	    {
 	      if (i_byte == size_byte)
@@ -1808,54 +1786,32 @@ skip_chars (bool forwardp, Lisp_Object string, Lisp_Object lim,
       while (i_byte < size_byte)
 	{
 	  int leading_code = str[i_byte];
-	  c = STRING_CHAR_AND_LENGTH (str + i_byte, len);
-	  i_byte += len;
 
-	  if (handle_iso_classes && c == '['
-	      && i_byte < size_byte
-	      && STRING_CHAR (str + i_byte) == ':')
+	  if (handle_iso_classes)
 	    {
-	      const unsigned char *class_beg = str + i_byte + 1;
-	      const unsigned char *class_end = class_beg;
-	      const unsigned char *class_limit = str + size_byte - 2;
-	      /* Leave room for the null.	 */
-	      unsigned char class_name[CHAR_CLASS_MAX_LENGTH + 1];
-	      re_wctype_t cc;
-
-	      if (class_limit - class_beg > CHAR_CLASS_MAX_LENGTH)
-		class_limit = class_beg + CHAR_CLASS_MAX_LENGTH;
-
-	      while (class_end < class_limit
-		     && *class_end >= 'a' && *class_end <= 'z')
-		class_end++;
-
-	      if (class_end == class_beg
-		  || *class_end != ':' || class_end[1] != ']')
-		goto not_a_class_name_multibyte;
-
-	      memcpy (class_name, class_beg, class_end - class_beg);
-	      class_name[class_end - class_beg] = 0;
-
-	      cc = re_wctype (class_name);
+	      const unsigned char *ch = str + i_byte;
+	      re_wctype_t cc = re_wctype_parse (&ch, size_byte - i_byte);
 	      if (cc == 0)
 		error ("Invalid ISO C character class");
-
-	      iso_classes = Fcons (make_number (cc), iso_classes);
-
-	      i_byte = class_end + 2 - str;
-	      continue;
+	      if (cc != -1)
+		{
+		  iso_classes = Fcons (make_number (cc), iso_classes);
+		  i_byte = ch - str;
+		  continue;
+		}
 	    }
 
-	not_a_class_name_multibyte:
-	  if (c == '\\')
+	  if (leading_code== '\\')
 	    {
-	      if (i_byte == size_byte)
+	      if (++i_byte == size_byte)
 		break;
 
 	      leading_code = str[i_byte];
-	      c = STRING_CHAR_AND_LENGTH (str + i_byte, len);
-	      i_byte += len;
 	    }
+	  c = STRING_CHAR_AND_LENGTH (str + i_byte, len);
+	  i_byte += len;
+
+
 	  /* Treat `-' as range character only if another character
 	     follows.  */
 	  if (i_byte + 1 < size_byte
-- 
2.8.0.rc3.226.g39d4020






^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH] Refactor regex character class parsing in [:name:]
  2016-07-25 22:54 bug#24071: [PATCH] Refactor regex character class parsing in [:name:] Michal Nazarewicz
@ 2016-07-26 14:46 ` Eli Zaretskii
  2016-07-27 15:29   ` Michal Nazarewicz
                     ` (2 more replies)
  2016-08-02 16:06 ` Michal Nazarewicz
  1 sibling, 3 replies; 18+ messages in thread
From: Eli Zaretskii @ 2016-07-26 14:46 UTC (permalink / raw)
  To: Michal Nazarewicz, Dima Kogan; +Cc: 24071

> From: Michal Nazarewicz <mina86@mina86.com>
> Date: Tue, 26 Jul 2016 00:54:05 +0200
> 
> re_wctype function is used in three separate places and in all of
> those places almost exact code extracting the name from [:name:]
> surrounds it.  Furthermore, re_wctype requires a NUL-terminated
> string, so the name of the character class is copied to a temporary
> buffer.
> 
> The code duplication and unnecessary memory copying can be avoided by
> pushing the responsibility of parsing the whole [:name:] sequence to
> the function.
> 
> Furthermore, since now the function has access to the length of the
> character class name (since it’s doing the parsing), it can take
> advantage of that information in skipping some string comparisons and
> using a constant-length memcmp instead of strcmp which needs to take
> care of NUL bytes.

Thanks.

If we are going to make some serious refactoring in regex.c, I think
we should start with having a test suite for it.  The
dima_regex_embedded_modifiers branch, created by Dima Kogan (CC'ed) in
the Emacs repository includes a suite taken from glibc.  Dima, could
you perhaps merge the parts of the test suite that can already be used
to the master branch, so that we could use them to verify changes in
regex.c?





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH] Refactor regex character class parsing in [:name:]
  2016-07-26 14:46 ` Eli Zaretskii
@ 2016-07-27 15:29   ` Michal Nazarewicz
  2016-07-27 16:28     ` Eli Zaretskii
  2016-07-27 16:50   ` bug#24071: [PATCH 1/7] New regex tests imported from glibc 2.21 Michal Nazarewicz
  2016-07-29  5:31   ` bug#24071: [PATCH] " Dima Kogan
  2 siblings, 1 reply; 18+ messages in thread
From: Michal Nazarewicz @ 2016-07-27 15:29 UTC (permalink / raw)
  To: Eli Zaretskii, Dima Kogan; +Cc: 24071

On Tue, Jul 26 2016, Eli Zaretskii wrote:
>> From: Michal Nazarewicz <mina86@mina86.com>
>> Date: Tue, 26 Jul 2016 00:54:05 +0200
>> 
>> re_wctype function is used in three separate places and in all of
>> those places almost exact code extracting the name from [:name:]
>> surrounds it.  Furthermore, re_wctype requires a NUL-terminated
>> string, so the name of the character class is copied to a temporary
>> buffer.
>> 
>> The code duplication and unnecessary memory copying can be avoided by
>> pushing the responsibility of parsing the whole [:name:] sequence to
>> the function.
>> 
>> Furthermore, since now the function has access to the length of the
>> character class name (since it’s doing the parsing), it can take
>> advantage of that information in skipping some string comparisons and
>> using a constant-length memcmp instead of strcmp which needs to take
>> care of NUL bytes.
>
> Thanks.
>
> If we are going to make some serious refactoring in regex.c, I think
> we should start with having a test suite for it.

I agree.  Which is why I started test/src/regex-tests.el¹.  Since this
patch touches only character classes I limited the tests to character
classes.

¹ If fact, the bug I’ve fixed with the previous patch was discovered
precisely because I’ve written tests for this patch.

> The dima_regex_embedded_modifiers branch, created by Dima Kogan
> (CC'ed) in the Emacs repository includes a suite taken from glibc.
> Dima, could you perhaps merge the parts of the test suite that can
> already be used to the master branch, so that we could use them to
> verify changes in regex.c?

This looks relatively straightforward;  I can take care of it.  I’ll
send a link to the result soon.

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH] Refactor regex character class parsing in [:name:]
  2016-07-27 15:29   ` Michal Nazarewicz
@ 2016-07-27 16:28     ` Eli Zaretskii
  2016-07-27 18:30       ` Michal Nazarewicz
  0 siblings, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2016-07-27 16:28 UTC (permalink / raw)
  To: Michal Nazarewicz; +Cc: lists, 24071

> From: Michal Nazarewicz <mina86@mina86.com>
> Cc: 24071@debbugs.gnu.org
> Date: Wed, 27 Jul 2016 17:29:04 +0200
> 
> > If we are going to make some serious refactoring in regex.c, I think
> > we should start with having a test suite for it.
> 
> I agree.  Which is why I started test/src/regex-tests.el¹.  Since this
> patch touches only character classes I limited the tests to character
> classes.

I know.  What I wrote was not a complaint about the past, it was a
suggestion for the future.  A single localized change doesn't yet
justify importing a large test suite.  But this later patch looks like
a beginning of a series of refactoring (is it?), hopefully followed by
more features, so I thought we should have a firm ground first.

> > The dima_regex_embedded_modifiers branch, created by Dima Kogan
> > (CC'ed) in the Emacs repository includes a suite taken from glibc.
> > Dima, could you perhaps merge the parts of the test suite that can
> > already be used to the master branch, so that we could use them to
> > verify changes in regex.c?
> 
> This looks relatively straightforward;  I can take care of it.  I’ll
> send a link to the result soon.

Thanks.





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH 1/7] New regex tests imported from glibc 2.21
  2016-07-26 14:46 ` Eli Zaretskii
  2016-07-27 15:29   ` Michal Nazarewicz
@ 2016-07-27 16:50   ` Michal Nazarewicz
  2016-07-27 16:50     ` bug#24071: [PATCH 2/7] Added driver for the regex tests Michal Nazarewicz
                       ` (5 more replies)
  2016-07-29  5:31   ` bug#24071: [PATCH] " Dima Kogan
  2 siblings, 6 replies; 18+ messages in thread
From: Michal Nazarewicz @ 2016-07-27 16:50 UTC (permalink / raw)
  To: 24071; +Cc: Dima Kogan

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 63528 bytes --]

From: Dima Kogan <dima@secretsauce.net>

* test/src/regex-resources/BOOST.tests:
* test/src/regex-resources/PCRE.tests:
* test/src/regex-resources/PTESTS:
* test/src/regex-resources/TESTS:
New test data files

[mina86@mina86.com: Moved files from test/src/regex/* to test/src/*.]
---
 test/src/regex-resources/BOOST.tests |  829 ++++++++++++
 test/src/regex-resources/PCRE.tests  | 2386 ++++++++++++++++++++++++++++++++++
 test/src/regex-resources/PTESTS      |  341 +++++
 test/src/regex-resources/TESTS       |  167 +++
 4 files changed, 3723 insertions(+)
 create mode 100644 test/src/regex-resources/BOOST.tests
 create mode 100644 test/src/regex-resources/PCRE.tests
 create mode 100644 test/src/regex-resources/PTESTS
 create mode 100644 test/src/regex-resources/TESTS

diff --git a/test/src/regex-resources/BOOST.tests b/test/src/regex-resources/BOOST.tests
new file mode 100644
index 0000000..98fd3b6
--- /dev/null
+++ b/test/src/regex-resources/BOOST.tests
@@ -0,0 +1,829 @@
+; 
+; 
+; this file contains a script of tests to run through regress.exe
+;
+; comments start with a semicolon and proceed to the end of the line
+;
+; changes to regular expression compile flags start with a "-" as the first
+; non-whitespace character and consist of a list of the printable names
+; of the flags, for example "match_default"
+;
+; Other lines contain a test to perform using the current flag status
+; the first token contains the expression to compile, the second the string
+; to match it against. If the second string is "!" then the expression should
+; not compile, that is the first string is an invalid regular expression.
+; This is then followed by a list of integers that specify what should match,
+; each pair represents the starting and ending positions of a subexpression
+; starting with the zeroth subexpression (the whole match).
+; A value of -1 indicates that the subexpression should not take part in the
+; match at all, if the first value is -1 then no part of the expression should
+; match the string.
+;
+; Tests taken from BOOST testsuite and adapted to glibc regex.
+;
+; Boost Software License - Version 1.0 - August 17th, 2003
+;
+; Permission is hereby granted, free of charge, to any person or organization
+; obtaining a copy of the software and accompanying documentation covered by
+; this license (the "Software") to use, reproduce, display, distribute,
+; execute, and transmit the Software, and to prepare derivative works of the
+; Software, and to permit third-parties to whom the Software is furnished to
+; do so, all subject to the following:
+;
+; The copyright notices in the Software and this entire statement, including
+; the above license grant, this restriction and the following disclaimer,
+; must be included in all copies of the Software, in whole or in part, and
+; all derivative works of the Software, unless such copies or derivative
+; works are solely in the form of machine-executable object code generated by
+; a source language processor.
+;
+; THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+; IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+; FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
+; SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
+; FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
+; ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+; DEALINGS IN THE SOFTWARE.
+;
+
+- match_default normal REG_EXTENDED
+
+;
+; try some really simple literals:
+a a 0 1
+Z Z 0 1
+Z aaa -1 -1
+Z xxxxZZxxx 4 5
+
+; and some simple brackets:
+(a) zzzaazz 3 4 3 4
+() zzz 0 0 0 0
+() "" 0 0 0 0
+( !
+) ) 0 1
+(aa !
+aa) baa)b 1 4
+a b -1 -1
+\(\) () 0 2
+\(a\) (a) 0 3
+\() () 0 2
+(\) !
+p(a)rameter ABCparameterXYZ 3 12 4 5
+[pq](a)rameter ABCparameterXYZ 3 12 4 5
+
+; now try escaped brackets:
+- match_default bk_parens REG_BASIC
+\(a\) zzzaazz 3 4 3 4
+\(\) zzz 0 0 0 0
+\(\) "" 0 0 0 0
+\( !
+\) !
+\(aa !
+aa\) !
+() () 0 2
+(a) (a) 0 3
+(\) !
+\() !
+
+; now move on to "." wildcards
+- match_default normal REG_EXTENDED REG_STARTEND
+. a 0 1
+. \n 0 1
+. \r 0 1
+. \0 0 1
+
+;
+; now move on to the repetion ops,
+; starting with operator *
+- match_default normal REG_EXTENDED
+a* b 0 0
+ab* a 0 1
+ab* ab 0 2
+ab* sssabbbbbbsss 3 10
+ab*c* a 0 1
+ab*c* abbb 0 4
+ab*c* accc 0 4
+ab*c* abbcc 0 5
+*a !
+\<* !
+\>* !
+\n* \n\n 0 2
+\** ** 0 2
+\* * 0 1
+
+; now try operator +
+ab+ a -1 -1
+ab+ ab 0 2
+ab+ sssabbbbbbsss 3 10
+ab+c+ a -1 -1
+ab+c+ abbb -1 -1
+ab+c+ accc -1 -1
+ab+c+ abbcc 0 5
++a !
+\<+ !
+\>+ !
+\n+ \n\n 0 2
+\+ + 0 1
+\+ ++ 0 1
+\++ ++ 0 2
+
+; now try operator ?
+- match_default normal REG_EXTENDED
+a? b 0 0
+ab? a 0 1
+ab? ab 0 2
+ab? sssabbbbbbsss 3 5
+ab?c? a 0 1
+ab?c? abbb 0 2
+ab?c? accc 0 2
+ab?c? abcc 0 3
+?a !
+\<? !
+\>? !
+\n? \n\n 0 1
+\? ? 0 1
+\? ?? 0 1
+\?? ?? 0 1
+
+; now try operator {}
+- match_default normal REG_EXTENDED
+a{2} a -1 -1
+a{2} aa 0 2
+a{2} aaa 0 2
+a{2,} a -1 -1
+a{2,} aa 0 2
+a{2,} aaaaa 0 5
+a{2,4} a -1 -1
+a{2,4} aa 0 2
+a{2,4} aaa 0 3
+a{2,4} aaaa 0 4
+a{2,4} aaaaa 0 4
+a{} !
+a{2 !
+a} a} 0 2
+\{\} {} 0 2
+
+- match_default normal REG_BASIC
+a\{2\} a -1 -1
+a\{2\} aa 0 2
+a\{2\} aaa 0 2
+a\{2,\} a -1 -1
+a\{2,\} aa 0 2
+a\{2,\} aaaaa 0 5
+a\{2,4\} a -1 -1
+a\{2,4\} aa 0 2
+a\{2,4\} aaa 0 3
+a\{2,4\} aaaa 0 4
+a\{2,4\} aaaaa 0 4
+{} {} 0 2
+
+; now test the alternation operator |
+- match_default normal REG_EXTENDED
+a|b a 0 1
+a|b b 0 1
+a(b|c) ab 0 2 1 2
+a(b|c) ac 0 2 1 2
+a(b|c) ad -1 -1 -1 -1
+a\| a| 0 2
+
+; now test the set operator []
+- match_default normal REG_EXTENDED
+; try some literals first
+[abc] a 0 1
+[abc] b 0 1
+[abc] c 0 1
+[abc] d -1 -1
+[^bcd] a 0 1
+[^bcd] b -1 -1
+[^bcd] d -1 -1
+[^bcd] e 0 1
+a[b]c abc 0 3
+a[ab]c abc 0 3
+a[^ab]c adc 0 3
+a[]b]c a]c 0 3
+a[[b]c a[c 0 3
+a[-b]c a-c 0 3
+a[^]b]c adc 0 3
+a[^-b]c adc 0 3
+a[b-]c a-c 0 3
+a[b !
+a[] !
+
+; then some ranges
+[b-e] a -1 -1
+[b-e] b 0 1
+[b-e] e 0 1
+[b-e] f -1 -1
+[^b-e] a 0 1
+[^b-e] b -1 -1
+[^b-e] e -1 -1
+[^b-e] f 0 1
+a[1-3]c a2c 0 3
+a[3-1]c !
+a[1-3-5]c !
+a[1- !
+
+; and some classes
+a[[:alpha:]]c abc 0 3
+a[[:unknown:]]c !
+a[[: !
+a[[:alpha !
+a[[:alpha:] !
+a[[:alpha,:] !
+a[[:]:]]b !
+a[[:-:]]b !
+a[[:alph:]] !
+a[[:alphabet:]] !
+[[:alnum:]]+ -%@a0X_- 3 6
+[[:alpha:]]+ -%@aX_0- 3 5
+[[:blank:]]+ "a  \tb" 1 4
+[[:cntrl:]]+ a\n\tb 1 3
+[[:digit:]]+ a019b 1 4
+[[:graph:]]+ " a%b " 1 4
+[[:lower:]]+ AabC 1 3
+; This test fails with STLPort, disable for now as this is a corner case anyway...
+;[[:print:]]+ "\na b\n" 1 4
+[[:punct:]]+ " %-&\t" 1 4
+[[:space:]]+ "a \n\t\rb" 1 5
+[[:upper:]]+ aBCd 1 3
+[[:xdigit:]]+ p0f3Cx 1 5
+
+; now test flag settings:
+- escape_in_lists REG_NO_POSIX_TEST
+[\n] \n 0 1
+- REG_NO_POSIX_TEST
+
+; line anchors
+- match_default normal REG_EXTENDED
+^ab ab 0 2
+^ab xxabxx -1 -1
+ab$ ab 0 2
+ab$ abxx -1 -1
+- match_default match_not_bol match_not_eol normal REG_EXTENDED REG_NOTBOL REG_NOTEOL
+^ab ab -1 -1
+^ab xxabxx -1 -1
+ab$ ab -1 -1
+ab$ abxx -1 -1
+
+; back references
+- match_default normal REG_PERL
+a(b)\2c	!
+a(b\1)c	!
+a(b*)c\1d abbcbbd 0 7 1 3
+a(b*)c\1d abbcbd -1 -1
+a(b*)c\1d abbcbbbd -1 -1
+^(.)\1 abc -1 -1
+a([bc])\1d abcdabbd	4 8 5 6
+; strictly speaking this is at best ambiguous, at worst wrong, this is what most
+; re implimentations will match though.
+a(([bc])\2)*d abbccd 0 6 3 5 3 4
+
+a(([bc])\2)*d abbcbd -1 -1
+a((b)*\2)*d abbbd 0 5 1 4 2 3
+; perl only:
+(ab*)[ab]*\1 ababaaa 0 7 0 1
+(a)\1bcd aabcd 0 5 0 1
+(a)\1bc*d aabcd 0 5 0 1
+(a)\1bc*d aabd 0 4 0 1
+(a)\1bc*d aabcccd 0 7 0 1
+(a)\1bc*[ce]d aabcccd 0 7 0 1
+^(a)\1b(c)*cd$ aabcccd 0 7 0 1 4 5
+
+; posix only: 
+- match_default extended REG_EXTENDED
+(ab*)[ab]*\1 ababaaa 0 7 0 1
+
+;
+; word operators:
+\w a 0 1
+\w z 0 1
+\w A 0 1
+\w Z 0 1
+\w _ 0 1
+\w } -1 -1
+\w ` -1 -1
+\w [ -1 -1
+\w @ -1 -1
+; non-word:
+\W a -1 -1
+\W z -1 -1
+\W A -1 -1
+\W Z -1 -1
+\W _ -1 -1
+\W } 0 1
+\W ` 0 1
+\W [ 0 1
+\W @ 0 1
+; word start:
+\<abcd "  abcd" 2 6
+\<ab cab -1 -1
+\<ab "\nab" 1 3
+\<tag ::tag 2 5
+;word end:
+abc\> abc 0 3
+abc\> abcd -1 -1
+abc\> abc\n 0 3
+abc\> abc:: 0 3
+; word boundary:
+\babcd "  abcd" 2 6
+\bab cab -1 -1
+\bab "\nab" 1 3
+\btag ::tag 2 5
+abc\b abc 0 3
+abc\b abcd -1 -1
+abc\b abc\n 0 3
+abc\b abc:: 0 3
+; within word:
+\B ab 1 1
+a\Bb ab 0 2
+a\B ab 0 1
+a\B a -1 -1
+a\B "a " -1 -1
+
+;
+; buffer operators:
+\`abc abc 0 3
+\`abc \nabc -1 -1
+\`abc " abc" -1 -1
+abc\' abc 0 3
+abc\' abc\n -1 -1
+abc\' "abc " -1 -1
+
+;
+; now follows various complex expressions designed to try and bust the matcher:
+a(((b)))c abc 0 3 1 2 1 2 1 2
+a(b|(c))d abd 0 3 1 2 -1 -1
+a(b|(c))d acd 0 3 1 2 1 2
+a(b*|c)d abbd 0 4 1 3
+; just gotta have one DFA-buster, of course
+a[ab]{20} aaaaabaaaabaaaabaaaab 0 21
+; and an inline expansion in case somebody gets tricky
+a[ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab] aaaaabaaaabaaaabaaaab 0 21
+; and in case somebody just slips in an NFA...
+a[ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab](wee|week)(knights|night) aaaaabaaaabaaaabaaaabweeknights 0 31 21 24 24 31
+; one really big one
+1234567890123456789012345678901234567890123456789012345678901234567890 a1234567890123456789012345678901234567890123456789012345678901234567890b 1 71
+; fish for problems as brackets go past 8
+[ab][cd][ef][gh][ij][kl][mn] xacegikmoq 1 8
+[ab][cd][ef][gh][ij][kl][mn][op] xacegikmoq 1 9
+[ab][cd][ef][gh][ij][kl][mn][op][qr] xacegikmoqy 1 10
+[ab][cd][ef][gh][ij][kl][mn][op][q] xacegikmoqy 1 10
+; and as parenthesis go past 9:
+(a)(b)(c)(d)(e)(f)(g)(h) zabcdefghi 1 9 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9
+(a)(b)(c)(d)(e)(f)(g)(h)(i) zabcdefghij 1 10 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10
+(a)(b)(c)(d)(e)(f)(g)(h)(i)(j) zabcdefghijk 1 11 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11
+(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k) zabcdefghijkl 1 12 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12
+(a)d|(b)c abc 1 3 -1 -1 1 2
+_+((www)|(ftp)|(mailto)):_* "_wwwnocolon _mailto:" 12 20 13 19 -1 -1 -1 -1 13 19
+
+; subtleties of matching
+;a(b)?c\1d acd 0 3 -1 -1
+; POSIX is about the following test:
+a(b)?c\1d acd -1 -1 -1 -1
+a(b?c)+d accd 0 4 2 3
+(wee|week)(knights|night) weeknights 0 10 0 3 3 10
+.* abc 0 3
+a(b|(c))d abd 0 3 1 2 -1 -1
+a(b|(c))d acd 0 3 1 2 1 2
+a(b*|c|e)d abbd 0 4 1 3
+a(b*|c|e)d acd 0 3 1 2
+a(b*|c|e)d ad 0 2 1 1
+a(b?)c abc 0 3 1 2
+a(b?)c ac 0 2 1 1
+a(b+)c abc 0 3 1 2
+a(b+)c abbbc 0 5 1 4 
+a(b*)c ac 0 2 1 1 
+(a|ab)(bc([de]+)f|cde) abcdef 0 6 0 1 1 6 3 5
+a([bc]?)c abc 0 3 1 2
+a([bc]?)c ac 0 2 1 1 
+a([bc]+)c abc 0 3 1 2
+a([bc]+)c abcc 0 4 1 3
+a([bc]+)bc abcbc 0 5 1 3
+a(bb+|b)b abb 0 3 1 2
+a(bbb+|bb+|b)b abb 0 3 1 2
+a(bbb+|bb+|b)b abbb 0 4 1 3
+a(bbb+|bb+|b)bb abbb 0 4 1 2
+(.*).* abcdef 0 6 0 6
+(a*)* bc 0 0 0 0
+xyx*xz xyxxxxyxxxz 5 11
+
+; do we get the right subexpression when it is used more than once?
+a(b|c)*d ad 0 2 -1 -1
+a(b|c)*d abcd 0 4 2 3
+a(b|c)+d abd 0 3 1 2
+a(b|c)+d abcd 0 4 2 3
+a(b|c?)+d ad 0 2 1 1
+a(b|c){0,0}d ad 0 2 -1 -1
+a(b|c){0,1}d ad 0 2 -1 -1
+a(b|c){0,1}d abd 0 3 1 2
+a(b|c){0,2}d ad 0 2 -1 -1
+a(b|c){0,2}d abcd 0 4 2 3
+a(b|c){0,}d ad 0 2 -1 -1
+a(b|c){0,}d abcd 0 4 2 3
+a(b|c){1,1}d abd 0 3 1 2
+a(b|c){1,2}d abd 0 3 1 2
+a(b|c){1,2}d abcd 0 4 2 3
+a(b|c){1,}d abd 0 3 1 2
+a(b|c){1,}d abcd 0 4 2 3
+a(b|c){2,2}d acbd 0 4 2 3
+a(b|c){2,2}d abcd 0 4 2 3
+a(b|c){2,4}d abcd 0 4 2 3
+a(b|c){2,4}d abcbd 0 5 3 4
+a(b|c){2,4}d abcbcd 0 6 4 5
+a(b|c){2,}d abcd 0 4 2 3
+a(b|c){2,}d abcbd 0 5 3 4
+; perl only: these conflict with the POSIX test below
+;a(b|c?)+d abcd 0 4 3 3
+;a(b+|((c)*))+d abd 0 3 2 2 2 2 -1 -1
+;a(b+|((c)*))+d abcd 0 4 3 3 3 3 2 3
+
+; posix only:
+- match_default extended REG_EXTENDED REG_STARTEND
+
+a(b|c?)+d abcd 0 4 2 3
+a(b|((c)*))+d abcd 0 4 2 3 2 3 2 3
+a(b+|((c)*))+d abd 0 3 1 2 -1 -1 -1 -1
+a(b+|((c)*))+d abcd 0 4 2 3 2 3 2 3
+a(b|((c)*))+d ad 0 2 1 1 1 1 -1 -1
+a(b|((c)*))*d abcd 0 4 2 3 2 3 2 3
+a(b+|((c)*))*d abd 0 3 1 2 -1 -1 -1 -1
+a(b+|((c)*))*d abcd 0 4 2 3 2 3 2 3
+a(b|((c)*))*d ad 0 2 1 1 1 1 -1 -1
+
+- match_default normal REG_PERL
+; try to match C++ syntax elements:
+; line comment:
+//[^\n]* "++i //here is a line comment\n" 4 28
+; block comment:
+/\*([^*]|\*+[^*/])*\*+/ "/* here is a block comment */" 0 29 26 27
+/\*([^*]|\*+[^*/])*\*+/ "/**/" 0 4 -1 -1
+/\*([^*]|\*+[^*/])*\*+/ "/***/" 0 5 -1 -1
+/\*([^*]|\*+[^*/])*\*+/ "/****/" 0 6 -1 -1
+/\*([^*]|\*+[^*/])*\*+/ "/*****/" 0 7 -1 -1
+/\*([^*]|\*+[^*/])*\*+/ "/*****/*/" 0 7 -1 -1
+; preprossor directives:
+^[[:blank:]]*#([^\n]*\\[[:space:]]+)*[^\n]* "#define some_symbol" 0 19 -1 -1
+^[[:blank:]]*#([^\n]*\\[[:space:]]+)*[^\n]* "#define some_symbol(x) #x" 0 25 -1 -1
+; perl only:
+^[[:blank:]]*#([^\n]*\\[[:space:]]+)*[^\n]* "#define some_symbol(x) \\  \r\n  foo();\\\r\n   printf(#x);" 0 53 30 42
+; literals:
+((0x[[:xdigit:]]+)|([[:digit:]]+))u?((int(8|16|32|64))|L)? 0xFF         						0 4		0 4		0 4 	-1 -1 	-1 -1 	-1 -1 	-1 -1
+((0x[[:xdigit:]]+)|([[:digit:]]+))u?((int(8|16|32|64))|L)? 35 									0 2 	0 2		-1 -1 	0 2 	-1 -1 	-1 -1 	-1 -1
+((0x[[:xdigit:]]+)|([[:digit:]]+))u?((int(8|16|32|64))|L)? 0xFFu 								0 5		0 4		0 4 	-1 -1 	-1 -1 	-1 -1 	-1 -1
+((0x[[:xdigit:]]+)|([[:digit:]]+))u?((int(8|16|32|64))|L)? 0xFFL 								0 5		0 4		0 4 	-1 -1 	4 5 	-1 -1 	-1 -1
+((0x[[:xdigit:]]+)|([[:digit:]]+))u?((int(8|16|32|64))|L)? 0xFFFFFFFFFFFFFFFFuint64 			0 24	0 18	0 18 	-1 -1 	19 24 	19 24 	22 24
+; strings:
+'([^\\']|\\.)*' '\\x3A' 0 6 4 5
+'([^\\']|\\.)*' '\\'' 0 4 1 3
+'([^\\']|\\.)*' '\\n' 0 4 1 3
+
+; finally try some case insensitive matches:
+- match_default normal REG_EXTENDED REG_ICASE
+; upper and lower have no meaning here so they fail, however these
+; may compile with other libraries...
+;[[:lower:]] !
+;[[:upper:]] !
+0123456789@abcdefghijklmnopqrstuvwxyz\[\\\]\^_`ABCDEFGHIJKLMNOPQRSTUVWXYZ\{\|\} 0123456789@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]\^_`abcdefghijklmnopqrstuvwxyz\{\|\} 0 72
+
+; known and suspected bugs:
+- match_default normal REG_EXTENDED
+\( ( 0 1
+\) ) 0 1
+\$ $ 0 1
+\^ ^ 0 1
+\. . 0 1
+\* * 0 1
+\+ + 0 1
+\? ? 0 1
+\[ [ 0 1
+\] ] 0 1
+\| | 0 1
+\\ \\ 0 1
+# # 0 1
+\# # 0 1
+a- a- 0 2
+\- - 0 1
+\{ { 0 1
+\} } 0 1
+0 0 0 1
+1 1 0 1
+9 9 0 1
+b b 0 1
+B B 0 1
+< < 0 1
+> > 0 1
+w w 0 1
+W W 0 1
+` ` 0 1
+' ' 0 1
+\n \n 0 1
+, , 0 1
+a a 0 1
+f f 0 1
+n n 0 1
+r r 0 1
+t t 0 1
+v v 0 1
+c c 0 1
+x x 0 1
+: : 0 1
+(\.[[:alnum:]]+){2} "w.a.b " 1 5 3 5
+
+- match_default normal REG_EXTENDED REG_ICASE
+a A 0 1
+A a 0 1
+[abc]+ abcABC 0 6
+[ABC]+ abcABC 0 6
+[a-z]+ abcABC 0 6
+[A-Z]+ abzANZ 0 6
+[a-Z]+ abzABZ 0 6
+[A-z]+ abzABZ 0 6
+[[:lower:]]+ abyzABYZ 0 8
+[[:upper:]]+ abzABZ 0 6
+[[:alpha:]]+ abyzABYZ 0 8
+[[:alnum:]]+ 09abyzABYZ 0 10
+
+; word start:
+\<abcd "  abcd" 2 6
+\<ab cab -1 -1
+\<ab "\nab" 1 3
+\<tag ::tag 2 5
+;word end:
+abc\> abc 0 3
+abc\> abcd -1 -1
+abc\> abc\n 0 3
+abc\> abc:: 0 3
+
+; collating elements and rewritten set code:
+- match_default normal REG_EXTENDED REG_STARTEND
+;[[.zero.]] 0 0 1
+;[[.one.]] 1 0 1
+;[[.two.]] 2 0 1
+;[[.three.]] 3 0 1
+[[.a.]] baa 1 2
+;[[.right-curly-bracket.]] } 0 1
+;[[.NUL.]] \0 0 1
+[[:<:]z] !
+[a[:>:]] !
+[[=a=]] a 0 1
+;[[=right-curly-bracket=]] } 0 1
+- match_default normal REG_EXTENDED REG_STARTEND REG_ICASE
+[[.A.]] A 0 1
+[[.A.]] a 0 1
+[[.A.]-b]+ AaBb 0 4
+[A-[.b.]]+ AaBb 0 4
+[[.a.]-B]+ AaBb 0 4
+[a-[.B.]]+ AaBb 0 4
+- match_default normal REG_EXTENDED REG_STARTEND
+[[.a.]-c]+ abcd 0 3
+[a-[.c.]]+ abcd 0 3
+[[:alpha:]-a] !
+[a-[:alpha:]] !
+
+; try mutli-character ligatures:
+;[[.ae.]] ae 0 2
+;[[.ae.]] aE -1 -1
+;[[.AE.]] AE 0 2
+;[[.Ae.]] Ae 0 2
+;[[.ae.]-b] a -1 -1
+;[[.ae.]-b] b 0 1
+;[[.ae.]-b] ae 0 2
+;[a-[.ae.]] a 0 1
+;[a-[.ae.]] b -1 -1
+;[a-[.ae.]] ae 0 2
+- match_default normal REG_EXTENDED REG_STARTEND REG_ICASE
+;[[.ae.]] AE 0 2
+;[[.ae.]] Ae 0 2
+;[[.AE.]] Ae 0 2
+;[[.Ae.]] aE 0 2
+;[[.AE.]-B] a -1 -1
+;[[.Ae.]-b] b 0 1
+;[[.Ae.]-b] B 0 1
+;[[.ae.]-b] AE 0 2
+
+- match_default normal REG_EXTENDED REG_STARTEND REG_NO_POSIX_TEST
+\s+ "ab   ab" 2 5
+\S+ "  abc  " 2 5
+
+- match_default normal REG_EXTENDED REG_STARTEND
+\`abc abc 0 3
+\`abc aabc -1 -1
+abc\' abc 0 3
+abc\' abcd -1 -1
+abc\' abc\n\n -1 -1
+abc\' abc 0 3
+
+; extended repeat checking to exercise new algorithms:
+ab.*xy abxy_ 0 4
+ab.*xy ab_xy_ 0 5
+ab.*xy abxy 0 4
+ab.*xy ab_xy 0 5
+ab.* ab 0 2
+ab.* ab__ 0 4
+
+ab.{2,5}xy ab__xy_ 0 6
+ab.{2,5}xy ab____xy_ 0 8
+ab.{2,5}xy ab_____xy_ 0 9
+ab.{2,5}xy ab__xy 0 6
+ab.{2,5}xy ab_____xy 0 9
+ab.{2,5} ab__ 0 4
+ab.{2,5} ab_______ 0 7
+ab.{2,5}xy ab______xy -1 -1
+ab.{2,5}xy ab_xy -1 -1
+
+ab.*?xy abxy_ 0 4
+ab.*?xy ab_xy_ 0 5
+ab.*?xy abxy 0 4
+ab.*?xy ab_xy 0 5
+ab.*? ab 0 2
+ab.*? ab__ 0 4
+
+ab.{2,5}?xy ab__xy_ 0 6
+ab.{2,5}?xy ab____xy_ 0 8
+ab.{2,5}?xy ab_____xy_ 0 9
+ab.{2,5}?xy ab__xy 0 6
+ab.{2,5}?xy ab_____xy 0 9
+ab.{2,5}? ab__ 0 4
+ab.{2,5}? ab_______ 0 7
+ab.{2,5}?xy ab______xy -1 -1
+ab.{2,5}xy ab_xy -1 -1
+
+; again but with slower algorithm variant:
+- match_default REG_EXTENDED
+; now again for single character repeats:
+
+ab_*xy abxy_ 0 4
+ab_*xy ab_xy_ 0 5
+ab_*xy abxy 0 4
+ab_*xy ab_xy 0 5
+ab_* ab 0 2
+ab_* ab__ 0 4
+
+ab_{2,5}xy ab__xy_ 0 6
+ab_{2,5}xy ab____xy_ 0 8
+ab_{2,5}xy ab_____xy_ 0 9
+ab_{2,5}xy ab__xy 0 6
+ab_{2,5}xy ab_____xy 0 9
+ab_{2,5} ab__ 0 4
+ab_{2,5} ab_______ 0 7
+ab_{2,5}xy ab______xy -1 -1
+ab_{2,5}xy ab_xy -1 -1
+
+ab_*?xy abxy_ 0 4
+ab_*?xy ab_xy_ 0 5
+ab_*?xy abxy 0 4
+ab_*?xy ab_xy 0 5
+ab_*? ab 0 2
+ab_*? ab__ 0 4
+
+ab_{2,5}?xy ab__xy_ 0 6
+ab_{2,5}?xy ab____xy_ 0 8
+ab_{2,5}?xy ab_____xy_ 0 9
+ab_{2,5}?xy ab__xy 0 6
+ab_{2,5}?xy ab_____xy 0 9
+ab_{2,5}? ab__ 0 4
+ab_{2,5}? ab_______ 0 7
+ab_{2,5}?xy ab______xy -1 -1
+ab_{2,5}xy ab_xy -1 -1
+
+; and again for sets:
+ab[_,;]*xy abxy_ 0 4
+ab[_,;]*xy ab_xy_ 0 5
+ab[_,;]*xy abxy 0 4
+ab[_,;]*xy ab_xy 0 5
+ab[_,;]* ab 0 2
+ab[_,;]* ab__ 0 4
+
+ab[_,;]{2,5}xy ab__xy_ 0 6
+ab[_,;]{2,5}xy ab____xy_ 0 8
+ab[_,;]{2,5}xy ab_____xy_ 0 9
+ab[_,;]{2,5}xy ab__xy 0 6
+ab[_,;]{2,5}xy ab_____xy 0 9
+ab[_,;]{2,5} ab__ 0 4
+ab[_,;]{2,5} ab_______ 0 7
+ab[_,;]{2,5}xy ab______xy -1 -1
+ab[_,;]{2,5}xy ab_xy -1 -1
+
+ab[_,;]*?xy abxy_ 0 4
+ab[_,;]*?xy ab_xy_ 0 5
+ab[_,;]*?xy abxy 0 4
+ab[_,;]*?xy ab_xy 0 5
+ab[_,;]*? ab 0 2
+ab[_,;]*? ab__ 0 4
+
+ab[_,;]{2,5}?xy ab__xy_ 0 6
+ab[_,;]{2,5}?xy ab____xy_ 0 8
+ab[_,;]{2,5}?xy ab_____xy_ 0 9
+ab[_,;]{2,5}?xy ab__xy 0 6
+ab[_,;]{2,5}?xy ab_____xy 0 9
+ab[_,;]{2,5}? ab__ 0 4
+ab[_,;]{2,5}? ab_______ 0 7
+ab[_,;]{2,5}?xy ab______xy -1 -1
+ab[_,;]{2,5}xy ab_xy -1 -1
+
+; and again for tricky sets with digraphs:
+;ab[_[.ae.]]*xy abxy_ 0 4
+;ab[_[.ae.]]*xy ab_xy_ 0 5
+;ab[_[.ae.]]*xy abxy 0 4
+;ab[_[.ae.]]*xy ab_xy 0 5
+;ab[_[.ae.]]* ab 0 2
+;ab[_[.ae.]]* ab__ 0 4
+
+;ab[_[.ae.]]{2,5}xy ab__xy_ 0 6
+;ab[_[.ae.]]{2,5}xy ab____xy_ 0 8
+;ab[_[.ae.]]{2,5}xy ab_____xy_ 0 9
+;ab[_[.ae.]]{2,5}xy ab__xy 0 6
+;ab[_[.ae.]]{2,5}xy ab_____xy 0 9
+;ab[_[.ae.]]{2,5} ab__ 0 4
+;ab[_[.ae.]]{2,5} ab_______ 0 7
+;ab[_[.ae.]]{2,5}xy ab______xy -1 -1
+;ab[_[.ae.]]{2,5}xy ab_xy -1 -1
+
+;ab[_[.ae.]]*?xy abxy_ 0 4
+;ab[_[.ae.]]*?xy ab_xy_ 0 5
+;ab[_[.ae.]]*?xy abxy 0 4
+;ab[_[.ae.]]*?xy ab_xy 0 5
+;ab[_[.ae.]]*? ab 0 2
+;ab[_[.ae.]]*? ab__ 0 2
+
+;ab[_[.ae.]]{2,5}?xy ab__xy_ 0 6
+;ab[_[.ae.]]{2,5}?xy ab____xy_ 0 8
+;ab[_[.ae.]]{2,5}?xy ab_____xy_ 0 9
+;ab[_[.ae.]]{2,5}?xy ab__xy 0 6
+;ab[_[.ae.]]{2,5}?xy ab_____xy 0 9
+;ab[_[.ae.]]{2,5}? ab__ 0 4
+;ab[_[.ae.]]{2,5}? ab_______ 0 4
+;ab[_[.ae.]]{2,5}?xy ab______xy -1 -1
+;ab[_[.ae.]]{2,5}xy ab_xy -1 -1
+
+; new bugs detected in spring 2003:
+- normal match_continuous REG_NO_POSIX_TEST
+b abc 1 2
+
+() abc 0 0 0 0
+^() abc 0 0 0 0
+^()+ abc 0 0 0 0
+^(){1} abc 0 0 0 0
+^(){2} abc 0 0 0 0
+^((){2}) abc 0 0 0 0 0 0
+() "" 0 0 0 0
+()\1 "" 0 0 0 0
+()\1 a 0 0 0 0
+a()\1b ab 0 2 1 1
+a()b\1 ab 0 2 1 1
+
+; subtleties of matching with no sub-expressions marked
+- normal match_nosubs REG_NO_POSIX_TEST
+a(b?c)+d accd 0 4 
+(wee|week)(knights|night) weeknights 0 10 
+.* abc 0 3
+a(b|(c))d abd 0 3 
+a(b|(c))d acd 0 3
+a(b*|c|e)d abbd 0 4
+a(b*|c|e)d acd 0 3 
+a(b*|c|e)d ad 0 2
+a(b?)c abc 0 3
+a(b?)c ac 0 2
+a(b+)c abc 0 3
+a(b+)c abbbc 0 5
+a(b*)c ac 0 2
+(a|ab)(bc([de]+)f|cde) abcdef 0 6
+a([bc]?)c abc 0 3
+a([bc]?)c ac 0 2
+a([bc]+)c abc 0 3
+a([bc]+)c abcc 0 4
+a([bc]+)bc abcbc 0 5
+a(bb+|b)b abb 0 3
+a(bbb+|bb+|b)b abb 0 3
+a(bbb+|bb+|b)b abbb 0 4
+a(bbb+|bb+|b)bb abbb 0 4
+(.*).* abcdef 0 6
+(a*)* bc 0 0
+
+- normal nosubs REG_NO_POSIX_TEST
+a(b?c)+d accd 0 4 
+(wee|week)(knights|night) weeknights 0 10 
+.* abc 0 3
+a(b|(c))d abd 0 3 
+a(b|(c))d acd 0 3
+a(b*|c|e)d abbd 0 4
+a(b*|c|e)d acd 0 3 
+a(b*|c|e)d ad 0 2
+a(b?)c abc 0 3
+a(b?)c ac 0 2
+a(b+)c abc 0 3
+a(b+)c abbbc 0 5
+a(b*)c ac 0 2
+(a|ab)(bc([de]+)f|cde) abcdef 0 6
+a([bc]?)c abc 0 3
+a([bc]?)c ac 0 2
+a([bc]+)c abc 0 3
+a([bc]+)c abcc 0 4
+a([bc]+)bc abcbc 0 5
+a(bb+|b)b abb 0 3
+a(bbb+|bb+|b)b abb 0 3
+a(bbb+|bb+|b)b abbb 0 4
+a(bbb+|bb+|b)bb abbb 0 4
+(.*).* abcdef 0 6
+(a*)* bc 0 0
+
diff --git a/test/src/regex-resources/PCRE.tests b/test/src/regex-resources/PCRE.tests
new file mode 100644
index 0000000..0fb9cad
--- /dev/null
+++ b/test/src/regex-resources/PCRE.tests
@@ -0,0 +1,2386 @@
+# PCRE version 4.4 21-August-2003
+
+# Tests taken from PCRE and modified to suit glibc regex.
+#
+# PCRE LICENCE
+# ------------
+#
+# PCRE is a library of functions to support regular expressions whose syntax
+# and semantics are as close as possible to those of the Perl 5 language.
+#
+# Written by: Philip Hazel <ph10@cam.ac.uk>
+#
+# University of Cambridge Computing Service,
+# Cambridge, England. Phone: +44 1223 334714.
+#
+# Copyright (c) 1997-2003 University of Cambridge
+#
+# Permission is granted to anyone to use this software for any purpose on any
+# computer system, and to redistribute it freely, subject to the following
+# restrictions:
+#
+# 1. This software is distributed in the hope that it will be useful,
+#    but WITHOUT ANY WARRANTY; without even the implied warranty of
+#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+#
+# 2. The origin of this software must not be misrepresented, either by
+#    explicit claim or by omission. In practice, this means that if you use
+#    PCRE in software that you distribute to others, commercially or
+#    otherwise, you must put a sentence like this
+#
+#      Regular expression support is provided by the PCRE library package,
+#      which is open source software, written by Philip Hazel, and copyright
+#      by the University of Cambridge, England.
+#
+#    somewhere reasonably visible in your documentation and in any relevant
+#    files or online help data or similar. A reference to the ftp site for
+#    the source, that is, to
+#
+#      ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/
+#
+#    should also be given in the documentation. However, this condition is not
+#    intended to apply to whole chains of software. If package A includes PCRE,
+#    it must acknowledge it, but if package B is software that includes package
+#    A, the condition is not imposed on package B (unless it uses PCRE
+#    independently).
+#
+# 3. Altered versions must be plainly marked as such, and must not be
+#    misrepresented as being the original software.
+#
+# 4. If PCRE is embedded in any software that is released under the GNU
+#   General Purpose Licence (GPL), or Lesser General Purpose Licence (LGPL),
+#   then the terms of that licence shall supersede any condition above with
+#   which it is incompatible.
+#
+# The documentation for PCRE, supplied in the "doc" directory, is distributed
+# under the same terms as the software itself.
+#
+# End
+#
+
+/the quick brown fox/
+    the quick brown fox
+ 0: the quick brown fox
+    The quick brown FOX
+No match
+    What do you know about the quick brown fox?
+ 0: the quick brown fox
+    What do you know about THE QUICK BROWN FOX?
+No match
+
+/The quick brown fox/i
+    the quick brown fox
+ 0: the quick brown fox
+    The quick brown FOX
+ 0: The quick brown FOX
+    What do you know about the quick brown fox?
+ 0: the quick brown fox
+    What do you know about THE QUICK BROWN FOX?
+ 0: THE QUICK BROWN FOX
+
+/a*abc?xyz+pqr{3}ab{2,}xy{4,5}pq{0,6}AB{0,}zz/
+    abxyzpqrrrabbxyyyypqAzz
+ 0: abxyzpqrrrabbxyyyypqAzz
+    abxyzpqrrrabbxyyyypqAzz
+ 0: abxyzpqrrrabbxyyyypqAzz
+    aabxyzpqrrrabbxyyyypqAzz
+ 0: aabxyzpqrrrabbxyyyypqAzz
+    aaabxyzpqrrrabbxyyyypqAzz
+ 0: aaabxyzpqrrrabbxyyyypqAzz
+    aaaabxyzpqrrrabbxyyyypqAzz
+ 0: aaaabxyzpqrrrabbxyyyypqAzz
+    abcxyzpqrrrabbxyyyypqAzz
+ 0: abcxyzpqrrrabbxyyyypqAzz
+    aabcxyzpqrrrabbxyyyypqAzz
+ 0: aabcxyzpqrrrabbxyyyypqAzz
+    aaabcxyzpqrrrabbxyyyypAzz
+ 0: aaabcxyzpqrrrabbxyyyypAzz
+    aaabcxyzpqrrrabbxyyyypqAzz
+ 0: aaabcxyzpqrrrabbxyyyypqAzz
+    aaabcxyzpqrrrabbxyyyypqqAzz
+ 0: aaabcxyzpqrrrabbxyyyypqqAzz
+    aaabcxyzpqrrrabbxyyyypqqqAzz
+ 0: aaabcxyzpqrrrabbxyyyypqqqAzz
+    aaabcxyzpqrrrabbxyyyypqqqqAzz
+ 0: aaabcxyzpqrrrabbxyyyypqqqqAzz
+    aaabcxyzpqrrrabbxyyyypqqqqqAzz
+ 0: aaabcxyzpqrrrabbxyyyypqqqqqAzz
+    aaabcxyzpqrrrabbxyyyypqqqqqqAzz
+ 0: aaabcxyzpqrrrabbxyyyypqqqqqqAzz
+    aaaabcxyzpqrrrabbxyyyypqAzz
+ 0: aaaabcxyzpqrrrabbxyyyypqAzz
+    abxyzzpqrrrabbxyyyypqAzz
+ 0: abxyzzpqrrrabbxyyyypqAzz
+    aabxyzzzpqrrrabbxyyyypqAzz
+ 0: aabxyzzzpqrrrabbxyyyypqAzz
+    aaabxyzzzzpqrrrabbxyyyypqAzz
+ 0: aaabxyzzzzpqrrrabbxyyyypqAzz
+    aaaabxyzzzzpqrrrabbxyyyypqAzz
+ 0: aaaabxyzzzzpqrrrabbxyyyypqAzz
+    abcxyzzpqrrrabbxyyyypqAzz
+ 0: abcxyzzpqrrrabbxyyyypqAzz
+    aabcxyzzzpqrrrabbxyyyypqAzz
+ 0: aabcxyzzzpqrrrabbxyyyypqAzz
+    aaabcxyzzzzpqrrrabbxyyyypqAzz
+ 0: aaabcxyzzzzpqrrrabbxyyyypqAzz
+    aaaabcxyzzzzpqrrrabbxyyyypqAzz
+ 0: aaaabcxyzzzzpqrrrabbxyyyypqAzz
+    aaaabcxyzzzzpqrrrabbbxyyyypqAzz
+ 0: aaaabcxyzzzzpqrrrabbbxyyyypqAzz
+    aaaabcxyzzzzpqrrrabbbxyyyyypqAzz
+ 0: aaaabcxyzzzzpqrrrabbbxyyyyypqAzz
+    aaabcxyzpqrrrabbxyyyypABzz
+ 0: aaabcxyzpqrrrabbxyyyypABzz
+    aaabcxyzpqrrrabbxyyyypABBzz
+ 0: aaabcxyzpqrrrabbxyyyypABBzz
+    >>>aaabxyzpqrrrabbxyyyypqAzz
+ 0: aaabxyzpqrrrabbxyyyypqAzz
+    >aaaabxyzpqrrrabbxyyyypqAzz
+ 0: aaaabxyzpqrrrabbxyyyypqAzz
+    >>>>abcxyzpqrrrabbxyyyypqAzz
+ 0: abcxyzpqrrrabbxyyyypqAzz
+    *** Failers
+No match
+    abxyzpqrrabbxyyyypqAzz
+No match
+    abxyzpqrrrrabbxyyyypqAzz
+No match
+    abxyzpqrrrabxyyyypqAzz
+No match
+    aaaabcxyzzzzpqrrrabbbxyyyyyypqAzz
+No match
+    aaaabcxyzzzzpqrrrabbbxyyypqAzz
+No match
+    aaabcxyzpqrrrabbxyyyypqqqqqqqAzz
+No match
+
+/^(abc){1,2}zz/
+    abczz
+ 0: abczz
+ 1: abc
+    abcabczz
+ 0: abcabczz
+ 1: abc
+    *** Failers
+No match
+    zz
+No match
+    abcabcabczz
+No match
+    >>abczz
+No match
+
+/^(b+|a){1,2}c/
+    bc
+ 0: bc
+ 1: b
+    bbc
+ 0: bbc
+ 1: bb
+    bbbc
+ 0: bbbc
+ 1: bbb
+    bac
+ 0: bac
+ 1: a
+    bbac
+ 0: bbac
+ 1: a
+    aac
+ 0: aac
+ 1: a
+    abbbbbbbbbbbc
+ 0: abbbbbbbbbbbc
+ 1: bbbbbbbbbbb
+    bbbbbbbbbbbac
+ 0: bbbbbbbbbbbac
+ 1: a
+    *** Failers
+No match
+    aaac
+No match
+    abbbbbbbbbbbac
+No match
+
+/^[]cde]/
+    ]thing
+ 0: ]
+    cthing
+ 0: c
+    dthing
+ 0: d
+    ething
+ 0: e
+    *** Failers
+No match
+    athing
+No match
+    fthing
+No match
+
+/^[^]cde]/
+    athing
+ 0: a
+    fthing
+ 0: f
+    *** Failers
+ 0: *
+    ]thing
+No match
+    cthing
+No match
+    dthing
+No match
+    ething
+No match
+
+/^[0-9]+$/
+    0
+ 0: 0
+    1
+ 0: 1
+    2
+ 0: 2
+    3
+ 0: 3
+    4
+ 0: 4
+    5
+ 0: 5
+    6
+ 0: 6
+    7
+ 0: 7
+    8
+ 0: 8
+    9
+ 0: 9
+    10
+ 0: 10
+    100
+ 0: 100
+    *** Failers
+No match
+    abc
+No match
+
+/^.*nter/
+    enter
+ 0: enter
+    inter
+ 0: inter
+    uponter
+ 0: uponter
+
+/^xxx[0-9]+$/
+    xxx0
+ 0: xxx0
+    xxx1234
+ 0: xxx1234
+    *** Failers
+No match
+    xxx
+No match
+
+/^.+[0-9][0-9][0-9]$/
+    x123
+ 0: x123
+    xx123
+ 0: xx123
+    123456
+ 0: 123456
+    *** Failers
+No match
+    123
+No match
+    x1234
+ 0: x1234
+
+/^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/
+    abc!pqr=apquxz.ixr.zzz.ac.uk
+ 0: abc!pqr=apquxz.ixr.zzz.ac.uk
+ 1: abc
+ 2: pqr
+    *** Failers
+No match
+    !pqr=apquxz.ixr.zzz.ac.uk
+No match
+    abc!=apquxz.ixr.zzz.ac.uk
+No match
+    abc!pqr=apquxz:ixr.zzz.ac.uk
+No match
+    abc!pqr=apquxz.ixr.zzz.ac.ukk
+No match
+
+/:/
+    Well, we need a colon: somewhere
+ 0: :
+    *** Fail if we don't
+No match
+
+/([0-9a-f:]+)$/i
+    0abc
+ 0: 0abc
+ 1: 0abc
+    abc
+ 0: abc
+ 1: abc
+    fed
+ 0: fed
+ 1: fed
+    E
+ 0: E
+ 1: E
+    ::
+ 0: ::
+ 1: ::
+    5f03:12C0::932e
+ 0: 5f03:12C0::932e
+ 1: 5f03:12C0::932e
+    fed def
+ 0: def
+ 1: def
+    Any old stuff
+ 0: ff
+ 1: ff
+    *** Failers
+No match
+    0zzz
+No match
+    gzzz
+No match
+    Any old rubbish
+No match
+
+/^.*\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})$/
+    .1.2.3
+ 0: .1.2.3
+ 1: 1
+ 2: 2
+ 3: 3
+    A.12.123.0
+ 0: A.12.123.0
+ 1: 12
+ 2: 123
+ 3: 0
+    *** Failers
+No match
+    .1.2.3333
+No match
+    1.2.3
+No match
+    1234.2.3
+No match
+
+/^([0-9]+)\s+IN\s+SOA\s+(\S+)\s+(\S+)\s*\(\s*$/
+    1 IN SOA non-sp1 non-sp2(
+ 0: 1 IN SOA non-sp1 non-sp2(
+ 1: 1
+ 2: non-sp1
+ 3: non-sp2
+    1    IN    SOA    non-sp1    non-sp2   (
+ 0: 1    IN    SOA    non-sp1    non-sp2   (
+ 1: 1
+ 2: non-sp1
+ 3: non-sp2
+    *** Failers
+No match
+    1IN SOA non-sp1 non-sp2(
+No match
+
+/^[a-zA-Z0-9][a-zA-Z0-9-]*(\.[a-zA-Z0-9][a-zA-z0-9-]*)*\.$/
+    a.
+ 0: a.
+    Z.
+ 0: Z.
+    2.
+ 0: 2.
+    ab-c.pq-r.
+ 0: ab-c.pq-r.
+ 1: .pq-r
+    sxk.zzz.ac.uk.
+ 0: sxk.zzz.ac.uk.
+ 1: .uk
+    x-.y-.
+ 0: x-.y-.
+ 1: .y-
+    *** Failers
+No match
+    -abc.peq.
+No match
+
+/^\*\.[a-z]([a-z0-9-]*[a-z0-9]+)?(\.[a-z]([a-z0-9-]*[a-z0-9]+)?)*$/
+    *.a
+ 0: *.a
+    *.b0-a
+ 0: *.b0-a
+ 1: 0-a
+    *.c3-b.c
+ 0: *.c3-b.c
+ 1: 3-b
+ 2: .c
+    *.c-a.b-c
+ 0: *.c-a.b-c
+ 1: -a
+ 2: .b-c
+ 3: -c
+    *** Failers
+No match
+    *.0
+No match
+    *.a-
+No match
+    *.a-b.c-
+No match
+    *.c-a.0-c
+No match
+
+/^[0-9a-f](\.[0-9a-f])*$/i
+    a.b.c.d
+ 0: a.b.c.d
+ 1: .d
+    A.B.C.D
+ 0: A.B.C.D
+ 1: .D
+    a.b.c.1.2.3.C
+ 0: a.b.c.1.2.3.C
+ 1: .C
+
+/^".*"\s*(;.*)?$/
+    "1234"
+ 0: "1234"
+    "abcd" ;
+ 0: "abcd" ;
+ 1: ;
+    "" ; rhubarb
+ 0: "" ; rhubarb
+ 1: ; rhubarb
+    *** Failers
+No match
+    "1234" : things
+No match
+
+/^(a(b(c)))(d(e(f)))(h(i(j)))(k(l(m)))$/
+    abcdefhijklm
+ 0: abcdefhijklm
+ 1: abc
+ 2: bc
+ 3: c
+ 4: def
+ 5: ef
+ 6: f
+ 7: hij
+ 8: ij
+ 9: j
+10: klm
+11: lm
+12: m
+
+/^a*\w/
+    z
+ 0: z
+    az
+ 0: az
+    aaaz
+ 0: aaaz
+    a
+ 0: a
+    aa
+ 0: aa
+    aaaa
+ 0: aaaa
+    a+
+ 0: a
+    aa+
+ 0: aa
+
+/^a+\w/
+    az
+ 0: az
+    aaaz
+ 0: aaaz
+    aa
+ 0: aa
+    aaaa
+ 0: aaaa
+    aa+
+ 0: aa
+
+/^[0-9]{8}\w{2,}/
+    1234567890
+ 0: 1234567890
+    12345678ab
+ 0: 12345678ab
+    12345678__
+ 0: 12345678__
+    *** Failers
+No match
+    1234567
+No match
+
+/^[aeiou0-9]{4,5}$/
+    uoie
+ 0: uoie
+    1234
+ 0: 1234
+    12345
+ 0: 12345
+    aaaaa
+ 0: aaaaa
+    *** Failers
+No match
+    123456
+No match
+
+/\`(abc|def)=(\1){2,3}\'/
+    abc=abcabc
+ 0: abc=abcabc
+ 1: abc
+ 2: abc
+    def=defdefdef
+ 0: def=defdefdef
+ 1: def
+ 2: def
+    *** Failers
+No match
+    abc=defdef
+No match
+
+/(cat(a(ract|tonic)|erpillar)) \1()2(3)/
+    cataract cataract23
+ 0: cataract cataract23
+ 1: cataract
+ 2: aract
+ 3: ract
+ 4: 
+ 5: 3
+    catatonic catatonic23
+ 0: catatonic catatonic23
+ 1: catatonic
+ 2: atonic
+ 3: tonic
+ 4: 
+ 5: 3
+    caterpillar caterpillar23
+ 0: caterpillar caterpillar23
+ 1: caterpillar
+ 2: erpillar
+ 3: <unset>
+ 4: 
+ 5: 3
+
+
+/^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/
+    From abcd  Mon Sep 01 12:33:02 1997
+ 0: From abcd  Mon Sep 01 12:33
+ 1: abcd
+
+/^From\s+\S+\s+([a-zA-Z]{3}\s+){2}[0-9]{1,2}\s+[0-9][0-9]:[0-9][0-9]/
+    From abcd  Mon Sep 01 12:33:02 1997
+ 0: From abcd  Mon Sep 01 12:33
+ 1: Sep 
+    From abcd  Mon Sep  1 12:33:02 1997
+ 0: From abcd  Mon Sep  1 12:33
+ 1: Sep  
+    *** Failers
+No match
+    From abcd  Sep 01 12:33:02 1997
+No match
+
+/^(a)\1{2,3}(.)/
+    aaab
+ 0: aaab
+ 1: a
+ 2: b
+    aaaab
+ 0: aaaab
+ 1: a
+ 2: b
+    aaaaab
+ 0: aaaaa
+ 1: a
+ 2: a
+    aaaaaab
+ 0: aaaaa
+ 1: a
+ 2: a
+
+/^[ab]{1,3}(ab*|b)/
+    aabbbbb
+ 0: aabbbbb
+ 1: abbbbb
+
+/^(cow|)\1(bell)/
+    cowcowbell
+ 0: cowcowbell
+ 1: cow
+ 2: bell
+    bell
+ 0: bell
+ 1: 
+ 2: bell
+    *** Failers
+No match
+    cowbell
+No match
+
+/^(a|)\1+b/
+    aab
+ 0: aab
+ 1: a
+    aaaab
+ 0: aaaab
+ 1: a
+    b
+ 0: b
+ 1: 
+    *** Failers
+No match
+    ab
+No match
+
+/^(a|)\1{2}b/
+    aaab
+ 0: aaab
+ 1: a
+    b
+ 0: b
+ 1: 
+    *** Failers
+No match
+    ab
+No match
+    aab
+No match
+    aaaab
+No match
+
+/^(a|)\1{2,3}b/
+    aaab
+ 0: aaab
+ 1: a
+    aaaab
+ 0: aaaab
+ 1: a
+    b
+ 0: b
+ 1: 
+    *** Failers
+No match
+    ab
+No match
+    aab
+No match
+    aaaaab
+No match
+
+/ab{1,3}bc/
+    abbbbc
+ 0: abbbbc
+    abbbc
+ 0: abbbc
+    abbc
+ 0: abbc
+    *** Failers
+No match
+    abc
+No match
+    abbbbbc
+No match
+
+/([^.]*)\.([^:]*):[T ]+(.*)/
+    track1.title:TBlah blah blah
+ 0: track1.title:TBlah blah blah
+ 1: track1
+ 2: title
+ 3: Blah blah blah
+
+/([^.]*)\.([^:]*):[T ]+(.*)/i
+    track1.title:TBlah blah blah
+ 0: track1.title:TBlah blah blah
+ 1: track1
+ 2: title
+ 3: Blah blah blah
+
+/([^.]*)\.([^:]*):[t ]+(.*)/i
+    track1.title:TBlah blah blah
+ 0: track1.title:TBlah blah blah
+ 1: track1
+ 2: title
+ 3: Blah blah blah
+
+/^abc$/
+    abc
+ 0: abc
+    *** Failers
+No match
+
+/[-az]+/
+    az-
+ 0: az-
+    *** Failers
+ 0: a
+    b
+No match
+
+/[az-]+/
+    za-
+ 0: za-
+    *** Failers
+ 0: a
+    b
+No match
+
+/[a-z]+/
+    abcdxyz
+ 0: abcdxyz
+
+/[0-9-]+/
+    12-34
+ 0: 12-34
+    *** Failers
+No match
+    aaa
+No match
+
+/(abc)\1/i
+    abcabc
+ 0: abcabc
+ 1: abc
+    ABCabc
+ 0: ABCabc
+ 1: ABC
+    abcABC
+ 0: abcABC
+ 1: abc
+
+/a{0}bc/
+    bc
+ 0: bc
+
+/^([^a])([^b])([^c]*)([^d]{3,4})/
+    baNOTccccd
+ 0: baNOTcccc
+ 1: b
+ 2: a
+ 3: NOT
+ 4: cccc
+    baNOTcccd
+ 0: baNOTccc
+ 1: b
+ 2: a
+ 3: NOT
+ 4: ccc
+    baNOTccd
+ 0: baNOTcc
+ 1: b
+ 2: a
+ 3: NO
+ 4: Tcc
+    bacccd
+ 0: baccc
+ 1: b
+ 2: a
+ 3: 
+ 4: ccc
+    *** Failers
+ 0: *** Failers
+ 1: *
+ 2: *
+ 3: * Fail
+ 4: ers
+    anything
+No match
+    baccd
+No match
+
+/[^a]/
+    Abc
+ 0: A
+
+/[^a]/i
+    Abc 
+ 0: b
+
+/[^a]+/
+    AAAaAbc
+ 0: AAA
+
+/[^a]+/i
+    AAAaAbc
+ 0: bc
+
+/[^k]$/
+    abc
+ 0: c
+    *** Failers
+ 0: s
+    abk
+No match
+
+/[^k]{2,3}$/
+    abc
+ 0: abc
+    kbc
+ 0: bc
+    kabc
+ 0: abc
+    *** Failers
+ 0: ers
+    abk
+No match
+    akb
+No match
+    akk 
+No match
+
+/^[0-9]{8,}@.+[^k]$/
+    12345678@a.b.c.d
+ 0: 12345678@a.b.c.d
+    123456789@x.y.z
+ 0: 123456789@x.y.z
+    *** Failers
+No match
+    12345678@x.y.uk
+No match
+    1234567@a.b.c.d       
+No match
+
+/(a)\1{8,}/
+    aaaaaaaaa
+ 0: aaaaaaaaa
+ 1: a
+    aaaaaaaaaa
+ 0: aaaaaaaaaa
+ 1: a
+    *** Failers
+No match
+    aaaaaaa   
+No match
+
+/[^a]/
+    aaaabcd
+ 0: b
+    aaAabcd 
+ 0: A
+
+/[^a]/i
+    aaaabcd
+ 0: b
+    aaAabcd 
+ 0: b
+
+/[^az]/
+    aaaabcd
+ 0: b
+    aaAabcd 
+ 0: A
+
+/[^az]/i
+    aaaabcd
+ 0: b
+    aaAabcd 
+ 0: b
+
+/P[^*]TAIRE[^*]{1,6}LL/
+    xxxxxxxxxxxPSTAIREISLLxxxxxxxxx
+ 0: PSTAIREISLL
+
+/P[^*]TAIRE[^*]{1,}LL/
+    xxxxxxxxxxxPSTAIREISLLxxxxxxxxx
+ 0: PSTAIREISLL
+
+/(\.[0-9][0-9][1-9]?)[0-9]+/
+    1.230003938
+ 0: .230003938
+ 1: .23
+    1.875000282   
+ 0: .875000282
+ 1: .875
+    1.235  
+ 0: .235
+ 1: .23
+                  
+/\b(foo)\s+(\w+)/i
+    Food is on the foo table
+ 0: foo table
+ 1: foo
+ 2: table
+    
+/foo(.*)bar/
+    The food is under the bar in the barn.
+ 0: food is under the bar in the bar
+ 1: d is under the bar in the 
+    
+/(.*)([0-9]*)/
+    I have 2 numbers: 53147
+ 0: I have 2 numbers: 53147
+ 1: I have 2 numbers: 53147
+ 2: 
+    
+/(.*)([0-9]+)/
+    I have 2 numbers: 53147
+ 0: I have 2 numbers: 53147
+ 1: I have 2 numbers: 5314
+ 2: 7
+
+/(.*)([0-9]+)$/
+    I have 2 numbers: 53147
+ 0: I have 2 numbers: 53147
+ 1: I have 2 numbers: 5314
+ 2: 7
+
+/(.*)\b([0-9]+)$/
+    I have 2 numbers: 53147
+ 0: I have 2 numbers: 53147
+ 1: I have 2 numbers: 
+ 2: 53147
+
+/(.*[^0-9])([0-9]+)$/
+    I have 2 numbers: 53147
+ 0: I have 2 numbers: 53147
+ 1: I have 2 numbers: 
+ 2: 53147
+
+/[[:digit:]][[:digit:]]\/[[:digit:]][[:digit:]]\/[[:digit:]][[:digit:]][[:digit:]][[:digit:]]/
+    01/01/2000
+ 0: 01/01/2000
+
+/^(a){0,0}/
+    bcd
+ 0: 
+    abc
+ 0: 
+    aab     
+ 0: 
+
+/^(a){0,1}/
+    bcd
+ 0: 
+    abc
+ 0: a
+ 1: a
+    aab  
+ 0: a
+ 1: a
+
+/^(a){0,2}/
+    bcd
+ 0: 
+    abc
+ 0: a
+ 1: a
+    aab  
+ 0: aa
+ 1: a
+
+/^(a){0,3}/
+    bcd
+ 0: 
+    abc
+ 0: a
+ 1: a
+    aab
+ 0: aa
+ 1: a
+    aaa   
+ 0: aaa
+ 1: a
+
+/^(a){0,}/
+    bcd
+ 0: 
+    abc
+ 0: a
+ 1: a
+    aab
+ 0: aa
+ 1: a
+    aaa
+ 0: aaa
+ 1: a
+    aaaaaaaa    
+ 0: aaaaaaaa
+ 1: a
+
+/^(a){1,1}/
+    bcd
+No match
+    abc
+ 0: a
+ 1: a
+    aab  
+ 0: a
+ 1: a
+
+/^(a){1,2}/
+    bcd
+No match
+    abc
+ 0: a
+ 1: a
+    aab  
+ 0: aa
+ 1: a
+
+/^(a){1,3}/
+    bcd
+No match
+    abc
+ 0: a
+ 1: a
+    aab
+ 0: aa
+ 1: a
+    aaa   
+ 0: aaa
+ 1: a
+
+/^(a){1,}/
+    bcd
+No match
+    abc
+ 0: a
+ 1: a
+    aab
+ 0: aa
+ 1: a
+    aaa
+ 0: aaa
+ 1: a
+    aaaaaaaa    
+ 0: aaaaaaaa
+ 1: a
+
+/^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]/
+    123456654321
+ 0: 123456654321
+
+/^[[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]]/
+    123456654321 
+ 0: 123456654321
+
+/^[abc]{12}/
+    abcabcabcabc
+ 0: abcabcabcabc
+    
+/^[a-c]{12}/
+    abcabcabcabc
+ 0: abcabcabcabc
+    
+/^(a|b|c){12}/
+    abcabcabcabc 
+ 0: abcabcabcabc
+ 1: c
+
+/^[abcdefghijklmnopqrstuvwxy0123456789]/
+    n
+ 0: n
+    *** Failers 
+No match
+    z 
+No match
+
+/abcde{0,0}/
+    abcd
+ 0: abcd
+    *** Failers
+No match
+    abce  
+No match
+
+/ab[cd]{0,0}e/
+    abe
+ 0: abe
+    *** Failers
+No match
+    abcde 
+No match
+    
+/ab(c){0,0}d/
+    abd
+ 0: abd
+    *** Failers
+No match
+    abcd   
+No match
+
+/a(b*)/
+    a
+ 0: a
+ 1: 
+    ab
+ 0: ab
+ 1: b
+    abbbb
+ 0: abbbb
+ 1: bbbb
+    *** Failers
+ 0: a
+ 1: 
+    bbbbb    
+No match
+    
+/ab[0-9]{0}e/
+    abe
+ 0: abe
+    *** Failers
+No match
+    ab1e   
+No match
+    
+/(A|B)*CD/
+    CD 
+ 0: CD
+
+/(AB)*\1/
+    ABABAB
+ 0: ABABAB
+ 1: AB
+    
+/([0-9]+)(\w)/
+    12345a
+ 0: 12345a
+ 1: 12345
+ 2: a
+    12345+ 
+ 0: 12345
+ 1: 1234
+ 2: 5
+
+/(abc|)+/
+    abc
+ 0: abc
+ 1: abc
+    abcabc
+ 0: abcabc
+ 1: abc
+    abcabcabc
+ 0: abcabcabc
+ 1: abc
+    xyz      
+ 0: 
+ 1: 
+
+/([a]*)*/
+    a
+ 0: a
+ 1: a
+    aaaaa 
+ 0: aaaaa
+ 1: aaaaa
+
+/([ab]*)*/
+    a
+ 0: a
+ 1: a
+    b
+ 0: b
+ 1: b
+    ababab
+ 0: ababab
+ 1: ababab
+    aaaabcde
+ 0: aaaab
+ 1: aaaab
+    bbbb    
+ 0: bbbb
+ 1: bbbb
+
+/([^a]*)*/
+    b
+ 0: b
+ 1: b
+    bbbb
+ 0: bbbb
+ 1: bbbb
+    aaa   
+ 0: 
+
+/([^ab]*)*/
+    cccc
+ 0: cccc
+ 1: cccc
+    abab  
+ 0: 
+
+/abc/
+    abc
+ 0: abc
+    xabcy
+ 0: abc
+    ababc
+ 0: abc
+    *** Failers
+No match
+    xbc
+No match
+    axc
+No match
+    abx
+No match
+
+/ab*c/
+    abc
+ 0: abc
+
+/ab*bc/
+    abc
+ 0: abc
+    abbc
+ 0: abbc
+    abbbbc
+ 0: abbbbc
+
+/.{1}/
+    abbbbc
+ 0: a
+
+/.{3,4}/
+    abbbbc
+ 0: abbb
+
+/ab{0,}bc/
+    abbbbc
+ 0: abbbbc
+
+/ab+bc/
+    abbc
+ 0: abbc
+    *** Failers
+No match
+    abc
+No match
+    abq
+No match
+
+/ab+bc/
+    abbbbc
+ 0: abbbbc
+
+/ab{1,}bc/
+    abbbbc
+ 0: abbbbc
+
+/ab{1,3}bc/
+    abbbbc
+ 0: abbbbc
+
+/ab{3,4}bc/
+    abbbbc
+ 0: abbbbc
+
+/ab{4,5}bc/
+    *** Failers
+No match
+    abq
+No match
+    abbbbc
+No match
+
+/ab?bc/
+    abbc
+ 0: abbc
+    abc
+ 0: abc
+
+/ab{0,1}bc/
+    abc
+ 0: abc
+
+/ab?c/
+    abc
+ 0: abc
+
+/ab{0,1}c/
+    abc
+ 0: abc
+
+/^abc$/
+    abc
+ 0: abc
+    *** Failers
+No match
+    abbbbc
+No match
+    abcc
+No match
+
+/^abc/
+    abcc
+ 0: abc
+
+/abc$/
+    aabc
+ 0: abc
+    *** Failers
+No match
+    aabc
+ 0: abc
+    aabcd
+No match
+
+/^/
+    abc
+ 0: 
+
+/$/
+    abc
+ 0: 
+
+/a.c/
+    abc
+ 0: abc
+    axc
+ 0: axc
+
+/a.*c/
+    axyzc
+ 0: axyzc
+
+/a[bc]d/
+    abd
+ 0: abd
+    *** Failers
+No match
+    axyzd
+No match
+    abc
+No match
+
+/a[b-d]e/
+    ace
+ 0: ace
+
+/a[b-d]/
+    aac
+ 0: ac
+
+/a[-b]/
+    a-
+ 0: a-
+
+/a[b-]/
+    a-
+ 0: a-
+
+/a[]]b/
+    a]b
+ 0: a]b
+
+/a[^bc]d/
+    aed
+ 0: aed
+    *** Failers
+No match
+    abd
+No match
+    abd
+No match
+
+/a[^-b]c/
+    adc
+ 0: adc
+
+/a[^]b]c/
+    adc
+ 0: adc
+    *** Failers
+No match
+    a-c
+ 0: a-c
+    a]c
+No match
+
+/\ba\b/
+    a-
+ 0: a
+    -a
+ 0: a
+    -a-
+ 0: a
+
+/\by\b/
+    *** Failers
+No match
+    xy
+No match
+    yz
+No match
+    xyz
+No match
+
+/\Ba\B/
+    *** Failers
+ 0: a
+    a-
+No match
+    -a
+No match
+    -a-
+No match
+
+/\By\b/
+    xy
+ 0: y
+
+/\by\B/
+    yz
+ 0: y
+
+/\By\B/
+    xyz
+ 0: y
+
+/\w/
+    a
+ 0: a
+
+/\W/
+    -
+ 0: -
+    *** Failers
+ 0: *
+    -
+ 0: -
+    a
+No match
+
+/a\sb/
+    a b
+ 0: a b
+
+/a\Sb/
+    a-b
+ 0: a-b
+    *** Failers
+No match
+    a-b
+ 0: a-b
+    a b
+No match
+
+/[0-9]/
+    1
+ 0: 1
+
+/[^0-9]/
+    -
+ 0: -
+    *** Failers
+ 0: *
+    -
+ 0: -
+    1
+No match
+
+/ab|cd/
+    abc
+ 0: ab
+    abcd
+ 0: ab
+
+/()ef/
+    def
+ 0: ef
+ 1: 
+
+/a\(b/
+    a(b
+ 0: a(b
+
+/a\(*b/
+    ab
+ 0: ab
+    a((b
+ 0: a((b
+
+/((a))/
+    abc
+ 0: a
+ 1: a
+ 2: a
+
+/(a)b(c)/
+    abc
+ 0: abc
+ 1: a
+ 2: c
+
+/a+b+c/
+    aabbabc
+ 0: abc
+
+/a{1,}b{1,}c/
+    aabbabc
+ 0: abc
+
+/(a+|b)*/
+    ab
+ 0: ab
+ 1: b
+
+/(a+|b){0,}/
+    ab
+ 0: ab
+ 1: b
+
+/(a+|b)+/
+    ab
+ 0: ab
+ 1: b
+
+/(a+|b){1,}/
+    ab
+ 0: ab
+ 1: b
+
+/(a+|b)?/
+    ab
+ 0: a
+ 1: a
+
+/(a+|b){0,1}/
+    ab
+ 0: a
+ 1: a
+
+/[^ab]*/
+    cde
+ 0: cde
+
+/abc/
+    *** Failers
+No match
+    b
+No match
+    
+
+/a*/
+    
+
+/([abc])*d/
+    abbbcd
+ 0: abbbcd
+ 1: c
+
+/([abc])*bcd/
+    abcd
+ 0: abcd
+ 1: a
+
+/a|b|c|d|e/
+    e
+ 0: e
+
+/(a|b|c|d|e)f/
+    ef
+ 0: ef
+ 1: e
+
+/abcd*efg/
+    abcdefg
+ 0: abcdefg
+
+/ab*/
+    xabyabbbz
+ 0: ab
+    xayabbbz
+ 0: a
+
+/(ab|cd)e/
+    abcde
+ 0: cde
+ 1: cd
+
+/[abhgefdc]ij/
+    hij
+ 0: hij
+
+/(abc|)ef/
+    abcdef
+ 0: ef
+ 1: 
+
+/(a|b)c*d/
+    abcd
+ 0: bcd
+ 1: b
+
+/(ab|ab*)bc/
+    abc
+ 0: abc
+ 1: a
+
+/a([bc]*)c*/
+    abc
+ 0: abc
+ 1: bc
+
+/a([bc]*)(c*d)/
+    abcd
+ 0: abcd
+ 1: bc
+ 2: d
+
+/a([bc]+)(c*d)/
+    abcd
+ 0: abcd
+ 1: bc
+ 2: d
+
+/a([bc]*)(c+d)/
+    abcd
+ 0: abcd
+ 1: b
+ 2: cd
+
+/a[bcd]*dcdcde/
+    adcdcde
+ 0: adcdcde
+
+/a[bcd]+dcdcde/
+    *** Failers
+No match
+    abcde
+No match
+    adcdcde
+No match
+
+/(ab|a)b*c/
+    abc
+ 0: abc
+ 1: ab
+
+/((a)(b)c)(d)/
+    abcd
+ 0: abcd
+ 1: abc
+ 2: a
+ 3: b
+ 4: d
+
+/[a-zA-Z_][a-zA-Z0-9_]*/
+    alpha
+ 0: alpha
+
+/^a(bc+|b[eh])g|.h$/
+    abh
+ 0: bh
+
+/(bc+d$|ef*g.|h?i(j|k))/
+    effgz
+ 0: effgz
+ 1: effgz
+    ij
+ 0: ij
+ 1: ij
+ 2: j
+    reffgz
+ 0: effgz
+ 1: effgz
+    *** Failers
+No match
+    effg
+No match
+    bcdd
+No match
+
+/((((((((((a))))))))))/
+    a
+ 0: a
+ 1: a
+ 2: a
+ 3: a
+ 4: a
+ 5: a
+ 6: a
+ 7: a
+ 8: a
+ 9: a
+10: a
+
+/((((((((((a))))))))))\9/
+    aa
+ 0: aa
+ 1: a
+ 2: a
+ 3: a
+ 4: a
+ 5: a
+ 6: a
+ 7: a
+ 8: a
+ 9: a
+10: a
+
+/(((((((((a)))))))))/
+    a
+ 0: a
+ 1: a
+ 2: a
+ 3: a
+ 4: a
+ 5: a
+ 6: a
+ 7: a
+ 8: a
+ 9: a
+
+/multiple words of text/
+    *** Failers
+No match
+    aa
+No match
+    uh-uh
+No match
+
+/multiple words/
+    multiple words, yeah
+ 0: multiple words
+
+/(.*)c(.*)/
+    abcde
+ 0: abcde
+ 1: ab
+ 2: de
+
+/\((.*), (.*)\)/
+    (a, b)
+ 0: (a, b)
+ 1: a
+ 2: b
+
+/abcd/
+    abcd
+ 0: abcd
+
+/a(bc)d/
+    abcd
+ 0: abcd
+ 1: bc
+
+/a[-]?c/
+    ac
+ 0: ac
+
+/(abc)\1/
+    abcabc
+ 0: abcabc
+ 1: abc
+
+/([a-c]*)\1/
+    abcabc
+ 0: abcabc
+ 1: abc
+
+/(a)|\1/
+    a
+ 0: a
+ 1: a
+    *** Failers
+ 0: a
+ 1: a
+    ab
+ 0: a
+ 1: a
+    x
+No match
+
+/abc/i
+    ABC
+ 0: ABC
+    XABCY
+ 0: ABC
+    ABABC
+ 0: ABC
+    *** Failers
+No match
+    aaxabxbaxbbx
+No match
+    XBC
+No match
+    AXC
+No match
+    ABX
+No match
+
+/ab*c/i
+    ABC
+ 0: ABC
+
+/ab*bc/i
+    ABC
+ 0: ABC
+    ABBC
+ 0: ABBC
+
+/ab+bc/i
+    *** Failers
+No match
+    ABC
+No match
+    ABQ
+No match
+
+/ab+bc/i
+    ABBBBC
+ 0: ABBBBC
+
+/^abc$/i
+    ABC
+ 0: ABC
+    *** Failers
+No match
+    ABBBBC
+No match
+    ABCC
+No match
+
+/^abc/i
+    ABCC
+ 0: ABC
+
+/abc$/i
+    AABC
+ 0: ABC
+
+/^/i
+    ABC
+ 0: 
+
+/$/i
+    ABC
+ 0: 
+
+/a.c/i
+    ABC
+ 0: ABC
+    AXC
+ 0: AXC
+
+/a.*c/i
+    *** Failers
+No match
+    AABC
+ 0: AABC
+    AXYZD
+No match
+
+/a[bc]d/i
+    ABD
+ 0: ABD
+
+/a[b-d]e/i
+    ACE
+ 0: ACE
+    *** Failers
+No match
+    ABC
+No match
+    ABD
+No match
+
+/a[b-d]/i
+    AAC
+ 0: AC
+
+/a[-b]/i
+    A-
+ 0: A-
+
+/a[b-]/i
+    A-
+ 0: A-
+
+/a[]]b/i
+    A]B
+ 0: A]B
+
+/a[^bc]d/i
+    AED
+ 0: AED
+
+/a[^-b]c/i
+    ADC
+ 0: ADC
+    *** Failers
+No match
+    ABD
+No match
+    A-C
+No match
+
+/a[^]b]c/i
+    ADC
+ 0: ADC
+
+/ab|cd/i
+    ABC
+ 0: AB
+    ABCD
+ 0: AB
+
+/()ef/i
+    DEF
+ 0: EF
+ 1: 
+
+/$b/i
+    *** Failers
+No match
+    A]C
+No match
+    B
+No match
+
+/a\(b/i
+    A(B
+ 0: A(B
+
+/a\(*b/i
+    AB
+ 0: AB
+    A((B
+ 0: A((B
+
+/((a))/i
+    ABC
+ 0: A
+ 1: A
+ 2: A
+
+/(a)b(c)/i
+    ABC
+ 0: ABC
+ 1: A
+ 2: C
+
+/a+b+c/i
+    AABBABC
+ 0: ABC
+
+/a{1,}b{1,}c/i
+    AABBABC
+ 0: ABC
+
+/(a+|b)*/i
+    AB
+ 0: AB
+ 1: B
+
+/(a+|b){0,}/i
+    AB
+ 0: AB
+ 1: B
+
+/(a+|b)+/i
+    AB
+ 0: AB
+ 1: B
+
+/(a+|b){1,}/i
+    AB
+ 0: AB
+ 1: B
+
+/(a+|b)?/i
+    AB
+ 0: A
+ 1: A
+
+/(a+|b){0,1}/i
+    AB
+ 0: A
+ 1: A
+
+/[^ab]*/i
+    CDE
+ 0: CDE
+
+/([abc])*d/i
+    ABBBCD
+ 0: ABBBCD
+ 1: C
+
+/([abc])*bcd/i
+    ABCD
+ 0: ABCD
+ 1: A
+
+/a|b|c|d|e/i
+    E
+ 0: E
+
+/(a|b|c|d|e)f/i
+    EF
+ 0: EF
+ 1: E
+
+/abcd*efg/i
+    ABCDEFG
+ 0: ABCDEFG
+
+/ab*/i
+    XABYABBBZ
+ 0: AB
+    XAYABBBZ
+ 0: A
+
+/(ab|cd)e/i
+    ABCDE
+ 0: CDE
+ 1: CD
+
+/[abhgefdc]ij/i
+    HIJ
+ 0: HIJ
+
+/^(ab|cd)e/i
+    ABCDE
+No match
+
+/(abc|)ef/i
+    ABCDEF
+ 0: EF
+ 1: 
+
+/(a|b)c*d/i
+    ABCD
+ 0: BCD
+ 1: B
+
+/(ab|ab*)bc/i
+    ABC
+ 0: ABC
+ 1: A
+
+/a([bc]*)c*/i
+    ABC
+ 0: ABC
+ 1: BC
+
+/a([bc]*)(c*d)/i
+    ABCD
+ 0: ABCD
+ 1: BC
+ 2: D
+
+/a([bc]+)(c*d)/i
+    ABCD
+ 0: ABCD
+ 1: BC
+ 2: D
+
+/a([bc]*)(c+d)/i
+    ABCD
+ 0: ABCD
+ 1: B
+ 2: CD
+
+/a[bcd]*dcdcde/i
+    ADCDCDE
+ 0: ADCDCDE
+
+/a[bcd]+dcdcde/i
+
+/(ab|a)b*c/i
+    ABC
+ 0: ABC
+ 1: AB
+
+/((a)(b)c)(d)/i
+    ABCD
+ 0: ABCD
+ 1: ABC
+ 2: A
+ 3: B
+ 4: D
+
+/[a-zA-Z_][a-zA-Z0-9_]*/i
+    ALPHA
+ 0: ALPHA
+
+/^a(bc+|b[eh])g|.h$/i
+    ABH
+ 0: BH
+
+/(bc+d$|ef*g.|h?i(j|k))/i
+    EFFGZ
+ 0: EFFGZ
+ 1: EFFGZ
+    IJ
+ 0: IJ
+ 1: IJ
+ 2: J
+    REFFGZ
+ 0: EFFGZ
+ 1: EFFGZ
+    *** Failers
+No match
+    ADCDCDE
+No match
+    EFFG
+No match
+    BCDD
+No match
+
+/((((((((((a))))))))))/i
+    A
+ 0: A
+ 1: A
+ 2: A
+ 3: A
+ 4: A
+ 5: A
+ 6: A
+ 7: A
+ 8: A
+ 9: A
+10: A
+
+/((((((((((a))))))))))\9/i
+    AA
+ 0: AA
+ 1: A
+ 2: A
+ 3: A
+ 4: A
+ 5: A
+ 6: A
+ 7: A
+ 8: A
+ 9: A
+10: A
+
+/(((((((((a)))))))))/i
+    A
+ 0: A
+ 1: A
+ 2: A
+ 3: A
+ 4: A
+ 5: A
+ 6: A
+ 7: A
+ 8: A
+ 9: A
+
+/multiple words of text/i
+    *** Failers
+No match
+    AA
+No match
+    UH-UH
+No match
+
+/multiple words/i
+    MULTIPLE WORDS, YEAH
+ 0: MULTIPLE WORDS
+
+/(.*)c(.*)/i
+    ABCDE
+ 0: ABCDE
+ 1: AB
+ 2: DE
+
+/\((.*), (.*)\)/i
+    (A, B)
+ 0: (A, B)
+ 1: A
+ 2: B
+
+/abcd/i
+    ABCD
+ 0: ABCD
+
+/a(bc)d/i
+    ABCD
+ 0: ABCD
+ 1: BC
+
+/a[-]?c/i
+    AC
+ 0: AC
+
+/(abc)\1/i
+    ABCABC
+ 0: ABCABC
+ 1: ABC
+
+/([a-c]*)\1/i
+    ABCABC
+ 0: ABCABC
+ 1: ABC
+
+/((foo)|(bar))*/
+    foobar
+ 0: foobar
+ 1: bar
+ 2: foo
+ 3: bar
+
+/^(.+)?B/
+    AB
+ 0: AB
+ 1: A
+
+/^([^a-z])|(\^)$/
+    .
+ 0: .
+ 1: .
+
+/^[<>]&/
+    <&OUT
+ 0: <&
+
+/^(){3,5}/
+    abc
+ 0: 
+ 1: 
+
+/^(a+)*ax/
+    aax
+ 0: aax
+ 1: a
+
+/^((a|b)+)*ax/
+    aax
+ 0: aax
+ 1: a
+ 2: a
+
+/^((a|bc)+)*ax/
+    aax
+ 0: aax
+ 1: a
+ 2: a
+
+/(a|x)*ab/
+    cab
+ 0: ab
+
+/(a)*ab/
+    cab
+ 0: ab
+
+/(ab)[0-9]\1/i
+    Ab4ab
+ 0: Ab4ab
+ 1: Ab
+    ab4Ab
+ 0: ab4Ab
+ 1: ab
+
+/foo\w*[0-9]{4}baz/
+    foobar1234baz
+ 0: foobar1234baz
+
+/(\w+:)+/
+    one:
+ 0: one:
+ 1: one:
+
+/((\w|:)+::)?(\w+)$/
+    abcd
+ 0: abcd
+ 1: <unset>
+ 2: <unset>
+ 3: abcd
+    xy:z:::abcd
+ 0: xy:z:::abcd
+ 1: xy:z:::
+ 2: :
+ 3: abcd
+
+/^[^bcd]*(c+)/
+    aexycd
+ 0: aexyc
+ 1: c
+
+/(a*)b+/
+    caab
+ 0: aab
+ 1: aa
+
+/((\w|:)+::)?(\w+)$/
+    abcd
+ 0: abcd
+ 1: <unset>
+ 2: <unset>
+ 3: abcd
+    xy:z:::abcd
+ 0: xy:z:::abcd
+ 1: xy:z:::
+ 2: :
+ 3: abcd
+    *** Failers
+ 0: Failers
+ 1: <unset>
+ 2: <unset>
+ 3: Failers
+    abcd:
+No match
+    abcd:
+No match
+
+/^[^bcd]*(c+)/
+    aexycd
+ 0: aexyc
+ 1: c
+
+/((Z)+|A)*/
+    ZABCDEFG
+ 0: ZA
+ 1: A
+ 2: Z
+
+/(Z()|A)*/
+    ZABCDEFG
+ 0: ZA
+ 1: A
+ 2: 
+
+/(Z(())|A)*/
+    ZABCDEFG
+ 0: ZA
+ 1: A
+ 2: 
+ 3: 
+
+/(.*)[0-9]+\1/
+    abc123abc
+ 0: abc123abc
+ 1: abc
+    abc123bc 
+ 0: bc123bc
+ 1: bc
+
+/((.*))[0-9]+\1/
+    abc123abc
+ 0: abc123abc
+ 1: abc
+ 2: abc
+    abc123bc  
+ 0: bc123bc
+ 1: bc
+ 2: bc
+
+/^a{2,5}$/
+    aa
+ 0: aa
+    aaa
+ 0: aaa
+    aaaa
+ 0: aaaa
+    aaaaa
+ 0: aaaaa
+    *** Failers
+No match
+    a
+No match
+    b
+No match
+    aaaaab
+No match
+    aaaaaa
diff --git a/test/src/regex-resources/PTESTS b/test/src/regex-resources/PTESTS
new file mode 100644
index 0000000..02b357c
--- /dev/null
+++ b/test/src/regex-resources/PTESTS
@@ -0,0 +1,341 @@
+# 2.8.2  Regular Expression General Requirement
+2¦4¦bb*¦abbbc¦
+2¦2¦bb*¦ababbbc¦
+7¦9¦A#*::¦A:A#:qA::qA#::qA##::q¦
+1¦5¦A#*::¦A##::A#::qA::qA#:q¦
+# 2.8.3.1.2  BRE Special Characters
+# GA108
+2¦2¦\.¦a.c¦
+2¦2¦\[¦a[c¦
+2¦2¦\\¦a\c¦
+2¦2¦\*¦a*c¦
+2¦2¦\^¦a^c¦
+2¦2¦\$¦a$c¦
+7¦11¦X\*Y\*8¦Y*8X*8X*Y*8¦
+# GA109
+2¦2¦[.]¦a.c¦
+2¦2¦[[]¦a[c¦
+-1¦-1¦[[]¦ac¦
+2¦2¦[\]¦a\c¦
+1¦1¦[\a]¦abc¦
+2¦2¦[\.]¦a\.c¦
+2¦2¦[\.]¦a.\c¦
+2¦2¦[*]¦a*c¦
+2¦2¦[$]¦a$c¦
+2¦2¦[X*Y8]¦7*8YX¦
+# GA110
+2¦2¦*¦a*c¦
+3¦4¦*a¦*b*a*c¦
+1¦5¦**9=¦***9=9¦
+# GA111
+1¦1¦^*¦*bc¦
+-1¦-1¦^*¦a*c¦
+-1¦-1¦^*¦^*ab¦
+1¦5¦^**9=¦***9=¦
+-1¦-1¦^*5<*9¦5<9*5<*9¦
+# GA112
+2¦3¦\(*b\)¦a*b¦
+-1¦-1¦\(*b\)¦ac¦
+1¦6¦A\(**9\)=¦A***9=79¦
+# GA113(1)
+1¦3¦\(^*ab\)¦*ab¦
+-1¦-1¦\(^*ab\)¦^*ab¦
+-1¦-1¦\(^*b\)¦a*b¦
+-1¦-1¦\(^*b\)¦^*b¦
+### GA113(2)			GNU regex implements GA113(1)
+##-1¦-1¦\(^*ab\)¦*ab¦
+##-1¦-1¦\(^*ab\)¦^*ab¦
+##1¦1¦\(^*b\)¦b¦
+##1¦3¦\(^*b\)¦^^b¦
+# GA114
+1¦3¦a^b¦a^b¦
+1¦3¦a\^b¦a^b¦
+1¦1¦^^¦^bc¦
+2¦2¦\^¦a^c¦
+1¦1¦[c^b]¦^abc¦
+1¦1¦[\^ab]¦^ab¦
+2¦2¦[\^ab]¦c\d¦
+-1¦-1¦[^^]¦^¦
+1¦3¦\(a^b\)¦a^b¦
+1¦3¦\(a\^b\)¦a^b¦
+2¦2¦\(\^\)¦a^b¦
+# GA115
+3¦3¦$$¦ab$¦
+-1¦-1¦$$¦$ab¦
+2¦3¦$c¦a$c¦
+2¦2¦[$]¦a$c¦
+1¦2¦\$a¦$a¦
+3¦3¦\$$¦ab$¦
+2¦6¦A\([34]$[34]\)B¦XA4$3BY¦
+# 2.8.3.1.3  Periods in BREs
+# GA116
+1¦1¦.¦abc¦
+-1¦-1¦.ab¦abc¦
+1¦3¦ab.¦abc¦
+1¦3¦a.b¦a,b¦
+-1¦-1¦.......¦PqRs6¦
+1¦7¦.......¦PqRs6T8¦
+# 2.8.3.2  RE Bracket Expression
+# GA118
+2¦2¦[abc]¦xbyz¦
+-1¦-1¦[abc]¦xyz¦
+2¦2¦[abc]¦xbay¦
+# GA119
+2¦2¦[^a]¦abc¦
+4¦4¦[^]cd]¦cd]ef¦
+2¦2¦[^abc]¦axyz¦
+-1¦-1¦[^abc]¦abc¦
+3¦3¦[^[.a.]b]¦abc¦
+3¦3¦[^[=a=]b]¦abc¦
+2¦2¦[^-ac]¦abcde-¦
+2¦2¦[^ac-]¦abcde-¦
+3¦3¦[^a-b]¦abcde¦
+3¦3¦[^a-bd-e]¦dec¦
+2¦2¦[^---]¦-ab¦
+16¦16¦[^a-zA-Z0-9]¦pqrstVWXYZ23579#¦
+# GA120(1)
+3¦3¦[]a]¦cd]ef¦
+1¦1¦[]-a]¦a_b¦
+3¦3¦[][.-.]-0]¦ab0-]¦
+1¦1¦[]^a-z]¦string¦
+# GA120(2)
+4¦4¦[^]cd]¦cd]ef¦
+0¦0¦[^]]*¦]]]]]]]]X¦
+0¦0¦[^]]*¦]]]]]]]]¦
+9¦9¦[^]]\{1,\}¦]]]]]]]]X¦
+-1¦-1¦[^]]\{1,\}¦]]]]]]]]¦
+# GA120(3)
+3¦3¦[c[.].]d]¦ab]cd¦
+2¦8¦[a-z]*[[.].]][A-Z]*¦Abcd]DEFg¦
+# GA121
+2¦2¦[[.a.]b]¦Abc¦
+1¦1¦[[.a.]b]¦aBc¦
+-1¦-1¦[[.a.]b]¦ABc¦
+3¦3¦[^[.a.]b]¦abc¦
+3¦3¦[][.-.]-0]¦ab0-]¦
+3¦3¦[A-[.].]c]¦ab]!¦
+# GA122
+-2¦-2¦[[.ch.]]¦abc¦
+-2¦-2¦[[.ab.][.CD.][.EF.]]¦yZabCDEFQ9¦
+# GA125
+2¦2¦[[=a=]b]¦Abc¦
+1¦1¦[[=a=]b]¦aBc¦
+-1¦-1¦[[=a=]b]¦ABc¦
+3¦3¦[^[=a=]b]¦abc¦
+# GA126
+#W the expected result for [[:alnum:]]* is 2-7 which is wrong
+0¦0¦[[:alnum:]]*¦ aB28gH¦
+2¦7¦[[:alnum:]][[:alnum:]]*¦ aB28gH¦
+#W the expected result for [^[:alnum:]]* is 2-5 which is wrong
+0¦0¦[^[:alnum:]]*¦2 	,\x7fa¦
+2¦5¦[^[:alnum:]][^[:alnum:]]*¦2 	,\x7fa¦
+#W the expected result for [[:alpha:]]* is 2-5 which is wrong
+0¦0¦[[:alpha:]]*¦ aBgH2¦
+2¦5¦[[:alpha:]][[:alpha:]]*¦ aBgH2¦
+1¦6¦[^[:alpha:]]*¦2 	8,\x7fa¦
+1¦2¦[[:blank:]]*¦ 	
\x7f¦
+1¦8¦[^[:blank:]]*¦aB28gH,\x7f ¦
+1¦2¦[[:cntrl:]]*¦	\x7f ¦
+1¦8¦[^[:cntrl:]]*¦aB2 8gh,¦
+#W the expected result for [[:digit:]]* is 2-3 which is wrong
+0¦0¦[[:digit:]]*¦a28¦
+2¦3¦[[:digit:]][[:digit:]]*¦a28¦
+1¦8¦[^[:digit:]]*¦aB 	gH,\x7f¦
+1¦7¦[[:graph:]]*¦aB28gH, ¦
+1¦3¦[^[:graph:]]*¦ 	\x7f,¦
+1¦2¦[[:lower:]]*¦agB¦
+1¦8¦[^[:lower:]]*¦B2 	8H,\x7fa¦
+1¦8¦[[:print:]]*¦aB2 8gH,	¦
+1¦2¦[^[:print:]]*¦	\x7f ¦
+#W the expected result for [[:punct:]]* is 2-2 which is wrong
+0¦0¦[[:punct:]]*¦a,2¦
+2¦3¦[[:punct:]][[:punct:]]*¦a,,2¦
+1¦9¦[^[:punct:]]*¦aB2 	8gH\x7f¦
+1¦3¦[[:space:]]*¦ 	
\x7f¦
+#W the expected result for [^[:space:]]* is 2-9 which is wrong
+0¦0¦[^[:space:]]*¦ aB28gH,\x7f	¦
+2¦9¦[^[:space:]][^[:space:]]*¦ aB28gH,\x7f	¦
+#W the expected result for [[:upper:]]* is 2-3 which is wrong
+0¦0¦[[:upper:]]*¦aBH2¦
+2¦3¦[[:upper:]][[:upper:]]*¦aBH2¦
+1¦8¦[^[:upper:]]*¦a2 	8g,\x7fB¦
+#W the expected result for [[:xdigit:]]* is 2-5 which is wrong
+0¦0¦[[:xdigit:]]*¦gaB28h¦
+2¦5¦[[:xdigit:]][[:xdigit:]]*¦gaB28h¦
+#W the expected result for [^[:xdigit:]]* is 2-7 which is wrong
+2¦7¦[^[:xdigit:]][^[:xdigit:]]*¦a 	gH,\x7f2¦
+# GA127
+-2¦-2¦[b-a]¦abc¦
+1¦1¦[a-c]¦bbccde¦
+2¦2¦[a-b]¦-bc¦
+3¦3¦[a-z0-9]¦AB0¦
+3¦3¦[^a-b]¦abcde¦
+3¦3¦[^a-bd-e]¦dec¦
+1¦1¦[]-a]¦a_b¦
+2¦2¦[+--]¦a,b¦
+2¦2¦[--/]¦a.b¦
+2¦2¦[^---]¦-ab¦
+3¦3¦[][.-.]-0]¦ab0-]¦
+3¦3¦[A-[.].]c]¦ab]!¦
+2¦6¦bc[d-w]xy¦abchxyz¦
+# GA129
+1¦1¦[a-cd-f]¦dbccde¦
+-1¦-1¦[a-ce-f]¦dBCCdE¦
+2¦4¦b[n-zA-M]Y¦absY9Z¦
+2¦4¦b[n-zA-M]Y¦abGY9Z¦
+# GA130
+3¦3¦[-xy]¦ac-¦
+2¦4¦c[-xy]D¦ac-D+¦
+2¦2¦[--/]¦a.b¦
+2¦4¦c[--/]D¦ac.D+b¦
+2¦2¦[^-ac]¦abcde-¦
+1¦3¦a[^-ac]c¦abcde-¦
+3¦3¦[xy-]¦zc-¦
+2¦4¦c[xy-]7¦zc-786¦
+2¦2¦[^ac-]¦abcde-¦
+2¦4¦a[^ac-]c¦5abcde-¦
+2¦2¦[+--]¦a,b¦
+2¦4¦a[+--]B¦Xa,By¦
+2¦2¦[^---]¦-ab¦
+4¦6¦X[^---]Y¦X-YXaYXbY¦
+# 2.8.3.3  BREs Matching Multiple Characters
+# GA131
+3¦4¦cd¦abcdeabcde¦
+1¦2¦ag*b¦abcde¦
+-1¦-1¦[a-c][e-f]¦abcdef¦
+3¦4¦[a-c][e-f]¦acbedf¦
+4¦8¦abc*XYZ¦890abXYZ#*¦
+4¦9¦abc*XYZ¦890abcXYZ#*¦
+4¦15¦abc*XYZ¦890abcccccccXYZ#*¦
+-1¦-1¦abc*XYZ¦890abc*XYZ#*¦
+# GA132
+2¦4¦\(*bc\)¦a*bc¦
+1¦2¦\(ab\)¦abcde¦
+1¦10¦\(a\(b\(c\(d\(e\(f\(g\)h\(i\(j\)\)\)\)\)\)\)\)¦abcdefghijk¦
+3¦8¦43\(2\(6\)*0\)AB¦654320ABCD¦
+3¦9¦43\(2\(7\)*0\)AB¦6543270ABCD¦
+3¦12¦43\(2\(7\)*0\)AB¦6543277770ABCD¦
+# GA133
+1¦10¦\(a\(b\(c\(d\(e\(f\(g\)h\(i\(j\)\)\)\)\)\)\)\)¦abcdefghijk¦
+-1¦-1¦\(a\(b\(c\(d\(e\(f\(g\)h\(i\(k\)\)\)\)\)\)\)\)¦abcdefghijk¦
+# GA134
+2¦4¦\(bb*\)¦abbbc¦
+2¦2¦\(bb*\)¦ababbbc¦
+1¦6¦a\(.*b\)¦ababbbc¦
+1¦2¦a\(b*\)¦ababbbc¦
+1¦20¦a\(.*b\)c¦axcaxbbbcsxbbbbbbbbc¦
+# GA135
+1¦7¦\(a\(b\(c\(d\(e\)\)\)\)\)\4¦abcdededede¦
+#W POSIX does not really specify whether a\(b\)*c\1 matches acb.
+#W back references are supposed to expand to the last match, but what
+#W if there never was a match as in this case?
+-1¦-1¦a\(b\)*c\1¦acb¦
+1¦11¦\(a\(b\(c\(d\(e\(f\(g\)h\(i\(j\)\)\)\)\)\)\)\)\9¦abcdefghijjk¦
+# GA136
+#W These two tests have the same problem as the test in GA135.  No match
+#W of a subexpression, why should the back reference be usable?
+#W 1 2 a\(b\)*c\1 acb
+#W 4 7 a\(b\(c\(d\(f\)*\)\)\)\4¦xYzabcdePQRST
+-1¦-1¦a\(b\)*c\1¦acb¦
+-1¦-1¦a\(b\(c\(d\(f\)*\)\)\)\4¦xYzabcdePQRST¦
+# GA137
+-2¦-2¦\(a\(b\)\)\3¦foo¦
+-2¦-2¦\(a\(b\)\)\(a\(b\)\)\5¦foo¦
+# GA138
+1¦2¦ag*b¦abcde¦
+1¦10¦a.*b¦abababvbabc¦
+2¦5¦b*c¦abbbcdeabbbbbbcde¦
+2¦5¦bbb*c¦abbbcdeabbbbbbcde¦
+1¦5¦a\(b\)*c\1¦abbcbbb¦
+-1¦-1¦a\(b\)*c\1¦abbdbd¦
+0¦0¦\([a-c]*\)\1¦abcacdef¦
+1¦6¦\([a-c]*\)\1¦abcabcabcd¦
+1¦2¦a^*b¦ab¦
+1¦5¦a^*b¦a^^^b¦
+# GA139
+1¦2¦a\{2\}¦aaaa¦
+1¦7¦\([a-c]*\)\{0,\}¦aabcaab¦
+1¦2¦\(a\)\1\{1,2\}¦aabc¦
+1¦3¦\(a\)\1\{1,2\}¦aaaabc¦
+#W the expression \(\(a\)\1\)\{1,2\} is ill-formed, using \2
+1¦4¦\(\(a\)\2\)\{1,2\}¦aaaabc¦
+# GA140
+1¦2¦a\{2\}¦aaaa¦
+-1¦-1¦a\{2\}¦abcd¦
+0¦0¦a\{0\}¦aaaa¦
+1¦64¦a\{64\}¦aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa¦
+# GA141
+1¦7¦\([a-c]*\)\{0,\}¦aabcaab¦
+#W the expected result for \([a-c]*\)\{2,\} is failure which isn't correct
+1¦3¦\([a-c]*\)\{2,\}¦abcdefg¦
+1¦3¦\([a-c]*\)\{1,\}¦abcdefg¦
+-1¦-1¦a\{64,\}¦aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa¦
+# GA142
+1¦3¦a\{2,3\}¦aaaa¦
+-1¦-1¦a\{2,3\}¦abcd¦
+0¦0¦\([a-c]*\)\{0,0\}¦foo¦
+1¦63¦a\{1,63\}¦aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa¦
+# 2.8.3.4  BRE Precedence
+# GA143
+#W There are numerous bugs in the original version.
+2¦19¦\^\[[[.].]]\\(\\1\\)\*\\{1,2\\}\$¦a^[]\(\1\)*\{1,2\}$b¦
+1¦6¦[[=*=]][[=\=]][[=]=]][[===]][[...]][[:punct:]]¦*\]=.;¦
+1¦6¦[$\(*\)^]*¦$\()*^¦
+1¦1¦[\1]¦1¦
+1¦1¦[\{1,2\}]¦{¦
+#W the expected result for \(*\)*\1* is 2-2 which isn't correct
+0¦0¦\(*\)*\1*¦a*b*11¦
+2¦3¦\(*\)*\1*b¦a*b*11¦
+#W the expected result for \(a\(b\{1,2\}\)\{1,2\}\) is 1-5 which isn't correct
+1¦3¦\(a\(b\{1,2\}\)\{1,2\}\)¦abbab¦
+1¦5¦\(a\(b\{1,2\}\)\)\{1,2\}¦abbab¦
+1¦1¦^\(^\(^a$\)$\)$¦a¦
+1¦2¦\(a\)\1$¦aa¦
+1¦3¦ab*¦abb¦
+1¦4¦ab\{2,4\}¦abbbc¦
+# 2.8.3.5  BRE Expression Anchoring
+# GA144
+1¦1¦^a¦abc¦
+-1¦-1¦^b¦abc¦
+-1¦-1¦^[a-zA-Z]¦99Nine¦
+1¦4¦^[a-zA-Z]*¦Nine99¦
+# GA145(1)
+1¦2¦\(^a\)\1¦aabc¦
+-1¦-1¦\(^a\)\1¦^a^abc¦
+1¦2¦\(^^a\)¦^a¦
+1¦1¦\(^^\)¦^^¦
+1¦3¦\(^abc\)¦abcdef¦
+-1¦-1¦\(^def\)¦abcdef¦
+### GA145(2)			GNU regex implements GA145(1)
+##-1¦-1¦\(^a\)\1¦aabc¦
+##1¦4¦\(^a\)\1¦^a^abc¦
+##-1¦-1¦\(^^a\)¦^a¦
+##1¦2¦\(^^\)¦^^¦
+# GA146
+3¦3¦a$¦cba¦
+-1¦-1¦a$¦abc¦
+5¦7¦[a-z]*$¦99ZZxyz¦
+#W the expected result for [a-z]*$ is failure which isn't correct
+10¦9¦[a-z]*$¦99ZZxyz99¦
+3¦3¦$$¦ab$¦
+-1¦-1¦$$¦$ab¦
+3¦3¦\$$¦ab$¦
+# GA147(1)
+-1¦-1¦\(a$\)\1¦bcaa¦
+-1¦-1¦\(a$\)\1¦ba$¦
+-1¦-1¦\(ab$\)¦ab$¦
+1¦2¦\(ab$\)¦ab¦
+4¦6¦\(def$\)¦abcdef¦
+-1¦-1¦\(abc$\)¦abcdef¦
+### GA147(2)			GNU regex implements GA147(1)
+##-1¦-1¦\(a$\)\1¦bcaa¦
+##2¦5¦\(a$\)\1¦ba$a$¦
+##-1¦-1¦\(ab$\)¦ab¦
+##1¦3¦\(ab$\)¦ab$¦
+# GA148
+0¦0¦^$¦¦
+1¦3¦^abc$¦abc¦
+-1¦-1¦^xyz$¦^xyz^¦
+-1¦-1¦^234$¦^234$¦
+1¦9¦^[a-zA-Z0-9]*$¦2aA3bB9zZ¦
+-1¦-1¦^[a-z0-9]*$¦2aA3b#B9zZ¦
diff --git a/test/src/regex-resources/TESTS b/test/src/regex-resources/TESTS
new file mode 100644
index 0000000..f2c9886
--- /dev/null
+++ b/test/src/regex-resources/TESTS
@@ -0,0 +1,167 @@
+0:(.*)*\1:xx
+0:^:
+0:$:
+0:^$:
+0:^a$:a
+0:abc:abc
+1:abc:xbc
+1:abc:axc
+1:abc:abx
+0:abc:xabcy
+0:abc:ababc
+0:ab*c:abc
+0:ab*bc:abc
+0:ab*bc:abbc
+0:ab*bc:abbbbc
+0:ab+bc:abbc
+1:ab+bc:abc
+1:ab+bc:abq
+0:ab+bc:abbbbc
+0:ab?bc:abbc
+0:ab?bc:abc
+1:ab?bc:abbbbc
+0:ab?c:abc
+0:^abc$:abc
+1:^abc$:abcc
+0:^abc:abcc
+1:^abc$:aabc
+0:abc$:aabc
+0:^:abc
+0:$:abc
+0:a.c:abc
+0:a.c:axc
+0:a.*c:axyzc
+1:a.*c:axyzd
+1:a[bc]d:abc
+0:a[bc]d:abd
+1:a[b-d]e:abd
+0:a[b-d]e:ace
+0:a[b-d]:aac
+0:a[-b]:a-
+0:a[b-]:a-
+2:a[b-a]:-
+2:a[]b:-
+2:a[:-
+0:a]:a]
+0:a[]]b:a]b
+0:a[^bc]d:aed
+1:a[^bc]d:abd
+0:a[^-b]c:adc
+1:a[^-b]c:a-c
+1:a[^]b]c:a]c
+0:a[^]b]c:adc
+0:ab|cd:abc
+0:ab|cd:abcd
+0:()ef:def
+0:()*:-
+2:*a:-
+2:^*:-
+2:$*:-
+2:(*)b:-
+1:$b:b
+2:a\:-
+0:a\(b:a(b
+0:a\(*b:ab
+0:a\(*b:a((b
+1:a\x:a\x
+1:abc):-
+2:(abc:-
+0:((a)):abc
+0:(a)b(c):abc
+0:a+b+c:aabbabc
+0:a**:-
+0:a*?:-
+0:(a*)*:-
+0:(a*)+:-
+0:(a|)*:-
+0:(a*|b)*:-
+0:(a+|b)*:ab
+0:(a+|b)+:ab
+0:(a+|b)?:ab
+0:[^ab]*:cde
+0:(^)*:-
+0:(ab|)*:-
+2:)(:-
+1:abc:
+1:abc:
+0:a*:
+0:([abc])*d:abbbcd
+0:([abc])*bcd:abcd
+0:a|b|c|d|e:e
+0:(a|b|c|d|e)f:ef
+0:((a*|b))*:-
+0:abcd*efg:abcdefg
+0:ab*:xabyabbbz
+0:ab*:xayabbbz
+0:(ab|cd)e:abcde
+0:[abhgefdc]ij:hij
+1:^(ab|cd)e:abcde
+0:(abc|)ef:abcdef
+0:(a|b)c*d:abcd
+0:(ab|ab*)bc:abc
+0:a([bc]*)c*:abc
+0:a([bc]*)(c*d):abcd
+0:a([bc]+)(c*d):abcd
+0:a([bc]*)(c+d):abcd
+0:a[bcd]*dcdcde:adcdcde
+1:a[bcd]+dcdcde:adcdcde
+0:(ab|a)b*c:abc
+0:((a)(b)c)(d):abcd
+0:[A-Za-z_][A-Za-z0-9_]*:alpha
+0:^a(bc+|b[eh])g|.h$:abh
+0:(bc+d$|ef*g.|h?i(j|k)):effgz
+0:(bc+d$|ef*g.|h?i(j|k)):ij
+1:(bc+d$|ef*g.|h?i(j|k)):effg
+1:(bc+d$|ef*g.|h?i(j|k)):bcdd
+0:(bc+d$|ef*g.|h?i(j|k)):reffgz
+1:((((((((((a)))))))))):-
+0:(((((((((a))))))))):a
+1:multiple words of text:uh-uh
+0:multiple words:multiple words, yeah
+0:(.*)c(.*):abcde
+1:\((.*),:(.*)\)
+1:[k]:ab
+0:abcd:abcd
+0:a(bc)d:abcd
+0:a[\x01-\x03]?c:a\x02c
+0:(....).*\1:beriberi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Muammar Qaddafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Mo'ammar Gadhafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Muammar Kaddafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Muammar Qadhafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Moammar El Kadhafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Muammar Gadafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Mu'ammar al-Qadafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Moamer El Kazzafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Moamar al-Gaddafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Mu'ammar Al Qathafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Muammar Al Qathafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Mo'ammar el-Gadhafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Moamar El Kadhafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Muammar al-Qadhafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Mu'ammar al-Qadhdhafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Mu'ammar Qadafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Moamar Gaddafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Mu'ammar Qadhdhafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Muammar Khaddafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Muammar al-Khaddafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Mu'amar al-Kadafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Muammar Ghaddafy
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Muammar Ghadafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Muammar Ghaddafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Muamar Kaddafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Muammar Quathafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Muammar Gheddafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Muamar Al-Kaddafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Moammar Khadafy 
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Moammar Qudhafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Mu'ammar al-Qaddafi
+0:M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]:Mulazim Awwal Mu'ammar Muhammad Abu Minyar al-Qadhafi
+0:[[:digit:]]+:01234
+1:[[:alpha:]]+:01234
+0:^[[:digit:]]*$:01234
+1:^[[:digit:]]*$:01234a
+0:^[[:alnum:]]*$:01234a
+0:^[[:xdigit:]]*$:01234a
+1:^[[:xdigit:]]*$:01234g
+0:^[[:alnum:][:space:]]*$:Hello world
-- 
2.8.0.rc3.226.g39d4020






^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH 2/7] Added driver for the regex tests
  2016-07-27 16:50   ` bug#24071: [PATCH 1/7] New regex tests imported from glibc 2.21 Michal Nazarewicz
@ 2016-07-27 16:50     ` Michal Nazarewicz
  2016-07-27 16:50     ` bug#24071: [PATCH 3/7] Fix reading of regex-resources in regex-tests Michal Nazarewicz
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Michal Nazarewicz @ 2016-07-27 16:50 UTC (permalink / raw)
  To: 24071; +Cc: Dima Kogan

From: Dima Kogan <dima@secretsauce.net>

* test/src/regex-tests.el (regex-tests): Test executing glibc tests
cases.

[mina86@mina86.com: merged test with existing file]
---
 test/src/regex-tests.el | 572 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 572 insertions(+)

diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
index 00165ab..13a9f86 100644
--- a/test/src/regex-tests.el
+++ b/test/src/regex-tests.el
@@ -20,6 +20,7 @@
 ;;; Code:
 
 (require 'ert)
+(require 'cl)
 
 (ert-deftest regex-word-cc-fallback-test ()
   "Test that ‘[[:cc:]]*x’ matches ‘x’ (bug#24020).
@@ -89,4 +90,575 @@ regex--test-cc
     (regex--test-cc "unibyte" "abcABC012 \t\n\1" "łą\u2622")
     (regex--test-cc "multibyte" "łą\u2622" "abcABC012 \t\n\1")))
 
+
+(defmacro regex-tests-generic-line (comment-char test-file whitelist &rest body)
+  "Reads a line of the test file TEST-FILE, skipping
+comments (defined by COMMENT-CHAR), and evaluates the tests in
+this line as defined in the BODY.  Line numbers in the WHITELIST
+are known failures, and are skipped."
+
+  `(with-temp-buffer
+    (modify-syntax-entry ?_ "w;; ") ; tests expect _ to be a word
+    (insert-file-contents ,(concat (file-name-directory (buffer-file-name)) test-file))
+
+    (let ((case-fold-search nil)
+          (line-number 1)
+          (whitelist-idx 0))
+
+      (goto-char (point-min))
+
+      (while (not (eobp))
+        (let ((start (point)))
+          (end-of-line)
+          (narrow-to-region start (point))
+
+          (goto-char (point-min))
+
+          (when
+              (and
+               ;; ignore comments
+               (save-excursion
+                 (re-search-forward ,(concat "^[^" (string comment-char) "]") nil t))
+
+               ;; skip lines in the whitelist
+               (let ((whitelist-next
+                      (condition-case nil
+                          (aref ,whitelist whitelist-idx) (args-out-of-range nil))))
+                 (cond
+                  ;; whitelist exhausted. do process this line
+                  ((null whitelist-next) t)
+
+                  ;; we're not yet at the next whitelist element. do
+                  ;; process this line
+                  ((< line-number whitelist-next) t)
+
+                  ;; we're past the next whitelist element. This
+                  ;; shouldn't happen
+                  ((> line-number whitelist-next)
+                   (error
+                    (format
+                     "We somehow skipped the next whitelist element: line %d" whitelist-next)))
+
+                  ;; we're at the next whitelist element. Skip this
+                  ;; line, and advance the whitelist index
+                  (t
+                   (setq whitelist-idx (1+ whitelist-idx)) nil))))
+            ,@body)
+
+          (widen)
+          (forward-line)
+          (beginning-of-line)
+          (setq line-number (1+ line-number)))))))
+
+(defun regex-tests-compare (string what-failed bounds-ref &optional substring-ref)
+  "I just ran a search, looking at STRING.  WHAT-FAILED describes
+what failed, if anything; valid values are 'search-failed,
+'compilation-failed and nil.  I compare the beginning/end of each
+group with their expected values.  This is done with either
+BOUNDS-REF or SUBSTRING-REF; one of those should be non-nil.
+BOUNDS-REF is a sequence \[start-ref0 end-ref0 start-ref1
+end-ref1 ....] while SUBSTRING-REF is the expected substring
+obtained by indexing the input string by start/end-ref.
+
+If the search was supposed to fail then start-ref0/substring-ref0
+is 'search-failed.  If the search wasn't even supposed to compile
+successfully, then start-ref0/substring-ref0 is
+'compilation-failed.  If I only care about a match succeeding,
+this can be set to t.
+
+This function returns a string that describes the failure, or nil
+on success"
+
+  (when (or
+         (and bounds-ref substring-ref)
+         (not (or bounds-ref substring-ref)))
+    (error "Exactly one of bounds-ref and bounds-ref should be non-nil"))
+
+  (let ((what-failed-ref (car (or bounds-ref substring-ref))))
+
+    (cond
+     ((eq what-failed 'search-failed)
+      (cond
+       ((eq what-failed-ref 'search-failed)
+        nil)
+       ((eq what-failed-ref 'compilation-failed)
+        "Expected pattern failure; but no match")
+       (t
+        "Expected match; but no match")))
+
+     ((eq what-failed 'compilation-failed)
+      (cond
+       ((eq what-failed-ref 'search-failed)
+        "Expected no match; but pattern failure")
+       ((eq what-failed-ref 'compilation-failed)
+        nil)
+       (t
+        "Expected match; but pattern failure")))
+
+     ;; The regex match succeeded
+     ((eq what-failed-ref 'search-failed)
+      "Expected no match; but match")
+     ((eq what-failed-ref 'compilation-failed)
+      "Expected pattern failure; but match")
+
+     ;; The regex match succeeded, as expected. I now check all the
+     ;; bounds
+     (t
+      (let ((idx 0)
+            msg
+            ref next-ref-function compare-ref-function mismatched-ref-function)
+
+        (if bounds-ref
+            (setq ref bounds-ref
+                  next-ref-function (lambda (x) (cddr x))
+                  compare-ref-function (lambda (ref start-pos end-pos)
+                                         (or (eq (car ref) t)
+                                             (and (eq start-pos (car ref))
+                                                  (eq end-pos   (cadr ref)))))
+                  mismatched-ref-function (lambda (ref start-pos end-pos)
+                                            (format
+                                             "beginning/end positions: %d/%s and %d/%s"
+                                             start-pos (car ref) end-pos (cadr ref))))
+          (setq ref substring-ref
+                next-ref-function (lambda (x) (cdr x))
+                compare-ref-function (lambda (ref start-pos end-pos)
+                                       (or (eq (car ref) t)
+                                           (string= (substring string start-pos end-pos) (car ref))))
+                mismatched-ref-function (lambda (ref start-pos end-pos)
+                                          (format
+                                           "beginning/end positions: %d/%s and %d/%s"
+                                           start-pos (car ref) end-pos (cadr ref)))))
+
+        (while (not (or (null ref) msg))
+
+          (let ((start (match-beginning idx))
+                (end   (match-end       idx)))
+
+            (when (not (funcall compare-ref-function ref start end))
+              (setq msg
+                    (format
+                     "Have expected match, but mismatch in group %d: %s" idx (funcall mismatched-ref-function ref start end))))
+
+            (setq ref (funcall next-ref-function ref)
+                  idx (1+ idx))))
+
+        (or msg
+            nil))))))
+
+
+
+(defun regex-tests-match (pattern string bounds-ref &optional substring-ref)
+  "I match the given STRING against PATTERN.  I compare the
+beginning/end of each group with their expected values.
+BOUNDS-REF is a sequence [start-ref0 end-ref0 start-ref1 end-ref1
+....].
+
+If the search was supposed to fail then start-ref0 is
+'search-failed.  If the search wasn't even supposed to compile
+successfully, then start-ref0 is 'compilation-failed.
+
+This function returns a string that describes the failure, or nil
+on success"
+
+  (if (string-match "\\[\\([\\.=]\\)..?\\1\\]" pattern)
+      ;; Skipping test: [.x.] and [=x=] forms not supported by emacs
+      nil
+
+    (regex-tests-compare
+     string
+     (condition-case nil
+         (if (string-match pattern string) nil 'search-failed)
+       ('invalid-regexp 'compilation-failed))
+     bounds-ref substring-ref)))
+
+
+(defconst regex-tests-re-even-escapes
+  "\\(?:^\\|[^\\\\]\\)\\(?:\\\\\\\\\\)*"
+  "Regex that matches an even number of \\ characters")
+
+(defconst regex-tests-re-odd-escapes
+  (concat regex-tests-re-even-escapes "\\\\")
+  "Regex that matches an odd number of \\ characters")
+
+
+(defun regex-tests-unextend (pattern)
+  "Basic conversion from extended regexen to emacs ones.  This is
+mostly a hack that adds \\ to () and | and {}, and removes it if
+it already exists.  We also change \\S (and \\s) to \\S- (and
+\\s-) because extended regexen see the former as whitespace, but
+emacs requires an extra symbol character"
+
+  (with-temp-buffer
+    (insert pattern)
+    (goto-char (point-min))
+
+    (while (re-search-forward "[()|{}]" nil t)
+      ;; point is past special character. If it is escaped, unescape
+      ;; it
+
+      (if (save-excursion
+            (re-search-backward (concat regex-tests-re-odd-escapes ".\\=") nil t))
+
+          ;; This special character is preceded by an odd number of \,
+          ;; so I unescape it by removing the last one
+          (progn
+            (forward-char -2)
+            (delete-char 1)
+            (forward-char 1))
+
+        ;; This special character is preceded by an even (possibly 0)
+        ;; number of \. I add an escape
+        (forward-char -1)
+        (insert "\\")
+        (forward-char 1)))
+
+    ;; convert \s to \s-
+    (goto-char (point-min))
+    (while (re-search-forward (concat regex-tests-re-odd-escapes "[Ss]") nil t)
+      (insert "-"))
+
+    (buffer-string)))
+
+(defun regex-tests-BOOST-frob-escapes (s ispattern)
+  "Mangle \\ the way it is done in frob_escapes() in
+regex-tests-BOOST.c in glibc: \\t, \\n, \\r are interpreted;
+\\\\, \\^, \{, \\|, \} are unescaped for the string (not
+pattern)"
+
+  ;; this is all similar to (regex-tests-unextend)
+  (with-temp-buffer
+    (insert s)
+
+    (let ((interpret-list (list "t" "n" "r")))
+      (while interpret-list
+        (goto-char (point-min))
+        (while (re-search-forward
+                (concat "\\(" regex-tests-re-even-escapes "\\)"
+                        "\\\\" (car interpret-list))
+                nil t)
+          (replace-match (concat "\\1" (car (read-from-string
+                                             (concat "\"\\" (car interpret-list) "\""))))))
+
+        (setq interpret-list (cdr interpret-list))))
+
+    (when (not ispattern)
+      ;; unescape \\, \^, \{, \|, \}
+      (let ((unescape-list (list "\\\\" "^" "{" "|" "}")))
+        (while unescape-list
+          (goto-char (point-min))
+          (while (re-search-forward
+                  (concat "\\(" regex-tests-re-even-escapes "\\)"
+                          "\\\\" (car unescape-list))
+                  nil t)
+            (replace-match (concat "\\1" (car unescape-list))))
+
+          (setq unescape-list (cdr unescape-list))))
+      )
+    (buffer-string)))
+
+
+
+
+(defconst regex-tests-BOOST-whitelist
+  [
+   ;; emacs is more stringent with regexen involving unbalanced )
+   63 65 69
+
+   ;; in emacs, regex . doesn't match \n
+   91
+
+   ;; emacs is more forgiving with * and ? that don't apply to
+   ;; characters
+   107 108 109 122 123 124 140 141 142
+
+   ;; emacs accepts regexen with {}
+   161
+
+   ;; emacs doesn't fail on bogus ranges such as [3-1] or [1-3-5]
+   222 223
+
+   ;; emacs doesn't match (ab*)[ab]*\1 greedily: only 4 chars of
+   ;; ababaaa match
+   284 294
+
+   ;; ambiguous groupings are ambiguous
+   443 444 445 446 448 449 450
+
+   ;; emacs doesn't know how to handle weird ranges such as [a-Z] and
+   ;; [[:alpha:]-a]
+   539 580 581
+
+   ;; emacs matches non-greedy regex ab.*? non-greedily
+   639 677 712
+   ]
+  "Line numbers in the boost test that should be skipped.  These
+are false-positive test failures that represent known/benign
+differences in behavior.")
+
+;; - Format
+;;   - Comments are lines starting with ;
+;;   - Lines starting with - set options passed to regcomp() and regexec():
+;;     - if no "REG_BASIC" is found, with have an extended regex
+;;     - These set a flag:
+;;       - REG_ICASE
+;;       - REG_NEWLINE
+;;       - REG_NOTBOL
+;;       - REG_NOTEOL
+;;
+;;   - Test lines are
+;;     pattern string start0 end0 start1 end1 ...
+;;
+;;   - pattern, string can have escapes
+;;   - string can have whitespace if enclosed in ""
+;;   - if string is "!", then the pattern is supposed to fail compilation
+;;   - start/end are of group0, group1, etc. group 0 is the full match
+;;   - start<0 indicates "no match"
+;;   - start is the 0-based index of the first character
+;;   - end   is the 0-based index of the first character past the group
+(defun regex-tests-BOOST ()
+  (let (failures
+        basic icase newline notbol noteol)
+    (regex-tests-generic-line
+     ?; "regex-resources/BOOST.tests" regex-tests-BOOST-whitelist
+     (if (save-excursion (re-search-forward "^-" nil t))
+         (setq basic   (save-excursion (re-search-forward "REG_BASIC" nil t))
+               icase   (save-excursion (re-search-forward "REG_ICASE" nil t))
+               newline (save-excursion (re-search-forward "REG_NEWLINE" nil t))
+               notbol  (save-excursion (re-search-forward "REG_NOTBOL" nil t))
+               noteol  (save-excursion (re-search-forward "REG_NOTEOL" nil t)))
+
+       (save-excursion
+         (or (re-search-forward "\\(\\S-+\\)\\s-+\"\\(.*\\)\"\\s-+?\\(.+\\)" nil t)
+             (re-search-forward "\\(\\S-+\\)\\s-+\\(\\S-+\\)\\s-+?\\(.+\\)"  nil t)
+             (re-search-forward "\\(\\S-+\\)\\s-+\\(!\\)"                    nil t)))
+
+       (let* ((pattern-raw   (match-string 1))
+              (string-raw    (match-string 2))
+              (positions-raw (match-string 3))
+              (pattern (regex-tests-BOOST-frob-escapes pattern-raw t))
+              (string  (regex-tests-BOOST-frob-escapes string-raw  nil))
+              (positions
+               (if (string= string "!")
+                   (list 'compilation-failed 0)
+                 (mapcar
+                  (lambda (x)
+                    (let ((x (string-to-number x)))
+                      (if (< x 0) nil x)))
+                  (split-string positions-raw)))))
+
+         (when (null (car positions))
+           (setcar positions 'search-failed))
+
+         (when (not basic)
+           (setq pattern (regex-tests-unextend pattern)))
+
+         ;; great. I now have all the data parsed. Let's use it to do
+         ;; stuff
+         (let* ((case-fold-search icase)
+                (msg (regex-tests-match pattern string positions)))
+
+           (if (and
+                ;; Skipping test: notbol/noteol not supported
+                (not notbol) (not noteol)
+
+                msg)
+
+               ;; store failure
+               (setq failures
+                     (cons (format "line number %d: Regex '%s': %s"
+                                   line-number pattern msg)
+                           failures)))))))
+
+    failures))
+
+(defconst regex-tests-PCRE-whitelist
+  [
+   ;; ambiguous groupings are ambiguous
+   610 611 1154 1157 1160 1168 1171 1176 1179 1182 1185 1188 1193 1196 1203
+  ]
+  "Line numbers in the PCRE test that should be skipped.  These
+are false-positive test failures that represent known/benign
+differences in behavior.")
+
+;; - Format
+;;
+;;  regex
+;;  input_string
+;;  group_num: group_match | "No match"
+;;  input_string
+;;  group_num: group_match | "No match"
+;;  input_string
+;;  group_num: group_match | "No match"
+;;  input_string
+;;  group_num: group_match | "No match"
+;;  ...
+(defun regex-tests-PCRE ()
+  (let (failures
+        pattern icase string what-failed matches-observed)
+    (regex-tests-generic-line
+     ?# "regex-resources/PCRE.tests" regex-tests-PCRE-whitelist
+
+     (cond
+
+      ;; pattern
+      ((save-excursion (re-search-forward "^/\\(.*\\)/\\(.*i?\\)$" nil t))
+       (setq icase (string= "i" (match-string 2))
+             pattern (regex-tests-unextend (match-string 1))))
+
+      ;; string. read it in, match against pattern, and save all the results
+      ((save-excursion (re-search-forward "^    \\(.*\\)" nil t))
+       (let ((case-fold-search icase))
+         (setq string (match-string 1)
+
+               ;; the regex match under test
+               what-failed
+               (condition-case nil
+                   (if (string-match pattern string) nil 'search-failed)
+                 ('invalid-regexp 'compilation-failed))
+
+               matches-observed
+               (loop for x from 0 to 20
+                     collect (and (not what-failed)
+                                  (or (match-string x string) "<unset>")))))
+       nil)
+
+      ;; verification line: failed match
+      ((save-excursion (re-search-forward "^No match" nil t))
+       (unless what-failed
+         (setq failures
+               (cons (format "line number %d: Regex '%s': Expected no match; but match"
+                             line-number pattern)
+                     failures))))
+
+      ;; verification line: succeeded match
+      ((save-excursion (re-search-forward "^ *\\([0-9]+\\): \\(.*\\)" nil t))
+       (let* ((match-ref (match-string 2))
+              (idx       (string-to-number (match-string 1))))
+
+         (if what-failed
+             "Expected match; but no match"
+           (unless (string= match-ref (elt matches-observed idx))
+             (setq failures
+                   (cons (format "line number %d: Regex '%s': Have expected match, but group %d is wrong: '%s'/'%s'"
+                                 line-number pattern
+                                 idx match-ref (elt matches-observed idx))
+                         failures))))))
+
+      ;; reset
+      (t (setq pattern nil) nil)))
+
+    failures))
+
+(defconst regex-tests-PTESTS-whitelist
+  [
+   ;; emacs doesn't barf on weird ranges such as [b-a], but simply
+   ;; fails to match
+   138
+
+   ;; emacs doesn't see DEL (0x78) as a [:cntrl:] character
+   168
+  ]
+  "Line numbers in the PTESTS test that should be skipped.  These
+are false-positive test failures that represent known/benign
+differences in behavior.")
+
+;; - Format
+;;   - fields separated by ¦ (note: this is not a |)
+;;   - start¦end¦pattern¦string
+;;   - start is the 1-based index of the first character
+;;   - end   is the 1-based index of the last  character
+(defun regex-tests-PTESTS ()
+  (let (failures)
+    (regex-tests-generic-line
+     ?# "regex-resources/PTESTS" regex-tests-PTESTS-whitelist
+     (let* ((fields (split-string (buffer-string) "¦"))
+
+            ;; string has 1-based index of first char in the
+            ;; match. -1 means "no match". -2 means "invalid
+            ;; regex".
+            ;;
+            ;; start-ref is 0-based index of first char in the
+            ;; match
+            ;;
+            ;; string==0 is a special case, and I have to treat
+            ;; it as start-ref = 0
+            (start-ref (let ((raw (string-to-number (elt fields 0))))
+                         (cond
+                          ((= raw -2) 'compilation-failed)
+                          ((= raw -1) 'search-failed)
+                          ((= raw  0) 0)
+                          (t          (1- raw)))))
+
+            ;; string has 1-based index of last char in the
+            ;; match. end-ref is 0-based index of first char past
+            ;; the match
+            (end-ref   (string-to-number (elt fields 1)))
+            (pattern   (elt fields 2))
+            (string    (elt fields 3)))
+
+       (let ((msg (regex-tests-match pattern string (list start-ref end-ref))))
+         (when msg
+           (setq failures
+                 (cons (format "line number %d: Regex '%s': %s"
+                               line-number pattern msg)
+                       failures))))))
+    failures))
+
+(defconst regex-tests-TESTS-whitelist
+  [
+   ;; emacs doesn't barf on weird ranges such as [b-a], but simply
+   ;; fails to match
+   42
+
+   ;; emacs is more forgiving with * and ? that don't apply to
+   ;; characters
+   57 58 59 60
+
+   ;; emacs is more stringent with regexen involving unbalanced )
+   67
+  ]
+  "Line numbers in the TESTS test that should be skipped.  These
+are false-positive test failures that represent known/benign
+differences in behavior.")
+
+;; - Format
+;;   - fields separated by :. Watch for [\[:xxx:]]
+;;   - expected:pattern:string
+;;
+;;   expected:
+;;   | 0 | successful match      |
+;;   | 1 | failed match          |
+;;   | 2 | regcomp() should fail |
+(defun regex-tests-TESTS ()
+  (let (failures)
+    (regex-tests-generic-line
+     ?# "regex-resources/TESTS" regex-tests-TESTS-whitelist
+     (if (save-excursion (re-search-forward "^\\([^:]+\\):\\(.*\\):\\([^:]*\\)$" nil t))
+         (let* ((what-failed
+                 (let ((raw (string-to-number (match-string 1))))
+                   (cond
+                    ((= raw 2) 'compilation-failed)
+                    ((= raw 1) 'search-failed)
+                    (t         t))))
+                (string  (match-string 3))
+                (pattern (regex-tests-unextend (match-string 2))))
+
+           (let ((msg (regex-tests-match pattern string nil (list what-failed))))
+             (when msg
+               (setq failures
+                     (cons (format "line number %d: Regex '%s': %s"
+                                   line-number pattern msg)
+                           failures)))))
+
+       (error "Error parsing TESTS file line: '%s'" (buffer-string))))
+    failures))
+
+(ert-deftest regex-tests ()
+  "Tests of the regular expression engine.  This evaluates the
+BOOST, PCRE, PTESTS and TESTS test cases from glibc."
+  (should-not (regex-tests-BOOST))
+  (should-not (regex-tests-PCRE))
+  (should-not (regex-tests-PTESTS))
+  (should-not (regex-tests-TESTS)))
+
 ;;; regex-tests.el ends here
-- 
2.8.0.rc3.226.g39d4020






^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH 3/7] Fix reading of regex-resources in regex-tests
  2016-07-27 16:50   ` bug#24071: [PATCH 1/7] New regex tests imported from glibc 2.21 Michal Nazarewicz
  2016-07-27 16:50     ` bug#24071: [PATCH 2/7] Added driver for the regex tests Michal Nazarewicz
@ 2016-07-27 16:50     ` Michal Nazarewicz
  2016-07-27 16:50     ` bug#24071: [PATCH 4/7] Don’t (require 'cl) Michal Nazarewicz
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Michal Nazarewicz @ 2016-07-27 16:50 UTC (permalink / raw)
  To: 24071

* test/src/regex-tests.el (regex-tests-generic-line): Referring to
‘buffer-file-name’ does not work when running the test from command
line, i.e. via make, which results in (wrong-type-argument stringp nil)
failures.  Replace it with hard-coded path.
(regex-tests-BOOST, regex-tests-PCRE, regex-tests-PTESTS-whitelist,
regex-tests-TESTS-whitelist): ‘regex-tests-generic-line’ now  includes
the ‘regex-resources’ path component so the tests don’t need to specify
it explicitly.
---
 test/src/regex-tests.el | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
index 13a9f86..1407441 100644
--- a/test/src/regex-tests.el
+++ b/test/src/regex-tests.el
@@ -99,8 +99,7 @@ regex-tests-generic-line
 
   `(with-temp-buffer
     (modify-syntax-entry ?_ "w;; ") ; tests expect _ to be a word
-    (insert-file-contents ,(concat (file-name-directory (buffer-file-name)) test-file))
-
+    (insert-file-contents ,(concat "src/regex-resources/" test-file))
     (let ((case-fold-search nil)
           (line-number 1)
           (whitelist-idx 0))
@@ -419,7 +418,7 @@ regex-tests-BOOST
   (let (failures
         basic icase newline notbol noteol)
     (regex-tests-generic-line
-     ?; "regex-resources/BOOST.tests" regex-tests-BOOST-whitelist
+     ?; "BOOST.tests" regex-tests-BOOST-whitelist
      (if (save-excursion (re-search-forward "^-" nil t))
          (setq basic   (save-excursion (re-search-forward "REG_BASIC" nil t))
                icase   (save-excursion (re-search-forward "REG_ICASE" nil t))
@@ -496,7 +495,7 @@ regex-tests-PCRE
   (let (failures
         pattern icase string what-failed matches-observed)
     (regex-tests-generic-line
-     ?# "regex-resources/PCRE.tests" regex-tests-PCRE-whitelist
+     ?# "PCRE.tests" regex-tests-PCRE-whitelist
 
      (cond
 
@@ -570,7 +569,7 @@ regex-tests-PTESTS-whitelist
 (defun regex-tests-PTESTS ()
   (let (failures)
     (regex-tests-generic-line
-     ?# "regex-resources/PTESTS" regex-tests-PTESTS-whitelist
+     ?# "PTESTS" regex-tests-PTESTS-whitelist
      (let* ((fields (split-string (buffer-string) "¦"))
 
             ;; string has 1-based index of first char in the
@@ -632,7 +631,7 @@ regex-tests-TESTS-whitelist
 (defun regex-tests-TESTS ()
   (let (failures)
     (regex-tests-generic-line
-     ?# "regex-resources/TESTS" regex-tests-TESTS-whitelist
+     ?# "TESTS" regex-tests-TESTS-whitelist
      (if (save-excursion (re-search-forward "^\\([^:]+\\):\\(.*\\):\\([^:]*\\)$" nil t))
          (let* ((what-failed
                  (let ((raw (string-to-number (match-string 1))))
-- 
2.8.0.rc3.226.g39d4020






^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH 4/7] Don’t (require 'cl)
  2016-07-27 16:50   ` bug#24071: [PATCH 1/7] New regex tests imported from glibc 2.21 Michal Nazarewicz
  2016-07-27 16:50     ` bug#24071: [PATCH 2/7] Added driver for the regex tests Michal Nazarewicz
  2016-07-27 16:50     ` bug#24071: [PATCH 3/7] Fix reading of regex-resources in regex-tests Michal Nazarewicz
@ 2016-07-27 16:50     ` Michal Nazarewicz
  2016-07-27 16:50     ` bug#24071: [PATCH 5/7] Split regex glibc test cases into separet tests Michal Nazarewicz
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Michal Nazarewicz @ 2016-07-27 16:50 UTC (permalink / raw)
  To: 24071

* test/src/regex-test.el: Don’t (require 'cl).
(regex-tests-PCRE): s/loop/cl-loop/
---
 test/src/regex-tests.el | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
index 1407441..97b9633 100644
--- a/test/src/regex-tests.el
+++ b/test/src/regex-tests.el
@@ -20,7 +20,6 @@
 ;;; Code:
 
 (require 'ert)
-(require 'cl)
 
 (ert-deftest regex-word-cc-fallback-test ()
   "Test that ‘[[:cc:]]*x’ matches ‘x’ (bug#24020).
@@ -516,9 +515,9 @@ regex-tests-PCRE
                  ('invalid-regexp 'compilation-failed))
 
                matches-observed
-               (loop for x from 0 to 20
-                     collect (and (not what-failed)
-                                  (or (match-string x string) "<unset>")))))
+               (cl-loop for x from 0 to 20
+                        collect (and (not what-failed)
+                                     (or (match-string x string) "<unset>")))))
        nil)
 
       ;; verification line: failed match
-- 
2.8.0.rc3.226.g39d4020






^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH 5/7] Split regex glibc test cases into separet tests
  2016-07-27 16:50   ` bug#24071: [PATCH 1/7] New regex tests imported from glibc 2.21 Michal Nazarewicz
                       ` (2 preceding siblings ...)
  2016-07-27 16:50     ` bug#24071: [PATCH 4/7] Don’t (require 'cl) Michal Nazarewicz
@ 2016-07-27 16:50     ` Michal Nazarewicz
  2016-07-27 16:50     ` bug#24071: [PATCH 6/7] Remove dead opcodes in regex bytecode Michal Nazarewicz
  2016-07-27 16:50     ` bug#24071: [PATCH 7/7] Refactor regex character class parsing in [:name:] Michal Nazarewicz
  5 siblings, 0 replies; 18+ messages in thread
From: Michal Nazarewicz @ 2016-07-27 16:50 UTC (permalink / raw)
  To: 24071

* test/src/regex-tests.el (regex-tests): Remove and split into multiple
tests cases.
(regex-tests-glbic-BOOST, regex-tests-glibc-BOOST,
regex-tests-glibc-PTESTS, regex-tests-glibc-TESTS): New test cases split
from ‘regex-tests’.
---
 test/src/regex-tests.el | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
index 97b9633..8981267 100644
--- a/test/src/regex-tests.el
+++ b/test/src/regex-tests.el
@@ -651,12 +651,24 @@ regex-tests-TESTS
        (error "Error parsing TESTS file line: '%s'" (buffer-string))))
     failures))
 
-(ert-deftest regex-tests ()
-  "Tests of the regular expression engine.  This evaluates the
-BOOST, PCRE, PTESTS and TESTS test cases from glibc."
-  (should-not (regex-tests-BOOST))
-  (should-not (regex-tests-PCRE))
-  (should-not (regex-tests-PTESTS))
+(ert-deftest regex-tests-BOOST ()
+  "Tests of the regular expression engine.
+This evaluates the BOOST test cases from glibc."
+  (should-not (regex-tests-BOOST)))
+
+(ert-deftest regex-tests-BOOST ()
+  "Tests of the regular expression engine.
+This evaluates the PCRE test cases from glibc."
+  (should-not (regex-tests-PCRE)))
+
+(ert-deftest regex-tests-PTESTS ()
+  "Tests of the regular expression engine.
+This evaluates the PTESTS test cases from glibc."
+  (should-not (regex-tests-PTESTS)))
+
+(ert-deftest regex-tests-TESTS ()
+  "Tests of the regular expression engine.
+This evaluates the TESTS test cases from glibc."
   (should-not (regex-tests-TESTS)))
 
 ;;; regex-tests.el ends here
-- 
2.8.0.rc3.226.g39d4020






^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH 6/7] Remove dead opcodes in regex bytecode
  2016-07-27 16:50   ` bug#24071: [PATCH 1/7] New regex tests imported from glibc 2.21 Michal Nazarewicz
                       ` (3 preceding siblings ...)
  2016-07-27 16:50     ` bug#24071: [PATCH 5/7] Split regex glibc test cases into separet tests Michal Nazarewicz
@ 2016-07-27 16:50     ` Michal Nazarewicz
  2016-07-27 16:50     ` bug#24071: [PATCH 7/7] Refactor regex character class parsing in [:name:] Michal Nazarewicz
  5 siblings, 0 replies; 18+ messages in thread
From: Michal Nazarewicz @ 2016-07-27 16:50 UTC (permalink / raw)
  To: 24071

There is no way to specify before_dot and after_dot opcodes in a regex
so code handling those ends up being dead.  Remove it.

* src/regex.c (print_partial_compiled_pattern, regex_compile,
analyze_first, re_match_2_internal): Remove handling and references to
before_dot and after_dot opcodes.
---
 src/regex.c | 28 +---------------------------
 1 file changed, 1 insertion(+), 27 deletions(-)

diff --git a/src/regex.c b/src/regex.c
index 1f2a1f08..cca00e9 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -669,9 +669,7 @@ typedef enum
   notsyntaxspec
 
 #ifdef emacs
-  ,before_dot,	/* Succeeds if before point.  */
-  at_dot,	/* Succeeds if at point.  */
-  after_dot,	/* Succeeds if after point.  */
+  , at_dot,	/* Succeeds if at point.  */
 
   /* Matches any character whose category-set contains the specified
      category.  The operator is followed by a byte which contains a
@@ -1053,18 +1051,10 @@ print_partial_compiled_pattern (re_char *start, re_char *end)
 	  break;
 
 # ifdef emacs
-	case before_dot:
-	  fprintf (stderr, "/before_dot");
-	  break;
-
 	case at_dot:
 	  fprintf (stderr, "/at_dot");
 	  break;
 
-	case after_dot:
-	  fprintf (stderr, "/after_dot");
-	  break;
-
 	case categoryspec:
 	  fprintf (stderr, "/categoryspec");
 	  mcnt = *p++;
@@ -3422,8 +3412,6 @@ regex_compile (const_re_char *pattern, size_t size, reg_syntax_t syntax,
 		 goto normal_char;
 
 #ifdef emacs
-	    /* There is no way to specify the before_dot and after_dot
-	       operators.  rms says this is ok.  --karl  */
 	    case '=':
 	      laststart = b;
 	      BUF_PUSH (at_dot);
@@ -4000,9 +3988,7 @@ analyze_first (const_re_char *p, const_re_char *pend, char *fastmap,
       /* All cases after this match the empty string.  These end with
 	 `continue'.  */
 
-	case before_dot:
 	case at_dot:
-	case after_dot:
 #endif /* !emacs */
 	case no_op:
 	case begline:
@@ -6150,24 +6136,12 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
 	  break;
 
 #ifdef emacs
-	case before_dot:
-	  DEBUG_PRINT ("EXECUTING before_dot.\n");
-	  if (PTR_BYTE_POS (d) >= PT_BYTE)
-	    goto fail;
-	  break;
-
 	case at_dot:
 	  DEBUG_PRINT ("EXECUTING at_dot.\n");
 	  if (PTR_BYTE_POS (d) != PT_BYTE)
 	    goto fail;
 	  break;
 
-	case after_dot:
-	  DEBUG_PRINT ("EXECUTING after_dot.\n");
-	  if (PTR_BYTE_POS (d) <= PT_BYTE)
-	    goto fail;
-	  break;
-
 	case categoryspec:
 	case notcategoryspec:
 	  {
-- 
2.8.0.rc3.226.g39d4020






^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH 7/7] Refactor regex character class parsing in [:name:]
  2016-07-27 16:50   ` bug#24071: [PATCH 1/7] New regex tests imported from glibc 2.21 Michal Nazarewicz
                       ` (4 preceding siblings ...)
  2016-07-27 16:50     ` bug#24071: [PATCH 6/7] Remove dead opcodes in regex bytecode Michal Nazarewicz
@ 2016-07-27 16:50     ` Michal Nazarewicz
  5 siblings, 0 replies; 18+ messages in thread
From: Michal Nazarewicz @ 2016-07-27 16:50 UTC (permalink / raw)
  To: 24071

re_wctype function is used in three separate places and in all of
those places almost exact code extracting the name from [:name:]
surrounds it.  Furthermore, re_wctype requires a NUL-terminated
string, so the name of the character class is copied to a temporary
buffer.

The code duplication and unnecessary memory copying can be avoided by
pushing the responsibility of parsing the whole [:name:] sequence to
the function.

Furthermore, since now the function has access to the length of the
character class name (since it’s doing the parsing), it can take
advantage of that information in skipping some string comparisons and
using a constant-length memcmp instead of strcmp which needs to take
care of NUL bytes.

* src/regex.c (re_wctype): Delete function.  Replace it with:
(re_wctype_parse): New function which parses a whole [:name:] string
and returns a RECC_* constant or -1 if the string is not of [:name:]
format.
(regex_compile): Use re_wctype_parse.
* src/syntax.c (skip_chars): Use re_wctype_parse.
---
 src/regex.c  | 312 +++++++++++++++++++++++++++++------------------------------
 src/regex.h  |  14 +--
 src/syntax.c |  96 +++++-------------
 3 files changed, 183 insertions(+), 239 deletions(-)

diff --git a/src/regex.c b/src/regex.c
index cca00e9..d482316 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -1959,29 +1959,98 @@ struct range_table_work_area
 \f
 #if ! WIDE_CHAR_SUPPORT
 
-/* Map a string to the char class it names (if any).  */
+/* Parse a character class, i.e. string such as "[:name:]".  *strp
+   points to the string to be parsed and limit is length, in bytes, of
+   that string.
+
+   If *strp point to a string that begins with "[:name:]", where name is
+   a non-empty sequence of lower case letters, *strp will be advanced past the
+   closing square bracket and RECC_* constant which maps to the name will be
+   returned.  If name is not a valid character class name zero, or RECC_ERROR,
+   is returned.
+
+   Otherwise, if *strp doesn’t begin with "[:name:]", -1 is returned.
+
+   The function can be used on ASCII and multibyte (UTF-8-encoded) strings.
+ */
 re_wctype_t
-re_wctype (const_re_char *str)
+re_wctype_parse (const unsigned char **strp, unsigned limit)
 {
-  const char *string = (const char *) str;
-  if      (STREQ (string, "alnum"))	return RECC_ALNUM;
-  else if (STREQ (string, "alpha"))	return RECC_ALPHA;
-  else if (STREQ (string, "word"))	return RECC_WORD;
-  else if (STREQ (string, "ascii"))	return RECC_ASCII;
-  else if (STREQ (string, "nonascii"))	return RECC_NONASCII;
-  else if (STREQ (string, "graph"))	return RECC_GRAPH;
-  else if (STREQ (string, "lower"))	return RECC_LOWER;
-  else if (STREQ (string, "print"))	return RECC_PRINT;
-  else if (STREQ (string, "punct"))	return RECC_PUNCT;
-  else if (STREQ (string, "space"))	return RECC_SPACE;
-  else if (STREQ (string, "upper"))	return RECC_UPPER;
-  else if (STREQ (string, "unibyte"))	return RECC_UNIBYTE;
-  else if (STREQ (string, "multibyte"))	return RECC_MULTIBYTE;
-  else if (STREQ (string, "digit"))	return RECC_DIGIT;
-  else if (STREQ (string, "xdigit"))	return RECC_XDIGIT;
-  else if (STREQ (string, "cntrl"))	return RECC_CNTRL;
-  else if (STREQ (string, "blank"))	return RECC_BLANK;
-  else return 0;
+  const char *beg = (const char *)*strp, *end, *it;
+
+  if (limit < 5 || beg[0] != '[' || beg[1] != ':')
+    return -1;
+
+  end = beg + limit - 2;  /* ‘- 2’ for the closing ":]" */
+  beg += 2;  /* skip opening "[:" */
+  it = beg;
+  while (it != end && *it >= 'a' && *it <= 'z')
+    ++it;
+  if (it[0] != ':' || it[1] != ']')
+    return -1;
+
+  *strp = (const unsigned char *)(it + 2);
+
+  /* Sort tests in the length=five case by frequency the classes to minimise
+     number of times we fail the comparison.  The frequencies of character class
+     names used in Emacs sources as of 2016-07-27:
+
+     $ find \( -name \*.c -o -name \*.el \) -exec grep -h '\[:[a-z]*:]' {} + |
+           sed 's/]/]\n/g' |grep -o '\[:[a-z]*:]' |sort |uniq -c |sort -nr
+         213 [:alnum:]
+         104 [:alpha:]
+          62 [:space:]
+          39 [:digit:]
+          36 [:blank:]
+          26 [:word:]
+          26 [:upper:]
+          21 [:lower:]
+          10 [:xdigit:]
+          10 [:punct:]
+          10 [:ascii:]
+           4 [:nonascii:]
+           4 [:graph:]
+           2 [:print:]
+           2 [:cntrl:]
+           1 [:ff:]
+
+     If you update this list, consider also updating chain of or’ed conditions
+     in execute_charset function.
+   */
+
+  switch (it - beg) {
+  case 4:
+    if (!memcmp (beg, "word", 4))      return RECC_WORD;
+    break;
+  case 5:
+    if (!memcmp (beg, "alnum", 5))     return RECC_ALNUM;
+    if (!memcmp (beg, "alpha", 5))     return RECC_ALPHA;
+    if (!memcmp (beg, "space", 5))     return RECC_SPACE;
+    if (!memcmp (beg, "digit", 5))     return RECC_DIGIT;
+    if (!memcmp (beg, "blank", 5))     return RECC_BLANK;
+    if (!memcmp (beg, "upper", 5))     return RECC_UPPER;
+    if (!memcmp (beg, "lower", 5))     return RECC_LOWER;
+    if (!memcmp (beg, "punct", 5))     return RECC_PUNCT;
+    if (!memcmp (beg, "ascii", 5))     return RECC_ASCII;
+    if (!memcmp (beg, "graph", 5))     return RECC_GRAPH;
+    if (!memcmp (beg, "print", 5))     return RECC_PRINT;
+    if (!memcmp (beg, "cntrl", 5))     return RECC_CNTRL;
+    break;
+  case 6:
+    if (!memcmp (beg, "xdigit", 6))    return RECC_XDIGIT;
+    break;
+  case 7:
+    if (!memcmp (beg, "unibyte", 7))   return RECC_UNIBYTE;
+    break;
+  case 8:
+    if (!memcmp (beg, "nonascii", 8))  return RECC_NONASCII;
+    break;
+  case 9:
+    if (!memcmp (beg, "multibyte", 9)) return RECC_MULTIBYTE;
+    break;
+  }
+
+  return RECC_ERROR;
 }
 
 /* True if CH is in the char class CC.  */
@@ -2766,10 +2835,74 @@ regex_compile (const_re_char *pattern, size_t size, reg_syntax_t syntax,
 	      {
 		boolean escaped_char = false;
 		const unsigned char *p2 = p;
+		re_wctype_t cc;
 		re_wchar_t ch;
 
 		if (p == pend) FREE_STACK_RETURN (REG_EBRACK);
 
+		/* See if we're at the beginning of a possible character
+		   class.  */
+		if (syntax & RE_CHAR_CLASSES &&
+		    (cc = re_wctype_parse(&p, pend - p)) != -1)
+		  {
+		    if (cc == 0)
+		      FREE_STACK_RETURN (REG_ECTYPE);
+
+		    if (p == pend)
+		      FREE_STACK_RETURN (REG_EBRACK);
+
+#ifndef emacs
+		    for (ch = 0; ch < (1 << BYTEWIDTH); ++ch)
+		      if (re_iswctype (btowc (ch), cc))
+			{
+			  c = TRANSLATE (ch);
+			  if (c < (1 << BYTEWIDTH))
+			    SET_LIST_BIT (c);
+			}
+#else  /* emacs */
+		    /* Most character classes in a multibyte match just set
+		       a flag.  Exceptions are is_blank, is_digit, is_cntrl, and
+		       is_xdigit, since they can only match ASCII characters.
+		       We don't need to handle them for multibyte.  They are
+		       distinguished by a negative wctype.  */
+
+		    /* Setup the gl_state object to its buffer-defined value.
+		       This hardcodes the buffer-global syntax-table for ASCII
+		       chars, while the other chars will obey syntax-table
+		       properties.  It's not ideal, but it's the way it's been
+		       done until now.  */
+		    SETUP_BUFFER_SYNTAX_TABLE ();
+
+		    for (ch = 0; ch < 256; ++ch)
+		      {
+			c = RE_CHAR_TO_MULTIBYTE (ch);
+			if (! CHAR_BYTE8_P (c)
+			    && re_iswctype (c, cc))
+			  {
+			    SET_LIST_BIT (ch);
+			    c1 = TRANSLATE (c);
+			    if (c1 == c)
+			      continue;
+			    if (ASCII_CHAR_P (c1))
+			      SET_LIST_BIT (c1);
+			    else if ((c1 = RE_CHAR_TO_UNIBYTE (c1)) >= 0)
+			      SET_LIST_BIT (c1);
+			  }
+		      }
+		    SET_RANGE_TABLE_WORK_AREA_BIT
+		      (range_table_work, re_wctype_to_bit (cc));
+#endif	/* emacs */
+		    /* In most cases the matching rule for char classes only
+		       uses the syntax table for multibyte chars, so that the
+		       content of the syntax-table is not hardcoded in the
+		       range_table.  SPACE and WORD are the two exceptions.  */
+		    if ((1 << cc) & ((1 << RECC_SPACE) | (1 << RECC_WORD)))
+		      bufp->used_syntax = 1;
+
+		    /* Repeat the loop. */
+		    continue;
+		  }
+
 		/* Don't translate yet.  The range TRANSLATE(X..Y) cannot
 		   always be determined from TRANSLATE(X) and TRANSLATE(Y)
 		   So the translation is done later in a loop.  Example:
@@ -2793,119 +2926,6 @@ regex_compile (const_re_char *pattern, size_t size, reg_syntax_t syntax,
 		      break;
 		  }
 
-		/* See if we're at the beginning of a possible character
-		   class.  */
-
-		if (!escaped_char &&
-		    syntax & RE_CHAR_CLASSES && c == '[' && *p == ':')
-		  {
-		    /* Leave room for the null.  */
-		    unsigned char str[CHAR_CLASS_MAX_LENGTH + 1];
-		    const unsigned char *class_beg;
-
-		    PATFETCH (c);
-		    c1 = 0;
-		    class_beg = p;
-
-		    /* If pattern is `[[:'.  */
-		    if (p == pend) FREE_STACK_RETURN (REG_EBRACK);
-
-		    for (;;)
-		      {
-		        PATFETCH (c);
-		        if ((c == ':' && *p == ']') || p == pend)
-		          break;
-			if (c1 < CHAR_CLASS_MAX_LENGTH)
-			  str[c1++] = c;
-			else
-			  /* This is in any case an invalid class name.  */
-			  str[0] = '\0';
-		      }
-		    str[c1] = '\0';
-
-		    /* If isn't a word bracketed by `[:' and `:]':
-		       undo the ending character, the letters, and
-		       leave the leading `:' and `[' (but set bits for
-		       them).  */
-		    if (c == ':' && *p == ']')
-		      {
-			re_wctype_t cc = re_wctype (str);
-
-			if (cc == 0)
-			  FREE_STACK_RETURN (REG_ECTYPE);
-
-                        /* Throw away the ] at the end of the character
-                           class.  */
-                        PATFETCH (c);
-
-                        if (p == pend) FREE_STACK_RETURN (REG_EBRACK);
-
-#ifndef emacs
-			for (ch = 0; ch < (1 << BYTEWIDTH); ++ch)
-			  if (re_iswctype (btowc (ch), cc))
-			    {
-			      c = TRANSLATE (ch);
-			      if (c < (1 << BYTEWIDTH))
-				SET_LIST_BIT (c);
-			    }
-#else  /* emacs */
-			/* Most character classes in a multibyte match
-			   just set a flag.  Exceptions are is_blank,
-			   is_digit, is_cntrl, and is_xdigit, since
-			   they can only match ASCII characters.  We
-			   don't need to handle them for multibyte.
-			   They are distinguished by a negative wctype.  */
-
-			/* Setup the gl_state object to its buffer-defined
-			   value.  This hardcodes the buffer-global
-			   syntax-table for ASCII chars, while the other chars
-			   will obey syntax-table properties.  It's not ideal,
-			   but it's the way it's been done until now.  */
-			SETUP_BUFFER_SYNTAX_TABLE ();
-
-			for (ch = 0; ch < 256; ++ch)
-			  {
-			    c = RE_CHAR_TO_MULTIBYTE (ch);
-			    if (! CHAR_BYTE8_P (c)
-				&& re_iswctype (c, cc))
-			      {
-				SET_LIST_BIT (ch);
-				c1 = TRANSLATE (c);
-				if (c1 == c)
-				  continue;
-				if (ASCII_CHAR_P (c1))
-				  SET_LIST_BIT (c1);
-				else if ((c1 = RE_CHAR_TO_UNIBYTE (c1)) >= 0)
-				  SET_LIST_BIT (c1);
-			      }
-			  }
-			SET_RANGE_TABLE_WORK_AREA_BIT
-			  (range_table_work, re_wctype_to_bit (cc));
-#endif	/* emacs */
-			/* In most cases the matching rule for char classes
-			   only uses the syntax table for multibyte chars,
-			   so that the content of the syntax-table is not
-			   hardcoded in the range_table.  SPACE and WORD are
-			   the two exceptions.  */
-			if ((1 << cc) & ((1 << RECC_SPACE) | (1 << RECC_WORD)))
-			  bufp->used_syntax = 1;
-
-			/* Repeat the loop. */
-			continue;
-		      }
-		    else
-		      {
-			/* Go back to right after the "[:".  */
-			p = class_beg;
-			SET_LIST_BIT ('[');
-
-			/* Because the `:' may start the range, we
-			   can't simply set bit and repeat the loop.
-			   Instead, just set it to C and handle below.  */
-			c = ':';
-		      }
-		  }
-
 		if (p < pend && p[0] == '-' && p[1] != ']')
 		  {
 
@@ -4645,28 +4665,8 @@ execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte)
       re_wchar_t range_start, range_end;
 
   /* Sort tests by the most commonly used classes with some adjustment to which
-     tests are easiest to perform.  Frequencies of character class names used in
-     Emacs sources as of 2016-07-15:
-
-     $ find \( -name \*.c -o -name \*.el \) -exec grep -h '\[:[a-z]*:]' {} + |
-           sed 's/]/]\n/g' |grep -o '\[:[a-z]*:]' |sort |uniq -c |sort -nr
-         213 [:alnum:]
-         104 [:alpha:]
-          62 [:space:]
-          39 [:digit:]
-          36 [:blank:]
-          26 [:upper:]
-          24 [:word:]
-          21 [:lower:]
-          10 [:punct:]
-          10 [:ascii:]
-           9 [:xdigit:]
-           4 [:nonascii:]
-           4 [:graph:]
-           2 [:print:]
-           2 [:cntrl:]
-           1 [:ff:]
-   */
+     tests are easiest to perform.  Take a look at comment in re_wctype_parse
+     for table with frequencies of character class names. */
 
       if ((class_bits & BIT_MULTIBYTE) ||
 	  (class_bits & BIT_ALNUM && ISALNUM (c)) ||
diff --git a/src/regex.h b/src/regex.h
index 817167a..01b659a 100644
--- a/src/regex.h
+++ b/src/regex.h
@@ -585,25 +585,13 @@ extern void regfree (regex_t *__preg);
 /* Solaris 2.5 has a bug: <wchar.h> must be included before <wctype.h>.  */
 # include <wchar.h>
 # include <wctype.h>
-#endif
 
-#if WIDE_CHAR_SUPPORT
-/* The GNU C library provides support for user-defined character classes
-   and the functions from ISO C amendment 1.  */
-# ifdef CHARCLASS_NAME_MAX
-#  define CHAR_CLASS_MAX_LENGTH CHARCLASS_NAME_MAX
-# else
-/* This shouldn't happen but some implementation might still have this
-   problem.  Use a reasonable default value.  */
-#  define CHAR_CLASS_MAX_LENGTH 256
-# endif
 typedef wctype_t re_wctype_t;
 typedef wchar_t re_wchar_t;
 # define re_wctype wctype
 # define re_iswctype iswctype
 # define re_wctype_to_bit(cc) 0
 #else
-# define CHAR_CLASS_MAX_LENGTH  9 /* Namely, `multibyte'.  */
 # ifndef emacs
 #  define btowc(c) c
 # endif
@@ -621,7 +609,7 @@ typedef enum { RECC_ERROR = 0,
 } re_wctype_t;
 
 extern char re_iswctype (int ch,    re_wctype_t cc);
-extern re_wctype_t re_wctype (const unsigned char* str);
+extern re_wctype_t re_wctype_parse (const unsigned char **strp, unsigned limit);
 
 typedef int re_wchar_t;
 
diff --git a/src/syntax.c b/src/syntax.c
index f8d987b..667de40 100644
--- a/src/syntax.c
+++ b/src/syntax.c
@@ -1691,44 +1691,22 @@ skip_chars (bool forwardp, Lisp_Object string, Lisp_Object lim,
       /* At first setup fastmap.  */
       while (i_byte < size_byte)
 	{
-	  c = str[i_byte++];
-
-	  if (handle_iso_classes && c == '['
-	      && i_byte < size_byte
-	      && str[i_byte] == ':')
+	  if (handle_iso_classes)
 	    {
-	      const unsigned char *class_beg = str + i_byte + 1;
-	      const unsigned char *class_end = class_beg;
-	      const unsigned char *class_limit = str + size_byte - 2;
-	      /* Leave room for the null.  */
-	      unsigned char class_name[CHAR_CLASS_MAX_LENGTH + 1];
-	      re_wctype_t cc;
-
-	      if (class_limit - class_beg > CHAR_CLASS_MAX_LENGTH)
-		class_limit = class_beg + CHAR_CLASS_MAX_LENGTH;
-
-	      while (class_end < class_limit
-		     && *class_end >= 'a' && *class_end <= 'z')
-		class_end++;
-
-	      if (class_end == class_beg
-		  || *class_end != ':' || class_end[1] != ']')
-		goto not_a_class_name;
-
-	      memcpy (class_name, class_beg, class_end - class_beg);
-	      class_name[class_end - class_beg] = 0;
-
-	      cc = re_wctype (class_name);
+	      const unsigned char *ch = str + i_byte;
+	      re_wctype_t cc = re_wctype_parse (&ch, size_byte - i_byte);
 	      if (cc == 0)
 		error ("Invalid ISO C character class");
-
-	      iso_classes = Fcons (make_number (cc), iso_classes);
-
-	      i_byte = class_end + 2 - str;
-	      continue;
+	      if (cc != -1)
+		{
+		  iso_classes = Fcons (make_number (cc), iso_classes);
+		  i_byte = ch - str;
+		  continue;
+		}
 	    }
 
-	not_a_class_name:
+	  c = str[i_byte++];
+
 	  if (c == '\\')
 	    {
 	      if (i_byte == size_byte)
@@ -1808,54 +1786,32 @@ skip_chars (bool forwardp, Lisp_Object string, Lisp_Object lim,
       while (i_byte < size_byte)
 	{
 	  int leading_code = str[i_byte];
-	  c = STRING_CHAR_AND_LENGTH (str + i_byte, len);
-	  i_byte += len;
 
-	  if (handle_iso_classes && c == '['
-	      && i_byte < size_byte
-	      && STRING_CHAR (str + i_byte) == ':')
+	  if (handle_iso_classes)
 	    {
-	      const unsigned char *class_beg = str + i_byte + 1;
-	      const unsigned char *class_end = class_beg;
-	      const unsigned char *class_limit = str + size_byte - 2;
-	      /* Leave room for the null.	 */
-	      unsigned char class_name[CHAR_CLASS_MAX_LENGTH + 1];
-	      re_wctype_t cc;
-
-	      if (class_limit - class_beg > CHAR_CLASS_MAX_LENGTH)
-		class_limit = class_beg + CHAR_CLASS_MAX_LENGTH;
-
-	      while (class_end < class_limit
-		     && *class_end >= 'a' && *class_end <= 'z')
-		class_end++;
-
-	      if (class_end == class_beg
-		  || *class_end != ':' || class_end[1] != ']')
-		goto not_a_class_name_multibyte;
-
-	      memcpy (class_name, class_beg, class_end - class_beg);
-	      class_name[class_end - class_beg] = 0;
-
-	      cc = re_wctype (class_name);
+	      const unsigned char *ch = str + i_byte;
+	      re_wctype_t cc = re_wctype_parse (&ch, size_byte - i_byte);
 	      if (cc == 0)
 		error ("Invalid ISO C character class");
-
-	      iso_classes = Fcons (make_number (cc), iso_classes);
-
-	      i_byte = class_end + 2 - str;
-	      continue;
+	      if (cc != -1)
+		{
+		  iso_classes = Fcons (make_number (cc), iso_classes);
+		  i_byte = ch - str;
+		  continue;
+		}
 	    }
 
-	not_a_class_name_multibyte:
-	  if (c == '\\')
+	  if (leading_code== '\\')
 	    {
-	      if (i_byte == size_byte)
+	      if (++i_byte == size_byte)
 		break;
 
 	      leading_code = str[i_byte];
-	      c = STRING_CHAR_AND_LENGTH (str + i_byte, len);
-	      i_byte += len;
 	    }
+	  c = STRING_CHAR_AND_LENGTH (str + i_byte, len);
+	  i_byte += len;
+
+
 	  /* Treat `-' as range character only if another character
 	     follows.  */
 	  if (i_byte + 1 < size_byte
-- 
2.8.0.rc3.226.g39d4020






^ permalink raw reply related	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH] Refactor regex character class parsing in [:name:]
  2016-07-27 16:28     ` Eli Zaretskii
@ 2016-07-27 18:30       ` Michal Nazarewicz
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Nazarewicz @ 2016-07-27 18:30 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: lists, 24071

On Wed, Jul 27 2016, Eli Zaretskii wrote:
> But this later patch looks like a beginning of a series of refactoring
> (is it?), hopefully followed by more features,

Well… I had bunch of patches prepared but then I discovered that
‘#ifndef emacs’ is used by etags and ‘#ifdef _LIBC’ cannot go away
either so all of them are now gone… :(

But I am wondering about a way to turn all of the ‘syntax & RE_FOO’
checks into compile-time expressions when compiling regex.c for Emacs.

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH] Refactor regex character class parsing in [:name:]
  2016-07-26 14:46 ` Eli Zaretskii
  2016-07-27 15:29   ` Michal Nazarewicz
  2016-07-27 16:50   ` bug#24071: [PATCH 1/7] New regex tests imported from glibc 2.21 Michal Nazarewicz
@ 2016-07-29  5:31   ` Dima Kogan
  2016-07-29  5:53     ` Eli Zaretskii
  2016-07-29 13:07     ` Michal Nazarewicz
  2 siblings, 2 replies; 18+ messages in thread
From: Dima Kogan @ 2016-07-29  5:31 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 24071, Michal Nazarewicz

Eli Zaretskii <eliz@gnu.org> writes:

> If we are going to make some serious refactoring in regex.c, I think
> we should start with having a test suite for it.  The
> dima_regex_embedded_modifiers branch, created by Dima Kogan (CC'ed) in
> the Emacs repository includes a suite taken from glibc.  Dima, could
> you perhaps merge the parts of the test suite that can already be used
> to the master branch, so that we could use them to verify changes in
> regex.c?

Hi. I just saw this. Do you still need me to get these merged? That
branch is very clean, so just cherry-picking the specific commits should
do it.





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH] Refactor regex character class parsing in [:name:]
  2016-07-29  5:31   ` bug#24071: [PATCH] " Dima Kogan
@ 2016-07-29  5:53     ` Eli Zaretskii
  2016-07-29 13:07     ` Michal Nazarewicz
  1 sibling, 0 replies; 18+ messages in thread
From: Eli Zaretskii @ 2016-07-29  5:53 UTC (permalink / raw)
  To: Dima Kogan; +Cc: 24071, mina86

> From: Dima Kogan <dima@secretsauce.net>
> Cc: Michal Nazarewicz <mina86@mina86.com>, 24071@debbugs.gnu.org
> Date: Thu, 28 Jul 2016 22:31:48 -0700
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > If we are going to make some serious refactoring in regex.c, I think
> > we should start with having a test suite for it.  The
> > dima_regex_embedded_modifiers branch, created by Dima Kogan (CC'ed) in
> > the Emacs repository includes a suite taken from glibc.  Dima, could
> > you perhaps merge the parts of the test suite that can already be used
> > to the master branch, so that we could use them to verify changes in
> > regex.c?
> 
> Hi. I just saw this. Do you still need me to get these merged?

No, I think Michal already did that, although the stiff is not yet on
the master branch.

Thanks.





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH] Refactor regex character class parsing in [:name:]
  2016-07-29  5:31   ` bug#24071: [PATCH] " Dima Kogan
  2016-07-29  5:53     ` Eli Zaretskii
@ 2016-07-29 13:07     ` Michal Nazarewicz
  1 sibling, 0 replies; 18+ messages in thread
From: Michal Nazarewicz @ 2016-07-29 13:07 UTC (permalink / raw)
  To: Dima Kogan, Eli Zaretskii; +Cc: 24071

On Thu, Jul 28 2016, Dima Kogan wrote:
> Hi. I just saw this. Do you still need me to get these merged? That
> branch is very clean, so just cherry-picking the specific commits
> should do it.

It’s all sorted out.

I did not cherry pick your last two commits though:
    regex: support for new case-fold embedded modifiers
    regex: added tests for the case-fold embedded modifiers

The result is at https://github.com/mina86/emacs/commits/master and I’ll
push it next week or so.

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH] Refactor regex character class parsing in [:name:]
  2016-07-25 22:54 bug#24071: [PATCH] Refactor regex character class parsing in [:name:] Michal Nazarewicz
  2016-07-26 14:46 ` Eli Zaretskii
@ 2016-08-02 16:06 ` Michal Nazarewicz
  2016-08-02 16:46   ` Eli Zaretskii
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Nazarewicz @ 2016-08-02 16:06 UTC (permalink / raw)
  To: 24071-done

Pushed.

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»

^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH] Refactor regex character class parsing in [:name:]
  2016-08-02 16:06 ` Michal Nazarewicz
@ 2016-08-02 16:46   ` Eli Zaretskii
  2016-08-02 17:54     ` Michal Nazarewicz
  0 siblings, 1 reply; 18+ messages in thread
From: Eli Zaretskii @ 2016-08-02 16:46 UTC (permalink / raw)
  To: Michal Nazarewicz; +Cc: 24071

> From: Michal Nazarewicz <mina86@mina86.com>
> Date: Tue, 02 Aug 2016 18:06:04 +0200
> 
> Pushed.

Thanks.  I get the following warning when compiling regex.c:

  regex.c:516:0: warning: macro "STREQ" is not used [-Wunused-macros]
   #define STREQ(s1, s2) ((strcmp (s1, s2) == 0))
   ^

When compiling the test suite, I get this warning:

Compiling src/regex-tests.el

  In toplevel form:
  src/regex-tests.el:416:1:Warning: Unused lexical variable `newline'





^ permalink raw reply	[flat|nested] 18+ messages in thread

* bug#24071: [PATCH] Refactor regex character class parsing in [:name:]
  2016-08-02 16:46   ` Eli Zaretskii
@ 2016-08-02 17:54     ` Michal Nazarewicz
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Nazarewicz @ 2016-08-02 17:54 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 24071

On Tue, Aug 02 2016, Eli Zaretskii wrote:
>> From: Michal Nazarewicz <mina86@mina86.com>
>> Date: Tue, 02 Aug 2016 18:06:04 +0200
>> 
>> Pushed.
>
> Thanks.  I get the following warning when compiling regex.c:
>
>   regex.c:516:0: warning: macro "STREQ" is not used [-Wunused-macros]
>    #define STREQ(s1, s2) ((strcmp (s1, s2) == 0))
>    ^
>
> When compiling the test suite, I get this warning:
>
> Compiling src/regex-tests.el
>
>   In toplevel form:
>   src/regex-tests.el:416:1:Warning: Unused lexical variable `newline'

Both fixed.  Thanks.

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»





^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-08-02 17:54 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-25 22:54 bug#24071: [PATCH] Refactor regex character class parsing in [:name:] Michal Nazarewicz
2016-07-26 14:46 ` Eli Zaretskii
2016-07-27 15:29   ` Michal Nazarewicz
2016-07-27 16:28     ` Eli Zaretskii
2016-07-27 18:30       ` Michal Nazarewicz
2016-07-27 16:50   ` bug#24071: [PATCH 1/7] New regex tests imported from glibc 2.21 Michal Nazarewicz
2016-07-27 16:50     ` bug#24071: [PATCH 2/7] Added driver for the regex tests Michal Nazarewicz
2016-07-27 16:50     ` bug#24071: [PATCH 3/7] Fix reading of regex-resources in regex-tests Michal Nazarewicz
2016-07-27 16:50     ` bug#24071: [PATCH 4/7] Don’t (require 'cl) Michal Nazarewicz
2016-07-27 16:50     ` bug#24071: [PATCH 5/7] Split regex glibc test cases into separet tests Michal Nazarewicz
2016-07-27 16:50     ` bug#24071: [PATCH 6/7] Remove dead opcodes in regex bytecode Michal Nazarewicz
2016-07-27 16:50     ` bug#24071: [PATCH 7/7] Refactor regex character class parsing in [:name:] Michal Nazarewicz
2016-07-29  5:31   ` bug#24071: [PATCH] " Dima Kogan
2016-07-29  5:53     ` Eli Zaretskii
2016-07-29 13:07     ` Michal Nazarewicz
2016-08-02 16:06 ` Michal Nazarewicz
2016-08-02 16:46   ` Eli Zaretskii
2016-08-02 17:54     ` Michal Nazarewicz

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).