Patch for lookaround assertion in regexp

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Patch for lookaround assertion in regexp
@ 2009-06-03 23:04 Tomohiro MATSUYAMA
  2009-06-04  4:47 ` Miles Bader
  0 siblings, 1 reply; 23+ messages in thread
From: Tomohiro MATSUYAMA @ 2009-06-03 23:04 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1388 bytes --]

Hi, all

I have attached a patch that enables you to
use lookaround assertion in regexp
with following syntax:

* Positive lookahead assertion
    \(?=...\)
* Negative lookahead assertion
    \(?!...\)
* Positive lookbehind assertion
    \(?<=...\)
* Negative lookbehind assertion
    \(?<!...\)

Basically, it works as same as Perl's one.

Spec:
* Any pattern is allowed in lookahead assertion.
* Nested looaround assertion is allowed.
* Capturing is allowed only in positive lookahead/lookbehind assertion.
* Duplication is allowed after such assertion.
* Variable length pattern is NOT yet allowed in lookbehind assertion.
  [x] \(?<=[0-9]+\)MB
  [o] \(?<=[0-9][0-9][0-9][0-9]\)MB
* Lookahead assertion over start bound is not allowed in re-search-backward.
  (re-search-backward "\(?<=a\)b") for buffer "abca_|_b"
  will seek to first "ab".

As of performace, I think there is no problem about lookahead assertion,
but lookbehind assertion is somewhat high cost.
You can check this patch works properly with a testcase I have attached
and also see performance:
    src/emacs --script regex-test.el perf

I saw that lookbehind assertion will spend 5 times than usual lookbehind alike
regexp. I think I have to improve its performance.

Anyway, please try it and review it.
And if like it, please merge it.
I believe that some people really want to use it.

Regards,
MATSUYAMA Tomohiro

[-- Attachment #2: regex-test.el --]
[-- Type: application/octet-stream, Size: 9452 bytes --]

;; -*-coding:utf-8-*-

(defvar test-counter 0)

(defmacro test (&rest form)
  `(princ-list (format "%d ... " (setq test-counter (1+ test-counter)))
               (condition-case nil
                   (if (progn ,@form) 'ok 'fail)
                 (error 'invalid))))

(defun expect-invalid (regexp)
  (test (condition-case nil
            (prog1 nil (string-match regexp ""))
          (error t))))

(defun expect-match (regexp string &optional group-number group-string)
  (test (and (string-match regexp string)
             (if group-number
                 (equal (match-string group-number string) group-string)
               t))))

(defun expect-not-match (regexp string)
  (test (not (string-match regexp string))))

(expect-match "\\(?=\\)" "")
(expect-not-match "\\(?=a\\)" "")
(expect-match "a\\(?=b\\)b" "ab")
(expect-not-match "a\\(?=b\\)c" "ab")
(expect-match "\\(?=a\\)a" "a")
(expect-not-match "\\(?=b\\)a" "a")
(expect-match "\\(?=^\\)a" "a")
(expect-match "a\\(?=$\\)$" "a")
(expect-match "a\\(?=\\)$" "a")
(expect-match "a\\(?=.*c\\)b" "abc")
(expect-not-match "a\\(?=.*d\\)b" "abc")
(expect-match "a\\(?=b\\|c\\|d\\|e\\)" "ae")
(expect-not-match "a\\(?=b\\|c\\|d\\|e\\)" "af")
(expect-match "a\\(?=\\(b\\)\\)b" "ab" 1 "b")
(expect-match "a\\(\\(?=b\\)\\)" "ab" 1 "")
(expect-match "a\\(?=\\(b\\)\\)" "ab" 1 "b")
(expect-match "\\(a\\(?=\\(b\\)\\)\\2\\)\\1" "abab" 1 "ab")
(expect-not-match "\\(a\\)\\(?=\\(b\\)\\)\\1" "ab")
(expect-match "\\(a\\(?=b\\(?=c\\)\\)\\)" "abc" 1 "a")
(expect-not-match "\\(a\\(?=b\\(?=c\\)\\)\\)" "abd")
(expect-not-match "\\(?!\\)" "")
(expect-match "\\(?!a\\)" "")
(expect-not-match "a\\(?!b\\)b" "ab")
(expect-match "a\\(?!b\\)c" "ac")
(expect-not-match "\\(?!a\\)a" "a")
(expect-match "\\(?!b\\)a" "a")
(expect-match "\\(?!^\\)a" "ba")
(expect-not-match "\\(?!^\\)a" "a")
(expect-not-match "a\\(?!$\\)$" "a")
(expect-not-match "a\\(?!\\)$" "a")
(expect-not-match "a\\(?!.*c\\)b" "abc")
(expect-match "a\\(?!.*d\\)b" "abc")
(expect-not-match "a\\(?!b\\|c\\|d\\|e\\)" "ae")
(expect-match "a\\(?!b\\|c\\|d\\|e\\)" "af")
(expect-match "a\\(?!\\(b\\)\\)c" "ac")
(expect-match "a\\(\\(?!b\\)\\)" "ac")
(expect-match "a\\(?!b\\(?!c\\)\\)" "abc")
(expect-not-match "a\\(?!b\\(?=\\(c\\)\\)\\)" "abc")
(expect-not-match "a\\(?!b\\(?!c\\)\\)" "abd")
(expect-match "\\(?<=\\)" "")
(expect-not-match "\\(?<=a\\)" "")
(expect-match "\\(?<=a\\)" "a")
(expect-not-match "\\(?<=b\\)" "a")
(expect-match "\\(?<=^\\)" "")
(expect-not-match "a\\(?<=^\\)" "")
(expect-match "\\(?<=$\\)" "")
(expect-not-match "\\(?<=$\\)a" "")
(expect-match "\\(?<=a\\)b" "ab")
(expect-not-match "\\(?<=c\\)b" "ab")
(expect-match "\\(?<=\\(?<=a\\)\\)b" "ab")
(expect-not-match "\\(?<=\\(?<=b\\)\\)b" "ab")
(expect-match "\\(?<=\\(?=a\\).\\)b" "ab")
(expect-match "\\(?<=\\(a\\)\\)b\\1" "aba" 1 "a")
(expect-match "\\(?<=.\\)a" "aa")
(expect-match "\\(?<=\\(.\\)\\)a" "aa")
(expect-match "\\(?<=\\w\\)a" "aa")
(expect-not-match "\\(?<=\\w\\)a" "!a")
(expect-match "\\(?<=\\sw\\)a" "aa")
(expect-not-match "\\(?<=\\sw\\)a" "!a")
(expect-match "\\(?<=\\cg\\)a" "λa")
(expect-not-match "\\(?<=\\Cg\\)a" "λa")
(expect-match "\\(?<=[a-z]\\)" "aa")
(expect-not-match "\\(?<=[a-z]\\)a" "1a")
(expect-match "\\(?<=[^a-z]\\)" "1a")
(expect-not-match "\\(?<=[^a-z]\\)" "aa")
(expect-match "\\(?<=[:ascii:]\\)a" "aa")
(expect-match "\\(?<=\\`\\)" "")
(expect-not-match "a\\(?<=\\`\\)" "a")
(expect-match "\\(?<=\\'\\)" "")
(expect-not-match "\\(?<=\\'\\)a" "a")
(expect-not-match "\\(?<=\\=\\)" "")
(expect-match "\\(?<=\\b\\)a" "a")
(expect-not-match "a\\(?<=\\b\\)b" "ab")
(expect-match "\\(?<=\\B\\)a" "aa")
(expect-not-match "\\(?<=\\B\\)a" " a")
(expect-match "\\(?<=\\<\\)a" "a")
(expect-not-match "a\\(?<=\\<\\)b" "ab")
(expect-match "a\\(?<=\\>\\)" "a")
(expect-not-match "a\\(?<=\\>\\)b" "ab")
(expect-match "\\(?<=\\_<\\)a" "a")
(expect-not-match "a\\(?<=\\_<\\)b" "ab")
(expect-match "a\\(?<=\\_>\\)" "a")
(expect-not-match "a\\(?<=\\_>\\)b" "ab")
(expect-invalid "\\(?<=\\(.\\)\\1\\)")  ; duplicate
(expect-invalid "\\(?<=a*\\)")          ; variable width
(expect-invalid "\\(?<=a*?\\)")         ; variable width
(expect-invalid "\\(?<=a+\\)")          ; variable width
(expect-invalid "\\(?<=a+?\\)")         ; variable width
(expect-invalid "\\(?<=a?\\)")          ; variable width
(expect-invalid "\\(?<=a??\\)")         ; variable width
(expect-invalid "\\(?<=a\\{1,4\\}\\)")  ; variable width
(expect-invalid "\\(?<=a\\|bb\\|ccc\\)") ; variable width
(expect-invalid "\\(?<=a\\{4\\}\\)")   ; fixed width but not supported yet
(expect-invalid "\\(?<=a\\|\\b\\c\\)")   ; fixed width but not supported yet
(expect-not-match "\\(?<!\\)" "")
(expect-match "\\(?<!a\\)" "")
(expect-match "\\(?<!a\\)" "a")
(expect-not-match "\\(?<!a\\)b" "ab")
(expect-match "\\(?<!b\\)" "a")
(expect-not-match "\\(?<!^\\)" "")
(expect-not-match "a\\(?<!^\\)" "")
(expect-not-match "\\(?<!$\\)" "")
(expect-match "\\(?<=a\\)b" "ab")
(expect-match "\\(?<!c\\)b" "ab")
(expect-match "\\(?<!\\(?<!a\\)\\)b" "ab")
(expect-not-match "\\(?<!\\(?<!b\\)\\)b" "ab")
(expect-match "\\(?<!\\(?!a\\).\\)b" "ab")
(expect-match "\\(?<!.\\)a" "aa")
(expect-not-match "\\(?<!.\\)b" "ab")
(expect-not-match "\\(?<!\\(.\\)\\)b" "ab")
(expect-not-match "\\(?<!\\w\\)b" "ab")
(expect-not-match "\\(?<!\\w\\)b" "ab")
(expect-not-match "\\(?<!\\sw\\)b" "ab")
(expect-match "\\(?<!\\sw\\)a" "!a")
(expect-not-match "\\(?<!\\cg\\)a" "λa")
(expect-match "\\(?<!\\Cg\\)a" "λa")
(expect-match "\\(?<![a-z]\\)" "aa")
(expect-match "\\(?<![a-z]\\)a" "1a")
(expect-not-match "\\(?<![^a-z]\\)a" "1a")
(expect-not-match "\\(?<![:ascii:]\\)b" "ab")
(expect-not-match "\\(?<!\\`\\)" "")
(expect-match "a\\(?<!\\`\\)" "a")
(expect-not-match "\\(?<!\\'\\)" "")
(expect-match "\\(?<!\\'\\)a" "a")
(expect-match "\\(?<!\\=\\)" "")
(expect-not-match "\\(?<!\\b\\)a" "a")
(expect-match "a\\(?<!\\b\\)b" "ab")
(expect-not-match "\\(?<!\\B\\)b" "ab")
(expect-match "\\(?<!\\B\\)a" " a")
(expect-not-match "\\(?<!\\<\\)a" "a")
(expect-match "a\\(?<!\\<\\)b" "ab")
(expect-not-match "a\\(?<!\\>\\)" "a")
(expect-match "a\\(?<!\\>\\)b" "ab")
(expect-not-match "\\(?<!\\_<\\)a" "a")
(expect-match "a\\(?<!\\_<\\)b" "ab")
(expect-not-match "a\\(?<!\\_>\\)" "a")
(expect-match "a\\(?<!\\_>\\)b" "ab")
(expect-invalid "\\(?<!\\(.\\)\\1\\)")  ; duplicate
(expect-invalid "\\(?<!a*\\)")          ; variable width
(expect-invalid "\\(?<!a*?\\)")         ; variable width
(expect-invalid "\\(?<!a+\\)")          ; variable width
(expect-invalid "\\(?<!a+?\\)")         ; variable width
(expect-invalid "\\(?<!a?\\)")          ; variable width
(expect-invalid "\\(?<!a??\\)")         ; variable width
(expect-invalid "\\(?<!a\\{1,4\\}\\)")  ; variable width
(expect-invalid "\\(?<!a\\|bb\\|ccc\\)") ; variable width
(expect-invalid "\\(?<!a\\{4\\}\\)")   ; fixed width but not supported yet
(expect-invalid "\\(?<!a\\|\\b\\c\\)")   ; fixed width but not supported yet

(expect-match "Hello, \\(?=世界\\)" "Hello, 世界!")
(expect-not-match "Hello, \\(?=せかい\\)" "Hello, 世界!")
(expect-match "Hello, \\(?!せかい\\)" "Hello, 世界!")
(expect-not-match "Hello, \\(?!世界\\)" "Hello, 世界!")
(expect-match "\\(?<=こんにちは\\), World!" "こんにちは, World!")
(expect-not-match "\\(?<=こんにちわ\\), World!" "こんにちは, World!")
(expect-match "\\(?<!こんにちわ\\), World!" "こんにちは, World!")
(expect-not-match "\\(?<!こんにちは\\), World!" "こんにちは, World!")

(require 'cl)

(with-temp-buffer
  (insert "abracadabra")
  (goto-char (point-min))
  (test (equal
         (loop while (re-search-forward "a\\(?=b\\)" nil t)
               collect (point))
         '(2 9))))

(with-temp-buffer
  (insert "abracadabra")
  (test (equal
         (loop while (re-search-backward "a\\(?=b\\)" nil t)
               collect (point))
         '(8 1))))

(with-temp-buffer
  (insert "abracadabra")
  (goto-char (point-min))
  (test (equal
         (loop while (re-search-forward "\\(?<=a\\)b" nil t)
               collect (point))
         '(3 10))))

(with-temp-buffer
  (insert "abracadabra")
  (test (equal
         (loop while (re-search-backward "\\(?<=a\\)b" nil t)
               collect (point))
         '(9 2))))

(with-temp-buffer
  (insert "abcdebc")
  (goto-char 3)
  (test (eq (re-search-forward "\\(?<=b\\)c" nil t) 4)))

(with-temp-buffer
  (insert "abcdebc")
  (goto-char 7)
  ;; search-backward with lookahead over bound is not supported yet
  (test (eq (re-search-backward "b\\(?=c\\)" nil t) 2)))

(when (member "perf" argv)
  ;; generate big file
  (require 'find-func)
  (let ((file (concat (or find-function-C-source-directory "~/src/emacs") "/xdisp.c"))
        count)
    (with-temp-buffer
      (insert-file-contents file)
      (dolist (pair '((point-min . re-search-forward) (point-max . re-search-backward)))
        (dolist (regexp '("unsigned \\(?:char\\|int\\|long\\)" "unsigned \\(?=char\\|int\\|long\\)"
                          "\\(?:unsigned \\)int" "\\(?<=unsigned \\)int"))
          (setq count 0)
          (princ-list regexp
                      ": "
                      (car
                       (benchmark-run
                        10
                        (progn
                          (goto-char (funcall (car pair)))
                          (while (funcall (cdr pair) regexp nil t)
                            (setq count (1+ count))))))
                      " elapsed (" count " found)"))))))

[-- Attachment #3: emacs-regex.patch --]
[-- Type: application/octet-stream, Size: 15834 bytes --]

Index: src/regex.c
===================================================================
RCS file: /sources/emacs/emacs/src/regex.c,v
retrieving revision 1.236
diff -u -r1.236 regex.c
--- src/regex.c	8 Jan 2009 03:15:54 -0000	1.236
+++ src/regex.c	3 Jun 2009 22:35:17 -0000
@@ -735,7 +735,14 @@
   syntaxspec,
 
 	/* Matches any character whose syntax is not that specified.  */
-  notsyntaxspec
+  notsyntaxspec,
+
+  lookahead,
+  lookahead_not,
+  lookbehind,
+  lookbehind_not,
+  lookaround_succeed,
+  lookaround_fail
 
 #ifdef emacs
   ,before_dot,	/* Succeeds if before point.  */
@@ -1033,6 +1040,36 @@
 	  fprintf (stderr, "/stop_memory/%d", *p++);
 	  break;
 
+        case lookahead:
+          extract_number_and_incr (&mcnt, &p);
+          fprintf (stderr, "/lookahead/%d", mcnt);
+          break;
+
+        case lookahead_not:
+          extract_number_and_incr (&mcnt, &p);
+          fprintf (stderr, "/lookahead_not/%d", mcnt);
+          break;
+
+        case lookbehind:
+          extract_number_and_incr (&mcnt, &p);
+          extract_number_and_incr (&mcnt2, &p);
+          fprintf (stderr, "/lookbehind/%d/%d", mcnt, mcnt2);
+          break;
+
+        case lookbehind_not:
+          extract_number_and_incr (&mcnt, &p);
+          extract_number_and_incr (&mcnt2, &p);
+          fprintf (stderr, "/lookbehind_not/%d/%d", mcnt, mcnt2);
+          break;
+
+        case lookaround_succeed:
+	  fprintf (stderr, "/lookaround_succeed");
+          break;
+
+        case lookaround_fail:
+          fprintf (stderr, "/lookaround_fail");
+          break;
+            
 	case duplicate:
 	  fprintf (stderr, "/duplicate/%d", *p++);
 	  break;
@@ -1600,11 +1637,17 @@
     }									\
   else									\
     {									\
-      regend[reg] = POP_FAILURE_POINTER ();				\
-      regstart[reg] = POP_FAILURE_POINTER ();				\
-      DEBUG_PRINT4 ("     Pop reg %d (spanning %p -> %p)\n",		\
-		    reg, regstart[reg], regend[reg]);			\
-    }									\
+      re_char *start, *end;                                             \
+      end = POP_FAILURE_POINTER ();                                     \
+      start = POP_FAILURE_POINTER ();                                   \
+      if (!discard_saved_regs)                                          \
+        {                                                               \
+          regstart[reg] = start;                                        \
+          regend[reg] = end;                                            \
+          DEBUG_PRINT4 ("     Pop reg %d (spanning %p -> %p)\n",        \
+                        reg, regstart[reg], regend[reg]);               \
+        }                                                               \
+    }                                                                   \
 } while (0)
 
 /* Check that we are not stuck in an infinite loop.  */
@@ -1702,7 +1745,7 @@
   while (fail_stack.frame < fail_stack.avail)				\
     POP_FAILURE_REG_OR_COUNT ();					\
 									\
-  pat = POP_FAILURE_POINTER ();				\
+  pat = POP_FAILURE_POINTER ();                                         \
   DEBUG_PRINT2 ("  Popping pattern %p: ", pat);				\
   DEBUG_PRINT_COMPILED_PATTERN (bufp, pat, pend);			\
 									\
@@ -1724,6 +1767,29 @@
 } while (0) /* POP_FAILURE_POINT */
 
 
+#define FINISH_LOOKAROUND()                                     \
+  do {                                                          \
+    re_char *str, *pat;                                         \
+    re_opcode_t op;                                             \
+    discard_saved_regs = 1;                                     \
+    while (!FAIL_STACK_EMPTY ())                                \
+      {                                                         \
+        POP_FAILURE_POINT (str, pat);                           \
+        op = (re_opcode_t) *pat;                                \
+        if (op == lookahead                                     \
+            || op == lookahead_not                              \
+            || op == lookbehind                                 \
+            || op == lookbehind_not)                            \
+          {                                                     \
+            d = str;                                            \
+            dend = ((d >= string1 && d <= end1)                 \
+                    ? end_match_1 : end_match_2);               \
+            break;                                              \
+          }                                                     \
+      }                                                         \
+    discard_saved_regs = 0;                                     \
+  } while (0);
+
 \f
 /* Registers are set to a sentinel when they haven't yet matched.  */
 #define REG_UNSET(e) ((e) == NULL)
@@ -1922,6 +1988,7 @@
   pattern_offset_t fixup_alt_jump;
   pattern_offset_t laststart_offset;
   regnum_t regnum;
+  int lookaround;
 } compile_stack_elt_t;
 
 
@@ -2522,6 +2589,8 @@
 						 compile_stack,
 						 regnum_t regnum));
 
+static int exact_chars_in_pattern_buffer _RE_ARGS ((struct re_pattern_buffer *bufp, re_char *p, re_char *pend));
+
 /* `regex_compile' compiles PATTERN (of length SIZE) according to SYNTAX.
    Returns one of error codes defined in `regex.h', or zero for success.
 
@@ -3261,6 +3330,7 @@
 	    handle_open:
 	      {
 		int shy = 0;
+                int lookaround = 0;
 		regnum_t regnum = 0;
 		if (p+1 < pend)
 		  {
@@ -3282,6 +3352,27 @@
 			      case '1': case '2': case '3': case '4':
 			      case '5': case '6': case '7': case '8': case '9':
 				regnum = 10*regnum + (c - '0'); break;
+                              case '=':
+                                /* Positive lookahead assertion.  */
+                                shy = lookaround = 1;
+                                break;
+                              case '!':
+                                /* Negative lookahead assertion.  */
+                                shy = lookaround = 2;
+                                break;
+                              case '<':
+                                {
+                                  PATFETCH (c);
+                                  if (c == '=')
+                                    /* Positive lookbehind assertion.  */
+                                    shy = lookaround = -1;
+                                  else if (c == '!')
+                                    /* Negative lookbehind assertion.  */
+                                    shy = lookaround = -2;
+                                  else
+                                    FREE_STACK_RETURN (REG_BADPAT);
+                                }
+                                break;
 			      default:
 				/* Only (?:...) is supported right now. */
 				FREE_STACK_RETURN (REG_BADPAT);
@@ -3328,6 +3419,7 @@
 		  = fixup_alt_jump ? fixup_alt_jump - bufp->buffer + 1 : 0;
 		COMPILE_STACK_TOP.laststart_offset = b - bufp->buffer;
 		COMPILE_STACK_TOP.regnum = regnum;
+                COMPILE_STACK_TOP.lookaround = lookaround;
 
 		/* Do not push a start_memory for groups beyond the last one
 		   we can represent in the compiled pattern.  */
@@ -3377,6 +3469,7 @@
 		   later groups should continue to be numbered higher,
 		   as in `(ab)c(de)' -- the second group is #2.  */
 		regnum_t regnum;
+                int lookaround;
 
 		compile_stack.avail--;
 		begalt = bufp->buffer + COMPILE_STACK_TOP.begalt_offset;
@@ -3389,13 +3482,40 @@
 		/* If we've reached MAX_REGNUM groups, then this open
 		   won't actually generate any code, so we'll have to
 		   clear pending_exact explicitly.  */
+                lookaround = COMPILE_STACK_TOP.lookaround;
 		pending_exact = 0;
 
 		/* We're at the end of the group, so now we know how many
 		   groups were inside this one.  */
 		if (regnum <= MAX_REGNUM && regnum > 0)
 		  BUF_PUSH_2 (stop_memory, regnum);
-	      }
+                else if (lookaround)
+                  {
+                    if (lookaround > 0)
+                      {
+                        /* Positive/negative lookahead assertion.  */
+                        GET_BUFFER_SPACE (3);
+                        INSERT_JUMP (lookaround == 1 ? lookahead : lookahead_not, laststart, b + 4);
+                        b += 3;
+                      }
+                    else
+                      {
+                        /* Positive/negative lookbehind assertion.  */
+                        int count = exact_chars_in_pattern_buffer (bufp, laststart, b);
+                        if (count == -1) /* variable length */
+                          FREE_STACK_RETURN (REG_BADPAT);
+
+                        GET_BUFFER_SPACE (5);
+                        INSERT_JUMP2 (lookaround == -1 ? lookbehind : lookbehind_not, laststart, b + 6, count);
+                        b += 5;
+                      }
+                    
+                    /* Negative form.  */
+                    if (lookaround > 1 || lookaround < -1)
+                      BUF_PUSH (lookaround_fail);
+                    BUF_PUSH (lookaround_succeed);
+                  }
+              }
 	      break;
 
 
@@ -3949,10 +4069,16 @@
        /* After an alternative?	 */
     || (*prev == '|' && (syntax & RE_NO_BK_VBAR || prev_prev_backslash))
        /* After a shy subexpression?  */
-    || ((syntax & RE_SHY_GROUPS) && prev - 2 >= pattern
-	&& prev[-1] == '?' && prev[-2] == '('
-	&& (syntax & RE_NO_BK_PARENS
-	    || (prev - 3 >= pattern && prev[-3] == '\\')));
+    || ((syntax & RE_SHY_GROUPS)
+        && ((prev - 2 >= pattern
+             && prev[-1] == '?' && prev[-2] == '('
+             && (syntax & RE_NO_BK_PARENS
+                 || (prev - 3 >= pattern && prev[-3] == '\\')))
+         || (prev - 3 >= pattern
+             && (*prev == '=' || *prev == '!')
+             && prev[-1] == '<' && prev[-2] == '?' && prev[-3] == '('
+             && (syntax & RE_NO_BK_PARENS
+                 || (prev - 4 >= pattern && prev[-4] == '\\')))));
 }
 
 
@@ -4198,6 +4324,13 @@
 	    }
 	  break;
 
+        case lookahead:
+        case lookahead_not:
+        case lookbehind:
+        case lookbehind_not:
+          if (!fastmap) break;
+          return -1;
+          
       /* All cases after this match the empty string.  These end with
 	 `continue'.  */
 
@@ -4827,7 +4960,7 @@
 	{
 	case start_memory:
 	case stop_memory:
-	  p += 2; break;
+          p += 2; break;
 	case no_op:
 	  p += 1; break;
 	case jump:
@@ -4843,6 +4976,93 @@
   return p;
 }
 
+static int
+exact_chars_in_pattern_buffer (bufp, p, pend)
+     struct re_pattern_buffer *bufp;
+     re_char *p, *pend;
+{
+  int count = 0;
+  while (p < pend)
+    {
+      switch (SWITCH_ENUM_CAST ((re_opcode_t) *p++))
+	{
+        case exactn:
+          {
+            int mcnt = *p++;
+            int buf_charlen;
+            while (mcnt > 0) {
+              STRING_CHAR_AND_LENGTH (p, p - pend, buf_charlen);
+              p += buf_charlen;
+              mcnt -= buf_charlen;
+              count++;
+            }
+          }
+          break;
+        case start_memory:
+        case stop_memory:
+          p++;
+          break;
+#ifdef emacs
+        case categoryspec:
+        case notcategoryspec:
+#endif /* emacs */
+        case syntaxspec:
+        case notsyntaxspec:
+          p++;
+        case anychar:
+          count++;
+          break;
+
+        case charset:
+        case charset_not:
+          if (CHARSET_RANGE_TABLE_EXISTS_P (p - 1))
+            {
+              int mcnt;
+              p = CHARSET_RANGE_TABLE (p - 1);
+              EXTRACT_NUMBER_AND_INCR (mcnt, p);
+              p = CHARSET_RANGE_TABLE_END (p, mcnt);
+            }
+          else
+            p += 1 + CHARSET_BITMAP_SIZE (p - 1);
+          count++;
+          break;
+
+#ifdef emacs
+	case before_dot:
+	case at_dot:
+	case after_dot:
+#endif /* emacs */
+	case no_op:
+	case begline:
+	case endline:
+	case begbuf:
+	case endbuf:
+	case wordbound:
+	case notwordbound:
+	case wordbeg:
+	case wordend:
+	case symbeg:
+	case symend:
+          /* Zero width.  */
+          continue;
+        case lookahead:
+        case lookahead_not:
+        case lookbehind:
+        case lookbehind_not:
+          /* Skip to lookaround_success.  */
+          while (p < pend)
+            {
+              if ((re_opcode_t) *p++ == lookaround_succeed)
+                break;
+            }
+          break;
+        default:
+          return -1;
+        }
+    }
+  return count;
+}
+
 /* Non-zero if "p1 matches something" implies "p2 fails".  */
 static int
 mutually_exclusive_p (bufp, p1, p2)
@@ -5200,6 +5420,9 @@
   re_char **best_regstart, **best_regend;
 #endif
 
+  /* Discard a saved register from the stack.  */
+  boolean discard_saved_regs = 0;
+
   /* Logically, this is `best_regend[0]'.  But we don't want to have to
      allocate space for that if we're not allocating space for anything
      else (see below).  Also, we never need info about register 0 for
@@ -5772,6 +5995,77 @@
 	  p += 1;
 	  break;
 
+        case lookahead:
+        case lookahead_not:
+          DEBUG_PRINT1 ((re_opcode_t) *(p - 1) == lookahead ? "EXECUTING lookahead.\n" : "EXECUTING lookahead_not.\n");
+
+          p += 2;
+          PUSH_FAILURE_POINT (p - 3, d);
+          break;
+
+        case lookbehind:
+        case lookbehind_not:
+          {
+            int mcnt, count;
+            boolean not = (re_opcode_t) *(p - 1) != lookbehind;
+
+            EXTRACT_NUMBER_AND_INCR (mcnt, p);
+            EXTRACT_NUMBER_AND_INCR (count, p);
+
+            DEBUG_PRINT2 (not
+                          ? "EXECUTING lookbehind_not %d.\n"
+                          : "EXECUTING lookbehind %d.\n", count);
+            
+            dfail = d;
+            while (d != string1 && count > 0)
+              {
+                if (d == string2)
+                  {
+                    if (!string1)
+                      break;
+                    d = end1;
+                    dend = end_match_1;
+                  }
+                
+                if (target_multibyte)
+                  {
+                    re_char *dhead = (d >= string1 && d <= end1) ? string1 : string2;
+                    PREV_CHAR_BOUNDARY (d, dhead);
+                  }
+                else
+                  d--;
+                count--;
+              }
+
+            if (count > 0)
+              {
+                if (not)
+                  {
+                    /* There is no enough string to match.
+                       So just make it succeeded here. */
+                    d = dfail;
+                    p = p - 2 + mcnt;
+                    break;
+                  }
+                else
+                  goto fail;
+              }
+
+            PUSH_FAILURE_POINT (p - 5, dfail);
+          }
+          break;
+
+        case lookaround_succeed:
+          DEBUG_PRINT1 ("EXECUTING lookaround_succeed.\n");
+          
+          FINISH_LOOKAROUND();
+          break;
+
+        case lookaround_fail:
+          DEBUG_PRINT1 ("EXECUTING lookaround_fail.\n");
+          
+          FINISH_LOOKAROUND();
+          goto fail;
 
 	/* \<digit> has been turned into a `duplicate' command which is
 	   followed by the numeric value of <digit> as the register number.  */
@@ -6413,12 +6707,16 @@
 	    case on_failure_jump_loop:
 	    case on_failure_jump:
 	    case succeed_n:
+            case lookahead_not:
+            case lookbehind_not:
 	      d = str;
 	    continue_failure_jump:
 	      EXTRACT_NUMBER_AND_INCR (mcnt, pat);
 	      p = pat + mcnt;
 	      break;
 
+            case lookahead:
+            case lookbehind:
 	    case no_op:
 	      /* A special frame used for nastyloops. */
 	      goto fail;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2009-06-03 23:04 Tomohiro MATSUYAMA
@ 2009-06-04  4:47 ` Miles Bader
  2009-06-04  8:27   ` Deniz Dogan
  0 siblings, 1 reply; 23+ messages in thread
From: Miles Bader @ 2009-06-04  4:47 UTC (permalink / raw)
  To: Tomohiro MATSUYAMA; +Cc: emacs-devel

Tomohiro MATSUYAMA <t.matsuyama.pub@gmail.com> writes:
> I have attached a patch that enables you to
> use lookaround assertion in regexp
...
> And if like it, please merge it.
> I believe that some people really want to use it.

This would be an excellent addition...

I often wish emacs regexps had such a feature.

-Miles

-- 
Run away!  Run away!




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2009-06-04  4:47 ` Miles Bader
@ 2009-06-04  8:27   ` Deniz Dogan
  0 siblings, 0 replies; 23+ messages in thread
From: Deniz Dogan @ 2009-06-04  8:27 UTC (permalink / raw)
  To: Miles Bader; +Cc: Tomohiro MATSUYAMA, emacs-devel

2009/6/4 Miles Bader <miles@gnu.org>:
> Tomohiro MATSUYAMA <t.matsuyama.pub@gmail.com> writes:
>> I have attached a patch that enables you to
>> use lookaround assertion in regexp
> ...
>> And if like it, please merge it.
>> I believe that some people really want to use it.
>
> This would be an excellent addition...
>
> I often wish emacs regexps had such a feature.

Seconded!

-- 
Deniz Dogan




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Patch for lookaround assertion in regexp
@ 2012-01-23 11:17 Colin Fraizer
  2012-01-23 14:11 ` Stefan Monnier
  2012-09-26  6:55 ` Tomohiro Matsuyama
  0 siblings, 2 replies; 23+ messages in thread
From: Colin Fraizer @ 2012-01-23 11:17 UTC (permalink / raw)
  To: emacs-devel; +Cc: t.matsuyama.pub

In 2009, Tomohiro Matsuyama sent a message to this list with a patch to add
lookahead/lookbehind assertions to Emacs regular expressions (regexps). Is
there any plan to incorporate this useful feature into an official release?

[IMHO, it is incredibly useful. The only responses to Matsuyama-san's
message were positive.]

Thanks,
--Colin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-23 11:17 Patch for lookaround assertion in regexp Colin Fraizer
@ 2012-01-23 14:11 ` Stefan Monnier
  2012-01-23 14:44   ` Tom
                     ` (2 more replies)
  2012-09-26  6:55 ` Tomohiro Matsuyama
  1 sibling, 3 replies; 23+ messages in thread
From: Stefan Monnier @ 2012-01-23 14:11 UTC (permalink / raw)
  To: Colin Fraizer; +Cc: t.matsuyama.pub, emacs-devel

> In 2009, Tomohiro Matsuyama sent a message to this list with a patch
> to add lookahead/lookbehind assertions to Emacs regular expressions
> (regexps).  Is there any plan to incorporate this useful feature into
> an official release?
> [IMHO, it is incredibly useful. The only responses to Matsuyama-san's
> message were positive.]

I'd like to replace the current regexp engine with one that does not
suffer from exponential blowup (i.e. using "Thompson's NFA").
Every feature added to the current regexp engine will make it more
difficult to switch, so I'm not too thrilled at the idea of adding
lookaround operators.
OTOH, noone has submitted code to replace the current regexp engine, and
I don't forsee I'll have the time to write it myself, so maybe I should
just give up on this plan.

        Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-23 14:11 ` Stefan Monnier
@ 2012-01-23 14:44   ` Tom
  2012-01-23 14:50     ` Andreas Schwab
  2012-01-23 15:31     ` Stefan Monnier
  2012-01-24  8:41   ` Nikolai Weibull
  2012-02-14 17:53   ` Daniel Colascione
  2 siblings, 2 replies; 23+ messages in thread
From: Tom @ 2012-01-23 14:44 UTC (permalink / raw)
  To: emacs-devel

Stefan Monnier <monnier <at> iro.umontreal.ca> writes:

> 
> OTOH, noone has submitted code to replace the current regexp engine, and
> I don't forsee I'll have the time to write it myself, so maybe I should
> just give up on this plan.
> 

Why not use simply PCRE with a bridge layer which translates from emacs
regexp format to PCRE? AFAIK the emacs regexes are more or less a subset
of PCRE, so the translation shouldn't be very difficult.

Why reinvent the wheel when PCRE is already there?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-23 14:44   ` Tom
@ 2012-01-23 14:50     ` Andreas Schwab
  2012-01-23 15:19       ` Tom
  2012-01-23 15:31     ` Stefan Monnier
  1 sibling, 1 reply; 23+ messages in thread
From: Andreas Schwab @ 2012-01-23 14:50 UTC (permalink / raw)
  To: Tom; +Cc: emacs-devel

Tom <adatgyujto@gmail.com> writes:

> Why not use simply PCRE with a bridge layer which translates from emacs
> regexp format to PCRE? AFAIK the emacs regexes are more or less a subset
> of PCRE, so the translation shouldn't be very difficult.

Does PCRE implement \c and \s?  Does PCRE provide an interface for
searching a memory region with a gap?

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-23 14:50     ` Andreas Schwab
@ 2012-01-23 15:19       ` Tom
  2012-01-23 16:14         ` Andreas Schwab
  0 siblings, 1 reply; 23+ messages in thread
From: Tom @ 2012-01-23 15:19 UTC (permalink / raw)
  To: emacs-devel

Andreas Schwab <schwab <at> linux-m68k.org> writes:

> 
> Tom <adatgyujto <at> gmail.com> writes:
> 
> > Why not use simply PCRE with a bridge layer which translates from emacs
> > regexp format to PCRE? AFAIK the emacs regexes are more or less a subset
> > of PCRE, so the translation shouldn't be very difficult.
> 
> Does PCRE implement \c and \s?  

If it doesn't then it's a job for the translation layer. Char syntaxes
and categories could be converted into the standard [...] format.

I don't know how efficient it would be, it should be tested, but 
pcre can compile regexps too. I don't know if emacs uses some precompiled
format internally if the same regexp is used again and again.

> Does PCRE provide an interface for searching a memory region with a gap?
> 

I don't know it should be checked, but other editors use PCRE as their
regex search engine, so there may be some support for that.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-23 14:44   ` Tom
  2012-01-23 14:50     ` Andreas Schwab
@ 2012-01-23 15:31     ` Stefan Monnier
  1 sibling, 0 replies; 23+ messages in thread
From: Stefan Monnier @ 2012-01-23 15:31 UTC (permalink / raw)
  To: Tom; +Cc: emacs-devel

>> OTOH, noone has submitted code to replace the current regexp engine, and
>> I don't forsee I'll have the time to write it myself, so maybe I should
>> just give up on this plan.
> Why not use simply PCRE with a bridge layer which translates from emacs
> regexp format to PCRE? AFAIK the emacs regexes are more or less a subset
> of PCRE, so the translation shouldn't be very difficult.

AFAIK PCRE uses a backtracking implementation, so suffers from the same
exponential blowup as the current code.

> Why reinvent the wheel when PCRE is already there?

I didn't mean to say that I want a fresh new engine written
from scratch.  But adding pcre to the list of libraries linked with
Emacs is not nearly enough: someone has to write the bridge layer and it
may turn out not to be as simple as it seems.

        Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-23 15:19       ` Tom
@ 2012-01-23 16:14         ` Andreas Schwab
  2012-01-23 17:11           ` Stefan Monnier
  0 siblings, 1 reply; 23+ messages in thread
From: Andreas Schwab @ 2012-01-23 16:14 UTC (permalink / raw)
  To: Tom; +Cc: emacs-devel

Tom <adatgyujto@gmail.com> writes:

> Andreas Schwab <schwab <at> linux-m68k.org> writes:
>
>> 
>> Tom <adatgyujto <at> gmail.com> writes:
>> 
>> > Why not use simply PCRE with a bridge layer which translates from emacs
>> > regexp format to PCRE? AFAIK the emacs regexes are more or less a subset
>> > of PCRE, so the translation shouldn't be very difficult.
>> 
>> Does PCRE implement \c and \s?  
>
> If it doesn't then it's a job for the translation layer. Char syntaxes
> and categories could be converted into the standard [...] format.

Enumerating the syntax/category members is not an option.  There is no
easy way to do that.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-23 16:14         ` Andreas Schwab
@ 2012-01-23 17:11           ` Stefan Monnier
  2012-01-23 18:45             ` Štěpán Němec
  0 siblings, 1 reply; 23+ messages in thread
From: Stefan Monnier @ 2012-01-23 17:11 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Tom, emacs-devel

>> If it doesn't then it's a job for the translation layer. Char syntaxes
>> and categories could be converted into the standard [...] format.
> Enumerating the syntax/category members is not an option.

Indeed.

> There is no easy way to do that.

For `categories', there is a way, but the result is a *very* large [...]
chunk, so it's impractical.  For `syntax' there is indeed no way, since
the syntax of a char doesn't only depend on the char itself but also of
the `char-table' text-property that might be applied to that particular
character position (and of course, if we ignore this problem, we're
still back to the same problem of enormous [...] expressions, as is the
case for categories).
These entities really need to be implemented inside the regexp-engine
(but they're usually pretty easy to implement there).

        Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-23 17:11           ` Stefan Monnier
@ 2012-01-23 18:45             ` Štěpán Němec
  2012-01-30  0:31               ` Juri Linkov
  0 siblings, 1 reply; 23+ messages in thread
From: Štěpán Němec @ 2012-01-23 18:45 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Andreas Schwab, Tom, emacs-devel

On Mon, 23 Jan 2012 18:11:26 +0100
Stefan Monnier wrote:

>>> If it doesn't then it's a job for the translation layer. Char syntaxes
>>> and categories could be converted into the standard [...] format.
>> Enumerating the syntax/category members is not an option.
>
> Indeed.
>
>> There is no easy way to do that.
>
> For `categories', there is a way, but the result is a *very* large [...]
> chunk, so it's impractical.  For `syntax' there is indeed no way, since
> the syntax of a char doesn't only depend on the char itself but also of
> the `char-table' text-property that might be applied to that particular
> character position (and of course, if we ignore this problem, we're
> still back to the same problem of enormous [...] expressions, as is the
> case for categories).
> These entities really need to be implemented inside the regexp-engine
> (but they're usually pretty easy to implement there).

OTOH using something like PCRE would finally fix the currently erroneous
implementation of classes like [:space:], which now is the same as \s-.
(And personally I would gladly forgo the syntax categories for standard
[:classes:], although I imagine the former might be used by the
font-locking or somewhere... I never felt the need for them.)

-- 
Štěpán



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-23 14:11 ` Stefan Monnier
  2012-01-23 14:44   ` Tom
@ 2012-01-24  8:41   ` Nikolai Weibull
  2012-01-24 14:40     ` Stefan Monnier
  2012-01-24 23:27     ` David De La Harpe Golden
  2012-02-14 17:53   ` Daniel Colascione
  2 siblings, 2 replies; 23+ messages in thread
From: Nikolai Weibull @ 2012-01-24  8:41 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel, Colin Fraizer, t.matsuyama.pub

On Mon, Jan 23, 2012 at 15:11, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
>> In 2009, Tomohiro Matsuyama sent a message to this list with a patch
>> to add lookahead/lookbehind assertions to Emacs regular expressions
>> (regexps).  Is there any plan to incorporate this useful feature into
>> an official release?

> I'd like to replace the current regexp engine with one that does not
> suffer from exponential blowup (i.e. using "Thompson's NFA").

> OTOH, noone has submitted code to replace the current regexp engine, and
> I don't forsee I'll have the time to write it myself, so maybe I should
> just give up on this plan.

As an alternative to PCRE, which, as has already been pointed out,
doesn’t match any of these requirements, how about RE2?

http://code.google.com/p/re2/

It’s written in C++, which is a minus, but it should be simple enough
to extend it with \c and \s.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-24  8:41   ` Nikolai Weibull
@ 2012-01-24 14:40     ` Stefan Monnier
  2012-01-24 15:09       ` Nikolai Weibull
  2012-01-24 23:27     ` David De La Harpe Golden
  1 sibling, 1 reply; 23+ messages in thread
From: Stefan Monnier @ 2012-01-24 14:40 UTC (permalink / raw)
  To: Nikolai Weibull; +Cc: emacs-devel, Colin Fraizer, t.matsuyama.pub

>>> In 2009, Tomohiro Matsuyama sent a message to this list with a patch
>>> to add lookahead/lookbehind assertions to Emacs regular expressions
>>> (regexps).  Is there any plan to incorporate this useful feature into
>>> an official release?
>> I'd like to replace the current regexp engine with one that does not
>> suffer from exponential blowup (i.e. using "Thompson's NFA").
>> OTOH, noone has submitted code to replace the current regexp engine, and
>> I don't forsee I'll have the time to write it myself, so maybe I should
>> just give up on this plan.
> As an alternative to PCRE, which, as has already been pointed out,
> doesn’t match any of these requirements, how about RE2?
> http://code.google.com/p/re2/
> It’s written in C++, which is a minus, but it should be simple enough
> to extend it with \c and \s.

That might work, indeed (tho someone still has to write the
corresponding code).
Note that it does not support lookaround assertions.


        Stefan



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-24 14:40     ` Stefan Monnier
@ 2012-01-24 15:09       ` Nikolai Weibull
  2012-01-24 17:34         ` Stefan Monnier
  0 siblings, 1 reply; 23+ messages in thread
From: Nikolai Weibull @ 2012-01-24 15:09 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel, Colin Fraizer, t.matsuyama.pub

On Tue, Jan 24, 2012 at 15:40, Stefan Monnier <monnier@iro.umontreal.ca> wrote:

>> As an alternative to PCRE, which, as has already been pointed out,
>> doesn’t match any of these requirements, how about RE2?
>> http://code.google.com/p/re2/
>> It’s written in C++, which is a minus, but it should be simple enough
>> to extend it with \c and \s.

> That might work, indeed (tho someone still has to write the
> corresponding code).
> Note that it does not support lookaround assertions.

True, but you can, as far as I know, not do so without (allowing for)
exponential behavior.

I don’t want to detract from the merits of lookaround assertions (or
start a discussion on the subject), but I’ve always found them to be a
sign of improper use of (no longer) regular expressions.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-24 15:09       ` Nikolai Weibull
@ 2012-01-24 17:34         ` Stefan Monnier
  0 siblings, 0 replies; 23+ messages in thread
From: Stefan Monnier @ 2012-01-24 17:34 UTC (permalink / raw)
  To: Nikolai Weibull; +Cc: emacs-devel, Colin Fraizer, t.matsuyama.pub

>>> As an alternative to PCRE, which, as has already been pointed out,
>>> doesn’t match any of these requirements, how about RE2?
>>> http://code.google.com/p/re2/
>>> It’s written in C++, which is a minus, but it should be simple enough
>>> to extend it with \c and \s.
>> That might work, indeed (tho someone still has to write the
>> corresponding code).
>> Note that it does not support lookaround assertions.
> True, but you can, as far as I know, not do so without (allowing for)
> exponential behavior.

Actually, no.  Contrary to backreferences (which are outside of the
mathematical notion of regular expressions, and can't be matched in
linear time), lookahead assertions are "normal".  So RE2 may get support
for lookahead assertions in the future (maybe for lookbehind as well,
tho that's more difficult).

> I don’t want to detract from the merits of lookaround assertions (or
> start a discussion on the subject), but I’ve always found them to be a
> sign of improper use of (no longer) regular expressions.

I just pointed it out as supporting my argument that I'd rather not add
lookaround assertions since it may make it more difficult to change to
an linear-time matcher later on.

        Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-24  8:41   ` Nikolai Weibull
  2012-01-24 14:40     ` Stefan Monnier
@ 2012-01-24 23:27     ` David De La Harpe Golden
  2012-01-25  6:07       ` Nikolai Weibull
  1 sibling, 1 reply; 23+ messages in thread
From: David De La Harpe Golden @ 2012-01-24 23:27 UTC (permalink / raw)
  To: emacs-devel

On 24/01/12 08:41, Nikolai Weibull wrote:
> As an alternative to PCRE, which, as has already been pointed out,
> doesn’t match any of these requirements, how about RE2?
>
> http://code.google.com/p/re2/
>
> It’s written in C++, which is a minus, but it should be simple enough
> to extend it with \c and \s.
>

If we're mentioning engines, CL-PPCRE by Dr. Edi Weitz is _in_ (common) 
lisp and generally nice (also 2-clause BSD-style licensed): 
http://weitz.de/cl-ppcre/





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-24 23:27     ` David De La Harpe Golden
@ 2012-01-25  6:07       ` Nikolai Weibull
  0 siblings, 0 replies; 23+ messages in thread
From: Nikolai Weibull @ 2012-01-25  6:07 UTC (permalink / raw)
  To: David De La Harpe Golden; +Cc: emacs-devel

On Wed, Jan 25, 2012 at 00:27, David De La Harpe Golden
<david@harpegolden.net> wrote:
> On 24/01/12 08:41, Nikolai Weibull wrote:
>>
>> As an alternative to PCRE, which, as has already been pointed out,
>> doesn’t match any of these requirements, how about RE2?
>>
>> http://code.google.com/p/re2/
>>
>> It’s written in C++, which is a minus, but it should be simple enough
>> to extend it with \c and \s.

> If we're mentioning engines, CL-PPCRE by Dr. Edi Weitz is _in_ (common) lisp
> and generally nice (also 2-clause BSD-style licensed):
> http://weitz.de/cl-ppcre/

CL-PPCRE is implemented using a backtracing NFA, so it’s not suited to
Stefan’s requirements.  It would, had it not been for this fact, been
a nice alternative.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-23 18:45             ` Štěpán Němec
@ 2012-01-30  0:31               ` Juri Linkov
  0 siblings, 0 replies; 23+ messages in thread
From: Juri Linkov @ 2012-01-30  0:31 UTC (permalink / raw)
  To: Štěpán Němec
  Cc: emacs-devel, Andreas Schwab, Stefan Monnier, Tom

> OTOH using something like PCRE would finally fix the currently erroneous
> implementation of classes like [:space:], which now is the same as \s-.
> (And personally I would gladly forgo the syntax categories for standard
> [:classes:], although I imagine the former might be used by the
> font-locking or somewhere... I never felt the need for them.)

`pcrecallout' could help to translate \c and \s to PCRE.

From `man pcrecallout(3)':

    PCRE provides a feature called "callout", which is a means of
    temporarily passing control to the caller of PCRE in the middle of
    pattern matching. The caller of PCRE provides an external function
    by putting its entry point in the global variable pcre_callout. By
    default, this variable contains NULL, which disables all calling out.

    Within a regular expression, (?C) indicates the points at which the
    external function is to be called. Different callout points can be
    identified by putting a number less than 256 after the letter C.
    For example, this pattern has two callout points:

         (?C1)abc(?C2)def



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-23 14:11 ` Stefan Monnier
  2012-01-23 14:44   ` Tom
  2012-01-24  8:41   ` Nikolai Weibull
@ 2012-02-14 17:53   ` Daniel Colascione
  2012-02-14 18:36     ` Stefan Monnier
  2012-02-20 16:19     ` Dimitri Fontaine
  2 siblings, 2 replies; 23+ messages in thread
From: Daniel Colascione @ 2012-02-14 17:53 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel, Colin Fraizer, t.matsuyama.pub

[-- Attachment #1: Type: text/plain, Size: 2311 bytes --]

On 1/23/12 6:11 AM, Stefan Monnier wrote:
>> In 2009, Tomohiro Matsuyama sent a message to this list with a patch
>> to add lookahead/lookbehind assertions to Emacs regular expressions
>> (regexps).  Is there any plan to incorporate this useful feature into
>> an official release?
>> [IMHO, it is incredibly useful. The only responses to Matsuyama-san's
>> message were positive.]
> 
> I'd like to replace the current regexp engine with one that does not
> suffer from exponential blowup (i.e. using "Thompson's NFA").
> Every feature added to the current regexp engine will make it more
> difficult to switch, so I'm not too thrilled at the idea of adding
> lookaround operators.
> OTOH, noone has submitted code to replace the current regexp engine, and
> I don't forsee I'll have the time to write it myself, so maybe I should
> just give up on this plan.

Implementing a fully general NFA-based regular expression matching
engine that support submatches is hard. The only two useful
implementations of which I'm aware are RE2 and Ville Laurikari's TRE,
both of which are two-clause BSD licensed. Laurikari wrote his thesis
[2] on the latter. TRE is the better of the two libraries, IMHO,
because it's single-pass and can work over arbitrary kinds of input
stream (like characters in a gap buffer). TRE's approximate matching
support is occasionally useful as well.

That said, I'd actually prefer to head in the other direction and
allow users to express arbitrarily rich grammars using an extended rx
syntax. The idea is basically to support parser combinator grammars,
which can be efficiently matched using a scannerless GRL parser, which
is O(N^3) time, or a memozing and backtracking "packrat" parser, which
is O(N) time and O(n) space. The end result would look a bit like Perl
6's rules.

I have some ultra-preliminary work at
https://github.com/dcolascione/jezebel.

Basically, instead of matching regular expressions more efficiently,
we should add support for efficient parsing using
declaratively-constructed grammars, of which regular expressions are
simply a special case. Such support would be incredibly useful for
processing languages like Perl and C++.

[1] http://laurikari.net/tre/about/
[2] http://laurikari.net/ville/regex-submatch.pdf

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 235 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-02-14 17:53   ` Daniel Colascione
@ 2012-02-14 18:36     ` Stefan Monnier
  2012-02-20 16:19     ` Dimitri Fontaine
  1 sibling, 0 replies; 23+ messages in thread
From: Stefan Monnier @ 2012-02-14 18:36 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: emacs-devel, Colin Fraizer, t.matsuyama.pub

> Implementing a fully general NFA-based regular expression matching
> engine that support submatches is hard. The only two useful
> implementations of which I'm aware are RE2 and Ville Laurikari's TRE,
> both of which are two-clause BSD licensed. Laurikari wrote his thesis
> [2] on the latter. TRE is the better of the two libraries, IMHO,
> because it's single-pass and can work over arbitrary kinds of input
> stream (like characters in a gap buffer). TRE's approximate matching
> support is occasionally useful as well.

I'm familiar with the work, yes.  TRE seemed like the best option last
time I looked around.

> That said, I'd actually prefer to head in the other direction and
> allow users to express arbitrarily rich grammars using an extended rx
> syntax.

I think that would be orthogonal: we want regexp support because it's
efficient (yes, our current implementation is super slow in some cases,
but it's also efficient in many important cases).

I also would like a new regexp engine to fix the "backward matching"
problem so that looking-back can work the way most people would expect,
and doesn't need a `greedy' hack.  The fact that regexps are symmetric
is a very neat property (operator precedence grammars enjoy the same
property, which is one of the reasons why I chose them as the basis for
SMIE).

> The idea is basically to support parser combinator grammars,
> which can be efficiently matched using a scannerless GRL parser, which
> is O(N^3) time, or a memozing and backtracking "packrat" parser, which
> is O(N) time and O(n) space. The end result would look a bit like Perl
> 6's rules.

While these are algorithmically reasonably efficient, it can be
difficult to make them as efficient as a regexp-only matcher for many
typical regexps.  Also it can be difficult to make them work backwards.
IOW I don't think that can replace regexps given the amount of regexps
out there we have to support.

        Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-02-14 17:53   ` Daniel Colascione
  2012-02-14 18:36     ` Stefan Monnier
@ 2012-02-20 16:19     ` Dimitri Fontaine
  1 sibling, 0 replies; 23+ messages in thread
From: Dimitri Fontaine @ 2012-02-20 16:19 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: t.matsuyama.pub, Stefan Monnier, Colin Fraizer, emacs-devel

Daniel Colascione <dancol@dancol.org> writes:
> Implementing a fully general NFA-based regular expression matching
> engine that support submatches is hard. The only two useful
> implementations of which I'm aware are RE2 and Ville Laurikari's TRE,
> both of which are two-clause BSD licensed. Laurikari wrote his thesis
> [2] on the latter. TRE is the better of the two libraries, IMHO,

There's also the Henry Spencer regexp code in use in TCL and PostgreSQL,
where providing an independent library for it is currently being
discussed. See this email and its thread:

  http://archives.postgresql.org/pgsql-hackers/2012-02/msg00782.php

Regards,
-- 
dim



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Patch for lookaround assertion in regexp
  2012-01-23 11:17 Patch for lookaround assertion in regexp Colin Fraizer
  2012-01-23 14:11 ` Stefan Monnier
@ 2012-09-26  6:55 ` Tomohiro Matsuyama
  1 sibling, 0 replies; 23+ messages in thread
From: Tomohiro Matsuyama @ 2012-09-26  6:55 UTC (permalink / raw)
  To: Colin Fraizer; +Cc: t.matsuyama.pub, emacs-devel

Just reminder.

I'd like to see soon the regexp lookaround assertion feature in trunk.
As Stefan said, the feature could be a bottleneck of the regexp engine replacement.
So, when such the replacement would happen?  I'm feeling I have to wait more 10 years.

I'm also skeptical in technical that Tompson's NFA is really good for Emacs.
In my understand, Tompson's NFA is an algorithm that reduces backtracks with restricting dynamism, and may require cpu and memory overhead in many cases.
Is that really what we want for regexp engines to improve scalability?

    Tomohiro

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2012-09-26  6:55 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-23 11:17 Patch for lookaround assertion in regexp Colin Fraizer
2012-01-23 14:11 ` Stefan Monnier
2012-01-23 14:44   ` Tom
2012-01-23 14:50     ` Andreas Schwab
2012-01-23 15:19       ` Tom
2012-01-23 16:14         ` Andreas Schwab
2012-01-23 17:11           ` Stefan Monnier
2012-01-23 18:45             ` Štěpán Němec
2012-01-30  0:31               ` Juri Linkov
2012-01-23 15:31     ` Stefan Monnier
2012-01-24  8:41   ` Nikolai Weibull
2012-01-24 14:40     ` Stefan Monnier
2012-01-24 15:09       ` Nikolai Weibull
2012-01-24 17:34         ` Stefan Monnier
2012-01-24 23:27     ` David De La Harpe Golden
2012-01-25  6:07       ` Nikolai Weibull
2012-02-14 17:53   ` Daniel Colascione
2012-02-14 18:36     ` Stefan Monnier
2012-02-20 16:19     ` Dimitri Fontaine
2012-09-26  6:55 ` Tomohiro Matsuyama
  -- strict thread matches above, loose matches on Subject: below --
2009-06-03 23:04 Tomohiro MATSUYAMA
2009-06-04  4:47 ` Miles Bader
2009-06-04  8:27   ` Deniz Dogan

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).