Fixing ill-conditioned regular expressions. Proof of concept.

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Fixing ill-conditioned regular expressions.  Proof of concept.
@ 2015-02-23 18:12 Alan Mackenzie
  2015-02-23 19:55 ` Paul Eggert
  2015-03-13 22:53 ` Stefan Monnier
  0 siblings, 2 replies; 21+ messages in thread
From: Alan Mackenzie @ 2015-02-23 18:12 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 4158 bytes --]

Hello, Emacs.

Please refer to Martin Rudalics's bug #19846.  Briefly, in a C Mode
buffer, with auto-fill-mode enabled, auto-repeating the spacebar at the
beginning of a comment line caused Emacs to freeze for a long time (up to
several minutes).

The reason for the freeze was an ill-conditioned regexp being constructed
by CC Mode and forward-paragraph, and this being used in regexp searches.
Here, ill-conditioned means the regexp had several subexpressions
concatenated, each of which matches arbitrary amounts of whitespace.
This causes the backtracking algorithm in the regexp engine to go crazy.

One solution would be for hackers to write better regexps.  But since the
pertinent one here is constructed partly from a user configuration, this
is difficult to enforce.  It's also difficult when a core function like
forward-paragraph uses strings supplied by arbitrary packages.

Another solution is to fix these ill-conditioned regexps, which is the
approach taken here.  The supplied file fix-re.el contains algorithms
which convert the regexp to an internal tree form, analyse and, if
necessary, correct the tree, then convert back to a regexp string.

There are two entry points to the code:
(i) fix-re, which given a regexp returns either the corrected form or
  the same regex; it caches supplied regexps, so that fixing/non-fixing
  only need be done once for each supplied regexp.
(ii) fix-re-test, a slightly lower level function bypasses the cache and
  simply tests the regexp, returning the corrected version or nil.

The runtime of fix-re is minimal - around 2ms, as measured by elp.

To use fix-re.el, byte-compile and load it.  Apply the following patch to
.../textmodes/paragraphs.el:


diff --git a/lisp/textmodes/paragraphs.el b/lisp/textmodes/paragraphs.el
index 8bcc71e..45aa95d 100644
--- a/lisp/textmodes/paragraphs.el
+++ b/lisp/textmodes/paragraphs.el
@@ -247,7 +247,8 @@ Returns the count of paragraphs left to move."
 		      fill-prefix-regexp "[ \t]*$")
 	    parsep))
 	 ;; This is used for searching.
-	 (sp-parstart (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)"))
+	 (sp-parstart (fix-re
+	  (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)")))
 	 start found-start)
     (while (and (< arg 0) (not (bobp)))
       (if (and (not (looking-at parsep))


.  With this in place, the freezing observed in bug #19846 no longer
happens.


The particular ill-conditioned sp-parstart which caused bug #19846 was


                           2             1    1            3             1
    ^[ \t]*\(?:[ \t]*\(//+\|\**\)[ \t]*$\|^^L\|[ \t]*\(//+\|\**\)[ \t]*$\|^^L\)
            1         2         2                     3         3             1
	                      ^                               ^

.  Note the two expressions matching arbitrary amounts of WS sandwiching
\(//+\|\**), which also matches the empty string.  Note in particular,
the marked character.  (N.B. Note also the repetition which results from
the way the regexp was constructed, which will double the execution time
of the RE engine here.)

Here is this regexp after processing by fix-re:

                         3                1    1                           1
    ^[ \t]*\(?:\(?:\(//+\|\*+\)[ \t]*\)?$\|^^L\|\(?:\(//+\|\*+\)[ \t]*\)?$\|^^L\)
            1   2   3       ^ 3       2          4   5      ^  5       4        1

.  Note how the subexpression 3 has become @dfn{de-emptified} - it no
longer matches the empty string, but matches everything else that the
original subexp did (and nothing more).

Abstractly, R*(R*ER*).... (where E is the subexp matching the empty
string) has been converted to R*((E@R*)?....), where E@ is the
de-emptification of E.

fix-re.el is currently a proof of concept, and is in a somewhat rough
state.  Comments and doc strings need attention, and the coding style is
not consistent throughout the file.  An additional transformation to
remove repeated alternatives from within \(..\) is also wanted.  But most
importantly, \{..\} expressions aren't handled at all.

I suggest that fix-re.el get incorporated into Emacs, after being tidied
up.

Comments?

-- 
Alan Mackenzie (Nuremberg, Germany).


[-- Attachment #2: fix-re.el --]
[-- Type: text/plain, Size: 28254 bytes --]

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;
;; f i x - r e . e l
;;
;; Fix ill-conditioned regular expressions.
;; 
;; Written by Alan Mackenzie, 2015-02.
;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; 
;; Format of the Abstract Syntax Table
;;
;; The AST is an internal representation of the regular expression string.
;; Elements are represented as follows:
;; 
;; 1. Ordinary characters are represented as themselves.  So "abc" will be
;; (?a ?b ?c)
;; 
;; 2. Enclosing parentheses are represented as a list whose car is a symbol
;; whose print-name is the start of the list, and each of whose alternatives
;; is itself a list.  So "\\(ab\\|cd\\)" will be ('\\( (?a ?b) (?c ?d)).  Neither
;; "\\|" or "\\)" is explicitly represented.  Openers like "\\(?6:" are
;; handled.
;;
;; 3. The top level of the AST is a special case of 2. whose symbol is '|.
;; Thus the entire regexp "ab\|cd" is represented by ('| (?a ?b) (?c ?d)).
;;
;; 4. Character alternatives have '[ in the car, and the internals of the
;; brackets in the cdr.  So "[^])]" becomes ('[ . (?^ ?] ?\)).  In place of a
;; character there may be an "escape list".  Character classes (like
;; [:alnum:]) are handled vaguely properly.  The terminating "]" is not
;; explicitly represented.
;;
;; 5. +, *, and ? are represented by conses with the pertinent symbol in the
;; car and the repreated/optional expression in the cdr.  So "a+" becomes
;; ('+ . ?a), \\(foo\\|bar\\)* would be ('* . ('\\( (?f ?o ?o) (?b ?a ?r)))).
;;
;; 6. Backslashed characters are represented by a list whose first element is
;; <esc>.  So "\_<foo" becomes ((?\e ?_ ?<) ?f ?o ?o).  Such a list may appear
;; in any position where a plain character can be.
;;
;; 7. The repetition operators like "\{2,3\}" are currently (2015-02-20) not
;; handled.  FIXME!!!
;;
;; There follow the routines for conversion from regexp to AST and back
;; again.
;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

(defvar fix-re--i 0)
(defvar fix-re--len 0)
(defvar fix-re--re nil)

(defmacro fix-re--next-ch ()
  '(progn (setq ch (aref fix-re--re fix-re--i)
		fix-re--i (1+ fix-re--i))
	  ch))
(defmacro fix-re--peek-ch ()
  '(aref fix-re--re fix-re--i))
(defmacro fix-re--append-ch (c)
  ;; N.B. Here c need not be a character.  It might instead be a list or
  ;; symbol.
  `(setcar tree (cons ,c (car tree))))

(defmacro fix-re--is-\( (l)
  `(and (consp ,l)
	(symbolp (car ,l))
	(string-match "|\\|\\\\(\\(\\?[0-9]*:\\)?" (symbol-name (car ,l)))
	(car ,l)))

(defun fix-re--build-AST-char-alt (tree)
  (let (ch)
    (fix-re--next-ch)
    (when (eq ch ?^)
      (fix-re--append-ch ch)
      (fix-re--next-ch))
    (while (progn	     ; The first character might be ].  Don't test it!
	     (when (and (eq ch ?\[) (eq (fix-re--peek-ch) ?:)) ; e.g. [:alnum:]
	       (fix-re--append-ch ch) (fix-re--next-ch) (fix-re--append-ch ch) (fix-re--next-ch) ; the [:
	       (while (not (eq ch ?:))
		 (fix-re--append-ch ch) (fix-re--next-ch))
	       (fix-re--append-ch ch) (fix-re--next-ch)) ; leave the ] till later.
	     (when (eq ch ?\\)
	       (fix-re--append-ch ?\\)
	       (fix-re--next-ch)) ; Closing :
	     (fix-re--append-ch ch)
	     (fix-re--next-ch)
	     (not (eq ch ?\]))))
    (setcar tree (nreverse (car tree)))
    tree))

(defun fix-re--subbuild-AST (tree)
  "The recursive part of `fix-re--build-AST'.
TREE is a non-empty list, onto which we push the AST."
  (let (ch sym)
    (catch 'done
      (while (< fix-re--i fix-re--len)
	(fix-re--next-ch)
	(cond
	 ((eq ch ?\\)
	  (fix-re--next-ch)
	  (cond
	   ((eq ch ?\()
	    (setq sym '\\\()
	    (when (eq (fix-re--peek-ch) ??) ; shy group or explicitly numbered group
	      (fix-re--next-ch)		    ; ??
	      (fix-re--next-ch)		    ; 0-9 or ?:
	      (let ((str (concat "\\\(\?" (string ch))))
		  (while (not (eq ch ?:))
		    (fix-re--next-ch)
		    (setq str (concat str (string ch))))
		  (setq sym (intern str))))
	    (let ((subtree (list nil)))
	      (fix-re--append-ch `(,sym ,@(fix-re--subbuild-AST subtree)))))
	   ((eq ch ?\|)
	    (setcar tree (nreverse (car tree)))
	    (push nil tree))
	   ((eq ch ?\))
	    (setcar tree (nreverse (car tree)))
	    (setq tree (nreverse tree))
	    (throw 'done tree))
	   ((memq ch '(?_ ?s ?S ?c ?C))	; \_< or \s., etc.
	    (fix-re--append-ch (list ?\\ ch (fix-re--next-ch))))
	   (t
	    (fix-re--append-ch (list ?\\ ch)))))

	 ((memq ch '(?* ?+ ??))
	  (setq sym (intern (string ch)))
	  (setcar (car tree) (cons sym (caar tree))))

	  ((eq ch ?\[)
	   (let ((subtree (list nil)))
	     (fix-re--append-ch `(\[ ,@(car (fix-re--build-AST-char-alt subtree))))))

	  (t				; ordinary character
	   (fix-re--append-ch ch))))
      (setcar tree (nreverse (car tree)))
      (nreverse tree))))
	   
(defun fix-re--build-AST (regexp)
  "Construct the AST corresponding to the input regexp REGEXP.
REGEXP is assumed to be syntactically correct."
  (let ((tree (list nil))
	)
    (setq fix-re--i 0
	  fix-re--len (length regexp)
	  fix-re--re regexp)
    `(| ,@(fix-re--subbuild-AST tree))))

(defun fix-re--dump-AST (s)
  "Convert the AST S to a regexp string and return it."
  (cond
   ((null s)
    "")
   ((numberp s)				; i.e. a character
    (string s))
   ((eq (car s) '|)
    (mapconcat 'fix-re--dump-AST (cdr s) "\\|"))
   ((fix-re--is-\( s)
    (concat (symbol-name (car s)) (mapconcat 'fix-re--dump-AST (cdr s) "\\|") "\\)"))
   ((memq (car s) '(* + \?))
    (concat (fix-re--dump-AST (cdr s)) (symbol-name (car s))))
   ((eq (car s) '\[)
    (concat "[" (fix-re--dump-AST (cdr s)) "]"))
   ((consp (car s))
    (concat (fix-re--dump-AST (car s)) (fix-re--dump-AST (cdr s))))
   ((numberp (car s))			; i.e. a character
    (concat (string (car s)) (fix-re--dump-AST (cdr s))))
   (t (error "(car s) is %s" (car s)))))

\f
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; PTR and AD
;; ----------
;;
;; Whilst building and transforming trees, what we really need is pointers to
;; either the car or cdr of a cons cell, to enable us to overwrite them.  The
;; only tools we have for this job are `setcar' and `setcdr', but these
;; primitives, unadorned, are too cumbersome for the manipulations we will be
;; doing.  Enter `ptr' and `ad'.  Together, these emulate the notion of
;; pointer.
;;
;; `ad' takes one of the values 'car, 'cdr, or 'cadr.  'ptr' is always a cons
;; cell, `ad' specifiying how to reach the element we're interested in.  For
;; example, if we needed an overwrite routine `fix-re--overwrite-via-ptr' (we don't
;; seem to), the overwrite operation would be one of:
;; car:  (setcar ptr val);
;; cdr:  (setcdr ptr val);
;; cadr: (setcar (cdr ptr) val).
;;
;; Note that when `ptr'/`ad' are used for list operations, `ad' may not be
;; 'cdr.
;; 
;; A set of tree access primitives follows.  Throughout the rest of this
;; source file, `ptr' and `ad' will not be further extensively documented.
;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(defmacro fix-re--ptr-get (ptr ad)
  "Get the element pointed to by PTR and AD."
  `(cond ((eq ,ad 'car) (car ,ptr))
	 ((eq ,ad 'cdr) (cdr ,ptr))
	 (t (cadr ,ptr))))

(defmacro fix-re--ptr-next (ptr ad)
  "Advance PTR/AD to the next in the list, and return what it then points to.
Here, PTR and AD must both be variables."
  `(cond
    ((eq ,ad 'car)
     (setq ,ad 'cadr)
     (cadr ,ptr))
    ((eq ,ad 'cadr)
     (setq ,ptr (cdr ,ptr))
     (cadr ,ptr))
    (t (error "'fix-re--ptr-next called with 'cdr"))))

(defun fix-re--bind-link-nil (ptr ad)
  "\"Bind\" PTR/AD's \"next field\" to nil, returning the old value."
  (cond
   ((eq ad 'car)
    (prog1
	(cdr ptr)
      (setcdr ptr nil)))
   ((eq ad 'cadr)
    (prog1
	(cddr ptr)
      (setcdr (cdr ptr) nil)))
   (t (error "fix-re--bind-link-nil called with 'cdr"))))

(defun fix-re--restore-link (lnk ptr ad)
  "Restore PTR/AD's \"next field\" to LNK."
  (cond
   ((eq ad 'car) (setcdr ptr lnk))
   ((eq ad 'cadr) (setcdr (cdr ptr) lnk))
   (t (error "fix-re--restore-link called with 'cdr"))))

(defmacro fix-re--chop (ptr ad)
  "Remove from the tree the element pointed to by PTR and AD.  Return the element."
  `(cond
    ((eq ,ad 'car)
     (prog1
	 (car ,ptr)
       (setcar ,ptr (cadr ,ptr))
       (setcdr ,ptr (cddr ,ptr))))
    ((eq ,ad 'cadr)
     (prog1
	 (cadr ,ptr)
       (setcdr ,ptr (cddr ,ptr))))
    (t (error "fix-re--chop called with 'cdr"))))

(defmacro fix-re--insert (elt ptr ad)
  "Insert ELT into the tree, just before the element point to by PTR and AD."
  `(cond
    ((eq ,ad 'car)
     (progn
       (setcdr ,ptr (cons (car ,ptr) (cdr ,ptr)))
       (setcar ,ptr ,elt)))
    ((eq ,ad 'cadr)
     (setcdr ,ptr (cons ,elt (cdr ,ptr))))
    (t (error "fix-re--insert called with 'cdr"))))

(defun fix-re--insert-after (elt ptr ad)
  "Insert ELT into the tree, just after the element pointed to by PTR/AD."
  (cond
   ((eq ad 'car)
    (setcdr ptr (cons elt (cdr ptr))))
   ((eq ad 'cadr)
    (setcdr (cdr ptr) (cons elt (cddr ptr))))
   (t (error "fix-re--insert-after called with 'cdr"))))

(defmacro fix-re--chop-+* (ptr ad)
  "Change R+, R*, or R? to R.
PTR points to a cons like (* . R).  Change it to R."
  `(cond
    ;; Remember: we have a dotted list here like ('+ . ?a)
    ((eq ,ad 'car)
     (setcar ,ptr (cdar ,ptr)))
    ((eq ,ad 'cdr)
     (setcdr ,ptr (cddr ,ptr)))
    (t (setcar (cdr ,ptr) (cdadr ,ptr)))))

(defmacro fix-re--+*ify (sym ptr ad)
  "Change R to R+, R*, or R?, where SYM is '+, '*, or '?"
  `(cond
    ((eq ,ad 'car)
     (setcar ,ptr (cons ,sym (car ,ptr))))
    ((eq ,ad 'cdr)
     (setcdr ,ptr (cons ,sym (cdr ,ptr))))
    (t (setcar (cdr ,ptr) (cons ,sym (cadr ,ptr))))))
    
(defmacro fix-re--splice-list (lst ptr ad)
  "Splice the list LST into the tree, just before the element pointed to by
  PTR and AD."
  `(when ,lst
     (cond
      ((eq ,ad 'car)
       (progn
	 (setcdr ,ptr (append (cdr ,lst) (cons (car ,ptr) (cdr ,ptr))))
	 (setcar ,ptr (car ,lst))))
      ((eq ,ad 'cadr)
       (setcdr ,ptr (append ,lst (cdr ,ptr))))
      (t (error "fix-re--splice-list called with 'cdr")))))

(defmacro fix-re--chop-\( (ptr ad)
  "PTR AD points to a \( construct with only one alternative.  Remove the
\(..\) from it."
  (error "fix-re--chop-\( hasn't been tested yet!!!  FIXME!!!")
  `(cond
    ((eq ,ad 'car)
     (setcar ,ptr (caadar ,ptr)))
    ((eq ,ad 'cdr)
     (setcdr ,ptr (caaddr ,ptr)))
    (t (setcar (caadr (cadr ,ptr))))))

;; (defmacro fix-re--wrap-in-\( (sym ptr ad)
;;   "Transform R to \\(R\\), where R is the element pointed to by PTR and AD.
;; SYM is a valid symbol to open such an expression, such as '\\\(, '\\\(\?6: or
;; '|."
;;   `(cond
;;     ((eq ,ad 'car)
;;      (setcar ,ptr (list ,sym (list (car ,ptr)))))
;;     ((eq ,ad 'cdr)
;;      (setcdr ,ptr (list ,sym (list (cdr ,ptr)))))
;;     (t (setcar (cdr ,ptr) (list ,sym (list (cadr ,ptr)))))))

(defun fix-re--wrap-in-\( (sym ptr ad)
  "Transform R to \\(R\\), where R is the element pointed to by PTR and AD.
SYM is a valid symbol to open such an expression, such as '\\\(, '\\\(\?6: or
'|."
  (cond
   ((eq ad 'car)
    (setcar ptr (list sym (list (car ptr)))))
   ((eq ad 'cdr)
    (setcdr ptr (list sym (list (cdr ptr)))))
   (t (setcar (cdr ptr) (list sym (list (cadr ptr)))))))

(defun fix-re--wrap-list-in-\( (sym ptr ad)
  "PTR points to the first element of a list.  Wrap that list in \\(..\\),
  without adding an extra layer of parentheses."
  (cond
   ((eq ad 'car)
    (setcar ptr (list sym (cons (car ptr) (cdr ptr))))
    (setcdr ptr nil))
   ((eq ad 'cadr)
    (setcar (cdr ptr) (list sym (cons (cadr ptr) (cddr ptr))))
    (setcdr (cdr ptr) nil))
   (t (error "fix-re--wrap-list-in-\( was called with 'cdr"))))

(defmacro fix-re--setres (call)
  "Call CALL, and set `res' to t if the result is non-nil."
  `(if ,call (setq res t)))

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

(defun fix-re--common-head-from-alts (e)
  "E is a tree element begining with \(, etc.  Calculate the common head of all subelements."
  (let* ((elts (cdr e))
	 (head (copy-tree (car elts)))
	 elt)
    (while (and (> (length elts) 1)
		head)
      (setq elts (cdr elts)
	    elt (car elts))
      (setq head (fix-re--common-head head elt)))
    head))

(defun fix-re--common-head (e0 e1)
  "E0 and E1 could be any type of element."
  (let (
	)
    (cond
     ((or (atom e0) (atom e1))
      (and (equal e0 e1) e0))
     ((and (fix-re--is-\( e0)
	   (eq (car e0) (car e1)))
      (let* ((e0-common (fix-re--common-head-from-alts e0))
	     (e1-common (fix-re--common-head-from-alts e1)))
	(cons (car e0)
	      (fix-re--common-head e0-common e1-common))))
     ((and (symbolp (car e0))
	   (eq (car e0) (car e1)))
      (and (equal e0 e1) e0))
     ((not (or (symbolp (car e0)) (symbolp (car e1))))
      (let (acc elt)
	(while
	    (and e0 e1
		 (setq elt (fix-re--common-head (car e0) (car e1)))
		 (equal elt (car e0)))
	  (push elt acc)
	  (setq elt nil)
	  (setq e0 (cdr e0)
		e1 (cdr e1)))
	(if elt (push elt acc))
	(nreverse acc)))
     (t nil))))

(defun fix-re--remove-head (head ptr)
  "Remove HEAD from front of element which is the cadr of PTR, returning the remains.
HEAD is a list of elements which are known to match the
head of ELT."
  (let ((elt (cadr ptr))
	)
    (while head
       (cond
	((equal (car head) (car elt))
	 (setq head (cdr head)  elt (cdr elt))
	 (fix-re--chop (cadr ptr) 'car))
	((fix-re--is-\( (car elt))
	 (mapc (lambda (e)
		 (fix-re--remove-head head e))
	       (cdar elt))
	 (setq head nil))
	(t (error "fix-re--remove-head: head = %s, ptr = %s" head ptr))))
    ))

(defun fix-re--RA|RB->R\(A|B\) (ptr ad in-alt)
  "Transform any \(RA\|RB\|RC\) into R\(A\|B\|C\), or
\(R\(?:A\|B\|C\)\).  FIXME!!! Correct this doc string, since *ptr
isn't necessarily an alternatives list.

Extract the head common to all the alternatives in the list which
is pointed to by PTR and AD, and insert it into the tree
structure before that list.  If IN-ALT is non-nil, additionally wrap the final
expression in \(..\).  (We need to do this when the enclosing tree element is
itself a \( construct.  Otherwise, rather than creating a sequence, we would
end up with alternatives.)

Return non-nil when a transformation was done, else nil."
  (let ((alt-list (fix-re--ptr-get ptr ad))
	)
    (when (and
	   (fix-re--is-\( alt-list)
	   (> (length alt-list) 2))	; i.e. the construct has a \|
      (let ((head (fix-re--common-head-from-alts alt-list))
	     (inner-ptr ptr)
	     (inner-ad ad)
	    )
	(when head
	  (mapl (lambda (elt-ptr)
		  (when (cdr elt-ptr)
		    (fix-re--remove-head head elt-ptr)))
		alt-list)
	  (when in-alt
	    (fix-re--wrap-in-\( (car alt-list) ptr ad)
	    (setcar alt-list '\\\(\?:)
	    (setq inner-ptr (cadar ptr))
	    (setq inner-ad 'car))
	  (fix-re--splice-list head inner-ptr inner-ad)
	  t
	  )))))

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;


(defun fix-re--R+R*->R+ (ptr ad)
  "Transform any R+R* to R+.
All combinations of +, *, and ? are handled.
Return 'shortened, t or nil.
...
"
  (if (eq ad 'cdr) (error "fix-re--R+R*->R+ got 'cdr"))
  (let ((e0 (if (eq ad 'car) (car ptr) (cadr ptr)))
	(e1 (if (eq ad 'car) (cadr ptr) (caddr ptr))))
    (when (and (consp e0) (consp e1))
      (let ((op0 (car e0))
	    (op1 (car e1))
	    ;; (link (if (eq ad 'car) (cddr ptr) (cdddr ptr)))
	    op
	    )
	(when
	    (and (memq op0 '(+ * \?))
		 (memq op1 '(+ * \?))
		 (equal (cdr e0) (cdr e1))) ; Both Rs are the same.
	  (cond
	   ((and (eq op0 '\?) (eq op1 '\?)) ; Cant combine R?R?
	    nil)
	   ((and (eq op0 '+) (eq op1 '+)) ; R+R+ -> RR+
	    (fix-re--chop-+* ptr 'car)
	    t)
	   (t
	    (setq op
		    (cond
		     ((or (eq op0 '+) (eq op1 '+)) '+)
		     ((or (eq op0 '*) (eq op1 '*)) '*)))
	    (setcar e0 op)
	    (fix-re--chop (if (eq ad 'car) ptr (cdr ptr)) 'cadr)
	    'shortened))
	  )))))

(defun fix-re--multi-R+R*->R+ (ptr ad)
  "Transform R+R*R?.... into R+.  PTR/AD point at the first R."
  (let (res result
	)
    (while
	(and (>= (length ptr) (if (eq ad 'car) 2 3))
	     (setq res (fix-re--R+R*->R+ ptr ad))
	     (if res (setq result t))
	     (eq res 'shortened)))
    result))

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

(defun fix-re--matches-empty-p (tree)
  "Return non-nil when the empty string matches TREE."
  (let (cur
	)
    (cond
     ((fix-re--is-\( tree)
      (or (null (cdr tree))
	  (progn
	    (setq cur (cdr tree))
	    (while (and cur
			(not (or
			      (null cur)
			      (fix-re--matches-empty-p (car cur)))))
	      (setq cur (cdr cur)))
	    cur)))
     ((atom tree)
      (not tree))
     ((memq (car tree) '(* \?)))
     ((eq (car tree) '+)
      (fix-re--matches-empty-p (cdr tree)))
     ((eq (car tree) '\[) nil)
     ((eq (car tree) ?\\) nil)
     (t					; Sequntial element.
      (setq cur tree)
      (while (and cur
		  (fix-re--matches-empty-p (car cur)))
	(setq cur (cdr cur)))
      (not cur)))))

(defun fix-re--de-emptify (ptr ad)
  "Destructively change PTR so that the tree it points to no longer matches the empty string,
but matches any non-empty string, and no others, that the original PTR did.

Expresssions like ^, $, \<, ...., which \"match the empty string\"
at particular places, are left unchanged."
  (let ((tree (fix-re--ptr-get ptr ad))
	cur cur-ptr cur-ad
	)
    (cond
     ((or
       (null tree)
       (atom tree)
       (memq (car tree) '(?\[))
       )
      nil)
     ((fix-re--is-\( tree)
      (setq cur-ptr tree
	    cur-ad 'car)
      (while (setq cur (fix-re--ptr-next cur-ptr cur-ad))
	(fix-re--de-emptify cur-ptr cur-ad))
      )
     ((eq (car tree) '+)
      (fix-re--de-emptify tree 'cdr))
     ((eq (car tree) '*)
      (fix-re--de-emptify tree 'cdr)
      (setcar tree '+))			; <-------
     ((eq (car tree) '\?)
      (fix-re--de-emptify tree 'cdr)
      (fix-re--chop-+* ptr ad))		; <-------
     ((eq (car tree) ?\\)
      nil)
     (t					; Sequential element.
      (when (fix-re--matches-empty-p tree) ; i.e. all elements match the empty string.
	(if (cdr tree)		      ; EF -> (EF@|E@), (where E@ is de-emptified E).
	    (let ((new (list (copy-tree (car tree))))); E   EF
	      (fix-re--de-emptify new 'car)		      ; E@    EF
	      (fix-re--de-emptify tree 'cdr)		      ; E@    EF@
	      (fix-re--wrap-list-in-\( '\\\(\?: tree 'car) ; E@    \(EF@\)
	      (setq tree (car (fix-re--ptr-get ptr ad)))
	      (fix-re--insert-after new tree 'cadr))       ; \(EF@\|E@\)
	  (fix-re--de-emptify tree 'car)))))))

(defun fix-re--do-R*ER*-transform (R0-ptr R0-ad empty0-ptr empty1-ptr R1-ptr)
  "Transformation R*ER* -> R*(E@R*)? or R*ER+ -> R*(E@R+|R).
Here, empty0/1-ad and R1-ad are always 'cadr, so we omit these from the
parameter list.
R0/R1-ptr point to the enclosing R*, R+ expressions,
empty0/1-ptr to the first and last expressions between them,
all of which match the empty string."
  (let* ((R0 (fix-re--ptr-get R0-ptr R0-ad))
	 (R0-R (cdr R0))
	 (R0-+* (car R0))
	 (R1-+* (car (fix-re--ptr-get R1-ptr 'cadr)))
	 new link
	 )
    ;; First "de-emptify" the empty expression sequence.
    (if (eq empty0-ptr empty1-ptr)
	;; Special case: just one empty matcher
	(fix-re--de-emptify empty0-ptr 'cadr)
      ;; General case: temporarily chop off the link on the last of
      ;; the list of empty matchers, so as to de-emptify them as a
      ;; sequence.
      (setq link (fix-re--bind-link-nil empty1-ptr 'cadr))
      (fix-re--de-emptify empty0-ptr 'cdr)
      (fix-re--restore-link link empty0-ptr 'cadr)
      ;; We've now lost R1-ptr.  Restore it.  The deemptification will have
      ;; left everything between empty0 and empty1 as a single list element.
      ;; So what now follows empty0 is R1.
      ;; We have also lost empty1-ptr, but we don't need it any more.
      (setq R1-ptr (cdr empty0-ptr)))	; R1-ad remains 'cadr.

    ;; Now transform according to whether we have R* or R+ or a mixture.
    ;; Do ER* -> (E@R*).
    ;; First set the link on R* to nil, to "terminate" the list.  This link
    ;; will be set on the newly formed \(..\) construct.
    (setq link (fix-re--bind-link-nil R1-ptr 'cadr))
    (fix-re--wrap-list-in-\( '\\\(\?: empty0-ptr 'cadr)
    ;; empty0-ptr/ad now points at the new \(..\).
    (fix-re--restore-link link empty0-ptr 'cadr)

    (if (eq R1-+* '*)
	;; Apply ? to \(..\) to give R*(E@R*)?
	(fix-re--+*ify '\? empty0-ptr 'cadr)
      ;; Change R*(E@R+) to R*(E@R+|R)
      (setq new (copy-tree R0-R))	; R   R*(E@R+)
      (fix-re--insert-after new (cadr empty0-ptr) 'cadr) ; R*(E@R+|R)
      )))

(defun fix-re--R+ER*->R+\(E@R*\)\? (ptr ad)
  "Do R+ER* -> R+(E@R*)? or R+ER+ -> R+(E@R+|R) on the whole list.
PTR/AD point at the first element of the sequential list.

Here, E is a non-empty sequence of elements which are matched by
the empty string, E@ is the \"de-emptified\" version of E."
  ;; We must perform the loop rightmost transformations first.  To see this,
  ;; consider R*ER*FR* done leftmost first.  The first transformation takes us
  ;; to R*(E@R*)?FR*.  We're now stuck, as the middle R* is no longer
  ;; "exposed" to the last R*, and the end expression is still ill-formed.
  ;; Done rightmost first, R*ER*FR* -> R*ER*(F@R*)? -> R*(E@R*)?(F@R*)?, which
  ;; is well-formed.
  (let (res)
    (let ((ptr ptr) (ad ad))
      (when (fix-re--ptr-next ptr ad)
	(setq res (fix-re--R+ER*->R+\(E@R*\)\? ptr ad))))

    (let* ((elt-ptr ptr)
	   (elt-ad ad)
	   (elt (fix-re--ptr-get elt-ptr elt-ad))
	   R0-R
	   empty0-ptr empty1-ptr R1-ptr	; No need for ..-ad's, since
					; these will always be 'cadr.
	   R1-+*
	   )
      (or
      ;; Isy `elt' R+ or R*?
       (when (and (consp elt)
		  (memq (car elt) '(+ *)))
	 (setq R0-R (cdr elt))
	 ;; Is the next element one matching the empty string, and which
	 ;; isn't R+ or R*?
	 (setq elt (fix-re--ptr-next elt-ptr elt-ad))
	 (when (and elt
		    (fix-re--matches-empty-p elt)
		    (not (and (consp elt)
			      (memq (car elt) '(+ *))
			      (equal (cdr elt) R0-R))))
	   (setq empty0-ptr elt-ptr ; Remember first empty. -ad is implicitly 'cadr
		 empty1-ptr elt-ptr )	; Remember last empty.
	   ;; Read the elements which match empty, but aren't R+ or R*.
	   (while (and (setq elt (fix-re--ptr-next elt-ptr elt-ad))
		       (fix-re--matches-empty-p elt)
		       (not (and (consp elt)
				 (memq (car elt) '(+ *))
				 (equal (cdr elt) R0-R))))
	     (setq empty1-ptr elt-ptr))
	   ;; Have we found the matching R+ or R*?
	   (when (and elt
		      (consp elt)
		      (memq (car elt) '(+ *))
		      (equal (cdr elt) R0-R))
	     ;; Yes.  We're in business.
	     (fix-re--do-R*ER*-transform ptr ad empty0-ptr empty1-ptr elt-ptr)
	     t)))
       res))))

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; R*(R*A|B) -> R*(A|B)
(defun fix-re--do-R+\(R*A|B\)-transform (R-rep alt)
  "Attempt a R+(R*A|B) -> R+(A|B) transformation.
R-REP is a cons representing either R+ or R*.  ALT represents a
form of the form \(..\|..\|...\).
"
  (let* ((R-R (cdr R-rep))
	 (R-+* (car R-rep))
	 (ptr alt)
	 (ad 'cadr) ; Point to the second elt. of the list, the first being '\\\(
	 (elt (fix-re--ptr-get ptr ad))
	 res car-elt elt-+*
	 )
    (while elt				; (R*A)
      (when (and (consp elt)		; This should always be true
		 (setq car-elt (car elt)) ; This is now R*A
		 (consp car-elt)
		 (memq (car car-elt) '(+ *))
		 (equal (cdr car-elt) R-R))
	(setq elt-+* (car car-elt))
	(if (eq elt-+* '*)
	    (fix-re--chop elt 'car)
	  (fix-re--chop-+* elt 'car))
	(setq res t))
      (setq elt (fix-re--ptr-next ptr ad)))
    res))
	


(defun fix-re--R+\(R*A|B\)->R*\(A|B\) (ptr ad)
  "Do the transition on every pertinent element pairs in the sequence.
PTR/AD point to the first element in the sequential list."
  (let ((elt (fix-re--ptr-get ptr ad))
	R-rep res)
    (while elt
      (if (and (consp elt)
	       (memq (car elt) '(+ *)))
	  (progn
	    (setq R-rep elt
		  elt (fix-re--ptr-next ptr ad))
	    (when (fix-re--is-\( elt)
	      (if (fix-re--do-R+\(R*A|B\)-transform R-rep elt)
		  (setq res t))
	      (setq elt (fix-re--ptr-next ptr ad))))
	(setq elt (fix-re--ptr-next ptr ad))))
    res))

\f
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Top level AST transformation routines.
(defun fix-re--transform (ptr ad in-alt)
  "Transform the expression pointed at by PTR/AD.  Return non-nil when we did something.
IN-ALT is non-zero iff the expression is directly contained in \(..\)."
  (let ((e (fix-re--ptr-get ptr ad))
	cur res cur-ptr cur-ad
	)

    (when (and (consp e)
	       (not (eq (car e) ?\[))  ; character alternative, nothing to do.
	       (not (eq (car e) ?\e))) ; escape pair/triple, nothing to do.
      ;; First recurse on all subexpressions
      (cond
       ((fix-re--is-\( e)
	(setq cur e)
	(while (cdr cur)
	  (fix-re--setres (fix-re--transform cur 'cadr t))
	  (setq cur (cdr cur))))
       ((symbolp (car e))		; +, *, ?.
	(fix-re--setres (fix-re--transform e 'cdr nil)))
       (t				; Sequential list.
	(fix-re--setres (fix-re--transform e 'car nil))
	(setq cur e)
	(while (cdr cur)
	  (fix-re--setres (fix-re--transform cur 'cadr nil))
	  (setq cur (cdr cur)))))

      ;; Now operate on the supplied list.
      (cond
       ((fix-re--is-\( e)
	;; Extract any common head from inside \(..\).
	;; We're sending in a pointer to e to fix-re--RA|RB...
	(fix-re--setres (fix-re--RA|RB->R\(A|B\) ptr ad in-alt))
	)
       ((symbolp (car e))		; +, *, ?
	)				; Nothing to do here - 
       (t				; Sequential list.
	(setq cur-ptr e  cur-ad 'car)
	(while
	    (progn
	      (fix-re--setres (fix-re--multi-R+R*->R+ cur-ptr cur-ad))
	      (fix-re--ptr-next cur-ptr cur-ad)))
	(setq cur-ptr e  cur-ad 'car)
	(fix-re--setres (fix-re--R+ER*->R+\(E@R*\)\? e 'car))
	(fix-re--setres (fix-re--R+\(R*A|B\)->R*\(A|B\) e 'car)))))
	
    res))

(defun fix-re--transform-AST (tree)
  "The top-level transformation function.  TREE will have '| as its car.
Return the transformed (or newly expanded) tree.
"
   (let ((ptr (list tree)))
     (fix-re--transform ptr 'car t)))
	
\f
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Entry point routines.
(defun fix-re-test (re)
  "Attempt to fix the regexp RE.
If no fix is needed, return nil, otherwise return the fixed
regexp.

Note: this function doesn't touch the cache `fix-re--cache'."
  (string-match re "")			; Throw an error if re is invalid.
  (let* ((ast (fix-re--build-AST re)))
    (and (fix-re--transform-AST ast)
	 (fix-re--dump-AST ast))))

(defcustom fix-re-cache-limit 40
  "Maximum number of entries in the cache for `fix-re'."
  :type 'integer
  :group 'isearch)
(defvar fix-re--cache-overflowed nil
  "Non-nil when the association list `fix-re--cache' has overflowed its size limit `fix-re--cache-limit'")
(defvar fix-re--cache nil
  "An association list linking input regexps with fixed regexps.
The key of each element is the input regexp.  The value is nil if the key
regexp is OK, otherwise it's the replacement regexp.")

(defun fix-re (re)
  "Check the regexp RE for certain solecisms, and if any are found, fix them.
Return the fixed regexp, if a fix was done, otherwise RE.

Regexps passed to `fix-re' for the first time are inserted into a
cache `fix-re--cache', so that calling `fix-re' repeatedly with an
argument is fast.  The limit on `fix-re--cache''s size is the
configurable option `fix-re-cache-limit'.

The typical errors corrected are ......  FIXME!!!

If any fixes were needed, return the fixed regexp, otherwise return RE."
  (let ((elt (assoc re fix-re--cache)) fix)
    (if elt
	(progn
	  (unless (eq elt (car fix-re--cache))
	    (setq fix-re--cache (delq elt fix-re--cache))
	    (push elt fix-re--cache))
	  (or (cdr elt) re))
      (setq fix (fix-re-test re))
      (when (> (length fix-re--cache) (max fix-re-cache-limit 1))
	(or fix-re--cache-overflowed (setq fix-re--cache-overflowed re))
	(setq fix-re--cache (butlast fix-re--cache)))
      (push (cons re fix) fix-re--cache)
      (or fix re))))

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-23 18:12 Fixing ill-conditioned regular expressions. Proof of concept Alan Mackenzie
@ 2015-02-23 19:55 ` Paul Eggert
  2015-02-23 20:21   ` Alan Mackenzie
  2015-02-24  6:20   ` Philipp Stephani
  2015-03-13 22:53 ` Stefan Monnier
  1 sibling, 2 replies; 21+ messages in thread
From: Paul Eggert @ 2015-02-23 19:55 UTC (permalink / raw)
  To: Alan Mackenzie, emacs-devel

Would it be possible to fix the regular expression engine, so that 
programs don't have to worry about parsing and reformulating regexps so 
that they're "nice"?



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-23 19:55 ` Paul Eggert
@ 2015-02-23 20:21   ` Alan Mackenzie
  2015-02-23 22:19     ` Paul Eggert
  2015-02-24  6:20   ` Philipp Stephani
  1 sibling, 1 reply; 21+ messages in thread
From: Alan Mackenzie @ 2015-02-23 20:21 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

Hello, Paul.

On Mon, Feb 23, 2015 at 11:55:24AM -0800, Paul Eggert wrote:
> Would it be possible to fix the regular expression engine, so that 
> programs don't have to worry about parsing and reformulating regexps so 
> that they're "nice"?

Anything's possible.

Somebody once explained to me that it was all about NFAs and DFAs (which
I only understand exceptionally vaguely) and that Emacs has the "wrong"
one of these.

I talked briefly with Kaz Kylheku on comp.theory a little while ago, and
he opined:

"I suspect, though, that what you're calling "regexp engine" may
 actually be some Perl-inspired job which requires a stack for
 backtracking, and which basically interprets the regular expression, or
 at least partially."

, which, if you take out the disparagement, is what we have, I think.

About replacing it with "the other sort" of regexp engine - it might
well be that the one we have, complete with backtracking stack and what
have you, is generally faster, and that if we were to convert to a
regexp engine which could handle awkward regexps gracefully, that could
slow Emacs down.

It seems a bit like using RISC chips - they can be faster than CISC, but
you need an intelligent compiler to drive them.  My fix-re.el might be
analogous to that intelligent compiler.

But, basically, I've got little idea about regexp engines.  The one
we've got's earliest copyright date is 1993, which seem very late.  Perl
was first released in 1987.

-- 
Alan Mackenzie (Nuremberg, Germany).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-23 20:21   ` Alan Mackenzie
@ 2015-02-23 22:19     ` Paul Eggert
  2015-02-23 22:42       ` Alan Mackenzie
  2015-02-24 16:29       ` Stefan Monnier
  0 siblings, 2 replies; 21+ messages in thread
From: Paul Eggert @ 2015-02-23 22:19 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel

On 02/23/2015 12:21 PM, Alan Mackenzie wrote:
> basically, I've got little idea about regexp engines.

That's OK, if you prefer a source-to-source transformation then you can 
use that instead, but the point is that this should be done for all uses 
of the regexp code, not just for some of them.

The Emacs regexp code isn't Perl-inspired, as far as I know.  It's an 
old copy of the glibc code, with a lot of hacks.  The glibc version 
mutated quite a bit when it added i18n support, and Emacs's version has 
mutated in different ways.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-23 22:19     ` Paul Eggert
@ 2015-02-23 22:42       ` Alan Mackenzie
  2015-02-23 23:07         ` Artur Malabarba
  2015-02-23 23:37         ` Paul Eggert
  2015-02-24 16:29       ` Stefan Monnier
  1 sibling, 2 replies; 21+ messages in thread
From: Alan Mackenzie @ 2015-02-23 22:42 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

Hi, Paul.

On Mon, Feb 23, 2015 at 02:19:03PM -0800, Paul Eggert wrote:
> On 02/23/2015 12:21 PM, Alan Mackenzie wrote:
> > basically, I've got little idea about regexp engines.

> That's OK, if you prefer a source-to-source transformation then you can 
> use that instead, but the point is that this should be done for all uses 
> of the regexp code, not just for some of them.

Brilliant idea!  Why not call fix-re from within
re-search-forward/backward, looking-at, ...  With its cache, the extra
runtime will be negligible.  After a bit of tidying up, debugging,
handling of \{..\}, proper testing, ....

> The Emacs regexp code isn't Perl-inspired, as far as I know.  It's an 
> old copy of the glibc code, with a lot of hacks.  The glibc version 
> mutated quite a bit when it added i18n support, and Emacs's version has 
> mutated in different ways.

Ah, right.

-- 
Alan Mackenzie (Nuremberg, Germany).



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions. Proof of concept.
  2015-02-23 22:42       ` Alan Mackenzie
@ 2015-02-23 23:07         ` Artur Malabarba
  2015-02-23 23:37         ` Paul Eggert
  1 sibling, 0 replies; 21+ messages in thread
From: Artur Malabarba @ 2015-02-23 23:07 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Paul Eggert, emacs-devel

> Brilliant idea!  Why not call fix-re from within
> re-search-forward/backward, looking-at, ...  With its cache, the extra
> runtime will be negligible.

As long as that's properly tested and verified. :-)



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-23 22:42       ` Alan Mackenzie
  2015-02-23 23:07         ` Artur Malabarba
@ 2015-02-23 23:37         ` Paul Eggert
  2015-02-25 10:08           ` Alan Mackenzie
  1 sibling, 1 reply; 21+ messages in thread
From: Paul Eggert @ 2015-02-23 23:37 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel

Alan Mackenzie wrote:
>   Why not call fix-re from within
> re-search-forward/backward, looking-at, ...

I was thinking of calling it from within regexp.c itself, so that the callers 
would not need to be changed; it's the same basic idea, just a bit easier on the 
maintainers.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions. Proof of concept.
  2015-02-23 19:55 ` Paul Eggert
  2015-02-23 20:21   ` Alan Mackenzie
@ 2015-02-24  6:20   ` Philipp Stephani
  1 sibling, 0 replies; 21+ messages in thread
From: Philipp Stephani @ 2015-02-24  6:20 UTC (permalink / raw)
  To: Paul Eggert, Alan Mackenzie, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 476 bytes --]

Paul Eggert <eggert@cs.ucla.edu> schrieb am Mon Feb 23 2015 at 20:55:54:

> Would it be possible to fix the regular expression engine, so that
> programs don't have to worry about parsing and reformulating regexps so
> that they're "nice"?
>
>
See http://swtch.com/~rsc/regexp/regexp1.html for a nice introduction into
RE engines. It might be worthwhile to investigate the performance
characteristics of using such a Thompson NFA in Emacs for regexes without
back references.

[-- Attachment #2: Type: text/html, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-23 22:19     ` Paul Eggert
  2015-02-23 22:42       ` Alan Mackenzie
@ 2015-02-24 16:29       ` Stefan Monnier
  1 sibling, 0 replies; 21+ messages in thread
From: Stefan Monnier @ 2015-02-24 16:29 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Alan Mackenzie, emacs-devel

> The Emacs regexp code isn't Perl-inspired, as far as I know.

No, indeed, but it uses the same kind of backtracking based
implementation technique.


        Stefan



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-23 23:37         ` Paul Eggert
@ 2015-02-25 10:08           ` Alan Mackenzie
  2015-02-26  1:11             ` Stephen J. Turnbull
  2015-02-26  8:46             ` Paul Eggert
  0 siblings, 2 replies; 21+ messages in thread
From: Alan Mackenzie @ 2015-02-25 10:08 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

Hi, Paul.

On Mon, Feb 23, 2015 at 03:37:40PM -0800, Paul Eggert wrote:
> Alan Mackenzie wrote:
> >   Why not call fix-re from within
> > re-search-forward/backward, looking-at, ...

> I was thinking of calling it from within regexp.c itself, so that the
> callers would not need to be changed; it's the same basic idea, just a
> bit easier on the maintainers.

regexp.c is a "pure" sort of module, not even containing the cache for
the compiled regexps, so that doesn't seem the place for fix-re.el.

Thinking about it, calling fix-re for every regexp search doesn't seem
right.  Most regexps are perfectly blameless, and filling up the fix-re
cache with them is a waste of time.  More to the point, it is sometimes
not possible to preserve the numbering of \(...\) constructs while
fixing a regexp, which would change the match-data.  But in the context
where one wants to call fix-re, mainly regexps constructed from
user/package configuration variables, the \(...\) numbers aren't
significant anyway.

So, how about the following: a new variable isearch-fix-re-flag which
would have to be bound to non-nil for fix-re to run, then binding this
variable in the few places it's needed, such as in forward-paragraph.

-- 
Alan Mackenzie (Nuremberg, Germany).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-25 10:08           ` Alan Mackenzie
@ 2015-02-26  1:11             ` Stephen J. Turnbull
  2015-02-26  8:46             ` Paul Eggert
  1 sibling, 0 replies; 21+ messages in thread
From: Stephen J. Turnbull @ 2015-02-26  1:11 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Paul Eggert, emacs-devel

 > Thinking about it, calling fix-re for every regexp search doesn't seem
 > right.  Most regexps are perfectly blameless, and filling up the fix-re
 > cache with them is a waste of time.

AFAIK Emacs supports string properties so the cache itself doesn't
have to be expensive, just set "blameless-regexp" to t on those.
Whether the analysis would be expensive or not, depends on to what
extent strings are constructed (probably new, although you could hash
them and catch re-constructed strings) and to what extent the same
string object is repeatedly used.

Your call, of course.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-25 10:08           ` Alan Mackenzie
  2015-02-26  1:11             ` Stephen J. Turnbull
@ 2015-02-26  8:46             ` Paul Eggert
  2015-02-26 10:11               ` Alan Mackenzie
  1 sibling, 1 reply; 21+ messages in thread
From: Paul Eggert @ 2015-02-26  8:46 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel

Alan Mackenzie wrote:
> More to the point, it is sometimes
> not possible to preserve the numbering of \(...\) constructs while
> fixing a regexp, which would change the match-data.

Sure, but you could remember how the \(...\) constructs were renumbered, and fix 
the match data after the underlying regexp call returned.  It shouldn't be a big 
deal.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-26  8:46             ` Paul Eggert
@ 2015-02-26 10:11               ` Alan Mackenzie
  2015-02-26 11:05                 ` Tassilo Horn
  0 siblings, 1 reply; 21+ messages in thread
From: Alan Mackenzie @ 2015-02-26 10:11 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

Hello, Paul.

On Thu, Feb 26, 2015 at 12:46:58AM -0800, Paul Eggert wrote:
> Alan Mackenzie wrote:
> > More to the point, it is sometimes
> > not possible to preserve the numbering of \(...\) constructs while
> > fixing a regexp, which would change the match-data.

> Sure, but you could remember how the \(...\) constructs were
> renumbered, and fix the match data after the underlying regexp call
> returned.  It shouldn't be a big deal.

Unfortunately, it's not that simple.  Consider the RE

    \(R\)+E*\(R\)+
     1  1    2  2

.  This gets transformed to

    \(R\)+\(?:E+\(R\)+\|\(R\)\)
     1  1        2  2    2  2

.  What was subexpression 2 in the original has become two subexpressions
straddling an \| sign in the transformation.  I don't think there's a way
of transforming R+E*R+ that preserves the numbering of the
subexpressions.

-- 
Alan Mackenzie (Nuremberg, Germany).



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-26 10:11               ` Alan Mackenzie
@ 2015-02-26 11:05                 ` Tassilo Horn
  2015-02-26 13:09                   ` Alan Mackenzie
  0 siblings, 1 reply; 21+ messages in thread
From: Tassilo Horn @ 2015-02-26 11:05 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Paul Eggert, emacs-devel

Alan Mackenzie <acm@muc.de> writes:

Hi Alan,

>> Sure, but you could remember how the \(...\) constructs were
>> renumbered, and fix the match data after the underlying regexp call
>> returned.  It shouldn't be a big deal.
>
> Unfortunately, it's not that simple.  Consider the RE
>
>     \(R\)+E*\(R\)+
>      1  1    2  2
>
> .  This gets transformed to
>
>     \(R\)+\(?:E+\(R\)+\|\(R\)\)
>      1  1        2  2    2  2
>
> .  What was subexpression 2 in the original has become two
> subexpressions straddling an \| sign in the transformation.  I don't
> think there's a way of transforming R+E*R+ that preserves the
> numbering of the subexpressions.

Couldn't you use explicitly numbered groups, i.e., the regex would
translate to

    \(?1:R\)+\(?:E+\(?2:R\)+\|\(?2:R\)\)

?

As long as the groups with the same number are exclusive there shouldn't
be a problem.

Bye,
Tassilo



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-26 11:05                 ` Tassilo Horn
@ 2015-02-26 13:09                   ` Alan Mackenzie
  2015-02-26 13:46                     ` Stefan Monnier
  0 siblings, 1 reply; 21+ messages in thread
From: Alan Mackenzie @ 2015-02-26 13:09 UTC (permalink / raw)
  To: emacs-devel; +Cc: Paul Eggert

Hello, Tassilo.

On Thu, Feb 26, 2015 at 12:05:37PM +0100, Tassilo Horn wrote:
> Alan Mackenzie <acm@muc.de> writes:

> Hi Alan,

> >> Sure, but you could remember how the \(...\) constructs were
> >> renumbered, and fix the match data after the underlying regexp call
> >> returned.  It shouldn't be a big deal.

> > Unfortunately, it's not that simple.  Consider the RE

> >     \(R\)+E*\(R\)+
> >      1  1    2  2

> > .  This gets transformed to

> >     \(R\)+\(?:E+\(R\)+\|\(R\)\)
> >      1  1        2  2    2  2

> > .  What was subexpression 2 in the original has become two
> > subexpressions straddling an \| sign in the transformation.  I don't
> > think there's a way of transforming R+E*R+ that preserves the
> > numbering of the subexpressions.

> Couldn't you use explicitly numbered groups, i.e., the regex would
> translate to

>     \(?1:R\)+\(?:E+\(?2:R\)+\|\(?2:R\)\)

> ?

Thanks.  That's a brilliant idea!  I think it would work.

> As long as the groups with the same number are exclusive there shouldn't
> be a problem.

I think that's true.  There might be a slight problem with groups which
match only the empty string.  Something like:

    R*\(\)R*

, but anybody who writes such regexps deserves what she gets.

> Bye,
> Tassilo

-- 
Alan Mackenzie (Nuremberg, Germany).



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-26 13:09                   ` Alan Mackenzie
@ 2015-02-26 13:46                     ` Stefan Monnier
  2015-02-26 16:21                       ` Alan Mackenzie
  0 siblings, 1 reply; 21+ messages in thread
From: Stefan Monnier @ 2015-02-26 13:46 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Paul Eggert, emacs-devel

> I think that's true.  There might be a slight problem with groups which
> match only the empty string.  Something like:
>     R*\(\)R*
> , but anybody who writes such regexps deserves what she gets.

What is it that I deserve to get?


        Stefan



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-26 13:46                     ` Stefan Monnier
@ 2015-02-26 16:21                       ` Alan Mackenzie
  2015-02-26 19:12                         ` Stefan Monnier
  0 siblings, 1 reply; 21+ messages in thread
From: Alan Mackenzie @ 2015-02-26 16:21 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Paul Eggert, emacs-devel

Hello, Stefan.

On Thu, Feb 26, 2015 at 08:46:22AM -0500, Stefan Monnier wrote:
> > I think that's true.  There might be a slight problem with groups which
> > match only the empty string.  Something like:
> >     R*\(\)R*
> > , but anybody who writes such regexps deserves what she gets.

> What is it that I deserve to get?

You deserve, perhaps, to lose (match-beginning 1) and (match-end 1),
which were ill-defined anyway.  You perhaps deserve to lose the entire
empty group.  Writing a regexp like that, you deserve pain.  ;-)

Have you really written a regexp like this (apart from for testing
purposes)?.  If so, what's it for?

By the way, how do you see the prospects of this file becoming
incorporated into Emacs at some stage?

>         Stefan

-- 
Alan Mackenzie (Nuremberg, Germany).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-26 16:21                       ` Alan Mackenzie
@ 2015-02-26 19:12                         ` Stefan Monnier
  2015-02-26 20:01                           ` Alan Mackenzie
  0 siblings, 1 reply; 21+ messages in thread
From: Stefan Monnier @ 2015-02-26 19:12 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Paul Eggert, emacs-devel

>> >     R*\(\)R*
>> > , but anybody who writes such regexps deserves what she gets.
>> What is it that I deserve to get?
> You deserve, perhaps, to lose (match-beginning 1) and (match-end 1),
> which were ill-defined anyway.

Why do you think so?  They seem perfectly well-defined to me.
They're just always equal to one another, of course, but to the extent
that the regexp syntax only forces me to put "named positions" in pairs,
if I need a single position, it's fairly natural to just use \(\).

> Have you really written a regexp like this (apart from for testing
> purposes)?.  If so, what's it for?

   grep '\\\\(\\\\)' **/*.el

finds 27 matches.  Taking one example from the list:

   lisp/emacs-lisp/smie.el:     ((looking-at "\\s(\\|\\s)\\(\\)")

what this does is to let me use (match-beginning 1) to figure out which
of the two alternatives was matched.  I could have written this as

       ((looking-at "\\s(\\|\\(\\s)\\)")

but this would be (marginally) slower, because we'd always push
a "group-start" marker before try to match "\\s)", whereas with the
other rule, we only do that when we know "\\s)" has matched.

> By the way, how do you see the prospects of this file becoming
> incorporated into Emacs at some stage?

To be honest, I haven't looked at it at all, yet.
The vague understanding I have of what it might be sounds interesting.
It's just a patch trying to cover up the worst aspects of the
current regexp engine, but since there doesn't seem to be much interest
in improving/overhauling the regexp engine, maybe it's a good stop-gap.

        Stefan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-26 19:12                         ` Stefan Monnier
@ 2015-02-26 20:01                           ` Alan Mackenzie
  2015-02-27 13:45                             ` Stefan Monnier
  0 siblings, 1 reply; 21+ messages in thread
From: Alan Mackenzie @ 2015-02-26 20:01 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Paul Eggert, emacs-devel

Hello, Stefan.

On Thu, Feb 26, 2015 at 02:12:32PM -0500, Stefan Monnier wrote:
> >> >     R*\(\)R*
> >> > , but anybody who writes such regexps deserves what she gets.
> >> What is it that I deserve to get?
> > You deserve, perhaps, to lose (match-beginning 1) and (match-end 1),
> > which were ill-defined anyway.

> Why do you think so?  They seem perfectly well-defined to me.
> They're just always equal to one another, of course, but to the extent
> that the regexp syntax only forces me to put "named positions" in pairs,
> if I need a single position, it's fairly natural to just use \(\).

I really did mean R*\(\)R*, with R being the same on both sides of the
\(\), but the *s possibly being +s.  _That_ is nasty and undefined.

> > Have you really written a regexp like this (apart from for testing
> > purposes)?.  If so, what's it for?

>    grep '\\\\(\\\\)' **/*.el

> finds 27 matches.  Taking one example from the list:

>    lisp/emacs-lisp/smie.el:     ((looking-at "\\s(\\|\\s)\\(\\)")

> what this does is to let me use (match-beginning 1) to figure out which
> of the two alternatives was matched.  I could have written this as

>        ((looking-at "\\s(\\|\\(\\s)\\)")

> but this would be (marginally) slower, because we'd always push
> a "group-start" marker before try to match "\\s)", whereas with the
> other rule, we only do that when we know "\\s)" has matched.

OK.

> > By the way, how do you see the prospects of this file becoming
> > incorporated into Emacs at some stage?

> To be honest, I haven't looked at it at all, yet.
> The vague understanding I have of what it might be sounds interesting.
> It's just a patch trying to cover up the worst aspects of the
> current regexp engine, but since there doesn't seem to be much interest
> in improving/overhauling the regexp engine, maybe it's a good stop-gap.

Thanks.  I'll continue working on it, adding a decent set of test cases
too.

>         Stefan

-- 
Alan Mackenzie (Nuremberg, Germany).



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-26 20:01                           ` Alan Mackenzie
@ 2015-02-27 13:45                             ` Stefan Monnier
  0 siblings, 0 replies; 21+ messages in thread
From: Stefan Monnier @ 2015-02-27 13:45 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: Paul Eggert, emacs-devel

> I really did mean R*\(\)R*, with R being the same on both sides of the
> \(\), but the *s possibly being +s.

Ah, well R*<E>R* where <E> can match the empty string is definitely
nasty given the current implementation technique of our regexp
matcher, yes.
It's not specific to \(\), OTOH.

> _That_ is nasty and undefined.

Well, it's actually defined (by the "leftmost longest" rule), but
I agree it's a nasty case.

        Stefan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fixing ill-conditioned regular expressions.  Proof of concept.
  2015-02-23 18:12 Fixing ill-conditioned regular expressions. Proof of concept Alan Mackenzie
  2015-02-23 19:55 ` Paul Eggert
@ 2015-03-13 22:53 ` Stefan Monnier
  1 sibling, 0 replies; 21+ messages in thread
From: Stefan Monnier @ 2015-03-13 22:53 UTC (permalink / raw)
  To: Alan Mackenzie; +Cc: emacs-devel

> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
> ;;
> ;; f i x - r e . e l

This is not using the standard file format.  I strongly suggest you use
something like auto-insert-mode which will get that right for you.

> ;; The AST is an internal representation of the regular expression string.
> ;; Elements are represented as follows:

Why invent a new format, instead of using rx?  I'm pretty sure someone
has already provided a parser from string to rx format (and of course
we also already have a dumper from rx to string).

> ;; PTR and AD
> ;; ----------
> ;; Whilst building and transforming trees, what we really need is pointers to
> ;; either the car or cdr of a cons cell, to enable us to
> ;; overwrite them.

I think your code will gain a lot in readability if you made it
functional: take an rx expression and return a *new* rx expression
without modifying the original one.

As it stands, I find it rather impenetrable.

> Return 'shortened, t or nil.

Our conventions say this should be

  Return `shortened', t or nil.

>   (if (eq ad 'cdr) (error "fix-re--R+R*->R+ got 'cdr"))
>   (let ((e0 (if (eq ad 'car) (car ptr) (cadr ptr)))
> 	(e1 (if (eq ad 'car) (cadr ptr) (caddr ptr))))
>     (when (and (consp e0) (consp e1))
>       (let ((op0 (car e0))
> 	    (op1 (car e1))
> 	    ;; (link (if (eq ad 'car) (cddr ptr) (cdddr ptr)))
> 	    op
> 	    )
> 	(when
> 	    (and (memq op0 '(+ * \?))
> 		 (memq op1 '(+ * \?))
> 		 (equal (cdr e0) (cdr e1))) ; Both Rs are the same.
> 	  (cond
> 	   ((and (eq op0 '\?) (eq op1 '\?)) ; Cant combine R?R?
> 	    nil)
> 	   ((and (eq op0 '+) (eq op1 '+)) ; R+R+ -> RR+
> 	    (fix-re--chop-+* ptr 'car)
> 	    t)
> 	   (t
> 	    (setq op
> 		    (cond
> 		     ((or (eq op0 '+) (eq op1 '+)) '+)
> 		     ((or (eq op0 '*) (eq op1 '*)) '*)))
> 	    (setcar e0 op)
> 	    (fix-re--chop (if (eq ad 'car) ptr (cdr ptr)) 'cadr)
> 	    'shortened))
> 	  )))))

For example, I think that if you make it functional rather than
imperative, the above could turn into something along the lines of

  (pcase ptr                ;We don't have `ad' any more.
    (`((,op0 . ,r0)
       (,op1 . ,r1)
       . (and ,rest
              (guard (and (equal r0 r1)
                          (memq op0 '(+ * \?))
                          (memq op1 '(+ * \?)))))))
     (cond
      ((and (eq op0 '\?) (eq op1 '\?)) ; Cant combine R?R?
       ptr)
      ((and (eq op0 '+) (eq op1 '+)) ; R+R+ -> RR+
       `(,(car r0) ,(car ptr) . ,rest))
      (t
       `((,(cond
               ((or (eq op0 '+) (eq op1 '+)) '+)
               ((or (eq op0 '*) (eq op1 '*)) '*)) . r0)
         . ,rest)))))


-- Stefan>



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-03-13 22:53 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-23 18:12 Fixing ill-conditioned regular expressions. Proof of concept Alan Mackenzie
2015-02-23 19:55 ` Paul Eggert
2015-02-23 20:21   ` Alan Mackenzie
2015-02-23 22:19     ` Paul Eggert
2015-02-23 22:42       ` Alan Mackenzie
2015-02-23 23:07         ` Artur Malabarba
2015-02-23 23:37         ` Paul Eggert
2015-02-25 10:08           ` Alan Mackenzie
2015-02-26  1:11             ` Stephen J. Turnbull
2015-02-26  8:46             ` Paul Eggert
2015-02-26 10:11               ` Alan Mackenzie
2015-02-26 11:05                 ` Tassilo Horn
2015-02-26 13:09                   ` Alan Mackenzie
2015-02-26 13:46                     ` Stefan Monnier
2015-02-26 16:21                       ` Alan Mackenzie
2015-02-26 19:12                         ` Stefan Monnier
2015-02-26 20:01                           ` Alan Mackenzie
2015-02-27 13:45                             ` Stefan Monnier
2015-02-24 16:29       ` Stefan Monnier
2015-02-24  6:20   ` Philipp Stephani
2015-03-13 22:53 ` Stefan Monnier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).