Dear Stefan and others

Some time ago I suggested an improvement in regexp-opt, factoring
similarities at the end of groups. At the end, Stefan wrote:

Em sex, 8 de fev de 2019 às 01:48, Stefan Monnier
<monnier@iro.umontreal.ca> escreveu:
> A much better approach is to go for a real "regexp to NFA/DFA
> conversion".  The `lex.el` package is one such example, but it's very
> inefficient (in terms of building the FA and in the size of the FA, not
> in terms of running the FA).

After some time, I had an idea of simplification by FA.  The base idea
is implement FA as "nodes" being lists of ARROWi and arrows being
(CHAR . NODE). For example the initial FA for strings ("abd" "acd") is

 >1 --a--> 2 --b--> 3 --d--> 4 --epsilon--> nil
           |                                 ^
           +---c--> 5 --d--> 6 --epsilon-----|
and inplemented as
(?a (?b (?d (0))) (?c (?d (0))))
= ((?a . ((?b . ((?d . ((0 . nil)))))
          (?c . ((?d . ((0 . nil)))))))

Note that nodes 3 and 5 are `equal'.  Simplification is to make them `eq'.

Also, there is simplification where a node is equal to a subset of a
parent node, resulting is a ? construction.
For example,
            +---f--> 9 ----------epsilon -------------\
            |                                          v
  >1 --a--> 2 --b--> 3 --c--> 4 --f--> 5 --epsilon--> nil
            |                 ^
            +---d--> 6 --e---/
Node 4 is equal to a subnode derived from 2, resulting on
            +---epsilon-------+
            |                 v
  >1 --a--> 2 --b--> 3 --c--> 4 --f--> 5 --epsilon--> nil
            |                 ^
            +---d--> 6 --e--> 7

I made an inplementation, a patch to regexp-opt.el

The pros:
If the resulting strings "came from" a regexp that is splittable, the
FA implementation always simplifies to it.  In pratice, these are
uncommun, and in most cases, the results are equivalent.

The cons:
The algorithm for FA seams to have greater computation complexity,
takes about 20 times to compute in average.

Example, the case on the

(setq strings '("cond" "if" "when" "unless" "while"
                "let" "let*" "progn" "prog1" "prog2"
                "save-restriction" "save-excursion" "save-window-excursion"
                "save-current-buffer" "save-match-data"
                "catch" "throw" "unwind-protect" "condition-case"))

(regexp-opt strings) ->
"\\(?:c\\(?:atch\\|ond\\(?:ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n]\\|save-\\(?:current-buffer\\|excursion\\|match-data\\|\\(?:restrict\\|window-excurs\\)ion\\)\\|throw\\|un\\(?:less\\|wind-protect\\)\\|wh\\(?:en\\|ile\\)\\)"

(regexp-opt2 strings) ->
"\\(?:c\\(?:atch\\|ond\\(?:ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n]\\|save-\\(?:current-buffer\\|match-data\\|\\(?:restrict\\|\\(?:window-\\)?excurs\\)ion\\)\\|throw\\|un\\(?:less\\|wind-protect\\)\\|wh\\(?:en\\|ile\\)\\)"

The difference is that FA algorithm

BUT my version takes 70 times more time to compute.

Example 2:
(setq strings2 '("car" "cdr"
                 "caar" "cadr" "cdar" "cddr"
                 "caaar" "caadr" "cadar" "caddr"
                 "cdaar" "cdadr" "cddar" "cdddr"))

(regexp-opt strings2) ->
"\\(?:c\\(?:\\(?:a\\(?:a[ad]\\|d[ad]\\|[ad]\\)\\|d\\(?:a[ad]\\|d[ad]\\|[ad]\\)\\|[ad]\\)r\\)\\)"

(regexp-opt2 strings2) ->
"\\(?:c[ad]\\(?:[ad][ad]?\\)?r\\)"

FA is 7 times slower here.

If this implementation is useful, I would like very much to contribute it.

I actually have the other implementation from previous idea. It is
faster than FA, same complexity of current regexp-opt, a bit slower of
course, but I like this better.

Best regards

Miguel Frasson

-- 
Miguel Vinicius Santini Frasson
mvsfrasson@gmail.com