Hi Adding some thoughts on why current fa implementation is so slow. It strives to keep well nested trees in all steps, so it spends a lot of time on branch compassion. This ensures well nested trees for the fa and eases final fa to regexp conversion. If we drop well-formness for fa and manage to convert an not well-formed fa to regexp, the process would be very fast... Regards Miguel Em sex, 12 de abr de 2019 12:06, Miguel V. S. Frasson escreveu: > Dear Stefan and others > > Some time ago I suggested an improvement in regexp-opt, factoring > similarities at the end of groups. At the end, Stefan wrote: > > Em sex, 8 de fev de 2019 às 01:48, Stefan Monnier > escreveu: > > A much better approach is to go for a real "regexp to NFA/DFA > > conversion". The `lex.el` package is one such example, but it's very > > inefficient (in terms of building the FA and in the size of the FA, not > > in terms of running the FA). > > After some time, I had an idea of simplification by FA. The base idea > is implement FA as "nodes" being lists of ARROWi and arrows being > (CHAR . NODE). For example the initial FA for strings ("abd" "acd") is > > >1 --a--> 2 --b--> 3 --d--> 4 --epsilon--> nil > | ^ > +---c--> 5 --d--> 6 --epsilon-----| > and inplemented as > (?a (?b (?d (0))) (?c (?d (0)))) > = ((?a . ((?b . ((?d . ((0 . nil))))) > (?c . ((?d . ((0 . nil))))))) > > Note that nodes 3 and 5 are `equal'. Simplification is to make them `eq'. > > Also, there is simplification where a node is equal to a subset of a > parent node, resulting is a ? construction. > For example, > +---f--> 9 ----------epsilon -------------\ > | v > >1 --a--> 2 --b--> 3 --c--> 4 --f--> 5 --epsilon--> nil > | ^ > +---d--> 6 --e---/ > Node 4 is equal to a subnode derived from 2, resulting on > +---epsilon-------+ > | v > >1 --a--> 2 --b--> 3 --c--> 4 --f--> 5 --epsilon--> nil > | ^ > +---d--> 6 --e--> 7 > > I made an inplementation, a patch to regexp-opt.el > > The pros: > If the resulting strings "came from" a regexp that is splittable, the > FA implementation always simplifies to it. In pratice, these are > uncommun, and in most cases, the results are equivalent. > > The cons: > The algorithm for FA seams to have greater computation complexity, > takes about 20 times to compute in average. > > Example, the case on the > > (setq strings '("cond" "if" "when" "unless" "while" > "let" "let*" "progn" "prog1" "prog2" > "save-restriction" "save-excursion" "save-window-excursion" > "save-current-buffer" "save-match-data" > "catch" "throw" "unwind-protect" "condition-case")) > > (regexp-opt strings) -> > > "\\(?:c\\(?:atch\\|ond\\(?:ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n]\\|save-\\(?:current-buffer\\|excursion\\|match-data\\|\\(?:restrict\\|window-excurs\\)ion\\)\\|throw\\|un\\(?:less\\|wind-protect\\)\\|wh\\(?:en\\|ile\\)\\)" > > (regexp-opt2 strings) -> > > "\\(?:c\\(?:atch\\|ond\\(?:ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n]\\|save-\\(?:current-buffer\\|match-data\\|\\(?:restrict\\|\\(?:window-\\)?excurs\\)ion\\)\\|throw\\|un\\(?:less\\|wind-protect\\)\\|wh\\(?:en\\|ile\\)\\)" > > The difference is that FA algorithm > > BUT my version takes 70 times more time to compute. > > Example 2: > (setq strings2 '("car" "cdr" > "caar" "cadr" "cdar" "cddr" > "caaar" "caadr" "cadar" "caddr" > "cdaar" "cdadr" "cddar" "cdddr")) > > (regexp-opt strings2) -> > > "\\(?:c\\(?:\\(?:a\\(?:a[ad]\\|d[ad]\\|[ad]\\)\\|d\\(?:a[ad]\\|d[ad]\\|[ad]\\)\\|[ad]\\)r\\)\\)" > > (regexp-opt2 strings2) -> > "\\(?:c[ad]\\(?:[ad][ad]?\\)?r\\)" > > FA is 7 times slower here. > > If this implementation is useful, I would like very much to contribute it. > > I actually have the other implementation from previous idea. It is > faster than FA, same complexity of current regexp-opt, a bit slower of > course, but I like this better. > > Best regards > > Miguel Frasson > > -- > Miguel Vinicius Santini Frasson > mvsfrasson@gmail.com >