Dear Stefan and others Some time ago I suggested an improvement in regexp-opt, factoring similarities at the end of groups. At the end, Stefan wrote: Em sex, 8 de fev de 2019 às 01:48, Stefan Monnier escreveu: > A much better approach is to go for a real "regexp to NFA/DFA > conversion". The `lex.el` package is one such example, but it's very > inefficient (in terms of building the FA and in the size of the FA, not > in terms of running the FA). After some time, I had an idea of simplification by FA. The base idea is implement FA as "nodes" being lists of ARROWi and arrows being (CHAR . NODE). For example the initial FA for strings ("abd" "acd") is >1 --a--> 2 --b--> 3 --d--> 4 --epsilon--> nil | ^ +---c--> 5 --d--> 6 --epsilon-----| and inplemented as (?a (?b (?d (0))) (?c (?d (0)))) = ((?a . ((?b . ((?d . ((0 . nil))))) (?c . ((?d . ((0 . nil))))))) Note that nodes 3 and 5 are `equal'. Simplification is to make them `eq'. Also, there is simplification where a node is equal to a subset of a parent node, resulting is a ? construction. For example, +---f--> 9 ----------epsilon -------------\ | v >1 --a--> 2 --b--> 3 --c--> 4 --f--> 5 --epsilon--> nil | ^ +---d--> 6 --e---/ Node 4 is equal to a subnode derived from 2, resulting on +---epsilon-------+ | v >1 --a--> 2 --b--> 3 --c--> 4 --f--> 5 --epsilon--> nil | ^ +---d--> 6 --e--> 7 I made an inplementation, a patch to regexp-opt.el The pros: If the resulting strings "came from" a regexp that is splittable, the FA implementation always simplifies to it. In pratice, these are uncommun, and in most cases, the results are equivalent. The cons: The algorithm for FA seams to have greater computation complexity, takes about 20 times to compute in average. Example, the case on the (setq strings '("cond" "if" "when" "unless" "while" "let" "let*" "progn" "prog1" "prog2" "save-restriction" "save-excursion" "save-window-excursion" "save-current-buffer" "save-match-data" "catch" "throw" "unwind-protect" "condition-case")) (regexp-opt strings) -> "\\(?:c\\(?:atch\\|ond\\(?:ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n]\\|save-\\(?:current-buffer\\|excursion\\|match-data\\|\\(?:restrict\\|window-excurs\\)ion\\)\\|throw\\|un\\(?:less\\|wind-protect\\)\\|wh\\(?:en\\|ile\\)\\)" (regexp-opt2 strings) -> "\\(?:c\\(?:atch\\|ond\\(?:ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n]\\|save-\\(?:current-buffer\\|match-data\\|\\(?:restrict\\|\\(?:window-\\)?excurs\\)ion\\)\\|throw\\|un\\(?:less\\|wind-protect\\)\\|wh\\(?:en\\|ile\\)\\)" The difference is that FA algorithm BUT my version takes 70 times more time to compute. Example 2: (setq strings2 '("car" "cdr" "caar" "cadr" "cdar" "cddr" "caaar" "caadr" "cadar" "caddr" "cdaar" "cdadr" "cddar" "cdddr")) (regexp-opt strings2) -> "\\(?:c\\(?:\\(?:a\\(?:a[ad]\\|d[ad]\\|[ad]\\)\\|d\\(?:a[ad]\\|d[ad]\\|[ad]\\)\\|[ad]\\)r\\)\\)" (regexp-opt2 strings2) -> "\\(?:c[ad]\\(?:[ad][ad]?\\)?r\\)" FA is 7 times slower here. If this implementation is useful, I would like very much to contribute it. I actually have the other implementation from previous idea. It is faster than FA, same complexity of current regexp-opt, a bit slower of course, but I like this better. Best regards Miguel Frasson -- Miguel Vinicius Santini Frasson mvsfrasson@gmail.com