From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: "Miguel V. S. Frasson" Newsgroups: gmane.emacs.devel Subject: Re: Improving regexp-opt Date: Fri, 12 Apr 2019 12:40:48 -0300 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="0000000000003a67710586572167" Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="177086"; mail-complaints-to="usenet@blaine.gmane.org" Cc: emacs-devel@gnu.org To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Apr 12 17:53:53 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1hEyUz-000juN-CT for ged-emacs-devel@m.gmane.org; Fri, 12 Apr 2019 17:53:53 +0200 Original-Received: from localhost ([127.0.0.1]:39053 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hEyUy-0005cp-FD for ged-emacs-devel@m.gmane.org; Fri, 12 Apr 2019 11:53:52 -0400 Original-Received: from eggs.gnu.org ([209.51.188.92]:37922) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hEyIa-0003cp-8B for emacs-devel@gnu.org; Fri, 12 Apr 2019 11:41:05 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hEyIY-0003H0-MW for emacs-devel@gnu.org; Fri, 12 Apr 2019 11:41:04 -0400 Original-Received: from mail-it1-x132.google.com ([2607:f8b0:4864:20::132]:34996) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1hEyIY-0003Gl-9u for emacs-devel@gnu.org; Fri, 12 Apr 2019 11:41:02 -0400 Original-Received: by mail-it1-x132.google.com with SMTP id w15so16265079itc.0 for ; Fri, 12 Apr 2019 08:41:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Jjgqn1TRpCSDpC4tddlOByYzHQYmlj8OmoChPOW6deM=; b=YGQPQT2zkiKIvNA+HOoADJJ3/6wAY37ZiMXNFpIXWUieK5ayWlo7bVsXJd+Zg9/x+E Df30BgXyupCwea6P0H0+VipiDHbXcFAN9wzNdTjBp0Dnao3+MmHrKTEyUYiBmrSS/rQ9 m3uUsmMPcTa/T2aew0Lw+MS8iiBmy/sLDPrpWfTyTpPM1nhItLL/Xdp4o96/ICVyhsUJ abIN5I4RA7uvAhZEaLK4paRPuWMPGQXCdiVTZTTvYjYKR1ZdFA7oeQMGb/zX3Kc+CBlp M3iJpbXN2bMIVgcLovE0+/Ha4jgLOQzEzzvzZeoxNRqySRp12a1zMlTiGQTicZdqS/jq wMnA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Jjgqn1TRpCSDpC4tddlOByYzHQYmlj8OmoChPOW6deM=; b=nkn0vY6b8ZHObWkF2EAcTMwlmc2Y3HJvphl5GSunLUu3MGW+UPptbalmyzBTDemkoT koGTfHwAkUpnLNoEDddtwqwEYjwGlpUMfOMoL1L1g/azouwXRrajWFeijfnFf3ov7yCf b6orPXrTccVfv9BlMcL040Z56yj2mWpj8mNUTI2SvOMY/uWzJ6qHZQyXROvQEpi1FyUT q9c1tH3VVgTEqI6X7oRbcnz6guGUPeaadFDN0X6mXaGTnbXHylkJUc2PsK7UxyYDycu7 v9TYipHcJUqvZVJQjtbqPg5jQnXCJjUjnxKs1NoM88xvIuZm/Z4IjdJfqfNIwfTSdVVV aI1A== X-Gm-Message-State: APjAAAW5HvGdeVhIfrJEeE5YS5Y86VBCMCz0PpzMUfcxgTk8TIIV0V0y StgoA0VOwf3HWUOlMNVoKrTiEq1Oxsn8c4E5W3E= X-Google-Smtp-Source: APXvYqyCDOsvgiJCa51DoznE/xlCBMI813LenlVa4Vw6bqnR/xjFRrGXjfbZn7CT2nvoL1dFKWAltLlz8YMtWbDMiq4= X-Received: by 2002:a02:a916:: with SMTP id n22mr40303853jam.40.1555083661303; Fri, 12 Apr 2019 08:41:01 -0700 (PDT) In-Reply-To: X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:4864:20::132 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:235347 Archived-At: --0000000000003a67710586572167 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Adding some thoughts on why current fa implementation is so slow. It strives to keep well nested trees in all steps, so it spends a lot of time on branch compassion. This ensures well nested trees for the fa and eases final fa to regexp conversion. If we drop well-formness for fa and manage to convert an not well-formed fa to regexp, the process would be very fast... Regards Miguel Em sex, 12 de abr de 2019 12:06, Miguel V. S. Frasson escreveu: > Dear Stefan and others > > Some time ago I suggested an improvement in regexp-opt, factoring > similarities at the end of groups. At the end, Stefan wrote: > > Em sex, 8 de fev de 2019 =C3=A0s 01:48, Stefan Monnier > escreveu: > > A much better approach is to go for a real "regexp to NFA/DFA > > conversion". The `lex.el` package is one such example, but it's very > > inefficient (in terms of building the FA and in the size of the FA, not > > in terms of running the FA). > > After some time, I had an idea of simplification by FA. The base idea > is implement FA as "nodes" being lists of ARROWi and arrows being > (CHAR . NODE). For example the initial FA for strings ("abd" "acd") is > > >1 --a--> 2 --b--> 3 --d--> 4 --epsilon--> nil > | ^ > +---c--> 5 --d--> 6 --epsilon-----| > and inplemented as > (?a (?b (?d (0))) (?c (?d (0)))) > =3D ((?a . ((?b . ((?d . ((0 . nil))))) > (?c . ((?d . ((0 . nil))))))) > > Note that nodes 3 and 5 are `equal'. Simplification is to make them `eq'= . > > Also, there is simplification where a node is equal to a subset of a > parent node, resulting is a ? construction. > For example, > +---f--> 9 ----------epsilon -------------\ > | v > >1 --a--> 2 --b--> 3 --c--> 4 --f--> 5 --epsilon--> nil > | ^ > +---d--> 6 --e---/ > Node 4 is equal to a subnode derived from 2, resulting on > +---epsilon-------+ > | v > >1 --a--> 2 --b--> 3 --c--> 4 --f--> 5 --epsilon--> nil > | ^ > +---d--> 6 --e--> 7 > > I made an inplementation, a patch to regexp-opt.el > > The pros: > If the resulting strings "came from" a regexp that is splittable, the > FA implementation always simplifies to it. In pratice, these are > uncommun, and in most cases, the results are equivalent. > > The cons: > The algorithm for FA seams to have greater computation complexity, > takes about 20 times to compute in average. > > Example, the case on the > > (setq strings '("cond" "if" "when" "unless" "while" > "let" "let*" "progn" "prog1" "prog2" > "save-restriction" "save-excursion" "save-window-excursio= n" > "save-current-buffer" "save-match-data" > "catch" "throw" "unwind-protect" "condition-case")) > > (regexp-opt strings) -> > > "\\(?:c\\(?:atch\\|ond\\(?:ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n]\\= |save-\\(?:current-buffer\\|excursion\\|match-data\\|\\(?:restrict\\|window= -excurs\\)ion\\)\\|throw\\|un\\(?:less\\|wind-protect\\)\\|wh\\(?:en\\|ile\= \)\\)" > > (regexp-opt2 strings) -> > > "\\(?:c\\(?:atch\\|ond\\(?:ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n]\\= |save-\\(?:current-buffer\\|match-data\\|\\(?:restrict\\|\\(?:window-\\)?ex= curs\\)ion\\)\\|throw\\|un\\(?:less\\|wind-protect\\)\\|wh\\(?:en\\|ile\\)\= \)" > > The difference is that FA algorithm > > BUT my version takes 70 times more time to compute. > > Example 2: > (setq strings2 '("car" "cdr" > "caar" "cadr" "cdar" "cddr" > "caaar" "caadr" "cadar" "caddr" > "cdaar" "cdadr" "cddar" "cdddr")) > > (regexp-opt strings2) -> > > "\\(?:c\\(?:\\(?:a\\(?:a[ad]\\|d[ad]\\|[ad]\\)\\|d\\(?:a[ad]\\|d[ad]\\|[a= d]\\)\\|[ad]\\)r\\)\\)" > > (regexp-opt2 strings2) -> > "\\(?:c[ad]\\(?:[ad][ad]?\\)?r\\)" > > FA is 7 times slower here. > > If this implementation is useful, I would like very much to contribute it= . > > I actually have the other implementation from previous idea. It is > faster than FA, same complexity of current regexp-opt, a bit slower of > course, but I like this better. > > Best regards > > Miguel Frasson > > -- > Miguel Vinicius Santini Frasson > mvsfrasson@gmail.com > --0000000000003a67710586572167 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi

Adding so= me thoughts on why current fa implementation is so slow.=C2=A0

It strives to keep well nested trees= in all steps, so it spends a lot of time on branch compassion. This ensure= s well nested trees for the fa and eases final fa to regexp conversion.=C2= =A0

If we drop well-form= ness for fa and manage to convert an not well-formed fa to regexp, the proc= ess would be very fast...=C2=A0

Regards

Miguel=

Em sex, 12 de abr de 2019 12:06, Miguel V. S. Frasson <mvsfrasson@gmail.com> escreveu:
Dear Stefan and others

Some time ago I suggested an improvement in regexp-opt, factoring
similarities at the end of groups. At the end, Stefan wrote:

Em sex, 8 de fev de 2019 =C3=A0s 01:48, Stefan Monnier
<monnier@iro.umontreal.ca> escreveu:
> A much better approach is to go for a real "regexp to NFA/DFA
> conversion".=C2=A0 The `lex.el` package is one such example, but = it's very
> inefficient (in terms of building the FA and in the size of the FA, no= t
> in terms of running the FA).

After some time, I had an idea of simplification by FA.=C2=A0 The base idea=
is implement FA as "nodes" being lists of ARROWi and arrows being=
(CHAR . NODE). For example the initial FA for strings ("abd" &quo= t;acd") is

=C2=A0>1 --a--> 2 --b--> 3 --d--> 4 --epsilon--> nil
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0^
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+---c--> 5 --d--> 6 --epsilo= n-----|
and inplemented as
(?a (?b (?d (0))) (?c (?d (0))))
=3D ((?a . ((?b . ((?d . ((0 . nil)))))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (?c . ((?d . ((0 . nil)))))))

Note that nodes 3 and 5 are `equal'.=C2=A0 Simplification is to make th= em `eq'.

Also, there is simplification where a node is equal to a subset of a
parent node, resulting is a ? construction.
For example,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 +---f--> 9 ----------epsilon -= ------------\
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 v
=C2=A0 >1 --a--> 2 --b--> 3 --c--> 4 --f--> 5 --epsilon-->= ; nil
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0^
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 +---d--> 6 --e---/
Node 4 is equal to a subnode derived from 2, resulting on
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 +---epsilon-------+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0v
=C2=A0 >1 --a--> 2 --b--> 3 --c--> 4 --f--> 5 --epsilon-->= ; nil
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0^
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 +---d--> 6 --e--> 7

I made an inplementation, a patch to regexp-opt.el

The pros:
If the resulting strings "came from" a regexp that is splittable,= the
FA implementation always simplifies to it.=C2=A0 In pratice, these are
uncommun, and in most cases, the results are equivalent.

The cons:
The algorithm for FA seams to have greater computation complexity,
takes about 20 times to compute in average.

Example, the case on the

(setq strings '("cond" "if" "when" "= unless" "while"
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 "let" &qu= ot;let*" "progn" "prog1" "prog2"
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 "save-restrict= ion" "save-excursion" "save-window-excursion"
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 "save-current-= buffer" "save-match-data"
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 "catch" &= quot;throw" "unwind-protect" "condition-case"))
(regexp-opt strings) ->
"\\(?:c\\(?:atch\\|ond\\(?:ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n= ]\\|save-\\(?:current-buffer\\|excursion\\|match-data\\|\\(?:restrict\\|win= dow-excurs\\)ion\\)\\|throw\\|un\\(?:less\\|wind-protect\\)\\|wh\\(?:en\\|i= le\\)\\)"

(regexp-opt2 strings) ->
"\\(?:c\\(?:atch\\|ond\\(?:ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n= ]\\|save-\\(?:current-buffer\\|match-data\\|\\(?:restrict\\|\\(?:window-\\)= ?excurs\\)ion\\)\\|throw\\|un\\(?:less\\|wind-protect\\)\\|wh\\(?:en\\|ile\= \)\\)"

The difference is that FA algorithm

BUT my version takes 70 times more time to compute.

Example 2:
(setq strings2 '("car" "cdr"
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"caar&qu= ot; "cadr" "cdar" "cddr"
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"caaar&q= uot; "caadr" "cadar" "caddr"
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"cdaar&q= uot; "cdadr" "cddar" "cdddr"))

(regexp-opt strings2) ->
"\\(?:c\\(?:\\(?:a\\(?:a[ad]\\|d[ad]\\|[ad]\\)\\|d\\(?:a[ad]\\|d[ad]\\= |[ad]\\)\\|[ad]\\)r\\)\\)"

(regexp-opt2 strings2) ->
"\\(?:c[ad]\\(?:[ad][ad]?\\)?r\\)"

FA is 7 times slower here.

If this implementation is useful, I would like very much to contribute it.<= br>
I actually have the other implementation from previous idea. It is
faster than FA, same complexity of current regexp-opt, a bit slower of
course, but I like this better.

Best regards

Miguel Frasson

--
Miguel Vinicius Santini Frasson
mvsfrasson@gmail.com
--0000000000003a67710586572167--