From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail
From: "Miguel V. S. Frasson" <mvsfrasson@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: Improving regexp-opt
Date: Fri, 12 Apr 2019 12:40:48 -0300
Message-ID: <CAARdmY04bv2V1xJ+-C4_+B2WuyyuYNFE=1G=SiUFL4EF4BL_eg@mail.gmail.com>
References: <CAARdmY1m2+ROr+orDMoUK7zXkFbf94o-89z6bATjzPuoVt=nBA@mail.gmail.com>
	<jwvimxvt0h5.fsf-monnier+emacs@gnu.org>
	<CAARdmY0VxaC9LdACyv_uzGZ-9Sinj3BmknwFGR6dWvGk21gtkQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="0000000000003a67710586572167"
Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226";
	logging-data="177086"; mail-complaints-to="usenet@blaine.gmane.org"
Cc: emacs-devel@gnu.org
To: Stefan Monnier <monnier@iro.umontreal.ca>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Apr 12 17:53:53 2019
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256)
	(Exim 4.89)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1hEyUz-000juN-CT
	for ged-emacs-devel@m.gmane.org; Fri, 12 Apr 2019 17:53:53 +0200
Original-Received: from localhost ([127.0.0.1]:39053 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1hEyUy-0005cp-FD
	for ged-emacs-devel@m.gmane.org; Fri, 12 Apr 2019 11:53:52 -0400
Original-Received: from eggs.gnu.org ([209.51.188.92]:37922)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mvsfrasson@gmail.com>) id 1hEyIa-0003cp-8B
	for emacs-devel@gnu.org; Fri, 12 Apr 2019 11:41:05 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mvsfrasson@gmail.com>) id 1hEyIY-0003H0-MW
	for emacs-devel@gnu.org; Fri, 12 Apr 2019 11:41:04 -0400
Original-Received: from mail-it1-x132.google.com ([2607:f8b0:4864:20::132]:34996)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <mvsfrasson@gmail.com>)
	id 1hEyIY-0003Gl-9u
	for emacs-devel@gnu.org; Fri, 12 Apr 2019 11:41:02 -0400
Original-Received: by mail-it1-x132.google.com with SMTP id w15so16265079itc.0
	for <emacs-devel@gnu.org>; Fri, 12 Apr 2019 08:41:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=mime-version:references:in-reply-to:from:date:message-id:subject:to
	:cc; bh=Jjgqn1TRpCSDpC4tddlOByYzHQYmlj8OmoChPOW6deM=;
	b=YGQPQT2zkiKIvNA+HOoADJJ3/6wAY37ZiMXNFpIXWUieK5ayWlo7bVsXJd+Zg9/x+E
	Df30BgXyupCwea6P0H0+VipiDHbXcFAN9wzNdTjBp0Dnao3+MmHrKTEyUYiBmrSS/rQ9
	m3uUsmMPcTa/T2aew0Lw+MS8iiBmy/sLDPrpWfTyTpPM1nhItLL/Xdp4o96/ICVyhsUJ
	abIN5I4RA7uvAhZEaLK4paRPuWMPGQXCdiVTZTTvYjYKR1ZdFA7oeQMGb/zX3Kc+CBlp
	M3iJpbXN2bMIVgcLovE0+/Ha4jgLOQzEzzvzZeoxNRqySRp12a1zMlTiGQTicZdqS/jq
	wMnA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:mime-version:references:in-reply-to:from:date
	:message-id:subject:to:cc;
	bh=Jjgqn1TRpCSDpC4tddlOByYzHQYmlj8OmoChPOW6deM=;
	b=nkn0vY6b8ZHObWkF2EAcTMwlmc2Y3HJvphl5GSunLUu3MGW+UPptbalmyzBTDemkoT
	koGTfHwAkUpnLNoEDddtwqwEYjwGlpUMfOMoL1L1g/azouwXRrajWFeijfnFf3ov7yCf
	b6orPXrTccVfv9BlMcL040Z56yj2mWpj8mNUTI2SvOMY/uWzJ6qHZQyXROvQEpi1FyUT
	q9c1tH3VVgTEqI6X7oRbcnz6guGUPeaadFDN0X6mXaGTnbXHylkJUc2PsK7UxyYDycu7
	v9TYipHcJUqvZVJQjtbqPg5jQnXCJjUjnxKs1NoM88xvIuZm/Z4IjdJfqfNIwfTSdVVV
	aI1A==
X-Gm-Message-State: APjAAAW5HvGdeVhIfrJEeE5YS5Y86VBCMCz0PpzMUfcxgTk8TIIV0V0y
	StgoA0VOwf3HWUOlMNVoKrTiEq1Oxsn8c4E5W3E=
X-Google-Smtp-Source: APXvYqyCDOsvgiJCa51DoznE/xlCBMI813LenlVa4Vw6bqnR/xjFRrGXjfbZn7CT2nvoL1dFKWAltLlz8YMtWbDMiq4=
X-Received: by 2002:a02:a916:: with SMTP id n22mr40303853jam.40.1555083661303; 
	Fri, 12 Apr 2019 08:41:01 -0700 (PDT)
In-Reply-To: <CAARdmY0VxaC9LdACyv_uzGZ-9Sinj3BmknwFGR6dWvGk21gtkQ@mail.gmail.com>
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-Received-From: 2607:f8b0:4864:20::132
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel/>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: "Emacs-devel" <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.devel:235347
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/235347>

--0000000000003a67710586572167
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi

Adding some thoughts on why current fa implementation is so slow.

It strives to keep well nested trees in all steps, so it spends a lot of
time on branch compassion. This ensures well nested trees for the fa and
eases final fa to regexp conversion.

If we drop well-formness for fa and manage to convert an not well-formed fa
to regexp, the process would be very fast...

Regards

Miguel

Em sex, 12 de abr de 2019 12:06, Miguel V. S. Frasson <mvsfrasson@gmail.com=
>
escreveu:

> Dear Stefan and others
>
> Some time ago I suggested an improvement in regexp-opt, factoring
> similarities at the end of groups. At the end, Stefan wrote:
>
> Em sex, 8 de fev de 2019 =C3=A0s 01:48, Stefan Monnier
> <monnier@iro.umontreal.ca> escreveu:
> > A much better approach is to go for a real "regexp to NFA/DFA
> > conversion".  The `lex.el` package is one such example, but it's very
> > inefficient (in terms of building the FA and in the size of the FA, not
> > in terms of running the FA).
>
> After some time, I had an idea of simplification by FA.  The base idea
> is implement FA as "nodes" being lists of ARROWi and arrows being
> (CHAR . NODE). For example the initial FA for strings ("abd" "acd") is
>
>  >1 --a--> 2 --b--> 3 --d--> 4 --epsilon--> nil
>            |                                 ^
>            +---c--> 5 --d--> 6 --epsilon-----|
> and inplemented as
> (?a (?b (?d (0))) (?c (?d (0))))
> =3D ((?a . ((?b . ((?d . ((0 . nil)))))
>           (?c . ((?d . ((0 . nil)))))))
>
> Note that nodes 3 and 5 are `equal'.  Simplification is to make them `eq'=
.
>
> Also, there is simplification where a node is equal to a subset of a
> parent node, resulting is a ? construction.
> For example,
>             +---f--> 9 ----------epsilon -------------\
>             |                                          v
>   >1 --a--> 2 --b--> 3 --c--> 4 --f--> 5 --epsilon--> nil
>             |                 ^
>             +---d--> 6 --e---/
> Node 4 is equal to a subnode derived from 2, resulting on
>             +---epsilon-------+
>             |                 v
>   >1 --a--> 2 --b--> 3 --c--> 4 --f--> 5 --epsilon--> nil
>             |                 ^
>             +---d--> 6 --e--> 7
>
> I made an inplementation, a patch to regexp-opt.el
>
> The pros:
> If the resulting strings "came from" a regexp that is splittable, the
> FA implementation always simplifies to it.  In pratice, these are
> uncommun, and in most cases, the results are equivalent.
>
> The cons:
> The algorithm for FA seams to have greater computation complexity,
> takes about 20 times to compute in average.
>
> Example, the case on the
>
> (setq strings '("cond" "if" "when" "unless" "while"
>                 "let" "let*" "progn" "prog1" "prog2"
>                 "save-restriction" "save-excursion" "save-window-excursio=
n"
>                 "save-current-buffer" "save-match-data"
>                 "catch" "throw" "unwind-protect" "condition-case"))
>
> (regexp-opt strings) ->
>
> "\\(?:c\\(?:atch\\|ond\\(?:ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n]\\=
|save-\\(?:current-buffer\\|excursion\\|match-data\\|\\(?:restrict\\|window=
-excurs\\)ion\\)\\|throw\\|un\\(?:less\\|wind-protect\\)\\|wh\\(?:en\\|ile\=
\)\\)"
>
> (regexp-opt2 strings) ->
>
> "\\(?:c\\(?:atch\\|ond\\(?:ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n]\\=
|save-\\(?:current-buffer\\|match-data\\|\\(?:restrict\\|\\(?:window-\\)?ex=
curs\\)ion\\)\\|throw\\|un\\(?:less\\|wind-protect\\)\\|wh\\(?:en\\|ile\\)\=
\)"
>
> The difference is that FA algorithm
>
> BUT my version takes 70 times more time to compute.
>
> Example 2:
> (setq strings2 '("car" "cdr"
>                  "caar" "cadr" "cdar" "cddr"
>                  "caaar" "caadr" "cadar" "caddr"
>                  "cdaar" "cdadr" "cddar" "cdddr"))
>
> (regexp-opt strings2) ->
>
> "\\(?:c\\(?:\\(?:a\\(?:a[ad]\\|d[ad]\\|[ad]\\)\\|d\\(?:a[ad]\\|d[ad]\\|[a=
d]\\)\\|[ad]\\)r\\)\\)"
>
> (regexp-opt2 strings2) ->
> "\\(?:c[ad]\\(?:[ad][ad]?\\)?r\\)"
>
> FA is 7 times slower here.
>
> If this implementation is useful, I would like very much to contribute it=
.
>
> I actually have the other implementation from previous idea. It is
> faster than FA, same complexity of current regexp-opt, a bit slower of
> course, but I like this better.
>
> Best regards
>
> Miguel Frasson
>
> --
> Miguel Vinicius Santini Frasson
> mvsfrasson@gmail.com
>

--0000000000003a67710586572167
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto">Hi<div dir=3D"auto"><br></div><div dir=3D"auto">Adding so=
me thoughts on why current fa implementation is so slow.=C2=A0</div><div di=
r=3D"auto"><br></div><div dir=3D"auto">It strives to keep well nested trees=
 in all steps, so it spends a lot of time on branch compassion. This ensure=
s well nested trees for the fa and eases final fa to regexp conversion.=C2=
=A0</div><div dir=3D"auto"><br></div><div dir=3D"auto">If we drop well-form=
ness for fa and manage to convert an not well-formed fa to regexp, the proc=
ess would be very fast...=C2=A0</div><div dir=3D"auto"><br></div><div dir=
=3D"auto">Regards</div><div dir=3D"auto"><br></div><div dir=3D"auto">Miguel=
</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_=
attr">Em sex, 12 de abr de 2019 12:06, Miguel V. S. Frasson &lt;<a href=3D"=
mailto:mvsfrasson@gmail.com">mvsfrasson@gmail.com</a>&gt; escreveu:<br></di=
v><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:=
1px #ccc solid;padding-left:1ex">Dear Stefan and others<br>
<br>
Some time ago I suggested an improvement in regexp-opt, factoring<br>
similarities at the end of groups. At the end, Stefan wrote:<br>
<br>
Em sex, 8 de fev de 2019 =C3=A0s 01:48, Stefan Monnier<br>
&lt;<a href=3D"mailto:monnier@iro.umontreal.ca" target=3D"_blank" rel=3D"no=
referrer">monnier@iro.umontreal.ca</a>&gt; escreveu:<br>
&gt; A much better approach is to go for a real &quot;regexp to NFA/DFA<br>
&gt; conversion&quot;.=C2=A0 The `lex.el` package is one such example, but =
it&#39;s very<br>
&gt; inefficient (in terms of building the FA and in the size of the FA, no=
t<br>
&gt; in terms of running the FA).<br>
<br>
After some time, I had an idea of simplification by FA.=C2=A0 The base idea=
<br>
is implement FA as &quot;nodes&quot; being lists of ARROWi and arrows being=
<br>
(CHAR . NODE). For example the initial FA for strings (&quot;abd&quot; &quo=
t;acd&quot;) is<br>
<br>
=C2=A0&gt;1 --a--&gt; 2 --b--&gt; 3 --d--&gt; 4 --epsilon--&gt; nil<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0^<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+---c--&gt; 5 --d--&gt; 6 --epsilo=
n-----|<br>
and inplemented as<br>
(?a (?b (?d (0))) (?c (?d (0))))<br>
=3D ((?a . ((?b . ((?d . ((0 . nil)))))<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (?c . ((?d . ((0 . nil)))))))<br>
<br>
Note that nodes 3 and 5 are `equal&#39;.=C2=A0 Simplification is to make th=
em `eq&#39;.<br>
<br>
Also, there is simplification where a node is equal to a subset of a<br>
parent node, resulting is a ? construction.<br>
For example,<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 +---f--&gt; 9 ----------epsilon -=
------------\<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 v<br>
=C2=A0 &gt;1 --a--&gt; 2 --b--&gt; 3 --c--&gt; 4 --f--&gt; 5 --epsilon--&gt=
; nil<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0^<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 +---d--&gt; 6 --e---/<br>
Node 4 is equal to a subnode derived from 2, resulting on<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 +---epsilon-------+<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0v<br>
=C2=A0 &gt;1 --a--&gt; 2 --b--&gt; 3 --c--&gt; 4 --f--&gt; 5 --epsilon--&gt=
; nil<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0^<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 +---d--&gt; 6 --e--&gt; 7<br>
<br>
I made an inplementation, a patch to regexp-opt.el<br>
<br>
The pros:<br>
If the resulting strings &quot;came from&quot; a regexp that is splittable,=
 the<br>
FA implementation always simplifies to it.=C2=A0 In pratice, these are<br>
uncommun, and in most cases, the results are equivalent.<br>
<br>
The cons:<br>
The algorithm for FA seams to have greater computation complexity,<br>
takes about 20 times to compute in average.<br>
<br>
Example, the case on the<br>
<br>
(setq strings &#39;(&quot;cond&quot; &quot;if&quot; &quot;when&quot; &quot;=
unless&quot; &quot;while&quot;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 &quot;let&quot; &qu=
ot;let*&quot; &quot;progn&quot; &quot;prog1&quot; &quot;prog2&quot;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 &quot;save-restrict=
ion&quot; &quot;save-excursion&quot; &quot;save-window-excursion&quot;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 &quot;save-current-=
buffer&quot; &quot;save-match-data&quot;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 &quot;catch&quot; &=
quot;throw&quot; &quot;unwind-protect&quot; &quot;condition-case&quot;))<br=
>
<br>
(regexp-opt strings) -&gt;<br>
&quot;\\(?:c\\(?:atch\\|ond\\(?:ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n=
]\\|save-\\(?:current-buffer\\|excursion\\|match-data\\|\\(?:restrict\\|win=
dow-excurs\\)ion\\)\\|throw\\|un\\(?:less\\|wind-protect\\)\\|wh\\(?:en\\|i=
le\\)\\)&quot;<br>
<br>
(regexp-opt2 strings) -&gt;<br>
&quot;\\(?:c\\(?:atch\\|ond\\(?:ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n=
]\\|save-\\(?:current-buffer\\|match-data\\|\\(?:restrict\\|\\(?:window-\\)=
?excurs\\)ion\\)\\|throw\\|un\\(?:less\\|wind-protect\\)\\|wh\\(?:en\\|ile\=
\)\\)&quot;<br>
<br>
The difference is that FA algorithm<br>
<br>
BUT my version takes 70 times more time to compute.<br>
<br>
Example 2:<br>
(setq strings2 &#39;(&quot;car&quot; &quot;cdr&quot;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&quot;caar&qu=
ot; &quot;cadr&quot; &quot;cdar&quot; &quot;cddr&quot;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&quot;caaar&q=
uot; &quot;caadr&quot; &quot;cadar&quot; &quot;caddr&quot;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&quot;cdaar&q=
uot; &quot;cdadr&quot; &quot;cddar&quot; &quot;cdddr&quot;))<br>
<br>
(regexp-opt strings2) -&gt;<br>
&quot;\\(?:c\\(?:\\(?:a\\(?:a[ad]\\|d[ad]\\|[ad]\\)\\|d\\(?:a[ad]\\|d[ad]\\=
|[ad]\\)\\|[ad]\\)r\\)\\)&quot;<br>
<br>
(regexp-opt2 strings2) -&gt;<br>
&quot;\\(?:c[ad]\\(?:[ad][ad]?\\)?r\\)&quot;<br>
<br>
FA is 7 times slower here.<br>
<br>
If this implementation is useful, I would like very much to contribute it.<=
br>
<br>
I actually have the other implementation from previous idea. It is<br>
faster than FA, same complexity of current regexp-opt, a bit slower of<br>
course, but I like this better.<br>
<br>
Best regards<br>
<br>
Miguel Frasson<br>
<br>
-- <br>
Miguel Vinicius Santini Frasson<br>
<a href=3D"mailto:mvsfrasson@gmail.com" target=3D"_blank" rel=3D"noreferrer=
">mvsfrasson@gmail.com</a><br>
</blockquote></div>

--0000000000003a67710586572167--