From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Le Wang <l26wang@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: Emacs needs truely useful flex matching
Date: Mon, 15 Apr 2013 08:14:44 +0800
Message-ID: <CAM=K+irSz6aDT8TN+JDNz-goZr4cE6MeUExePJ_gUNGZrk9Z_Q@mail.gmail.com>
References: <CAM=K+irfX2-z0s6uYBfOZiTXP4PGSo1x7MizLsed3zUa5tMByw@mail.gmail.com>
	<877gl0od6x.fsf@wanadoo.es> <jwvwqt0wbvh.fsf-monnier+emacs@gnu.org>
	<CAM=K+iprKKQx0sqvxwDxpd65rKcs1=WfsLSDfuBkppvZ7qnofA@mail.gmail.com>
	<CAM=K+ipDy92SpLfR_mMEW+Rbhzz3KpJv4uypxKwHhhQOfJp4kg@mail.gmail.com>
	<jwvvc7pf08y.fsf-monnier+emacs@gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary=047d7beb95c0c9fc4604da5b2321
X-Trace: ger.gmane.org 1365984888 25300 80.91.229.3 (15 Apr 2013 00:14:48 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Mon, 15 Apr 2013 00:14:48 +0000 (UTC)
Cc: =?ISO-8859-1?Q?=D3scar_Fuentes?= <ofv@wanadoo.es>, emacs-devel@gnu.org
To: Stefan Monnier <monnier@iro.umontreal.ca>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Apr 15 02:14:52 2013
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1URX4M-0004KG-RC
	for ged-emacs-devel@m.gmane.org; Mon, 15 Apr 2013 02:14:51 +0200
Original-Received: from localhost ([::1]:42084 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1URX4M-0000D6-6j
	for ged-emacs-devel@m.gmane.org; Sun, 14 Apr 2013 20:14:50 -0400
Original-Received: from eggs.gnu.org ([208.118.235.92]:47119)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <l26wang@gmail.com>) id 1URX4I-0000D1-Ju
	for emacs-devel@gnu.org; Sun, 14 Apr 2013 20:14:47 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <l26wang@gmail.com>) id 1URX4H-0005JE-8Y
	for emacs-devel@gnu.org; Sun, 14 Apr 2013 20:14:46 -0400
Original-Received: from mail-wi0-x22a.google.com ([2a00:1450:400c:c05::22a]:50102)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <l26wang@gmail.com>) id 1URX4G-0005J9-VC
	for emacs-devel@gnu.org; Sun, 14 Apr 2013 20:14:45 -0400
Original-Received: by mail-wi0-f170.google.com with SMTP id hm11so1055642wib.3
	for <emacs-devel@gnu.org>; Sun, 14 Apr 2013 17:14:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:x-received:in-reply-to:references:date:message-id
	:subject:from:to:cc:content-type;
	bh=4elRamuTZVPbJhuLdoVzTEhf8Ytopyqq+pvWUyY0/Os=;
	b=SnZljZWukenqphHoMzqS1ce1TlCvLIuq3BUCVx0xcnzCMLwGKceAiyasnWb3whutg4
	Aq9VgbwDDCssRZf42hVJKLfZtvD9FSnx6lc7xq6pS5dbC9NimpH1X14vNjvPragXha0B
	avB6UADDx8zVSqkUDEDsSrbqXk/3RD01TsFipmkGXOJBWqRuMVHoNskAXe9/Xzz1jfqq
	S6V6dCGAfcp70PGV5JCh1inUTvVVfmSN3kM1VPnL2PGeo8q7ZWyQX3hbncypI8Sz8MEj
	uTPK9vPN6g9l4tbVNwo8X8+EfYFJmn8XWQ6DKs2+/bENWTBj+8v8kerv2oxwYjnE7y8R
	Dr6A==
X-Received: by 10.194.82.34 with SMTP id f2mr26620634wjy.25.1365984884159;
	Sun, 14 Apr 2013 17:14:44 -0700 (PDT)
Original-Received: by 10.217.116.8 with HTTP; Sun, 14 Apr 2013 17:14:44 -0700 (PDT)
In-Reply-To: <jwvvc7pf08y.fsf-monnier+emacs@gnu.org>
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
	(bad octet value).
X-Received-From: 2a00:1450:400c:c05::22a
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:158910
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/158910>

--047d7beb95c0c9fc4604da5b2321
Content-Type: text/plain; charset=ISO-8859-1

On Mon, Apr 15, 2013 at 2:18 AM, Stefan Monnier <monnier@iro.umontreal.ca>wrote:

> >>
> "\\(\\<\\)?a\\([^b]*\\)\\(\\<\\)?b\\([^c]*\\)\\(\\<\\)?c\\([^d]*\\)\\(\\<\\)?d"
> >> > the regexp matching should be fairly efficient and you should be able
> to
> >> > compute the score efficiently as well (at least if
> >> > you ignore the camelCase boundaries).
> >> I hadn't thought of this, and I'll try it soon.
> > I gave this a good try.  :-)
> > Since we are favouring beginning of word anchors (often but not always),
> I
> > actually had to insert "\\<" in the various slots in front of characters.
> > That is all permutations of 4x"\\<", then 3x, then 2x, then 1x, etc.
>
> I don't understand.  The example I gave already has all the "\\<" we
> need, doesn't it?


The regexp you gave works gave works the same as "a.*?b.*?c.*?d".  It
doesn't really prefer word boundaries.

Consider "fafbfcfd-bcd", your regexp will match "afbfcfd" from the first
word, but we want to prefer "a" from the first word, "bcd" from the second
word.

I could only match the preferred parts by inserting the max number of "\\<"
and then one less (but insert them in different places preferring places
closer to the beginning of the string.


> The algorithm works fast.  I believe it has feature parity with Sublime
> > Text's fuzzy search.
> > However it uses a fair bit of memory,
> > 1. it's allocating one hash table and one same-length vector for each
> string
> > 2. it's allocating one cons cell for each character.
>
> Sounds great.  It does seem to require a fair bit of pre-processing, tho.
> When you say it's fast, does that include the time it takes to build the
> hash tables from the list of potential candidates?
>

I missed describing the top level caching layer.  I keep two global caches
-- one for filenames, one for other strings.  The key is a completion
string and the value is the analysis for that string (hashtable + heatmap).

The algorithm works differently for files than other strings, scoring
different path segments differently (basepath is especially preferred).

When I say fast I mean the matching and scoring is fast, once the static
analysis is done.

Warming up the cache with 6300 filenames (average length 44 characters) it
takes 0.7 seconds and invokes the garbage collector 8 times.

Actually can I skip the GC somehow since I know I'm about to allocate a
bunch of memory that never gets freed.

-- 
Le

--047d7beb95c0c9fc4604da5b2321
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On Mon, Apr 15, 2013 at 2:18 AM, Stefan Monnier <span dir=
=3D"ltr">&lt;<a href=3D"mailto:monnier@iro.umontreal.ca" target=3D"_blank">=
monnier@iro.umontreal.ca</a>&gt;</span> wrote:<br><div class=3D"gmail_extra=
"><div class=3D"gmail_quote">
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div class=3D"im">&gt;&gt; &quot;\\(\\&lt;\\)?a\\([^b]*\\)=
\\(\\&lt;\\)?b\\([^c]*\\)\\(\\&lt;\\)?c\\([^d]*\\)\\(\\&lt;\\)?d&quot;<br>

&gt;&gt; &gt; the regexp matching should be fairly efficient and you should=
 be able to<br>
&gt;&gt; &gt; compute the score efficiently as well (at least if<br>
&gt;&gt; &gt; you ignore the camelCase boundaries).<br>
&gt;&gt; I hadn&#39;t thought of this, and I&#39;ll try it soon.<br>
&gt; I gave this a good try. =A0:-)<br>
&gt; Since we are favouring beginning of word anchors (often but not always=
), I<br>
&gt; actually had to insert &quot;\\&lt;&quot; in the various slots in fron=
t of characters.<br>
&gt; That is all permutations of 4x&quot;\\&lt;&quot;, then 3x, then 2x, th=
en 1x, etc.<br>
<br>
</div>I don&#39;t understand. =A0The example I gave already has all the &qu=
ot;\\&lt;&quot; we<br>
need, doesn&#39;t it?</blockquote><div></div><div><br></div><div style>The =
regexp you gave works gave works the same as &quot;a.*?b.*?c.*?d&quot;. =A0=
It doesn&#39;t really prefer word boundaries.</div><div style><br></div>
<div style>Consider &quot;fafbfcfd-bcd&quot;, your regexp will match &quot;=
afbfcfd&quot; from the first word, but we want to prefer &quot;a&quot; from=
 the first word, &quot;bcd&quot; from the second word.</div><div style>
<br></div><div style>I could only match the preferred parts by inserting th=
e max number of &quot;\\&lt;&quot; and then one less (but insert them in di=
fferent places preferring places closer to the beginning of the string.</di=
v>
<div><br></div><div><br></div><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,=
204);border-left-style:solid;padding-left:1ex"><div class=3D"im">
&gt; The algorithm works fast. =A0I believe it has feature parity with Subl=
ime<br>
&gt; Text&#39;s fuzzy search.<br>
&gt; However it uses a fair bit of memory,<br>
&gt; 1. it&#39;s allocating one hash table and one same-length vector for e=
ach string<br>
&gt; 2. it&#39;s allocating one cons cell for each character.<br>
<br>
</div>Sounds great. =A0It does seem to require a fair bit of pre-processing=
, tho.<br>
When you say it&#39;s fast, does that include the time it takes to build th=
e<br>
hash tables from the list of potential candidates?<br></blockquote><div><br=
></div><div style>I missed describing the top level caching layer. =A0I kee=
p two global caches -- one for filenames, one for other strings. =A0The key=
 is a completion string and the value is the analysis for that string (hash=
table + heatmap).</div>
<div style><br></div><div style>The algorithm works differently for files t=
han other strings, scoring different path segments differently (basepath is=
 especially preferred).</div><div style><br></div><div style>When I say fas=
t I mean the matching and scoring is fast, once the static analysis is done=
.</div>
<div style><br></div><div style>Warming up the cache with 6300 filenames (a=
verage length 44 characters) it takes 0.7 seconds and invokes the garbage c=
ollector 8 times. =A0</div><div style><br></div><div style>Actually can I s=
kip the GC somehow since I know I&#39;m about to allocate a bunch of memory=
 that never gets freed.</div>
<div><br></div></div>-- <br>Le
</div></div>

--047d7beb95c0c9fc4604da5b2321--