From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Le Wang <l26wang@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: Emacs needs truely useful flex matching
Date: Mon, 15 Apr 2013 00:48:21 +0800
Message-ID: <CAM=K+ipDy92SpLfR_mMEW+Rbhzz3KpJv4uypxKwHhhQOfJp4kg@mail.gmail.com>
References: <CAM=K+irfX2-z0s6uYBfOZiTXP4PGSo1x7MizLsed3zUa5tMByw@mail.gmail.com>
	<877gl0od6x.fsf@wanadoo.es> <jwvwqt0wbvh.fsf-monnier+emacs@gnu.org>
	<CAM=K+iprKKQx0sqvxwDxpd65rKcs1=WfsLSDfuBkppvZ7qnofA@mail.gmail.com>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary=047d7beb95c0696b1104da54e72f
X-Trace: ger.gmane.org 1365958105 6900 80.91.229.3 (14 Apr 2013 16:48:25 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sun, 14 Apr 2013 16:48:25 +0000 (UTC)
Cc: =?ISO-8859-1?Q?=D3scar_Fuentes?= <ofv@wanadoo.es>, emacs-devel@gnu.org
To: Stefan Monnier <monnier@iro.umontreal.ca>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Apr 14 18:48:29 2013
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1URQ6O-0001GQ-Kj
	for ged-emacs-devel@m.gmane.org; Sun, 14 Apr 2013 18:48:28 +0200
Original-Received: from localhost ([::1]:58438 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1URQ6O-00029f-9H
	for ged-emacs-devel@m.gmane.org; Sun, 14 Apr 2013 12:48:28 -0400
Original-Received: from eggs.gnu.org ([208.118.235.92]:43225)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <l26wang@gmail.com>) id 1URQ6K-00029O-ND
	for emacs-devel@gnu.org; Sun, 14 Apr 2013 12:48:26 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <l26wang@gmail.com>) id 1URQ6J-0004vz-21
	for emacs-devel@gnu.org; Sun, 14 Apr 2013 12:48:24 -0400
Original-Received: from mail-wg0-f46.google.com ([74.125.82.46]:50993)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <l26wang@gmail.com>) id 1URQ6I-0004vd-PR
	for emacs-devel@gnu.org; Sun, 14 Apr 2013 12:48:22 -0400
Original-Received: by mail-wg0-f46.google.com with SMTP id e11so1755510wgh.1
	for <emacs-devel@gnu.org>; Sun, 14 Apr 2013 09:48:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:x-received:in-reply-to:references:date:message-id
	:subject:from:to:cc:content-type;
	bh=hK22aFdRVDt1mUXHBIYP8gPTwnaUcExPbiHM+B3dPlU=;
	b=ss/5APGaZAzOQuws/38+TtfzlsrIpsmWnOXjiCOOVXDo6UgHRDwf96gjTXJ8aHiJPF
	eAhxAY4nDNE9bSGYedl/Y+TUAHbgmsO+WMW9QsGFP7ostqpJbC+oNIfT5hwLRCxKf4vH
	6Jsvn3/cTiPfF0Oq1B90+YTgulyjYcpki3COBpMwU/fIZB3PZKIb9rwSaioZBhB++DID
	p+wChUk2W3/DMPFquqXo4Cv+H+1AMYRr2hqwVfbB58rSVhCTQ0F9xALNBcUona68FxfE
	+h57J83REMYPCUebIA3YjkcPmiRl/jY7/B5GnqTpNbemyuyerDKrPpfvxtvUjD+Oqt0l
	+yQQ==
X-Received: by 10.194.82.34 with SMTP id f2mr25486933wjy.25.1365958101394;
	Sun, 14 Apr 2013 09:48:21 -0700 (PDT)
Original-Received: by 10.217.116.8 with HTTP; Sun, 14 Apr 2013 09:48:21 -0700 (PDT)
In-Reply-To: <CAM=K+iprKKQx0sqvxwDxpd65rKcs1=WfsLSDfuBkppvZ7qnofA@mail.gmail.com>
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy]
X-Received-From: 74.125.82.46
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:158905
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/158905>

--047d7beb95c0696b1104da54e72f
Content-Type: text/plain; charset=ISO-8859-1

On Fri, Mar 22, 2013 at 9:00 AM, Le Wang <l26wang@gmail.com> wrote:

> On Fri, Mar 22, 2013 at 7:58 AM, Stefan Monnier
> <monnier@iro.umontreal.ca> wrote:
> >>> The sorting algorithm is roughly this for a query: "abcd"
> >>>
> >>> 1. Get all matches for "a.*b.*c.*c"
> >>> 2. Calculate score of each match
> >>> - contiguous matched chars gets a boost
> >>> - matches at word and camelCase boundaries (abbreviation) get a boost
> >>> - matches with smallest starting index gets a boost
> >>> 2. Sort list according to score.
> >
> > I think that if you turn "abcd" into a regexp of the form
> >
> "\\(\\<\\)?a\\([^b]*\\)\\(\\<\\)?b\\([^c]*\\)\\(\\<\\)?c\\([^d]*\\)\\(\\<\\)?d"
> > the regexp matching should be fairly efficient and you should be able to
> > compute the score efficiently as well (at least if
> > you ignore the camelCase boundaries).
>
> I hadn't thought of this, and I'll try it soon.


I gave this a good try.  :-)

Since we are favouring beginning of word anchors (often but not always), I
actually had to insert "\\<" in the various slots in front of characters.
 That is all permutations of 4x"\\<", then 3x, then 2x, then 1x, etc.  I
bumped into the regexp length limit very quickly and it wasn't fast enough
even when it did work.

However, it turns out that emacs-lisp is fast enough with a tweaked
algorithm -- no regexps at all. To wit,

1. For each string, we allocate a hash keyed by a character and value is a
sorted list indices into the string for occurence of the key.
 (char-indices-hash)

2. For each string, we allocate a vector of same length do static analysis
to assign a heat value to each position.  (call this the heatmap-vector)

3. For a query "abcd" we can quickly find all combinations of ascending
indices using char-indices-hash.

4. For each of these combinations, we can compute a heat value based on the
heatmap-vector.  We take the max of all heat values as the "score" of the
query against the string.

5. Order matches by score.

The algorithm works fast.  I believe it has feature parity with Sublime
Text's fuzzy search.

However it uses a fair bit of memory,

1. it's allocating one hash table and one same-length vector for each string

2. it's allocating one cons cell for each character.


Does anyone have any optimisation ideas to use less memory?  I will have
code for review in the coming days.


-- 
Le

--047d7beb95c0696b1104da54e72f
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><br><div class=3D"gmail_extra"><br><br><div class=3D"gmail=
_quote">On Fri, Mar 22, 2013 at 9:00 AM, Le Wang <span dir=3D"ltr">&lt;<a h=
ref=3D"mailto:l26wang@gmail.com" target=3D"_blank">l26wang@gmail.com</a>&gt=
;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div class=3D"im">On Fri, Mar 22, 2013 at 7:58 AM, Stefan =
Monnier<br>

&lt;<a href=3D"mailto:monnier@iro.umontreal.ca">monnier@iro.umontreal.ca</a=
>&gt; wrote:<br>
&gt;&gt;&gt; The sorting algorithm is roughly this for a query: &quot;abcd&=
quot;<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; 1. Get all matches for &quot;a.*b.*c.*c&quot;<br>
&gt;&gt;&gt; 2. Calculate score of each match<br>
&gt;&gt;&gt; - contiguous matched chars gets a boost<br>
&gt;&gt;&gt; - matches at word and camelCase boundaries (abbreviation) get =
a boost<br>
&gt;&gt;&gt; - matches with smallest starting index gets a boost<br>
&gt;&gt;&gt; 2. Sort list according to score.<br>
&gt;<br>
&gt; I think that if you turn &quot;abcd&quot; into a regexp of the form<br=
>
&gt; &quot;\\(\\&lt;\\)?a\\([^b]*\\)\\(\\&lt;\\)?b\\([^c]*\\)\\(\\&lt;\\)?c=
\\([^d]*\\)\\(\\&lt;\\)?d&quot;<br>
&gt; the regexp matching should be fairly efficient and you should be able =
to<br>
&gt; compute the score efficiently as well (at least if<br>
&gt; you ignore the camelCase boundaries).<br>
<br>
</div>I hadn&#39;t thought of this, and I&#39;ll try it soon.</blockquote><=
div><br></div><div><div class=3D"gmail_quote"><div><br class=3D"">I gave th=
is a good try. =A0:-)</div><div><br></div><div>Since we are favouring begin=
ning of word anchors (often but not always), I actually had to insert &quot=
;\\&lt;&quot; in the various slots in front of characters. =A0That is all p=
ermutations of 4x&quot;\\&lt;&quot;, then 3x, then 2x, then 1x, etc. =A0I b=
umped into the regexp length limit very quickly and it wasn&#39;t fast enou=
gh even when it did work.</div>
<div><br></div><div>However, it turns out that emacs-lisp is fast enough wi=
th a tweaked algorithm -- no regexps at all. To wit,</div><div><br></div><d=
iv>1. For each string, we allocate a hash keyed by a character and value is=
 a sorted list indices into the string for occurence of the key. =A0(char-i=
ndices-hash)</div>
<div><br></div><div>2. For each string, we allocate a vector of same length=
 do static analysis to assign a heat value to each position. =A0(call this =
the heatmap-vector)</div><div><br></div><div>3. For a query &quot;abcd&quot=
; we can quickly find all combinations of ascending indices using char-indi=
ces-hash.</div>
<div><br></div><div>4. For each of these combinations, we can compute a hea=
t value based on the heatmap-vector. =A0We take the max of all heat values =
as the &quot;score&quot; of the query against the string.</div><div><br></d=
iv>
<div>5. Order matches by score.</div><div><br></div><div>The algorithm work=
s fast. =A0I believe it has feature parity with Sublime Text&#39;s fuzzy se=
arch.</div><div><br></div><div>However it uses a fair bit of memory,</div>
<div><br></div><div>1. it&#39;s allocating one hash table and one same-leng=
th vector for each string</div><div><br></div><div>2. it&#39;s allocating o=
ne cons cell for each character.</div><div><br></div><div><br></div><div>
Does anyone have any=A0optimisation=A0ideas to use less memory? =A0I will h=
ave code for review in the coming days.<br><br></div><div><br></div></div><=
/div></div>-- <br>Le
</div></div>

--047d7beb95c0696b1104da54e72f--