From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Le Wang Newsgroups: gmane.emacs.devel Subject: Re: Emacs needs truely useful flex matching Date: Mon, 15 Apr 2013 00:48:21 +0800 Message-ID: References: <877gl0od6x.fsf@wanadoo.es> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=047d7beb95c0696b1104da54e72f X-Trace: ger.gmane.org 1365958105 6900 80.91.229.3 (14 Apr 2013 16:48:25 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 14 Apr 2013 16:48:25 +0000 (UTC) Cc: =?ISO-8859-1?Q?=D3scar_Fuentes?= , emacs-devel@gnu.org To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Sun Apr 14 18:48:29 2013 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1URQ6O-0001GQ-Kj for ged-emacs-devel@m.gmane.org; Sun, 14 Apr 2013 18:48:28 +0200 Original-Received: from localhost ([::1]:58438 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1URQ6O-00029f-9H for ged-emacs-devel@m.gmane.org; Sun, 14 Apr 2013 12:48:28 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:43225) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1URQ6K-00029O-ND for emacs-devel@gnu.org; Sun, 14 Apr 2013 12:48:26 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1URQ6J-0004vz-21 for emacs-devel@gnu.org; Sun, 14 Apr 2013 12:48:24 -0400 Original-Received: from mail-wg0-f46.google.com ([74.125.82.46]:50993) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1URQ6I-0004vd-PR for emacs-devel@gnu.org; Sun, 14 Apr 2013 12:48:22 -0400 Original-Received: by mail-wg0-f46.google.com with SMTP id e11so1755510wgh.1 for ; Sun, 14 Apr 2013 09:48:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=hK22aFdRVDt1mUXHBIYP8gPTwnaUcExPbiHM+B3dPlU=; b=ss/5APGaZAzOQuws/38+TtfzlsrIpsmWnOXjiCOOVXDo6UgHRDwf96gjTXJ8aHiJPF eAhxAY4nDNE9bSGYedl/Y+TUAHbgmsO+WMW9QsGFP7ostqpJbC+oNIfT5hwLRCxKf4vH 6Jsvn3/cTiPfF0Oq1B90+YTgulyjYcpki3COBpMwU/fIZB3PZKIb9rwSaioZBhB++DID p+wChUk2W3/DMPFquqXo4Cv+H+1AMYRr2hqwVfbB58rSVhCTQ0F9xALNBcUona68FxfE +h57J83REMYPCUebIA3YjkcPmiRl/jY7/B5GnqTpNbemyuyerDKrPpfvxtvUjD+Oqt0l +yQQ== X-Received: by 10.194.82.34 with SMTP id f2mr25486933wjy.25.1365958101394; Sun, 14 Apr 2013 09:48:21 -0700 (PDT) Original-Received: by 10.217.116.8 with HTTP; Sun, 14 Apr 2013 09:48:21 -0700 (PDT) In-Reply-To: X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy] X-Received-From: 74.125.82.46 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:158905 Archived-At: --047d7beb95c0696b1104da54e72f Content-Type: text/plain; charset=ISO-8859-1 On Fri, Mar 22, 2013 at 9:00 AM, Le Wang wrote: > On Fri, Mar 22, 2013 at 7:58 AM, Stefan Monnier > wrote: > >>> The sorting algorithm is roughly this for a query: "abcd" > >>> > >>> 1. Get all matches for "a.*b.*c.*c" > >>> 2. Calculate score of each match > >>> - contiguous matched chars gets a boost > >>> - matches at word and camelCase boundaries (abbreviation) get a boost > >>> - matches with smallest starting index gets a boost > >>> 2. Sort list according to score. > > > > I think that if you turn "abcd" into a regexp of the form > > > "\\(\\<\\)?a\\([^b]*\\)\\(\\<\\)?b\\([^c]*\\)\\(\\<\\)?c\\([^d]*\\)\\(\\<\\)?d" > > the regexp matching should be fairly efficient and you should be able to > > compute the score efficiently as well (at least if > > you ignore the camelCase boundaries). > > I hadn't thought of this, and I'll try it soon. I gave this a good try. :-) Since we are favouring beginning of word anchors (often but not always), I actually had to insert "\\<" in the various slots in front of characters. That is all permutations of 4x"\\<", then 3x, then 2x, then 1x, etc. I bumped into the regexp length limit very quickly and it wasn't fast enough even when it did work. However, it turns out that emacs-lisp is fast enough with a tweaked algorithm -- no regexps at all. To wit, 1. For each string, we allocate a hash keyed by a character and value is a sorted list indices into the string for occurence of the key. (char-indices-hash) 2. For each string, we allocate a vector of same length do static analysis to assign a heat value to each position. (call this the heatmap-vector) 3. For a query "abcd" we can quickly find all combinations of ascending indices using char-indices-hash. 4. For each of these combinations, we can compute a heat value based on the heatmap-vector. We take the max of all heat values as the "score" of the query against the string. 5. Order matches by score. The algorithm works fast. I believe it has feature parity with Sublime Text's fuzzy search. However it uses a fair bit of memory, 1. it's allocating one hash table and one same-length vector for each string 2. it's allocating one cons cell for each character. Does anyone have any optimisation ideas to use less memory? I will have code for review in the coming days. -- Le --047d7beb95c0696b1104da54e72f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable




2. For each string, we allocate a vector of same length= do static analysis to assign a heat value to each position. =A0(call this = the heatmap-vector)

3. For a query "abcd"= ; we can quickly find all combinations of ascending indices using char-indi= ces-hash.

4. For each of these combinations, we can compute a hea= t value based on the heatmap-vector. =A0We take the max of all heat values = as the "score" of the query against the string.

5. Order matches by score.

The algorithm work= s fast. =A0I believe it has feature parity with Sublime Text's fuzzy se= arch.

However it uses a fair bit of memory,

1. it's allocating one hash table and one same-leng= th vector for each string

2. it's allocating o= ne cons cell for each character.


Does anyone have any=A0optimisation=A0ideas to use less memory? =A0I will h= ave code for review in the coming days.


<= /div>
--
Le
--047d7beb95c0696b1104da54e72f--