From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Le Wang Newsgroups: gmane.emacs.devel Subject: Re: Emacs needs truely useful flex matching Date: Mon, 15 Apr 2013 08:14:44 +0800 Message-ID: References: <877gl0od6x.fsf@wanadoo.es> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=047d7beb95c0c9fc4604da5b2321 X-Trace: ger.gmane.org 1365984888 25300 80.91.229.3 (15 Apr 2013 00:14:48 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 15 Apr 2013 00:14:48 +0000 (UTC) Cc: =?ISO-8859-1?Q?=D3scar_Fuentes?= , emacs-devel@gnu.org To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Apr 15 02:14:52 2013 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1URX4M-0004KG-RC for ged-emacs-devel@m.gmane.org; Mon, 15 Apr 2013 02:14:51 +0200 Original-Received: from localhost ([::1]:42084 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1URX4M-0000D6-6j for ged-emacs-devel@m.gmane.org; Sun, 14 Apr 2013 20:14:50 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:47119) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1URX4I-0000D1-Ju for emacs-devel@gnu.org; Sun, 14 Apr 2013 20:14:47 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1URX4H-0005JE-8Y for emacs-devel@gnu.org; Sun, 14 Apr 2013 20:14:46 -0400 Original-Received: from mail-wi0-x22a.google.com ([2a00:1450:400c:c05::22a]:50102) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1URX4G-0005J9-VC for emacs-devel@gnu.org; Sun, 14 Apr 2013 20:14:45 -0400 Original-Received: by mail-wi0-f170.google.com with SMTP id hm11so1055642wib.3 for ; Sun, 14 Apr 2013 17:14:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=4elRamuTZVPbJhuLdoVzTEhf8Ytopyqq+pvWUyY0/Os=; b=SnZljZWukenqphHoMzqS1ce1TlCvLIuq3BUCVx0xcnzCMLwGKceAiyasnWb3whutg4 Aq9VgbwDDCssRZf42hVJKLfZtvD9FSnx6lc7xq6pS5dbC9NimpH1X14vNjvPragXha0B avB6UADDx8zVSqkUDEDsSrbqXk/3RD01TsFipmkGXOJBWqRuMVHoNskAXe9/Xzz1jfqq S6V6dCGAfcp70PGV5JCh1inUTvVVfmSN3kM1VPnL2PGeo8q7ZWyQX3hbncypI8Sz8MEj uTPK9vPN6g9l4tbVNwo8X8+EfYFJmn8XWQ6DKs2+/bENWTBj+8v8kerv2oxwYjnE7y8R Dr6A== X-Received: by 10.194.82.34 with SMTP id f2mr26620634wjy.25.1365984884159; Sun, 14 Apr 2013 17:14:44 -0700 (PDT) Original-Received: by 10.217.116.8 with HTTP; Sun, 14 Apr 2013 17:14:44 -0700 (PDT) In-Reply-To: X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2a00:1450:400c:c05::22a X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:158910 Archived-At: --047d7beb95c0c9fc4604da5b2321 Content-Type: text/plain; charset=ISO-8859-1 On Mon, Apr 15, 2013 at 2:18 AM, Stefan Monnier wrote: > >> > "\\(\\<\\)?a\\([^b]*\\)\\(\\<\\)?b\\([^c]*\\)\\(\\<\\)?c\\([^d]*\\)\\(\\<\\)?d" > >> > the regexp matching should be fairly efficient and you should be able > to > >> > compute the score efficiently as well (at least if > >> > you ignore the camelCase boundaries). > >> I hadn't thought of this, and I'll try it soon. > > I gave this a good try. :-) > > Since we are favouring beginning of word anchors (often but not always), > I > > actually had to insert "\\<" in the various slots in front of characters. > > That is all permutations of 4x"\\<", then 3x, then 2x, then 1x, etc. > > I don't understand. The example I gave already has all the "\\<" we > need, doesn't it? The regexp you gave works gave works the same as "a.*?b.*?c.*?d". It doesn't really prefer word boundaries. Consider "fafbfcfd-bcd", your regexp will match "afbfcfd" from the first word, but we want to prefer "a" from the first word, "bcd" from the second word. I could only match the preferred parts by inserting the max number of "\\<" and then one less (but insert them in different places preferring places closer to the beginning of the string. > The algorithm works fast. I believe it has feature parity with Sublime > > Text's fuzzy search. > > However it uses a fair bit of memory, > > 1. it's allocating one hash table and one same-length vector for each > string > > 2. it's allocating one cons cell for each character. > > Sounds great. It does seem to require a fair bit of pre-processing, tho. > When you say it's fast, does that include the time it takes to build the > hash tables from the list of potential candidates? > I missed describing the top level caching layer. I keep two global caches -- one for filenames, one for other strings. The key is a completion string and the value is the analysis for that string (hashtable + heatmap). The algorithm works differently for files than other strings, scoring different path segments differently (basepath is especially preferred). When I say fast I mean the matching and scoring is fast, once the static analysis is done. Warming up the cache with 6300 filenames (average length 44 characters) it takes 0.7 seconds and invokes the garbage collector 8 times. Actually can I skip the GC somehow since I know I'm about to allocate a bunch of memory that never gets freed. -- Le --047d7beb95c0c9fc4604da5b2321 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
On Mon, Apr 15, 2013 at 2:18 AM, Stefan Monnier <= monnier@iro.umontreal.ca> wrote:
>> "\\(\\<\\)?a\\([^b]*\\)= \\(\\<\\)?b\\([^c]*\\)\\(\\<\\)?c\\([^d]*\\)\\(\\<\\)?d"
>> > the regexp matching should be fairly efficient and you should= be able to
>> > compute the score efficiently as well (at least if
>> > you ignore the camelCase boundaries).
>> I hadn't thought of this, and I'll try it soon.
> I gave this a good try. =A0:-)
> Since we are favouring beginning of word anchors (often but not always= ), I
> actually had to insert "\\<" in the various slots in fron= t of characters.
> That is all permutations of 4x"\\<", then 3x, then 2x, th= en 1x, etc.

I don't understand. =A0The example I gave already has all the &qu= ot;\\<" we
need, doesn't it?

The = regexp you gave works gave works the same as "a.*?b.*?c.*?d". =A0= It doesn't really prefer word boundaries.

Consider "fafbfcfd-bcd", your regexp will match "= afbfcfd" from the first word, but we want to prefer "a" from= the first word, "bcd" from the second word.

I could only match the preferred parts by inserting th= e max number of "\\<" and then one less (but insert them in di= fferent places preferring places closer to the beginning of the string.


> The algorithm works fast. =A0I believe it has feature parity with Subl= ime
> Text's fuzzy search.
> However it uses a fair bit of memory,
> 1. it's allocating one hash table and one same-length vector for e= ach string
> 2. it's allocating one cons cell for each character.

Sounds great. =A0It does seem to require a fair bit of pre-processing= , tho.
When you say it's fast, does that include the time it takes to build th= e
hash tables from the list of potential candidates?
I missed describing the top level caching layer. =A0I kee= p two global caches -- one for filenames, one for other strings. =A0The key= is a completion string and the value is the analysis for that string (hash= table + heatmap).

The algorithm works differently for files t= han other strings, scoring different path segments differently (basepath is= especially preferred).

When I say fas= t I mean the matching and scoring is fast, once the static analysis is done= .

Warming up the cache with 6300 filenames (a= verage length 44 characters) it takes 0.7 seconds and invokes the garbage c= ollector 8 times. =A0

Actually can I s= kip the GC somehow since I know I'm about to allocate a bunch of memory= that never gets freed.

--
Le
--047d7beb95c0c9fc4604da5b2321--