On Mon, Apr 15, 2013 at 2:18 AM, Stefan Monnier <monnier@iro.umontreal.ca>wrote:

> >>
> "\\(\\<\\)?a\\([^b]*\\)\\(\\<\\)?b\\([^c]*\\)\\(\\<\\)?c\\([^d]*\\)\\(\\<\\)?d"
> >> > the regexp matching should be fairly efficient and you should be able
> to
> >> > compute the score efficiently as well (at least if
> >> > you ignore the camelCase boundaries).
> >> I hadn't thought of this, and I'll try it soon.
> > I gave this a good try.  :-)
> > Since we are favouring beginning of word anchors (often but not always),
> I
> > actually had to insert "\\<" in the various slots in front of characters.
> > That is all permutations of 4x"\\<", then 3x, then 2x, then 1x, etc.
>
> I don't understand.  The example I gave already has all the "\\<" we
> need, doesn't it?


The regexp you gave works gave works the same as "a.*?b.*?c.*?d".  It
doesn't really prefer word boundaries.

Consider "fafbfcfd-bcd", your regexp will match "afbfcfd" from the first
word, but we want to prefer "a" from the first word, "bcd" from the second
word.

I could only match the preferred parts by inserting the max number of "\\<"
and then one less (but insert them in different places preferring places
closer to the beginning of the string.


> The algorithm works fast.  I believe it has feature parity with Sublime
> > Text's fuzzy search.
> > However it uses a fair bit of memory,
> > 1. it's allocating one hash table and one same-length vector for each
> string
> > 2. it's allocating one cons cell for each character.
>
> Sounds great.  It does seem to require a fair bit of pre-processing, tho.
> When you say it's fast, does that include the time it takes to build the
> hash tables from the list of potential candidates?
>

I missed describing the top level caching layer.  I keep two global caches
-- one for filenames, one for other strings.  The key is a completion
string and the value is the analysis for that string (hashtable + heatmap).

The algorithm works differently for files than other strings, scoring
different path segments differently (basepath is especially preferred).

When I say fast I mean the matching and scoring is fast, once the static
analysis is done.

Warming up the cache with 6300 filenames (average length 44 characters) it
takes 0.7 seconds and invokes the garbage collector 8 times.

Actually can I skip the GC somehow since I know I'm about to allocate a
bunch of memory that never gets freed.

-- 
Le