On Fri, Mar 22, 2013 at 9:00 AM, Le Wang <l26wang@gmail.com> wrote:

> On Fri, Mar 22, 2013 at 7:58 AM, Stefan Monnier
> <monnier@iro.umontreal.ca> wrote:
> >>> The sorting algorithm is roughly this for a query: "abcd"
> >>>
> >>> 1. Get all matches for "a.*b.*c.*c"
> >>> 2. Calculate score of each match
> >>> - contiguous matched chars gets a boost
> >>> - matches at word and camelCase boundaries (abbreviation) get a boost
> >>> - matches with smallest starting index gets a boost
> >>> 2. Sort list according to score.
> >
> > I think that if you turn "abcd" into a regexp of the form
> >
> "\\(\\<\\)?a\\([^b]*\\)\\(\\<\\)?b\\([^c]*\\)\\(\\<\\)?c\\([^d]*\\)\\(\\<\\)?d"
> > the regexp matching should be fairly efficient and you should be able to
> > compute the score efficiently as well (at least if
> > you ignore the camelCase boundaries).
>
> I hadn't thought of this, and I'll try it soon.


I gave this a good try.  :-)

Since we are favouring beginning of word anchors (often but not always), I
actually had to insert "\\<" in the various slots in front of characters.
 That is all permutations of 4x"\\<", then 3x, then 2x, then 1x, etc.  I
bumped into the regexp length limit very quickly and it wasn't fast enough
even when it did work.

However, it turns out that emacs-lisp is fast enough with a tweaked
algorithm -- no regexps at all. To wit,

1. For each string, we allocate a hash keyed by a character and value is a
sorted list indices into the string for occurence of the key.
 (char-indices-hash)

2. For each string, we allocate a vector of same length do static analysis
to assign a heat value to each position.  (call this the heatmap-vector)

3. For a query "abcd" we can quickly find all combinations of ascending
indices using char-indices-hash.

4. For each of these combinations, we can compute a heat value based on the
heatmap-vector.  We take the max of all heat values as the "score" of the
query against the string.

5. Order matches by score.

The algorithm works fast.  I believe it has feature parity with Sublime
Text's fuzzy search.

However it uses a fair bit of memory,

1. it's allocating one hash table and one same-length vector for each string

2. it's allocating one cons cell for each character.


Does anyone have any optimisation ideas to use less memory?  I will have
code for review in the coming days.


-- 
Le