On Fri, Mar 22, 2013 at 9:00 AM, Le Wang <l26wang@gmail.com> wrote:
On Fri, Mar 22, 2013 at 7:58 AM, Stefan Monnier
<monnier@iro.umontreal.ca> wrote:
>>> The sorting algorithm is roughly this for a query: "abcd"
>>>
>>> 1. Get all matches for "a.*b.*c.*c"
>>> 2. Calculate score of each match
>>> - contiguous matched chars gets a boost
>>> - matches at word and camelCase boundaries (abbreviation) get a boost
>>> - matches with smallest starting index gets a boost
>>> 2. Sort list according to score.
>
> I think that if you turn "abcd" into a regexp of the form
> "\\(\\<\\)?a\\([^b]*\\)\\(\\<\\)?b\\([^c]*\\)\\(\\<\\)?c\\([^d]*\\)\\(\\<\\)?d"
> the regexp matching should be fairly efficient and you should be able to
> compute the score efficiently as well (at least if
> you ignore the camelCase boundaries).

I hadn't thought of this, and I'll try it soon.


I gave this a good try.  :-)

Since we are favouring beginning of word anchors (often but not always), I actually had to insert "\\<" in the various slots in front of characters.  That is all permutations of 4x"\\<", then 3x, then 2x, then 1x, etc.  I bumped into the regexp length limit very quickly and it wasn't fast enough even when it did work.

However, it turns out that emacs-lisp is fast enough with a tweaked algorithm -- no regexps at all. To wit,

1. For each string, we allocate a hash keyed by a character and value is a sorted list indices into the string for occurence of the key.  (char-indices-hash)

2. For each string, we allocate a vector of same length do static analysis to assign a heat value to each position.  (call this the heatmap-vector)

3. For a query "abcd" we can quickly find all combinations of ascending indices using char-indices-hash.

4. For each of these combinations, we can compute a heat value based on the heatmap-vector.  We take the max of all heat values as the "score" of the query against the string.

5. Order matches by score.

The algorithm works fast.  I believe it has feature parity with Sublime Text's fuzzy search.

However it uses a fair bit of memory,

1. it's allocating one hash table and one same-length vector for each string

2. it's allocating one cons cell for each character.


Does anyone have any optimisation ideas to use less memory?  I will have code for review in the coming days.


--
Le