On Fri, Mar 22, 2013 at 9:00 AM, Le Wang wrote: > On Fri, Mar 22, 2013 at 7:58 AM, Stefan Monnier > wrote: > >>> The sorting algorithm is roughly this for a query: "abcd" > >>> > >>> 1. Get all matches for "a.*b.*c.*c" > >>> 2. Calculate score of each match > >>> - contiguous matched chars gets a boost > >>> - matches at word and camelCase boundaries (abbreviation) get a boost > >>> - matches with smallest starting index gets a boost > >>> 2. Sort list according to score. > > > > I think that if you turn "abcd" into a regexp of the form > > > "\\(\\<\\)?a\\([^b]*\\)\\(\\<\\)?b\\([^c]*\\)\\(\\<\\)?c\\([^d]*\\)\\(\\<\\)?d" > > the regexp matching should be fairly efficient and you should be able to > > compute the score efficiently as well (at least if > > you ignore the camelCase boundaries). > > I hadn't thought of this, and I'll try it soon. I gave this a good try. :-) Since we are favouring beginning of word anchors (often but not always), I actually had to insert "\\<" in the various slots in front of characters. That is all permutations of 4x"\\<", then 3x, then 2x, then 1x, etc. I bumped into the regexp length limit very quickly and it wasn't fast enough even when it did work. However, it turns out that emacs-lisp is fast enough with a tweaked algorithm -- no regexps at all. To wit, 1. For each string, we allocate a hash keyed by a character and value is a sorted list indices into the string for occurence of the key. (char-indices-hash) 2. For each string, we allocate a vector of same length do static analysis to assign a heat value to each position. (call this the heatmap-vector) 3. For a query "abcd" we can quickly find all combinations of ascending indices using char-indices-hash. 4. For each of these combinations, we can compute a heat value based on the heatmap-vector. We take the max of all heat values as the "score" of the query against the string. 5. Order matches by score. The algorithm works fast. I believe it has feature parity with Sublime Text's fuzzy search. However it uses a fair bit of memory, 1. it's allocating one hash table and one same-length vector for each string 2. it's allocating one cons cell for each character. Does anyone have any optimisation ideas to use less memory? I will have code for review in the coming days. -- Le