unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Le Wang <l26wang@gmail.com>
To: Stefan Monnier <monnier@iro.umontreal.ca>
Cc: "Óscar Fuentes" <ofv@wanadoo.es>, emacs-devel@gnu.org
Subject: Re: Emacs needs truely useful flex matching
Date: Mon, 15 Apr 2013 00:48:21 +0800	[thread overview]
Message-ID: <CAM=K+ipDy92SpLfR_mMEW+Rbhzz3KpJv4uypxKwHhhQOfJp4kg@mail.gmail.com> (raw)
In-Reply-To: <CAM=K+iprKKQx0sqvxwDxpd65rKcs1=WfsLSDfuBkppvZ7qnofA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2369 bytes --]

On Fri, Mar 22, 2013 at 9:00 AM, Le Wang <l26wang@gmail.com> wrote:

> On Fri, Mar 22, 2013 at 7:58 AM, Stefan Monnier
> <monnier@iro.umontreal.ca> wrote:
> >>> The sorting algorithm is roughly this for a query: "abcd"
> >>>
> >>> 1. Get all matches for "a.*b.*c.*c"
> >>> 2. Calculate score of each match
> >>> - contiguous matched chars gets a boost
> >>> - matches at word and camelCase boundaries (abbreviation) get a boost
> >>> - matches with smallest starting index gets a boost
> >>> 2. Sort list according to score.
> >
> > I think that if you turn "abcd" into a regexp of the form
> >
> "\\(\\<\\)?a\\([^b]*\\)\\(\\<\\)?b\\([^c]*\\)\\(\\<\\)?c\\([^d]*\\)\\(\\<\\)?d"
> > the regexp matching should be fairly efficient and you should be able to
> > compute the score efficiently as well (at least if
> > you ignore the camelCase boundaries).
>
> I hadn't thought of this, and I'll try it soon.



I gave this a good try.  :-)

Since we are favouring beginning of word anchors (often but not always), I
actually had to insert "\\<" in the various slots in front of characters.
 That is all permutations of 4x"\\<", then 3x, then 2x, then 1x, etc.  I
bumped into the regexp length limit very quickly and it wasn't fast enough
even when it did work.

However, it turns out that emacs-lisp is fast enough with a tweaked
algorithm -- no regexps at all. To wit,

1. For each string, we allocate a hash keyed by a character and value is a
sorted list indices into the string for occurence of the key.
 (char-indices-hash)

2. For each string, we allocate a vector of same length do static analysis
to assign a heat value to each position.  (call this the heatmap-vector)

3. For a query "abcd" we can quickly find all combinations of ascending
indices using char-indices-hash.

4. For each of these combinations, we can compute a heat value based on the
heatmap-vector.  We take the max of all heat values as the "score" of the
query against the string.

5. Order matches by score.

The algorithm works fast.  I believe it has feature parity with Sublime
Text's fuzzy search.

However it uses a fair bit of memory,

1. it's allocating one hash table and one same-length vector for each string

2. it's allocating one cons cell for each character.


Does anyone have any optimisation ideas to use less memory?  I will have
code for review in the coming days.


-- 
Le

[-- Attachment #2: Type: text/html, Size: 3481 bytes --]

  parent reply	other threads:[~2013-04-14 16:48 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-21 15:02 Emacs needs truely useful flex matching Le Wang
2013-03-21 17:49 ` Óscar Fuentes
2013-03-21 23:34   ` Le Wang
2013-03-21 23:58   ` Stefan Monnier
2013-03-22  1:00     ` Le Wang
2013-03-22  8:24       ` Eli Zaretskii
2013-03-22 11:18         ` Dmitry Gutov
2013-04-14 16:48       ` Le Wang [this message]
2013-04-14 18:18         ` Stefan Monnier
2013-04-15  0:14           ` Le Wang
2013-04-15 13:50             ` Stefan Monnier
2013-03-22  2:36 ` Richard Stallman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAM=K+ipDy92SpLfR_mMEW+Rbhzz3KpJv4uypxKwHhhQOfJp4kg@mail.gmail.com' \
    --to=l26wang@gmail.com \
    --cc=emacs-devel@gnu.org \
    --cc=monnier@iro.umontreal.ca \
    --cc=ofv@wanadoo.es \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).