From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Daniel Pittman Newsgroups: gmane.emacs.devel Subject: Re: new-flex-completion-style Date: Thu, 14 Feb 2019 10:35:49 -0500 Message-ID: References: <20190202232827.27331.87300@vcs0.savannah.gnu.org> <20190202232828.4AE452159A@vcs0.savannah.gnu.org> <87lg2mynrg.fsf@gmail.com> <871s4czm5n.fsf@gmail.com> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="000000000000d643a80581dc6bab" Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="204868"; mail-complaints-to="usenet@blaine.gmane.org" Cc: Stefan Monnier , emacs-devel To: =?UTF-8?B?Sm/Do28gVMOhdm9yYQ==?= Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Thu Feb 14 16:46:43 2019 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.89) (envelope-from ) id 1guJDn-000rC2-Bc for ged-emacs-devel@m.gmane.org; Thu, 14 Feb 2019 16:46:43 +0100 Original-Received: from localhost ([127.0.0.1]:50536 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1guJDm-0008JU-Aj for ged-emacs-devel@m.gmane.org; Thu, 14 Feb 2019 10:46:42 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:59136) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1guJ3y-0001RE-To for emacs-devel@gnu.org; Thu, 14 Feb 2019 10:36:36 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1guJ3v-0007nb-UQ for emacs-devel@gnu.org; Thu, 14 Feb 2019 10:36:33 -0500 Original-Received: from mail-it1-x130.google.com ([2607:f8b0:4864:20::130]:56186) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1guJ3u-0007fk-EU for emacs-devel@gnu.org; Thu, 14 Feb 2019 10:36:31 -0500 Original-Received: by mail-it1-x130.google.com with SMTP id z131so5572486itf.5 for ; Thu, 14 Feb 2019 07:36:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=VYOtdO/FyjYD8pwu10IjdU2JcLpy8ME8/gH3GbN+tn0=; b=fmavtn4Xe4mgNrGVcUsqvoioU+EjrEgkTZylYxuei4KiG9mbJfiQY9dUBmhzgDnM1N cB3JcC18O5HN+/tNu18iMXjffnEe0jmksmP1mxUNTqs/DJEanazczyYRm0EyzmrrJnGr 4PrkxmtNCVpPXQM2Mhd9fAFYBVvs2IXPnNgK6NQtdj+7t76E7SJ6+9CwVyYbJuky+RbZ 7ZotQZVljKYbAWOCC92j07W+eIXK1f+12zxqQcdom4fTCC/tkGXsopHWtK/i094yU+rY RfA2j1HqNswAw7xrnWZee4mGblWy739hWKYjsYafJoxRqvrPTL7mkcvmSIV1Xm7Uq3K/ 9rfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=VYOtdO/FyjYD8pwu10IjdU2JcLpy8ME8/gH3GbN+tn0=; b=uho6vevWos9lWPHhrD5yXJzsB3QNQJy+IdHX6jt6VCUyMUwUbLmfnDM2OCyFj+o5sx rYveId8k3EEAAa7g+3mwe9PyR/9u4jmmxH4OohG/WxBga4IxrqnWBB47slW5sKmMRzK8 mfWfJuPSxHSFJmC9oodJ8aseK76p1ml3CIbY0NAwGABMklOZEanULz6d8Nm6Hpsaa6kG CTctk9pAZpy1PirBlul0TVICUpVwz10pmX4wHU6FyI7BdsQo4fEevqNWUymD+omPJKvo YpGgw3vsWVIYnBaGHc4oWjOfwYqmKKqcG5I4vDOkXXxnXbmwQGkQGyFpQxJdaH+6UHPk cesQ== X-Gm-Message-State: AHQUAuZeycQwiPkC6g/o2xSH/gsG4YhRsGb3UIrHNAVma44ZBaQ4f6nJ 2i4zSO/hkGA+mogR0H3QBktQbPR1ojFUY14FKRF7Eg== X-Google-Smtp-Source: AHgI3IbpUs1IRkVi/s+gcIGcljUUYePHdVLwP5BB0TUDhPH3wo6a8hRQVPzW7zfl2tXN8kHwujupRMDFIx3I4x1t1jI= X-Received: by 2002:a6b:9256:: with SMTP id u83mr2513480iod.208.1550158585519; Thu, 14 Feb 2019 07:36:25 -0800 (PST) In-Reply-To: X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:4864:20::130 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:233329 Archived-At: --000000000000d643a80581dc6bab Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, Feb 14, 2019 at 10:16 AM Jo=C3=A3o T=C3=A1vora wrote: > On Thu, Feb 14, 2019 at 1:55 PM Jo=C3=A3o T=C3=A1vora wrote: > > On Thu, Feb 14, 2019 at 1:36 PM Stefan Monnier > wrote: > > > > > > > I don't think it's odd. I call the former "scattered match" > > > > and the latter a "tighter match". > > > > > > Don't you find it odd that "foo" gives a better score to > > > "fotttttttttttttttto" than to "foto" ? > > > > Perhaps. But as it stands it doesn't make sense to > > compare scores across different length strings. > > > > It does make sense to compare the score between > > foo's matches of foot and foto. > > > > > [ Given them the same score sounds acceptable, tho. ] > > > > Feel free to add some kind of normalization to the function > > if you can come up with it. Or come up with another function > > entirely. As I said, a function that measures the distance > > in "boring editing operations" between the pattern and the target > > would be a nicer measure. > The Levenshtein distance and close variants generally do a reasonable job of approximating human expectations, yes. Close variants (eg: only delete/insert, no mutate) also work well. The cost of building the Levenshtein edit distance matrix for picking or ordering candidates is quite high, though; https://en.wikipedia.org/wiki/Approximate_string_matching gives a good summary, and you really don't want to use the most naive option if you can avoid it. Computation costs go through the roof very fast; the cost is roughly proportional to the length of the candidate cubed, using the brute force approach. The other shortfall of using only edit distance is that most of the time people expect to front-weight the search: you may have a closer edit distance match later in the string, but people usually want a less exact match, closer to the start, intuitively. > I've been experimenting with a simpler function that just counts the > number of "holes" and the length of those holes separately > in the denominator. The numerator is the same and a perfect > match is still a 1. It seems to fare better for your cases. For foo > I'd very, very strongly encourage y'all to take a look at the existing work on the problem, not least because of the algorithmic complexity costs when you apply this to a large corpus of candidates. Consider, for example, completion over the set of all defined symbols in Emacs, which is ~ 55,000 candidates, and IIRC I calculated the average length to 9 characters, so brute force matching would be be on the order of 40 **million** comparisons for the first pattern character, though thankfully dropping fast as you eliminate non-matching candidates entirely. (Raw comparisons, accounting for all possible matches in all possible candidate, of that pattern, so each candidate gets $length / $pattern_length comparisons performed.) Anyway, the point is: this is an area where serious study has been invested, and I strongly urge you to take advantage of that, rather than trying to reinvent this surprisingly complex wheel. This isn't simple to do, and it definitely isn't simple to do in a high performance way. PS: whatever else you do, make it trivial to configure what function is used for ordering the candidate results. That matters far more that the details of the default match / order choice, because it allows easy innovation over time. PPS: the best modern tools seem to be derived from the problem of genome sequence matching, but heuristics are definitely helpful too. Again, fzf is the best version of that I have found. --000000000000d643a80581dc6bab Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Thu, Feb 14, 2019 at 10:16 AM Jo=C3=A3o T=C3=A1vora <joaotavora@gmail.com> wrote:
On Thu, Feb 14, 2019 at 1:55 PM Jo=C3=A3o T=C3=A1vora <joaotavora@gmail.com>= wrote:
> On Thu, Feb 14, 2019 at 1:36 PM Stefan Monnier <monnier@iro.umontrea= l.ca> wrote:
> >
> > > I don't think it's odd.=C2=A0 I call the former &quo= t;scattered match"
> > > and the latter a "tighter match".
> >
> > Don't you find it odd that "foo" gives a better sco= re to
> > "fotttttttttttttttto" than to "foto" ?
>
> Perhaps.=C2=A0 But as it stands it doesn't make sense to
> compare scores across different length strings.
>
> It does make sense to compare the score between
> foo's matches of foot and foto.
>
> > [ Given them the same score sounds acceptable, tho.=C2=A0 ]
>
> Feel free to add some kind of normalization to the function
> if you can come up with it.=C2=A0 Or come up with another function
> entirely.=C2=A0 As I said, a function that measures the distance
> in "boring editing operations" between the pattern and the t= arget
> would be a nicer measure.

The=C2= =A0Levenshtein distance and close variants generally do a reasonable job of= approximating human expectations, yes.=C2=A0 Close variants (eg: only dele= te/insert, no mutate) also work well.

The cost of = building the Levenshtein edit distance matrix for picking or ordering candi= dates is quite high, though;=C2=A0https://en.wikipedia.org/wiki/Approximate_stri= ng_matching gives a good summary, and you really don't want to use = the most naive option if you can avoid it.=C2=A0 Computation costs go throu= gh the roof very fast; the cost is roughly proportional to the length of th= e candidate cubed, using the brute force approach.

The other shortfall of using only edit distance is that most of the time p= eople expect to front-weight the search: you may have a closer edit distanc= e match later in the string, but people usually want a less exact match, cl= oser to the start, intuitively.
=C2=A0
I've been experimenting with a simpler function that just counts the number of "holes" and the length of those holes separately
in the denominator. The numerator is the same and a perfect
match is still a 1. It seems to fare better for your cases. For foo

I'd very, very strongly encourage y'al= l to take a look at the existing work on the problem, not least because of = the algorithmic complexity costs when you apply this to a large corpus of c= andidates.=C2=A0 Consider, for example, completion over the set of all defi= ned symbols in Emacs, which is ~ 55,000 candidates, and IIRC I calculated t= he average length to 9 characters, so brute force matching would be be on t= he order of=C2=A040 **million** comparisons for the first pattern character= , though thankfully dropping fast as you eliminate non-matching candidates = entirely.=C2=A0 (Raw comparisons, accounting for all possible matches in al= l possible candidate, of that pattern, so each candidate gets $length / $pa= ttern_length comparisons performed.)

Anyway, the p= oint is: this is an area where serious study has been invested, and I stron= gly urge you to take advantage of that, rather than trying to reinvent this= surprisingly complex wheel.=C2=A0 This isn't simple to do, and it defi= nitely isn't simple to do in a high performance way.

PS: whatever else you do, make it trivial to configure what function= is used for ordering the candidate results.=C2=A0 That matters far more th= at the details of the default match / order choice, because it allows easy = innovation over time.

PPS: the best modern tools s= eem to be derived from the problem of genome sequence matching, but heurist= ics are definitely helpful too.=C2=A0 Again, fzf is the best version of tha= t I have found.
--000000000000d643a80581dc6bab--