all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: zimoun <zimon.toutoune@gmail.com>
To: Taylan Kammer <taylan.kammer@gmail.com>
Cc: LibreMiami <packaging-guix@libremiami.org>,
	Raghav Gururajan <rg@raghavgururajan.name>,
	jgart <jgart@dismail.de>,
	Nicolas Goaziou <mail@nicolasgoaziou.fr>,
	Guix Devel <guix-devel@gnu.org>
Subject: Re: Search improvements (Was: Opposition to new single-letter package name "t")
Date: Tue, 9 Mar 2021 16:12:24 +0100	[thread overview]
Message-ID: <CAJ3okZ3E3bhZ5pROZS68wEKdKOcZ8SpXsvdi-bnB=9Jz3mPahA@mail.gmail.com> (raw)
In-Reply-To: <446cc95b-9068-e07d-80ee-fbcc887c2c65@gmail.com>

Hi,

On Tue, 9 Mar 2021 at 14:37, Taylan Kammer <taylan.kammer@gmail.com> wrote:

> This discussion made me realize that "guix search" might benefit from
> the following improvement though:  I think the relevance score for a
> search result should be increased significantly if the searched word is
> a standalone (not substring) part of a package's name when the name is
> split into dash-separated words.

Currently, perfect match uses the weight of 5 and substring match uses
1.  You are proposing to add something between, say 3, for perfect
match on substring delimited by dash.  Why not.

> For instance, the package "emacs-hl-todo" should get a much higher score
> than "emacs-mastodon" when searching for "todo".  Currently the Mastodon
> one has score 11 and the todo one only 9.

Here how the relevance score reads:

query: todo
| field       |  emacs-hl-todo |  emacs-mastodon | weight |
|-------------+----------------+-----------------+--------|
| name        |              1 |               1 |      4 |
| synopsis    |              1 |               1 |      3 |
| description |              1 |               2 |      2 |
|-------------+----------------+-----------------+--------|
| total       | 1*4+1*3+2*1= 9 | 1*4+1*3+2*2= 11 |        |

Therefore, something looks wrong here: the score for emacs-hl-todo
should be 1*4+1*5*3+1*5*2= 29 because the term TODO should be
considered as a perfect match for the query todo.


> The same thing goes for the synopsis and description of the package, but
> with respectively lower increases to the score.  (I.e. name > synopsis >
> description.)

Your proposal just needs the tweak of 'score' in the function
'relevance' from (guix ui).  The weight for the field is another part
(see %package-metrics in (guix ui))


> Handling of plurals like "todos" instead of "todo" would also be great
> but could be left to a later step.

The issue with this is that it is strongly connected to the language.
Therefore, an external library implementing Natural Language should be
added.  And I am not convinced it is worth at the CLI level.


> Any thoughts about / objections to this idea?  To be honest I haven't
> checked if there's maybe already a bug report about this.

If you are interested, there is such discussion in this heavy thread:

<http://issues.guix.gnu.org/issue/39258>

And the 'relevance' function could be improved, for sure.  For
example, I proposed TF-IDF here:

<https://lists.gnu.org/archive/html/guix-devel/2019-07/msg00252.html>

and I did some tiny math calculs (optimization) to compute "better"
relevance weight (%package-metrics) but the current choice are not so
bad and simple enough. :-)

Previous week, I have started to examine a strategy based on
Bag-Of-Word and some word embedings strategies; mimicking a simple
autoencoder [1] such as Word2Vec [2] but since the Guile tools are
poor in this field, I have started to use Julia first to look if it is
worth to implement or not such solution.  My idea is to see how the
packages cluster based on the synopsis+description information, then
ideally based on this, we should be able to define package similarity
and "synonyms".

Well, if you are student and you are looking for a cool project about
Machine Learning and Data Science, ping me. :-)

1: <https://en.wikipedia.org/wiki/Autoencoder>
2: <https://en.wikipedia.org/wiki/Word2vec>


Cheers,
simon


  reply	other threads:[~2021-03-09 15:36 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-09  5:27 Opposition to new single-letter package name "t" Mark H Weaver
2021-03-09  5:39 ` Raghav Gururajan
2021-03-09  5:47 ` jgart
2021-03-09  6:08   ` Raghav Gururajan
2021-03-09  9:35     ` Leo Prikler
2021-03-09 11:38     ` Tobias Geerinckx-Rice
2021-03-09 11:40       ` Julien Lepiller
2021-03-09 13:09         ` Ricardo Wurmus
2021-03-09 12:40       ` Raghav Gururajan
2021-03-09 18:12         ` Nicolas Goaziou
2021-03-10  0:16           ` Mark H Weaver
2021-03-09 13:32       ` Search improvements (Was: Opposition to new single-letter package name "t") Taylan Kammer
2021-03-09 15:12         ` zimoun [this message]
2021-03-09 16:18         ` Tobias Geerinckx-Rice
2021-03-09 18:37           ` zimoun
2021-03-09 21:39 ` bug#47028: Discourage single-character package names Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-03-09 21:41   ` bug#47028: [PATCH 1/2] doc: Discourage ambiguous " Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-03-09 21:41     ` bug#47028: [PATCH 2/2] lint: Warn about single-character " Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-03-26  8:26       ` zimoun
2021-04-01  8:51         ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-04-01 20:41         ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-03-09 21:51     ` bug#47028: [PATCH 1/2] doc: Discourage ambiguous " Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-03-10  0:44   ` bug#47028: Discourage single-character " Mark H Weaver
2021-03-10 11:28     ` Ludovic Courtès
2021-03-10 13:04     ` zimoun
2021-04-01  8:57   ` Tobias Geerinckx-Rice via Bug reports for GNU Guix

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJ3okZ3E3bhZ5pROZS68wEKdKOcZ8SpXsvdi-bnB=9Jz3mPahA@mail.gmail.com' \
    --to=zimon.toutoune@gmail.com \
    --cc=guix-devel@gnu.org \
    --cc=jgart@dismail.de \
    --cc=mail@nicolasgoaziou.fr \
    --cc=packaging-guix@libremiami.org \
    --cc=rg@raghavgururajan.name \
    --cc=taylan.kammer@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.