all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: zimoun <zimon.toutoune@gmail.com>
To: Tobias Geerinckx-Rice <me@tobias.gr>
Cc: LibreMiami <packaging-guix@libremiami.org>,
	Raghav Gururajan <rg@raghavgururajan.name>,
	jgart <jgart@dismail.de>,
	Nicolas Goaziou <mail@nicolasgoaziou.fr>,
	Guix Devel <guix-devel@gnu.org>
Subject: Re: Search improvements (Was: Opposition to new single-letter package name "t")
Date: Tue, 9 Mar 2021 19:37:23 +0100	[thread overview]
Message-ID: <CAJ3okZ3+hn0nJP98OhnZYLWJvhLGpdTUK+jB0hoM5JArQxO=zw@mail.gmail.com> (raw)
In-Reply-To: <87h7lkmhb5.fsf@nckx>

Hi Tobias,

On Tue, 9 Mar 2021 at 18:14, Tobias Geerinckx-Rice <me@tobias.gr> wrote:

> For most upstreams whether or not dashes were in vogue[0] when
> they named their project is literally arbitrary.  We'd penalise
> many other packages like texlive-todonotes, open{ssh,vpn,*},
> ktexteditor, r-performanceanalytics, qutebrowser, ...  It's not a
> net win.

I am not sure to understand what you mean here.

> If I might pet my own peeve, I think clever heuristics appear
> necessary in part because %package-metrics grossly overscores
> package names.  Rank them *below* synopsis & description--which
> will contain the name anyway--with a metric of 1, maybe 2.  Enough
> to keep the relevant stuff above the irrelevant stuff (python- >
> ruby-, etc.) without distorting things as they do now.

I really did math, i.e., write the scoring function, something like
(to simplify)

  score(package, query) = sum_{term in query} (wS cS + wD cD + w)

where wS, wD, wN are the weights for synopsis, description, name and
cS, cD, cN are the number of occurrences.  Then for example computed
Jacobian and so on in order to see the relation between the weights w*
and the number of occurrence c*.  Or I gave a look at the condition to
have:

  score(package_1, query) = score(package_2, query)

and basically, using the linear relevance as it is currently, the
weight (%package-metrics) are not so bad; you cannot find a really
better heuristic.  Another conclusion is: it really depends on the
number of terms the query has.  Basically, if you type one term, you
know what you are looking for and it is the package name but your are
not sure.  For more terms, currently the result strongly depends on
the quality of the synopsis and description.  For instance, try:

  guix search gnu compiler

and compare the description of all the packages with a relevance
higher than 4 (gcc-toolchain).  Well, with a linear and local scoring
function as it is currently, you cannot improve much, IMHO.  By local,
I mean only considering the words of one package independently of the
words of other packages.  That's why TF-IDF [1].  For a concrete
example, see <https://lists.gnu.org/archive/html/guix-devel/2019-07/msg00252.html>.
  Once you have a TF-IDF, the natural scoring is BM25 [2].  Well, it
is included in Xapian and there is a patch by Arun using Xapian as a
backend for "guix search", see
<http://issues.guix.gnu.org/issue/39258#14>.  It is missing a good
evaluation, i.e., queries examples.  I have asked such examples (what
query an user type and what they are expecting) here
<https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00190.html>
but no one replied and since I am enough comfortable with searching
with Guix and other bugs are more annoying for my workflow, I moved to
other stuff.

For another discussion on the topic, see
<https://lists.gnu.org/archive/html/guix-devel/2020-01/msg00222.html>.


Since 2020, I have read pieces of "word embdeding" (part of vogue[0]
graph neural nets), and I think it would a great project: first some
vogue[0] stats to evaluate how the packages cluster together, i.e., is
emacs-foo closer to emacs-bar or python-foo?  and second depending on
the results, implement such embdeding to improve "guix search".  The
first means use Julia (or package PyTorch for Guix ;-)) and the second
means implement targeting Guile (it could awesome to have an
equivalent to Zygote [3,4] for Guile).

0: Not a joke. :-)
1: <https://en.wikipedia.org/wiki/Tf%E2%80%93idf>
2: <https://en.wikipedia.org/wiki/Okapi_BM25>
3: <https://github.com/FluxML/Zygote.jl>
4: <https://arxiv.org/pdf/1810.07951.pdf>


Cheers,
simon


  reply	other threads:[~2021-03-09 19:49 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-09  5:27 Opposition to new single-letter package name "t" Mark H Weaver
2021-03-09  5:39 ` Raghav Gururajan
2021-03-09  5:47 ` jgart
2021-03-09  6:08   ` Raghav Gururajan
2021-03-09  9:35     ` Leo Prikler
2021-03-09 11:38     ` Tobias Geerinckx-Rice
2021-03-09 11:40       ` Julien Lepiller
2021-03-09 13:09         ` Ricardo Wurmus
2021-03-09 12:40       ` Raghav Gururajan
2021-03-09 18:12         ` Nicolas Goaziou
2021-03-10  0:16           ` Mark H Weaver
2021-03-09 13:32       ` Search improvements (Was: Opposition to new single-letter package name "t") Taylan Kammer
2021-03-09 15:12         ` zimoun
2021-03-09 16:18         ` Tobias Geerinckx-Rice
2021-03-09 18:37           ` zimoun [this message]
2021-03-09 21:39 ` bug#47028: Discourage single-character package names Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-03-09 21:41   ` bug#47028: [PATCH 1/2] doc: Discourage ambiguous " Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-03-09 21:41     ` bug#47028: [PATCH 2/2] lint: Warn about single-character " Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-03-26  8:26       ` zimoun
2021-04-01  8:51         ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-04-01 20:41         ` Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-03-09 21:51     ` bug#47028: [PATCH 1/2] doc: Discourage ambiguous " Tobias Geerinckx-Rice via Bug reports for GNU Guix
2021-03-10  0:44   ` bug#47028: Discourage single-character " Mark H Weaver
2021-03-10 11:28     ` Ludovic Courtès
2021-03-10 13:04     ` zimoun
2021-04-01  8:57   ` Tobias Geerinckx-Rice via Bug reports for GNU Guix

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJ3okZ3+hn0nJP98OhnZYLWJvhLGpdTUK+jB0hoM5JArQxO=zw@mail.gmail.com' \
    --to=zimon.toutoune@gmail.com \
    --cc=guix-devel@gnu.org \
    --cc=jgart@dismail.de \
    --cc=mail@nicolasgoaziou.fr \
    --cc=me@tobias.gr \
    --cc=packaging-guix@libremiami.org \
    --cc=rg@raghavgururajan.name \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.