all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: "Ludovic Courtès" <ludo@gnu.org>
To: zimoun <zimon.toutoune@gmail.com>
Cc: Guix Devel <guix-devel@gnu.org>, Jesse Gibbons <jgibbons2357@gmail.com>
Subject: Improving ‘guix search’ scoring
Date: Wed, 17 Jul 2019 23:27:47 +0200	[thread overview]
Message-ID: <878ssw8i7g.fsf_-_@gnu.org> (raw)
In-Reply-To: <CAJ3okZ0LaJzWDBA7bjqZew_jAmtt1rj9PJhevwrtBiA_COXENg@mail.gmail.com> (zimoun's message of "Tue, 16 Jul 2019 19:04:26 +0200")

Hello zimoun!

zimoun <zimon.toutoune@gmail.com> skribis:

> However, a kind of tf-idf [1] should be used to better self organize
> the packages when searching.
>
> [1] https://en.wikipedia.org/wiki/Tf%E2%80%93idf
>
>
> For example, I have 10146 package definitions:
>   guix search ' ' | recsel -P name -C | wc -l
>   10146
> and 46 contain the word 'drawing'.
> So, the Inverse-Document-Frequency is:
>  IDF(drawing) = log(10146 / 46)
>
> Let consider the 3 first most relevant package (with the current score).
> The term `drawing` appears:
>    for pkg in $(guix search drawing | recsel -C -P name | head -n3);\
>    do\
>       echo $pkg ; guix package --show=$pkg | grep -c drawing ;\
>    done
>
> FREQ(drawing, texlive-latex-eepic) = 5
> FREQ(drawing, tuxpaint) = 2
> FREQ(drawing, xfig) = 2
>
> Let normalize by the length of the document:
>    for pkg in $(guix search drawing | recsel -C -P name | head -n3);\
>    do\
>       echo $pkg ; guix package --show=$pkg \
>       | recsel -P synopsis,description | wc -w ;\
>    done
>
> LEN(texlive-latex-eepic) = 68
> LEN(tuxpaint) = 60
> LEN(xfig) = 76
>
> Then one definition of the Term-Frequency is:
>
> TF(drawing, texlive-latex-eepic) = 5 / 68
> TF(drawing, tuxpaint) = 2 / 60
> TF(drawing, xfig) = 2 / 76
>
>
> The TF-IDF reads:
>
> TF-IDF(drawing, texlive-latex-eepic) = 5/68*log(10146/46) =0.3968
> TF-IDF(drawing, tuxpaint) = 2/60*log(10146/46) =0.1799
> TF-IDF(drawing, xfig) = 2/76*log(10146/46) =0.1420
>
>
> This does not change much the current result. But this allows to
> better know which words are "good filter".
>
> Let consider the word `program` and the package `tuxpaint`.
> The current relevance score is 5 for `program`. The term appears 2
> times (note that `software` appears in synopsis which should be
> replaced be `program`).
> The current relevance score is 7 for `drawing`. The term also appears 2 times.
> The difference is just because the weight per field.
>
> However, the TF-IDF is totally different:
>
> TF-IDF(drawing, tuxpaint) = 2/60*log(10146/46) =0.1799
> TF-IDF(program, tuxpaint) = 2/60*log(10146/1056) =0.0754
>
> Well, the term `drawing` owns more information than the term `program`
> for the package tuxpaint.

That’s insightful!

I guess computing the TF-IDF could perhaps improve the results compared
to the current scoring mechanism.  It would be worth trying to implement
it.

The bottom line though, as you wrote, is that this all depends on the
quality of synopses and descriptions, and there’s only so much we can
draw from 5-line descriptions.

Thanks,
Ludo’.

  reply	other threads:[~2019-07-17 21:27 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-09 16:07 Organizing packages Jesse Gibbons
2019-07-14 13:54 ` Ludovic Courtès
2019-07-15 17:21   ` Jesse Gibbons
2019-07-15 17:38     ` Robert Vollmert
2019-07-15 20:15       ` Jesse Gibbons
2019-07-15 21:37     ` Ricardo Wurmus
2019-07-16 17:04   ` zimoun
2019-07-17 21:27     ` Ludovic Courtès [this message]
2019-07-18 11:11       ` Improving ‘guix search’ scoring zimoun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878ssw8i7g.fsf_-_@gnu.org \
    --to=ludo@gnu.org \
    --cc=guix-devel@gnu.org \
    --cc=jgibbons2357@gmail.com \
    --cc=zimon.toutoune@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.