all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: zimoun <zimon.toutoune@gmail.com>
To: Arun Isaac <arunisaac@systemreboot.net>
Cc: Guix Devel <guix-devel@gnu.org>
Subject: Re: Inverted index to accelerate guix package search
Date: Mon, 13 Jan 2020 18:54:18 +0100	[thread overview]
Message-ID: <CAJ3okZ23f4kuygoj_HS0+uKYGCad2ZV=u7DWuZ=-3vQXM_QYtQ@mail.gmail.com> (raw)
In-Reply-To: <cu7a76re6nj.fsf@systemreboot.net>

Hi Arun,

For sure, it is a good direction... but it is not that simple. :-)


On Mon, 13 Jan 2020 at 16:08, Arun Isaac <arunisaac@systemreboot.net> wrote:

> > Those measures don't seem precise enough to draw a good conclusion.
> > Could you increase the sample size (or maybe just loop?) so that all
> > times reach over a second or so?
>
> Indeed, my bad! Here are better timing results. I have repeated each of
> the searches a 1000 times, and I'm getting around 90x speedup! :-)
>
> Building index
> clock utime stime cutime cstime gctime
>  9.76 12.50  0.18   0.00   0.00   6.16
>
> Brute force search (repeated 1000 times)
> clock utime stime cutime cstime gctime
> 195.95 264.13  0.90   0.00   0.00 142.62

`fold-packages` traverses all the packages so yes it is "slow".


> Inverted index search (repeated 1000 times)
> clock utime stime cutime cstime gctime
>  2.18  2.48  0.00   0.00   0.00   0.60

Yes, looking up into an hash table is really really faster. :-)



However, the core issue about "search" is not 'find quickly an item'
but 'find the relevant item'.
For example, Bloom filter [1] or B-tree could be interesting data
structure if we are talking about fast retrieval. Note that this part
should even be delegated to SQLite, IMHO.


[1] https://en.wikipedia.org/wiki/Bloom_filter
[2] https://en.wikipedia.org/wiki/B-tree


Well, the issue is 'scoring' a query. For example, try:

  guix search youtube | recsel -p name,relevance

--8<---------------cut here---------------start------------->8---
name: youtube-dl-gui
relevance: 15

name: youtube-viewer
relevance: 11

name: youtube-dl
relevance: 11

name: mps-youtube
relevance: 11

name: emacs-youtube-dl
relevance: 11
--8<---------------cut here---------------end--------------->8---

Why the package youtube-dl-gui is more relevant? Because the function
that computes the relevance score not enough "smart". The function
guix/ui.scm(relevance) is too simple: the score is the number of
ocurences of the pattern (weighted by the field where the pattern
matches). Therefore, if you compare "guix show" the packages
"youtube-dl-gui" and "youtube-dl", you will see that the term
"youtube" (case-insentive) appears 5 times for youtube-dl-gui against
only 3 times for youtube-dl. Does the difference (5 vs 3) provide more
information? I do not think so... The issue is: 1. the description is
not well-written (for the current scoring system) and 2. the scoring
function is too simple.

I have tried to described such issues/views here [3] [4] and since I
have not read so much about recommender systems: state-of-the-art, NLP
tools in Guile, etc.

[3] https://lists.gnu.org/archive/html/guix-devel/2019-07/msg00252.html
[4] https://lists.gnu.org/archive/html/guix-devel/2019-12/msg00160.html



Last, I have not read the 2 links you pointed but I think you have a
bug in your code. The retrieval "game" using the index returns the
package "r-mgcv" which does not contain the term "game" in its
description. Well, the query "game" using the brute force does not
return this very package.

Then, in the function
"guix/scripts/package.scm(find-packages-by-description)", you could
try to replace the `fold-packages' by something using the inverted
index. (Note that the name `find-packages-by-descripton' is odd
because it uses synopsis, name, etc. too.).


Hope that helps.

Chhers,
simon

  reply	other threads:[~2020-01-13 17:54 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-12 15:03 Inverted index to accelerate guix package search Arun Isaac
2020-01-13  8:48 ` Pierre Neidhardt
2020-01-13 15:08   ` Arun Isaac
2020-01-13 17:54     ` zimoun [this message]
2020-01-14  3:53       ` Bengt Richter
2020-01-14 14:21       ` Pierre Neidhardt
2020-01-14 15:27         ` Giovanni Biscuolo
2020-01-14 20:55           ` zimoun
2020-01-14 21:41             ` Pierre Neidhardt
2020-01-14 21:59               ` zimoun
2020-01-15  9:12                 ` Pierre Neidhardt
2020-01-15 14:25                   ` zimoun
2020-01-15 11:53             ` Giovanni Biscuolo
2020-01-15 14:49               ` zimoun
2020-01-16 19:06         ` Arun Isaac
2020-01-16 20:08           ` zimoun
2020-01-17 17:13           ` Pierre Neidhardt
2020-01-17 19:11             ` Pierre Neidhardt
2020-01-18 20:31               ` Arun Isaac
2020-01-15  5:44       ` Arun Isaac
2020-01-15  9:06         ` Pierre Neidhardt
2020-01-15 12:00           ` zimoun
2020-01-15 13:41             ` Pierre Neidhardt
2020-01-15 11:54         ` zimoun
2020-01-15 21:03         ` Ricardo Wurmus
2020-01-15 21:19           ` zimoun
2020-01-16 14:44           ` Ludovic Courtès
2020-01-16 15:04             ` zimoun
2020-01-16 19:11               ` Arun Isaac
2020-01-16 19:53                 ` zimoun
2020-01-17 19:29                   ` Arun Isaac
2020-01-20 19:14                     ` zimoun
2020-01-20 20:42                       ` Arun Isaac
2020-01-21 10:46                     ` Ludovic Courtès
2020-01-23 20:49                       ` Arun Isaac

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJ3okZ23f4kuygoj_HS0+uKYGCad2ZV=u7DWuZ=-3vQXM_QYtQ@mail.gmail.com' \
    --to=zimon.toutoune@gmail.com \
    --cc=arunisaac@systemreboot.net \
    --cc=guix-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.