Inverted index to accelerate guix package search

all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* Inverted index to accelerate guix package search
@ 2020-01-12 15:03 Arun Isaac
  2020-01-13  8:48 ` Pierre Neidhardt
  0 siblings, 1 reply; 35+ messages in thread
From: Arun Isaac @ 2020-01-12 15:03 UTC (permalink / raw)
  To: guix-devel

[-- Attachment #1.1: Type: text/plain, Size: 1910 bytes --]

Hi,

I have a proof of concept inverted index based accelerator for guix
package search. The attached proof of concept script searches for
packages whose description contains a certain query string, and achieves
at least a 10x speedup (maybe more!) over brute force search.

Example proof of concept script run below:

Building index
clock utime stime cutime cstime gctime
 9.15 11.62  0.16   0.00   0.00   5.63

Brute force search
clock utime stime cutime cstime gctime
 0.12  0.12  0.00   0.00   0.00   0.00

Inverted index search
clock utime stime cutime cstime gctime
 0.00  0.00  0.00   0.00   0.00   0.00

* Method

We iterate over all package descriptions, decompose them into trigrams
(substrings of length 3), and build an "inverted index" -- a hash table
mapping trigrams to package names. Building this index needs to be done
only once, say during 'guix pull'. When the user provides a search
string, we decompose the search string into trigrams and use the
inverted index to look up packages whose descriptions contain those
trigrams. For typical search strings, this only has an average
complexity of O(1) whereas brute force search over all package
descriptions scales as O(n).

* Extension to regexp search

The attached proof of concept script only works for exact string
search. However, I can extend this to work for arbitrary regular
expressions (see references) if there is sufficient interest to have
this in Guix.

I think performance of guix package search will become even more
important as the number of packages in Guix grows. And, even as it is,
guix package search is relatively slow. Having blazing fast package
search could do wonders for user experience and user perception of
Guix's speed! :-)

WDYT?

* References

- https://en.wikipedia.org/wiki/Inverted_index
- http://wonconsulting.com/~cho/papers/cho-regex.pdf
- https://swtch.com/~rsc/regexp/regexp4.html

Regards,
Arun

[-- Attachment #1.2: inverted-index-poc.scm --]
[-- Type: text/plain, Size: 2600 bytes --]

(use-modules (gnu packages)
             (guix packages)
             (ice-9 match)
             (ice-9 time)
             (srfi srfi-1)
             (srfi srfi-26))

;;; Implementation of sets using hash tables

(define make-set make-hash-table)
(define set-element-of? hash-ref)
(define set-add! (cut hash-set! <> <> #t))
(define set->list (cut hash-map->list (lambda (element _) element) <>))

(define (set-intersection set1 set2)
  "Return the intersection of SET1 with SET2"
  (let ((result (make-set)))
    (hash-for-each (lambda (element _)
                     (when (set-element-of? set2 element)
                       (set-add! result element)))
                   set1)
    result))

(define (trigrams str)
  "Return all trigrams in STR."
  (if (< (string-length str) 3)
      (list)
      (map (lambda (start)
             (substring str start (+ start 3)))
           (iota (- (string-length str) 2)))))

;; Inverted index
(define index (make-hash-table))

(define (build-index!)
  "Return an inverted index of all trigrams in the descriptions of all
packages."
  (fold-packages (lambda (package _)
                   (for-each (lambda (trigram)
                               (let ((set (hash-ref index trigram (make-set))))
                                 (set-add! set (package-name package))
                                 (hash-set! index trigram set)))
                             (trigrams (package-description package))))
                 #f)
  index)

(define (brute-force-retrieve query)
  "Return names of all packages whose descriptions contain the string QUERY.
Search brute force by folding over all packages."
  (fold-packages (lambda (package result)
                   (if (string-contains (package-description package) query)
                       (cons (package-name package) result)
                       result))
                 (list)))

(define (inverted-index-retrieve query)
  "Return names of all packages whose descriptions contain the string QUERY.
Search using the pre-built inverted index."
  (match (trigrams query)
    ((first-trigram . remaining-trigrams)
     (set->list
      (fold (lambda (trigram result)
              (set-intersection result (hash-ref index trigram)))
            (hash-ref index first-trigram)
            remaining-trigrams)))
    (() (brute-force-retrieve query))))

(display "Building index")
(newline)
(time (build-index!))
(newline)

(display "Brute force search")
(newline)
(time (brute-force-retrieve "game"))
(newline)

(display "Inverted index search")
(newline)
(time (inverted-index-retrieve "game"))
(newline)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-12 15:03 Inverted index to accelerate guix package search Arun Isaac
@ 2020-01-13  8:48 ` Pierre Neidhardt
  2020-01-13 15:08   ` Arun Isaac
  0 siblings, 1 reply; 35+ messages in thread
From: Pierre Neidhardt @ 2020-01-13  8:48 UTC (permalink / raw)
  To: Arun Isaac, guix-devel

[-- Attachment #1: Type: text/plain, Size: 823 bytes --]

Hi Arun!

Nice!  This could be a very welcome addition!
I in particular would enjoy this with my work on improving search on
Guix :)

Question:

Arun Isaac <arunisaac@systemreboot.net> writes:

> Building index
> clock utime stime cutime cstime gctime
>  9.15 11.62  0.16   0.00   0.00   5.63
>
> Brute force search
> clock utime stime cutime cstime gctime
>  0.12  0.12  0.00   0.00   0.00   0.00
>
> Inverted index search
> clock utime stime cutime cstime gctime
>  0.00  0.00  0.00   0.00   0.00   0.00

Those measures don't seem precise enough to draw a good conclusion.
Could you increase the sample size (or maybe just loop?) so that all
times reach over a second or so?

Great research, curious to see where this will reach!

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-13  8:48 ` Pierre Neidhardt
@ 2020-01-13 15:08   ` Arun Isaac
  2020-01-13 17:54     ` zimoun
  0 siblings, 1 reply; 35+ messages in thread
From: Arun Isaac @ 2020-01-13 15:08 UTC (permalink / raw)
  To: Pierre Neidhardt, guix-devel


[-- Attachment #1.1: Type: text/plain, Size: 724 bytes --]


> Those measures don't seem precise enough to draw a good conclusion.
> Could you increase the sample size (or maybe just loop?) so that all
> times reach over a second or so?

Indeed, my bad! Here are better timing results. I have repeated each of
the searches a 1000 times, and I'm getting around 90x speedup! :-)

Building index
clock utime stime cutime cstime gctime
 9.76 12.50  0.18   0.00   0.00   6.16

Brute force search (repeated 1000 times)
clock utime stime cutime cstime gctime
195.95 264.13  0.90   0.00   0.00 142.62

Inverted index search (repeated 1000 times)
clock utime stime cutime cstime gctime
 2.18  2.48  0.00   0.00   0.00   0.60

Also, find attached the updated proof of concept script.

Cheers!


[-- Attachment #1.2: inverted-index-poc.scm --]
[-- Type: text/plain, Size: 2786 bytes --]

(use-modules (gnu packages)
             (guix packages)
             (ice-9 match)
             (ice-9 time)
             (srfi srfi-1)
             (srfi srfi-26))

;;; Implementation of sets using hash tables

(define make-set make-hash-table)
(define set-element-of? hash-ref)
(define set-add! (cut hash-set! <> <> #t))
(define set->list (cut hash-map->list (lambda (element _) element) <>))

(define (set-intersection set1 set2)
  "Return the intersection of SET1 with SET2"
  (let ((result (make-set)))
    (hash-for-each (lambda (element _)
                     (when (set-element-of? set2 element)
                       (set-add! result element)))
                   set1)
    result))

(define (trigrams str)
  "Return all trigrams in STR."
  (if (< (string-length str) 3)
      (list)
      (map (lambda (start)
             (substring str start (+ start 3)))
           (iota (- (string-length str) 2)))))

;; Inverted index
(define index (make-hash-table))

(define (build-index!)
  "Return an inverted index of all trigrams in the descriptions of all
packages."
  (fold-packages (lambda (package _)
                   (for-each (lambda (trigram)
                               (let ((set (hash-ref index trigram (make-set))))
                                 (set-add! set (package-name package))
                                 (hash-set! index trigram set)))
                             (trigrams (package-description package))))
                 #f)
  index)

(define (brute-force-retrieve query)
  "Return names of all packages whose descriptions contain the string QUERY.
Search brute force by folding over all packages."
  (fold-packages (lambda (package result)
                   (if (string-contains (package-description package) query)
                       (cons (package-name package) result)
                       result))
                 (list)))

(define (inverted-index-retrieve query)
  "Return names of all packages whose descriptions contain the string QUERY.
Search using the pre-built inverted index."
  (match (trigrams query)
    ((first-trigram . remaining-trigrams)
     (set->list
      (fold (lambda (trigram result)
              (set-intersection result (hash-ref index trigram)))
            (hash-ref index first-trigram)
            remaining-trigrams)))
    (() (brute-force-retrieve query))))

(define (repeat thunk times)
  (unless (zero? times)
    (thunk)
    (repeat thunk (1- times))))

(display "Building index")
(newline)
(time (build-index!))
(newline)

(display "Brute force search (repeated 1000 times)")
(newline)
(time (repeat (cut brute-force-retrieve "strategy") 1000))
(newline)

(display "Inverted index search (repeated 1000 times)")
(newline)
(time (repeat (cut inverted-index-retrieve "strategy") 1000))
(newline)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-13 15:08   ` Arun Isaac
@ 2020-01-13 17:54     ` zimoun
  2020-01-14  3:53       ` Bengt Richter
                         ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: zimoun @ 2020-01-13 17:54 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Guix Devel

Hi Arun,

For sure, it is a good direction... but it is not that simple. :-)

On Mon, 13 Jan 2020 at 16:08, Arun Isaac <arunisaac@systemreboot.net> wrote:

> > Those measures don't seem precise enough to draw a good conclusion.
> > Could you increase the sample size (or maybe just loop?) so that all
> > times reach over a second or so?
>
> Indeed, my bad! Here are better timing results. I have repeated each of
> the searches a 1000 times, and I'm getting around 90x speedup! :-)
>
> Building index
> clock utime stime cutime cstime gctime
>  9.76 12.50  0.18   0.00   0.00   6.16
>
> Brute force search (repeated 1000 times)
> clock utime stime cutime cstime gctime
> 195.95 264.13  0.90   0.00   0.00 142.62

`fold-packages` traverses all the packages so yes it is "slow".

> Inverted index search (repeated 1000 times)
> clock utime stime cutime cstime gctime
>  2.18  2.48  0.00   0.00   0.00   0.60

Yes, looking up into an hash table is really really faster. :-)

However, the core issue about "search" is not 'find quickly an item'
but 'find the relevant item'.
For example, Bloom filter [1] or B-tree could be interesting data
structure if we are talking about fast retrieval. Note that this part
should even be delegated to SQLite, IMHO.

[1] https://en.wikipedia.org/wiki/Bloom_filter
[2] https://en.wikipedia.org/wiki/B-tree

Well, the issue is 'scoring' a query. For example, try:

  guix search youtube | recsel -p name,relevance

--8<---------------cut here---------------start------------->8---
name: youtube-dl-gui
relevance: 15

name: youtube-viewer
relevance: 11

name: youtube-dl
relevance: 11

name: mps-youtube
relevance: 11

name: emacs-youtube-dl
relevance: 11
--8<---------------cut here---------------end--------------->8---

Why the package youtube-dl-gui is more relevant? Because the function
that computes the relevance score not enough "smart". The function
guix/ui.scm(relevance) is too simple: the score is the number of
ocurences of the pattern (weighted by the field where the pattern
matches). Therefore, if you compare "guix show" the packages
"youtube-dl-gui" and "youtube-dl", you will see that the term
"youtube" (case-insentive) appears 5 times for youtube-dl-gui against
only 3 times for youtube-dl. Does the difference (5 vs 3) provide more
information? I do not think so... The issue is: 1. the description is
not well-written (for the current scoring system) and 2. the scoring
function is too simple.

I have tried to described such issues/views here [3] [4] and since I
have not read so much about recommender systems: state-of-the-art, NLP
tools in Guile, etc.

[3] https://lists.gnu.org/archive/html/guix-devel/2019-07/msg00252.html
[4] https://lists.gnu.org/archive/html/guix-devel/2019-12/msg00160.html

Last, I have not read the 2 links you pointed but I think you have a
bug in your code. The retrieval "game" using the index returns the
package "r-mgcv" which does not contain the term "game" in its
description. Well, the query "game" using the brute force does not
return this very package.

Then, in the function
"guix/scripts/package.scm(find-packages-by-description)", you could
try to replace the `fold-packages' by something using the inverted
index. (Note that the name `find-packages-by-descripton' is odd
because it uses synopsis, name, etc. too.).

Hope that helps.

Chhers,
simon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-13 17:54     ` zimoun
@ 2020-01-14  3:53       ` Bengt Richter
  2020-01-14 14:21       ` Pierre Neidhardt
  2020-01-15  5:44       ` Arun Isaac
  2 siblings, 0 replies; 35+ messages in thread
From: Bengt Richter @ 2020-01-14  3:53 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel

Hi simoun,  Guix,

On +2020-01-13 18:54:18 +0100, zimoun wrote:
> Hi Arun,
> 
> For sure, it is a good direction... but it is not that simple. :-)
> 
> 
> On Mon, 13 Jan 2020 at 16:08, Arun Isaac <arunisaac@systemreboot.net> wrote:
> 
> > > Those measures don't seem precise enough to draw a good conclusion.
> > > Could you increase the sample size (or maybe just loop?) so that all
> > > times reach over a second or so?
> >
> > Indeed, my bad! Here are better timing results. I have repeated each of
> > the searches a 1000 times, and I'm getting around 90x speedup! :-)
> >
> > Building index
> > clock utime stime cutime cstime gctime
> >  9.76 12.50  0.18   0.00   0.00   6.16
> >
> > Brute force search (repeated 1000 times)
> > clock utime stime cutime cstime gctime
> > 195.95 264.13  0.90   0.00   0.00 142.62
> 
> `fold-packages` traverses all the packages so yes it is "slow".
> 
> 
> > Inverted index search (repeated 1000 times)
> > clock utime stime cutime cstime gctime
> >  2.18  2.48  0.00   0.00   0.00   0.60
> 
> Yes, looking up into an hash table is really really faster. :-)
>

If you want to instrument particular calls, as you know, you can probably
get sub-microsecond resolution (try the following snippet on your platform
to see how long it takes to get a gettimeofday value -- even on this
Acer Swift with its Intel(R) Pentium(R) CPU N4200 @ 1.10GHz I get about 10
calls per microsecond).

At that kind of resolution, timing at various milestones in your program will
show interesting regularities and outliers. You may even detect timing subsets
caused by such things as the CPU taking a shortcut when multiplying by zero.

If you time a thousand calls of A in a loop followed by the same for B,
and then do a loop that calls them in succession inside the loop, you will
most likely discover cache or other resource competitions between A and B.

My point is that benchmarking meaningfully is subtle, and it is easy to
be fooled by statistical summary measures of distributions that are really
combinations of separate distributions (that are easily, sometimes beautifully,
seen in scattergrams).

--8<---------------cut here---------------start------------->8---
#!/usr/bin/guile -s
!#
;;;; ttime -- see how fast gettimeofday is
;;;; 2017-08-24 20:44:21 -- bokr AT bokr DOT com

(use-modules (ice-9 pretty-print))
(use-modules (ice-9 format))

(define loop-count 0)
(catch #t (lambda () (set! loop-count (string->number (cadr (command-line)))))
          (lambda ( k . rest )  (begin 
          (display "\nFirst arg should be integer loop count\n")
          (display "Suggested usage from bash: ./ttime 500|uniq -c\n")
          (display "to see calls/useq\n\n")
          (exit))))
(define start-loop (gettimeofday))
(define   end-loop start-loop)

(display
    (let lp ((x loop-count) (tl `()))
        (if (positive? x) 
            (lp (- x 1) (cons (gettimeofday) tl))
            (begin
                (set! end-loop (gettimeofday))
            (format #f "~y" tl)))))
(format #t "start loop: ~a\n" start-loop)
(format #t "  end loop: ~a\n" end-loop)
(format #t "  end disp: ~a\n" (gettimeofday))
(newline)
--8<---------------cut here---------------end--------------->8---
(pls excuse the odd use of catch -- I was playing around,
meaning later to catch other errors like bad count etc., but never
got a round tuit)
So just chmod 755 ttime and then ./ttime 500|uniq -c
as it will suggest if you run it without args.

> 
> 
> However, the core issue about "search" is not 'find quickly an item'
> but 'find the relevant item'.
> For example, Bloom filter [1] or B-tree could be interesting data
> structure if we are talking about fast retrieval. Note that this part
> should even be delegated to SQLite, IMHO.
> 
> 
> [1] https://en.wikipedia.org/wiki/Bloom_filter
> [2] https://en.wikipedia.org/wiki/B-tree
> 
> 
> Well, the issue is 'scoring' a query. For example, try:
> 
>   guix search youtube | recsel -p name,relevance
> 
> --8<---------------cut here---------------start------------->8---
> name: youtube-dl-gui
> relevance: 15
> 
> name: youtube-viewer
> relevance: 11
> 
> name: youtube-dl
> relevance: 11
> 
> name: mps-youtube
> relevance: 11
> 
> name: emacs-youtube-dl
> relevance: 11
> --8<---------------cut here---------------end--------------->8---
> 
> Why the package youtube-dl-gui is more relevant? Because the function
> that computes the relevance score not enough "smart". The function
> guix/ui.scm(relevance) is too simple: the score is the number of
> ocurences of the pattern (weighted by the field where the pattern
> matches). Therefore, if you compare "guix show" the packages
> "youtube-dl-gui" and "youtube-dl", you will see that the term
> "youtube" (case-insentive) appears 5 times for youtube-dl-gui against
> only 3 times for youtube-dl. Does the difference (5 vs 3) provide more
> information? I do not think so... The issue is: 1. the description is
> not well-written (for the current scoring system) and 2. the scoring
> function is too simple.
> 
> I have tried to described such issues/views here [3] [4] and since I
> have not read so much about recommender systems: state-of-the-art, NLP
> tools in Guile, etc.
> 
> [3] https://lists.gnu.org/archive/html/guix-devel/2019-07/msg00252.html

From [3]:
┌────────────────────────────────────────────────────────────────────────┐
│ From my opinion, text mining or search engine strategies seem a better │
│ approach to improve the `guix search` than rigidify the filename tree. │
│ And some data mining to see how the packages cluster (depending on the │
│ metric) should be helpful to first understand how to improve `guix     │
│ search`.                                                               │
│ I do not know if my words make sense.                                  │
└────────────────────────────────────────────────────────────────────────┘
They make sense to me :)

> [4] https://lists.gnu.org/archive/html/guix-devel/2019-12/msg00160.html
More stuff, skimmed, but mostly LGTM.

I would note that librarians have been helping people find relevant
information for centuries, so IWT there is something to learn from them?
Maybe not a Dewey Decimal System book classification for guix, but I am
not sure it's good to exclude tagging ideas altogether.

Another aspect of search is _who_ is doing the searching. A guix developer
wanting to find implementation examples for bytecode jit is different
from a newbie struggling to form an overview of what makes his app
stutter.

I note that papers and dissertations etc often contain prefatory
paragraphs indicating intended audience, and noting parts that are
only for specialists vs no-previous-knowledge-required.

If synopses and such contained an audience hint in a standard syntax,
might that serve as a basis for on-the-fly derivation of a search filter,
maybe with a --audience guix-dev,specialist:guile:cps option?

For object-oriented oops goops stuff, where you might like to find
a collection of methods that might serve as mixin methods for an
object of your own, how could you search for that?

(think provides/ensures/requires or other type contraints ??)

> 
> Last, I have not read the 2 links you pointed but I think you have a
> bug in your code. The retrieval "game" using the index returns the
> package "r-mgcv" which does not contain the term "game" in its
> description. Well, the query "game" using the brute force does not
> return this very package.
> 
> Then, in the function
> "guix/scripts/package.scm(find-packages-by-description)", you could
> try to replace the `fold-packages' by something using the inverted
> index. (Note that the name `find-packages-by-descripton' is odd
> because it uses synopsis, name, etc. too.).
> 
> 
> Hope that helps.
> 
> Chhers,
> simon
> 

HTH in some way :)

-- 
Regards,
Bengt Richter

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-13 17:54     ` zimoun
  2020-01-14  3:53       ` Bengt Richter
@ 2020-01-14 14:21       ` Pierre Neidhardt
  2020-01-14 15:27         ` Giovanni Biscuolo
  2020-01-16 19:06         ` Arun Isaac
  2020-01-15  5:44       ` Arun Isaac
  2 siblings, 2 replies; 35+ messages in thread
From: Pierre Neidhardt @ 2020-01-14 14:21 UTC (permalink / raw)
  To: zimoun, Arun Isaac; +Cc: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 213 bytes --]

By the way, what about using Xapian in Guix?

https://en.wikipedia.org/wiki/Xapian

If it's relevant, maybe we can follow up with a discussion in a new thread.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-14 14:21       ` Pierre Neidhardt
@ 2020-01-14 15:27         ` Giovanni Biscuolo
  2020-01-14 20:55           ` zimoun
  2020-01-16 19:06         ` Arun Isaac
  1 sibling, 1 reply; 35+ messages in thread
From: Giovanni Biscuolo @ 2020-01-14 15:27 UTC (permalink / raw)
  To: Pierre Neidhardt, zimoun, Arun Isaac; +Cc: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 838 bytes --]

Pierre Neidhardt <mail@ambrevar.xyz> writes:

> By the way, what about using Xapian in Guix?
>
> https://en.wikipedia.org/wiki/Xapian
>
> If it's relevant, maybe we can follow up with a discussion in a new
> thread.

IMHO it's relevant (you decide if it's worth a new thread)

Actually I would love to search (tag?!?) for packages/services/machines
(profiles?) the same way I query my email database with notmuch: it
would give me a fourth dimension in "DevOps" :-)

IMHO this is not a priority, but probably should go in the roadmap as a
wishlist

I cannot find Guile in Xapian bindings [1] but NotDeft [2] have some
Emacs Lisp code to interact with Xapian

WDYT?

Thanks, Gio'


[1] https://xapian.org/docs/bindings/

[2] https://tero.hasu.is/notdeft/

-- 
Giovanni Biscuolo

Xelera IT Infrastructures

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-14 15:27         ` Giovanni Biscuolo
@ 2020-01-14 20:55           ` zimoun
  2020-01-14 21:41             ` Pierre Neidhardt
  2020-01-15 11:53             ` Giovanni Biscuolo
  0 siblings, 2 replies; 35+ messages in thread
From: zimoun @ 2020-01-14 20:55 UTC (permalink / raw)
  To: Giovanni Biscuolo; +Cc: Guix Devel

Hi,

On Tue, 14 Jan 2020 at 16:27, Giovanni Biscuolo <g@xelera.eu> wrote:

> Pierre Neidhardt <mail@ambrevar.xyz> writes:

> > By the way, what about using Xapian in Guix?

Xapian could be really cool!
However, the size of this dependency should be evaluated; then
optional feature or not.

> Actually I would love to search (tag?!?) for packages/services/machines
> (profiles?) the same way I query my email database with notmuch: it
> would give me a fourth dimension in "DevOps" :-)

Currently, you can output all the packages with 'guix search ""' and
index them using the Python API. It is not so nice but it should give
an idea about what to expose first to Guile.

> I cannot find Guile in Xapian bindings [1] but NotDeft [2] have some
> Emacs Lisp code to interact with Xapian

There is some work. IMHO.

If Guix goes to Xapian, all the Git history of all the packages should
be indexed. Today, it is really hard to find the right commit
corresponding to the right version -- the only solution I have is "git
log"+grep; not really friendly.

Some discussion spoke about using an external index, fetched for
example on the Guix Data Service. But I personally do not like to rely
on external service accessible through the network. However, note that
index all the packages of all the Guix history using
guile-git+fold-packages is not straightforward: resource consuming and
piece of well-thought data structures.

Well, now "guix time-machine" is here, Guix is lacking tools to browse
the history of all the packages.

Cheers,
simon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-14 20:55           ` zimoun
@ 2020-01-14 21:41             ` Pierre Neidhardt
  2020-01-14 21:59               ` zimoun
  2020-01-15 11:53             ` Giovanni Biscuolo
  1 sibling, 1 reply; 35+ messages in thread
From: Pierre Neidhardt @ 2020-01-14 21:41 UTC (permalink / raw)
  To: zimoun, Giovanni Biscuolo; +Cc: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 858 bytes --]

zimoun <zimon.toutoune@gmail.com> writes:

> Hi,
>
>
> On Tue, 14 Jan 2020 at 16:27, Giovanni Biscuolo <g@xelera.eu> wrote:
>
>> Pierre Neidhardt <mail@ambrevar.xyz> writes:
>
>> > By the way, what about using Xapian in Guix?
>
> Xapian could be really cool!
> However, the size of this dependency should be evaluated; then
> optional feature or not.

--8<---------------cut here---------------start------------->8---
$ guix size guix
...
total: 390.5 MiB

$ guix size guix xapian
...
total: 414.0 MiB
--8<---------------cut here---------------end--------------->8---

Some 24 MiB.  Actually Xapien comes with about 6 MiB of doc, so we could
move that out to a separate output.

Conclusion: Xapian is very light.
The only input that Xapian drags that's not in Guix is util-linux.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-14 21:41             ` Pierre Neidhardt
@ 2020-01-14 21:59               ` zimoun
  2020-01-15  9:12                 ` Pierre Neidhardt
  0 siblings, 1 reply; 35+ messages in thread
From: zimoun @ 2020-01-14 21:59 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: Guix Devel

On Tue, 14 Jan 2020 at 22:41, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> >> > By the way, what about using Xapian in Guix?
> >
> > Xapian could be really cool!
> > However, the size of this dependency should be evaluated; then
> > optional feature or not.
>
> --8<---------------cut here---------------start------------->8---
> $ guix size guix
> ...
> total: 390.5 MiB
>
> $ guix size guix xapian
> ...
> total: 414.0 MiB
> --8<---------------cut here---------------end--------------->8---
>
> Some 24 MiB.  Actually Xapien comes with about 6 MiB of doc, so we could
> move that out to a separate output.
>
> Conclusion: Xapian is very light.
> The only input that Xapian drags that's not in Guix is util-linux.

Cool!
Already nice!

This size is the runtime size, right?
What about the "build time" size? Other said, all one needs to build
to have xapian, i.e., the size of the full bag: starting from the
bootstrap seed and having Guix+Xapian (vs. Guix alone).


Cheers,
simon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-13 17:54     ` zimoun
  2020-01-14  3:53       ` Bengt Richter
  2020-01-14 14:21       ` Pierre Neidhardt
@ 2020-01-15  5:44       ` Arun Isaac
  2020-01-15  9:06         ` Pierre Neidhardt
                           ` (2 more replies)
  2 siblings, 3 replies; 35+ messages in thread
From: Arun Isaac @ 2020-01-15  5:44 UTC (permalink / raw)
  To: Pierre Neidhardt, Bengt Richter, zimoun; +Cc: Guix Devel


[-- Attachment #1.1: Type: text/plain, Size: 3214 bytes --]


zimoun <zimon.toutoune@gmail.com> writes:

> However, the core issue about "search" is not 'find quickly an item'
> but 'find the relevant item'.
> For example, Bloom filter [1] or B-tree could be interesting data
> structure if we are talking about fast retrieval. Note that this part
> should even be delegated to SQLite, IMHO.
>
> [1] https://en.wikipedia.org/wiki/Bloom_filter
> [2] https://en.wikipedia.org/wiki/B-tree

I just read about Bloom filters. They are fascinating indeed!

> Well, the issue is 'scoring' a query.

I think the issue of whether to use an inverted index is orthogonal to
the quest to improve the relevance of the search results. Implementing
tf-idf like you suggested could greatly benefit from having fast
searches.

> Last, I have not read the 2 links you pointed but I think you have a
> bug in your code. The retrieval "game" using the index returns the
> package "r-mgcv" which does not contain the term "game" in its
> description. Well, the query "game" using the brute force does not
> return this very package.

Indeed, I found the bug and fixed it. Please find attached the updated
code. Also, the inverted index hash table stores package objects instead
of package names and all comparisons are now done with eq?. This gets us
an even higher speedup of around 300x.

Built index in 8.556596040725708 seconds
Brute force search took 0.4107058048248291 seconds
Inverted index search took 0.0012979507446289062 seconds

Bengt Richter <bokr@bokr.com> writes:

> If you want to instrument particular calls, as you know, you can probably
> get sub-microsecond resolution (try the following snippet on your platform
> to see how long it takes to get a gettimeofday value -- even on this
> Acer Swift with its Intel(R) Pentium(R) CPU N4200 @ 1.10GHz I get about 10
> calls per microsecond).

gettimeofday is indeed a good idea. I have used it in the latest version
of my proof of concept script. Like you said, I also noticed some
peculiar outliers which probably got averaged out when I was running the
search 1000 times. For the purpose of the discussion above, I have only
reported the most common time.

Pierre Neidhardt <mail@ambrevar.xyz> writes:

> By the way, what about using Xapian in Guix?
>
> https://en.wikipedia.org/wiki/Xapian
>
> If it's relevant, maybe we can follow up with a discussion in a new
> thread.

I feel xapian is too much work (considering that we don't yet have guile
bindings) compared to our own simple implementation of an inverted
index. But, of course, I am biased since I wrote the inverted index
code! :-)

But, on a more serious note, if we move to xapian, we will not be able
to support regular expression based search queries that we support
today. On the other hand, I can extend the inverted index implementation
to support regular expression searches. Personally, I don't use regular
expression based search queries, and don't think they are very useful
especially if we make use of xapian's stemming. What do people think?

On the question of whether xapian is too heavy, I think we should make
it an optional dependency of Guix so that it remains possible to build
and use a more minimalistic Guix for those interested in such things.


[-- Attachment #1.2: inverted-index-poc.scm --]
[-- Type: text/plain, Size: 2961 bytes --]

(use-modules (gnu packages)
             (guix packages)
             (ice-9 match)
             (ice-9 time)
             (srfi srfi-1)
             (srfi srfi-26))

;;; Implementation of sets using hash tables

(define make-set make-hash-table)
(define set-element-of? hashq-ref)
(define set-add! (cut hashq-set! <> <> #t))
(define (set-fold proc init set)
  (hash-fold (lambda (element _ result)
               (proc element result))
             init set))

(define (set-intersection set1 set2)
  "Return the intersection of SET1 with SET2."
  (let ((result (make-set)))
    (hash-for-each (lambda (element _)
                     (when (set-element-of? set2 element)
                       (set-add! result element)))
                   set1)
    result))

(define (trigrams str)
  "Return all trigrams in STR."
  (if (< (string-length str) 3)
      '()
      (map (lambda (start)
             (substring str start (+ start 3)))
           (iota (- (string-length str) 2)))))

;; Inverted index
(define index (make-hash-table))

(define (build-index!)
  "Return an inverted index of all trigrams in the descriptions of all
packages."
  (fold-packages (lambda (package _)
                   (for-each (lambda (trigram)
                               (let ((set (hash-ref index trigram (make-set))))
                                 (set-add! set package)
                                 (hash-set! index trigram set)))
                             (trigrams (package-description package))))
                 #f)
  index)

(define (match-package-fold-proc query package result)
  (if (string-contains (package-description package) query)
      (cons package result)
      result))

(define (brute-force-retrieve query)
  "Return names of all packages whose descriptions contain the string QUERY.
Search brute force by folding over all packages."
  (fold-packages (cut match-package-fold-proc query <> <>) '()))

(define (inverted-index-retrieve query)
  "Return names of all packages whose descriptions contain the string QUERY.
Search using the pre-built inverted index."
  (match (trigrams query)
    ((first-trigram . remaining-trigrams)
     (set-fold
      (cut match-package-fold-proc query <> <>)
      '()
      (fold (lambda (trigram result)
              (set-intersection result (hash-ref index trigram (make-set))))
            (hash-ref index first-trigram (make-set))
            remaining-trigrams)))
    (() (brute-force-retrieve query))))

(define (time-us thunk)
  (define pair->sec
    (match-lambda
      ((sec . usec)
       (+ sec (/ usec 1e6)))))

  (let ((start (gettimeofday)))
    (thunk)
    (let ((stop (gettimeofday)))
      (- (pair->sec stop) (pair->sec start)))))

(format #t "Built index in ~a seconds
Brute force search took ~a seconds
Inverted index search took ~a seconds
"
        (time-us build-index!)
        (time-us (cut brute-force-retrieve "strategy"))
        (time-us (cut inverted-index-retrieve "strategy")))

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-15  5:44       ` Arun Isaac
@ 2020-01-15  9:06         ` Pierre Neidhardt
  2020-01-15 12:00           ` zimoun
  2020-01-15 11:54         ` zimoun
  2020-01-15 21:03         ` Ricardo Wurmus
  2 siblings, 1 reply; 35+ messages in thread
From: Pierre Neidhardt @ 2020-01-15  9:06 UTC (permalink / raw)
  To: Arun Isaac, Bengt Richter, zimoun; +Cc: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 1623 bytes --]

Arun Isaac <arunisaac@systemreboot.net> writes:

> I feel xapian is too much work (considering that we don't yet have guile
> bindings) compared to our own simple implementation of an inverted
> index. But, of course, I am biased since I wrote the inverted index
> code! :-)

Indeed, xapian bindings would be the biggest obstacle.

> But, on a more serious note, if we move to xapian, we will not be able
> to support regular expression based search queries that we support
> today.

We can always keep our current regexp search (which is trivial) for
those who really want it.  I believe that Xapian is much more usable
than regexps on a daily basis.

> On the other hand, I can extend the inverted index implementation
> to support regular expression searches. Personally, I don't use regular
> expression based search queries, and don't think they are very useful
> especially if we make use of xapian's stemming. What do people think?

Agreed!

I see this with my emails (Notmuch): I type whatever words I remember
and whoever names was involved in a thread and I systematically find
it.  I've used it for months and it never missed! :)

> On the question of whether xapian is too heavy, I think we should make
> it an optional dependency of Guix so that it remains possible to build
> and use a more minimalistic Guix for those interested in such things.

I suppose it wouldn't be too hard to make it optional.  That said, with
this little overhead and this much benefit, it seems to be a very nice
default at first glance.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-14 21:59               ` zimoun
@ 2020-01-15  9:12                 ` Pierre Neidhardt
  2020-01-15 14:25                   ` zimoun
  0 siblings, 1 reply; 35+ messages in thread
From: Pierre Neidhardt @ 2020-01-15  9:12 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 862 bytes --]

zimoun <zimon.toutoune@gmail.com> writes:

> This size is the runtime size, right?

Yes.

> What about the "build time" size? Other said, all one needs to build
> to have xapian, i.e., the size of the full bag: starting from the
> bootstrap seed and having Guix+Xapian (vs. Guix alone).

As shows with guix graph?

This yields a small graph:
--8<---------------cut here---------------start------------->8---
guix graph --type=bag-emerged xapian | dot -Tpdf > dag.pdf
--8<---------------cut here---------------end--------------->8---

This yields a big graph:
--8<---------------cut here---------------start------------->8---
guix graph --type=bag xapian | dot -Tpdf > dag.pdf
--8<---------------cut here---------------end--------------->8---

I don't know how it compares to other packages.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-14 20:55           ` zimoun
  2020-01-14 21:41             ` Pierre Neidhardt
@ 2020-01-15 11:53             ` Giovanni Biscuolo
  2020-01-15 14:49               ` zimoun
  1 sibling, 1 reply; 35+ messages in thread
From: Giovanni Biscuolo @ 2020-01-15 11:53 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 2905 bytes --]

Hi Simon and all other developers

first and foremost: I'm sorry I still cannot help in developing this
"xapian feature" or even explore pros and cons of this feature

for now mine it's just a user and possible beta-tester POV... let's call
it food for thought :-)

zimoun <zimon.toutoune@gmail.com> writes:

[...]

>> Actually I would love to search (tag?!?) for packages/services/machines
>> (profiles?) the same way I query my email database with notmuch: it
>> would give me a fourth dimension in "DevOps" :-)
>
> Currently, you can output all the packages with 'guix search ""' and
> index them using the Python API. It is not so nice but it should give
> an idea about what to expose first to Guile.

interesting approach for an indexer external to Guix, let's call it a
proof of concept...

I'd really like to experiment this directly in Guile but I'm not able to
help with Guile bindings

>> I cannot find Guile in Xapian bindings [1] but NotDeft [2] have some
>> Emacs Lisp code to interact with Xapian
>
> There is some work. IMHO.

for sure :-) this is why i defined this low-pri in the general Guix
plans

> If Guix goes to Xapian, all the Git history of all the packages should
> be indexed. Today, it is really hard to find the right commit
> corresponding to the right version -- the only solution I have is "git
> log"+grep; not really friendly.

indexing the all the history could be very interesting, but it's enough
interesting to me having a system to query (and tag if needed) Guix info
from a point in time, i.e. index creation

thinking about implementation, IMHO indexing the output of "guix search"
is doable but indexing the commit logs of git is complex, probably too
much complex

> Some discussion spoke about using an external index, fetched for
> example on the Guix Data Service. But I personally do not like to rely
> on external service accessible through the network.

me too, as I said I'd like something like notmuch - guixmuch :-) - that
indexes my mailbox; in Guix terms I see "my mailbox" [1] as
packages/services/machines and all other useful metadata I have in all
my channels/profiles

...then I have muchsync to sync the notmuch database across different
machines, and a similar feature whould be nice for guixmuch :-O


[1] where IMAP is replaced by git (in various repos, for
packages/config/channels) and offlineimap is replaced by guix pull

> However, note that index all the packages of all the Guix history
> using guile-git+fold-packages is not straightforward: resource
> consuming and piece of well-thought data structures.
>
> Well, now "guix time-machine" is here, Guix is lacking tools to browse
> the history of all the packages.

we are asking too much to Guix, but it is an interesting feature

Thanks! Gio'

[...]

-- 
Giovanni Biscuolo

Xelera IT Infrastructures

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-15  5:44       ` Arun Isaac
  2020-01-15  9:06         ` Pierre Neidhardt
@ 2020-01-15 11:54         ` zimoun
  2020-01-15 21:03         ` Ricardo Wurmus
  2 siblings, 0 replies; 35+ messages in thread
From: zimoun @ 2020-01-15 11:54 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Guix Devel

Hi,

On Wed, 15 Jan 2020 at 06:44, Arun Isaac <arunisaac@systemreboot.net> wrote:

> > Well, the issue is 'scoring' a query.
>
> I think the issue of whether to use an inverted index is orthogonal to
> the quest to improve the relevance of the search results. Implementing
> tf-idf like you suggested could greatly benefit from having fast
> searches.

I think it is not so orthogonal. In general, a fast and good system is
a combination of relevant scoring adapted to the good data structure,
IMHO.


However, I agree that adding an inverted index will improve the
current situation of "guix search" -- keeping the current scoring
function -- and ease the end-user experience.


> Pierre Neidhardt <mail@ambrevar.xyz> writes:
>
> > By the way, what about using Xapian in Guix?
> >
> > https://en.wikipedia.org/wiki/Xapian
> >
> > If it's relevant, maybe we can follow up with a discussion in a new
> > thread.
>
> I feel xapian is too much work (considering that we don't yet have guile
> bindings) compared to our own simple implementation of an inverted
> index. But, of course, I am biased since I wrote the inverted index
> code! :-)

It depends on how long run we are talking. :-)
Xapian avoids to reinvent the wheel. ;-)

> But, on a more serious note, if we move to xapian, we will not be able
> to support regular expression based search queries that we support
> today.

I am not convinced...

> On the question of whether xapian is too heavy, I think we should make
> it an optional dependency of Guix so that it remains possible to build
> and use a more minimalistic Guix for those interested in such things.

Guix comes with SQLite and it is ok.
The question is: how Xapian is minimalist. :-)
(need some investigation)


All the best,
simon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-15  9:06         ` Pierre Neidhardt
@ 2020-01-15 12:00           ` zimoun
  2020-01-15 13:41             ` Pierre Neidhardt
  0 siblings, 1 reply; 35+ messages in thread
From: zimoun @ 2020-01-15 12:00 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: Guix Devel

On Wed, 15 Jan 2020 at 10:06, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> We can always keep our current regexp search (which is trivial) for
> those who really want it.  I believe that Xapian is much more usable
> than regexps on a daily basis.

What do you mean by trivial?

The command "guix search" supports the full Guile regexp engine, if I
remember well.


> > On the other hand, I can extend the inverted index implementation
> > to support regular expression searches. Personally, I don't use regular
> > expression based search queries, and don't think they are very useful
> > especially if we make use of xapian's stemming. What do people think?
>
> Agreed!
>
> I see this with my emails (Notmuch): I type whatever words I remember
> and whoever names was involved in a thread and I systematically find
> it.  I've used it for months and it never missed! :)

The key point here is the scoring.

Type whatever words you remember in the GNU mailing search engine.

https://lists.gnu.org/archive/cgi-bin/namazu.cgi?query=load-path&submit=Search%21&idxname=guix-devel&max=20&result=normal&sort=score


All the best,
simon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-15 12:00           ` zimoun
@ 2020-01-15 13:41             ` Pierre Neidhardt
  0 siblings, 0 replies; 35+ messages in thread
From: Pierre Neidhardt @ 2020-01-15 13:41 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 475 bytes --]

zimoun <zimon.toutoune@gmail.com> writes:

> On Wed, 15 Jan 2020 at 10:06, Pierre Neidhardt <mail@ambrevar.xyz> wrote:
>
>> We can always keep our current regexp search (which is trivial) for
>> those who really want it.  I believe that Xapian is much more usable
>> than regexps on a daily basis.
>
> What do you mean by trivial?

I meant "it's trivial to keep the current regexp search alongside
something else".

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-15  9:12                 ` Pierre Neidhardt
@ 2020-01-15 14:25                   ` zimoun
  0 siblings, 0 replies; 35+ messages in thread
From: zimoun @ 2020-01-15 14:25 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: Guix Devel

Hi Pierre,

On Wed, 15 Jan 2020 at 10:12, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> > What about the "build time" size? Other said, all one needs to build
> > to have xapian, i.e., the size of the full bag: starting from the
> > bootstrap seed and having Guix+Xapian (vs. Guix alone).
>
> As shows with guix graph?
>
> This yields a small graph:
> --8<---------------cut here---------------start------------->8---
> guix graph --type=bag-emerged xapian | dot -Tpdf > dag.pdf
> --8<---------------cut here---------------end--------------->8---
>
> This yields a big graph:
> --8<---------------cut here---------------start------------->8---
> guix graph --type=bag xapian | dot -Tpdf > dag.pdf
> --8<---------------cut here---------------end--------------->8---
>
> I don't know how it compares to other packages.

I do not know neither. But I guess that Xapian does not add extra
dependencies then itself.


I do not know if this filter is correct.

--8<---------------cut here---------------start------------->8---
guixdot=/tmp/guix.dot

guix graph --type=bag guix > $guixdot

for what in `guix graph --type=bag xapian | grep '@' | cut -d'[' -f1`;
do
    grep '@' $guixdot | grep $what > /dev/null
    if [ $? -eq 0 ]
    then
        echo "OK -- $what"
    else
        echo "KO -- $what"
    fi
done
--8<---------------cut here---------------end--------------->8---

Roughly, it compares the '/gnu/store'  hashes of the nodes. And all
the nodes of Xapian are already in the graph of Guix, except Xapian
itself. So there is not extra cost (really low). I mean if I
understand well.


Well, guile-xapian bindings is the blocking point.
Any taker? ;-)


All the best,
simon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-15 11:53             ` Giovanni Biscuolo
@ 2020-01-15 14:49               ` zimoun
  0 siblings, 0 replies; 35+ messages in thread
From: zimoun @ 2020-01-15 14:49 UTC (permalink / raw)
  To: Giovanni Biscuolo; +Cc: Guix Devel

Hi Giovanni,

On Wed, 15 Jan 2020 at 12:53, Giovanni Biscuolo <g@xelera.eu> wrote:

> zimoun <zimon.toutoune@gmail.com> writes:

> > If Guix goes to Xapian, all the Git history of all the packages should
> > be indexed. Today, it is really hard to find the right commit
> > corresponding to the right version -- the only solution I have is "git
> > log"+grep; not really friendly.
>
> indexing the all the history could be very interesting, but it's enough
> interesting to me having a system to query (and tag if needed) Guix info
> from a point in time, i.e. index creation

Currently, "guix search" already does a good job (modulo the scoring
already discussed in length :-)).
If you read the doc about Guile regexp and recutils then you can query
(almost) all you want; or explain what you are not able to query. :-)

However, the current implementation cannot scale for all the packages;
other said find removed packages, or old version. And it is very
frustrating, especially when you use "guix time-machine".

A concrete example is given here [1] and "git log --grep" done in [2]
is not convenient at all. Another concrete example, see point 1. in
[3].

[1] https://lists.gnu.org/archive/html/help-guix/2019-06/msg00094.html
[2] https://lists.gnu.org/archive/html/help-guix/2019-06/msg00098.html
[3] https://lists.gnu.org/archive/html/help-guix/2020-01/msg00087.html

> thinking about implementation, IMHO indexing the output of "guix search"
> is doable but indexing the commit logs of git is complex, probably too
> much complex

I do not want to index the commit log message. But today, AFAIK, use
these commit logs is the only way to find the commit providing the
version an user want. And it is not cool, IMHO.

The whishlist is: be able to search through all the packages you can
manipulate with Guix. Other said index all the packages after the big
overhaul (inferior).

(Note that we are talking about packages but it is the same thing for services.)

> me too, as I said I'd like something like notmuch - guixmuch :-) - that
> indexes my mailbox; in Guix terms I see "my mailbox" [1] as
> packages/services/machines and all other useful metadata I have in all
> my channels/profiles

What do you mean by "other useful metadata"?

> ...then I have muchsync to sync the notmuch database across different
> machines, and a similar feature whould be nice for guixmuch :-O

It is already possible using manifests.

> [1] where IMAP is replaced by git (in various repos, for
> packages/config/channels) and offlineimap is replaced by guix pull

What you want already exist: manifest. :-)
And "guix publish" to expose your own substitutes if the package is
not build in the Guix farm and you have already built it.

> > However, note that index all the packages of all the Guix history
> > using guile-git+fold-packages is not straightforward: resource
> > consuming and piece of well-thought data structures.
> >
> > Well, now "guix time-machine" is here, Guix is lacking tools to browse
> > the history of all the packages.
>
> we are asking too much to Guix, but it is an interesting feature

I do not think so it is "too much". :-)

All the best,
simon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-15  5:44       ` Arun Isaac
  2020-01-15  9:06         ` Pierre Neidhardt
  2020-01-15 11:54         ` zimoun
@ 2020-01-15 21:03         ` Ricardo Wurmus
  2020-01-15 21:19           ` zimoun
  2020-01-16 14:44           ` Ludovic Courtès
  2 siblings, 2 replies; 35+ messages in thread
From: Ricardo Wurmus @ 2020-01-15 21:03 UTC (permalink / raw)
  To: Arun Isaac; +Cc: guix-devel


Arun Isaac <arunisaac@systemreboot.net> writes:

> Indeed, I found the bug and fixed it. Please find attached the updated
> code. Also, the inverted index hash table stores package objects instead
> of package names and all comparisons are now done with eq?. This gets us
> an even higher speedup of around 300x.
>
> Built index in 8.556596040725708 seconds
> Brute force search took 0.4107058048248291 seconds
> Inverted index search took 0.0012979507446289062 seconds

Neat!

We could build and install the index when Guix is built and installed
and then use it for the search.  I can’t think of a downside to adopting
an index compared to what we have now.

-- 
Ricardo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-15 21:03         ` Ricardo Wurmus
@ 2020-01-15 21:19           ` zimoun
  2020-01-16 14:44           ` Ludovic Courtès
  1 sibling, 0 replies; 35+ messages in thread
From: zimoun @ 2020-01-15 21:19 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: Guix Devel

Hi Ricardo,

On Wed, 15 Jan 2020 at 22:03, Ricardo Wurmus <rekado@elephly.net> wrote:

> We could build and install the index when Guix is built and installed
> and then use it for the search.  I can’t think of a downside to adopting
> an index compared to what we have now.

I agree that it is an easy move that improve the current search experience. :-)

So, yes work in progress... using the code that Arun sent and the code
already available: sets.scm and pull.scm etc..
And how to adapt the lookup to the current fold-packages+regexp-matching.

Do you think `vhash' is better than `hash-table'?

However, I am not clear about how "guix pull" works so how to update
the index. Still trying to figure out. :-)
Well I am not clear about what is cached... I mean my folder
~/.cache/guix is crowdy. ;-)

Cheers,
simon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-15 21:03         ` Ricardo Wurmus
  2020-01-15 21:19           ` zimoun
@ 2020-01-16 14:44           ` Ludovic Courtès
  2020-01-16 15:04             ` zimoun
  1 sibling, 1 reply; 35+ messages in thread
From: Ludovic Courtès @ 2020-01-16 14:44 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

Hello,

Ricardo Wurmus <rekado@elephly.net> skribis:

> Arun Isaac <arunisaac@systemreboot.net> writes:
>
>> Indeed, I found the bug and fixed it. Please find attached the updated
>> code. Also, the inverted index hash table stores package objects instead
>> of package names and all comparisons are now done with eq?. This gets us
>> an even higher speedup of around 300x.
>>
>> Built index in 8.556596040725708 seconds
>> Brute force search took 0.4107058048248291 seconds
>> Inverted index search took 0.0012979507446289062 seconds
>
> Neat!
>
> We could build and install the index when Guix is built and installed
> and then use it for the search.  I can’t think of a downside to adopting
> an index compared to what we have now.

The possible downsides are (1) ‘guix pull’ will take an additional 8
seconds, (2) there’ll be some extra complexity because the current
implementation needs to be kept anyway for when the pre-built index is
not authoritative—i.e., when ‘GUIX_PACKAGE_PATH’ is set or when ‘-L’ is
used; see ‘cache-is-authoritative?’ in (gnu packages).

I don’t find ‘guix search’ to be excessively slow currently (on an SSD
at least), but I agree that the speedup would be welcome!

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-16 14:44           ` Ludovic Courtès
@ 2020-01-16 15:04             ` zimoun
  2020-01-16 19:11               ` Arun Isaac
  0 siblings, 1 reply; 35+ messages in thread
From: zimoun @ 2020-01-16 15:04 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix Devel

Hi Ludo,

On Thu, 16 Jan 2020 at 15:46, Ludovic Courtès <ludo@gnu.org> wrote:

> > We could build and install the index when Guix is built and installed
> > and then use it for the search.  I can’t think of a downside to adopting
> > an index compared to what we have now.
>
> The possible downsides are (1) ‘guix pull’ will take an additional 8
> seconds, (2) there’ll be some extra complexity because the current
> implementation needs to be kept anyway for when the pre-built index is
> not authoritative—i.e., when ‘GUIX_PACKAGE_PATH’ is set or when ‘-L’ is
> used; see ‘cache-is-authoritative?’ in (gnu packages).

About (1), let implement something experimental and time it to compare
apples with apples. :-)
I mean I am working on it. As said elsewhere, "guix search" could be
improved in different area and the inverted index is an easy first
step, IMHO.

About (2), I have not figured out yet how "guix pull" works and all
the relative folders in ~/.cache/guix.
I will report my issues later. :-)

> I don’t find ‘guix search’ to be excessively slow currently (on an SSD
> at least), but I agree that the speedup would be welcome!

One next step would to search in all the packages and/or services that
"guix time-machine" can manipulate.

Cheers,
simon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-14 14:21       ` Pierre Neidhardt
  2020-01-14 15:27         ` Giovanni Biscuolo
@ 2020-01-16 19:06         ` Arun Isaac
  2020-01-16 20:08           ` zimoun
  2020-01-17 17:13           ` Pierre Neidhardt
  1 sibling, 2 replies; 35+ messages in thread
From: Arun Isaac @ 2020-01-16 19:06 UTC (permalink / raw)
  To: Pierre Neidhardt, zimoun; +Cc: Guix Devel


[-- Attachment #1.1: Type: text/plain, Size: 1749 bytes --]

Pierre Neidhardt <mail@ambrevar.xyz> writes:

> By the way, what about using Xapian in Guix?

I looked up xapian's features at https://xapian.org/features and it is
quite impressive. I was introduced to xapian through notmuch. notmuch
does not utilize xapian to the fullest and I therefore ended up
underestimating its value. Of particular importance might be the
following.

- Relevance feedback - given one or more documents, Xapian can suggest
  the most relevant index terms to expand a query, suggest related
  documents, categorise documents, etc.
- Phrase and proximity searching - users can search for words occurring
  in an exact phrase or within a specified number of words, either in a
  specified order, or in any order.
- Supports stemming of search terms (e.g. a search for "football" would
  match documents which mention "footballs" or "footballer")

I think these features would really help in Pierre's work trying to
improve search and discoverability on Guix. If we are planning to have a
"Software Center" like interface at some point in the future, xapian's
search could come in handy.

Not directly related to Guix, but I also wonder if info manuals would be
a lot more useful if they had good full text search using xapian.

For the time being, since we don't have xapian bindings, I think we
should settle for sqlite's full text search capabilities.

https://www.sqlite.org/fts5.html

I have attached a short proof of concept script for an sqlite based
search. Speedup is around 200x, and populating the database only takes
around 2.5 seconds. Here is a sample run.

Sqlite database populated in 2.5516340732574463 seconds
Brute force search took 0.11850595474243164 seconds
Sqlite search took 5.459785461425781e-4 seconds


[-- Attachment #1.2: sqlite-search.scm --]
[-- Type: text/plain, Size: 2145 bytes --]

(use-modules (gnu packages)
             (guix packages)
             (ice-9 match)
             (sqlite3)
             (srfi srfi-26))

(define db (sqlite-open "/tmp/index.sqlite"))

(define schema
  "CREATE VIRTUAL TABLE packages USING fts5(name, description)")

(define (build-sqlite-database)
  (sqlite-exec db schema)
  (sqlite-exec db "BEGIN")
  (fold-packages
   (lambda (package _)
     (let ((statement (sqlite-prepare db "INSERT INTO packages(name, description) VALUES(:name, :description)")))
       (sqlite-bind-arguments statement
                              #:name (package-name package)
                              #:description (package-description package))
       (sqlite-fold cons '() statement)
       (sqlite-finalize statement)))
   #f)
  (sqlite-exec db "COMMIT;"))

(define (sqlite-retrieve query)
  (let ((statement (sqlite-prepare db "SELECT name FROM packages WHERE description MATCH :query")))
    (sqlite-bind-arguments statement #:query query)
    (let ((result (sqlite-fold (lambda (v result)
                                 (match v
                                   (#(name) (cons name result))))
                               '() statement)))
      (sqlite-finalize statement)
      result)))

(define (brute-force-retrieve query)
  "Return names of all packages whose descriptions contain the string QUERY.
Search brute force by folding over all packages."
  (fold-packages (lambda (package result)
                   (if (string-contains (package-description package) query)
                       (cons package result)
                       result))
                 '()))

(define (time-us thunk)
  (define pair->sec
    (match-lambda
      ((sec . usec)
       (+ sec (/ usec 1e6)))))

  (let ((start (gettimeofday)))
    (thunk)
    (let ((stop (gettimeofday)))
      (- (pair->sec stop) (pair->sec start)))))

(format #t "Sqlite database populated in ~a seconds
Brute force search took ~a seconds
Sqlite search took ~a seconds
"
        (time-us build-sqlite-database)
        (time-us (cut brute-force-retrieve "strategy"))
        (time-us (cut sqlite-retrieve "strategy")))

(sqlite-close db)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-16 15:04             ` zimoun
@ 2020-01-16 19:11               ` Arun Isaac
  2020-01-16 19:53                 ` zimoun
  0 siblings, 1 reply; 35+ messages in thread
From: Arun Isaac @ 2020-01-16 19:11 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 336 bytes --]

zimoun <zimon.toutoune@gmail.com> writes:

> About (1), let implement something experimental and time it to compare
> apples with apples. :-)
> I mean I am working on it.

I am interested in working on an experimental sqlite based
implementation too. So, do let me know your plans so we don't duplicate
too much work. :-)

Cheers!
Arun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-16 19:11               ` Arun Isaac
@ 2020-01-16 19:53                 ` zimoun
  2020-01-17 19:29                   ` Arun Isaac
  0 siblings, 1 reply; 35+ messages in thread
From: zimoun @ 2020-01-16 19:53 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Guix Devel

Hi Arun,

On Thu, 16 Jan 2020 at 20:11, Arun Isaac <arunisaac@systemreboot.net> wrote:
>
> zimoun <zimon.toutoune@gmail.com> writes:
>
> > About (1), let implement something experimental and time it to compare
> > apples with apples. :-)
> > I mean I am working on it.
>
> I am interested in working on an experimental sqlite based
> implementation too. So, do let me know your plans so we don't duplicate
> too much work. :-)

Why not directly go for SQLite. But some details are not clear to me
and they are clearer with a Guile hash table or VHash.

What is not clear to me right now in both implementations are.

1.
How to update the index.
Give a look at the "pull" code and the ~/.cache/guix folder.

2.
How to deal with regexp.
It is more or less clear to me how to deal with using the trigram keys
but I do not know with SQLite; I have not thought about yet.

Basically, the current search works like that: the CLI arguments are
transformed in 'patterns', then transformed in 'regexps' which is
basically a list of terms; then 'find-packages-by-description' is
applied.

--8<---------------cut here---------------start------------->8---
(let* ((patterns (filter-map (match-lambda
                              (('query 'search rx) rx)
                              (_                   #f))
                             opts))
       (regexps  (map (cut make-regexp* <> regexp/icase) patterns))
       (matches  (find-packages-by-description regexps)))
--8<---------------cut here---------------end--------------->8---

The 'find-packages-by-description' applies a 'fold-packages' where
each term of the regexps list is lookup in each package (name,
synopsis, description, location, etc.) computing the relevance with
'package-relevance'.

--8<---------------cut here---------------start------------->8---
(let ((matches (fold-packages (lambda (package result)
                                (if (package-superseded package)
                                    result
                                  (match (package-relevance package
                                                            regexps)
                                         ((? zero?)
                                          result)
                                         (score
                                          (cons (cons package score)
                                                result)))))
                              '())))
--8<---------------cut here---------------end--------------->8---

Therefore, this call to 'fold-packages' needs to be replaced.

Using the trigram keys, I see more or less how to output the list of
packages where the key matches one term from the list of regexps. Some
False Positive will be filtered out then by the 'package-relevance'
function.

Using the SQLite, I do not know now. I need to read a bit about SQL query. :-)

If you want to implement it, go ahead. :-)
Otherwise, I will try to finish next week what I started yesterday
evening using VHash. :-)
(note that to avoid duplicate , the file sets.scm can be relevant)

All the best,
simon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-16 19:06         ` Arun Isaac
@ 2020-01-16 20:08           ` zimoun
  2020-01-17 17:13           ` Pierre Neidhardt
  1 sibling, 0 replies; 35+ messages in thread
From: zimoun @ 2020-01-16 20:08 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Guix Devel

Hi,

On Thu, 16 Jan 2020 at 20:06, Arun Isaac <arunisaac@systemreboot.net> wrote:

> I looked up xapian's features at https://xapian.org/features and it is
> quite impressive. I was introduced to xapian through notmuch. notmuch
> does not utilize xapian to the fullest and I therefore ended up
> underestimating its value. Of particular importance might be the
> following.
>
> - Relevance feedback - given one or more documents, Xapian can suggest
>   the most relevant index terms to expand a query, suggest related
>   documents, categorise documents, etc.
> - Phrase and proximity searching - users can search for words occurring
>   in an exact phrase or within a specified number of words, either in a
>   specified order, or in any order.
> - Supports stemming of search terms (e.g. a search for "football" would
>   match documents which mention "footballs" or "footballer")

Yes Xapian rocks! :-)


> For the time being, since we don't have xapian bindings, I think we
> should settle for sqlite's full text search capabilities.
>
> https://www.sqlite.org/fts5.html

Thank you for the pointer. I am looking at it.

Maybe a good ol' SQL query could improve. :-)



All the best,
simon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-16 19:06         ` Arun Isaac
  2020-01-16 20:08           ` zimoun
@ 2020-01-17 17:13           ` Pierre Neidhardt
  2020-01-17 19:11             ` Pierre Neidhardt
  1 sibling, 1 reply; 35+ messages in thread
From: Pierre Neidhardt @ 2020-01-17 17:13 UTC (permalink / raw)
  To: Arun Isaac, zimoun; +Cc: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 980 bytes --]

Arun Isaac <arunisaac@systemreboot.net> writes:

> For the time being, since we don't have xapian bindings, I think we
> should settle for sqlite's full text search capabilities.
>
> https://www.sqlite.org/fts5.html
>
> I have attached a short proof of concept script for an sqlite based
> search. Speedup is around 200x, and populating the database only takes
> around 2.5 seconds. Here is a sample run.
>
> Sqlite database populated in 2.5516340732574463 seconds
> Brute force search took 0.11850595474243164 seconds
> Sqlite search took 5.459785461425781e-4 seconds

This is really cool!  And quite simple too!
So now I suppose the test would be to try with some real world examples :)
I don't know the kind of tests we can write for this though.

If the results are convincing enough, I'd agree with you: we can first
settle for SQLite before moving to something more sophisticated like
Xapian.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-17 17:13           ` Pierre Neidhardt
@ 2020-01-17 19:11             ` Pierre Neidhardt
  2020-01-18 20:31               ` Arun Isaac
  0 siblings, 1 reply; 35+ messages in thread
From: Pierre Neidhardt @ 2020-01-17 19:11 UTC (permalink / raw)
  To: Arun Isaac, zimoun; +Cc: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 176 bytes --]

Oh, and is there a Guile library to write SQL statements in a more lispy
way?

E.g. like https://github.com/fukamachi/sxql.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-16 19:53                 ` zimoun
@ 2020-01-17 19:29                   ` Arun Isaac
  2020-01-20 19:14                     ` zimoun
  2020-01-21 10:46                     ` Ludovic Courtès
  0 siblings, 2 replies; 35+ messages in thread
From: Arun Isaac @ 2020-01-17 19:29 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 2653 bytes --]

> What is not clear to me right now in both implementations are.
>
> 1.
> How to update the index.
> Give a look at the "pull" code and the ~/.cache/guix folder.

We don't "update" the index. At every guix pull we create it
anew. Currently, generate-package-cache in gnu/packages.scm does
this. generate-package-cache is called by package-cache-file in
guix/channels.scm. package-cache-file is a channel profile hook listed
under %channel-profile-hooks.

Now, what I am unclear about is how to test my sqlite index building
code without actually pushing to master and running a guix pull. I will
go through the various tests in Guix and see if I can figure something
out, but any pointers would be much appreciated.

> 2.
> How to deal with regexp.
> It is more or less clear to me how to deal with using the trigram keys
> but I do not know with SQLite; I have not thought about yet.

I think it is not possible to search using regular expressions in sqlite
unless some external module is loaded. See
https://stackoverflow.com/questions/5071601/how-do-i-use-regex-in-a-sqlite-query/8338515#8338515

I think we should remove regex support altogether. I don't think a good
search interface should expect the user to provide regexes for
search. Certainly, it will be a lot less useful if and when we have
xapian. However, just to keep backward compatibility, we can fall back
to brute force fold-packages search for regexes. As Ludo pointed out, we
can't remove the brute force code since we need to support cases when
the cache is not authoritative.

> If you want to implement it, go ahead. :-)

Yes, I'll give it a shot. :-) I have some other commitments over the
weekend, but hopefully I'll have something by Monday night.

> Otherwise, I will try to finish next week what I started yesterday
> evening using VHash. :-)

About sqlite versus an inverted index using vhashes, I don't know if it
is possible to serialize a vhash onto disk. Even if that were possible,
we'll have to load the entire vhash based inverted index into memory for
every invocation of guix search, and that could hit
performance. Something like guile-gdbm could have helped, but that's
another story.

Also, I now agree with your earlier assessment that we should delegate
all this to sqlite. :-) That guix already uses sqlite for other things
is all the more reason.

> (note that to avoid duplicate , the file sets.scm can be relevant)

I didn't know about sets.scm when I wrote my first proof of concept
inverted index script. That is why I reinvented the set using hash
tables. I don't know how hash tables are different from VHashes or which
is better.

Cheers! :-)
Arun.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-17 19:11             ` Pierre Neidhardt
@ 2020-01-18 20:31               ` Arun Isaac
  0 siblings, 0 replies; 35+ messages in thread
From: Arun Isaac @ 2020-01-18 20:31 UTC (permalink / raw)
  To: Pierre Neidhardt, zimoun; +Cc: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 256 bytes --]

Pierre Neidhardt <mail@ambrevar.xyz> writes:

> Oh, and is there a Guile library to write SQL statements in a more lispy
> way?
>
> E.g. like https://github.com/fukamachi/sxql.

I am not aware of any similar guile library but it would be nice to have
one.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-17 19:29                   ` Arun Isaac
@ 2020-01-20 19:14                     ` zimoun
  2020-01-20 20:42                       ` Arun Isaac
  2020-01-21 10:46                     ` Ludovic Courtès
  1 sibling, 1 reply; 35+ messages in thread
From: zimoun @ 2020-01-20 19:14 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Guix Devel

Hi Arun,


On Fri, 17 Jan 2020 at 20:29, Arun Isaac <arunisaac@systemreboot.net> wrote:

> > 1.
> > How to update the index.
> > Give a look at the "pull" code and the ~/.cache/guix folder.
>
> We don't "update" the index. At every guix pull we create it
> anew. Currently, generate-package-cache in gnu/packages.scm does
> this. generate-package-cache is called by package-cache-file in
> guix/channels.scm. package-cache-file is a channel profile hook listed
> under %channel-profile-hooks.

I would like to be able to search the packages in all the history of
all the commits, and not only in only the packages for one specific
commit.



> Now, what I am unclear about is how to test my sqlite index building
> code without actually pushing to master and running a guix pull. I will
> go through the various tests in Guix and see if I can figure something
> out, but any pointers would be much appreciated.

To test "guix pull", simple "make as-derivation". Disclaim: can take
some time :-)

Then the issue is more to avoid to pollute your ~/.cache/guix and
~/.config/guix :-)


1. Update Guix with the result in /tmp/test

guix pull -p /tmp/test --url=/path/to/guix/repo

2. Create your SQL index

/tmp/test/bin/guix pull -p /tmp/trash

Now your index should be created with all the packages currently in master.
To have something reproducible (and faster), I suggest to add
--commit= and always pull against the same commit.

3. Test the index

/tmp/test/bin/guix search foo


I mean something along these lines. ;-)


> > 2.
> > How to deal with regexp.
> > It is more or less clear to me how to deal with using the trigram keys
> > but I do not know with SQLite; I have not thought about yet.
>
> I think it is not possible to search using regular expressions in sqlite

I think it is possible. I imagine something using multiple query.
I will give a look at the Guile module.


> I think we should remove regex support altogether. I don't think a good
> search interface should expect the user to provide regexes for
> search. Certainly, it will be a lot less useful if and when we have
> xapian. However, just to keep backward compatibility, we can fall back
> to brute force fold-packages search for regexes. As Ludo pointed out, we
> can't remove the brute force code since we need to support cases when
> the cache is not authoritative.

I disagree. We should keep the regexp. Otherwise we cannot include
under "guix search" or "guix package --search=" because arguments
about backward compatibility.

The end user interface (CLI) has to be exactly the same when using
brute force or the index. And the results too.


> About sqlite versus an inverted index using vhashes, I don't know if it
> is possible to serialize a vhash onto disk. Even if that were possible,
> we'll have to load the entire vhash based inverted index into memory for
> every invocation of guix search, and that could hit
> performance. Something like guile-gdbm could have helped, but that's
> another story.

And your first test was not fair. ;-)
Because you compared when the hash table was already in memory.
I mean to know the real performance, only timing can talk. :-)


> I didn't know about sets.scm when I wrote my first proof of concept
> inverted index script. That is why I reinvented the set using hash
> tables. I don't know how hash tables are different from VHashes or which
> is better.

VHashes is a bit confused in my mind too. ;-)

https://www.gnu.org/software/guile/manual/html_node/VHashes.html


Cheers,
simon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-20 19:14                     ` zimoun
@ 2020-01-20 20:42                       ` Arun Isaac
  0 siblings, 0 replies; 35+ messages in thread
From: Arun Isaac @ 2020-01-20 20:42 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 2032 bytes --]


I've replaced the cache building code in gnu/packages.scm with code that
builds a sqlite database instead. I haven't finished hooking this up to
the guix search code. I'll have it ready in another day or two.

> To test "guix pull", simple "make as-derivation". Disclaim: can take
> some time :-)
>
> Then the issue is more to avoid to pollute your ~/.cache/guix and
> ~/.config/guix :-)
>
> 1. Update Guix with the result in /tmp/test
>
> guix pull -p /tmp/test --url=/path/to/guix/repo
>
> 2. Create your SQL index
>
> /tmp/test/bin/guix pull -p /tmp/trash
>
> Now your index should be created with all the packages currently in master.
> To have something reproducible (and faster), I suggest to add
> --commit= and always pull against the same commit.
>
> 3. Test the index
>
> /tmp/test/bin/guix search foo
>
> I mean something along these lines. ;-)

Thanks, I got the idea. I'll try it out.

>> I think it is not possible to search using regular expressions in sqlite
>
> I think it is possible. I imagine something using multiple query.
> I will give a look at the Guile module.

Sure, let me know if you find something.

> I disagree. We should keep the regexp. Otherwise we cannot include
> under "guix search" or "guix package --search=" because arguments
> about backward compatibility.

I agree about backward compatibility.

>> About sqlite versus an inverted index using vhashes, I don't know if it
>> is possible to serialize a vhash onto disk. Even if that were possible,
>> we'll have to load the entire vhash based inverted index into memory for
>> every invocation of guix search, and that could hit
>> performance. Something like guile-gdbm could have helped, but that's
>> another story.
>
> And your first test was not fair. ;-)
> Because you compared when the hash table was already in memory.
> I mean to know the real performance, only timing can talk. :-)

Yes, it wasn't completely fair. :-P I was assuming there was some way to
efficiently serialize to disk and read from it without too much overhead.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-17 19:29                   ` Arun Isaac
  2020-01-20 19:14                     ` zimoun
@ 2020-01-21 10:46                     ` Ludovic Courtès
  2020-01-23 20:49                       ` Arun Isaac
  1 sibling, 1 reply; 35+ messages in thread
From: Ludovic Courtès @ 2020-01-21 10:46 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Guix Devel

Hello!

Arun Isaac <arunisaac@systemreboot.net> skribis:

>> What is not clear to me right now in both implementations are.
>>
>> 1.
>> How to update the index.
>> Give a look at the "pull" code and the ~/.cache/guix folder.
>
> We don't "update" the index. At every guix pull we create it
> anew. Currently, generate-package-cache in gnu/packages.scm does
> this. generate-package-cache is called by package-cache-file in
> guix/channels.scm. package-cache-file is a channel profile hook listed
> under %channel-profile-hooks.

Indeed.

To build the search index, you could add another hook in there.  But
note that it would have to be either fast or substitutable (the latter
is tricky because people can have arbitrary sets of channels).

> Now, what I am unclear about is how to test my sqlite index building
> code without actually pushing to master and running a guix pull. I will
> go through the various tests in Guix and see if I can figure something
> out, but any pointers would be much appreciated.

You could write a hook similar to ‘generate-package-cache’ and commit it
locally.  Then you can run:

  ./pre-inst-env guix pull --url=$PWD --branch=the-right-branch -p /tmp/test

(You need ./pre-inst-env so that the hook actually runs.)

Thanks to both of you for working on it!

Ludo’.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Inverted index to accelerate guix package search
  2020-01-21 10:46                     ` Ludovic Courtès
@ 2020-01-23 20:49                       ` Arun Isaac
  0 siblings, 0 replies; 35+ messages in thread
From: Arun Isaac @ 2020-01-23 20:49 UTC (permalink / raw)
  To: Guix Devel

[-- Attachment #1: Type: text/plain, Size: 491 bytes --]


I'm having some difficulty with populating the sqlite database from
within guix pull. I have sent a WIP patch to
https://issues.guix.info/issue/39258 . Let's continue the conversation
there.

> You could write a hook similar to ‘generate-package-cache’ and commit it
> locally.  Then you can run:
>
>   ./pre-inst-env guix pull --url=$PWD --branch=the-right-branch -p /tmp/test
>
> (You need ./pre-inst-env so that the hook actually runs.)

Thank you, I'm using this now.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2020-01-23 20:49 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-12 15:03 Inverted index to accelerate guix package search Arun Isaac
2020-01-13  8:48 ` Pierre Neidhardt
2020-01-13 15:08   ` Arun Isaac
2020-01-13 17:54     ` zimoun
2020-01-14  3:53       ` Bengt Richter
2020-01-14 14:21       ` Pierre Neidhardt
2020-01-14 15:27         ` Giovanni Biscuolo
2020-01-14 20:55           ` zimoun
2020-01-14 21:41             ` Pierre Neidhardt
2020-01-14 21:59               ` zimoun
2020-01-15  9:12                 ` Pierre Neidhardt
2020-01-15 14:25                   ` zimoun
2020-01-15 11:53             ` Giovanni Biscuolo
2020-01-15 14:49               ` zimoun
2020-01-16 19:06         ` Arun Isaac
2020-01-16 20:08           ` zimoun
2020-01-17 17:13           ` Pierre Neidhardt
2020-01-17 19:11             ` Pierre Neidhardt
2020-01-18 20:31               ` Arun Isaac
2020-01-15  5:44       ` Arun Isaac
2020-01-15  9:06         ` Pierre Neidhardt
2020-01-15 12:00           ` zimoun
2020-01-15 13:41             ` Pierre Neidhardt
2020-01-15 11:54         ` zimoun
2020-01-15 21:03         ` Ricardo Wurmus
2020-01-15 21:19           ` zimoun
2020-01-16 14:44           ` Ludovic Courtès
2020-01-16 15:04             ` zimoun
2020-01-16 19:11               ` Arun Isaac
2020-01-16 19:53                 ` zimoun
2020-01-17 19:29                   ` Arun Isaac
2020-01-20 19:14                     ` zimoun
2020-01-20 20:42                       ` Arun Isaac
2020-01-21 10:46                     ` Ludovic Courtès
2020-01-23 20:49                       ` Arun Isaac

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.