From: Arun Isaac <arunisaac@systemreboot.net>
To: Pierre Neidhardt <mail@ambrevar.xyz>,
Bengt Richter <bokr@bokr.com>, zimoun <zimon.toutoune@gmail.com>
Cc: Guix Devel <guix-devel@gnu.org>
Subject: Re: Inverted index to accelerate guix package search
Date: Wed, 15 Jan 2020 11:14:20 +0530 [thread overview]
Message-ID: <cu74kwxe0jv.fsf@systemreboot.net> (raw)
In-Reply-To: <CAJ3okZ23f4kuygoj_HS0+uKYGCad2ZV=u7DWuZ=-3vQXM_QYtQ@mail.gmail.com>
[-- Attachment #1.1: Type: text/plain, Size: 3214 bytes --]
zimoun <zimon.toutoune@gmail.com> writes:
> However, the core issue about "search" is not 'find quickly an item'
> but 'find the relevant item'.
> For example, Bloom filter [1] or B-tree could be interesting data
> structure if we are talking about fast retrieval. Note that this part
> should even be delegated to SQLite, IMHO.
>
> [1] https://en.wikipedia.org/wiki/Bloom_filter
> [2] https://en.wikipedia.org/wiki/B-tree
I just read about Bloom filters. They are fascinating indeed!
> Well, the issue is 'scoring' a query.
I think the issue of whether to use an inverted index is orthogonal to
the quest to improve the relevance of the search results. Implementing
tf-idf like you suggested could greatly benefit from having fast
searches.
> Last, I have not read the 2 links you pointed but I think you have a
> bug in your code. The retrieval "game" using the index returns the
> package "r-mgcv" which does not contain the term "game" in its
> description. Well, the query "game" using the brute force does not
> return this very package.
Indeed, I found the bug and fixed it. Please find attached the updated
code. Also, the inverted index hash table stores package objects instead
of package names and all comparisons are now done with eq?. This gets us
an even higher speedup of around 300x.
Built index in 8.556596040725708 seconds
Brute force search took 0.4107058048248291 seconds
Inverted index search took 0.0012979507446289062 seconds
Bengt Richter <bokr@bokr.com> writes:
> If you want to instrument particular calls, as you know, you can probably
> get sub-microsecond resolution (try the following snippet on your platform
> to see how long it takes to get a gettimeofday value -- even on this
> Acer Swift with its Intel(R) Pentium(R) CPU N4200 @ 1.10GHz I get about 10
> calls per microsecond).
gettimeofday is indeed a good idea. I have used it in the latest version
of my proof of concept script. Like you said, I also noticed some
peculiar outliers which probably got averaged out when I was running the
search 1000 times. For the purpose of the discussion above, I have only
reported the most common time.
Pierre Neidhardt <mail@ambrevar.xyz> writes:
> By the way, what about using Xapian in Guix?
>
> https://en.wikipedia.org/wiki/Xapian
>
> If it's relevant, maybe we can follow up with a discussion in a new
> thread.
I feel xapian is too much work (considering that we don't yet have guile
bindings) compared to our own simple implementation of an inverted
index. But, of course, I am biased since I wrote the inverted index
code! :-)
But, on a more serious note, if we move to xapian, we will not be able
to support regular expression based search queries that we support
today. On the other hand, I can extend the inverted index implementation
to support regular expression searches. Personally, I don't use regular
expression based search queries, and don't think they are very useful
especially if we make use of xapian's stemming. What do people think?
On the question of whether xapian is too heavy, I think we should make
it an optional dependency of Guix so that it remains possible to build
and use a more minimalistic Guix for those interested in such things.
[-- Attachment #1.2: inverted-index-poc.scm --]
[-- Type: text/plain, Size: 2961 bytes --]
(use-modules (gnu packages)
(guix packages)
(ice-9 match)
(ice-9 time)
(srfi srfi-1)
(srfi srfi-26))
;;; Implementation of sets using hash tables
(define make-set make-hash-table)
(define set-element-of? hashq-ref)
(define set-add! (cut hashq-set! <> <> #t))
(define (set-fold proc init set)
(hash-fold (lambda (element _ result)
(proc element result))
init set))
(define (set-intersection set1 set2)
"Return the intersection of SET1 with SET2."
(let ((result (make-set)))
(hash-for-each (lambda (element _)
(when (set-element-of? set2 element)
(set-add! result element)))
set1)
result))
(define (trigrams str)
"Return all trigrams in STR."
(if (< (string-length str) 3)
'()
(map (lambda (start)
(substring str start (+ start 3)))
(iota (- (string-length str) 2)))))
;; Inverted index
(define index (make-hash-table))
(define (build-index!)
"Return an inverted index of all trigrams in the descriptions of all
packages."
(fold-packages (lambda (package _)
(for-each (lambda (trigram)
(let ((set (hash-ref index trigram (make-set))))
(set-add! set package)
(hash-set! index trigram set)))
(trigrams (package-description package))))
#f)
index)
(define (match-package-fold-proc query package result)
(if (string-contains (package-description package) query)
(cons package result)
result))
(define (brute-force-retrieve query)
"Return names of all packages whose descriptions contain the string QUERY.
Search brute force by folding over all packages."
(fold-packages (cut match-package-fold-proc query <> <>) '()))
(define (inverted-index-retrieve query)
"Return names of all packages whose descriptions contain the string QUERY.
Search using the pre-built inverted index."
(match (trigrams query)
((first-trigram . remaining-trigrams)
(set-fold
(cut match-package-fold-proc query <> <>)
'()
(fold (lambda (trigram result)
(set-intersection result (hash-ref index trigram (make-set))))
(hash-ref index first-trigram (make-set))
remaining-trigrams)))
(() (brute-force-retrieve query))))
(define (time-us thunk)
(define pair->sec
(match-lambda
((sec . usec)
(+ sec (/ usec 1e6)))))
(let ((start (gettimeofday)))
(thunk)
(let ((stop (gettimeofday)))
(- (pair->sec stop) (pair->sec start)))))
(format #t "Built index in ~a seconds
Brute force search took ~a seconds
Inverted index search took ~a seconds
"
(time-us build-index!)
(time-us (cut brute-force-retrieve "strategy"))
(time-us (cut inverted-index-retrieve "strategy")))
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]
next prev parent reply other threads:[~2020-01-15 5:44 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-12 15:03 Inverted index to accelerate guix package search Arun Isaac
2020-01-13 8:48 ` Pierre Neidhardt
2020-01-13 15:08 ` Arun Isaac
2020-01-13 17:54 ` zimoun
2020-01-14 3:53 ` Bengt Richter
2020-01-14 14:21 ` Pierre Neidhardt
2020-01-14 15:27 ` Giovanni Biscuolo
2020-01-14 20:55 ` zimoun
2020-01-14 21:41 ` Pierre Neidhardt
2020-01-14 21:59 ` zimoun
2020-01-15 9:12 ` Pierre Neidhardt
2020-01-15 14:25 ` zimoun
2020-01-15 11:53 ` Giovanni Biscuolo
2020-01-15 14:49 ` zimoun
2020-01-16 19:06 ` Arun Isaac
2020-01-16 20:08 ` zimoun
2020-01-17 17:13 ` Pierre Neidhardt
2020-01-17 19:11 ` Pierre Neidhardt
2020-01-18 20:31 ` Arun Isaac
2020-01-15 5:44 ` Arun Isaac [this message]
2020-01-15 9:06 ` Pierre Neidhardt
2020-01-15 12:00 ` zimoun
2020-01-15 13:41 ` Pierre Neidhardt
2020-01-15 11:54 ` zimoun
2020-01-15 21:03 ` Ricardo Wurmus
2020-01-15 21:19 ` zimoun
2020-01-16 14:44 ` Ludovic Courtès
2020-01-16 15:04 ` zimoun
2020-01-16 19:11 ` Arun Isaac
2020-01-16 19:53 ` zimoun
2020-01-17 19:29 ` Arun Isaac
2020-01-20 19:14 ` zimoun
2020-01-20 20:42 ` Arun Isaac
2020-01-21 10:46 ` Ludovic Courtès
2020-01-23 20:49 ` Arun Isaac
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://guix.gnu.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cu74kwxe0jv.fsf@systemreboot.net \
--to=arunisaac@systemreboot.net \
--cc=bokr@bokr.com \
--cc=guix-devel@gnu.org \
--cc=mail@ambrevar.xyz \
--cc=zimon.toutoune@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).