From mboxrd@z Thu Jan 1 00:00:00 1970 From: Arun Isaac Subject: Re: Inverted index to accelerate guix package search Date: Wed, 15 Jan 2020 11:14:20 +0530 Message-ID: References: <87a76r68u6.fsf@ambrevar.xyz> Mime-Version: 1.0 Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Return-path: Received: from eggs.gnu.org ([2001:470:142:3::10]:55125) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1irbU0-0000VX-PN for guix-devel@gnu.org; Wed, 15 Jan 2020 00:44:51 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1irbTy-0001Ds-AX for guix-devel@gnu.org; Wed, 15 Jan 2020 00:44:47 -0500 Received: from mugam.systemreboot.net ([139.59.75.54]:33262) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1irbTx-0001BT-58 for guix-devel@gnu.org; Wed, 15 Jan 2020 00:44:46 -0500 In-Reply-To: List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane-mx.org@gnu.org Sender: "Guix-devel" To: Pierre Neidhardt , Bengt Richter , zimoun Cc: Guix Devel --==-=-= Content-Type: multipart/mixed; boundary="=-=-=" --=-=-= Content-Type: text/plain zimoun writes: > However, the core issue about "search" is not 'find quickly an item' > but 'find the relevant item'. > For example, Bloom filter [1] or B-tree could be interesting data > structure if we are talking about fast retrieval. Note that this part > should even be delegated to SQLite, IMHO. > > [1] https://en.wikipedia.org/wiki/Bloom_filter > [2] https://en.wikipedia.org/wiki/B-tree I just read about Bloom filters. They are fascinating indeed! > Well, the issue is 'scoring' a query. I think the issue of whether to use an inverted index is orthogonal to the quest to improve the relevance of the search results. Implementing tf-idf like you suggested could greatly benefit from having fast searches. > Last, I have not read the 2 links you pointed but I think you have a > bug in your code. The retrieval "game" using the index returns the > package "r-mgcv" which does not contain the term "game" in its > description. Well, the query "game" using the brute force does not > return this very package. Indeed, I found the bug and fixed it. Please find attached the updated code. Also, the inverted index hash table stores package objects instead of package names and all comparisons are now done with eq?. This gets us an even higher speedup of around 300x. Built index in 8.556596040725708 seconds Brute force search took 0.4107058048248291 seconds Inverted index search took 0.0012979507446289062 seconds Bengt Richter writes: > If you want to instrument particular calls, as you know, you can probably > get sub-microsecond resolution (try the following snippet on your platform > to see how long it takes to get a gettimeofday value -- even on this > Acer Swift with its Intel(R) Pentium(R) CPU N4200 @ 1.10GHz I get about 10 > calls per microsecond). gettimeofday is indeed a good idea. I have used it in the latest version of my proof of concept script. Like you said, I also noticed some peculiar outliers which probably got averaged out when I was running the search 1000 times. For the purpose of the discussion above, I have only reported the most common time. Pierre Neidhardt writes: > By the way, what about using Xapian in Guix? > > https://en.wikipedia.org/wiki/Xapian > > If it's relevant, maybe we can follow up with a discussion in a new > thread. I feel xapian is too much work (considering that we don't yet have guile bindings) compared to our own simple implementation of an inverted index. But, of course, I am biased since I wrote the inverted index code! :-) But, on a more serious note, if we move to xapian, we will not be able to support regular expression based search queries that we support today. On the other hand, I can extend the inverted index implementation to support regular expression searches. Personally, I don't use regular expression based search queries, and don't think they are very useful especially if we make use of xapian's stemming. What do people think? On the question of whether xapian is too heavy, I think we should make it an optional dependency of Guix so that it remains possible to build and use a more minimalistic Guix for those interested in such things. --=-=-= Content-Type: text/plain Content-Disposition: attachment; filename=inverted-index-poc.scm (use-modules (gnu packages) (guix packages) (ice-9 match) (ice-9 time) (srfi srfi-1) (srfi srfi-26)) ;;; Implementation of sets using hash tables (define make-set make-hash-table) (define set-element-of? hashq-ref) (define set-add! (cut hashq-set! <> <> #t)) (define (set-fold proc init set) (hash-fold (lambda (element _ result) (proc element result)) init set)) (define (set-intersection set1 set2) "Return the intersection of SET1 with SET2." (let ((result (make-set))) (hash-for-each (lambda (element _) (when (set-element-of? set2 element) (set-add! result element))) set1) result)) (define (trigrams str) "Return all trigrams in STR." (if (< (string-length str) 3) '() (map (lambda (start) (substring str start (+ start 3))) (iota (- (string-length str) 2))))) ;; Inverted index (define index (make-hash-table)) (define (build-index!) "Return an inverted index of all trigrams in the descriptions of all packages." (fold-packages (lambda (package _) (for-each (lambda (trigram) (let ((set (hash-ref index trigram (make-set)))) (set-add! set package) (hash-set! index trigram set))) (trigrams (package-description package)))) #f) index) (define (match-package-fold-proc query package result) (if (string-contains (package-description package) query) (cons package result) result)) (define (brute-force-retrieve query) "Return names of all packages whose descriptions contain the string QUERY. Search brute force by folding over all packages." (fold-packages (cut match-package-fold-proc query <> <>) '())) (define (inverted-index-retrieve query) "Return names of all packages whose descriptions contain the string QUERY. Search using the pre-built inverted index." (match (trigrams query) ((first-trigram . remaining-trigrams) (set-fold (cut match-package-fold-proc query <> <>) '() (fold (lambda (trigram result) (set-intersection result (hash-ref index trigram (make-set)))) (hash-ref index first-trigram (make-set)) remaining-trigrams))) (() (brute-force-retrieve query)))) (define (time-us thunk) (define pair->sec (match-lambda ((sec . usec) (+ sec (/ usec 1e6))))) (let ((start (gettimeofday))) (thunk) (let ((stop (gettimeofday))) (- (pair->sec stop) (pair->sec start))))) (format #t "Built index in ~a seconds Brute force search took ~a seconds Inverted index search took ~a seconds " (time-us build-index!) (time-us (cut brute-force-retrieve "strategy")) (time-us (cut inverted-index-retrieve "strategy"))) --=-=-=-- --==-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQEzBAEBCAAdFiEEf3MDQ/Lwnzx3v3nTLiXui2GAK7MFAl4eprQACgkQLiXui2GA K7PwWggAoxYrq/FR0YLopKapgDu4S6jx399+mw+yNAZ7s9yS/MG2dVNMjx8kVs+q nZD3ELU5350HJRy7+6BcsWWToy4E6C49oSgSvj0T4cnUPzW8zK2QEzLu5h9u68GY xn00lVJtsWz35qrVxY7edUx4pY4immjdv0WhD9/HTAotNsJUphMcqeqFo3FTrIy5 pKsXETlUxt09HKfv6SlT384hKcpnpVEVMwD4Ab5WhOMDm5sml7YCKtx20C1kV6cG bYj972dF5FcoQE9FKTmu9uW0T8Wkw0YufenER55Whrjr3dDjR+rn7nZf+MSLvAGm 9t3rZFlG2L09aSED89q4Eq0ZsrSeKw== =Hx1f -----END PGP SIGNATURE----- --==-=-=--