From mboxrd@z Thu Jan 1 00:00:00 1970 From: Arun Isaac Subject: Re: Inverted index to accelerate guix package search Date: Mon, 13 Jan 2020 20:38:00 +0530 Message-ID: References: <87a76r68u6.fsf@ambrevar.xyz> Mime-Version: 1.0 Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Return-path: Received: from eggs.gnu.org ([2001:470:142:3::10]:49861) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ir1KL-0006CA-DW for guix-devel@gnu.org; Mon, 13 Jan 2020 10:08:27 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ir1KJ-0007f3-9B for guix-devel@gnu.org; Mon, 13 Jan 2020 10:08:24 -0500 Received: from mugam.systemreboot.net ([139.59.75.54]:43882) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1ir1KI-0007Zi-2J for guix-devel@gnu.org; Mon, 13 Jan 2020 10:08:23 -0500 In-Reply-To: <87a76r68u6.fsf@ambrevar.xyz> List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane-mx.org@gnu.org Sender: "Guix-devel" To: Pierre Neidhardt , guix-devel@gnu.org --==-=-= Content-Type: multipart/mixed; boundary="=-=-=" --=-=-= Content-Type: text/plain > Those measures don't seem precise enough to draw a good conclusion. > Could you increase the sample size (or maybe just loop?) so that all > times reach over a second or so? Indeed, my bad! Here are better timing results. I have repeated each of the searches a 1000 times, and I'm getting around 90x speedup! :-) Building index clock utime stime cutime cstime gctime 9.76 12.50 0.18 0.00 0.00 6.16 Brute force search (repeated 1000 times) clock utime stime cutime cstime gctime 195.95 264.13 0.90 0.00 0.00 142.62 Inverted index search (repeated 1000 times) clock utime stime cutime cstime gctime 2.18 2.48 0.00 0.00 0.00 0.60 Also, find attached the updated proof of concept script. Cheers! --=-=-= Content-Type: text/plain Content-Disposition: attachment; filename=inverted-index-poc.scm (use-modules (gnu packages) (guix packages) (ice-9 match) (ice-9 time) (srfi srfi-1) (srfi srfi-26)) ;;; Implementation of sets using hash tables (define make-set make-hash-table) (define set-element-of? hash-ref) (define set-add! (cut hash-set! <> <> #t)) (define set->list (cut hash-map->list (lambda (element _) element) <>)) (define (set-intersection set1 set2) "Return the intersection of SET1 with SET2" (let ((result (make-set))) (hash-for-each (lambda (element _) (when (set-element-of? set2 element) (set-add! result element))) set1) result)) (define (trigrams str) "Return all trigrams in STR." (if (< (string-length str) 3) (list) (map (lambda (start) (substring str start (+ start 3))) (iota (- (string-length str) 2))))) ;; Inverted index (define index (make-hash-table)) (define (build-index!) "Return an inverted index of all trigrams in the descriptions of all packages." (fold-packages (lambda (package _) (for-each (lambda (trigram) (let ((set (hash-ref index trigram (make-set)))) (set-add! set (package-name package)) (hash-set! index trigram set))) (trigrams (package-description package)))) #f) index) (define (brute-force-retrieve query) "Return names of all packages whose descriptions contain the string QUERY. Search brute force by folding over all packages." (fold-packages (lambda (package result) (if (string-contains (package-description package) query) (cons (package-name package) result) result)) (list))) (define (inverted-index-retrieve query) "Return names of all packages whose descriptions contain the string QUERY. Search using the pre-built inverted index." (match (trigrams query) ((first-trigram . remaining-trigrams) (set->list (fold (lambda (trigram result) (set-intersection result (hash-ref index trigram))) (hash-ref index first-trigram) remaining-trigrams))) (() (brute-force-retrieve query)))) (define (repeat thunk times) (unless (zero? times) (thunk) (repeat thunk (1- times)))) (display "Building index") (newline) (time (build-index!)) (newline) (display "Brute force search (repeated 1000 times)") (newline) (time (repeat (cut brute-force-retrieve "strategy") 1000)) (newline) (display "Inverted index search (repeated 1000 times)") (newline) (time (repeat (cut inverted-index-retrieve "strategy") 1000)) (newline) --=-=-=-- --==-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQEzBAEBCAAdFiEEf3MDQ/Lwnzx3v3nTLiXui2GAK7MFAl4ch9AACgkQLiXui2GA K7ODcAf8D2Q3PmrLuxJ7o6ygpQaD1L1q3+Fw8VWfoomXHwu2XvbpVYUoqy1HjScb PcemmPbMvSHDz9Y/UNDM1iIa5Zoo/e8R862oipuxzX8p5xtXR2qbpzzwDvmRVW0L 2v6zRGq9yokdJoz8A0KrKL+46hq8wYZwbMZSROoVLul/JKXT3hHFNWo7s2lBonjL X9AMSh5Yh6auoxido+NrJwdSG51hFDhLDjEEO7kkA4sP7BJEaxEkcZsZEJcMNmQy kTHBf438qZ8DsaTQHribkkIMLTSalUwfseLe7OhnxNR4CJR519erdMuP0D9wU1rc U5diSgkQODQRx3Cz13Ngjk7kV1AoOw== =KExf -----END PGP SIGNATURE----- --==-=-=--