From mboxrd@z Thu Jan 1 00:00:00 1970 From: Arun Isaac Subject: Re: Inverted index to accelerate guix package search Date: Fri, 17 Jan 2020 00:36:37 +0530 Message-ID: References: <87a76r68u6.fsf@ambrevar.xyz> <87muaqnmod.fsf@ambrevar.xyz> Mime-Version: 1.0 Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Return-path: Received: from eggs.gnu.org ([2001:470:142:3::10]:42181) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1isATu-00017H-2i for guix-devel@gnu.org; Thu, 16 Jan 2020 14:07:06 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1isATp-0001RO-Fy for guix-devel@gnu.org; Thu, 16 Jan 2020 14:07:01 -0500 Received: from mugam.systemreboot.net ([139.59.75.54]:41744) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1isATo-0001Mq-8j for guix-devel@gnu.org; Thu, 16 Jan 2020 14:06:57 -0500 In-Reply-To: <87muaqnmod.fsf@ambrevar.xyz> List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane-mx.org@gnu.org Sender: "Guix-devel" To: Pierre Neidhardt , zimoun Cc: Guix Devel --==-=-= Content-Type: multipart/mixed; boundary="=-=-=" --=-=-= Content-Type: text/plain Pierre Neidhardt writes: > By the way, what about using Xapian in Guix? I looked up xapian's features at https://xapian.org/features and it is quite impressive. I was introduced to xapian through notmuch. notmuch does not utilize xapian to the fullest and I therefore ended up underestimating its value. Of particular importance might be the following. - Relevance feedback - given one or more documents, Xapian can suggest the most relevant index terms to expand a query, suggest related documents, categorise documents, etc. - Phrase and proximity searching - users can search for words occurring in an exact phrase or within a specified number of words, either in a specified order, or in any order. - Supports stemming of search terms (e.g. a search for "football" would match documents which mention "footballs" or "footballer") I think these features would really help in Pierre's work trying to improve search and discoverability on Guix. If we are planning to have a "Software Center" like interface at some point in the future, xapian's search could come in handy. Not directly related to Guix, but I also wonder if info manuals would be a lot more useful if they had good full text search using xapian. For the time being, since we don't have xapian bindings, I think we should settle for sqlite's full text search capabilities. https://www.sqlite.org/fts5.html I have attached a short proof of concept script for an sqlite based search. Speedup is around 200x, and populating the database only takes around 2.5 seconds. Here is a sample run. Sqlite database populated in 2.5516340732574463 seconds Brute force search took 0.11850595474243164 seconds Sqlite search took 5.459785461425781e-4 seconds --=-=-= Content-Type: text/plain Content-Disposition: attachment; filename=sqlite-search.scm (use-modules (gnu packages) (guix packages) (ice-9 match) (sqlite3) (srfi srfi-26)) (define db (sqlite-open "/tmp/index.sqlite")) (define schema "CREATE VIRTUAL TABLE packages USING fts5(name, description)") (define (build-sqlite-database) (sqlite-exec db schema) (sqlite-exec db "BEGIN") (fold-packages (lambda (package _) (let ((statement (sqlite-prepare db "INSERT INTO packages(name, description) VALUES(:name, :description)"))) (sqlite-bind-arguments statement #:name (package-name package) #:description (package-description package)) (sqlite-fold cons '() statement) (sqlite-finalize statement))) #f) (sqlite-exec db "COMMIT;")) (define (sqlite-retrieve query) (let ((statement (sqlite-prepare db "SELECT name FROM packages WHERE description MATCH :query"))) (sqlite-bind-arguments statement #:query query) (let ((result (sqlite-fold (lambda (v result) (match v (#(name) (cons name result)))) '() statement))) (sqlite-finalize statement) result))) (define (brute-force-retrieve query) "Return names of all packages whose descriptions contain the string QUERY. Search brute force by folding over all packages." (fold-packages (lambda (package result) (if (string-contains (package-description package) query) (cons package result) result)) '())) (define (time-us thunk) (define pair->sec (match-lambda ((sec . usec) (+ sec (/ usec 1e6))))) (let ((start (gettimeofday))) (thunk) (let ((stop (gettimeofday))) (- (pair->sec stop) (pair->sec start))))) (format #t "Sqlite database populated in ~a seconds Brute force search took ~a seconds Sqlite search took ~a seconds " (time-us build-sqlite-database) (time-us (cut brute-force-retrieve "strategy")) (time-us (cut sqlite-retrieve "strategy"))) (sqlite-close db) --=-=-=-- --==-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQEzBAEBCAAdFiEEf3MDQ/Lwnzx3v3nTLiXui2GAK7MFAl4gtD0ACgkQLiXui2GA K7MYkQgAlrgOYBrrxY5KbQJTO9VlZ6HSEdppjaZr4pf60VZ3pWinfJYsp7FzujkP 0C72Rle5aE+idwTmh/M41imKdCuqxh6wl5i3zGAlB4vn4kuQ85vBvWxDAu3XYHs0 erl3IgwIppRvMfreT8UIHbSb9mp0DmxoMTNmnqdWCctkgx+SJvPlMc9XhdElthVJ ei+M0mx3nXcZ7y9b1MStozvWypI6YrfCju2eFqEGMOPxVeWqbxrcxip7Z1SKAtRM Ik3iYXlgUpk9G64wK7lO20r6lXFBk1bwFccmoJ1EMWSS2r76jlr3wJRmChMzWgs8 XSc81kHkJjUD51+fCU+Zm0gAjEYe2A== =DeVc -----END PGP SIGNATURE----- --==-=-=--