From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:470:142:3::10]:56876) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1j9D8B-0002cb-JM for guix-patches@gnu.org; Tue, 03 Mar 2020 14:23:04 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1j9D8A-0003Bf-3v for guix-patches@gnu.org; Tue, 03 Mar 2020 14:23:03 -0500 Received: from debbugs.gnu.org ([209.51.188.43]:33991) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1j9D89-0003Bb-W8 for guix-patches@gnu.org; Tue, 03 Mar 2020 14:23:02 -0500 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1j9D89-00056d-Sw for guix-patches@gnu.org; Tue, 03 Mar 2020 14:23:01 -0500 Subject: [bug#39258] [PATCH 4/4] gnu: Use xapian index for package search. Resent-Message-ID: MIME-Version: 1.0 References: <20200227204150.30985-1-arunisaac@systemreboot.net> <20200227204150.30985-5-arunisaac@systemreboot.net> In-Reply-To: <20200227204150.30985-5-arunisaac@systemreboot.net> From: zimoun Date: Tue, 3 Mar 2020 20:21:46 +0100 Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-patches-bounces+kyle=kyleam.com@gnu.org Sender: "Guix-patches" To: Arun Isaac Cc: Ludovic =?UTF-8?Q?Court=C3=A8s?= , 39258@debbugs.gnu.org Hi Arun, On Thu, 27 Feb 2020 at 21:42, Arun Isaac wrote= : > > * gnu/packages.scm (search-package-index): New function. > * guix/scripts/package.scm (find-packages-by-description): Search using t= he > xapian package index if search patterns are literal strings. Else, search > using fold-packages. > --- > gnu/packages.scm | 17 +++++++++++- > guix/scripts/package.scm | 57 +++++++++++++++++++++++----------------- > 2 files changed, 49 insertions(+), 25 deletions(-) > > diff --git a/gnu/packages.scm b/gnu/packages.scm > index e91753e2a8..5b5b29bf84 100644 > --- a/gnu/packages.scm > +++ b/gnu/packages.scm > @@ -67,7 +67,8 @@ > specifications->manifest > > generate-package-cache > - generate-package-search-index)) > + generate-package-search-index > + search-package-index)) > > ;;; Commentary: > ;;; > @@ -453,6 +454,20 @@ reducing the memory footprint." > > db-path) > > +(define (search-package-index profile querystring) > + (let ((offset 0) > + (pagesize 10)) Why this value of 10? This fix the number of packages returned. Hum? I have tried to replace by 100 and I got 100 packages. :-) > + (call-with-database (string-append profile %package-search-index) > + (lambda (db) > + (let ((query (parse-query querystring #:stemmer (make-stem "en")= ))) > + (mset-fold (lambda (item result) I do not know what is the convention for the bindings. But there is 'fold-packages' so I would be inclined to 'fold-msets' or something in this flavour. > + (match (find-packages-by-name > + (document-data (mset-item-document item))= ) > + ((package _ ...) > + (append result `((,package . ,(mset-item-weigh= t item))))))) > + '() > + (enquire-mset (enquire db query) offset pagesize)))= )))) > + > > (define %sigint-prompt > ;; The prompt to jump to upon SIGINT. > diff --git a/guix/scripts/package.scm b/guix/scripts/package.scm > index 1cb0d382bf..6a3b9002dd 100644 > --- a/guix/scripts/package.scm > +++ b/guix/scripts/package.scm > @@ -7,6 +7,7 @@ > ;;; Copyright =C2=A9 2016 Benz Schenk > ;;; Copyright =C2=A9 2016 Chris Marusich > ;;; Copyright =C2=A9 2019 Tobias Geerinckx-Rice > +;;; Copyright =C2=A9 2020 Arun Isaac > ;;; > ;;; This file is part of GNU Guix. > ;;; > @@ -178,31 +179,40 @@ hooks\" run when building the profile." > ;;; Package specifications. > ;;; > > -(define (find-packages-by-description regexps) > +(define (find-packages-by-description patterns) > "Return a list of pairs: packages whose name, synopsis, description, > or output matches at least one of REGEXPS sorted by relevance, and its > non-zero relevance score." > - (let ((matches (fold-packages (lambda (package result) > - (if (package-superseded package) > - result > - (match (package-relevance package > - regexps) > - ((? zero?) > - result) > - (score > - (cons (cons package score) > - result))))) > - '()))) > - (sort matches > - (lambda (m1 m2) > - (match m1 > - ((package1 . score1) > - (match m2 > - ((package2 . score2) > - (if (=3D score1 score2) > - (string>? (package-full-name package1) > - (package-full-name package2)) > - (> score1 score2)))))))))) > + (define (regexp? str) > + (string-any > + (char-set #\. #\[ #\{ #\} #\( #\) #\\ #\* #\+ #\? #\| #\^ #\$) > + str)) Instead of reverting this, I would let the current 'find-packages-by-description' and would add 'find-packages-by-description-indexed' doing just '(search-package-index (current-profile) (string-join patterns " "))'. And maybe refactoring the sort of scores. Then I would put the test branch in 'guix/scripts/packages.scm'... > + (if (and (current-profile) > + (not (any regexp? patterns))) > + (search-package-index (current-profile) (string-join patterns " ")= ) > + (let* ((regexps (map (cut make-regexp* <> regexp/icase) patterns)) > + (matches (fold-packages (lambda (package result) > + (if (package-superseded package) > + result > + (match (package-relevance pac= kage Note that I am in the process of implementing the BM25 weights as 'package-relevance'; at least really thinking about it! :-) I have already talked about TF-IDF as relevance, for example here [1]. And reading the Xapian documentation [2], it seems affordable. Or not ;-) because of the regexp... Need some thoughts... I mean "in the process". ;-) And in this case, it is almost a drop-in replacement of 'fold-packages' by 'mset-fold'; well it should add some flexibility and a more unified code. (Aside the searching, IMHO 'package-relevance' should help too in the linting process of bad written descriptions, another story. ;-) [1] https://lists.gnu.org/archive/html/guix-devel/2019-07/msg00252.html [2] https://xapian.org/docs/bm25.html > + reg= exps) > + ((? zero?) > + result) > + (score > + (cons (cons package score) > + result))))) > + '()))) > + (sort matches > + (lambda (m1 m2) > + (match m1 > + ((package1 . score1) > + (match m2 > + ((package2 . score2) > + (if (=3D score1 score2) > + (string>? (package-full-name package1) > + (package-full-name package2)) > + (> score1 score2))))))))))) > > (define (transaction-upgrade-entry store entry transaction) > "Return a variant of TRANSACTION that accounts for the upgrade of ENTR= Y, a > @@ -777,8 +787,7 @@ processed, #f otherwise." ...here. + (define (regexp? str) + (string-any + (char-set #\. #\[ #\{ #\} #\( #\) #\\ #\* #\+ #\? #\| #\^ #\$) + str)) > (('query 'search rx) rx) > (_ #f)) > opts)) > > - (regexps (map (cut make-regexp* <> regexp/icase) patterns= )) > - (matches (find-packages-by-description regexps))) + (if (any regexp? patterns) + (matches (find-packages-by-description regexps)) + (matches (find-packages-by-description-indexed patterns)) I mean something like that. > (leave-on-EPIPE > (display-search-results matches (current-output-port))) > #t)) > -- > 2.23.0 All the best, simon