unofficial mirror of guix-patches@gnu.org 
 help / color / mirror / code / Atom feed
* [bug#39258] Faster guix search using an sqlite cache
@ 2020-01-23 19:51 Arun Isaac
  2020-01-29 23:33 ` zimoun
                   ` (8 more replies)
  0 siblings, 9 replies; 126+ messages in thread
From: Arun Isaac @ 2020-01-23 19:51 UTC (permalink / raw)
  To: 39258


[-- Attachment #1.1: Type: text/plain, Size: 1818 bytes --]


Hi,

As discussed on guix-devel at
https://lists.gnu.org/archive/html/guix-devel/2020-01/msg00310.html , I
am working on an sqlite cache to improve guix search performance. I have
attached a highly incomplete WIP patch. The patch attempts to
reimplement the package-cache-file hook in guix/channels.scm using a
sqlite database. To this end, it rewrites most of the
generate-package-cache and cache-lookup functions in gnu/packages.scm. I
am yet to hook this up to guix search.

At the moment, I am having some difficulty populating the sqlite
database. generate-package-cache populates the database correctly when
invoked from a normal guile REPL using geiser, but fails to do so when
run by the guix daemon during guix pull.

I ran guix pull using

$ ./pre-inst-env guix pull --url=$PWD --branch=search -p /tmp/test

where search is the branch I am working on.

Running

$ ls /tmp/test/lib/guix -lh

shows

total 2.1M
-r--r--r-- 2 root root 2.1M ஜன.   1  1970 package-cache.sqlite
-r--r--r-- 2 root root  26K ஜன.   1  1970 package-cache.sqlite-journal

On examining package-cache.sqlite, I find that no records have been
written. And, there is a lingering journal file that shouldn't be
there. For some reason, populating the sqlite database does not work
with guix pull. sqlite probably crashes and leaves the journal file.

If I try to populate the database with each package record being
inserted in its own transaction, at least some of the insertions
work. But the journal file still lingers. My unverified guess is that
everything except the last transaction was successful.

Any ideas what's going on?

Also, inserting each package in its own transaction is ridiculously slow
and so that is out of the question. See https://www.sqlite.org/faq.html#q19


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: 0001-fast-search.patch --]
[-- Type: text/x-patch, Size: 13197 bytes --]

From d1305351a90a84eb75e4769284d5e06927eade3e Mon Sep 17 00:00:00 2001
From: Arun Isaac <arunisaac@systemreboot.net>
Date: Tue, 21 Jan 2020 20:45:43 +0530
Subject: [PATCH] fast search

---
 build-aux/build-self.scm |   5 +
 gnu/packages.scm         | 207 +++++++++++++++++++++++----------------
 2 files changed, 128 insertions(+), 84 deletions(-)

diff --git a/build-aux/build-self.scm b/build-aux/build-self.scm
index fc13032b73..c123ad3b11 100644
--- a/build-aux/build-self.scm
+++ b/build-aux/build-self.scm
@@ -264,6 +264,9 @@ interface (FFI) of Guile.")
   (define fake-git
     (scheme-file "git.scm" #~(define-module (git))))
 
+  (define fake-sqlite3
+    (scheme-file "sqlite3.scm" #~(define-module (sqlite3))))
+
   (with-imported-modules `(((guix config)
                             => ,(make-config.scm))
 
@@ -278,6 +281,8 @@ interface (FFI) of Guile.")
                            ;; (git) to placate it.
                            ((git) => ,fake-git)
 
+                           ((sqlite3) => ,fake-sqlite3)
+
                            ,@(source-module-closure `((guix store)
                                                       (guix self)
                                                       (guix derivations)
diff --git a/gnu/packages.scm b/gnu/packages.scm
index d22c992bb1..4e2c52e62d 100644
--- a/gnu/packages.scm
+++ b/gnu/packages.scm
@@ -43,6 +43,7 @@
   #:use-module (srfi srfi-34)
   #:use-module (srfi srfi-35)
   #:use-module (srfi srfi-39)
+  #:use-module (sqlite3)
   #:export (search-patch
             search-patches
             search-auxiliary-file
@@ -204,10 +205,8 @@ PROC is called along these lines:
 PROC can use #:allow-other-keys to ignore the bits it's not interested in.
 When a package cache is available, this procedure does not actually load any
 package module."
-  (define cache
-    (load-package-cache (current-profile)))
-
-  (if (and cache (cache-is-authoritative?))
+  (if (and (cache-is-authoritative?)
+           (current-profile))
       (vhash-fold (lambda (name vector result)
                     (match vector
                       (#(name version module symbol outputs
@@ -220,7 +219,7 @@ package module."
                              #:supported? supported?
                              #:deprecated? deprecated?))))
                   init
-                  cache)
+                  (cache-lookup (current-profile)))
       (fold-packages (lambda (package result)
                        (proc (package-name package)
                              (package-version package)
@@ -252,31 +251,7 @@ is guaranteed to never traverse the same package twice."
 
 (define %package-cache-file
   ;; Location of the package cache.
-  "/lib/guix/package.cache")
-
-(define load-package-cache
-  (mlambda (profile)
-    "Attempt to load the package cache.  On success return a vhash keyed by
-package names.  Return #f on failure."
-    (match profile
-      (#f #f)
-      (profile
-       (catch 'system-error
-         (lambda ()
-           (define lst
-             (load-compiled (string-append profile %package-cache-file)))
-           (fold (lambda (item vhash)
-                   (match item
-                     (#(name version module symbol outputs
-                             supported? deprecated?
-                             file line column)
-                      (vhash-cons name item vhash))))
-                 vlist-null
-                 lst))
-         (lambda args
-           (if (= ENOENT (system-error-errno args))
-               #f
-               (apply throw args))))))))
+  "/lib/guix/package-cache.sqlite")
 
 (define find-packages-by-name/direct              ;bypass the cache
   (let ((packages (delay
@@ -297,25 +272,57 @@ decreasing version order."
                     matching)
             matching)))))
 
-(define (cache-lookup cache name)
+(define* (cache-lookup profile #:optional name)
   "Lookup package NAME in CACHE.  Return a list sorted in increasing version
 order."
   (define (package-version<? v1 v2)
     (version>? (vector-ref v2 1) (vector-ref v1 1)))
 
-  (sort (vhash-fold* cons '() name cache)
-        package-version<?))
+  (define (int->boolean n)
+    (case n
+      ((0) #f)
+      ((1) #t)))
+
+  (define (string->list str)
+    (call-with-input-string str read))
+
+  (define select-statement
+    (string-append
+     "SELECT name, version, module, symbol, outputs, supported, superseded, locationFile, locationLine, locationColumn from packages"
+     (if name " WHERE name = :name" "")))
+
+  (define cache-file
+    (string-append profile %package-cache-file))
+
+  (let* ((db (sqlite-open cache-file SQLITE_OPEN_READONLY))
+         (statement (sqlite-prepare db select-statement)))
+    (when name
+      (sqlite-bind-arguments statement #:name name))
+    (let ((result (sqlite-fold (lambda (v result)
+                                 (match v
+                                   (#(name version module symbol outputs supported superseded file line column)
+                                    (cons
+                                     (vector name
+                                             version
+                                             (string->list module)
+                                             (string->symbol symbol)
+                                             (string->list outputs)
+                                             (int->boolean supported)
+                                             (int->boolean superseded)
+                                             (list file line column))
+                                     result))))
+                               '() statement)))
+      (sqlite-finalize statement)
+      (sqlite-close db)
+      (sort result package-version<?))))
 
 (define* (find-packages-by-name name #:optional version)
   "Return the list of packages with the given NAME.  If VERSION is not #f,
 then only return packages whose version is prefixed by VERSION, sorted in
 decreasing version order."
-  (define cache
-    (load-package-cache (current-profile)))
-
-  (if (and (cache-is-authoritative?) cache)
-      (match (cache-lookup cache name)
-        (#f #f)
+  (if (and (cache-is-authoritative?)
+           (current-profile))
+      (match (cache-lookup (current-profile) name)
         ((#(_ versions modules symbols _ _ _ _ _ _) ...)
          (fold (lambda (version* module symbol result)
                  (if (or (not version)
@@ -331,12 +338,9 @@ decreasing version order."
 (define* (find-package-locations name #:optional version)
   "Return a list of version/location pairs corresponding to each package
 matching NAME and VERSION."
-  (define cache
-    (load-package-cache (current-profile)))
-
-  (if (and cache (cache-is-authoritative?))
-      (match (cache-lookup cache name)
-        (#f '())
+  (if (and (cache-is-authoritative?)
+           (current-profile))
+      (match (cache-lookup (current-profile) name)
         ((#(name versions modules symbols outputs
                  supported? deprecated?
                  files lines columns) ...)
@@ -372,6 +376,9 @@ VERSION."
 ;; Prevent Guile 3 from inlining this procedure so we can mock it in tests.
 (set! find-best-packages-by-name find-best-packages-by-name)
 
+(define (list->string x)
+  (call-with-output-string (cut write x <>)))
+
 (define (generate-package-cache directory)
   "Generate under DIRECTORY a cache of all the available packages.
 
@@ -381,49 +388,81 @@ reducing the memory footprint."
   (define cache-file
     (string-append directory %package-cache-file))
 
-  (define (expand-cache module symbol variable result+seen)
+  (define schema
+    "CREATE TABLE packages (name text,
+version text,
+module text,
+symbol text,
+outputs text,
+supported int,
+superseded int,
+locationFile text,
+locationLine int,
+locationColumn int);
+CREATE VIRTUAL TABLE packageSearch USING fts5(name, searchText);")
+
+  (define insert-statement
+    "INSERT INTO packages(name, version, module, symbol, outputs, supported, superseded, locationFile, locationLine, locationColumn)
+VALUES(:name, :version, :module, :symbol, :outputs, :supported, :superseded, :locationfile, :locationline, :locationcolumn)")
+
+  (define insert-package-search-statement
+    "INSERT INTO packageSearch(name, searchText) VALUES(:name, :searchtext)")
+
+  (define (boolean->int x)
+    (if x 1 0))
+
+  (define (list->string x)
+    (call-with-output-string (cut write x <>)))
+
+  (define (insert-package db module symbol variable seen)
     (match (false-if-exception (variable-ref variable))
       ((? package? package)
-       (match result+seen
-         ((result . seen)
-          (if (or (vhash-assq package seen)
-                  (hidden-package? package))
-              (cons result seen)
-              (cons (cons `#(,(package-name package)
-                             ,(package-version package)
-                             ,(module-name module)
-                             ,symbol
-                             ,(package-outputs package)
-                             ,(->bool (supported-package? package))
-                             ,(->bool (package-superseded package))
-                             ,@(let ((loc (package-location package)))
-                                 (if loc
-                                     `(,(location-file loc)
-                                       ,(location-line loc)
-                                       ,(location-column loc))
-                                     '(#f #f #f))))
-                          result)
-                    (vhash-consq package #t seen))))))
-      (_
-       result+seen)))
-
-  (define exp
-    (first
-     (fold-module-public-variables* expand-cache
-                                    (cons '() vlist-null)
-                                    (all-modules (%package-module-path)
-                                                 #:warn
-                                                 warn-about-load-error))))
+       (cond
+        ((or (vhash-assq package seen)
+             (hidden-package? package))
+         seen)
+        (else
+         (let ((statement (sqlite-prepare db insert-statement)))
+           (sqlite-bind-arguments statement
+                                  #:name (package-name package)
+                                  #:version (package-version package)
+                                  #:module (list->string (module-name module))
+                                  #:symbol (symbol->string symbol)
+                                  #:outputs (list->string (package-outputs package))
+                                  #:supported (boolean->int (supported-package? package))
+                                  #:superseded (boolean->int (package-superseded package))
+                                  #:locationfile (cond
+                                                  ((package-location package) => location-file)
+                                                  (else #f))
+                                  #:locationline (cond
+                                                  ((package-location package) => location-line)
+                                                  (else #f))
+                                  #:locationcolumn (cond
+                                                    ((package-location package) => location-column)
+                                                    (else #f)))
+           (sqlite-fold cons '() statement)
+           (sqlite-finalize statement))
+         (let ((statement (sqlite-prepare db insert-package-search-statement)))
+           (sqlite-bind-arguments statement
+                                  #:name (package-name package)
+                                  #:searchtext (package-description package))
+           (sqlite-fold cons '() statement)
+           (sqlite-finalize statement))
+         (vhash-consq package #t seen))))
+      (_ seen)))
 
   (mkdir-p (dirname cache-file))
-  (call-with-output-file cache-file
-    (lambda (port)
-      ;; Store the cache as a '.go' file.  This makes loading fast and reduces
-      ;; heap usage since some of the static data is directly mmapped.
-      (put-bytevector port
-                      (compile `'(,@exp)
-                               #:to 'bytecode
-                               #:opts '(#:to-file? #t)))))
+  (let ((db (sqlite-open cache-file)))
+    (sqlite-exec db schema)
+    (sqlite-exec db "BEGIN")
+    (fold-module-public-variables* (cut insert-package db <> <> <> <>)
+                                   vlist-null
+                                   (all-modules (%package-module-path)
+                                                #:warn
+                                                warn-about-load-error))
+    (sqlite-exec db "COMMIT;")
+    (sqlite-close db))
+
   cache-file)
 
 \f
-- 
2.23.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] Faster guix search using an sqlite cache
  2020-01-23 19:51 [bug#39258] Faster guix search using an sqlite cache Arun Isaac
@ 2020-01-29 23:33 ` zimoun
  2020-01-30 13:48   ` Arun Isaac
  2020-02-27 20:41 ` [bug#39258] [PATCH 0/4] Xapian for Guix package search Arun Isaac
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 126+ messages in thread
From: zimoun @ 2020-01-29 23:33 UTC (permalink / raw)
  To: Arun Isaac; +Cc: 39258

Hi Arun,

Thank you for the patch!
Cool! :-)

I have not tested it yet. Sorry.


On Thu, 23 Jan 2020 at 20:53, Arun Isaac <arunisaac@systemreboot.net> wrote:

> At the moment, I am having some difficulty populating the sqlite
> database. generate-package-cache populates the database correctly when
> invoked from a normal guile REPL using geiser, but fails to do so when
> run by the guix daemon during guix pull.

[...]

> On examining package-cache.sqlite, I find that no records have been
> written. And, there is a lingering journal file that shouldn't be
> there. For some reason, populating the sqlite database does not work
> with guix pull. sqlite probably crashes and leaves the journal file.

Hum? weird...
Is it possible that a module is loaded when Guile repl is used and not
with Guix pull?
What about "guix repl"?


> If I try to populate the database with each package record being
> inserted in its own transaction, at least some of the insertions

You mean 'commit' the database after each insertion, right?


> work. But the journal file still lingers. My unverified guess is that
> everything except the last transaction was successful.

And this does not happen with the repl, right?


> Any ideas what's going on?

I have no idea.
Weird.

What about adding 'last-insert-row-id' from 'guix/store/database.scm'?
I mean without really understanding and just grepping
'sqlite-finalize' spots that 'last-insert-row-id' seems often around.
:-)


> Also, inserting each package in its own transaction is ridiculously slow
> and so that is out of the question. See https://www.sqlite.org/faq.html#q19

Agree that it is not an option. :-)


Otherwise, 'list->string' is defined twice. And the first one is not
necessary, I guess.
The docstring of 'cache-lookup' is not coherent anymore. :-)


Cheers,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] Faster guix search using an sqlite cache
  2020-01-29 23:33 ` zimoun
@ 2020-01-30 13:48   ` Arun Isaac
  2020-01-31 12:48     ` zimoun
  0 siblings, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-01-30 13:48 UTC (permalink / raw)
  To: zimoun; +Cc: 39258

[-- Attachment #1: Type: text/plain, Size: 1452 bytes --]


> Hum? weird...
> Is it possible that a module is loaded when Guile repl is used and not
> with Guix pull?

It could be. But I don't know how to confirm this theory.

> What about "guix repl"?

I just tried 'guix repl'. It worked correctly, just like guile repl.

>> If I try to populate the database with each package record being
>> inserted in its own transaction, at least some of the insertions
>
> You mean 'commit' the database after each insertion, right?

Yes, that is what I mean.

>> work. But the journal file still lingers. My unverified guess is that
>> everything except the last transaction was successful.
>
> And this does not happen with the repl, right?

No, this does not happen with the repl.

>> Any ideas what's going on?
>
> I have no idea.
> Weird.

Do you know of any way sqlite can create an error log to report what's
going on? That might really help debug this issue.

> What about adding 'last-insert-row-id' from 'guix/store/database.scm'?
> I mean without really understanding and just grepping
> 'sqlite-finalize' spots that 'last-insert-row-id' seems often around.
> :-)

I tried this just now, but still the journal lingers.

> Otherwise, 'list->string' is defined twice. And the first one is not
> necessary, I guess.

Ah, thanks for catching this!

> The docstring of 'cache-lookup' is not coherent anymore. :-)

Yes, I haven't gotten around to fixing up all those yet. I thought I'll
get the code working first.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] Faster guix search using an sqlite cache
  2020-01-30 13:48   ` Arun Isaac
@ 2020-01-31 12:48     ` zimoun
  2020-02-02 21:16       ` Arun Isaac
  0 siblings, 1 reply; 126+ messages in thread
From: zimoun @ 2020-01-31 12:48 UTC (permalink / raw)
  To: Arun Isaac; +Cc: 39258

Hi,

On Thu, 30 Jan 2020 at 14:49, Arun Isaac <arunisaac@systemreboot.net> wrote:

> >> Any ideas what's going on?
[...]
> Do you know of any way sqlite can create an error log to report what's
> going on? That might really help debug this issue.

Danny told me something like:

  (catch (sqlite-error

I have not tried yet.



> > The docstring of 'cache-lookup' is not coherent anymore. :-)
>
> Yes, I haven't gotten around to fixing up all those yet. I thought I'll
> get the code working first.

Yes, I imagine. Just to notice. :-)



All the best,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] Faster guix search using an sqlite cache
  2020-01-31 12:48     ` zimoun
@ 2020-02-02 21:16       ` Arun Isaac
  2020-02-04 10:19         ` zimoun
  0 siblings, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-02-02 21:16 UTC (permalink / raw)
  To: zimoun; +Cc: 39258

[-- Attachment #1: Type: text/plain, Size: 771 bytes --]


> Danny told me something like:
>
>   (catch (sqlite-error
>
> I have not tried yet.

Thank you, this was useful. I was able to catch and report the error. I
also found the log file for the guix-package-cache profile hook. It says

(repl-version 0 0)
Generating package cache for '/gnu/store/b6f9b5qbcn4r932whrr6m15rdimbgrhs-profile'...
(exception sqlite-error (value sqlite-open) (value 14) (value "Unable to open the database file"))

This could be a permission error, or something to do with the existence
or lack thereof of certain directories (such as /var) in the chroot of
the build daemon. I'm still figuring it out.

I'm also in half a mind to get some guile xapian bindings ready so we
can just do that instead of messing with sqlite here. But, let's
see. :-P

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] Faster guix search using an sqlite cache
  2020-02-02 21:16       ` Arun Isaac
@ 2020-02-04 10:19         ` zimoun
  2020-02-06  1:58           ` Arun Isaac
  0 siblings, 1 reply; 126+ messages in thread
From: zimoun @ 2020-02-04 10:19 UTC (permalink / raw)
  To: Arun Isaac; +Cc: 39258

Hi,

On Sun, 2 Feb 2020 at 22:16, Arun Isaac <arunisaac@systemreboot.net> wrote:

> Thank you, this was useful. I was able to catch and report the error. I

Where have you reported the error?


> also found the log file for the guix-package-cache profile hook. It says
>
> (repl-version 0 0)
> Generating package cache for '/gnu/store/b6f9b5qbcn4r932whrr6m15rdimbgrhs-profile'...
> (exception sqlite-error (value sqlite-open) (value 14) (value "Unable to open the database file"))
>
> This could be a permission error, or something to do with the existence
> or lack thereof of certain directories (such as /var) in the chroot of
> the build daemon. I'm still figuring it out.

Hum? And this should explain why it is working with the REPL and not
with the CLI, right?


> I'm also in half a mind to get some guile xapian bindings ready so we
> can just do that instead of messing with sqlite here. But, let's
> see. :-P

Cool!
Let me know if you push something somewhere.


Cheers,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] Faster guix search using an sqlite cache
  2020-02-04 10:19         ` zimoun
@ 2020-02-06  1:58           ` Arun Isaac
  2020-02-11 16:29             ` Ludovic Courtès
  0 siblings, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-02-06  1:58 UTC (permalink / raw)
  To: zimoun; +Cc: 39258


[-- Attachment #1.1: Type: text/plain, Size: 1213 bytes --]


>> Thank you, this was useful. I was able to catch and report the error. I
>
> Where have you reported the error?

I reported the error to the derivation log. For example, if the
derivation for the guix-package-cache derivation is
/gnu/store/cyf2h3frcjxm147dii5qic8d6kpm39nq-guix-package-cache.drv, the
log file will be at
/var/log/guix/drvs/cy/f2h3frcjxm147dii5qic8d6kpm39nq-guix-package-cache.drv.bz2. Notice
that the directory name under drvs is the first two letters of the hash,
and the file name under that directory is the remaining letters.

Also please find attached a dump of my code so far.

>> This could be a permission error, or something to do with the existence
>> or lack thereof of certain directories (such as /var) in the chroot of
>> the build daemon. I'm still figuring it out.
>
> Hum? And this should explain why it is working with the REPL and not
> with the CLI, right?

This could expalin it, but I am not sure if this is the correct
explanation.

>> I'm also in half a mind to get some guile xapian bindings ready so we
>> can just do that instead of messing with sqlite here. But, let's
>> see. :-P
>
> Cool!
> Let me know if you push something somewhere.

Sure, will let you know.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: 0001-fast-search.patch --]
[-- Type: text/x-patch, Size: 14366 bytes --]

From 4c883fcff1f44339b28df6ccdb2b10c906439e3d Mon Sep 17 00:00:00 2001
From: Arun Isaac <arunisaac@systemreboot.net>
Date: Tue, 21 Jan 2020 20:45:43 +0530
Subject: [PATCH] fast search

---
 build-aux/build-self.scm |   5 +
 gnu/packages.scm         | 234 +++++++++++++++++++++++++--------------
 2 files changed, 155 insertions(+), 84 deletions(-)

diff --git a/build-aux/build-self.scm b/build-aux/build-self.scm
index fc13032b73..c123ad3b11 100644
--- a/build-aux/build-self.scm
+++ b/build-aux/build-self.scm
@@ -264,6 +264,9 @@ interface (FFI) of Guile.")
   (define fake-git
     (scheme-file "git.scm" #~(define-module (git))))
 
+  (define fake-sqlite3
+    (scheme-file "sqlite3.scm" #~(define-module (sqlite3))))
+
   (with-imported-modules `(((guix config)
                             => ,(make-config.scm))
 
@@ -278,6 +281,8 @@ interface (FFI) of Guile.")
                            ;; (git) to placate it.
                            ((git) => ,fake-git)
 
+                           ((sqlite3) => ,fake-sqlite3)
+
                            ,@(source-module-closure `((guix store)
                                                       (guix self)
                                                       (guix derivations)
diff --git a/gnu/packages.scm b/gnu/packages.scm
index d22c992bb1..0ae5b84284 100644
--- a/gnu/packages.scm
+++ b/gnu/packages.scm
@@ -43,6 +43,7 @@
   #:use-module (srfi srfi-34)
   #:use-module (srfi srfi-35)
   #:use-module (srfi srfi-39)
+  #:use-module (sqlite3)
   #:export (search-patch
             search-patches
             search-auxiliary-file
@@ -204,10 +205,8 @@ PROC is called along these lines:
 PROC can use #:allow-other-keys to ignore the bits it's not interested in.
 When a package cache is available, this procedure does not actually load any
 package module."
-  (define cache
-    (load-package-cache (current-profile)))
-
-  (if (and cache (cache-is-authoritative?))
+  (if (and (cache-is-authoritative?)
+           (current-profile))
       (vhash-fold (lambda (name vector result)
                     (match vector
                       (#(name version module symbol outputs
@@ -220,7 +219,7 @@ package module."
                              #:supported? supported?
                              #:deprecated? deprecated?))))
                   init
-                  cache)
+                  (cache-lookup (current-profile)))
       (fold-packages (lambda (package result)
                        (proc (package-name package)
                              (package-version package)
@@ -252,31 +251,7 @@ is guaranteed to never traverse the same package twice."
 
 (define %package-cache-file
   ;; Location of the package cache.
-  "/lib/guix/package.cache")
-
-(define load-package-cache
-  (mlambda (profile)
-    "Attempt to load the package cache.  On success return a vhash keyed by
-package names.  Return #f on failure."
-    (match profile
-      (#f #f)
-      (profile
-       (catch 'system-error
-         (lambda ()
-           (define lst
-             (load-compiled (string-append profile %package-cache-file)))
-           (fold (lambda (item vhash)
-                   (match item
-                     (#(name version module symbol outputs
-                             supported? deprecated?
-                             file line column)
-                      (vhash-cons name item vhash))))
-                 vlist-null
-                 lst))
-         (lambda args
-           (if (= ENOENT (system-error-errno args))
-               #f
-               (apply throw args))))))))
+  "/lib/guix/package-cache.sqlite")
 
 (define find-packages-by-name/direct              ;bypass the cache
   (let ((packages (delay
@@ -297,25 +272,57 @@ decreasing version order."
                     matching)
             matching)))))
 
-(define (cache-lookup cache name)
+(define* (cache-lookup profile #:optional name)
   "Lookup package NAME in CACHE.  Return a list sorted in increasing version
 order."
   (define (package-version<? v1 v2)
     (version>? (vector-ref v2 1) (vector-ref v1 1)))
 
-  (sort (vhash-fold* cons '() name cache)
-        package-version<?))
+  (define (int->boolean n)
+    (case n
+      ((0) #f)
+      ((1) #t)))
+
+  (define (string->list str)
+    (call-with-input-string str read))
+
+  (define select-statement
+    (string-append
+     "SELECT name, version, module, symbol, outputs, supported, superseded, locationFile, locationLine, locationColumn from packages"
+     (if name " WHERE name = :name" "")))
+
+  (define cache-file
+    (string-append profile %package-cache-file))
+
+  (let* ((db (sqlite-open cache-file SQLITE_OPEN_READONLY))
+         (statement (sqlite-prepare db select-statement)))
+    (when name
+      (sqlite-bind-arguments statement #:name name))
+    (let ((result (sqlite-fold (lambda (v result)
+                                 (match v
+                                   (#(name version module symbol outputs supported superseded file line column)
+                                    (cons
+                                     (vector name
+                                             version
+                                             (string->list module)
+                                             (string->symbol symbol)
+                                             (string->list outputs)
+                                             (int->boolean supported)
+                                             (int->boolean superseded)
+                                             (list file line column))
+                                     result))))
+                               '() statement)))
+      (sqlite-finalize statement)
+      (sqlite-close db)
+      (sort result package-version<?))))
 
 (define* (find-packages-by-name name #:optional version)
   "Return the list of packages with the given NAME.  If VERSION is not #f,
 then only return packages whose version is prefixed by VERSION, sorted in
 decreasing version order."
-  (define cache
-    (load-package-cache (current-profile)))
-
-  (if (and (cache-is-authoritative?) cache)
-      (match (cache-lookup cache name)
-        (#f #f)
+  (if (and (cache-is-authoritative?)
+           (current-profile))
+      (match (cache-lookup (current-profile) name)
         ((#(_ versions modules symbols _ _ _ _ _ _) ...)
          (fold (lambda (version* module symbol result)
                  (if (or (not version)
@@ -331,12 +338,9 @@ decreasing version order."
 (define* (find-package-locations name #:optional version)
   "Return a list of version/location pairs corresponding to each package
 matching NAME and VERSION."
-  (define cache
-    (load-package-cache (current-profile)))
-
-  (if (and cache (cache-is-authoritative?))
-      (match (cache-lookup cache name)
-        (#f '())
+  (if (and (cache-is-authoritative?)
+           (current-profile))
+      (match (cache-lookup (current-profile) name)
         ((#(name versions modules symbols outputs
                  supported? deprecated?
                  files lines columns) ...)
@@ -372,6 +376,33 @@ VERSION."
 ;; Prevent Guile 3 from inlining this procedure so we can mock it in tests.
 (set! find-best-packages-by-name find-best-packages-by-name)
 
+;; (generate-package-cache "/tmp/test")
+
+;; XXX: missing in guile-sqlite3@0.1.0
+(define SQLITE_BUSY 5)
+
+(define (call-with-transaction db proc)
+  "Start a transaction with DB (make as many attempts as necessary) and run
+PROC.  If PROC exits abnormally, abort the transaction, otherwise commit the
+transaction after it finishes."
+  (catch 'sqlite-error
+    (lambda ()
+      ;; We use begin immediate here so that if we need to retry, we
+      ;; figure that out immediately rather than because some SQLITE_BUSY
+      ;; exception gets thrown partway through PROC - in which case the
+      ;; part already executed (which may contain side-effects!) would be
+      ;; executed again for every retry.
+      (sqlite-exec db "begin immediate;")
+      (let ((result (proc)))
+        (sqlite-exec db "commit;")
+        result))
+    (lambda (key who error description)
+      (if (= error SQLITE_BUSY)
+          (call-with-transaction db proc)
+          (begin
+            (sqlite-exec db "rollback;")
+            (throw 'sqlite-error who error description))))))
+
 (define (generate-package-cache directory)
   "Generate under DIRECTORY a cache of all the available packages.
 
@@ -381,49 +412,84 @@ reducing the memory footprint."
   (define cache-file
     (string-append directory %package-cache-file))
 
-  (define (expand-cache module symbol variable result+seen)
+  (define schema
+    "CREATE TABLE packages (name text,
+version text,
+module text,
+symbol text,
+outputs text,
+supported int,
+superseded int,
+locationFile text,
+locationLine int,
+locationColumn int);
+CREATE VIRTUAL TABLE packageSearch USING fts5(name, searchText);")
+
+  (define insert-statement
+    "INSERT INTO packages(name, version, module, symbol, outputs, supported, superseded, locationFile, locationLine, locationColumn)
+VALUES(:name, :version, :module, :symbol, :outputs, :supported, :superseded, :locationfile, :locationline, :locationcolumn)")
+
+  (define insert-package-search-statement
+    "INSERT INTO packageSearch(name, searchText) VALUES(:name, :searchtext)")
+
+  (define (boolean->int x)
+    (if x 1 0))
+
+  (define (list->string x)
+    (call-with-output-string (cut write x <>)))
+
+  (define (insert-package db module symbol variable seen)
     (match (false-if-exception (variable-ref variable))
       ((? package? package)
-       (match result+seen
-         ((result . seen)
-          (if (or (vhash-assq package seen)
-                  (hidden-package? package))
-              (cons result seen)
-              (cons (cons `#(,(package-name package)
-                             ,(package-version package)
-                             ,(module-name module)
-                             ,symbol
-                             ,(package-outputs package)
-                             ,(->bool (supported-package? package))
-                             ,(->bool (package-superseded package))
-                             ,@(let ((loc (package-location package)))
-                                 (if loc
-                                     `(,(location-file loc)
-                                       ,(location-line loc)
-                                       ,(location-column loc))
-                                     '(#f #f #f))))
-                          result)
-                    (vhash-consq package #t seen))))))
-      (_
-       result+seen)))
-
-  (define exp
-    (first
-     (fold-module-public-variables* expand-cache
-                                    (cons '() vlist-null)
-                                    (all-modules (%package-module-path)
-                                                 #:warn
-                                                 warn-about-load-error))))
+       (cond
+        ((or (vhash-assq package seen)
+             (hidden-package? package))
+         seen)
+        (else
+         (let ((statement (sqlite-prepare db insert-statement)))
+           (sqlite-bind-arguments statement
+                                  #:name (package-name package)
+                                  #:version (package-version package)
+                                  #:module (list->string (module-name module))
+                                  #:symbol (symbol->string symbol)
+                                  #:outputs (list->string (package-outputs package))
+                                  #:supported (boolean->int (supported-package? package))
+                                  #:superseded (boolean->int (package-superseded package))
+                                  #:locationfile (cond
+                                                  ((package-location package) => location-file)
+                                                  (else #f))
+                                  #:locationline (cond
+                                                  ((package-location package) => location-line)
+                                                  (else #f))
+                                  #:locationcolumn (cond
+                                                    ((package-location package) => location-column)
+                                                    (else #f)))
+           (sqlite-fold cons '() statement)
+           (sqlite-finalize statement))
+         (let ((statement (sqlite-prepare db insert-package-search-statement)))
+           (sqlite-bind-arguments statement
+                                  #:name (package-name package)
+                                  #:searchtext (package-description package))
+           (sqlite-fold cons '() statement)
+           (sqlite-finalize statement))
+         (vhash-consq package #t seen))))
+      (_ seen)))
 
   (mkdir-p (dirname cache-file))
-  (call-with-output-file cache-file
-    (lambda (port)
-      ;; Store the cache as a '.go' file.  This makes loading fast and reduces
-      ;; heap usage since some of the static data is directly mmapped.
-      (put-bytevector port
-                      (compile `'(,@exp)
-                               #:to 'bytecode
-                               #:opts '(#:to-file? #t)))))
+  (let ((tmp (string-append (dirname cache-file) "/tmp")))
+    (mkdir-p tmp)
+    (setenv "SQLITE_TMPDIR" tmp))
+  (let ((db (sqlite-open cache-file)))
+    (sqlite-exec db schema)
+    (call-with-transaction db
+        (lambda ()
+          (fold-module-public-variables* (cut insert-package db <> <> <> <>)
+                                         vlist-null
+                                         (all-modules (%package-module-path)
+                                                      #:warn
+                                                      warn-about-load-error))))
+    (sqlite-close db))
+
   cache-file)
 
 \f
-- 
2.23.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] Faster guix search using an sqlite cache
  2020-02-06  1:58           ` Arun Isaac
@ 2020-02-11 16:29             ` Ludovic Courtès
  2020-02-11 18:21               ` zimoun
  0 siblings, 1 reply; 126+ messages in thread
From: Ludovic Courtès @ 2020-02-11 16:29 UTC (permalink / raw)
  To: Arun Isaac; +Cc: 39258, zimoun

Hello Arun!

Arun Isaac <arunisaac@systemreboot.net> skribis:

> From 4c883fcff1f44339b28df6ccdb2b10c906439e3d Mon Sep 17 00:00:00 2001
> From: Arun Isaac <arunisaac@systemreboot.net>
> Date: Tue, 21 Jan 2020 20:45:43 +0530
> Subject: [PATCH] fast search

[...]

> --- a/gnu/packages.scm
> +++ b/gnu/packages.scm
> @@ -43,6 +43,7 @@
>    #:use-module (srfi srfi-34)
>    #:use-module (srfi srfi-35)
>    #:use-module (srfi srfi-39)
> +  #:use-module (sqlite3)
>    #:export (search-patch
>              search-patches
>              search-auxiliary-file
> @@ -204,10 +205,8 @@ PROC is called along these lines:
>  PROC can use #:allow-other-keys to ignore the bits it's not interested in.
>  When a package cache is available, this procedure does not actually load any
>  package module."
> -  (define cache
> -    (load-package-cache (current-profile)))
> -
> -  (if (and cache (cache-is-authoritative?))
> +  (if (and (cache-is-authoritative?)
> +           (current-profile))
>        (vhash-fold (lambda (name vector result)
>                      (match vector
>                        (#(name version module symbol outputs
> @@ -220,7 +219,7 @@ package module."
>                               #:supported? supported?
>                               #:deprecated? deprecated?))))
>                    init
> -                  cache)
> +                  (cache-lookup (current-profile)))
>        (fold-packages (lambda (package result)
>                         (proc (package-name package)
>                               (package-version package)
> @@ -252,31 +251,7 @@ is guaranteed to never traverse the same package twice."
>  
>  (define %package-cache-file
>    ;; Location of the package cache.
> -  "/lib/guix/package.cache")
> -
> -(define load-package-cache

[...]

> +(define* (cache-lookup profile #:optional name)
>    "Lookup package NAME in CACHE.  Return a list sorted in increasing version
>  order."
>    (define (package-version<? v1 v2)
>      (version>? (vector-ref v2 1) (vector-ref v1 1)))
>  
> -  (sort (vhash-fold* cons '() name cache)
> -        package-version<?))
> +  (define (int->boolean n)
> +    (case n
> +      ((0) #f)
> +      ((1) #t)))
> +
> +  (define (string->list str)
> +    (call-with-input-string str read))
> +
> +  (define select-statement
> +    (string-append
> +     "SELECT name, version, module, symbol, outputs, supported, superseded, locationFile, locationLine, locationColumn from packages"
> +     (if name " WHERE name = :name" "")))

I would rather keep the current package cache as-is instead of inserting
sqlite in here.  I don’t expect it to bring much compared
performance-wise to the current simple cache (especially if we look at
load time), and it does increase complexity quite a bit.

However, using sqlite for keyword search as you initially proposed on
guix-devel does sound like a great idea to me.

WDYT?

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] Faster guix search using an sqlite cache
  2020-02-11 16:29             ` Ludovic Courtès
@ 2020-02-11 18:21               ` zimoun
  2020-02-11 18:39                 ` Ludovic Courtès
  0 siblings, 1 reply; 126+ messages in thread
From: zimoun @ 2020-02-11 18:21 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Arun Isaac, 39258

Hi Ludo,

On Tue, 11 Feb 2020 at 17:29, Ludovic Courtès <ludo@gnu.org> wrote:

> I would rather keep the current package cache as-is instead of inserting
> sqlite in here.  I don’t expect it to bring much compared
> performance-wise to the current simple cache (especially if we look at
> load time), and it does increase complexity quite a bit.

Complexity is about taste. ;-)
About performance, the idea was to first implement something with
sqlite and then see if it makes the difference. I mean I have
understood that.

> However, using sqlite for keyword search as you initially proposed on
> guix-devel does sound like a great idea to me.

If I understand correctly, you are proposing 2 caches, right?
Or are you proposing an inverted index (VHash/VList table) based on
trigrams to speed up the lookup?

Cheers,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] Faster guix search using an sqlite cache
  2020-02-11 18:21               ` zimoun
@ 2020-02-11 18:39                 ` Ludovic Courtès
  2020-02-11 19:07                   ` Arun Isaac
  2020-02-11 20:13                   ` zimoun
  0 siblings, 2 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-02-11 18:39 UTC (permalink / raw)
  To: zimoun; +Cc: Arun Isaac, 39258

zimoun <zimon.toutoune@gmail.com> skribis:

> On Tue, 11 Feb 2020 at 17:29, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> I would rather keep the current package cache as-is instead of inserting
>> sqlite in here.  I don’t expect it to bring much compared
>> performance-wise to the current simple cache (especially if we look at
>> load time), and it does increase complexity quite a bit.
>
> Complexity is about taste. ;-)

It’s measurable to some extent (lines of code, cyclomatic complexity,
etc.)

> About performance, the idea was to first implement something with
> sqlite and then see if it makes the difference. I mean I have
> understood that.

Yes.  But keep in mind that this package cache is used exclusively for
package lookups by name.  Namely, the goal is to speed package lookup in
operations like “guix install foo” (mapping “foo” to the right <package>
in the right module without walking through all the modules) and “guix
package -A” (which is what the shell completion hooks use).

Currently “guix package -A” runs in .5s on my laptop, and I suspect it’s
going to be hard to do better just by touching the cache.

>> However, using sqlite for keyword search as you initially proposed on
>> guix-devel does sound like a great idea to me.
>
> If I understand correctly, you are proposing 2 caches, right?
> Or are you proposing an inverted index (VHash/VList table) based on
> trigrams to speed up the lookup?

Arun started the discussion on guix-devel with the idea of an inverted
index, and I thought this would become a second index (possibly
implemented using SQLite).  Perhaps I misunderstood the discussion all
along though, let me know!  :-)

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] Faster guix search using an sqlite cache
  2020-02-11 18:39                 ` Ludovic Courtès
@ 2020-02-11 19:07                   ` Arun Isaac
  2020-02-11 20:20                     ` zimoun
  2020-02-15 14:50                     ` Arun Isaac
  2020-02-11 20:13                   ` zimoun
  1 sibling, 2 replies; 126+ messages in thread
From: Arun Isaac @ 2020-02-11 19:07 UTC (permalink / raw)
  To: Ludovic Courtès, zimoun; +Cc: 39258

[-- Attachment #1: Type: text/plain, Size: 1076 bytes --]


>>> I would rather keep the current package cache as-is instead of inserting
>>> sqlite in here.  I don’t expect it to bring much compared
>>> performance-wise to the current simple cache (especially if we look at
>>> load time), and it does increase complexity quite a bit.
>>
>> Complexity is about taste. ;-)
>
> It’s measurable to some extent (lines of code, cyclomatic complexity,
> etc.)

I agree with Ludo here. I think it does increase the complexity, and
probably unnecessarily so.

> Arun started the discussion on guix-devel with the idea of an inverted
> index, and I thought this would become a second index (possibly
> implemented using SQLite).  Perhaps I misunderstood the discussion all
> along though, let me know!  :-)

No, you didn't misunderstand. That's where it began. But, while
implementing it, I thought I might as well replace the existing cache.

Also, I've started working on guile-xapian bindings. With that, it seems
simpler to keep the current package cache and add a xapian index only to
speed up package search.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] Faster guix search using an sqlite cache
  2020-02-11 18:39                 ` Ludovic Courtès
  2020-02-11 19:07                   ` Arun Isaac
@ 2020-02-11 20:13                   ` zimoun
  1 sibling, 0 replies; 126+ messages in thread
From: zimoun @ 2020-02-11 20:13 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Arun Isaac, 39258

Hi Ludo,

On Tue, 11 Feb 2020 at 19:39, Ludovic Courtès <ludo@gnu.org> wrote:

> > About performance, the idea was to first implement something with
> > sqlite and then see if it makes the difference. I mean I have
> > understood that.
>
> Yes.  But keep in mind that this package cache is used exclusively for
> package lookups by name.  Namely, the goal is to speed package lookup in

I agree that some confusion happens here. And this cache cannot be improved.


> >> However, using sqlite for keyword search as you initially proposed on
> >> guix-devel does sound like a great idea to me.
> >
> > If I understand correctly, you are proposing 2 caches, right?
> > Or are you proposing an inverted index (VHash/VList table) based on
> > trigrams to speed up the lookup?
>
> Arun started the discussion on guix-devel with the idea of an inverted
> index, and I thought this would become a second index (possibly
> implemented using SQLite).  Perhaps I misunderstood the discussion all
> along though, let me know!  :-)

Well, your suggestion is very welcome and 2 caches are required: one
for the lookup by name, as it is already (and does a good job);
another one for "guix search" speeds up, SQLite or whatever (based on
inverted index or whatever).


Thanks,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] Faster guix search using an sqlite cache
  2020-02-11 19:07                   ` Arun Isaac
@ 2020-02-11 20:20                     ` zimoun
  2020-02-15 14:50                     ` Arun Isaac
  1 sibling, 0 replies; 126+ messages in thread
From: zimoun @ 2020-02-11 20:20 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, 39258

Hi Arun,

On Tue, 11 Feb 2020 at 20:07, Arun Isaac <arunisaac@systemreboot.net> wrote:

> > Arun started the discussion on guix-devel with the idea of an inverted
> > index, and I thought this would become a second index (possibly
> > implemented using SQLite).  Perhaps I misunderstood the discussion all
> > along though, let me know!  :-)
>
> No, you didn't misunderstand. That's where it began. But, while
> implementing it, I thought I might as well replace the existing cache.

An inverted index backed by Guile as you did on guix-devel or backed
by SQLite seems a good improvement for "guix search".

> Also, I've started working on guile-xapian bindings. With that, it seems
> simpler to keep the current package cache and add a xapian index only to
> speed up package search.

Xapian would be cool!
And an SQLite based index seems easier to index locally the packages
and their history. The Guix Data Service is already doing such thing
but AFAIK using PostgreSQL and some magic. :-)

http://data.guix.gnu.org/repository/1/branch/master/package/git


All the best,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] Faster guix search using an sqlite cache
  2020-02-11 19:07                   ` Arun Isaac
  2020-02-11 20:20                     ` zimoun
@ 2020-02-15 14:50                     ` Arun Isaac
  1 sibling, 0 replies; 126+ messages in thread
From: Arun Isaac @ 2020-02-15 14:50 UTC (permalink / raw)
  To: Ludovic Courtès, zimoun; +Cc: Pierre Neidhardt, 39258

[-- Attachment #1: Type: text/plain, Size: 380 bytes --]


> Also, I've started working on guile-xapian bindings. With that, it seems
> simpler to keep the current package cache and add a xapian index only to
> speed up package search.

I have published the first version of guile-xapian! Feedback is
welcome. :-)

https://git.systemreboot.net/guile-xapian/about/

I will now move on to building a Xapian index for Guix's package search.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 0/4] Xapian for Guix package search
  2020-01-23 19:51 [bug#39258] Faster guix search using an sqlite cache Arun Isaac
  2020-01-29 23:33 ` zimoun
@ 2020-02-27 20:41 ` Arun Isaac
  2020-02-27 20:41   ` [bug#39258] [PATCH 1/4] gnu: Add guile-xapian Arun Isaac
                     ` (6 more replies)
  2020-03-07 13:31 ` [bug#39258] [PATCH v2 0/3] " Arun Isaac
                   ` (6 subsequent siblings)
  8 siblings, 7 replies; 126+ messages in thread
From: Arun Isaac @ 2020-02-27 20:41 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac, mail, ludo, zimon.toutoune

Hi,

I have finally got xapian working for package search. Some comments follow.

* Speed improvement

Despite search-package-index in gnu/packages.scm taking only around 1.5ms, I
see an overall speedup in `guix search` of only a factor of 2 -- from around
2s to around 1s. I wonder what else in `guix search` is taking up so much
time.

* Currently indexing only the package descriptions

In this patchset, I have only indexed the package descriptions. In the next
version of this patchset, I will index all other terms as specified in
%package-metrics of guix/ui.scm.

* Should I add guile-xapian as a propagated input to guix in
  gnu/packages/package-management.scm?

* Drop regexp search support

In this patchset, I have retained the older regexp search support. But, I
think we should drop it and only have xapian search. In cases where the search
index is not authoritative, we can build an in-memory xapian search index on
the fly and use it to search. This will slow down the search, but will ensure
our search results are consistent and do not depend on the authoritativeness
of the search index.

* Commit messages

Except for patch 1, I am not sure what prefixes (build-self, gnu, etc.) to use
in the first line of the commit message. Some advice there would be helpful.

Regards,
Arun.

Arun Isaac (4):
  gnu: Add guile-xapian.
  build-self: Add guile-xapian to Guix dependencies.
  gnu: Generate xapian package search index.
  gnu: Use xapian index for package search.

 build-aux/build-self.scm   | 11 ++++++++
 gnu/packages.scm           | 44 ++++++++++++++++++++++++++++-
 gnu/packages/guile-xyz.scm | 50 ++++++++++++++++++++++++++++++++-
 guix/channels.scm          | 34 ++++++++++++++++++++++-
 guix/scripts/package.scm   | 57 ++++++++++++++++++++++----------------
 guix/self.scm              |  7 ++++-
 6 files changed, 175 insertions(+), 28 deletions(-)

-- 
2.23.0

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 1/4] gnu: Add guile-xapian.
  2020-02-27 20:41 ` [bug#39258] [PATCH 0/4] Xapian for Guix package search Arun Isaac
@ 2020-02-27 20:41   ` Arun Isaac
  2020-03-03 16:29     ` zimoun
  2020-02-27 20:41   ` [bug#39258] [PATCH 2/4] build-self: Add guile-xapian to Guix dependencies Arun Isaac
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-02-27 20:41 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac, ludo, zimon.toutoune

* gnu/packages/guile-xyz.scm (guile-xapian, guile3.0-xapian): New variables.
---
 gnu/packages/guile-xyz.scm | 50 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/gnu/packages/guile-xyz.scm b/gnu/packages/guile-xyz.scm
index 37a5198e4e..75aba83593 100644
--- a/gnu/packages/guile-xyz.scm
+++ b/gnu/packages/guile-xyz.scm
@@ -17,7 +17,7 @@
 ;;; Copyright © 2017 ng0 <ng0@n0.is>
 ;;; Copyright © 2017, 2018 Tobias Geerinckx-Rice <me@tobias.gr>
 ;;; Copyright © 2018 Maxim Cournoyer <maxim.cournoyer@gmail.com>
-;;; Copyright © 2018, 2019 Arun Isaac <arunisaac@systemreboot.net>
+;;; Copyright © 2018, 2019, 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;; Copyright © 2018 Pierre-Antoine Rouby <pierre-antoine.rouby@inria.fr>
 ;;; Copyright © 2018 Eric Bavier <bavier@member.fsf.org>
 ;;; Copyright © 2019 swedebugia <swedebugia@riseup.net>
@@ -80,8 +80,10 @@
   #:use-module (gnu packages python)
   #:use-module (gnu packages readline)
   #:use-module (gnu packages sdl)
+  #:use-module (gnu packages search)
   #:use-module (gnu packages slang)
   #:use-module (gnu packages sqlite)
+  #:use-module (gnu packages swig)
   #:use-module (gnu packages tex)
   #:use-module (gnu packages texinfo)
   #:use-module (gnu packages tls)
@@ -3109,3 +3111,49 @@ currently a re-implementation of the lentes library for Clojure.  Lenses
 provide composable procedures, which can be used to focus, apply functions
 over, or update a value in arbitrary data structures.")
       (license license:gpl3+))))
+
+(define-public guile-xapian
+  (let ((commit "bfad1b0e2a88bfe1d4c100046da0d585b96d2a73")
+        (revision "1"))
+    (package
+      (name "guile-xapian")
+      (version (git-version "0.1.0" revision commit))
+      (home-page "https://git.systemreboot.net/guile-xapian")
+      (source
+       (origin
+         (method git-fetch)
+         (uri (git-reference (url home-page)
+                             (commit commit)))
+         (file-name (git-file-name name version))
+         (sha256
+          (base32
+           "1nrs23abb0lxx7gw14jw5k8jgbma0gi21gzahw0jgv6b25d9jdwp"))))
+      (build-system gnu-build-system)
+      (arguments
+       '(#:make-flags '("GUILE_AUTO_COMPILE=0"))) ; to prevent guild warnings
+      (inputs
+       `(("guile" ,guile-2.2)
+         ("xapian" ,xapian)
+         ("zlib" ,zlib)))
+      (native-inputs
+       `(("autoconf" ,autoconf)
+         ("autoconf-archive" ,autoconf-archive)
+         ("automake" ,automake)
+         ("libtool" ,libtool)
+         ("pkg-config" ,pkg-config)
+         ("swig" ,swig)))
+      (synopsis "Guile bindings for Xapian")
+      (description "@code{guile-xapian} provides Guile bindings for Xapian, a
+search engine library.  Xapian is a highly adaptable toolkit which allows
+developers to easily add advanced indexing and search facilities to their own
+applications.  It has built-in support for several families of weighting
+models and also supports a rich set of boolean query operators.")
+      (license license:gpl2+))))
+
+(define-public guile3.0-xapian
+  (package
+    (inherit guile-xapian)
+    (name "guile3.0-xapian")
+    (inputs
+     `(("guile" ,guile-next)
+       ,@(alist-delete "guile" (package-inputs guile-xapian))))))
-- 
2.23.0

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 2/4] build-self: Add guile-xapian to Guix dependencies.
  2020-02-27 20:41 ` [bug#39258] [PATCH 0/4] Xapian for Guix package search Arun Isaac
  2020-02-27 20:41   ` [bug#39258] [PATCH 1/4] gnu: Add guile-xapian Arun Isaac
@ 2020-02-27 20:41   ` Arun Isaac
  2020-02-27 20:41   ` [bug#39258] [PATCH 3/4] gnu: Generate xapian package search index Arun Isaac
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 126+ messages in thread
From: Arun Isaac @ 2020-02-27 20:41 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac, ludo, zimon.toutoune

* build-aux/build-self.scm (build-program): Import fake guile-xapian module.
* guix/self.scm (compiled-guix): Add guile-xapian to Guix dependencies.
---
 build-aux/build-self.scm | 11 +++++++++++
 guix/self.scm            |  7 ++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/build-aux/build-self.scm b/build-aux/build-self.scm
index f2e785b7f1..05d0353ccf 100644
--- a/build-aux/build-self.scm
+++ b/build-aux/build-self.scm
@@ -1,5 +1,6 @@
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2014, 2016, 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -261,6 +262,10 @@ interface (FFI) of Guile.")
                  #~(define-module (gcrypt hash)
                      #:export (sha1 sha256))))
 
+  (define fake-xapian-hash
+    ;; Fake (xapian xapian) module; see below.
+    (scheme-file "xapian.scm" #~(define-module (xapian xapian))))
+
   (define fake-git
     (scheme-file "git.scm" #~(define-module (git))))
 
@@ -273,6 +278,12 @@ interface (FFI) of Guile.")
                            ;; adjust %LOAD-PATH later on.
                            ((gcrypt hash) => ,fake-gcrypt-hash)
 
+                           ;; To avoid relying on 'with-extensions', which was
+                           ;; introduced in 0.15.0, provide a fake (xapian
+                           ;; xapian) just so that we can build modules, and
+                           ;; adjust %LOAD-PATH later on.
+                           ((xapian xapian) => ,fake-xapian-hash)
+
                            ;; (guix git-download) depends on (git) but only
                            ;; for peripheral functionality.  Provide a dummy
                            ;; (git) to placate it.
diff --git a/guix/self.scm b/guix/self.scm
index 6b633f9bc0..a4f40574d1 100644
--- a/guix/self.scm
+++ b/guix/self.scm
@@ -1,5 +1,6 @@
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -54,6 +55,7 @@
       ("guile-git"  (ref '(gnu packages guile) 'guile3.0-git))
       ("guile-sqlite3" (ref '(gnu packages guile) 'guile3.0-sqlite3))
       ("guile-gcrypt"  (ref '(gnu packages gnupg) 'guile3.0-gcrypt))
+      ("guile-xapian"  (ref '(gnu packages guile-xyz) 'guile3.0-xapian))
       ("gnutls"     (ref '(gnu packages tls) 'guile3.0-gnutls))
       ("zlib"       (ref '(gnu packages compression) 'zlib))
       ("lzlib"      (ref '(gnu packages compression) 'lzlib))
@@ -682,6 +684,9 @@ Info manual."
   (define guile-gcrypt
     (specification->package "guile-gcrypt"))
 
+  (define guile-xapian
+    (specification->package "guile-xapian"))
+
   (define gnutls
     (specification->package "gnutls"))
 
@@ -690,7 +695,7 @@ Info manual."
                          (cons (list "x" package)
                                (package-transitive-propagated-inputs package)))
                        (list guile-gcrypt gnutls guile-git guile-json
-                             guile-ssh guile-sqlite3))
+                             guile-ssh guile-sqlite3 guile-xapian))
       (((labels packages _ ...) ...)
        packages)))
 
-- 
2.23.0

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 3/4] gnu: Generate xapian package search index.
  2020-02-27 20:41 ` [bug#39258] [PATCH 0/4] Xapian for Guix package search Arun Isaac
  2020-02-27 20:41   ` [bug#39258] [PATCH 1/4] gnu: Add guile-xapian Arun Isaac
  2020-02-27 20:41   ` [bug#39258] [PATCH 2/4] build-self: Add guile-xapian to Guix dependencies Arun Isaac
@ 2020-02-27 20:41   ` Arun Isaac
  2020-02-28  8:04     ` Pierre Neidhardt
  2020-03-03 18:29     ` zimoun
  2020-02-27 20:41   ` [bug#39258] [PATCH 4/4] gnu: Use xapian index for package search Arun Isaac
                     ` (3 subsequent siblings)
  6 siblings, 2 replies; 126+ messages in thread
From: Arun Isaac @ 2020-02-27 20:41 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac, ludo, zimon.toutoune

* gnu/packages.scm (%package-search-index): New variable.
(generate-package-search-index): New function.
* guix/channels.scm (package-search-index): New function.
(%channel-profile-hooks): Add package-search-index.
---
 gnu/packages.scm  | 29 ++++++++++++++++++++++++++++-
 guix/channels.scm | 34 +++++++++++++++++++++++++++++++++-
 2 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/gnu/packages.scm b/gnu/packages.scm
index d22c992bb1..e91753e2a8 100644
--- a/gnu/packages.scm
+++ b/gnu/packages.scm
@@ -4,6 +4,7 @@
 ;;; Copyright © 2014 Eric Bavier <bavier@member.fsf.org>
 ;;; Copyright © 2016, 2017 Alex Kost <alezost@gmail.com>
 ;;; Copyright © 2016 Mathieu Lirzin <mthl@gnu.org>
+;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -43,6 +44,7 @@
   #:use-module (srfi srfi-34)
   #:use-module (srfi srfi-35)
   #:use-module (srfi srfi-39)
+  #:use-module (xapian xapian)
   #:export (search-patch
             search-patches
             search-auxiliary-file
@@ -64,7 +66,8 @@
             specification->location
             specifications->manifest
 
-            generate-package-cache))
+            generate-package-cache
+            generate-package-search-index))
 
 ;;; Commentary:
 ;;;
@@ -426,6 +429,30 @@ reducing the memory footprint."
                                #:opts '(#:to-file? #t)))))
   cache-file)
 
+(define %package-search-index
+  ;; Location of the package search-index
+  "/lib/guix/package-search.index")
+
+(define (generate-package-search-index directory)
+  "Generate under DIRECTORY a xapian index of all the available packages."
+  (define db-path
+    (string-append directory %package-search-index))
+
+  (mkdir-p (dirname db-path))
+  (call-with-writable-database db-path
+    (lambda (db)
+      (fold-packages (lambda (package _)
+                       (let* ((idterm (string-append "Q" (package-name package)))
+                              (doc (make-document #:data (package-name package)
+                                                  #:terms `((,idterm . 0))))
+                              (term-generator (make-term-generator #:stem (make-stem "en")
+                                                                   #:document doc)))
+                         (index-text! term-generator (package-description package))
+                         (replace-document! db idterm doc)))
+                     #f)))
+
+  db-path)
+
 \f
 (define %sigint-prompt
   ;; The prompt to jump to upon SIGINT.
diff --git a/guix/channels.scm b/guix/channels.scm
index f0261dc2da..c70c70938c 100644
--- a/guix/channels.scm
+++ b/guix/channels.scm
@@ -2,6 +2,7 @@
 ;;; Copyright © 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
 ;;; Copyright © 2018 Ricardo Wurmus <rekado@elephly.net>
 ;;; Copyright © 2019 Jan (janneke) Nieuwenhuizen <janneke@gnu.org>
+;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -581,9 +582,40 @@ be used as a profile hook."
                                                  (hook . package-cache))
                                   #:local-build? #t)))
 
+(define (package-search-index manifest)
+  "Build a package search index for the instance in MANIFEST.  This is meant
+to be used as a profile hook."
+  (mlet %store-monad ((profile (profile-derivation manifest
+                                                   #:hooks '())))
+
+    (define build
+      #~(begin
+          (use-modules (gnu packages))
+
+          (if (defined? 'generate-package-search-index)
+              (begin
+                ;; Delegate package search index generation to the inferior.
+                (format (current-error-port)
+                        "Generating package search index for '~a'...~%"
+                        #$profile)
+                (generate-package-search-index #$output))
+              (mkdir #$output))))
+
+    (gexp->derivation-in-inferior "guix-package-search-index" build
+                                  profile
+
+                                  ;; If the Guix in PROFILE is too old and
+                                  ;; lacks 'guix repl', don't build the cache
+                                  ;; instead of failing.
+                                  #:silent-failure? #t
+
+                                  #:properties '((type . profile-hook)
+                                                 (hook . package-search-index))
+                                  #:local-build? #t)))
+
 (define %channel-profile-hooks
   ;; The default channel profile hooks.
-  (cons package-cache-file %default-profile-hooks))
+  (cons* package-cache-file package-search-index %default-profile-hooks))
 
 (define (channel-instances->derivation instances)
   "Return the derivation of the profile containing INSTANCES, a list of
-- 
2.23.0

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 4/4] gnu: Use xapian index for package search.
  2020-02-27 20:41 ` [bug#39258] [PATCH 0/4] Xapian for Guix package search Arun Isaac
                     ` (2 preceding siblings ...)
  2020-02-27 20:41   ` [bug#39258] [PATCH 3/4] gnu: Generate xapian package search index Arun Isaac
@ 2020-02-27 20:41   ` Arun Isaac
  2020-02-28  8:11     ` Pierre Neidhardt
  2020-03-03 19:21     ` zimoun
  2020-02-28  8:13   ` [bug#39258] [PATCH 0/4] Xapian for Guix " Pierre Neidhardt
                     ` (2 subsequent siblings)
  6 siblings, 2 replies; 126+ messages in thread
From: Arun Isaac @ 2020-02-27 20:41 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac, ludo, zimon.toutoune

* gnu/packages.scm (search-package-index): New function.
* guix/scripts/package.scm (find-packages-by-description): Search using the
xapian package index if search patterns are literal strings. Else, search
using fold-packages.
---
 gnu/packages.scm         | 17 +++++++++++-
 guix/scripts/package.scm | 57 +++++++++++++++++++++++-----------------
 2 files changed, 49 insertions(+), 25 deletions(-)

diff --git a/gnu/packages.scm b/gnu/packages.scm
index e91753e2a8..5b5b29bf84 100644
--- a/gnu/packages.scm
+++ b/gnu/packages.scm
@@ -67,7 +67,8 @@
             specifications->manifest
 
             generate-package-cache
-            generate-package-search-index))
+            generate-package-search-index
+            search-package-index))
 
 ;;; Commentary:
 ;;;
@@ -453,6 +454,20 @@ reducing the memory footprint."
 
   db-path)
 
+(define (search-package-index profile querystring)
+  (let ((offset 0)
+        (pagesize 10))
+    (call-with-database (string-append profile %package-search-index)
+      (lambda (db)
+        (let ((query (parse-query querystring #:stemmer (make-stem "en"))))
+          (mset-fold (lambda (item result)
+                       (match (find-packages-by-name
+                               (document-data (mset-item-document item)))
+                         ((package _ ...)
+                          (append result `((,package . ,(mset-item-weight item)))))))
+                     '()
+                     (enquire-mset (enquire db query) offset pagesize)))))))
+
 \f
 (define %sigint-prompt
   ;; The prompt to jump to upon SIGINT.
diff --git a/guix/scripts/package.scm b/guix/scripts/package.scm
index 1cb0d382bf..6a3b9002dd 100644
--- a/guix/scripts/package.scm
+++ b/guix/scripts/package.scm
@@ -7,6 +7,7 @@
 ;;; Copyright © 2016 Benz Schenk <benz.schenk@uzh.ch>
 ;;; Copyright © 2016 Chris Marusich <cmmarusich@gmail.com>
 ;;; Copyright © 2019 Tobias Geerinckx-Rice <me@tobias.gr>
+;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -178,31 +179,40 @@ hooks\" run when building the profile."
 ;;; Package specifications.
 ;;;
 
-(define (find-packages-by-description regexps)
+(define (find-packages-by-description patterns)
   "Return a list of pairs: packages whose name, synopsis, description,
 or output matches at least one of REGEXPS sorted by relevance, and its
 non-zero relevance score."
-  (let ((matches (fold-packages (lambda (package result)
-                                  (if (package-superseded package)
-                                      result
-                                      (match (package-relevance package
-                                                                regexps)
-                                        ((? zero?)
-                                         result)
-                                        (score
-                                         (cons (cons package score)
-                                               result)))))
-                                '())))
-    (sort matches
-          (lambda (m1 m2)
-            (match m1
-              ((package1 . score1)
-               (match m2
-                 ((package2 . score2)
-                  (if (= score1 score2)
-                      (string>? (package-full-name package1)
-                                (package-full-name package2))
-                      (> score1 score2))))))))))
+  (define (regexp? str)
+    (string-any
+     (char-set #\. #\[ #\{ #\} #\( #\) #\\ #\* #\+ #\? #\| #\^ #\$)
+     str))
+
+  (if (and (current-profile)
+           (not (any regexp? patterns)))
+      (search-package-index (current-profile) (string-join patterns " "))
+      (let* ((regexps (map (cut make-regexp* <> regexp/icase) patterns))
+             (matches (fold-packages (lambda (package result)
+                                       (if (package-superseded package)
+                                           result
+                                           (match (package-relevance package
+                                                                     regexps)
+                                             ((? zero?)
+                                              result)
+                                             (score
+                                              (cons (cons package score)
+                                                    result)))))
+                                     '())))
+        (sort matches
+              (lambda (m1 m2)
+                (match m1
+                  ((package1 . score1)
+                   (match m2
+                     ((package2 . score2)
+                      (if (= score1 score2)
+                          (string>? (package-full-name package1)
+                                    (package-full-name package2))
+                          (> score1 score2)))))))))))
 
 (define (transaction-upgrade-entry store entry transaction)
   "Return a variant of TRANSACTION that accounts for the upgrade of ENTRY, a
@@ -777,8 +787,7 @@ processed, #f otherwise."
                                       (('query 'search rx) rx)
                                       (_                   #f))
                                     opts))
-              (regexps  (map (cut make-regexp* <> regexp/icase) patterns))
-              (matches  (find-packages-by-description regexps)))
+              (matches  (find-packages-by-description patterns)))
          (leave-on-EPIPE
           (display-search-results matches (current-output-port)))
          #t))
-- 
2.23.0

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 3/4] gnu: Generate xapian package search index.
  2020-02-27 20:41   ` [bug#39258] [PATCH 3/4] gnu: Generate xapian package search index Arun Isaac
@ 2020-02-28  8:04     ` Pierre Neidhardt
  2020-03-05 20:26       ` Arun Isaac
  2020-03-03 18:29     ` zimoun
  1 sibling, 1 reply; 126+ messages in thread
From: Pierre Neidhardt @ 2020-02-28  8:04 UTC (permalink / raw)
  To: Arun Isaac; +Cc: ludo, 39258, zimon.toutoune

[-- Attachment #1: Type: text/plain, Size: 1237 bytes --]

Arun Isaac <arunisaac@systemreboot.net> writes:

> +(define (generate-package-search-index directory)
> +  "Generate under DIRECTORY a xapian index of all the available packages."
> +  (define db-path
> +    (string-append directory %package-search-index))
> +
> +  (mkdir-p (dirname db-path))
> +  (call-with-writable-database db-path
> +    (lambda (db)
> +      (fold-packages (lambda (package _)
> +                       (let* ((idterm (string-append "Q" (package-name package)))
> +                              (doc (make-document #:data (package-name package)
> +                                                  #:terms `((,idterm . 0))))
> +                              (term-generator (make-term-generator #:stem (make-stem "en")
> +                                                                   #:document doc)))
> +                         (index-text! term-generator (package-description package))
> +                         (replace-document! db idterm doc)))

I guess these non-functional functions (index-text!, replace-document!)
represent how Xapian works at the C++ level.  Would it be possible to
make more functional bindings nonetheless?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 4/4] gnu: Use xapian index for package search.
  2020-02-27 20:41   ` [bug#39258] [PATCH 4/4] gnu: Use xapian index for package search Arun Isaac
@ 2020-02-28  8:11     ` Pierre Neidhardt
  2020-03-03 19:21     ` zimoun
  1 sibling, 0 replies; 126+ messages in thread
From: Pierre Neidhardt @ 2020-02-28  8:11 UTC (permalink / raw)
  To: Arun Isaac; +Cc: ludo, 39258, zimon.toutoune

[-- Attachment #1: Type: text/plain, Size: 2673 bytes --]

Arun Isaac <arunisaac@systemreboot.net> writes:

> @@ -453,6 +454,20 @@ reducing the memory footprint."
>  
>    db-path)
>  
> +(define (search-package-index profile querystring)

Maybe `query-string'?

>  \f
> --- a/guix/scripts/package.scm
> +++ b/guix/scripts/package.scm
> @@ -7,6 +7,7 @@
>  ;;; Copyright © 2016 Benz Schenk <benz.schenk@uzh.ch>
>  ;;; Copyright © 2016 Chris Marusich <cmmarusich@gmail.com>
>  ;;; Copyright © 2019 Tobias Geerinckx-Rice <me@tobias.gr>
> +;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
>  ;;;
>  ;;; This file is part of GNU Guix.
>  ;;;
> @@ -178,31 +179,40 @@ hooks\" run when building the profile."
>  ;;; Package specifications.
>  ;;;
>  
> -(define (find-packages-by-description regexps)
> +(define (find-packages-by-description patterns)
>    "Return a list of pairs: packages whose name, synopsis, description,
>  or output matches at least one of REGEXPS sorted by relevance, and its
>  non-zero relevance score."

Need to update the docstring.

> -  (let ((matches (fold-packages (lambda (package result)
> -                                  (if (package-superseded package)
> -                                      result
> -                                      (match (package-relevance package
> -                                                                regexps)
> -                                        ((? zero?)
> -                                         result)
> -                                        (score
> -                                         (cons (cons package score)
> -                                               result)))))
> -                                '())))
> -    (sort matches
> -          (lambda (m1 m2)
> -            (match m1
> -              ((package1 . score1)
> -               (match m2
> -                 ((package2 . score2)
> -                  (if (= score1 score2)
> -                      (string>? (package-full-name package1)
> -                                (package-full-name package2))
> -                      (> score1 score2))))))))))
> +  (define (regexp? str)
> +    (string-any
> +     (char-set #\. #\[ #\{ #\} #\( #\) #\\ #\* #\+ #\? #\| #\^ #\$)
> +     str))
> +
> +  (if (and (current-profile)
> +           (not (any regexp? patterns)))

I would not put characters like ".", "$", or "+" here, lest we mistake a
Xapian pattern for a regexp.

As you said, I don't think both are compatible without ambiguity
anyways, so we should probably drop regexp (or at least toggle them with
a command line argument).


-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 0/4] Xapian for Guix package search
  2020-02-27 20:41 ` [bug#39258] [PATCH 0/4] Xapian for Guix package search Arun Isaac
                     ` (3 preceding siblings ...)
  2020-02-27 20:41   ` [bug#39258] [PATCH 4/4] gnu: Use xapian index for package search Arun Isaac
@ 2020-02-28  8:13   ` Pierre Neidhardt
  2020-02-28 12:39     ` zimoun
  2020-02-28 15:36     ` Arun Isaac
  2020-02-28 12:36   ` zimoun
  2020-03-05 16:46   ` Ludovic Courtès
  6 siblings, 2 replies; 126+ messages in thread
From: Pierre Neidhardt @ 2020-02-28  8:13 UTC (permalink / raw)
  To: Arun Isaac, 39258; +Cc: ludo, zimon.toutoune

[-- Attachment #1: Type: text/plain, Size: 1080 bytes --]

Fantastic, thank you so much for this great feature!

I can't build your patch though:

--8<---------------cut here---------------start------------->8---
[ 30%] LOAD     guix/scripts/package.scm
;;; note: source file ./guix/scripts/package.scm
;;;       newer than compiled /home/ambrevar/projects/guix/guix/scripts/package.go
;;; note: source file ./guix/scripts/package.scm
;;;       newer than compiled /home/ambrevar/projects/guix/guix/scripts/package.go
;;; note: source file ./gnu/packages.scm
;;;       newer than compiled /home/ambrevar/projects/guix/gnu/packages.go
;;; note: source file ./gnu/packages.scm
;;;       newer than compiled /home/ambrevar/projects/guix/gnu/packages.go
error: failed to load 'gnu/packages.scm':
ice-9/eval.scm:293:34: no code for module (xapian xapian)
--8<---------------cut here---------------end--------------->8---

Beside this issue, how do you test it?  I guess we first need to install
a bunch of package with `pre-inst-env guix ...` then to a `pre-inst-env search`?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 0/4] Xapian for Guix package search
  2020-02-27 20:41 ` [bug#39258] [PATCH 0/4] Xapian for Guix package search Arun Isaac
                     ` (4 preceding siblings ...)
  2020-02-28  8:13   ` [bug#39258] [PATCH 0/4] Xapian for Guix " Pierre Neidhardt
@ 2020-02-28 12:36   ` zimoun
  2020-03-05 16:46   ` Ludovic Courtès
  6 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2020-02-28 12:36 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, Pierre Neidhardt, 39258

Hi Arun,

Really cool! Thank you!


On Thu, 27 Feb 2020 at 21:42, Arun Isaac <arunisaac@systemreboot.net> wrote:

> * Speed improvement
>
> Despite search-package-index in gnu/packages.scm taking only around 1.5ms, I
> see an overall speedup in `guix search` of only a factor of 2 -- from around
> 2s to around 1s. I wonder what else in `guix search` is taking up so much
> time.

Interesting... maybe an hidden 'fold-packages'?
Well, I have not yet looked into your code.


> * Currently indexing only the package descriptions
>
> In this patchset, I have only indexed the package descriptions. In the next
> version of this patchset, I will index all other terms as specified in
> %package-metrics of guix/ui.scm.

Yes, it appears to me a detail that should be easy to fix. I mean, it
does not seems blocking.


> * Should I add guile-xapian as a propagated input to guix in
>   gnu/packages/package-management.scm?

IMHO, yes.
I mean, I guess. :-)


> * Drop regexp search support
>
> In this patchset, I have retained the older regexp search support. But, I
> think we should drop it and only have xapian search. In cases where the search
> index is not authoritative, we can build an in-memory xapian search index on
> the fly and use it to search. This will slow down the search, but will ensure
> our search results are consistent and do not depend on the authoritativeness
> of the search index.

I understand why you have turned off the regexp support. It is not
necessary at the first experimentation to see if it is worth the
addition or not.
So, before investigating how some better regexp could be used with
Xapian, let start to benchmark Xapian vs plain 'fold-packages'.


> * Commit messages
>
> Except for patch 1, I am not sure what prefixes (build-self, gnu, etc.) to use
> in the first line of the commit message. Some advice there would be helpful.

I cannot help. )-:


All the best,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 0/4] Xapian for Guix package search
  2020-02-28  8:13   ` [bug#39258] [PATCH 0/4] Xapian for Guix " Pierre Neidhardt
@ 2020-02-28 12:39     ` zimoun
  2020-02-28 12:49       ` Pierre Neidhardt
  2020-02-28 15:36     ` Arun Isaac
  1 sibling, 1 reply; 126+ messages in thread
From: zimoun @ 2020-02-28 12:39 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: Arun Isaac, Ludovic Courtès, 39258

Hi Pierre,

On Fri, 28 Feb 2020 at 09:13, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> Beside this issue, how do you test it?  I guess we first need to install
> a bunch of package with `pre-inst-env guix ...` then to a `pre-inst-env search`?

It is not searching in the installed packages but in all the packages.
So, to test it, you need to "./pre-inst-env guix pull -p" or something
like that to populate the Xapian index database. Then "./pre-inst-env
guix search" will lookup into.
I mean, it is how I understand it should work. I have not yet looked
into the code.


Cheers,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 0/4] Xapian for Guix package search
  2020-02-28 12:39     ` zimoun
@ 2020-02-28 12:49       ` Pierre Neidhardt
  0 siblings, 0 replies; 126+ messages in thread
From: Pierre Neidhardt @ 2020-02-28 12:49 UTC (permalink / raw)
  To: zimoun; +Cc: Arun Isaac, Ludovic Courtès, 39258

[-- Attachment #1: Type: text/plain, Size: 858 bytes --]

zimoun <zimon.toutoune@gmail.com> writes:

> Hi Pierre,
>
> On Fri, 28 Feb 2020 at 09:13, Pierre Neidhardt <mail@ambrevar.xyz> wrote:
>
>> Beside this issue, how do you test it?  I guess we first need to install
>> a bunch of package with `pre-inst-env guix ...` then to a `pre-inst-env search`?
>
> It is not searching in the installed packages but in all the packages.
> So, to test it, you need to "./pre-inst-env guix pull -p" or something
> like that to populate the Xapian index database. Then "./pre-inst-env
> guix search" will lookup into.
> I mean, it is how I understand it should work. I have not yet looked
> into the code.

What I meant with "install a bunch of packages" is "guix pull -p", is
you said.  Xapian cache
is populated as a hook of guix pull if I got it correctly.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 0/4] Xapian for Guix package search
  2020-02-28  8:13   ` [bug#39258] [PATCH 0/4] Xapian for Guix " Pierre Neidhardt
  2020-02-28 12:39     ` zimoun
@ 2020-02-28 15:36     ` Arun Isaac
  2020-02-28 16:04       ` Arun Isaac
  2020-02-29  8:25       ` Arun Isaac
  1 sibling, 2 replies; 126+ messages in thread
From: Arun Isaac @ 2020-02-28 15:36 UTC (permalink / raw)
  To: Pierre Neidhardt, 39258; +Cc: ludo, zimon.toutoune

[-- Attachment #1: Type: text/plain, Size: 1264 bytes --]


> I can't build your patch though:
>
> ice-9/eval.scm:293:34: no code for module (xapian xapian)

Sorry, I forgot to mention this in my patch cover letter. The above
error is happening because of the new guile-xapian dependency. It's a
little tricky to get right at the moment. Here goes.

Drop into a guix development environment.

$ guix environment guix

Commit patch 1 (the patch that adds guile-xapian) alone, and build.

$ git am 0001-gnu-Add-guile-xapian.patch
$ make

Then, drop into an environment where guile-xapian is available.

$ ./pre-inst-env guix environment guix --ad-hoc guile-xapian

Apply the other 3 patches and build.

$ git am 0002-build-self-Add-guile-xapian-to-Guix-dependencies.patch 0003-gnu-Generate-xapian-package-search-index.patch 0004-gnu-Use-xapian-index-for-package-search.patch
$ make

Now, the build should have completed successfully. Let's do a test guix
pull to actually test the new guix search.

$ ./pre-inst-env guix pull -p /tmp/test

Then, run the guix search in /tmp/test.

$ /tmp/test/bin/guix search game

That's it! :-)

This whole process will be simpler if the guile-xapian package is pushed
to master and guile-xapian added as an input to the guix package in
gnu/packages/package-management.scm. But, for now...

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 0/4] Xapian for Guix package search
  2020-02-28 15:36     ` Arun Isaac
@ 2020-02-28 16:04       ` Arun Isaac
  2020-03-02 18:37         ` zimoun
  2020-02-29  8:25       ` Arun Isaac
  1 sibling, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-02-28 16:04 UTC (permalink / raw)
  To: Pierre Neidhardt, 39258; +Cc: ludo, zimon.toutoune

[-- Attachment #1: Type: text/plain, Size: 320 bytes --]


> $ ./pre-inst-env guix pull -p /tmp/test

One mistake. This command should be

./pre-inst-env guix pull --url=$PWD --branch=xapian -p /tmp/test

where xapian is the name of the branch you committed the patches to.

Also, I acknowledge the corrections you both suggested. I will
incorporate them in v2 of the patchset.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 0/4] Xapian for Guix package search
  2020-02-28 15:36     ` Arun Isaac
  2020-02-28 16:04       ` Arun Isaac
@ 2020-02-29  8:25       ` Arun Isaac
  2020-03-02 18:27         ` zimoun
  1 sibling, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-02-29  8:25 UTC (permalink / raw)
  To: Pierre Neidhardt, 39258; +Cc: zimon.toutoune

[-- Attachment #1: Type: text/plain, Size: 259 bytes --]


> This whole process will be simpler if the guile-xapian package is pushed
> to master and guile-xapian added as an input to the guix package in
> gnu/packages/package-management.scm. But, for now...

Shall I push patch 1 (add guile-xapian) alone to master?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 0/4] Xapian for Guix package search
  2020-02-29  8:25       ` Arun Isaac
@ 2020-03-02 18:27         ` zimoun
  0 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2020-03-02 18:27 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Pierre Neidhardt, 39258

Hi Arun,

On Sat, 29 Feb 2020 at 09:25, Arun Isaac <arunisaac@systemreboot.net> wrote:

> Shall I push patch 1 (add guile-xapian) alone to master?

Yes, it seems a good idea and it will ease the process for building
and then benchmarking the "guix search" via Xapian.


All the best,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 0/4] Xapian for Guix package search
  2020-02-28 16:04       ` Arun Isaac
@ 2020-03-02 18:37         ` zimoun
  2020-03-02 19:13           ` zimoun
  0 siblings, 1 reply; 126+ messages in thread
From: zimoun @ 2020-03-02 18:37 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, Pierre Neidhardt, 39258

Hi Arun,

Do you have some benchmark in mind?


On Fri, 28 Feb 2020 at 17:05, Arun Isaac <arunisaac@systemreboot.net> wrote:

> ./pre-inst-env guix pull --url=$PWD --branch=xapian -p /tmp/test

We need to benchmark on different machines the new "guix pull". Well,
it is nothing compared to the derivation computations. :-)
And more importantly, 'make as-derivations' to avoid a "guix pull" breakage,

Then on cold caches, the new "guix search" for a couple of query.

There is no so much inspiration in tests/. :-)
Ah do not forget to adapt some tests.


All the best,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 0/4] Xapian for Guix package search
  2020-03-02 18:37         ` zimoun
@ 2020-03-02 19:13           ` zimoun
  2020-03-03 20:04             ` zimoun
  0 siblings, 1 reply; 126+ messages in thread
From: zimoun @ 2020-03-02 19:13 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, Pierre Neidhardt, 39258

Hi,

After a quick benchmark:

 a. It is faster. Between x2 and x3. Really?
 b. The xapian relevance should truncated and examined in more details.

--8<---------------cut here---------------start------------->8---
time guix search emacs | recsel -p name,relevance | head -n18
name: emacs
relevance: 33

name: emacs-with-editor
relevance: 19

name: emacs-restart-emacs
relevance: 19

name: emacs-epkg
relevance: 18

name: guile-emacs
relevance: 17

name: emacs-xwidgets
relevance: 17


real    0m1.530s
user    0m1.827s
sys     0m0.074s
--8<---------------cut here---------------end--------------->8---


--8<---------------cut here---------------start------------->8---
time /tmp/test/bin/guix search emacs | recsel -p name,relevance | head -n18
name: emacs-helm-pass
relevance: 5.0774748262821685

name: emacs-spark
relevance: 4.898640632723127

name: emacs-evil-smartparens
relevance: 4.898640632723127

name: emacs-howm
relevance: 4.8638448958830685

name: emacs-el-mock
relevance: 4.8638448958830685

name: emacs-strace-mode
relevance: 4.693676055650271


real    0m0.440s
user    0m0.482s
sys     0m0.058s
--8<---------------cut here---------------end--------------->8---


Here for example, Xapian does not return the package 'emacs' itself as
the first. And worse, it is not returned at all.
That's said, I do not know if it is really faster since:

--8<---------------cut here---------------start------------->8---
guix search emacs | recsel -C -P name | wc -l
829
--8<---------------cut here---------------end--------------->8---

and

--8<---------------cut here---------------start------------->8---
/tmp/test/bin/guix search emacs | recsel -C -P name | wc -l
10
--8<---------------cut here---------------end--------------->8---

Maybe I am doing a mistake.


Well, thank you Arun for the Xapian bindings which will improve the
searching experience. :-)
And now it needs some polishing.


All the best
simo

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 1/4] gnu: Add guile-xapian.
  2020-02-27 20:41   ` [bug#39258] [PATCH 1/4] gnu: Add guile-xapian Arun Isaac
@ 2020-03-03 16:29     ` zimoun
  0 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2020-03-03 16:29 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, 39258

Hi Arun,

On Thu, 27 Feb 2020 at 21:42, Arun Isaac <arunisaac@systemreboot.net> wrote:

> * gnu/packages/guile-xyz.scm (guile-xapian, guile3.0-xapian): New variables.

I am a bit lost with the Guile update. Now the convention should not
be the opposite: guile-xapian using 3.0 and guile2.2-xapian using 2.2
(or simply 2.2 since 2.0 seems not really used).

Otherwise, feel free to push it. :-)
(It will ease to reach a large audience of testers for "guix search" ;-))

All the best,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 3/4] gnu: Generate xapian package search index.
  2020-02-27 20:41   ` [bug#39258] [PATCH 3/4] gnu: Generate xapian package search index Arun Isaac
  2020-02-28  8:04     ` Pierre Neidhardt
@ 2020-03-03 18:29     ` zimoun
  1 sibling, 0 replies; 126+ messages in thread
From: zimoun @ 2020-03-03 18:29 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, 39258

Hi Arun,

In the commit message, I would capitalize Xapian.


On Thu, 27 Feb 2020 at 21:42, Arun Isaac <arunisaac@systemreboot.net> wrote:
>
> * gnu/packages.scm (%package-search-index): New variable.
> (generate-package-search-index): New function.
> * guix/channels.scm (package-search-index): New function.
> (%channel-profile-hooks): Add package-search-index.
> ---
>  gnu/packages.scm  | 29 ++++++++++++++++++++++++++++-
>  guix/channels.scm | 34 +++++++++++++++++++++++++++++++++-
>  2 files changed, 61 insertions(+), 2 deletions(-)
>
> diff --git a/gnu/packages.scm b/gnu/packages.scm
> index d22c992bb1..e91753e2a8 100644
> --- a/gnu/packages.scm
> +++ b/gnu/packages.scm
> @@ -4,6 +4,7 @@
>  ;;; Copyright © 2014 Eric Bavier <bavier@member.fsf.org>
>  ;;; Copyright © 2016, 2017 Alex Kost <alezost@gmail.com>
>  ;;; Copyright © 2016 Mathieu Lirzin <mthl@gnu.org>
> +;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
>  ;;;
>  ;;; This file is part of GNU Guix.
>  ;;;
> @@ -43,6 +44,7 @@
>    #:use-module (srfi srfi-34)
>    #:use-module (srfi srfi-35)
>    #:use-module (srfi srfi-39)
> +  #:use-module (xapian xapian)
>    #:export (search-patch
>              search-patches
>              search-auxiliary-file
> @@ -64,7 +66,8 @@
>              specification->location
>              specifications->manifest
>
> -            generate-package-cache))
> +            generate-package-cache
> +            generate-package-search-index))
>
>  ;;; Commentary:
>  ;;;
> @@ -426,6 +429,30 @@ reducing the memory footprint."
>                                 #:opts '(#:to-file? #t)))))
>    cache-file)
>
> +(define %package-search-index
> +  ;; Location of the package search-index
> +  "/lib/guix/package-search.index")
> +
> +(define (generate-package-search-index directory)
> +  "Generate under DIRECTORY a xapian index of all the available packages."

Xapian with capital.


> +  (define db-path
> +    (string-append directory %package-search-index))
> +
> +  (mkdir-p (dirname db-path))
> +  (call-with-writable-database db-path
> +    (lambda (db)
> +      (fold-packages (lambda (package _)
> +                       (let* ((idterm (string-append "Q" (package-name package)))
> +                              (doc (make-document #:data (package-name package)
> +                                                  #:terms `((,idterm . 0))))
> +                              (term-generator (make-term-generator #:stem (make-stem "en")
> +                                                                   #:document doc)))
> +                         (index-text! term-generator (package-description package))

Instead, this:

(index-term! term-generator (string-append (package-synopsis package)
(package-description package)))

should index both 'synopsis' and 'description'.


Is (make-stem "en") for the locale?


> +                         (replace-document! db idterm doc)))
> +                     #f)))
> +
> +  db-path)
> +
>
>  (define %sigint-prompt
>    ;; The prompt to jump to upon SIGINT.
> diff --git a/guix/channels.scm b/guix/channels.scm
> index f0261dc2da..c70c70938c 100644
> --- a/guix/channels.scm
> +++ b/guix/channels.scm
> @@ -2,6 +2,7 @@
>  ;;; Copyright © 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
>  ;;; Copyright © 2018 Ricardo Wurmus <rekado@elephly.net>
>  ;;; Copyright © 2019 Jan (janneke) Nieuwenhuizen <janneke@gnu.org>
> +;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
>  ;;;
>  ;;; This file is part of GNU Guix.
>  ;;;
> @@ -581,9 +582,40 @@ be used as a profile hook."
>                                                   (hook . package-cache))
>                                    #:local-build? #t)))
>
> +(define (package-search-index manifest)
> +  "Build a package search index for the instance in MANIFEST.  This is meant
> +to be used as a profile hook."
> +  (mlet %store-monad ((profile (profile-derivation manifest
> +                                                   #:hooks '())))
> +
> +    (define build
> +      #~(begin
> +          (use-modules (gnu packages))
> +
> +          (if (defined? 'generate-package-search-index)
> +              (begin
> +                ;; Delegate package search index generation to the inferior.
> +                (format (current-error-port)
> +                        "Generating package search index for '~a'...~%"
> +                        #$profile)
> +                (generate-package-search-index #$output))
> +              (mkdir #$output))))
> +
> +    (gexp->derivation-in-inferior "guix-package-search-index" build
> +                                  profile
> +
> +                                  ;; If the Guix in PROFILE is too old and
> +                                  ;; lacks 'guix repl', don't build the cache
> +                                  ;; instead of failing.
> +                                  #:silent-failure? #t
> +
> +                                  #:properties '((type . profile-hook)
> +                                                 (hook . package-search-index))
> +                                  #:local-build? #t)))
> +

package-search-index and package-cache-file could be refactored
because they share all the same code.


>  (define %channel-profile-hooks
>    ;; The default channel profile hooks.
> -  (cons package-cache-file %default-profile-hooks))
> +  (cons* package-cache-file package-search-index %default-profile-hooks))
>
>  (define (channel-instances->derivation instances)
>    "Return the derivation of the profile containing INSTANCES, a list of
> --
> 2.23.0
>

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 4/4] gnu: Use xapian index for package search.
  2020-02-27 20:41   ` [bug#39258] [PATCH 4/4] gnu: Use xapian index for package search Arun Isaac
  2020-02-28  8:11     ` Pierre Neidhardt
@ 2020-03-03 19:21     ` zimoun
  2020-03-03 19:51       ` zimoun
  1 sibling, 1 reply; 126+ messages in thread
From: zimoun @ 2020-03-03 19:21 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, 39258

Hi Arun,


On Thu, 27 Feb 2020 at 21:42, Arun Isaac <arunisaac@systemreboot.net> wrote:
>
> * gnu/packages.scm (search-package-index): New function.
> * guix/scripts/package.scm (find-packages-by-description): Search using the
> xapian package index if search patterns are literal strings. Else, search
> using fold-packages.
> ---
>  gnu/packages.scm         | 17 +++++++++++-
>  guix/scripts/package.scm | 57 +++++++++++++++++++++++-----------------
>  2 files changed, 49 insertions(+), 25 deletions(-)
>
> diff --git a/gnu/packages.scm b/gnu/packages.scm
> index e91753e2a8..5b5b29bf84 100644
> --- a/gnu/packages.scm
> +++ b/gnu/packages.scm
> @@ -67,7 +67,8 @@
>              specifications->manifest
>
>              generate-package-cache
> -            generate-package-search-index))
> +            generate-package-search-index
> +            search-package-index))
>
>  ;;; Commentary:
>  ;;;
> @@ -453,6 +454,20 @@ reducing the memory footprint."
>
>    db-path)
>
> +(define (search-package-index profile querystring)
> +  (let ((offset 0)
> +        (pagesize 10))

Why this value of 10?
This fix the number of packages returned. Hum?
I have tried to replace by 100 and I got 100 packages. :-)


> +    (call-with-database (string-append profile %package-search-index)
> +      (lambda (db)
> +        (let ((query (parse-query querystring #:stemmer (make-stem "en"))))
> +          (mset-fold (lambda (item result)

I do not know what is the convention for the bindings.
But there is 'fold-packages' so I would be inclined to 'fold-msets' or
something in this flavour.


> +                       (match (find-packages-by-name
> +                               (document-data (mset-item-document item)))
> +                         ((package _ ...)
> +                          (append result `((,package . ,(mset-item-weight item)))))))
> +                     '()
> +                     (enquire-mset (enquire db query) offset pagesize)))))))
> +
>
>  (define %sigint-prompt
>    ;; The prompt to jump to upon SIGINT.
> diff --git a/guix/scripts/package.scm b/guix/scripts/package.scm
> index 1cb0d382bf..6a3b9002dd 100644
> --- a/guix/scripts/package.scm
> +++ b/guix/scripts/package.scm
> @@ -7,6 +7,7 @@
>  ;;; Copyright © 2016 Benz Schenk <benz.schenk@uzh.ch>
>  ;;; Copyright © 2016 Chris Marusich <cmmarusich@gmail.com>
>  ;;; Copyright © 2019 Tobias Geerinckx-Rice <me@tobias.gr>
> +;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
>  ;;;
>  ;;; This file is part of GNU Guix.
>  ;;;
> @@ -178,31 +179,40 @@ hooks\" run when building the profile."
>  ;;; Package specifications.
>  ;;;
>
> -(define (find-packages-by-description regexps)
> +(define (find-packages-by-description patterns)
>    "Return a list of pairs: packages whose name, synopsis, description,
>  or output matches at least one of REGEXPS sorted by relevance, and its
>  non-zero relevance score."
> -  (let ((matches (fold-packages (lambda (package result)
> -                                  (if (package-superseded package)
> -                                      result
> -                                      (match (package-relevance package
> -                                                                regexps)
> -                                        ((? zero?)
> -                                         result)
> -                                        (score
> -                                         (cons (cons package score)
> -                                               result)))))
> -                                '())))
> -    (sort matches
> -          (lambda (m1 m2)
> -            (match m1
> -              ((package1 . score1)
> -               (match m2
> -                 ((package2 . score2)
> -                  (if (= score1 score2)
> -                      (string>? (package-full-name package1)
> -                                (package-full-name package2))
> -                      (> score1 score2))))))))))
> +  (define (regexp? str)
> +    (string-any
> +     (char-set #\. #\[ #\{ #\} #\( #\) #\\ #\* #\+ #\? #\| #\^ #\$)
> +     str))

Instead of reverting this, I would let the current
'find-packages-by-description' and would add
'find-packages-by-description-indexed' doing just
'(search-package-index (current-profile) (string-join patterns " "))'.
And maybe refactoring the sort of scores. Then I would put the test
branch in 'guix/scripts/packages.scm'...


> +  (if (and (current-profile)
> +           (not (any regexp? patterns)))
> +      (search-package-index (current-profile) (string-join patterns " "))
> +      (let* ((regexps (map (cut make-regexp* <> regexp/icase) patterns))
> +             (matches (fold-packages (lambda (package result)
> +                                       (if (package-superseded package)
> +                                           result
> +                                           (match (package-relevance package

Note that I am in the process of implementing the BM25 weights as
'package-relevance'; at least really thinking about it! :-)
I have already talked about TF-IDF as relevance, for example here [1].
And reading the Xapian documentation [2], it seems affordable. Or not
;-) because of the regexp... Need some thoughts... I mean "in the
process". ;-)
And in this case, it is almost a drop-in replacement of
'fold-packages' by 'mset-fold'; well it should add some flexibility
and a more unified code.

(Aside the searching, IMHO 'package-relevance' should help too in the
linting process of bad written descriptions, another story. ;-)

[1] https://lists.gnu.org/archive/html/guix-devel/2019-07/msg00252.html
[2] https://xapian.org/docs/bm25.html


> +                                                                     regexps)
> +                                             ((? zero?)
> +                                              result)
> +                                             (score
> +                                              (cons (cons package score)
> +                                                    result)))))
> +                                     '())))
> +        (sort matches
> +              (lambda (m1 m2)
> +                (match m1
> +                  ((package1 . score1)
> +                   (match m2
> +                     ((package2 . score2)
> +                      (if (= score1 score2)
> +                          (string>? (package-full-name package1)
> +                                    (package-full-name package2))
> +                          (> score1 score2)))))))))))
>
>  (define (transaction-upgrade-entry store entry transaction)
>    "Return a variant of TRANSACTION that accounts for the upgrade of ENTRY, a
> @@ -777,8 +787,7 @@ processed, #f otherwise."

...here.

+  (define (regexp? str)
+    (string-any
+     (char-set #\. #\[ #\{ #\} #\( #\) #\\ #\* #\+ #\? #\| #\^ #\$)
+     str))

>                                        (('query 'search rx) rx)
>                                        (_                   #f))
>                                      opts))
>
> -              (regexps  (map (cut make-regexp* <> regexp/icase) patterns))
> -              (matches  (find-packages-by-description regexps)))

+ (if   (any regexp? patterns)
+    (matches (find-packages-by-description regexps))
+    (matches (find-packages-by-description-indexed patterns))

I mean something like that.

>           (leave-on-EPIPE
>            (display-search-results matches (current-output-port)))
>           #t))
> --
> 2.23.0


All the best,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 4/4] gnu: Use xapian index for package search.
  2020-03-03 19:21     ` zimoun
@ 2020-03-03 19:51       ` zimoun
  0 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2020-03-03 19:51 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, 39258

On Tue, 3 Mar 2020 at 20:21, zimoun <zimon.toutoune@gmail.com> wrote:
> On Thu, 27 Feb 2020 at 21:42, Arun Isaac <arunisaac@systemreboot.net> wrote:

> > +(define (search-package-index profile querystring)
> > +  (let ((offset 0)
> > +        (pagesize 10))
>
> Why this value of 10?
> This fix the number of packages returned. Hum?
> I have tried to replace by 100 and I got 100 packages. :-)

I propose the value of 4294967295 for pagesize.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 0/4] Xapian for Guix package search
  2020-03-02 19:13           ` zimoun
@ 2020-03-03 20:04             ` zimoun
  0 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2020-03-03 20:04 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, Pierre Neidhardt, 39258

Hi,

On Mon, 2 Mar 2020 at 20:13, zimoun <zimon.toutoune@gmail.com> wrote:

> --8<---------------cut here---------------start------------->8---
> /tmp/test/bin/guix search emacs | recsel -C -P name | wc -l
> 10
> --8<---------------cut here---------------end--------------->8---
>
> Maybe I am doing a mistake.

I think this issue is fixed when changing the 'pagesize' value.

Well, with '(pagesize 4294967295)' and using the same commit
(c1febbbf94), I get:

--8<---------------cut here---------------start------------->8---
guix time-machine --commit=c1febbbf94 -- guix search games | recsel -C
-p name | wc -l
247

./pre-inst-env guix search games | recsel -C -p name | wc -l
236
--8<---------------cut here---------------end--------------->8---

(I modified the patches in order to pull once to generate the index at
commit c1febbbf94 and then do some stuff.)


Note that the old "guix search" does not output blender and Xapian
does even if the term 'games' is not in the description but 'game' is.
Well, I am comparing the different list, i.e., "guix search games |
recsel -C -P name | sort" to see which one is in one list and not the
other one.

But before going more ahead, let polish a bit the patches to more
easily test without the double environment etc.
And because I am using good old HDD and some SSD comparison should be welcome.


All the best,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 0/4] Xapian for Guix package search
  2020-02-27 20:41 ` [bug#39258] [PATCH 0/4] Xapian for Guix package search Arun Isaac
                     ` (5 preceding siblings ...)
  2020-02-28 12:36   ` zimoun
@ 2020-03-05 16:46   ` Ludovic Courtès
  6 siblings, 0 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-03-05 16:46 UTC (permalink / raw)
  To: Arun Isaac; +Cc: mail, 39258, zimon.toutoune

Hello Arun,

Arun Isaac <arunisaac@systemreboot.net> skribis:

> * Speed improvement
>
> Despite search-package-index in gnu/packages.scm taking only around 1.5ms, I
> see an overall speedup in `guix search` of only a factor of 2 -- from around
> 2s to around 1s. I wonder what else in `guix search` is taking up so much
> time.

Note that ‘guix search’ time is largely dominated by I/O.  On my laptop,
I get (first measurement is cold cache, second one is warm cache):

--8<---------------cut here---------------start------------->8---
$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time guix search foo >/dev/null

real    0m2.631s
user    0m1.134s
sys     0m0.124s
$ time guix search foo >/dev/null

real    0m0.836s
user    0m1.027s
sys     0m0.053s
--8<---------------cut here---------------end--------------->8---

It’s hard to do better on the warm cache case because at this level,
there may be other things to optimize having little to do with searching
itself.

Note that this is on an SSD; the cold-cache case must be worse on NFS or
on a spinning disk, and there we could gain a lot.

I think we should weigh the pros and cons on all these aspects: speed,
complexity and maintenance cost, search result quality, search features,
etc.

Thanks,
Ludo’.

PS: I have not yet looked at the whole series as I’m just coming back to
    the keyboard.  :-)

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 3/4] gnu: Generate xapian package search index.
  2020-02-28  8:04     ` Pierre Neidhardt
@ 2020-03-05 20:26       ` Arun Isaac
  0 siblings, 0 replies; 126+ messages in thread
From: Arun Isaac @ 2020-03-05 20:26 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: ludo, 39258, zimon.toutoune

[-- Attachment #1: Type: text/plain, Size: 1261 bytes --]


>> +      (fold-packages (lambda (package _)
>> +                       (let* ((idterm (string-append "Q" (package-name package)))
>> +                              (doc (make-document #:data (package-name package)
>> +                                                  #:terms `((,idterm . 0))))
>> +                              (term-generator (make-term-generator #:stem (make-stem "en")
>> +                                                                   #:document doc)))
>> +                         (index-text! term-generator (package-description package))
>> +                         (replace-document! db idterm doc)))
>
> I guess these non-functional functions (index-text!, replace-document!)
> represent how Xapian works at the C++ level.  Would it be possible to
> make more functional bindings nonetheless?

I somehow overlooked this particular email and am reading it just
now. Yes, the non-functional bindings are a bit ugly. But, I'm not able
to think of a clean way to make functional bindings without supporting
all features offered by xapian. Any suggestions you have in this regard
would be useful. Look through xapian/termgenerator.h for more
details. In particular, look at functions increase_termpos,
index_text_without_positions.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-01-23 19:51 [bug#39258] Faster guix search using an sqlite cache Arun Isaac
  2020-01-29 23:33 ` zimoun
  2020-02-27 20:41 ` [bug#39258] [PATCH 0/4] Xapian for Guix package search Arun Isaac
@ 2020-03-07 13:31 ` Arun Isaac
  2020-03-07 13:31   ` [bug#39258] [PATCH v2 1/3] build-self: Add guile-xapian to Guix dependencies Arun Isaac
                     ` (5 more replies)
  2020-03-27 16:26 ` [bug#39258] [PATCH v3 0/3] Package metadata cache for guix search Arun Isaac
                   ` (5 subsequent siblings)
  8 siblings, 6 replies; 126+ messages in thread
From: Arun Isaac @ 2020-03-07 13:31 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac, mail, ludo, zimon.toutoune

Hi,

Here is the second iteration of my Xapian Guix package search patchset. I have
found the reason the earlier patchset did not show significant speedup. It
turns out that most of the time is spent in printing and texinfo rendering of
the search results. So, in this patchset, I pre-render the search results
while building the Xapian index and stuff them into the Xapian database
itself. Therefore, during `guix search`, I just pull out the pre-rendered
search results and print it on the screen. This is much faster. See comparison
below.

--8<---------------cut here---------------start------------->8---
With a warm cache,
$ time guix search inkscape

real	0m1.787s
user	0m1.745s
sys	0m0.111s
--8<---------------cut here---------------end--------------->8---

--8<---------------cut here---------------start------------->8---
$ time /tmp/test/bin/guix search inkscape

real	0m0.199s
user	0m0.182s
sys	0m0.024s
--8<---------------cut here---------------end--------------->8---

If most of the speedup comes from pre-rendering the results, it might seem
that the Xapian search is not so useful. We might as well have stuffed the
pre-rendered search results into the existing package cache generated by
generate-package-cache, or so it might seem. But, there are the following
arguments in favor of Xapian.

- The package cache would grow in size, and lookup would be slowed down
  because we need to load the entire cache into memory. Xapian, on the other
  hand, need only look up the specific packages that match the search query.
- Xapian can provide superior search results due to it stemming and language
  models.
- Xapian can provide spelling correction and query expansion -- that is,
  suggest search terms to improve search results. Note that I haven't
  implemented this yet and is out of scope in this patchset.

* Simplify our package search results

Why not use a simpler package search results format like Arch Linux or Debian
does? We could just display the package name, version and synopsis like so.

inkscape 0.92.4
    Vector graphics editor
inklingreader 0.8
    Wacom Inkling sketch format conversion and manipulation

Why do we need the entire recutils format? If the user is interested, they can
always use `guix package --show` to get the full recutils formatted
info. Having shorter search results will make everything even faster and much
more readable. WDYT?

* How to test this patchset

To get guile-xapian, run a `guix pull`, if you haven't already. Then in your
Guix source directory, drop into an environment with guix dependencies and
guile-xapian.

$ guix environment guix --ad-hoc guile-xapian

Apply patches and build.

$ git am v2-0000-cover-letter.patch v2-0002-gnu-Generate-Xapian-package-search-index.patch v2-0001-build-self-Add-guile-xapian-to-Guix-dependencies.patch v2-0003-gnu-Use-Xapian-index-for-package-search.patch
$ make

Run a test guix pull.

$ ./pre-inst-env guix pull --url=$PWD --branch=xapian -p /tmp/test

where xapian is the name of the branch you committed the patches to.

Then, run the guix search in /tmp/test.

$ /tmp/test/bin/guix search game

* Comments

Pierre Neidhardt <mail@ambrevar.xyz> writes:

>> +(define (search-package-index profile querystring)
>
> Maybe `query-string'?

Done in this patchset.

>> +  (define (regexp? str)
>> +    (string-any
>> +     (char-set #\. #\[ #\{ #\} #\( #\) #\\ #\* #\+ #\? #\| #\^ #\$)
>> +     str))
>> +
>> +  (if (and (current-profile)
>> +           (not (any regexp? patterns)))
>
> I would not put characters like ".", "$", or "+" here, lest we mistake a
> Xapian pattern for a regexp.
>
> As you said, I don't think both are compatible without ambiguity
> anyways, so we should probably drop regexp (or at least toggle them with
> a command line argument).

I agree.

zimoun <zimon.toutoune@gmail.com> writes:

> In the commit message, I would capitalize Xapian.

Done in this patchset.

>> +(define (generate-package-search-index directory)
>> +  "Generate under DIRECTORY a xapian index of all the available packages."
>
> Xapian with capital.

Done in this patchset.

> Is (make-stem "en") for the locale?

I still have English hard-coded. I haven't yet figured out how to detect the
locale and stem accordingly. But, there is a larger problem. Since we cannot
anticipate what locale the user will run guix search with, should we build the
Xapian index for all locales? That is, should we index not only the English
versions of the packages but also all other translations as well?

> package-search-index and package-cache-file could be refactored
> because they share all the same code.

Yes, they could be. However, I'll postpone to the next iteration of the
patchset.

> I do not know what is the convention for the bindings.
> But there is 'fold-packages' so I would be inclined to 'fold-msets' or
> something in this flavour.

Well, everywhere else in guile we have such things as vhash-fold, string-fold,
hash-fold, stream-fold, etc. That's why I went with mset-fold. Also, we are
folding over a single mset (match-set). So, mset should be in the singular.

> And more importantly, 'make as-derivations' to avoid a "guix pull" breakage,
> Ah do not forget to adapt some tests.

Will do this once we have consensus about the other features of this patchset.

>  b. The xapian relevance should truncated

Done in this patchset.

> Xapian does not return the package 'emacs' itself as the first. And worse,
> it is not returned at all.

In this patchset, since we're indexing the package name as well, emacs is
returned but it is still far from the beginning.

> I propose the value of 4294967295 for pagesize.

In this patchset, I pass (database-document-count db) as the #:maximum-items
keyword argument to enquire-mset. This is the upstream recommended way to get
all search results. I hadn't done this earlier since I hadn't yet wrapped
database-document-count in guile-xapian.

>> In this patchset, I have only indexed the package descriptions. In the next
>> version of this patchset, I will index all other terms as specified in
>> %package-metrics of guix/ui.scm.
>
> Yes, it appears to me a detail that should be easy to fix. I mean, it
> does not seems blocking.

Done in this patchset.

Ludovic Courtès <ludo@gnu.org> writes:

> Note that ‘guix search’ time is largely dominated by I/O.

Yes, `guix search` is I/O intensive. That is why I expect Xapian to do better
since it only needs to access matching packages not all packages. Also, the
Xapian index is fast at all times. It is not very dependent on a warm
filesystem cache.

> On my laptop,
> I get (first measurement is cold cache, second one is warm cache):
>
> --8<---------------cut here---------------start------------->8---
> $ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
> $ time guix search foo >/dev/null
>
> real    0m2.631s
> user    0m1.134s
> sys     0m0.124s
> $ time guix search foo >/dev/null
>
> real    0m0.836s
> user    0m1.027s
> sys     0m0.053s
> --8<---------------cut here---------------end--------------->8---
>
> It’s hard to do better on the warm cache case because at this level,
> there may be other things to optimize having little to do with searching
> itself.
>
> Note that this is on an SSD; the cold-cache case must be worse on NFS or
> on a spinning disk, and there we could gain a lot.

My laptop is quite old with a particularly slow HDD. Hence my motivation to
improve guix search performance!

> I think we should weigh the pros and cons on all these aspects: speed,
> complexity and maintenance cost, search result quality, search features,
> etc.

I agree.

> PS: I have not yet looked at the whole series as I’m just coming back to
>     the keyboard.  :-)

Welcome back! :-)

Arun Isaac (3):
  build-self: Add guile-xapian to Guix dependencies.
  gnu: Generate Xapian package search index.
  gnu: Use Xapian index for package search.

 build-aux/build-self.scm | 11 +++++++
 gnu/packages.scm         | 62 +++++++++++++++++++++++++++++++++++++++-
 guix/channels.scm        | 34 +++++++++++++++++++++-
 guix/scripts/package.scm |  7 +++--
 guix/self.scm            |  7 ++++-
 guix/ui.scm              | 37 ++++++++++++++++++++++++
 6 files changed, 153 insertions(+), 5 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 1/3] build-self: Add guile-xapian to Guix dependencies.
  2020-03-07 13:31 ` [bug#39258] [PATCH v2 0/3] " Arun Isaac
@ 2020-03-07 13:31   ` Arun Isaac
  2020-03-09 18:14     ` zimoun
  2020-03-09 23:40     ` Jonathan Brielmaier
  2020-03-07 13:31   ` [bug#39258] [PATCH v2 2/3] gnu: Generate Xapian package search index Arun Isaac
                     ` (4 subsequent siblings)
  5 siblings, 2 replies; 126+ messages in thread
From: Arun Isaac @ 2020-03-07 13:31 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac, mail, ludo, zimon.toutoune

* build-aux/build-self.scm (build-program): Import fake guile-xapian module.
* guix/self.scm (compiled-guix): Add guile-xapian to Guix dependencies.
---
 build-aux/build-self.scm | 11 +++++++++++
 guix/self.scm            |  7 ++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/build-aux/build-self.scm b/build-aux/build-self.scm
index f2e785b7f1..05d0353ccf 100644
--- a/build-aux/build-self.scm
+++ b/build-aux/build-self.scm
@@ -1,5 +1,6 @@
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2014, 2016, 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -261,6 +262,10 @@ interface (FFI) of Guile.")
                  #~(define-module (gcrypt hash)
                      #:export (sha1 sha256))))
 
+  (define fake-xapian-hash
+    ;; Fake (xapian xapian) module; see below.
+    (scheme-file "xapian.scm" #~(define-module (xapian xapian))))
+
   (define fake-git
     (scheme-file "git.scm" #~(define-module (git))))
 
@@ -273,6 +278,12 @@ interface (FFI) of Guile.")
                            ;; adjust %LOAD-PATH later on.
                            ((gcrypt hash) => ,fake-gcrypt-hash)
 
+                           ;; To avoid relying on 'with-extensions', which was
+                           ;; introduced in 0.15.0, provide a fake (xapian
+                           ;; xapian) just so that we can build modules, and
+                           ;; adjust %LOAD-PATH later on.
+                           ((xapian xapian) => ,fake-xapian-hash)
+
                            ;; (guix git-download) depends on (git) but only
                            ;; for peripheral functionality.  Provide a dummy
                            ;; (git) to placate it.
diff --git a/guix/self.scm b/guix/self.scm
index 6b633f9bc0..a4f40574d1 100644
--- a/guix/self.scm
+++ b/guix/self.scm
@@ -1,5 +1,6 @@
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -54,6 +55,7 @@
       ("guile-git"  (ref '(gnu packages guile) 'guile3.0-git))
       ("guile-sqlite3" (ref '(gnu packages guile) 'guile3.0-sqlite3))
       ("guile-gcrypt"  (ref '(gnu packages gnupg) 'guile3.0-gcrypt))
+      ("guile-xapian"  (ref '(gnu packages guile-xyz) 'guile3.0-xapian))
       ("gnutls"     (ref '(gnu packages tls) 'guile3.0-gnutls))
       ("zlib"       (ref '(gnu packages compression) 'zlib))
       ("lzlib"      (ref '(gnu packages compression) 'lzlib))
@@ -682,6 +684,9 @@ Info manual."
   (define guile-gcrypt
     (specification->package "guile-gcrypt"))
 
+  (define guile-xapian
+    (specification->package "guile-xapian"))
+
   (define gnutls
     (specification->package "gnutls"))
 
@@ -690,7 +695,7 @@ Info manual."
                          (cons (list "x" package)
                                (package-transitive-propagated-inputs package)))
                        (list guile-gcrypt gnutls guile-git guile-json
-                             guile-ssh guile-sqlite3))
+                             guile-ssh guile-sqlite3 guile-xapian))
       (((labels packages _ ...) ...)
        packages)))
 
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 2/3] gnu: Generate Xapian package search index.
  2020-03-07 13:31 ` [bug#39258] [PATCH v2 0/3] " Arun Isaac
  2020-03-07 13:31   ` [bug#39258] [PATCH v2 1/3] build-self: Add guile-xapian to Guix dependencies Arun Isaac
@ 2020-03-07 13:31   ` Arun Isaac
  2020-03-09 18:19     ` zimoun
  2020-03-07 13:31   ` [bug#39258] [PATCH v2 3/3] gnu: Use Xapian index for package search Arun Isaac
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-03-07 13:31 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac, mail, ludo, zimon.toutoune

* guix/ui.scm: Export %package-metrics.
* gnu/packages.scm (%package-search-index): New variable.
(generate-package-search-index): New function.
* guix/channels.scm (package-search-index): New function.
(%channel-profile-hooks): Add package-search-index.
---
 gnu/packages.scm  | 42 +++++++++++++++++++++++++++++++++++++++++-
 guix/channels.scm | 34 +++++++++++++++++++++++++++++++++-
 guix/ui.scm       |  2 ++
 3 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/gnu/packages.scm b/gnu/packages.scm
index d22c992bb1..c8e221de68 100644
--- a/gnu/packages.scm
+++ b/gnu/packages.scm
@@ -4,6 +4,7 @@
 ;;; Copyright © 2014 Eric Bavier <bavier@member.fsf.org>
 ;;; Copyright © 2016, 2017 Alex Kost <alezost@gmail.com>
 ;;; Copyright © 2016 Mathieu Lirzin <mthl@gnu.org>
+;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -43,6 +44,7 @@
   #:use-module (srfi srfi-34)
   #:use-module (srfi srfi-35)
   #:use-module (srfi srfi-39)
+  #:use-module (xapian xapian)
   #:export (search-patch
             search-patches
             search-auxiliary-file
@@ -64,7 +66,8 @@
             specification->location
             specifications->manifest
 
-            generate-package-cache))
+            generate-package-cache
+            generate-package-search-index))
 
 ;;; Commentary:
 ;;;
@@ -426,6 +429,43 @@ reducing the memory footprint."
                                #:opts '(#:to-file? #t)))))
   cache-file)
 
+(define %package-search-index
+  ;; Location of the package search-index
+  "/lib/guix/package-search.index")
+
+(define (generate-package-search-index directory)
+  "Generate under DIRECTORY a Xapian index of all the available packages."
+  (define db-path
+    (string-append directory %package-search-index))
+
+  (mkdir-p (dirname db-path))
+  (call-with-writable-database db-path
+    (lambda (db)
+      (fold-packages (lambda (package _)
+                       (let* ((idterm (string-append "Q" (package-name package)))
+                              (doc (make-document #:data (string-trim-right
+                                                          (call-with-output-string
+                                                            (cut package->recutils package <>))
+                                                          #\newline)
+                                                  #:terms `((,idterm . 0))))
+                              (term-generator (make-term-generator #:stem (make-stem "en")
+                                                                   #:document doc)))
+                         (for-each (match-lambda
+                                     ((field . weight)
+                                      (match (field package)
+                                        ((? string? str)
+                                         (index-text! term-generator str
+                                                      #:wdf-increment weight))
+                                        ((lst ...)
+                                         (for-each (cut index-text! term-generator <>
+                                                        #:wdf-increment weight)
+                                                   lst)))
+                                      (replace-document! db idterm doc)))
+                                   %package-metrics)))
+                     #f)))
+
+  db-path)
+
 \f
 (define %sigint-prompt
   ;; The prompt to jump to upon SIGINT.
diff --git a/guix/channels.scm b/guix/channels.scm
index f0261dc2da..c70c70938c 100644
--- a/guix/channels.scm
+++ b/guix/channels.scm
@@ -2,6 +2,7 @@
 ;;; Copyright © 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
 ;;; Copyright © 2018 Ricardo Wurmus <rekado@elephly.net>
 ;;; Copyright © 2019 Jan (janneke) Nieuwenhuizen <janneke@gnu.org>
+;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -581,9 +582,40 @@ be used as a profile hook."
                                                  (hook . package-cache))
                                   #:local-build? #t)))
 
+(define (package-search-index manifest)
+  "Build a package search index for the instance in MANIFEST.  This is meant
+to be used as a profile hook."
+  (mlet %store-monad ((profile (profile-derivation manifest
+                                                   #:hooks '())))
+
+    (define build
+      #~(begin
+          (use-modules (gnu packages))
+
+          (if (defined? 'generate-package-search-index)
+              (begin
+                ;; Delegate package search index generation to the inferior.
+                (format (current-error-port)
+                        "Generating package search index for '~a'...~%"
+                        #$profile)
+                (generate-package-search-index #$output))
+              (mkdir #$output))))
+
+    (gexp->derivation-in-inferior "guix-package-search-index" build
+                                  profile
+
+                                  ;; If the Guix in PROFILE is too old and
+                                  ;; lacks 'guix repl', don't build the cache
+                                  ;; instead of failing.
+                                  #:silent-failure? #t
+
+                                  #:properties '((type . profile-hook)
+                                                 (hook . package-search-index))
+                                  #:local-build? #t)))
+
 (define %channel-profile-hooks
   ;; The default channel profile hooks.
-  (cons package-cache-file %default-profile-hooks))
+  (cons* package-cache-file package-search-index %default-profile-hooks))
 
 (define (channel-instances->derivation instances)
   "Return the derivation of the profile containing INSTANCES, a list of
diff --git a/guix/ui.scm b/guix/ui.scm
index fbe2b70485..3bc82111a5 100644
--- a/guix/ui.scm
+++ b/guix/ui.scm
@@ -14,6 +14,7 @@
 ;;; Copyright © 2019 Chris Marusich <cmmarusich@gmail.com>
 ;;; Copyright © 2019 Tobias Geerinckx-Rice <me@tobias.gr>
 ;;; Copyright © 2019 Simon Tournier <zimon.toutoune@gmail.com>
+;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -120,6 +121,7 @@
             relevance
             package-relevance
             display-search-results
+            %package-metrics
 
             with-profile-lock
             string->generations
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 3/3] gnu: Use Xapian index for package search.
  2020-03-07 13:31 ` [bug#39258] [PATCH v2 0/3] " Arun Isaac
  2020-03-07 13:31   ` [bug#39258] [PATCH v2 1/3] build-self: Add guile-xapian to Guix dependencies Arun Isaac
  2020-03-07 13:31   ` [bug#39258] [PATCH v2 2/3] gnu: Generate Xapian package search index Arun Isaac
@ 2020-03-07 13:31   ` Arun Isaac
  2020-03-07 20:33   ` [bug#39258] [PATCH v2 0/3] Xapian for Guix " Ludovic Courtès
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 126+ messages in thread
From: Arun Isaac @ 2020-03-07 13:31 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac, mail, ludo, zimon.toutoune

* gnu/packages.scm (search-package-index): New function.
* guix/ui.scm (display-package-search-results): New function.
* guix/scripts/package.scm (process-query): Search using the Xapian package
index if current profile is available. Else, search using regexps.
---
 gnu/packages.scm         | 22 +++++++++++++++++++++-
 guix/scripts/package.scm |  7 +++++--
 guix/ui.scm              | 35 +++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/gnu/packages.scm b/gnu/packages.scm
index c8e221de68..3cbd7c63e3 100644
--- a/gnu/packages.scm
+++ b/gnu/packages.scm
@@ -67,7 +67,8 @@
             specifications->manifest
 
             generate-package-cache
-            generate-package-search-index))
+            generate-package-search-index
+            search-package-index))
 
 ;;; Commentary:
 ;;;
@@ -466,6 +467,25 @@ reducing the memory footprint."
 
   db-path)
 
+(define (search-package-index profile query-string)
+  "Search Xapian index in PROFILE for packages matching the Xapian query
+QUERY-STRING.  Return a list of search result texts each corresponding to one
+matching package."
+  (call-with-database (string-append profile %package-search-index)
+    (lambda (db)
+      (let ((query (parse-query query-string #:stemmer (make-stem "en"))))
+        (mset-fold (lambda (item result)
+                     (let ((search-result-text
+                            (call-with-output-string
+                              (cut format <> "~a~%relevance: ~a~%~%"
+                                   (document-data (mset-item-document item))
+                                   ;; Round score to one decimal place.
+                                   (/ (round (* 10 (mset-item-weight item))) 10)))))
+                       (append result (list search-result-text))))
+                   '()
+                   (enquire-mset (enquire db query)
+                                 #:maximum-items (database-document-count db)))))))
+
 \f
 (define %sigint-prompt
   ;; The prompt to jump to upon SIGINT.
diff --git a/guix/scripts/package.scm b/guix/scripts/package.scm
index d2f4f1ccd3..91c975b168 100644
--- a/guix/scripts/package.scm
+++ b/guix/scripts/package.scm
@@ -7,6 +7,7 @@
 ;;; Copyright © 2016 Benz Schenk <benz.schenk@uzh.ch>
 ;;; Copyright © 2016 Chris Marusich <cmmarusich@gmail.com>
 ;;; Copyright © 2019 Tobias Geerinckx-Rice <me@tobias.gr>
+;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -781,9 +782,11 @@ processed, #f otherwise."
                                       (_                   #f))
                                     opts))
               (regexps  (map (cut make-regexp* <> regexp/icase) patterns))
-              (matches  (find-packages-by-description regexps)))
+              (matches  (if (current-profile)
+                            (search-package-index (current-profile) (string-join patterns " "))
+                            (find-packages-by-description regexps))))
          (leave-on-EPIPE
-          (display-search-results matches (current-output-port)))
+          (display-package-search-results matches (current-output-port)))
          #t))
 
       (('show requested-name)
diff --git a/guix/ui.scm b/guix/ui.scm
index 3bc82111a5..163042054c 100644
--- a/guix/ui.scm
+++ b/guix/ui.scm
@@ -121,6 +121,7 @@
             relevance
             package-relevance
             display-search-results
+            display-package-search-results
             %package-metrics
 
             with-profile-lock
@@ -1490,6 +1491,40 @@ to view all the results.")
       (()
        #t))))
 
+(define* (display-package-search-results search-results port
+                                         #:key
+                                         (command "guix search"))
+  "Display SEARCH-RESULTS, a list of search result texts each corresponding to
+one matching package.  If PORT is a terminal, print at most a full screen of
+results."
+  (define first-line
+    (port-line port))
+
+  (define max-rows
+    (and first-line (isatty? port)
+         (terminal-rows port)))
+
+  (define (line-count str)
+    (string-count str #\newline))
+
+  (let loop ((search-results search-results))
+    (match search-results
+      ((text rest ...)
+       (if (and (not (getenv "INSIDE_EMACS"))
+                max-rows
+                (> (port-line port) first-line) ;print at least one result
+                (> (+ 4 (line-count text) (port-line port))
+                   max-rows))
+           (unless (null? rest)
+             (display-hint (format #f (G_ "Run @code{~a ... | less} \
+to view all the results.")
+                                   command)))
+           (begin
+             (display text port)
+             (loop rest))))
+      (()
+       #t))))
+
 \f
 (define (string->generations str)
   "Return the list of generations matching a pattern in STR.  This function
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-07 13:31 ` [bug#39258] [PATCH v2 0/3] " Arun Isaac
                     ` (2 preceding siblings ...)
  2020-03-07 13:31   ` [bug#39258] [PATCH v2 3/3] gnu: Use Xapian index for package search Arun Isaac
@ 2020-03-07 20:33   ` Ludovic Courtès
  2020-03-08  9:01     ` Arun Isaac
  2020-03-09 12:34     ` zimoun
  2020-03-08 20:27   ` zimoun
  2020-03-09 12:28   ` zimoun
  5 siblings, 2 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-03-07 20:33 UTC (permalink / raw)
  To: Arun Isaac; +Cc: mail, 39258, zimon.toutoune

Hello,

Arun Isaac <arunisaac@systemreboot.net> skribis:

> Here is the second iteration of my Xapian Guix package search patchset. I have
> found the reason the earlier patchset did not show significant speedup. It
> turns out that most of the time is spent in printing and texinfo rendering of
> the search results. So, in this patchset, I pre-render the search results
> while building the Xapian index and stuff them into the Xapian database
> itself. Therefore, during `guix search`, I just pull out the pre-rendered
> search results and print it on the screen. This is much faster. See comparison
> below.
>
> With a warm cache,
> $ time guix search inkscape
>
> real	0m1.787s
> user	0m1.745s
> sys	0m0.111s
>
> $ time /tmp/test/bin/guix search inkscape
>
> real	0m0.199s
> user	0m0.182s
> sys	0m0.024s

Nice!

In general, pre-rendering doesn’t seem practical to me: the output of
‘guix search’ is locale-dependent (it speaks the user’s language) and
adjusts to the terminal width (well, this is temporarily broken on
Guile 3.0.0, but see ‘%text-width’ in (guix ui)).

Also, if the 12K+ descriptions need to be rendered at the time the user
runs ‘guix pull’, the experience may not be great, because it could take
a bit of time.

WDYT?

> Why not use a simpler package search results format like Arch Linux or Debian
> does? We could just display the package name, version and synopsis like so.
>
> inkscape 0.92.4
>     Vector graphics editor
> inklingreader 0.8
>     Wacom Inkling sketch format conversion and manipulation
>
> Why do we need the entire recutils format? If the user is interested, they can
> always use `guix package --show` to get the full recutils formatted
> info. Having shorter search results will make everything even faster and much
> more readable. WDYT?

What I like about the recutils format in this context is that it’s both
human- and machine-readable.  The examples in the manual show how it can
be useful to select the information displayed or to refine the search
(info "(guix) Invoking guix package").

Also: I’d recommend tackling one thing at a time.  :-)

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Note that ‘guix search’ time is largely dominated by I/O.
>
> Yes, `guix search` is I/O intensive. That is why I expect Xapian to do better
> since it only needs to access matching packages not all packages. Also, the
> Xapian index is fast at all times. It is not very dependent on a warm
> filesystem cache.

Yes, indeed.

>> On my laptop,
>> I get (first measurement is cold cache, second one is warm cache):
>>
>> --8<---------------cut here---------------start------------->8---
>> $ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
>> $ time guix search foo >/dev/null
>>
>> real    0m2.631s
>> user    0m1.134s
>> sys     0m0.124s
>> $ time guix search foo >/dev/null
>>
>> real    0m0.836s
>> user    0m1.027s
>> sys     0m0.053s
>> --8<---------------cut here---------------end--------------->8---
>>
>> It’s hard to do better on the warm cache case because at this level,
>> there may be other things to optimize having little to do with searching
>> itself.
>>
>> Note that this is on an SSD; the cold-cache case must be worse on NFS or
>> on a spinning disk, and there we could gain a lot.
>
> My laptop is quite old with a particularly slow HDD. Hence my motivation to
> improve guix search performance!

Were you able to measure the cost of rendering specifically?

Here’s what I see when I turn ‘package->recutils’ into a no-op:

--8<---------------cut here---------------start------------->8---
$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time ./pre-inst-env guix search foo 

real	0m1.617s
user	0m0.812s
sys	0m0.094s
$ time ./pre-inst-env guix search foo 

real	0m0.595s
user	0m0.747s
sys	0m0.043s
--8<---------------cut here---------------end--------------->8---

To compare with:

--8<---------------cut here---------------start------------->8---
$ time ./pre-inst-env guix search foo >/dev/null

real	0m0.829s
user	0m1.026s
sys	0m0.046s
--8<---------------cut here---------------end--------------->8---

I think we should look at a profile of ‘package->recutils’, there’s
probably room for improvement there.

Thoughts?

Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-07 20:33   ` [bug#39258] [PATCH v2 0/3] Xapian for Guix " Ludovic Courtès
@ 2020-03-08  9:01     ` Arun Isaac
  2020-03-08 11:33       ` Ludovic Courtès
  2020-03-09 12:40       ` zimoun
  2020-03-09 12:34     ` zimoun
  1 sibling, 2 replies; 126+ messages in thread
From: Arun Isaac @ 2020-03-08  9:01 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: mail, 39258, zimon.toutoune

[-- Attachment #1: Type: text/plain, Size: 3252 bytes --]


>> It turns out that most of the time is spent in printing and texinfo
>> rendering of the search results.

Also, when we put all package metadata into the Xapian index, we don't
have to look up any of the package variables in (gnu packages *) during
`guix search` time. This also contributes substantially to the speedup.

> In general, pre-rendering doesn’t seem practical to me: the output of
> ‘guix search’ is locale-dependent (it speaks the user’s language) and

Note that we already need to index package synopses and descriptions in
all languages. I still haven't implemented this, though.

> adjusts to the terminal width (well, this is temporarily broken on
> Guile 3.0.0, but see ‘%text-width’ in (guix ui)).

This could be accomplished even with pre-rendering. Xapian provides
"slots" to store arbitrary strings with a document. Instead of storing
the pre-rendered document as a whole, we could store pre-rendered fields
in separate slots. Then, during `guix search` time, we can assemble the
result from these pre-rendered fields.

> Also, if the 12K+ descriptions need to be rendered at the time the user
> runs ‘guix pull’, the experience may not be great, because it could take
> a bit of time.

This is a problem, but I would see it as a necessary "compilation"
step. :-P In fact, this whole patchset speeds up `guix search` by doing
part of the work of `guix search` ahead of time. So, some such cost is
unavoidable.

> What I like about the recutils format in this context is that it’s both
> human- and machine-readable.  The examples in the manual show how it can
> be useful to select the information displayed or to refine the search
> (info "(guix) Invoking guix package").

Xapian's query language is much more natural (as in natural language)
than the regexp based techniques we need to use with recutils. I have
hardly ever used the regexp based search and I suspect many others
haven't either. Also, refining the search query should be easier to do
with Xapian. We could even use Xapian's query expansion feature to
suggest improved queries to the user.

That said, if we want the recutils format, we can still keep it in a
simplified form like so.

name: inkscape
version: 0.92.4
synopsis: Vector graphics editor

name: inklingreader
version: 0.8
synopsis: Wacom Inkling skecth format conversion and manipulation

> Also: I’d recommend tackling one thing at a time.  :-)

I totally agree, but I'm tempted to say that pre-rendering would be a
lot cheaper with the simplified form of search results. :-)

> Were you able to measure the cost of rendering specifically?

generate-package-search-index takes around 50 seconds. If I modify
generate-package-search-index to not pre-render but simply store the
package description alone, it takes around 20 seconds. That gives us a
rough idea of the cost of pre-rendering.

> I think we should look at a profile of ‘package->recutils’, there’s
> probably room for improvement there.

On quick inspection, most of the time in package->recutils is spent in
texinfo rendering the description. Unless we use the simplified search
results format as discussed above, we cannot avoid it.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-08  9:01     ` Arun Isaac
@ 2020-03-08 11:33       ` Ludovic Courtès
  2020-03-08 20:27         ` Arun Isaac
                           ` (2 more replies)
  2020-03-09 12:40       ` zimoun
  1 sibling, 3 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-03-08 11:33 UTC (permalink / raw)
  To: Arun Isaac; +Cc: mail, 39258, zimon.toutoune

Hi,

Arun Isaac <arunisaac@systemreboot.net> skribis:

>>> It turns out that most of the time is spent in printing and texinfo
>>> rendering of the search results.
>
> Also, when we put all package metadata into the Xapian index, we don't
> have to look up any of the package variables in (gnu packages *) during
> `guix search` time. This also contributes substantially to the speedup.

Yup.

>> In general, pre-rendering doesn’t seem practical to me: the output of
>> ‘guix search’ is locale-dependent (it speaks the user’s language) and
>
> Note that we already need to index package synopses and descriptions in
> all languages. I still haven't implemented this, though.

Oh, right.  Tricky!

>> adjusts to the terminal width (well, this is temporarily broken on
>> Guile 3.0.0, but see ‘%text-width’ in (guix ui)).
>
> This could be accomplished even with pre-rendering. Xapian provides
> "slots" to store arbitrary strings with a document. Instead of storing
> the pre-rendered document as a whole, we could store pre-rendered fields
> in separate slots. Then, during `guix search` time, we can assemble the
> result from these pre-rendered fields.

I’m not sure I understand.  The index wouldn’t store pre-rendered
strings for every possible terminal width, right?

>> Also, if the 12K+ descriptions need to be rendered at the time the user
>> runs ‘guix pull’, the experience may not be great, because it could take
>> a bit of time.
>
> This is a problem, but I would see it as a necessary "compilation"
> step. :-P In fact, this whole patchset speeds up `guix search` by doing
> part of the work of `guix search` ahead of time. So, some such cost is
> unavoidable.

Yeah.  I think we need to take the whole user experience into account,
not just ‘guix search’.  ‘guix pull’ already feels very slow, and it’s a
fairly common operation.  Conversely, ‘guix search’ takes roughly
between 0.5 and 2 seconds and is an uncommon operation on a “slow path”
(in the sense that when you’re searching for software, you’ll probably
have to spend more than a couple of seconds to find what you’re looking
for.)

>> What I like about the recutils format in this context is that it’s both
>> human- and machine-readable.  The examples in the manual show how it can
>> be useful to select the information displayed or to refine the search
>> (info "(guix) Invoking guix package").
>
> Xapian's query language is much more natural (as in natural language)
> than the regexp based techniques we need to use with recutils. I have
> hardly ever used the regexp based search and I suspect many others
> haven't either. Also, refining the search query should be easier to do
> with Xapian. We could even use Xapian's query expansion feature to
> suggest improved queries to the user.

I’m not sufficiently familiar with Xapian’s query language.  The
examples I had in mind were:

  guix search malloc | recsel -p name,version,relevance
  guix search | recsel -p name -e 'license ~ "LGPL 3"'
  guix search crypto library | \
    recsel -e '! (name ~ "^(ghc|perl|python|ruby)")' -p name,synopsis

It’s not so much about regexps than it is about selecting individual
fields.

>> Were you able to measure the cost of rendering specifically?
>
> generate-package-search-index takes around 50 seconds. If I modify
> generate-package-search-index to not pre-render but simply store the
> package description alone, it takes around 20 seconds. That gives us a
> rough idea of the cost of pre-rendering.

To me, adding 20–50 seconds on ‘guix pull’ would be undesirable.  :-/

>> I think we should look at a profile of ‘package->recutils’, there’s
>> probably room for improvement there.
>
> On quick inspection, most of the time in package->recutils is spent in
> texinfo rendering the description. Unless we use the simplified search
> results format as discussed above, we cannot avoid it.

What I meant was that we could use (statprof) to see whether/how Texinfo
rendering/parsing can be optimized.

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-07 13:31 ` [bug#39258] [PATCH v2 0/3] " Arun Isaac
                     ` (3 preceding siblings ...)
  2020-03-07 20:33   ` [bug#39258] [PATCH v2 0/3] Xapian for Guix " Ludovic Courtès
@ 2020-03-08 20:27   ` zimoun
  2020-03-08 20:40     ` Arun Isaac
  2020-03-09 12:28   ` zimoun
  5 siblings, 1 reply; 126+ messages in thread
From: zimoun @ 2020-03-08 20:27 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, Pierre Neidhardt, 39258

Hi Arun,

Thank you for that.
I will be probably far from keyboard for a couple of days for medical
reasons* and I will not be able to look into this second set patches
soon.

Cheers,
simon

*nothing to worry, even if I am currently typing that in an hospital,
just the bad consequence of sports. ;-)

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-08 11:33       ` Ludovic Courtès
@ 2020-03-08 20:27         ` Arun Isaac
  2020-03-09  7:42           ` Pierre Neidhardt
  2020-03-09 10:35           ` Ludovic Courtès
  2020-03-09  7:50         ` Pierre Neidhardt
  2020-03-09 12:47         ` zimoun
  2 siblings, 2 replies; 126+ messages in thread
From: Arun Isaac @ 2020-03-08 20:27 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: mail, 39258, zimon.toutoune

[-- Attachment #1: Type: text/plain, Size: 2439 bytes --]


>> This could be accomplished even with pre-rendering. Xapian provides
>> "slots" to store arbitrary strings with a document. Instead of storing
>> the pre-rendered document as a whole, we could store pre-rendered fields
>> in separate slots. Then, during `guix search` time, we can assemble the
>> result from these pre-rendered fields.
>
> I’m not sure I understand.  The index wouldn’t store pre-rendered
> strings for every possible terminal width, right?

No, it wouldn't. It would store a partially pre-rendered string, that is
without fill-paragraph. We run fill-paragraph at `guix search` time to
complete the rendering.

> I think we need to take the whole user experience into account, not
> just ‘guix search’.  ‘guix pull’ already feels very slow, and it’s a
> fairly common operation.  Conversely, ‘guix search’ takes roughly
> between 0.5 and 2 seconds and is an uncommon operation on a “slow
> path” (in the sense that when you’re searching for software, you’ll
> probably have to spend more than a couple of seconds to find what
> you’re looking for.)

I agree we can't compromise too much on `guix pull` performance.

> To me, adding 20–50 seconds on ‘guix pull’ would be undesirable.  :-/

Maybe I'm missing something here. guix pull takes around 40 minutes on
my machine. In comparison to that, is another 20-50 seconds (roughly 1
minute) a big deal? How much time would it be acceptable to spend on
building the Xapian index?

Also, is it possible to somehow provide substitutes for the Xapian index
so that the user does not have to actually build it locally during `guix
pull` time?

> I’m not sufficiently familiar with Xapian’s query language.  The
> examples I had in mind were:
> It’s not so much about regexps than it is about selecting individual
> fields.

I have totally not tested this, but I imagine that equivalent Xapian
queries might look something like:

>   guix search | recsel -p name -e 'license ~ "LGPL 3"'

guix search license:LGPL3

>   guix search crypto library | \
>     recsel -e '! (name ~ "^(ghc|perl|python|ruby)")' -p name,synopsis

guix search crypto library AND (NOT ghc) AND (NOT perl) AND (NOT python)
AND (NOT ruby)

> What I meant was that we could use (statprof) to see whether/how Texinfo
> rendering/parsing can be optimized.

Oh, ok. I'll try this if we decide not to pre-render.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-08 20:27   ` zimoun
@ 2020-03-08 20:40     ` Arun Isaac
  0 siblings, 0 replies; 126+ messages in thread
From: Arun Isaac @ 2020-03-08 20:40 UTC (permalink / raw)
  To: zimoun; +Cc: Ludovic Courtès, Pierre Neidhardt, 39258

[-- Attachment #1: Type: text/plain, Size: 338 bytes --]


zimoun <zimon.toutoune@gmail.com> writes:

> I will be probably far from keyboard for a couple of days for medical
> reasons* and I will not be able to look into this second set patches
> soon.

No problem, take care! :-)

> *nothing to worry, even if I am currently typing that in an hospital,
> just the bad consequence of sports. ;-)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-08 20:27         ` Arun Isaac
@ 2020-03-09  7:42           ` Pierre Neidhardt
  2020-03-09 12:50             ` zimoun
  2020-03-09 10:35           ` Ludovic Courtès
  1 sibling, 1 reply; 126+ messages in thread
From: Pierre Neidhardt @ 2020-03-09  7:42 UTC (permalink / raw)
  To: Arun Isaac, Ludovic Courtès; +Cc: 39258, zimon.toutoune

[-- Attachment #1: Type: text/plain, Size: 876 bytes --]

Arun Isaac <arunisaac@systemreboot.net> writes:

>> I’m not sufficiently familiar with Xapian’s query language.  The
>> examples I had in mind were:
>> It’s not so much about regexps than it is about selecting individual
>> fields.
>
> I have totally not tested this, but I imagine that equivalent Xapian
> queries might look something like:
>
>>   guix search | recsel -p name -e 'license ~ "LGPL 3"'
>
> guix search license:LGPL3
>
>>   guix search crypto library | \
>>     recsel -e '! (name ~ "^(ghc|perl|python|ruby)")' -p name,synopsis
>
> guix search crypto library AND (NOT ghc) AND (NOT perl) AND (NOT python)
> AND (NOT ruby)

Indeed, if you look at the notmuch-search-terms man page, you'll see
that you can select fields.
In my opinion, the recsel format is fully superseded by Xapian.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-08 11:33       ` Ludovic Courtès
  2020-03-08 20:27         ` Arun Isaac
@ 2020-03-09  7:50         ` Pierre Neidhardt
  2020-03-09 10:28           ` Ludovic Courtès
  2020-03-09 12:53           ` zimoun
  2020-03-09 12:47         ` zimoun
  2 siblings, 2 replies; 126+ messages in thread
From: Pierre Neidhardt @ 2020-03-09  7:50 UTC (permalink / raw)
  To: Ludovic Courtès, Arun Isaac; +Cc: 39258, zimon.toutoune

[-- Attachment #1: Type: text/plain, Size: 1762 bytes --]

Ludovic Courtès <ludo@gnu.org> writes:

> Yeah.  I think we need to take the whole user experience into account,
> not just ‘guix search’.  ‘guix pull’ already feels very slow, and it’s a
> fairly common operation.  Conversely, ‘guix search’ takes roughly
> between 0.5 and 2 seconds and is an uncommon operation on a “slow path”
> (in the sense that when you’re searching for software, you’ll probably
> have to spend more than a couple of seconds to find what you’re looking
> for.)

I think I disagree with "guix search" being an uncommon operation and a
slow path.

- The slowness of `guix search' (and the awkwardness of recutils) is
  maybe what makes it uncommon: users refrain from using it because it's
  too impractical.

- Searches are typically refined, i.e. you run a search multiple times
  by precising the terms, so in that sense I believe `guix search` is a
  very common operation.  Or should be.

Anyways, one of the key issues here is the inherent limitation of the
shell interface that does not allow us to directly and contextually
process the output of a command (at least not without rerunning it).

This issue can only be tackled with a GUI: there the user would be able
to interactively act with the result of the search, without having to
re-run the search.

Concretely, the GUI search would only return the package name, version
and synopses.  No need for the Texinfo / recutils juggling.

Then the user would select the packages of interest to display more
details.  This allows us to query the full details just-in-time.



Back to the topic: I believe that Xapian is a huge win both for the
shell and the future GUI :)

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-09  7:50         ` Pierre Neidhardt
@ 2020-03-09 10:28           ` Ludovic Courtès
  2020-03-09 13:03             ` zimoun
  2020-03-09 12:53           ` zimoun
  1 sibling, 1 reply; 126+ messages in thread
From: Ludovic Courtès @ 2020-03-09 10:28 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: Arun Isaac, 39258, zimon.toutoune

Hello,

Pierre Neidhardt <mail@ambrevar.xyz> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Yeah.  I think we need to take the whole user experience into account,
>> not just ‘guix search’.  ‘guix pull’ already feels very slow, and it’s a
>> fairly common operation.  Conversely, ‘guix search’ takes roughly
>> between 0.5 and 2 seconds and is an uncommon operation on a “slow path”
>> (in the sense that when you’re searching for software, you’ll probably
>> have to spend more than a couple of seconds to find what you’re looking
>> for.)
>
> I think I disagree with "guix search" being an uncommon operation and a
> slow path.

(Not “and” but “on” a slow path.)

> - The slowness of `guix search' (and the awkwardness of recutils) is
>   maybe what makes it uncommon: users refrain from using it because it's
>   too impractical.

I think “slowness” and “awkwardness” are overstatements.  I’m not saying
this is perfect, but to me it’s not bad.  (Of course I’m biased :-), but
I’ve used other similar tools and this one looks rather good compared to
what I’ve used.)

> - Searches are typically refined, i.e. you run a search multiple times
>   by precising the terms, so in that sense I believe `guix search` is a
>   very common operation.  Or should be.
>
> Anyways, one of the key issues here is the inherent limitation of the
> shell interface that does not allow us to directly and contextually
> process the output of a command (at least not without rerunning it).

I agree, but ‘guix search’ is a shell command, so we have to adapt to
that context.

> Concretely, the GUI search would only return the package name, version
> and synopses.  No need for the Texinfo / recutils juggling.
>
> Then the user would select the packages of interest to display more
> details.  This allows us to query the full details just-in-time.

Note that Emacs-Guix does that, although it doesn’t use the search
facility of (guix ui) with relevance metrics.

> Back to the topic: I believe that Xapian is a huge win both for the
> shell and the future GUI :)

It could be, but we need to consider all the aspects of the story,
including the maintenance cost and overhead moved to ‘guix pull’.  So
it’s not so much about “beliefs” at this point, but rather about
demonstrating what can be done, and I’m glad Arun is exploring that
space!

Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-08 20:27         ` Arun Isaac
  2020-03-09  7:42           ` Pierre Neidhardt
@ 2020-03-09 10:35           ` Ludovic Courtès
  2020-03-10 14:17             ` Arun Isaac
  1 sibling, 1 reply; 126+ messages in thread
From: Ludovic Courtès @ 2020-03-09 10:35 UTC (permalink / raw)
  To: Arun Isaac; +Cc: mail, 39258, zimon.toutoune

Hello!

Arun Isaac <arunisaac@systemreboot.net> skribis:

>>> This could be accomplished even with pre-rendering. Xapian provides
>>> "slots" to store arbitrary strings with a document. Instead of storing
>>> the pre-rendered document as a whole, we could store pre-rendered fields
>>> in separate slots. Then, during `guix search` time, we can assemble the
>>> result from these pre-rendered fields.
>>
>> I’m not sure I understand.  The index wouldn’t store pre-rendered
>> strings for every possible terminal width, right?
>
> No, it wouldn't. It would store a partially pre-rendered string, that is
> without fill-paragraph. We run fill-paragraph at `guix search` time to
> complete the rendering.

Note that Texinfo rendering doesn’t use (@ (guix ui) fill-paragraph).
It has its own paragraph-filling code.  We cannot use ‘fill-paragraph’
after Texinfo rendering anyway, since Texinfo knows where things can be
filled and where they cannot—e.g., @example.

>> I think we need to take the whole user experience into account, not
>> just ‘guix search’.  ‘guix pull’ already feels very slow, and it’s a
>> fairly common operation.  Conversely, ‘guix search’ takes roughly
>> between 0.5 and 2 seconds and is an uncommon operation on a “slow
>> path” (in the sense that when you’re searching for software, you’ll
>> probably have to spend more than a couple of seconds to find what
>> you’re looking for.)
>
> I agree we can't compromise too much on `guix pull` performance.
>
>> To me, adding 20–50 seconds on ‘guix pull’ would be undesirable.  :-/
>
> Maybe I'm missing something here. guix pull takes around 40 minutes on
> my machine. In comparison to that, is another 20-50 seconds (roughly 1
> minute) a big deal? How much time would it be acceptable to spend on
> building the Xapian index?

On my laptop, in the best case, when all the substitutes are available
(not uncommon), it takes 2 minutes.  Sometimes, when some substitutes
are missing, it takes 15 minutes.

So of course, the 20–50 seconds matter only in the best case.  But they
matter primarily because that index build may not be substitutable: it’s
possibly unique to each profile (see below).  That means we know we’re
often going to pay for it.

> Also, is it possible to somehow provide substitutes for the Xapian index
> so that the user does not have to actually build it locally during `guix
> pull` time?

We could provide a substitute for users who use only the official 'guix
channel.  However, as soon as users combine multiple channels, they’ll
have to build the index locally.

>> I’m not sufficiently familiar with Xapian’s query language.  The
>> examples I had in mind were:
>> It’s not so much about regexps than it is about selecting individual
>> fields.
>
> I have totally not tested this, but I imagine that equivalent Xapian
> queries might look something like:
>
>>   guix search | recsel -p name -e 'license ~ "LGPL 3"'
>
> guix search license:LGPL3

Nice.

>>   guix search crypto library | \
>>     recsel -e '! (name ~ "^(ghc|perl|python|ruby)")' -p name,synopsis
>
> guix search crypto library AND (NOT ghc) AND (NOT perl) AND (NOT python)
> AND (NOT ruby)

This one is not quite equivalent I guess, but yeah.  :-)

>> What I meant was that we could use (statprof) to see whether/how Texinfo
>> rendering/parsing can be optimized.
>
> Oh, ok. I'll try this if we decide not to pre-render.

It’d be beneficial anyways.

Thank you!

Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-07 13:31 ` [bug#39258] [PATCH v2 0/3] " Arun Isaac
                     ` (4 preceding siblings ...)
  2020-03-08 20:27   ` zimoun
@ 2020-03-09 12:28   ` zimoun
  5 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2020-03-09 12:28 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, Pierre Neidhardt, 39258

Hi,

On Sat, 7 Mar 2020 at 14:31, Arun Isaac <arunisaac@systemreboot.net> wrote:

> --8<---------------cut here---------------start------------->8---
> With a warm cache,
> $ time guix search inkscape
>
> real    0m1.787s
> user    0m1.745s
> sys     0m0.111s
> --8<---------------cut here---------------end--------------->8---
>
> --8<---------------cut here---------------start------------->8---
> $ time /tmp/test/bin/guix search inkscape
>
> real    0m0.199s
> user    0m0.182s
> sys     0m0.024s
> --8<---------------cut here---------------end--------------->8---

IMHO, it is interesting to compare the list of results and the order
of the both query; as i did with Emacs.
Speed is one thing, the initial motivation. But accuracy is maybe more
important.


> - The package cache would grow in size, and lookup would be slowed down
>   because we need to load the entire cache into memory. Xapian, on the other
>   hand, need only look up the specific packages that match the search query.

I agree that 'fold-packages' could become soon a bottleneck.

IMHO, 'mset-fold' should be a drop-in replacement of 'fold-package' in
the search function.


> - Xapian can provide superior search results due to it stemming and language
>   models.
> - Xapian can provide spelling correction and query expansion -- that is,
>   suggest search terms to improve search results. Note that I haven't
>   implemented this yet and is out of scope in this patchset.

I agree too that Xapian should improve the user experience when searching.


> * Simplify our package search results
>
> Why not use a simpler package search results format like Arch Linux or Debian
> does? We could just display the package name, version and synopsis like so.
>
> inkscape 0.92.4
>     Vector graphics editor
> inklingreader 0.8
>     Wacom Inkling sketch format conversion and manipulation2
>
> Why do we need the entire recutils format? If the user is interested, they can
> always use `guix package --show` to get the full recutils formatted
> info. Having shorter search results will make everything even faster and much
> more readable. WDYT?

I disagree.

What I proposed some time ago was to have different flavour of the
ouput of search; as e.g., 'git log --pretty=oneline' etc..
For example by default, it should be what you suggest. Then "guix
search --format=full" should output the current. And we could imagine
mimick the Git log strategy: "guix search --format="%name
%version\n%license" etc.

WDYT?



> > Is (make-stem "en") for the locale?
>
> I still have English hard-coded. I haven't yet figured out how to detect the
> locale and stem accordingly. But, there is a larger problem. Since we cannot
> anticipate what locale the user will run guix search with, should we build the
> Xapian index for all locales? That is, should we index not only the English
> versions of the packages but also all other translations as well?

I understand. Let consider that for the next round.


> > package-search-index and package-cache-file could be refactored
> > because they share all the same code.
>
> Yes, they could be. However, I'll postpone to the next iteration of the
> patchset.

Ok.


> > I do not know what is the convention for the bindings.
> > But there is 'fold-packages' so I would be inclined to 'fold-msets' or
> > something in this flavour.
>
> Well, everywhere else in guile we have such things as vhash-fold, string-fold,
> hash-fold, stream-fold, etc. That's why I went with mset-fold. Also, we are
> folding over a single mset (match-set). So, mset should be in the singular.

I understand.


> > And more importantly, 'make as-derivations' to avoid a "guix pull" breakage,
> > Ah do not forget to adapt some tests.
>
> Will do this once we have consensus about the other features of this patchset.

And we should test that on different machines and states.



> > Xapian does not return the package 'emacs' itself as the first. And worse,
> > it is not returned at all.
>
> In this patchset, since we're indexing the package name as well, emacs is
> returned but it is still far from the beginning.

This is an issue.

IMHO, it is because of the BM25 score. It is too rough and some weight
should be applied. But that another story.
The fix is:
 a- provide a scoring function to Xapian as the doc explains
 b- adapt 'fold-package' to 'mset-fold' in
'find-packages-by-description' and implement our version of BM25 then
use it in 'relevance'


> > I propose the value of 4294967295 for pagesize.
>
> In this patchset, I pass (database-document-count db) as the #:maximum-items
> keyword argument to enquire-mset. This is the upstream recommended way to get
> all search results. I hadn't done this earlier since I hadn't yet wrapped
> database-document-count in guile-xapian.

Cool!



> My laptop is quite old with a particularly slow HDD. Hence my motivation to
> improve guix search performance!

I agree.
But performance is not all. Accuracy counts more! :-)


> > I think we should weigh the pros and cons on all these aspects: speed,
> > complexity and maintenance cost, search result quality, search features,
> > etc.
>
> I agree.

I agree too.
We should write a benchmark. For example, using Emacs as query or more
complex we could think of.


All the best,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-07 20:33   ` [bug#39258] [PATCH v2 0/3] Xapian for Guix " Ludovic Courtès
  2020-03-08  9:01     ` Arun Isaac
@ 2020-03-09 12:34     ` zimoun
  1 sibling, 0 replies; 126+ messages in thread
From: zimoun @ 2020-03-09 12:34 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Arun Isaac, Pierre Neidhardt, 39258

Hi,

On Sat, 7 Mar 2020 at 21:33, Ludovic Courtès <ludo@gnu.org> wrote:
> Arun Isaac <arunisaac@systemreboot.net> skribis:

> > Why not use a simpler package search results format like Arch Linux or Debian
> > does? We could just display the package name, version and synopsis like so.
> >
> > inkscape 0.92.4
> >     Vector graphics editor
> > inklingreader 0.8
> >     Wacom Inkling sketch format conversion and manipulation
> >
> > Why do we need the entire recutils format? If the user is interested, they can
> > always use `guix package --show` to get the full recutils formatted
> > info. Having shorter search results will make everything even faster and much
> > more readable. WDYT?
>
> What I like about the recutils format in this context is that it’s both
> human- and machine-readable.  The examples in the manual show how it can
> be useful to select the information displayed or to refine the search
> (info "(guix) Invoking guix package").
>
> Also: I’d recommend tackling one thing at a time.  :-)

I agree with Ludo.

And IMHO, we should add "guix search --format=<options>" mimicking how
"git log" works.
By default, displays as Arun proposes. Using '--format=full" as it is
done now by default.
And we could imagine "--format=%name \t %version \n %description" etc.



> I think we should look at a profile of ‘package->recutils’, there’s
> probably room for improvement there.

Interesting. Note that speed was the initial motivation but accuracy
is another important one. As we discussed earlier when I showed an
example with TF-IDF. And Xapian implemets the state-of-art (BM25) for
scoring.


All the best,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-08  9:01     ` Arun Isaac
  2020-03-08 11:33       ` Ludovic Courtès
@ 2020-03-09 12:40       ` zimoun
  1 sibling, 0 replies; 126+ messages in thread
From: zimoun @ 2020-03-09 12:40 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, Pierre Neidhardt, 39258

On Sun, 8 Mar 2020 at 10:02, Arun Isaac <arunisaac@systemreboot.net> wrote:

> >> It turns out that most of the time is spent in printing and texinfo
> >> rendering of the search results.
>
> Also, when we put all package metadata into the Xapian index, we don't
> have to look up any of the package variables in (gnu packages *) during
> `guix search` time. This also contributes substantially to the speedup.

Yes, magic power of inverted index. ;-)



> > Also, if the 12K+ descriptions need to be rendered at the time the user
> > runs ‘guix pull’, the experience may not be great, because it could take
> > a bit of time.
>
> This is a problem, but I would see it as a necessary "compilation"
> step. :-P In fact, this whole patchset speeds up `guix search` by doing
> part of the work of `guix search` ahead of time. So, some such cost is
> unavoidable.

Currently "guix pull" is rather long on my machine. I would accept a
couple of seconds more (even minutes).
So this compilation step could be done at the "guix pull" time.
Or even we could imagine something indexing in the background.


> > What I like about the recutils format in this context is that it’s both
> > human- and machine-readable.  The examples in the manual show how it can
> > be useful to select the information displayed or to refine the search
> > (info "(guix) Invoking guix package").
>
> Xapian's query language is much more natural (as in natural language)
> than the regexp based techniques we need to use with recutils. I have
> hardly ever used the regexp based search and I suspect many others
> haven't either. Also, refining the search query should be easier to do
> with Xapian. We could even use Xapian's query expansion feature to
> suggest improved queries to the user.
>
> That said, if we want the recutils format, we can still keep it in a
> simplified form like so.
>
> name: inkscape
> version: 0.92.4
> synopsis: Vector graphics editor
>
> name: inklingreader
> version: 0.8
> synopsis: Wacom Inkling skecth format conversion and manipulation
>
> > Also: I’d recommend tackling one thing at a time.  :-)
>
> I totally agree, but I'm tempted to say that pre-rendering would be a
> lot cheaper with the simplified form of search results. :-)

IMHO, we "just" need to propose different outputs mimicking "git log
--format". Soemthing like "guix search --format=".

What do you think?



All the best,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-08 11:33       ` Ludovic Courtès
  2020-03-08 20:27         ` Arun Isaac
  2020-03-09  7:50         ` Pierre Neidhardt
@ 2020-03-09 12:47         ` zimoun
  2 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2020-03-09 12:47 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Arun Isaac, Pierre Neidhardt, 39258

On Sun, 8 Mar 2020 at 12:33, Ludovic Courtès <ludo@gnu.org> wrote:
> Arun Isaac <arunisaac@systemreboot.net> skribis:

> > This is a problem, but I would see it as a necessary "compilation"
> > step. :-P In fact, this whole patchset speeds up `guix search` by doing
> > part of the work of `guix search` ahead of time. So, some such cost is
> > unavoidable.
>
> Yeah.  I think we need to take the whole user experience into account,
> not just ‘guix search’.  ‘guix pull’ already feels very slow, and it’s a
> fairly common operation.  Conversely, ‘guix search’ takes roughly
> between 0.5 and 2 seconds and is an uncommon operation on a “slow path”
> (in the sense that when you’re searching for software, you’ll probably
> have to spend more than a couple of seconds to find what you’re looking
> for.)

We could imagine something doing the job of indexing in the
background; using the daemon or whatever.


> >> What I like about the recutils format in this context is that it’s both
> >> human- and machine-readable.  The examples in the manual show how it can
> >> be useful to select the information displayed or to refine the search
> >> (info "(guix) Invoking guix package").
> >
> > Xapian's query language is much more natural (as in natural language)
> > than the regexp based techniques we need to use with recutils. I have
> > hardly ever used the regexp based search and I suspect many others
> > haven't either. Also, refining the search query should be easier to do
> > with Xapian. We could even use Xapian's query expansion feature to
> > suggest improved queries to the user.
>
> I’m not sufficiently familiar with Xapian’s query language.  The
> examples I had in mind were:
>
>   guix search malloc | recsel -p name,version,relevance
>   guix search | recsel -p name -e 'license ~ "LGPL 3"'
>   guix search crypto library | \
>     recsel -e '! (name ~ "^(ghc|perl|python|ruby)")' -p name,synopsis

I think these examples are good ones to benchmark the different approaches.
Because the speed is one thing, the accuracy is another one.

Let cut the "slow path" by providing a better experience when searching. ;-)


> It’s not so much about regexps than it is about selecting individual
> fields.

The regexp should be provided directly to "guix search" actually and
'recsel' is only a "filter" allowing to deal differently with the
fields.



> To me, adding 20–50 seconds on ‘guix pull’ would be undesirable.  :-/

Ok, at least it is clear. :-)
And computing in the background?


All the best,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-09  7:42           ` Pierre Neidhardt
@ 2020-03-09 12:50             ` zimoun
  0 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2020-03-09 12:50 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: Arun Isaac, Ludovic Courtès, 39258

On Mon, 9 Mar 2020 at 08:42, Pierre Neidhardt <mail@ambrevar.xyz> wrote:
>
> Arun Isaac <arunisaac@systemreboot.net> writes:
>
> >> I’m not sufficiently familiar with Xapian’s query language.  The
> >> examples I had in mind were:
> >> It’s not so much about regexps than it is about selecting individual
> >> fields.
> >
> > I have totally not tested this, but I imagine that equivalent Xapian
> > queries might look something like:
> >
> >>   guix search | recsel -p name -e 'license ~ "LGPL 3"'
> >
> > guix search license:LGPL3
> >
> >>   guix search crypto library | \
> >>     recsel -e '! (name ~ "^(ghc|perl|python|ruby)")' -p name,synopsis
> >
> > guix search crypto library AND (NOT ghc) AND (NOT perl) AND (NOT python)
> > AND (NOT ruby)
>
> Indeed, if you look at the notmuch-search-terms man page, you'll see
> that you can select fields.
> In my opinion, the recsel format is fully superseded by Xapian.

No!
Because implementing the "fields" using Xapian is not done and it is
not as straightforward as it seems.
For sure, Xapian could do a lot of thing. But we should move one step
after one step.

Let first focus on speed and accuracy. For example, the fact that
"guix search emacs" does not returns first the package 'emacs' using
Xapian is really an issue.

Cheers,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-09  7:50         ` Pierre Neidhardt
  2020-03-09 10:28           ` Ludovic Courtès
@ 2020-03-09 12:53           ` zimoun
  1 sibling, 0 replies; 126+ messages in thread
From: zimoun @ 2020-03-09 12:53 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: Ludovic Courtès, Arun Isaac, 39258

On Mon, 9 Mar 2020 at 08:50, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> Back to the topic: I believe that Xapian is a huge win both for the
> shell and the future GUI :)

I agree.
The big win is to test the strategy of the inverted index strategy and
in the same time the state-of-art of scoring (relevance).

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-09 10:28           ` Ludovic Courtès
@ 2020-03-09 13:03             ` zimoun
  0 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2020-03-09 13:03 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Arun Isaac, Pierre Neidhardt, 39258

On Mon, 9 Mar 2020 at 11:29, Ludovic Courtès <ludo@gnu.org> wrote:

> > Back to the topic: I believe that Xapian is a huge win both for the
> > shell and the future GUI :)
>
> It could be, but we need to consider all the aspects of the story,
> including the maintenance cost and overhead moved to ‘guix pull’.  So
> it’s not so much about “beliefs” at this point, but rather about
> demonstrating what can be done, and I’m glad Arun is exploring that
> space!

I agree.
What is currently tested with Xapian is:
 1- speeding up (or not) using an inverted index
 2- the accuracy using the state-of-art of information retrieval (BM25)

About 1- I do not have a strong opinion; even if I find "guix search"
terribly slow as I mentioned earlier (one year ago ;-)).

About 2- as I mentioned earlier, the 'relevance' function could be
improved. Currently, the score is computed only considering the
package itself and not the other packages (the words they use, their
number etc.). BM25 is the state-of-art using what I tried to explained
some time ago when I showed for example TF-IDF. The question is so
what the best move to improve the accuracy. And the improvement
necessarily uses a global index (of terms, at least). But on the other
hand, the improvement should not pay off because it would add
complexity and burden, more than the improvement itself.

Without testing, we cannot say. Thank you Arun for pushing forward.


All the best,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 1/3] build-self: Add guile-xapian to Guix dependencies.
  2020-03-07 13:31   ` [bug#39258] [PATCH v2 1/3] build-self: Add guile-xapian to Guix dependencies Arun Isaac
@ 2020-03-09 18:14     ` zimoun
  2020-03-09 23:40     ` Jonathan Brielmaier
  1 sibling, 0 replies; 126+ messages in thread
From: zimoun @ 2020-03-09 18:14 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, Pierre Neidhardt, 39258

On Sat, 7 Mar 2020 at 14:31, Arun Isaac <arunisaac@systemreboot.net> wrote:

> diff --git a/build-aux/build-self.scm b/build-aux/build-self.scm
> index f2e785b7f1..05d0353ccf 100644
> --- a/build-aux/build-self.scm
> +++ b/build-aux/build-self.scm
> @@ -1,5 +1,6 @@
>  ;;; GNU Guix --- Functional package management for GNU
>  ;;; Copyright © 2014, 2016, 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
> +;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
>  ;;;
>  ;;; This file is part of GNU Guix.
>  ;;;
> @@ -261,6 +262,10 @@ interface (FFI) of Guile.")
>                   #~(define-module (gcrypt hash)
>                       #:export (sha1 sha256))))
>
> +  (define fake-xapian-hash
> +    ;; Fake (xapian xapian) module; see below.
> +    (scheme-file "xapian.scm" #~(define-module (xapian xapian))))
> +

Why 'fake-xapian-hash' and not simply 'fake-xapian'?

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 2/3] gnu: Generate Xapian package search index.
  2020-03-07 13:31   ` [bug#39258] [PATCH v2 2/3] gnu: Generate Xapian package search index Arun Isaac
@ 2020-03-09 18:19     ` zimoun
  0 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2020-03-09 18:19 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, Pierre Neidhardt, 39258

On Sat, 7 Mar 2020 at 14:31, Arun Isaac <arunisaac@systemreboot.net> wrote:

> diff --git a/gnu/packages.scm b/gnu/packages.scm
> index d22c992bb1..c8e221de68 100644
> --- a/gnu/packages.scm
> +++ b/gnu/packages.scm

[...]

> @@ -426,6 +429,43 @@ reducing the memory footprint."
>                                 #:opts '(#:to-file? #t)))))
>    cache-file)
>
> +(define %package-search-index
> +  ;; Location of the package search-index
> +  "/lib/guix/package-search.index")
> +
> +(define (generate-package-search-index directory)
> +  "Generate under DIRECTORY a Xapian index of all the available packages."
> +  (define db-path
> +    (string-append directory %package-search-index))
> +
> +  (mkdir-p (dirname db-path))
> +  (call-with-writable-database db-path
> +    (lambda (db)
> +      (fold-packages (lambda (package _)
> +                       (let* ((idterm (string-append "Q" (package-name package)))
> +                              (doc (make-document #:data (string-trim-right
> +                                                          (call-with-output-string
> +                                                            (cut package->recutils package <>))
> +                                                          #\newline)
> +                                                  #:terms `((,idterm . 0))))
> +                              (term-generator (make-term-generator #:stem (make-stem "en")
> +                                                                   #:document doc)))
> +                         (for-each (match-lambda
> +                                     ((field . weight)
> +                                      (match (field package)
> +                                        ((? string? str)
> +                                         (index-text! term-generator str
> +                                                      #:wdf-increment weight))
> +                                        ((lst ...)
> +                                         (for-each (cut index-text! term-generator <>
> +                                                        #:wdf-increment weight)
> +                                                   lst)))
> +                                      (replace-document! db idterm doc)))
> +                                   %package-metrics)))
> +                     #f)))
> +
> +  db-path)

If I understand correctly, the index is stored with a weight coming
from '%package-metrics', right? Well, I am not convinced it is the
correct way but I have not tried by myself yet. :-)

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 1/3] build-self: Add guile-xapian to Guix dependencies.
  2020-03-07 13:31   ` [bug#39258] [PATCH v2 1/3] build-self: Add guile-xapian to Guix dependencies Arun Isaac
  2020-03-09 18:14     ` zimoun
@ 2020-03-09 23:40     ` Jonathan Brielmaier
  2020-03-10  5:24       ` Arun Isaac
  1 sibling, 1 reply; 126+ messages in thread
From: Jonathan Brielmaier @ 2020-03-09 23:40 UTC (permalink / raw)
  To: Arun Isaac, 39258; +Cc: ludo, mail, zimon.toutoune

On 07.03.20 14:31, Arun Isaac wrote:
> * build-aux/build-self.scm (build-program): Import fake guile-xapian module.
> * guix/self.scm (compiled-guix): Add guile-xapian to Guix dependencies.

Could you please trigger a release (e.g. 0.0.1) for guile-xapian? This
would make it more easy for distributions like openSUSE to package it.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 1/3] build-self: Add guile-xapian to Guix dependencies.
  2020-03-09 23:40     ` Jonathan Brielmaier
@ 2020-03-10  5:24       ` Arun Isaac
  0 siblings, 0 replies; 126+ messages in thread
From: Arun Isaac @ 2020-03-10  5:24 UTC (permalink / raw)
  To: Jonathan Brielmaier; +Cc: 39258

[-- Attachment #1: Type: text/plain, Size: 428 bytes --]

Jonathan Brielmaier <jonathan.brielmaier@web.de> writes:

> Could you please trigger a release (e.g. 0.0.1) for guile-xapian? This
> would make it more easy for distributions like openSUSE to package it.

Done! :-) See
https://git.systemreboot.net/guile-xapian/snapshot/guile-xapian-0.1.0.tar.xz

But I must warn you that guile-xapian is terribly incomplete and
terribly unstable. The API may change at any time without notice.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-09 10:35           ` Ludovic Courtès
@ 2020-03-10 14:17             ` Arun Isaac
  2020-03-10 14:33               ` zimoun
  2020-03-11 13:50               ` Ludovic Courtès
  0 siblings, 2 replies; 126+ messages in thread
From: Arun Isaac @ 2020-03-10 14:17 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: mail, 39258, zimon.toutoune

[-- Attachment #1: Type: text/plain, Size: 1149 bytes --]


> Note that Texinfo rendering doesn’t use (@ (guix ui) fill-paragraph).
> It has its own paragraph-filling code.  We cannot use ‘fill-paragraph’
> after Texinfo rendering anyway, since Texinfo knows where things can be
> filled and where they cannot—e.g., @example.

True, I did not think of this.

> We could provide a substitute for users who use only the official 'guix
> channel.  However, as soon as users combine multiple channels, they’ll
> have to build the index locally.

We could build a separate Xapian database for each channel. Xapian does
support searching across multiple databases at once and will handle
merging the results together appropriately. If I understand correctly,
this means we can provide substitutes for at least the official guix
channel and let the user build the index locally for other channels. Is
that correct?

Also, could someone please build the patchset v2 on their machine and
measure the time taken by generate-package-search-index? My laptop,
particularly my HDD is slow even as far as HDDs go. So, my figure of
20-50 seconds may not be representative.

Thanks,
Arun.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-10 14:17             ` Arun Isaac
@ 2020-03-10 14:33               ` zimoun
  2020-03-11 13:50               ` Ludovic Courtès
  1 sibling, 0 replies; 126+ messages in thread
From: zimoun @ 2020-03-10 14:33 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, Pierre Neidhardt, 39258

Hi Arun,

On Tue, 10 Mar 2020 at 15:18, Arun Isaac <arunisaac@systemreboot.net> wrote:

> > We could provide a substitute for users who use only the official 'guix
> > channel.  However, as soon as users combine multiple channels, they’ll
> > have to build the index locally.
>
> We could build a separate Xapian database for each channel. Xapian does
> support searching across multiple databases at once and will handle
> merging the results together appropriately. If I understand correctly,
> this means we can provide substitutes for at least the official guix
> channel and let the user build the index locally for other channels. Is
> that correct?

To complement your words, you could also imagine index all the history
as any other channels. It needs some thoughts but it seems a path that
I would to go.


> Also, could someone please build the patchset v2 on their machine and
> measure the time taken by generate-package-search-index? My laptop,
> particularly my HDD is slow even as far as HDDs go. So, my figure of
> 20-50 seconds may not be representative.

I will do when I will be fully back. :-)


All the best,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-10 14:17             ` Arun Isaac
  2020-03-10 14:33               ` zimoun
@ 2020-03-11 13:50               ` Ludovic Courtès
  2020-03-13  5:37                 ` Arun Isaac
  1 sibling, 1 reply; 126+ messages in thread
From: Ludovic Courtès @ 2020-03-11 13:50 UTC (permalink / raw)
  To: Arun Isaac; +Cc: mail, 39258, zimon.toutoune

Hello!

Arun Isaac <arunisaac@systemreboot.net> skribis:

>> We could provide a substitute for users who use only the official 'guix
>> channel.  However, as soon as users combine multiple channels, they’ll
>> have to build the index locally.
>
> We could build a separate Xapian database for each channel. Xapian does
> support searching across multiple databases at once and will handle
> merging the results together appropriately.

Nice!

> If I understand correctly, this means we can provide substitutes for
> at least the official guix channel and let the user build the index
> locally for other channels. Is that correct?

I’m afraid not, or at least not trivially.

Currently, profile hooks such as ‘%channel-profile-hooks’, receive a
complete profile—in this case, the composition of all the channels the
user chose.

So if we want to achieve what you propose, we’d need to find another way
to hook database generation.


BTW, there’s also the problem of modules added dynamically with
$GUIX_PACKAGE_PATH or ‘-L’.  With the proposed scheme, it seems that
they could no longer be searched.  Is that correct?

(Conversely the package cache is optional: it’s only used when it’s
considered authoritative, see (gnu packages).  The API and behavior are
exactly the same whether or not the package cache is used.)

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-11 13:50               ` Ludovic Courtès
@ 2020-03-13  5:37                 ` Arun Isaac
  2020-03-15 20:40                   ` Ludovic Courtès
  0 siblings, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-03-13  5:37 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: mail, 39258, zimon.toutoune

[-- Attachment #1: Type: text/plain, Size: 1083 bytes --]


> Currently, profile hooks such as ‘%channel-profile-hooks’, receive a
> complete profile—in this case, the composition of all the channels the
> user chose.
>
> So if we want to achieve what you propose, we’d need to find another way
> to hook database generation.

Hmmm. Tough luck, I suppose. Do you have suggestions for anywhere else
to hook database generation?

> BTW, there’s also the problem of modules added dynamically with
> $GUIX_PACKAGE_PATH or ‘-L’.  With the proposed scheme, it seems that
> they could no longer be searched.  Is that correct?

Unfortunately, that is correct. To address this, we discussed retaining
the current search implementation along with the new xapian
implementation. But, that changes the search query behaviour and
adds a lot of complexity. I'll think of some other way out.

> (Conversely the package cache is optional: it’s only used when it’s
> considered authoritative, see (gnu packages).  The API and behavior are
> exactly the same whether or not the package cache is used.)

Thanks,
Arun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
  2020-03-13  5:37                 ` Arun Isaac
@ 2020-03-15 20:40                   ` Ludovic Courtès
  0 siblings, 0 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-03-15 20:40 UTC (permalink / raw)
  To: Arun Isaac; +Cc: mail, 39258, zimon.toutoune

Hi Arun,

Arun Isaac <arunisaac@systemreboot.net> skribis:

>> Currently, profile hooks such as ‘%channel-profile-hooks’, receive a
>> complete profile—in this case, the composition of all the channels the
>> user chose.
>>
>> So if we want to achieve what you propose, we’d need to find another way
>> to hook database generation.
>
> Hmmm. Tough luck, I suppose. Do you have suggestions for anywhere else
> to hook database generation?

For the core database (packages that come with Guix), (guix self) could
take care of it.

>> BTW, there’s also the problem of modules added dynamically with
>> $GUIX_PACKAGE_PATH or ‘-L’.  With the proposed scheme, it seems that
>> they could no longer be searched.  Is that correct?
>
> Unfortunately, that is correct. To address this, we discussed retaining
> the current search implementation along with the new xapian
> implementation. But, that changes the search query behaviour and
> adds a lot of complexity. I'll think of some other way out.

Yeah, I think we’d want to have roughly a single implementation.

I wonder if the relevant metrics that Xapian implements, like zimoun
mentioned, could be available directly in Scheme in a way that allows us
to compute them at run time when the pre-built cache is unavailable.  Or
would that be necessarily too slow?

If so, perhaps a slightly less fancy metric could work with better
performance?

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v3 0/3] Package metadata cache for guix search
  2020-01-23 19:51 [bug#39258] Faster guix search using an sqlite cache Arun Isaac
                   ` (2 preceding siblings ...)
  2020-03-07 13:31 ` [bug#39258] [PATCH v2 0/3] " Arun Isaac
@ 2020-03-27 16:26 ` Arun Isaac
  2020-03-27 16:26   ` [bug#39258] [PATCH v3 1/3] guix: Generate package metadata cache Arun Isaac
                     ` (4 more replies)
  2020-04-26  3:54 ` [bug#39258] benchmark search: default vs v2 vs v3 zimoun
                   ` (4 subsequent siblings)
  8 siblings, 5 replies; 126+ messages in thread
From: Arun Isaac @ 2020-03-27 16:26 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac, mail, ludo, zimon.toutoune

Hi everyone,

This is v3 of my attempt to make guix search faster. In this version, I have
abandoned use of xapian. Instead I build a cache of the metadata of all
packages in a profile hook. Then, I use that cache to search and display
search results. This way, package guile modules are not loaded during guix
search.

Speedup is around 2x. Both measurements below are with a warm cache.

--8<---------------cut here---------------start------------->8---
$ time guix search inkscape

real	0m1.722s
user	0m1.776s
sys	0m0.097s
--8<---------------cut here---------------end--------------->8---

--8<---------------cut here---------------start------------->8---
$ time /tmp/test/bin/guix search inkscape

real	0m0.749s
user	0m0.770s
sys	0m0.020s
--8<---------------cut here---------------end--------------->8---

This patchset does not affect the search API nor does it improve the relevance
of search results. If there is interest in this approach, I'll complete this
patchset properly. But, in the long run, I do think we should aim to get
xapian or the like for guix search. WDYT?

Unfortunately, generate-package-metadata-cache takes 43 seconds to build the
cache on my relatively slow computer. Performance should be better on other
people's machines.

Meanwhile, it would still be useful if someone built patchset v2 on their
machine and reported the time it took to build the xapian index.

* How to test this patchset

Apply patches and build as usual. Do a guix pull into a temporary profile.

$ ./pre-inst-env guix pull --url=$PWD --branch=the-name-of-the-branch-you-applied-patches-to -p /tmp/test

Then, run guix search from the built profile

$ /tmp/test/bin/guix search inkscape

Thanks!

Arun Isaac (3):
  guix: Generate package metadata cache.
  guix: Search package metadata cache.
  guix: Use package metadata cache for package search.

 gnu/packages.scm         |  88 +++++++++++++++++++++++++-
 guix/channels.scm        |  34 +++++++++-
 guix/packages.scm        |  32 ++++++++++
 guix/scripts/package.scm |   5 +-
 guix/ui.scm              | 132 ++++++++++++++++++++++++++++++++++++---
 5 files changed, 277 insertions(+), 14 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v3 1/3] guix: Generate package metadata cache.
  2020-03-27 16:26 ` [bug#39258] [PATCH v3 0/3] Package metadata cache for guix search Arun Isaac
@ 2020-03-27 16:26   ` Arun Isaac
  2020-04-24 20:48     ` Ludovic Courtès
  2020-03-27 16:26   ` [bug#39258] [PATCH v3 2/3] guix: Search " Arun Isaac
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-03-27 16:26 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac, mail, ludo, zimon.toutoune

* gnu/packages.scm (%package-metadata-cache-file): New variable.
(generate-package-metadata-cache): New function.
* guix/channels.scm (package-metadata-cache-file): New function.
(%channel-profile-hooks): Add package-metadata-cache-file.
---
 gnu/packages.scm  | 50 ++++++++++++++++++++++++++++++++++++++++++++++-
 guix/channels.scm | 34 +++++++++++++++++++++++++++++++-
 2 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/gnu/packages.scm b/gnu/packages.scm
index d22c992bb1..c0b527acf0 100644
--- a/gnu/packages.scm
+++ b/gnu/packages.scm
@@ -4,6 +4,7 @@
 ;;; Copyright © 2014 Eric Bavier <bavier@member.fsf.org>
 ;;; Copyright © 2016, 2017 Alex Kost <alezost@gmail.com>
 ;;; Copyright © 2016 Mathieu Lirzin <mthl@gnu.org>
+;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -64,7 +65,8 @@
             specification->location
             specifications->manifest
 
-            generate-package-cache))
+            generate-package-cache
+            generate-package-metadata-cache))
 
 ;;; Commentary:
 ;;;
@@ -426,6 +428,52 @@ reducing the memory footprint."
                                #:opts '(#:to-file? #t)))))
   cache-file)
 
+(define %package-metadata-cache-file
+  ;; Location of the package metadata cache.
+  "/lib/guix/package-metadata.cache")
+
+(define (generate-package-metadata-cache directory)
+  "Generate under DIRECTORY a cache of the metadata of all available packages.
+
+The primary purpose of this cache is to speed up package metadata lookup
+during package search so that we don't have to traverse and load all the
+package modules."
+  (define cache-file
+    (string-append directory %package-metadata-cache-file))
+
+  (define (package<? p1 p2)
+    (string<? (package-full-name p1) (package-full-name p2)))
+
+  (define (expand-cache package result)
+    (cons `#(,(package-name package)
+             ,(package-version package)
+             ,(delete-duplicates
+               (map package-full-name
+                    (sort (filter package? (package-direct-inputs package))
+                          package<?)))
+             ,(package-outputs package)
+             ,(package-supported-systems package)
+             ,(package-synopsis package)
+             ,(package-description package)
+             ,(package-home-page package)
+             ,(let ((location (package-location package)))
+                (list (location-file location)
+                      (location-line location)
+                      (location-column location))))
+          result))
+
+  (define exp
+    (fold-packages expand-cache '()))
+
+  (mkdir-p (dirname cache-file))
+  (call-with-output-file cache-file
+    (lambda (port)
+      (put-bytevector port
+                      (compile `'(,@exp)
+                               #:to 'bytecode
+                               #:opts '(#:to-file? #t)))))
+  cache-file)
+
 \f
 (define %sigint-prompt
   ;; The prompt to jump to upon SIGINT.
diff --git a/guix/channels.scm b/guix/channels.scm
index f0261dc2da..c4efaa7300 100644
--- a/guix/channels.scm
+++ b/guix/channels.scm
@@ -2,6 +2,7 @@
 ;;; Copyright © 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
 ;;; Copyright © 2018 Ricardo Wurmus <rekado@elephly.net>
 ;;; Copyright © 2019 Jan (janneke) Nieuwenhuizen <janneke@gnu.org>
+;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -581,9 +582,40 @@ be used as a profile hook."
                                                  (hook . package-cache))
                                   #:local-build? #t)))
 
+(define (package-metadata-cache-file manifest)
+  "Build a package metadata cache file for the instance in MANIFEST.  This is
+meant to be used as a profile hook."
+  (mlet %store-monad ((profile (profile-derivation manifest
+                                                   #:hooks '())))
+
+    (define build
+      #~(begin
+          (use-modules (gnu packages))
+
+          (if (defined? 'generate-package-metadata-cache)
+              (begin
+                ;; Delegate package cache generation to the inferior.
+                (format (current-error-port)
+                        "Generating package metadata cache for '~a'...~%"
+                        #$profile)
+                (generate-package-metadata-cache #$output))
+              (mkdir #$output))))
+
+    (gexp->derivation-in-inferior "guix-package-metadata-cache" build
+                                  profile
+
+                                  ;; If the Guix in PROFILE is too old and
+                                  ;; lacks 'guix repl', don't build the cache
+                                  ;; instead of failing.
+                                  #:silent-failure? #t
+
+                                  #:properties '((type . profile-hook)
+                                                 (hook . package-cache))
+                                  #:local-build? #t)))
+
 (define %channel-profile-hooks
   ;; The default channel profile hooks.
-  (cons package-cache-file %default-profile-hooks))
+  (cons* package-cache-file package-metadata-cache-file %default-profile-hooks))
 
 (define (channel-instances->derivation instances)
   "Return the derivation of the profile containing INSTANCES, a list of
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v3 2/3] guix: Search package metadata cache.
  2020-03-27 16:26 ` [bug#39258] [PATCH v3 0/3] Package metadata cache for guix search Arun Isaac
  2020-03-27 16:26   ` [bug#39258] [PATCH v3 1/3] guix: Generate package metadata cache Arun Isaac
@ 2020-03-27 16:26   ` Arun Isaac
  2020-04-24 20:58     ` Ludovic Courtès
  2020-03-27 16:26   ` [bug#39258] [PATCH v3 3/3] guix: Use package metadata cache for package search Arun Isaac
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-03-27 16:26 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac, mail, ludo, zimon.toutoune

* gnu/packages.scm (search-packages): New function.
* guix/packages.scm (<package-metadata>): New record type.
---
 gnu/packages.scm  | 38 ++++++++++++++++++++++++++++++++++++++
 guix/packages.scm | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/gnu/packages.scm b/gnu/packages.scm
index c0b527acf0..2510b1fe49 100644
--- a/gnu/packages.scm
+++ b/gnu/packages.scm
@@ -59,6 +59,7 @@
             find-packages-by-name
             find-package-locations
             find-best-packages-by-name
+            search-packages
 
             specification->package
             specification->package+output
@@ -474,6 +475,43 @@ package modules."
                                #:opts '(#:to-file? #t)))))
   cache-file)
 
+(define (search-packages profile regexps)
+  "Return a list of pairs: <package-metadata> objects corresponding to
+packages whose name, synopsis, description, or output matches at least one of
+REGEXPS sorted by relevance, and its non-zero relevance score."
+  (define cache-file
+    (string-append profile %package-metadata-cache-file))
+
+  (define cache
+    (catch 'system-error
+      (lambda ()
+        (map (match-lambda
+               (#(name version dependencies outputs systems
+                  synopsis description home-page (file line column))
+                (make-package-metadata
+                 name version dependencies outputs systems
+                 synopsis description home-page
+                 (location file line column))))
+             (load-compiled cache-file)))
+      (lambda args
+        (if (= ENOENT (system-error-errno args))
+            #f
+            (apply throw args)))))
+
+  (let ((matches
+         (filter-map (lambda (package-metadata)
+                       (let ((score (package-relevance package-metadata regexps)))
+                         (and (positive? score)
+                              (cons package-metadata score))))
+                     cache)))
+    (sort matches
+          (lambda (m1 m2)
+            (match m1
+              ((package1 . score1)
+               (match m2
+                 ((package2 . score2)
+                  (> score1 score2)))))))))
+
 \f
 (define %sigint-prompt
   ;; The prompt to jump to upon SIGINT.
diff --git a/guix/packages.scm b/guix/packages.scm
index 70b1478c91..bb06baa1ee 100644
--- a/guix/packages.scm
+++ b/guix/packages.scm
@@ -5,6 +5,7 @@
 ;;; Copyright © 2016 Alex Kost <alezost@gmail.com>
 ;;; Copyright © 2017, 2019 Efraim Flashner <efraim@flashner.co.il>
 ;;; Copyright © 2019 Marius Bakke <mbakke@fastmail.com>
+;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -115,6 +116,21 @@
 
             transitive-input-references
 
+            package-metadata
+            make-package-metadata
+            package-metadata?
+            this-package-metadata
+            package-metadata-name
+            package-metadata-version
+            package-metadata-dependencies
+            package-metadata-outputs
+            package-metadata-synopsis
+            package-metadata-description
+            package-metadata-license
+            package-metadata-home-page
+            package-metadata-supported-systems
+            package-metadata-location
+
             %supported-systems
             %hurd-systems
             %hydra-supported-systems
@@ -310,6 +326,22 @@ name of its URI."
                                                        package)
                                                       16)))))
 
+(define-record-type* <package-metadata>
+  package-metadata make-package-metadata
+  package-metadata?
+  this-package-metadata
+  (name package-metadata-name)
+  (version package-metadata-version)
+  (dependencies package-metadata-dependencies)
+  (outputs package-metadata-outputs)
+  (supported-systems package-metadata-supported-systems)
+  (synopsis package-metadata-synopsis)
+  (description package-metadata-description)
+  ;; TODO: Add license
+  ;; (license package-metadata-license)
+  (home-page package-metadata-home-page)
+  (location package-metadata-location))
+
 (define (package-upstream-name package)
   "Return the upstream name of PACKAGE, which could be different from the name
 it has in Guix."
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v3 3/3] guix: Use package metadata cache for package search.
  2020-03-27 16:26 ` [bug#39258] [PATCH v3 0/3] Package metadata cache for guix search Arun Isaac
  2020-03-27 16:26   ` [bug#39258] [PATCH v3 1/3] guix: Generate package metadata cache Arun Isaac
  2020-03-27 16:26   ` [bug#39258] [PATCH v3 2/3] guix: Search " Arun Isaac
@ 2020-03-27 16:26   ` Arun Isaac
  2020-04-24 21:03     ` Ludovic Courtès
  2020-04-05 14:08   ` [bug#39258] [PATCH v3 0/3] Package metadata cache for guix search Ludovic Courtès
  2020-04-24 21:05   ` Ludovic Courtès
  4 siblings, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-03-27 16:26 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac, mail, ludo, zimon.toutoune

* guix/scripts/package.scm (process-query): Call search-packages and
display-package-search-results instead of find-packages-by-description and
display-search-results respectively.
* guix/ui.scm (package-metadata->recutils): New function.
(%package-metrics): Use package-metadata record field accessors.
(package-relevance): Rename argument package to package-metadata.
(display-package-search-results): New function.
---
 guix/scripts/package.scm |   5 +-
 guix/ui.scm              | 132 ++++++++++++++++++++++++++++++++++++---
 2 files changed, 125 insertions(+), 12 deletions(-)

diff --git a/guix/scripts/package.scm b/guix/scripts/package.scm
index 110d4f2977..c11f92f5a2 100644
--- a/guix/scripts/package.scm
+++ b/guix/scripts/package.scm
@@ -7,6 +7,7 @@
 ;;; Copyright © 2016 Benz Schenk <benz.schenk@uzh.ch>
 ;;; Copyright © 2016 Chris Marusich <cmmarusich@gmail.com>
 ;;; Copyright © 2019 Tobias Geerinckx-Rice <me@tobias.gr>
+;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -770,9 +771,9 @@ processed, #f otherwise."
                                       (_                   #f))
                                     opts))
               (regexps  (map (cut make-regexp* <> regexp/icase) patterns))
-              (matches  (find-packages-by-description regexps)))
+              (matches  (search-packages (current-profile) regexps)))
          (leave-on-EPIPE
-          (display-search-results matches (current-output-port)))
+          (display-package-search-results matches (current-output-port)))
          #t))
 
       (('show requested-name)
diff --git a/guix/ui.scm b/guix/ui.scm
index 1e24fe5dca..934699f065 100644
--- a/guix/ui.scm
+++ b/guix/ui.scm
@@ -14,6 +14,7 @@
 ;;; Copyright © 2019 Chris Marusich <cmmarusich@gmail.com>
 ;;; Copyright © 2019 Tobias Geerinckx-Rice <me@tobias.gr>
 ;;; Copyright © 2019 Simon Tournier <zimon.toutoune@gmail.com>
+;;; Copyright © 2020 Arun Isaac <arunisaac@systemreboot.net>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -112,6 +113,7 @@
             package-synopsis-string
             string->recutils
             package->recutils
+            package-metadata->recutils
             package-specification->name+version+output
 
             supports-hyperlinks?
@@ -122,6 +124,7 @@
             relevance
             package-relevance
             display-search-results
+            display-package-search-results
 
             with-profile-lock
             string->generations
@@ -1484,6 +1487,75 @@ HYPERLINKS? is true, emit hyperlink escape sequences when appropriate."
             extra-fields)
   (newline port))
 
+(define* (package-metadata->recutils p port #:optional (width (%text-width))
+                                     #:key
+                                     (hyperlinks? (supports-hyperlinks? port))
+                                     (extra-fields '()))
+  "Write to PORT a `recutils' record of <package-metadata> object P, arranging
+to fit within WIDTH columns.  EXTRA-FIELDS is a list of symbol/value pairs to
+emit.  When HYPERLINKS? is true, emit hyperlink escape sequences when
+appropriate."
+  (define width*
+    ;; The available number of columns once we've taken into account space for
+    ;; the initial "+ " prefix.
+    (if (> width 2) (- width 2) width))
+
+  ;; Note: Don't i18n field names so that people can post-process it.
+  (format port "name: ~a~%" (package-metadata-name p))
+  (format port "version: ~a~%" (package-metadata-version p))
+  (format port "outputs: ~a~%" (string-join (package-metadata-outputs p)))
+  (format port "systems: ~a~%"
+          (string-join (package-metadata-supported-systems p)))
+  (format port "dependencies: ~a~%"
+          (string-join (package-metadata-dependencies p) " "))
+  (format port "location: ~a~%"
+          (or (and=> (package-metadata-location p)
+                     (if hyperlinks? location->hyperlink location->string))
+              (G_ "unknown")))
+
+  ;; Note: Starting from version 1.6 or recutils, hyphens are not allowed in
+  ;; field identifiers.
+  (format port "homepage: ~a~%" (package-metadata-home-page p))
+
+  ;; TODO: Print license
+  ;; (format port "license: ~a~%"
+  ;;         (match (package-metadata-license p)
+  ;;           (((? license? licenses) ...)
+  ;;            (string-join (map license-name licenses)
+  ;;                         ", "))
+  ;;           ((? license? license)
+  ;;            (let ((text (license-name license))
+  ;;                  (uri  (license-uri license)))
+  ;;              (if (and hyperlinks? uri (string-prefix? "http" uri))
+  ;;                  (hyperlink uri text)
+  ;;                  text)))
+  ;;           (x
+  ;;            (G_ "unknown"))))
+  (format port "synopsis: ~a~%"
+          (string-map (match-lambda
+                        (#\newline #\space)
+                        (chr       chr))
+                      (or (and=> (package-metadata-synopsis p) P_)
+                          "")))
+  (format port "~a~%"
+          (string->recutils
+           (string-trim-right
+            (parameterize ((%text-width width*))
+              (texi->plain-text
+               (string-append "description: "
+                              (or (and=> (package-metadata-description p) P_)
+                                  ""))))
+            #\newline)))
+  (for-each (match-lambda
+              ((field . value)
+               (let ((field (symbol->string field)))
+                 (format port "~a: ~a~%"
+                         field
+                         (fill-paragraph (object->string value) width*
+                                         (string-length field))))))
+            extra-fields)
+  (newline port))
+
 \f
 ;;;
 ;;; Searching.
@@ -1528,34 +1600,74 @@ score, the more relevant OBJ is to REGEXPS."
 (define %package-metrics
   ;; Metrics used to compute the "relevance score" of a package against a set
   ;; of regexps.
-  `((,package-name . 4)
+  `((,package-metadata-name . 4)
 
     ;; Match against uncommon outputs.
-    (,(lambda (package)
+    (,(lambda (package-metadata)
         (filter (lambda (output)
                   (not (member output
                                ;; Some common outpus shared by many packages.
                                '("out" "doc" "debug" "lib" "include" "bin"))))
-                (package-outputs package)))
+                (package-metadata-outputs package-metadata)))
      . 1)
 
     ;; Match regexps on the raw Texinfo since formatting it is quite expensive
     ;; and doesn't have much of an effect on search results.
-    (,(lambda (package)
-        (and=> (package-synopsis package) P_)) . 3)
-    (,(lambda (package)
-        (and=> (package-description package) P_)) . 2)
+    (,(lambda (package-metadata)
+        (and=> (package-metadata-synopsis package-metadata) P_)) . 3)
+    (,(lambda (package-metadata)
+        (and=> (package-metadata-description package-metadata) P_)) . 2)
 
     (,(lambda (type)
-        (match (and=> (package-location type) location-file)
+        (match (and=> (package-metadata-location type) location-file)
           ((? string? file) (basename file ".scm"))
           (#f "")))
      . 1)))
 
-(define (package-relevance package regexps)
+(define (package-relevance package-metadata regexps)
   "Return a score denoting the relevance of PACKAGE for REGEXPS.  A score of
 zero means that PACKAGE does not match any of REGEXPS."
-  (relevance package regexps %package-metrics))
+  (relevance package-metadata regexps %package-metrics))
+
+(define* (display-package-search-results matches port
+                                 #:key
+                                 (command "guix search"))
+  "Display MATCHES, a list of <package-metadata>/score pairs.  If PORT is a
+terminal, print at most a full screen of results."
+  (define first-line
+    (port-line port))
+
+  (define max-rows
+    (and first-line (isatty? port)
+         (terminal-rows port)))
+
+  (define (line-count str)
+    (string-count str #\newline))
+
+  (let loop ((matches matches))
+    (match matches
+      (((package-metadata . score) rest ...)
+       (let* ((links? (supports-hyperlinks? port))
+              (text   (call-with-output-string
+                        (lambda (port)
+                          (package-metadata->recutils package-metadata port
+                                                      #:hyperlinks? links?
+                                                      #:extra-fields
+                                                      `((relevance . ,score)))))))
+         (if (and (not (getenv "INSIDE_EMACS"))
+                  max-rows
+                  (> (port-line port) first-line) ;print at least one result
+                  (> (+ 4 (line-count text) (port-line port))
+                     max-rows))
+             (unless (null? rest)
+               (display-hint (format #f (G_ "Run @code{~a ... | less} \
+to view all the results.")
+                                     command)))
+             (begin
+               (display text port)
+               (loop rest)))))
+      (()
+       #t))))
 
 (define* (display-search-results matches port
                                  #:key
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v3 0/3] Package metadata cache for guix search
  2020-03-27 16:26 ` [bug#39258] [PATCH v3 0/3] Package metadata cache for guix search Arun Isaac
                     ` (2 preceding siblings ...)
  2020-03-27 16:26   ` [bug#39258] [PATCH v3 3/3] guix: Use package metadata cache for package search Arun Isaac
@ 2020-04-05 14:08   ` Ludovic Courtès
  2020-04-24 21:05   ` Ludovic Courtès
  4 siblings, 0 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-04-05 14:08 UTC (permalink / raw)
  To: Arun Isaac; +Cc: mail, 39258, zimon.toutoune

Hi Arun,

Arun Isaac <arunisaac@systemreboot.net> skribis:

> This is v3 of my attempt to make guix search faster. In this version, I have
> abandoned use of xapian. Instead I build a cache of the metadata of all
> packages in a profile hook. Then, I use that cache to search and display
> search results. This way, package guile modules are not loaded during guix
> search.
>
> Speedup is around 2x. Both measurements below are with a warm cache.

Sorry for the delay!  Just to say that I like the approach, and I’ll
take a closer look once the release is out…

Thank you!

Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v3 1/3] guix: Generate package metadata cache.
  2020-03-27 16:26   ` [bug#39258] [PATCH v3 1/3] guix: Generate package metadata cache Arun Isaac
@ 2020-04-24 20:48     ` Ludovic Courtès
  2020-04-26  9:48       ` zimoun
  0 siblings, 1 reply; 126+ messages in thread
From: Ludovic Courtès @ 2020-04-24 20:48 UTC (permalink / raw)
  To: Arun Isaac; +Cc: mail, 39258, zimon.toutoune

Hi Arun,

Arun Isaac <arunisaac@systemreboot.net> skribis:

> * gnu/packages.scm (%package-metadata-cache-file): New variable.
> (generate-package-metadata-cache): New function.
> * guix/channels.scm (package-metadata-cache-file): New function.
> (%channel-profile-hooks): Add package-metadata-cache-file.

This is short and sweet, nice!

> +  (define (expand-cache package result)
> +    (cons `#(,(package-name package)
> +             ,(package-version package)
> +             ,(delete-duplicates
> +               (map package-full-name
> +                    (sort (filter package? (package-direct-inputs package))
> +                          package<?)))
> +             ,(package-outputs package)
> +             ,(package-supported-systems package)
> +             ,(package-synopsis package)
> +             ,(package-description package)
> +             ,(package-home-page package)
> +             ,(let ((location (package-location package)))
> +                (list (location-file location)
> +                      (location-line location)
> +                      (location-column location))))

I was wondering if we could omit inputs, which are not that useful.

Apart from that it LGTM.

Note that this is probably the place where we could eventually add the
computation of an inverted index like zimoun suggested in
<https://lists.gnu.org/archive/html/guix-devel/2020-01/msg00243.html>.

> +                                  #:properties '((type . profile-hook)
> +                                                 (hook . package-cache))

‘package-metadata-cache’, even (it’s for UI purposes).

Nitpick: I’d use “packages:” as the prefix in the subject line.

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v3 2/3] guix: Search package metadata cache.
  2020-03-27 16:26   ` [bug#39258] [PATCH v3 2/3] guix: Search " Arun Isaac
@ 2020-04-24 20:58     ` Ludovic Courtès
  0 siblings, 0 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-04-24 20:58 UTC (permalink / raw)
  To: Arun Isaac; +Cc: mail, 39258, zimon.toutoune

Arun Isaac <arunisaac@systemreboot.net> skribis:

> * gnu/packages.scm (search-packages): New function.
> * guix/packages.scm (<package-metadata>): New record type.

[...]

> +(define (search-packages profile regexps)
> +  "Return a list of pairs: <package-metadata> objects corresponding to
> +packages whose name, synopsis, description, or output matches at least one of
> +REGEXPS sorted by relevance, and its non-zero relevance score."
> +  (define cache-file
> +    (string-append profile %package-metadata-cache-file))

Here we’re missing something that checks if the cache is authoritative
and falls back to the old method if it’s not, akin to what
‘fold-available-packages’ does.

> +  (define cache
> +    (catch 'system-error
> +      (lambda ()
> +        (map (match-lambda
> +               (#(name version dependencies outputs systems
> +                  synopsis description home-page (file line column))
> +                (make-package-metadata
> +                 name version dependencies outputs systems
> +                 synopsis description home-page
> +                 (location file line column))))
> +             (load-compiled cache-file)))

I realize the other cache also has that problem, but it would be nice to
add a version tag to the cache.  Basically emit something like:

  (package-metadata-cache (version 0) VECTOR …)

instead of just:

  (VECTOR …)

> +(define-record-type* <package-metadata>
> +  package-metadata make-package-metadata
> +  package-metadata?
> +  this-package-metadata
> +  (name package-metadata-name)
> +  (version package-metadata-version)
> +  (dependencies package-metadata-dependencies)
> +  (outputs package-metadata-outputs)
> +  (supported-systems package-metadata-supported-systems)
> +  (synopsis package-metadata-synopsis)
> +  (description package-metadata-description)
> +  ;; TODO: Add license
> +  ;; (license package-metadata-license)
> +  (home-page package-metadata-home-page)
> +  (location package-metadata-location))

I’m not comfortable with this data structure duplication, especially
right in (guix packages, but I’m not sure it’s avoidable.
‘fold-available-packages’ avoids it by passing all the fields as
arguments to the fold procedure, I’m not sure if it’s applicable here.

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v3 3/3] guix: Use package metadata cache for package search.
  2020-03-27 16:26   ` [bug#39258] [PATCH v3 3/3] guix: Use package metadata cache for package search Arun Isaac
@ 2020-04-24 21:03     ` Ludovic Courtès
  0 siblings, 0 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-04-24 21:03 UTC (permalink / raw)
  To: Arun Isaac; +Cc: mail, 39258, zimon.toutoune

Hello,

Arun Isaac <arunisaac@systemreboot.net> skribis:

> * guix/scripts/package.scm (process-query): Call search-packages and
> display-package-search-results instead of find-packages-by-description and
> display-search-results respectively.
> * guix/ui.scm (package-metadata->recutils): New function.
> (%package-metrics): Use package-metadata record field accessors.
> (package-relevance): Rename argument package to package-metadata.
> (display-package-search-results): New function.

[...]

> +(define* (package-metadata->recutils p port #:optional (width (%text-width))
> +                                     #:key
> +                                     (hyperlinks? (supports-hyperlinks? port))
> +                                     (extra-fields '()))
> +  "Write to PORT a `recutils' record of <package-metadata> object P, arranging
> +to fit within WIDTH columns.  EXTRA-FIELDS is a list of symbol/value pairs to
> +emit.  When HYPERLINKS? is true, emit hyperlink escape sequences when
> +appropriate."

I think we should avoid copy/paste of ‘package->recutils’.

How about factorizing by having a common procedure that takes the fields
as keyword arguments instead of taking a record?

>  (define %package-metrics
>    ;; Metrics used to compute the "relevance score" of a package against a set
>    ;; of regexps.
> -  `((,package-name . 4)
> +  `((,package-metadata-name . 4)

Here we would also need to arrange so that this can apply to both a
<package> and <package-metadata> (or whatever), perhaps by defining the
two sets of metrics at once, or defining the second one by mapping over
the first one.

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v3 0/3] Package metadata cache for guix search
  2020-03-27 16:26 ` [bug#39258] [PATCH v3 0/3] Package metadata cache for guix search Arun Isaac
                     ` (3 preceding siblings ...)
  2020-04-05 14:08   ` [bug#39258] [PATCH v3 0/3] Package metadata cache for guix search Ludovic Courtès
@ 2020-04-24 21:05   ` Ludovic Courtès
  4 siblings, 0 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-04-24 21:05 UTC (permalink / raw)
  To: Arun Isaac; +Cc: mail, 39258, zimon.toutoune

Hi,

Arun Isaac <arunisaac@systemreboot.net> skribis:

> Speedup is around 2x. Both measurements below are with a warm cache.
>
> $ time guix search inkscape
>
> real	0m1.722s
> user	0m1.776s
> sys	0m0.097s
>
> $ time /tmp/test/bin/guix search inkscape
>
> real	0m0.749s
> user	0m0.770s
> sys	0m0.020s
>
> This patchset does not affect the search API nor does it improve the relevance
> of search results. If there is interest in this approach, I'll complete this
> patchset properly. But, in the long run, I do think we should aim to get
> xapian or the like for guix search. WDYT?
>
> Unfortunately, generate-package-metadata-cache takes 43 seconds to build the
> cache on my relatively slow computer. Performance should be better on other
> people's machines.

43 seconds is a lot.  How do these 43 seconds compare to current ‘guix
search’ on your computer?  (Both do roughly the same thing.)

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] benchmark search: default vs v2 vs v3
  2020-01-23 19:51 [bug#39258] Faster guix search using an sqlite cache Arun Isaac
                   ` (3 preceding siblings ...)
  2020-03-27 16:26 ` [bug#39258] [PATCH v3 0/3] Package metadata cache for guix search Arun Isaac
@ 2020-04-26  3:54 ` zimoun
  2020-04-26  7:29   ` Pierre Neidhardt
  2020-04-26 15:49   ` Ludovic Courtès
  2020-05-03 15:01 ` [bug#39258] [PATCH v4 0/3] Faster cache generation (similar as v3) zimoun
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 126+ messages in thread
From: zimoun @ 2020-04-26  3:54 UTC (permalink / raw)
  To: 39258, Arun Isaac, Ludovic Courtès, Pierre Neidhardt

Hi,

Thank you Arun for the patches and all the work.  Sorryfor the delay.


TLDR:

 1) around 25 seconds added to "guix pull"... but I am more than often
waiting around 10 minutes when pulling.
 2) the speedup is clear: more than 2x.


The question is the tradeoff between: the slowdown of pull vs the
speedup of search. What is acceptable?


Here let benchmark 3 versions of Guix:

 - default is a357849f5b
 - v2 rebased on default and based on Xapian
 - v3 rebased on default too and based on "custom" index

and let compare the time of "guix pull" and then "guix search".
Because v2 uses Xapian, the accuracy is different and so the list of
outputs is different depending on the query; the impact on the
performance seems minimal.  Let discuss elsewhere about accuracy and
BM25 and let focus on performance for now.


* guix pull
-----------

The idea is: measure if computing the new index is expensive or not,
compared to all of what "guix pull" computes.


** Reference
------------

Maybe, I should have misconfigured something or my laptop is really
not powerful at all, but here some numbers.

(Note: /proc/cpuinfo says 4 times Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz
and /sys/block/sda/queue/rotational says 0 which is SSD.)

--8<---------------cut here---------------start------------->8---
$ guix describe
Generation 8    Apr 25 2020 09:00:01    (current)
  guix f84b036
    repository URL: https://git.savannah.gnu.org/git/guix.git
    branch: master
    commit: f84b0363053e5479464f6ce6ded45f80360d90fc
--8<---------------cut here---------------end--------------->8---

--8<---------------cut here---------------start------------->8---
$ time guix pull -C ~/.config/guix/default-channels.scm
Updating channel 'guix' from Git repository at
'https://git.savannah.gnu.org/git/guix.git'...
Building from this channel:
  guix      https://git.savannah.gnu.org/git/guix.git   8cf6d15
downloading from
https://ci.guix.gnu.org/nar/gzip/xgakzpfs3rz57m666hsk1v3d3zcy7wgn-config.scm
...
 config.scm

[...]

building fonts directory...
building directory of Info manuals...
building database for manual pages...
building profile with 1 package...
building /gnu/store/kq1zlj5rxz8wrxc3ha8vck2wv2iakfnb-inferior-script.scm.drv...
building package cache...
building profile with 1 package...
New in this revision:
  2 new packages: cl-osicat, sbcl-osicat


real    13m37.997s
user    1m38.129s
sys     0m0.856s
--8<---------------cut here---------------end--------------->8---


And because "guix search" is used say 10 times more than "guix pull",
an increase of 10% of "guix pull" will ease the experience of the user
if "guix search" is faster, IMHO.

Therefore, because "guix pull" takes around 13 minutes, the extra cost
to index all the packages can be roughly 1min30s (at most).


Then, if I pull back from 8cf6d15 to '--commit=a357849f5b' then it takes:

real    2m13.693s
user    1m37.418s
sys     0m0.666s

so in this case 10% means around 7s. But after 1 minute waiting, the
command feels too long to me and personally I am already waiting so I
do not mind much if it would take 2m13s or 3m00s.


Well, it is hard to draw a clear line about what could be accepted as
the time of indexing because the time of pulling is already highly
variable.


What is the average of "guix pull"?

It could be really interresting to probe the users.  They could report:
 - guix describe
 - time guix pull
whatever which channels are up.

Just to have an idea about what should be the acceptable extra time
added by indexing.  For sure it depends on the hardware but it would
provide an idea and help to see if the extra time is worth or not.

WDYT?



** Let's compare the index time
-------------------------------

Let pull for the 3 cases and populate the store by all the necessary
items.  Could be looooonng! (20minutes)  For example, for the version
2 of patches -- living in my branch 'search-v2' using a worktree.

--8<---------------cut here---------------start------------->8---
time ./pre-inst-env guix pull -p /tmp/v2 \
     --url=$PWD --branch=search-v2 \
     -C ~/.config/guix/default-channels.scm
--8<---------------cut here---------------end--------------->8---

and then let spot the index file for each version:

--8<---------------cut here---------------start------------->8---
# ls -l /tmp/default/lib/guix
/gnu/store/g5c08vqsv31nkn2r0hr32dbrkhf3cvd8-guix-package-cache

readlink /tmp/v2/lib/guix/package-search.index
/gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index/lib/guix/package-search.index

readlink /tmp/v3/lib/guix/package-metadata.cache
/gnu/store/8j78b5c4ddic21gcx7wpbq2akjn7x7mr-guix-package-metadata-cache/lib/guix/package-metadata.cache
--8<---------------cut here---------------end--------------->8---

Well, let remove the profiles and garbage collect the index files:

--8<---------------cut here---------------start------------->8---
rm /tmp/default /tmp/v{2,3}*
guix gc -D \
   /gnu/store/g5c08vqsv31nkn2r0hr32dbrkhf3cvd8-guix-package-cache \
   /gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index \
   /gnu/store/8j78b5c4ddic21gcx7wpbq2akjn7x7mr-guix-package-metadata-cache
--8<---------------cut here---------------end--------------->8---


And then re-run "guix pull". We are now comparing apple to apple, I guess.


| time | default   | v2        | v3        |
|------+-----------+-----------+-----------|
| real | 1m11.899s | 1m30.806s | 1m34.341s |
| user | 1m23.845s | 1m24.160s | 1m24.233s |
| sys  | 0m0.570s  | 0m0.563s  | 0m0.529s  |


Therefore less than extra 20s and 25s for v2 and v3.


All the question is an extra 25s compared to which time of "guix pull":
 - more than 13m: adding 25s is acceptable
 - less than 2m: adding 25s is questionable

Usually, my feeling about "guix pull" is... I am waiting!  Therefore,
I will not see this extra 25s because it is masked by all the other
work "guix pull" is doing.


* guix search
-------------

Let compare cold (sudo echo 3 > /proc/sys/vm/drop_caches) and warm
cache.  For example for the query 'inkscape'.


| time | default  | v2       | v3       |
|------+----------+----------+----------|
| real | 0m1.842s | 0m0.331s | 0m0.437s |
| user | 0m1.270s | 0m0.179s | 0m0.336s |
| sys  | 0m0.142s | 0m0.047s | 0m0.052s |
|------+----------+----------+----------|
| real | 0m0.898s | 0m0.132s | 0m0.292s |
| user | 0m1.069s | 0m0.168s | 0m0.353s |
| sys  | 0m0.072s | 0m0.008s | 0m0.019s |


Therefore the speedup is at least 3.

| cache | default-vs-v2 | default-vs-v3 |
|-------+---------------+---------------|
| cold  |           5.6 |           4.2 |
| warm  |           6.8 |           3.1 |


Another query:

--8<---------------cut here---------------start------------->8---
time guix search crypto library | recsel -P name | grep libb2
--8<---------------cut here---------------end--------------->8---

| time | default  | v2       | v3       |
|------+----------+----------+----------|
| real | 0m2.216s | 0m1.109s | 0m0.689s |
| user | 0m1.655s | 0m1.309s | 0m0.683s |
| sys  | 0m0.193s | 0m0.073s | 0m0.035s |
|------+----------+----------+----------|
| real | 0m1.197s | 0m0.490s | 0m0.491s |
| user | 0m1.448s | 0m0.819s | 0m0.625s |
| sys  | 0m0.089s | 0m0.034s | 0m0.039s |


| cache | default-vs-v2 | default-vs-v3 |
|-------+---------------+---------------|
| cold  |           2.0 |           3.2 |
| warm  |           2.4 |           2.4 |




Before going further, especially about any other more sophisticated
inverted index (BM25), it appears to me important to fix what is
"cost" on "guix pull" that the users are ready to pay.  Because
somehow the inverted index has to be computed.  And without an
inverted index, it seems difficult to improve the accurary.

One solution should be: let compute the inverted index in the
background with a low priority.  If the index is not done yet when
"guix search" is called, then fallback to the current default
behaviour.


WDYT?


Cheers,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] benchmark search: default vs v2 vs v3
  2020-04-26  3:54 ` [bug#39258] benchmark search: default vs v2 vs v3 zimoun
@ 2020-04-26  7:29   ` Pierre Neidhardt
  2020-04-26 15:49   ` Ludovic Courtès
  1 sibling, 0 replies; 126+ messages in thread
From: Pierre Neidhardt @ 2020-04-26  7:29 UTC (permalink / raw)
  To: zimoun, 39258, Arun Isaac, Ludovic Courtès

[-- Attachment #1: Type: text/plain, Size: 717 bytes --]

Hi Simon,

Thanks for taking the time to benchmark this, this is very insightful!

> Usually, my feeling about "guix pull" is... I am waiting!  Therefore,
> I will not see this extra 25s because it is masked by all the other
> work "guix pull" is doing.

I agree and this is a very good point in my opinion.
While I don't expect nor do I need "guix pull" to complete immediately,
this is not true of "guix search".

As Simon suggested, maybe we can wrap a benchmark script together, post
it on the mailing list and ask member to report their results.  Maybe
a few dozen results would give us a better idea of the numbers we are
dealing with.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v3 1/3] guix: Generate package metadata cache.
  2020-04-24 20:48     ` Ludovic Courtès
@ 2020-04-26  9:48       ` zimoun
  2020-04-26 14:35         ` Ludovic Courtès
  0 siblings, 1 reply; 126+ messages in thread
From: zimoun @ 2020-04-26  9:48 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Arun Isaac, Pierre Neidhardt, 39258

On Fri, 24 Apr 2020 at 22:48, Ludovic Courtès <ludo@gnu.org> wrote:

> > +  (define (expand-cache package result)
> > +    (cons `#(,(package-name package)
> > +             ,(package-version package)
> > +             ,(delete-duplicates
> > +               (map package-full-name
> > +                    (sort (filter package? (package-direct-inputs package))
> > +                          package<?)))
> > +             ,(package-outputs package)
> > +             ,(package-supported-systems package)
> > +             ,(package-synopsis package)
> > +             ,(package-description package)
> > +             ,(package-home-page package)
> > +             ,(let ((location (package-location package)))
> > +                (list (location-file location)
> > +                      (location-line location)
> > +                      (location-column location))))
>
> I was wondering if we could omit inputs, which are not that useful.

Agree.


> Note that this is probably the place where we could eventually add the
> computation of an inverted index like zimoun suggested in
> <https://lists.gnu.org/archive/html/guix-devel/2020-01/msg00243.html>.

We should first agree on the extra cost (time) we are ready to pay to
build improvements.
See the lengthy message [1] about only the caching "inverted index"
using the current 'relevance' scoring function.

[1] http://issues.guix.gnu.org/39258#78



Cheers,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v3 1/3] guix: Generate package metadata cache.
  2020-04-26  9:48       ` zimoun
@ 2020-04-26 14:35         ` Ludovic Courtès
  2020-04-26 14:54           ` Pierre Neidhardt
  2020-04-26 15:05           ` zimoun
  0 siblings, 2 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-04-26 14:35 UTC (permalink / raw)
  To: zimoun; +Cc: Arun Isaac, Pierre Neidhardt, 39258

Hi Simon,

zimoun <zimon.toutoune@gmail.com> skribis:

> On Fri, 24 Apr 2020 at 22:48, Ludovic Courtès <ludo@gnu.org> wrote:

[...]

>> Note that this is probably the place where we could eventually add the
>> computation of an inverted index like zimoun suggested in
>> <https://lists.gnu.org/archive/html/guix-devel/2020-01/msg00243.html>.
>
> We should first agree on the extra cost (time) we are ready to pay to
> build improvements.

It’s complicated.  As it stands, I’d rather not add overhead to ‘guix
pull’, especially since current ‘guix search’ on my SSD is fast enough
and can hardly be made any faster.

Realistically though, I understand that things are different on slower
machines and/or spinning disks.  That’s why I’m interested in seeing how
Arun’s proposed changes can affect such machines.

If, as a bonus, it allows us to have an inverted index and thus improve
the quality of search results, that’s great!

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v3 1/3] guix: Generate package metadata cache.
  2020-04-26 14:35         ` Ludovic Courtès
@ 2020-04-26 14:54           ` Pierre Neidhardt
  2020-04-26 15:33             ` Ludovic Courtès
  2020-04-26 15:05           ` zimoun
  1 sibling, 1 reply; 126+ messages in thread
From: Pierre Neidhardt @ 2020-04-26 14:54 UTC (permalink / raw)
  To: Ludovic Courtès, zimoun; +Cc: Arun Isaac, 39258

[-- Attachment #1: Type: text/plain, Size: 981 bytes --]

Hi Ludo!

Ludovic Courtès <ludo@gnu.org> writes:

> It’s complicated.  As it stands, I’d rather not add overhead to ‘guix
> pull’, especially since current ‘guix search’ on my SSD is fast enough
> and can hardly be made any faster.

The question is, what is fast enough?  I have an NVMe here that has a
throughput of some 2GB/s, and yet

--8<---------------cut here---------------start------------->8---
time guix search emacs > /dev/null
real    0m1.545s
user    0m1.938s
sys     0m0.080s
--8<---------------cut here---------------end--------------->8---

on a hot cache, which is too slow in my opinion :p

Mildly impatient users might be slightly discouraged from iterating
search queries.

It also makes `guix search` very impractical to use in (non-guile)
script.  Which is too bad considering that the recsel-formatting makes
`guix search` a very good candidate for scripting.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v3 1/3] guix: Generate package metadata cache.
  2020-04-26 14:35         ` Ludovic Courtès
  2020-04-26 14:54           ` Pierre Neidhardt
@ 2020-04-26 15:05           ` zimoun
  1 sibling, 0 replies; 126+ messages in thread
From: zimoun @ 2020-04-26 15:05 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Arun Isaac, Pierre Neidhardt, 39258

Hi Ludo,

On Sun, 26 Apr 2020 at 16:35, Ludovic Courtès <ludo@gnu.org> wrote:

> Realistically though, I understand that things are different on slower
> machines and/or spinning disks.  That’s why I’m interested in seeing how
> Arun’s proposed changes can affect such machines.

I understand. I have done a small benchmark [1] of the 3 ways: the
current, the v2 using Xapian (which is not an option on the long term)
and the v3.

My "slower" machine is at my office... but it provides already
interesting numbers, IMHO.

[1] http://issues.guix.gnu.org/39258#78


> If, as a bonus, it allows us to have an inverted index and thus improve
> the quality of search results, that’s great!

This "issue" is: any improvement on both sides performance and
accuracy would add an somehow extra cost. The question is what is the
maximum users would accept to pay for?


Well, it is complicated as you said. :-)
A trade off between extra cost, maintenance, complexity, etc is not
easy to draw, as you said too elsewhere.
I am seeing all that as experimental: explore ideas to see if they are
worth or not.
And what should be concluded now could change in the (near) future;
for example if the computations of derivations are faster, resulting
on "guix pull" faster, etc..


Cheer,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v3 1/3] guix: Generate package metadata cache.
  2020-04-26 14:54           ` Pierre Neidhardt
@ 2020-04-26 15:33             ` Ludovic Courtès
  0 siblings, 0 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-04-26 15:33 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: Arun Isaac, 39258, zimoun

Hey!

Pierre Neidhardt <mail@ambrevar.xyz> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> It’s complicated.  As it stands, I’d rather not add overhead to ‘guix
>> pull’, especially since current ‘guix search’ on my SSD is fast enough
>> and can hardly be made any faster.
>
> The question is, what is fast enough?  I have an NVMe here that has a
> throughput of some 2GB/s, and yet
>
> time guix search emacs > /dev/null
> real    0m1.545s
> user    0m1.938s
> sys     0m0.080s

That accounts for the time to render 864 entries:

  $ guix search emacs| grep ^name| wc -l
  864

Compare with:

  $ time guix search emacs | head -100 > /dev/null

  real    0m0.674s
  user    0m0.802s
  sys     0m0.048s

Again, this is not to say it cannot be improved, but it’s quite a
challenge to do better on such hardware.

Though as discussed with Arun, there may be low-hanging optimization
fruits in Texinfo parsing and rendering.  I guess we need to go ahead
fire up statprof now.  :-)

Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] benchmark search: default vs v2 vs v3
  2020-04-26  3:54 ` [bug#39258] benchmark search: default vs v2 vs v3 zimoun
  2020-04-26  7:29   ` Pierre Neidhardt
@ 2020-04-26 15:49   ` Ludovic Courtès
  2020-04-26 17:01     ` zimoun
  2020-04-30 13:10     ` zimoun
  1 sibling, 2 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-04-26 15:49 UTC (permalink / raw)
  To: zimoun; +Cc: Arun Isaac, Pierre Neidhardt, 39258

Hi,

zimoun <zimon.toutoune@gmail.com> skribis:

>  1) around 25 seconds added to "guix pull"... but I am more than often
> waiting around 10 minutes when pulling.
>  2) the speedup is clear: more than 2x.

Nice!

It does seem like Arun’s v3 (or maybe even v2) would work nicely.

> The question is the tradeoff between: the slowdown of pull vs the
> speedup of search. What is acceptable?

That’s only one criterion among others.  I hear the argument that 25s is
“nothing” compared to the rest, but it’s really a tradeoff.  Like, if I
spent a day optimizing ‘guix pull’ and managed to save 25s, I would find
it nice.  :-)

> $ time guix pull -C ~/.config/guix/default-channels.scm

It also depends on what’s in that file, of course.

> Then, if I pull back from 8cf6d15 to '--commit=a357849f5b' then it takes:
>
> real    2m13.693s
> user    1m37.418s
> sys     0m0.666s

For me:

--8<---------------cut here---------------start------------->8---
$ guix describe
Generacio 139   Apr 13 2020 21:50:08    (nuna)
  guix bad368b
    repository URL: https://git.savannah.gnu.org/git/guix.git
    branch: master
    commit: bad368b0d794689f3a8a11b58f1ea4987938682e
$ time guix pull -p /tmp/test --commit=bad368b0d794689f3a8a11b58f1ea4987938682e
Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'...
Building from this channel:
  guix      https://git.savannah.gnu.org/git/guix.git   bad368b

[...]

real    0m57.916s
user    1m1.017s
sys     0m0.609s
--8<---------------cut here---------------end--------------->8---

(On a 2.6 GHz i7 though.)

> Well, let remove the profiles and garbage collect the index files:
>
> rm /tmp/default /tmp/v{2,3}*
> guix gc -D \
>    /gnu/store/g5c08vqsv31nkn2r0hr32dbrkhf3cvd8-guix-package-cache \
>    /gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index \
>    /gnu/store/8j78b5c4ddic21gcx7wpbq2akjn7x7mr-guix-package-metadata-cache

Could you do, for v2 and v3:

  time guix build /gnu/store/…-package-metadata-cache.drv --check

?

That we’ll give us the exact cost of that part.  It’ll be interesting
especially in the Xapian case, which we expected to be higher.

Thanks for the insightful benchmarks!

Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] benchmark search: default vs v2 vs v3
  2020-04-26 15:49   ` Ludovic Courtès
@ 2020-04-26 17:01     ` zimoun
  2020-04-26 20:22       ` Ludovic Courtès
  2020-04-30 13:10     ` zimoun
  1 sibling, 1 reply; 126+ messages in thread
From: zimoun @ 2020-04-26 17:01 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Arun Isaac, Pierre Neidhardt, 39258

Hi Ludo,

On Sun, 26 Apr 2020 at 17:49, Ludovic Courtès <ludo@gnu.org> wrote:

> It does seem like Arun’s v3 (or maybe even v2) would work nicely.

The v3 is more interesting because it does not change the relevance
scoring and does not add other dependency.
However v2 is interesting to easily test BM25 which is another
relevance scoring... work in progress. :-)


> > The question is the tradeoff between: the slowdown of pull vs the
> > speedup of search. What is acceptable?
>
> That’s only one criterion among others.  I hear the argument that 25s is
> “nothing” compared to the rest, but it’s really a tradeoff.  Like, if I
> spent a day optimizing ‘guix pull’ and managed to save 25s, I would find
> it nice.  :-)

And I expect that the middle-term roadmap would even decrease more the
computations of derivations. ;-)



> > $ time guix pull -C ~/.config/guix/default-channels.scm
>
> It also depends on what’s in that file, of course.

Contains only one line: %default-channels

See my wishlist ;-)
https://lists.gnu.org/archive/html/guix-devel/2020-04/msg00393.html



me:  2m13.693s
you: 0m57.916s

As we already discussed elsewhere, it is hard to "test" 'guix pull'.
Does it make sense to measure "guix pull"? As Chris (Marusich) did for
CDN.


> > Well, let remove the profiles and garbage collect the index files:
> >
> > rm /tmp/default /tmp/v{2,3}*
> > guix gc -D \
> >    /gnu/store/g5c08vqsv31nkn2r0hr32dbrkhf3cvd8-guix-package-cache \
> >    /gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index \
> >    /gnu/store/8j78b5c4ddic21gcx7wpbq2akjn7x7mr-guix-package-metadata-cache
>
> Could you do, for v2 and v3:
>
>   time guix build /gnu/store/…-package-metadata-cache.drv --check

Newbie me! :-)

Two points:

   1. It may not be reproducible... I am checking.
   2. The time seems similar (v2=26s and v3=29s) considering the time
to start Guile and so on.

--8<---------------cut here---------------start------------->8---
guix gc --list-live | grep metadata
time /tmp/v3/bin/guix build
/gnu/store/jxs0abica8kjz1ppym95df97jk0qa9by-guix-package-metadata-cache.drv
--check
The following profile hook will be built:
   /gnu/store/jxs0abica8kjz1ppym95df97jk0qa9by-guix-package-metadata-cache.drv
building package cache...
(repl-version 0 1 1)
Generating package metadata cache for
'/gnu/store/95mi525syinh08jmcd3q7a7a8mr1sykb-profile'...
(values (value "/gnu/store/zhp7wv87vr6iis0fa3ff925i5r04i08q-guix-package-metadata-cache/lib/guix/package-metadata.cache"))
guix build: error: derivation
`/gnu/store/jxs0abica8kjz1ppym95df97jk0qa9by-guix-package-metadata-cache.drv'
may not be deterministic: output
`/gnu/store/zhp7wv87vr6iis0fa3ff925i5r04i08q-guix-package-metadata-cache'
differs

real    0m29.788s
user    0m0.535s
sys    0m0.025s
--8<---------------cut here---------------end--------------->8---


> That we’ll give us the exact cost of that part.  It’ll be interesting
> especially in the Xapian case, which we expected to be higher.

--8<---------------cut here---------------start------------->8---
time /tmp/v2/bin/guix build
/gnu/store/w0dhl2n3ngi4v2ld8lprkqjl1g1q2m4p-guix-package-search-index.drv
--check
The following profile hook will be built:
   /gnu/store/w0dhl2n3ngi4v2ld8lprkqjl1g1q2m4p-guix-package-search-index.drv
running profile hook of type 'package-search-index'...
(repl-version 0 1 1)
Generating package search index for
'/gnu/store/wiinj9nrb45wlf2cgbgkjl9chxz9cb9b-profile'...
(values (value "/gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index/lib/guix/package-search.index"))
guix build: error: derivation
`/gnu/store/w0dhl2n3ngi4v2ld8lprkqjl1g1q2m4p-guix-package-search-index.drv'
may not be deterministic: output
`/gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index'
differs

real    0m26.552s
user    0m0.626s
sys    0m0.046s
--8<---------------cut here---------------end--------------->8---

It is not higher. Why should it be?


Considering aside the issue of reproducibility -- which should be one!
-- well, should be possible to download the index file as any other
substitute?


Cheers,
simon

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] benchmark search: default vs v2 vs v3
  2020-04-26 17:01     ` zimoun
@ 2020-04-26 20:22       ` Ludovic Courtès
  0 siblings, 0 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-04-26 20:22 UTC (permalink / raw)
  To: zimoun; +Cc: Arun Isaac, Pierre Neidhardt, 39258

zimoun <zimon.toutoune@gmail.com> skribis:

>> Could you do, for v2 and v3:
>>
>>   time guix build /gnu/store/…-package-metadata-cache.drv --check
>
> Newbie me! :-)
>
> Two points:
>
>    1. It may not be reproducible... I am checking.
>    2. The time seems similar (v2=26s and v3=29s) considering the time
> to start Guile and so on.

Good, so it means that’s not Xapian taking time here.

Thanks again!

Ludo’.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] benchmark search: default vs v2 vs v3
  2020-04-26 15:49   ` Ludovic Courtès
  2020-04-26 17:01     ` zimoun
@ 2020-04-30 13:10     ` zimoun
  1 sibling, 0 replies; 126+ messages in thread
From: zimoun @ 2020-04-30 13:10 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Arun Isaac, Pierre Neidhardt, 39258

Hi Ludo,

On Sun, 26 Apr 2020 at 17:49, Ludovic Courtès <ludo@gnu.org> wrote:

> That’s only one criterion among others.  I hear the argument that 25s is
> “nothing” compared to the rest, but it’s really a tradeoff.  Like, if I
> spent a day optimizing ‘guix pull’ and managed to save 25s, I would find
> it nice.  :-)

I am not sure to understand all what "guix pull" does.
Does "guix pull" compile all the scheme files under 'gnu/'? Probably
only recompiles the "new" files?

I do not know if it makes sense, but I just note this difference:

 1. Search without compiling of all files under 'gnu/packages/'
 2. Compile all the files under 'gnu/packages/' then search
 3. Search with only the file gnu/packages/emacs-xyz.scm not compiled
(all the other files are compiled)
 4. Compile the file above and then search

3b and 4b with gnu/packages/cobol.scm which is smaller than emacs-xyz.scm.


Results:

1) 1m43.312s
2) 0m1.301s (but 9m51.801s compiling)

3) 0m6.526s
4) 0m1.389s (1m8.670s compiling)

3b) 0m0.921s
4b) 0m0.924s (0m1.884s compiling)

Therefore, an option to reduce the time when pulling should to relax
the "compilation" for 'gnu/packages/' and 'gnu/services'; something
less optimized since the packages and services "just" need to be
transformed into bytecode to improve IO when reading them. Perhaps I
miss a point...

And maybe, it is similar than what Andy Wingo is proposing in [1].

[1] https://lists.gnu.org/archive/html/guix-devel/2020-04/msg00444.html


Cheers,
simon

--8<---------------cut here---------------start------------->8---
find gnu/packages -name "*.scm" -type f -exec touch {} \;
time ./pre-inst-env guix search gmsh | recsel -C -p name

;;; note: source file /home/simon/src/guix/wk/tmp/gnu/packages/abduco.scm
;;;       newer than compiled /home/simon/src/guix/wk/tmp/gnu/packages/abduco.go

[...]

;;; note: source file /home/simon/src/guix/wk/tmp/gnu/packages/zwave.scm
;;;       newer than compiled /home/simon/src/guix/wk/tmp/gnu/packages/zwave.go
name: gmsh

real    1m43.312s
user    2m19.318s
--8<---------------cut here---------------end--------------->8---

--8<---------------cut here---------------start------------->8---
find gnu/packages -name "*.scm" -type f -exec touch {} \;
time make -j4 && time ./pre-inst-env guix search gmsh | recsel -C -p name

make  all-recursive
make[1]: Entering directory '/home/simon/src/guix/wk/tmp'
Making all in po/guix
make[2]: Entering directory '/home/simon/src/guix/wk/tmp/po/guix'
make[2]: Leaving directory '/home/simon/src/guix/wk/tmp/po/guix'
Making all in po/packages
make[2]: Entering directory '/home/simon/src/guix/wk/tmp/po/packages'
make[2]: Leaving directory '/home/simon/src/guix/wk/tmp/po/packages'
make[2]: Entering directory '/home/simon/src/guix/wk/tmp'
Compiling Scheme modules...
[  0%] LOAD     gnu/packages/abduco.scm
;;; note: source file ./gnu/packages/abduco.scm
;;;       newer than compiled /home/simon/src/guix/wk/tmp/gnu/packages/abduco.go

[...]

[100%] GUILEC   gnu/packages/zwave.go
make[2]: Leaving directory '/home/simon/src/guix/wk/tmp'
make[1]: Leaving directory '/home/simon/src/guix/wk/tmp'

real    9m51.801s
user    29m18.938s
sys     0m5.822s
name: gmsh

real    0m1.301s
user    0m1.266s
sys     0m0.101s
--8<---------------cut here---------------end--------------->8---




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v4 0/3] Faster cache generation (similar as v3)
  2020-01-23 19:51 [bug#39258] Faster guix search using an sqlite cache Arun Isaac
                   ` (4 preceding siblings ...)
  2020-04-26  3:54 ` [bug#39258] benchmark search: default vs v2 vs v3 zimoun
@ 2020-05-03 15:01 ` zimoun
  2020-05-03 15:01   ` [bug#39258] [PATCH v4 1/3] DRAFT packages: Add fields to packages cache zimoun
                     ` (3 more replies)
  2020-06-01  0:00 ` [bug#39258] [PATCH 0/4] Optimize guix search Arun Isaac
                   ` (2 subsequent siblings)
  8 siblings, 4 replies; 126+ messages in thread
From: zimoun @ 2020-05-03 15:01 UTC (permalink / raw)
  To: 39258; +Cc: arunisaac, mail, ludo, zimoun

Dear,

The aim of this version v4 is to keep the same searching performances as the previous version v3 but to drastically reduce the generation of the cache.  On my laptop, the overhead is now 4 seconds; compared to more than 20 seconds for v2 and v3.

--8<---------------cut here---------------start------------->8---
# default
time guix build /gnu/store/0nfpp82mqglpwvl1nbfpaphw5db2ivcp-guix-package-cache.drv --check
# v4
time guix build /gnu/store/y78gfh1n7m3kyrj8wsqj25qc2cbc1a4d-guix-package-cache.drv --check
--8<---------------cut here---------------end--------------->8---

|      | default  | v4        |
|------+----------+-----------|
| real | 0m6.012s | 0m10.244s |
| user | 0m0.541s | 0m0.542s  |
| sys  | 0m0.033s | 0m0.032s  |


In the version v3, the cache is built using 'cons' and 'fold-packages' (wrapper to 'fold-module-public-variables').  The version v4 modifies -- by adding other information -- the function 'generate-package-cache' which uses 'vhash' and 'fold-module-public-variables*'.

Therefore the cache '/lib/guix/package.cache' contains more information.  (The v4 structure of 'package.cache' is a quick draft, so details should be discussed and an interesting move should to have a structured (binary and all strings) S-exp; because it should become an entry point to export the packages list to JSON.  WDYT?)


Now, we are comparing apples to apples and the cost to compute BM25 (v2) is not free at all.  Remember that BM25 is the state-of-the-art of information retrieval (relevance ranking) and it is delegated to Xapian (v2).  I do not know if there is perfomance bottleneck between Guix, Guile-Xapian and Xapian itself but for sure the computation of BM25 is not free.  More about that soon.

To be clear about BM25 and caching, what I have in mind is:
  1. "guix search --build-index" optionally done by the user if they wants for example the BM25 ranking.
  2. Use BM25 metrics to detect poor package meta-data (synopsis and description); if it worth why not add another checker to "guix lint".

However, ranking is another story and I am not convinced yet if BM25 fits Guix needs or not.



* Details
~~~~~~~~~

The pacthes applies against the commit a357849f5b (and it is not yet rebased).

--8<---------------cut here---------------start------------->8---
time ./pre-env-inst guix pull --branch=search-v4 --url=$PWD -p /tmp/v4
--8<---------------cut here---------------end--------------->8---


Similar test than the previous benchmark (cold cache).

--8<---------------cut here---------------start------------->8---
time ./pre-env-inst /tmp/v4/bin/guix search crypto library \
     | recsel -P name | grep libb2
name: libb2

real    0m0.784s
user    0m0.810s
sys     0m0.037s
--8<---------------cut here---------------end--------------->8---

And the option '--load-path' turns off the cache and it fallbacks to the usual 'fold-package'.

--8<---------------cut here---------------start------------->8---
time ./pre-inst-env /tmp/v4/bin/guix search -L /tmp/my-pkgs crypto library \
     | recsel -C -p name | grep libb2
name: libb2

real    0m2.446s
user    0m1.872s
sys     0m0.187s
--8<---------------cut here---------------end--------------->8---



* Still draft
~~~~~~~~~~~~~

 1. The name of 'fold-packages*' should be misleading since it does not return "true" packages.

--8<---------------cut here---------------start------------->8---
(define get-hello (p r)
  (if (string=? (package-name p) "hello")
      p
      r))
(define no-cache   (fold-packages  get-hello '()))
(define from-cache (fold-packages* get-hello '()))

(equal? no-cache from-cache)
;;; #f
--8<---------------cut here---------------end--------------->8---

    Another name for the procedure is welcome if it is an issue.

 2. The function 'package->recutils' in 'guix/ui.scm' is modified but it is not the better.

--8<---------------cut here---------------start------------->8---
          (match (package-supported-systems p)
            (('cache supported-systems)
             (string-join supported-systems))
            (_
             (string-join (package-transitive-supported-systems p)))))
--8<---------------cut here---------------end--------------->8---

    However it avoids to duplicate code; as it is done in version v3.


 3. Deprecated packages are displayed (bug in v3 too).

 4. Impolite '@@' is used to access the private license construction.

 5. Commit messages are incomplete, copyright header too, etc..



* Next?
~~~~~~~

IMHO, simply caching improves the current situation:

 - a bit of extra time at pull time (less than 5s on my machine)
 + speed up at search time (2x faster)
 * maintainable code?

Is it in the right direction?
Could you advise for a more compliant code?
Could you test on your machines to have another point of comparison?



Best regards,
simon


zimoun (3):
  DRAFT packages: Add fields to packages cache.
  DRAFT packages: Add new procedure 'fold-packages*'.
  DRAFT guix package: Use cache in 'find-packages-by-description'.

 gnu/packages.scm         | 98 ++++++++++++++++++++++++++++++++++++++--
 guix/scripts/package.scm |  2 +-
 guix/ui.scm              | 29 +++++++-----
 tests/packages.scm       | 31 +++++++++++++
 4 files changed, 143 insertions(+), 17 deletions(-)

-- 
2.26.1





^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v4 1/3] DRAFT packages: Add fields to packages cache.
  2020-05-03 15:01 ` [bug#39258] [PATCH v4 0/3] Faster cache generation (similar as v3) zimoun
@ 2020-05-03 15:01   ` zimoun
  2020-05-03 15:01   ` [bug#39258] [PATCH v4 2/3] DRAFT packages: Add new procedure 'fold-packages*' zimoun
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2020-05-03 15:01 UTC (permalink / raw)
  To: 39258; +Cc: arunisaac, mail, ludo, zimoun

---
 gnu/packages.scm | 51 +++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 46 insertions(+), 5 deletions(-)

diff --git a/gnu/packages.scm b/gnu/packages.scm
index d22c992bb1..fa18f81487 100644
--- a/gnu/packages.scm
+++ b/gnu/packages.scm
@@ -33,6 +33,8 @@
   #:use-module (guix profiles)
   #:use-module (guix describe)
   #:use-module (guix deprecation)
+  #:use-module (guix build-system)
+  #:use-module (guix licenses)
   #:use-module (ice-9 vlist)
   #:use-module (ice-9 match)
   #:use-module (ice-9 binary-ports)
@@ -212,7 +214,8 @@ package module."
                     (match vector
                       (#(name version module symbol outputs
                               supported? deprecated?
-                              file line column)
+                              file line column
+                              _ _ _ _ _ _ _ _ _ _)
                        (proc name version result
                              #:outputs outputs
                              #:location (and file
@@ -269,7 +272,11 @@ package names.  Return #f on failure."
                    (match item
                      (#(name version module symbol outputs
                              supported? deprecated?
-                             file line column)
+                             file line column
+                             synopsis description home-page
+                             build-system-name build-system-description
+                             supported-systems direct-inputs
+                             license-name license-uri license-comment)
                       (vhash-cons name item vhash))))
                  vlist-null
                  lst))
@@ -316,7 +323,8 @@ decreasing version order."
   (if (and (cache-is-authoritative?) cache)
       (match (cache-lookup cache name)
         (#f #f)
-        ((#(_ versions modules symbols _ _ _ _ _ _) ...)
+        ((#(_ versions modules symbols _ _ _ _ _ _
+              _ _ _ _ _ _ _ _ _ _) ...)
          (fold (lambda (version* module symbol result)
                  (if (or (not version)
                          (version-prefix? version version*))
@@ -339,7 +347,8 @@ matching NAME and VERSION."
         (#f '())
         ((#(name versions modules symbols outputs
                  supported? deprecated?
-                 files lines columns) ...)
+                 files lines columns
+                 _ _ _ _ _ _ _ _ _ _) ...)
          (fold (lambda (version* file line column result)
                  (if (and file
                           (or (not version)
@@ -401,7 +410,39 @@ reducing the memory footprint."
                                      `(,(location-file loc)
                                        ,(location-line loc)
                                        ,(location-column loc))
-                                     '(#f #f #f))))
+                                     '(#f #f #f)))
+
+                             ,(package-synopsis package)
+                             ,(package-description package)
+                             ,(package-home-page package)
+
+                             ,@(let ((build-system
+                                       (package-build-system package)))
+                                 `(,(symbol->string
+                                     (build-system-name build-system))
+                                   ,(build-system-description build-system)))
+
+                             ,(package-transitive-supported-systems package)
+
+                             ,(delete-duplicates
+                               (sort (map package-full-name
+                                          (match (package-direct-inputs package)
+                                            (((labels inputs . _) ...)
+                                             (filter package? inputs))))
+                                     string<?))
+
+                             ,@(match (package-license package)
+                                 (((? license? licenses) ...) ; multilicenses
+                                  `(,(string-join (map license-name licenses)
+                                                  ", ")
+                                    ,(license-uri (car licenses)) ;TODO: names>uris?
+                                    ;; see gpl1+ comment #f
+                                    ,(license-comment (car licenses))))
+                                 ((? license? license)
+                                  `(,(license-name license)
+                                    ,(license-uri license)
+                                    ,(license-comment license)))
+                                 (_ '(#f #f #f))))
                           result)
                     (vhash-consq package #t seen))))))
       (_
-- 
2.26.1





^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v4 2/3] DRAFT packages: Add new procedure 'fold-packages*'.
  2020-05-03 15:01 ` [bug#39258] [PATCH v4 0/3] Faster cache generation (similar as v3) zimoun
  2020-05-03 15:01   ` [bug#39258] [PATCH v4 1/3] DRAFT packages: Add fields to packages cache zimoun
@ 2020-05-03 15:01   ` zimoun
  2020-05-03 15:01   ` [bug#39258] [PATCH v4 3/3] DRAFT guix package: Use cache in 'find-packages-by-description' zimoun
  2020-05-03 16:43   ` [bug#39258] [PATCH v4 0/3] Faster cache generation (similar as v3) Ludovic Courtès
  3 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2020-05-03 15:01 UTC (permalink / raw)
  To: 39258; +Cc: arunisaac, mail, ludo, zimoun

---
 gnu/packages.scm   | 47 ++++++++++++++++++++++++++++++++++++++++++++++
 guix/ui.scm        | 29 +++++++++++++++++-----------
 tests/packages.scm | 31 ++++++++++++++++++++++++++++++
 3 files changed, 96 insertions(+), 11 deletions(-)

diff --git a/gnu/packages.scm b/gnu/packages.scm
index fa18f81487..a0c5835b8b 100644
--- a/gnu/packages.scm
+++ b/gnu/packages.scm
@@ -55,6 +55,7 @@
 
             fold-packages
             fold-available-packages
+            fold-packages*
 
             find-newest-available-packages
             find-packages-by-name
@@ -253,6 +254,52 @@ is guaranteed to never traverse the same package twice."
                                 init
                                 modules))
 
+(define (fold-packages* proc init)
+  "Fold (PROC PACKAGE RESULT) over the list of available packages.  When a
+package cache is available, this procedure does not actually load any package
+module.  Moreover when package cache is available, this procedure
+re-constructs a new package skipping some package record field.  The usage of
+this procedure is User Interface (ui) only."
+  (define cache
+    (load-package-cache (current-profile)))
+
+  (define license  (@@ (guix licenses) license))
+
+  (if (and cache (cache-is-authoritative?))
+      (vhash-fold (lambda (name vector result)
+                    (match vector
+                      (#(name version module symbol outputs
+                              supported? deprecated?
+                              file line column
+                              synopsis description home-page
+                              build-system-name build-system-description
+                              supported-systems direct-inputs
+                              license-name license-uri license-comment)
+                       (proc (package
+                               (name name)
+                               (version version)
+                               (source #f)            ;TODO: ?
+                               (build-system
+                                 (build-system
+                                   (name (string->symbol build-system-name))
+                                   (description build-system-description)
+                                   (lower #f)))       ; never used by ui
+                               (inputs ; list of "full-name@version"
+                                (list 'cache direct-inputs))
+                               (outputs outputs)
+                               (synopsis synopsis)
+                               (description description)
+                               (license (license
+                                         license-name license-uri license-comment))
+                               (home-page home-page)
+                               (supported-systems (list 'cache supported-systems))
+                               (location (location
+                                          file line column)))
+                        result))))
+                  init
+                  cache)
+      (fold-packages proc init)))
+
 (define %package-cache-file
   ;; Location of the package cache.
   "/lib/guix/package.cache")
diff --git a/guix/ui.scm b/guix/ui.scm
index 1e24fe5dca..257d119798 100644
--- a/guix/ui.scm
+++ b/guix/ui.scm
@@ -1416,13 +1416,10 @@ HYPERLINKS? is true, emit hyperlink escape sequences when appropriate."
     ;; the initial "+ " prefix.
     (if (> width 2) (- width 2) width))
 
-  (define (dependencies->recutils packages)
-    (let ((list (string-join (delete-duplicates
-                              (map package-full-name
-                                   (sort packages package<?))) " ")))
-      (string->recutils
-       (fill-paragraph list width*
-                       (string-length "dependencies: ")))))
+  (define (dependencies->string packages)
+    (string-join (delete-duplicates
+                  (map package-full-name
+                       (sort packages package<?))) " "))
 
   (define (package<? p1 p2)
     (string<? (package-full-name p1) (package-full-name p2)))
@@ -1432,11 +1429,21 @@ HYPERLINKS? is true, emit hyperlink escape sequences when appropriate."
   (format port "version: ~a~%" (package-version p))
   (format port "outputs: ~a~%" (string-join (package-outputs p)))
   (format port "systems: ~a~%"
-          (string-join (package-transitive-supported-systems p)))
+          (match (package-supported-systems p)
+            (('cache supported-systems)
+             (string-join supported-systems))
+            (_
+             (string-join (package-transitive-supported-systems p)))))
   (format port "dependencies: ~a~%"
-          (match (package-direct-inputs p)
-            (((labels inputs . _) ...)
-             (dependencies->recutils (filter package? inputs)))))
+          (let ((dependencies
+                 (match (package-direct-inputs p)
+                    (('cache inputs)
+                     (string-join inputs))
+                    (((labels inputs . _) ...)
+                     (dependencies->string (filter package? inputs))))))
+            (string->recutils
+             (fill-paragraph dependencies width*
+                             (string-length "dependencies: ")))))
   (format port "location: ~a~%"
           (or (and=> (package-location p)
                      (if hyperlinks? location->hyperlink location->string))
diff --git a/tests/packages.scm b/tests/packages.scm
index 7a8b5e4a2d..4504f6cf33 100644
--- a/tests/packages.scm
+++ b/tests/packages.scm
@@ -1169,6 +1169,37 @@
     ((one)
      (eq? one guile-2.0))))
 
+(test-assert "fold-packages* hello with/without cache"
+  (let ()
+    (define (equal-package? p1 p2)
+      ;; fold-package* re-constructs a new package skipping 'source' and 'lower'
+      ;; so equal? does not apply
+      (and (equal? (package-full-name p1) (package-full-name p2))
+           (equal? (package-description p1) (package-description p2))))
+
+    (define no-cache
+      (fold-packages* (lambda (p r)
+                        (if (string=? (package-name p) "hello")
+                            p
+                            r))
+                      #f))
+
+    (define from-cache
+      (call-with-temporary-directory
+       (lambda (cache)
+         (generate-package-cache cache)
+         (mock ((guix describe) current-profile (const cache))
+               (mock ((gnu packages) cache-is-authoritative? (const #t))
+                     (fold-packages* (lambda (p r)
+                                      (if (string=? (package-name p) "hello")
+                                          p
+                                          r))
+                                    #f))))))
+
+    (and (equal? no-cache hello)
+         (equal-package? from-cache hello)
+         (equal-package? no-cache from-cache))))
+
 (test-assert "fold-available-packages with/without cache"
   (let ()
     (define no-cache
-- 
2.26.1





^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v4 3/3] DRAFT guix package: Use cache in 'find-packages-by-description'.
  2020-05-03 15:01 ` [bug#39258] [PATCH v4 0/3] Faster cache generation (similar as v3) zimoun
  2020-05-03 15:01   ` [bug#39258] [PATCH v4 1/3] DRAFT packages: Add fields to packages cache zimoun
  2020-05-03 15:01   ` [bug#39258] [PATCH v4 2/3] DRAFT packages: Add new procedure 'fold-packages*' zimoun
@ 2020-05-03 15:01   ` zimoun
  2020-05-03 16:43   ` [bug#39258] [PATCH v4 0/3] Faster cache generation (similar as v3) Ludovic Courtès
  3 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2020-05-03 15:01 UTC (permalink / raw)
  To: 39258; +Cc: arunisaac, mail, ludo, zimoun

---
 guix/scripts/package.scm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/guix/scripts/package.scm b/guix/scripts/package.scm
index badb1dcd38..6b982eb172 100644
--- a/guix/scripts/package.scm
+++ b/guix/scripts/package.scm
@@ -174,7 +174,7 @@ hooks\" run when building the profile."
   "Return a list of pairs: packages whose name, synopsis, description,
 or output matches at least one of REGEXPS sorted by relevance, and its
 non-zero relevance score."
-  (let ((matches (fold-packages (lambda (package result)
+  (let ((matches (fold-packages* (lambda (package result)
                                   (if (package-superseded package)
                                       result
                                       (match (package-relevance package
-- 
2.26.1





^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v4 0/3] Faster cache generation (similar as v3)
  2020-05-03 15:01 ` [bug#39258] [PATCH v4 0/3] Faster cache generation (similar as v3) zimoun
                     ` (2 preceding siblings ...)
  2020-05-03 15:01   ` [bug#39258] [PATCH v4 3/3] DRAFT guix package: Use cache in 'find-packages-by-description' zimoun
@ 2020-05-03 16:43   ` Ludovic Courtès
  2020-05-03 18:10     ` zimoun
  3 siblings, 1 reply; 126+ messages in thread
From: Ludovic Courtès @ 2020-05-03 16:43 UTC (permalink / raw)
  To: zimoun; +Cc: arunisaac, mail, 39258

Hello!

zimoun <zimon.toutoune@gmail.com> skribis:

> The aim of this version v4 is to keep the same searching performances as the previous version v3 but to drastically reduce the generation of the cache.  On my laptop, the overhead is now 4 seconds; compared to more than 20 seconds for v2 and v3.
>
> # default
> time guix build /gnu/store/0nfpp82mqglpwvl1nbfpaphw5db2ivcp-guix-package-cache.drv --check
> # v4
> time guix build /gnu/store/y78gfh1n7m3kyrj8wsqj25qc2cbc1a4d-guix-package-cache.drv --check
>
> |      | default  | v4        |
> |------+----------+-----------|
> | real | 0m6.012s | 0m10.244s |
> | user | 0m0.541s | 0m0.542s  |
> | sys  | 0m0.033s | 0m0.032s  |

Not bad!

> In the version v3, the cache is built using 'cons' and 'fold-packages' (wrapper to 'fold-module-public-variables').  The version v4 modifies -- by adding other information -- the function 'generate-package-cache' which uses 'vhash' and 'fold-module-public-variables*'.
>
> Therefore the cache '/lib/guix/package.cache' contains more
> information.

This breaks the binary interface, so we’ll have to analyze the impact of
such a change and devise a strategy.

> (The v4 structure of 'package.cache' is a quick draft, so details
> should be discussed and an interesting move should to have a
> structured (binary and all strings) S-exp; because it should become an
> entry point to export the packages list to JSON.  WDYT?)

It’s on purpose that this cache is an object file: it just needs to be
mmap’d, and that’s it.  It’s the cheapest possible way to do it.
Parsing sexps would be more costly, and since we’re talking about
startup time, this is sensitive.

> Now, we are comparing apples to apples and the cost to compute BM25 (v2) is not free at all.  Remember that BM25 is the state-of-the-art of information retrieval (relevance ranking) and it is delegated to Xapian (v2).  I do not know if there is perfomance bottleneck between Guix, Guile-Xapian and Xapian itself but for sure the computation of BM25 is not free.  More about that soon.
>
> To be clear about BM25 and caching, what I have in mind is:
>   1. "guix search --build-index" optionally done by the user if they wants for example the BM25 ranking.

Something that must be done explicitly doesn’t seem great to me.  As a
user, I’d rather not think about search indexes and all.  But I don’t
know, maybe if it happened automatically on the first ‘guix search’
invocation that’d be fine.

>   2. Use BM25 metrics to detect poor package meta-data (synopsis and description); if it worth why not add another checker to "guix lint".

That’d be interesting!

>  1. The name of 'fold-packages*' should be misleading since it does not return "true" packages.

Did you see ‘fold-available-packages’?  It seems you could extend it
instead of introducing ‘fold-packages*’, no?

>  2. The function 'package->recutils' in 'guix/ui.scm' is modified but it is not the better.
>
>           (match (package-supported-systems p)
>             (('cache supported-systems)
>              (string-join supported-systems))
>             (_
>              (string-join (package-transitive-supported-systems p)))))
>
>     However it avoids to duplicate code; as it is done in version v3.

I made suggestions to Arun’s v3 about the API here.  Essentially, I
think I proposed having a procedure that takes the list of fields as
keyword parameters, and ‘package->recutils’ would just delegate to that.


>  3. Deprecated packages are displayed (bug in v3 too).
>
>  4. Impolite '@@' is used to access the private license construction.

(guix licenses) could provide a ‘string->license’ procedure.

Stopping here for now because I’m sorta drowning in patch review.  :-)

Thanks for exploring this design space, we’re making progress!

Ludo’.




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v4 0/3] Faster cache generation (similar as v3)
  2020-05-03 16:43   ` [bug#39258] [PATCH v4 0/3] Faster cache generation (similar as v3) Ludovic Courtès
@ 2020-05-03 18:10     ` zimoun
  2020-05-03 19:49       ` Ludovic Courtès
  0 siblings, 1 reply; 126+ messages in thread
From: zimoun @ 2020-05-03 18:10 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Arun Isaac, Pierre Neidhardt, 39258

Hi Ludo,

On Sun, 3 May 2020 at 18:43, Ludovic Courtès <ludo@gnu.org> wrote:

> > Therefore the cache '/lib/guix/package.cache' contains more
> > information.
>
> This breaks the binary interface, so we’ll have to analyze the impact of
> such a change and devise a strategy.

Interface between what and what?

Because from my understanding, this file is only used by only one
guix.  What do I miss?


Note that I have read your comment in v3 2/3 but I did not understand it. Sorry.

--8<---------------cut here---------------start------------->8---
I realize the other cache also has that problem, but it would be nice to
add a version tag to the cache.  Basically emit something like:

  (package-metadata-cache (version 0) VECTOR …)

instead of just:

  (VECTOR …)
--8<---------------cut here---------------end--------------->8---



> > (The v4 structure of 'package.cache' is a quick draft, so details
> > should be discussed and an interesting move should to have a
> > structured (binary and all strings) S-exp; because it should become an
> > entry point to export the packages list to JSON.  WDYT?)
>
> It’s on purpose that this cache is an object file: it just needs to be
> mmap’d, and that’s it.  It’s the cheapest possible way to do it.
> Parsing sexps would be more costly, and since we’re talking about
> startup time, this is sensitive.

I agree and I have badly worded or I misunderstand something.
For example, 'supported-systems' is saved as a list of strings,
whereas 'license' is expanded as 3 strings without be packed in a list
of strings.  From my point of view, it is inconsistent and I do not
know what is the best (readibility, startup time, etc.).


> > To be clear about BM25 and caching, what I have in mind is:
> >   1. "guix search --build-index" optionally done by the user if they wants for example the BM25 ranking.
>
> Something that must be done explicitly doesn’t seem great to me.  As a
> user, I’d rather not think about search indexes and all.  But I don’t
> know, maybe if it happened automatically on the first ‘guix search’
> invocation that’d be fine.

I do not think it is an option to build the BM25 the first time "guix
search" is called.  Back-to-envelop estimation, it needs ~25 seconds
to Xapian* to do so.

From my point of view, two options:
 a) "guix pull" does this extra ~25 seconds (compared to 10 seconds to
build the v4 cache)
 b) the user manually build the index (I agree it is awkward!)

Well, the first question is to evaluate if it is worth -- I am using
the v2 version based on Xapian to have an idea.  Please if you have
suggestions about query (terms an user could type) and results
(packages an user could expect), there are welcome.


*Xapian: I do not think we could do better but I have not checked yet
if there is a bottleneck Guix, Guile-Xapian and Xapian.


> >  1. The name of 'fold-packages*' should be misleading since it does not return "true" packages.
>
> Did you see ‘fold-available-packages’?  It seems you could extend it
> instead of introducing ‘fold-packages*’, no?

Yes and no.

 a) 'fold-available-packages' requires to modify the 'lambda' in
'find-package-by-description',
 b) 'fold-package*' returning a 'package' is less tweaks, IMHO.

Well, I agree that on the long term, what 'fold-package*' does could
be done by 'fold-available-packages' with the adequate 'proc'.

Thank you for the suggestion; even if once re-read correctly v3 2/3
you already mentioned it. :-)


> >  2. The function 'package->recutils' in 'guix/ui.scm' is modified but it is not the better.
> >
> >           (match (package-supported-systems p)
> >             (('cache supported-systems)
> >              (string-join supported-systems))
> >             (_
> >              (string-join (package-transitive-supported-systems p)))))
> >
> >     However it avoids to duplicate code; as it is done in version v3.
>
> I made suggestions to Arun’s v3 about the API here.  Essentially, I
> think I proposed having a procedure that takes the list of fields as
> keyword parameters, and ‘package->recutils’ would just delegate to that.

Yes, it was already your suggestion in v3 3/3.  Do you suggest to
refactor 'package->recutils'? For example,

--8<---------------cut here---------------start------------->8---
(define* (package->recutils name version
                            ... all-the-other-fields ...
                            port #:optional (width (%text-width))
                            #:key
                            (hyperlinks? (supports-hyperlinks? port))
                            (extra-fields '()))
--8<---------------cut here---------------end--------------->8---



> >  4. Impolite '@@' is used to access the private license construction.
>
> (guix licenses) could provide a ‘string->license’ procedure.

Well, do you suggest:

    (define (string->license name) (license name #f #f))

? Skipping 'uri' and 'comment'?  Naive question: what is the purpose
of these 2 fields?  Because there are not exposed at the CLI level,
AFAIK, and I do not think an user evaluate '(license-uri pkg)' in a
script.

Well, I think that the hyperlink feature could be used to display the
license URI too.  WDYT?



> Stopping here for now because I’m sorta drowning in patch review.  :-)

Thank you for all the comments.


> Thanks for exploring this design space, we’re making progress!

My pleasure. Scheme is designed to explore. ;-)


Cheers,
simon




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v4 0/3] Faster cache generation (similar as v3)
  2020-05-03 18:10     ` zimoun
@ 2020-05-03 19:49       ` Ludovic Courtès
  0 siblings, 0 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-05-03 19:49 UTC (permalink / raw)
  To: zimoun; +Cc: Arun Isaac, Pierre Neidhardt, 39258

Hi,

zimoun <zimon.toutoune@gmail.com> skribis:

> On Sun, 3 May 2020 at 18:43, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> > Therefore the cache '/lib/guix/package.cache' contains more
>> > information.
>>
>> This breaks the binary interface, so we’ll have to analyze the impact of
>> such a change and devise a strategy.
>
> Interface between what and what?

Guix revision N creates a cache that will be read by revision N+1, upon
‘guix pull’ completion.

> Note that I have read your comment in v3 2/3 but I did not understand it. Sorry.
>
> I realize the other cache also has that problem, but it would be nice to
> add a version tag to the cache.  Basically emit something like:
>
>   (package-metadata-cache (version 0) VECTOR …)
>
> instead of just:
>
>   (VECTOR …)

Yes, it would be better.

> For example, 'supported-systems' is saved as a list of strings,
> whereas 'license' is expanded as 3 strings without be packed in a list
> of strings.  From my point of view, it is inconsistent and I do not
> know what is the best (readibility, startup time, etc.).

I guess both ‘license’ and ‘supported-systems’ should be list of
strings.  It doesn’t really have an impact on startup time (I thought
you were suggesting storing the cache as an sexp instead of an object
file.)

>> Something that must be done explicitly doesn’t seem great to me.  As a
>> user, I’d rather not think about search indexes and all.  But I don’t
>> know, maybe if it happened automatically on the first ‘guix search’
>> invocation that’d be fine.
>
> I do not think it is an option to build the BM25 the first time "guix
> search" is called.  Back-to-envelop estimation, it needs ~25 seconds
> to Xapian* to do so.
>
> From my point of view, two options:
>  a) "guix pull" does this extra ~25 seconds (compared to 10 seconds to
> build the v4 cache)
>  b) the user manually build the index (I agree it is awkward!)
>
> Well, the first question is to evaluate if it is worth -- I am using
> the v2 version based on Xapian to have an idea.  Please if you have
> suggestions about query (terms an user could type) and results
> (packages an user could expect), there are welcome.

Yeah, dunno.  Maybe an option would be to create the index in such a way
that it is substitutable.


[...]

> Yes, it was already your suggestion in v3 3/3.  Do you suggest to
> refactor 'package->recutils'? For example,
>
> (define* (package->recutils name version
>                             ... all-the-other-fields ...
>                             port #:optional (width (%text-width))
>                             #:key
>                             (hyperlinks? (supports-hyperlinks? port))
>                             (extra-fields '()))

Yes.

>> >  4. Impolite '@@' is used to access the private license construction.
>>
>> (guix licenses) could provide a ‘string->license’ procedure.
>
> Well, do you suggest:
>
>     (define (string->license name) (license name #f #f))

No; rather, it would look up the license in a dictionary and return the
corresponding object or #f if it’s not a known license.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 0/4] Optimize guix search
  2020-01-23 19:51 [bug#39258] Faster guix search using an sqlite cache Arun Isaac
                   ` (5 preceding siblings ...)
  2020-05-03 15:01 ` [bug#39258] [PATCH v4 0/3] Faster cache generation (similar as v3) zimoun
@ 2020-06-01  0:00 ` Arun Isaac
  2020-06-01  0:00   ` [bug#39258] [PATCH 1/4] ui: Cut off search early if any regexp does not match Arun Isaac
                     ` (4 more replies)
  2020-06-01 10:11 ` [bug#39258] KMP string search algorithm? zimoun
  2021-07-15  7:33 ` [bug#39258] [PATCH v6 0/2] DRAFT "guix search" performances zimoun
  8 siblings, 5 replies; 126+ messages in thread
From: Arun Isaac @ 2020-06-01  0:00 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac, ludo, zimon.toutoune

Hi,

Sorry for the long delay in replying to this thread.

I think Ludo is right in that we can improve guix search performance with only
simple code improvements rather than including xapian or improving our
existing cache. Here are a few patches on those lines.

In `relevance`, we set our score to 0 if any of the regexps don't match. Then,
we might as well not match the remaining regexps. Patch 1 does this early cut
off optimization.

Often our search strings are only literal strings. So, we can save some time
by using string-contains instead of invoking the regexp engine. Patch 2 does
this. In addition, guile's string-contains uses a naive O(n^2) string search
algorithm. We should perhaps use the O(n) Knuth-Morris-Pratt algorithm[1]. In
fact, a comment on line 2006 of libguile/srfi-13.c in the guile source code
mentions this. If implemented, the KMP algorithm could speed up guix search
further.

[1]: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

Patch 3 and 4 are minor improvements.

Here's a rough performance comparison.

--8<---------------cut here---------------start------------->8---
time ./pre-inst-env guix search game

real	0m2.261s
user	0m2.351s
sys	0m0.104s
--8<---------------cut here---------------end--------------->8---

--8<---------------cut here---------------start------------->8---
time guix search game

real	0m2.661s
user	0m2.843s
sys	0m0.080s
--8<---------------cut here---------------end--------------->8---

--8<---------------cut here---------------start------------->8---
time ./pre-inst-env guix search strategy game

real	0m1.613s
user	0m1.635s
sys	0m0.096s
--8<---------------cut here---------------end--------------->8---

--8<---------------cut here---------------start------------->8---
time guix search strategy game

real	0m2.520s
user	0m2.583s
sys	0m0.112s
--8<---------------cut here---------------end--------------->8---

Arun Isaac (4):
  ui: Cut off search early if any regexp does not match.
  ui: Use string matching with literal search strings.
  ui: Do not translate package synopsis a second time.
  ui: Use package-description-string.

 guix/scripts/package.scm | 12 +++++--
 guix/ui.scm              | 68 ++++++++++++++++++++++++----------------
 2 files changed, 50 insertions(+), 30 deletions(-)

-- 
2.26.2





^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 1/4] ui: Cut off search early if any regexp does not match.
  2020-06-01  0:00 ` [bug#39258] [PATCH 0/4] Optimize guix search Arun Isaac
@ 2020-06-01  0:00   ` Arun Isaac
  2020-06-09  8:29     ` Ludovic Courtès
  2020-06-01  0:00   ` [bug#39258] [PATCH 2/4] ui: Use string matching with literal search strings Arun Isaac
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-06-01  0:00 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac

* guix/ui.scm (relevance): When one of the regexps does not match, cut off
early and return 0. Do not try to match the remaining regexps.
---
 guix/ui.scm | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/guix/ui.scm b/guix/ui.scm
index ea5f460865..4a22358963 100644
--- a/guix/ui.scm
+++ b/guix/ui.scm
@@ -1519,11 +1519,16 @@ score, the more relevant OBJ is to REGEXPS."
                     (+ relevance (* weight (apply + (map score-regexp lst)))))))))
             0 metrics)))
 
-  (let ((scores (map regexp->score regexps)))
-    ;; Return zero if one of REGEXPS doesn't match.
-    (if (any zero? scores)
-        0
-        (reduce + 0 scores))))
+  (let loop ((regexps regexps)
+             (total-score 0))
+    (match regexps
+      ((head . tail)
+       (let ((score (regexp->score head)))
+         ;; Return zero if one of PATTERNS doesn't match.
+         (cond
+          ((zero? score) 0)
+          (else (loop tail (+ total-score score))))))
+      (() total-score))))
 
 (define %package-metrics
   ;; Metrics used to compute the "relevance score" of a package against a set
-- 
2.26.2





^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 2/4] ui: Use string matching with literal search strings.
  2020-06-01  0:00 ` [bug#39258] [PATCH 0/4] Optimize guix search Arun Isaac
  2020-06-01  0:00   ` [bug#39258] [PATCH 1/4] ui: Cut off search early if any regexp does not match Arun Isaac
@ 2020-06-01  0:00   ` Arun Isaac
  2020-06-09  8:33     ` Ludovic Courtès
  2020-06-01  0:00   ` [bug#39258] [PATCH 3/4] ui: Do not translate package synopsis a second time Arun Isaac
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-06-01  0:00 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac

* guix/scripts/package.scm (process-query): Make search query a regexp only if
it is not a literal search string.
* guix/ui.scm (relevance): Use string matching with literal search strings and
regexp matching with regexp search strings.
---
 guix/scripts/package.scm | 12 +++++++---
 guix/ui.scm              | 50 +++++++++++++++++++++++++---------------
 2 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/guix/scripts/package.scm b/guix/scripts/package.scm
index 1246147798..1b637f7802 100644
--- a/guix/scripts/package.scm
+++ b/guix/scripts/package.scm
@@ -675,6 +675,11 @@ doesn't need it."
 (define (process-query opts)
   "Process any query specified by OPTS.  Return #t when a query was actually
 processed, #f otherwise."
+  (define (regexp-pattern? str)
+    (string-any
+     (char-set #\. #\[ #\{ #\} #\( #\) #\\ #\* #\+ #\? #\| #\^ #\$)
+     str))
+
   (let* ((profiles (delete-duplicates
                     (match (filter-map (match-lambda
                                          (('profile . p) p)
@@ -781,11 +786,12 @@ processed, #f otherwise."
 
       (('search _)
        (let* ((patterns (filter-map (match-lambda
-                                      (('query 'search rx) rx)
+                                      (('query 'search (? regexp-pattern? rx))
+                                       (make-regexp* rx regexp/icase))
+                                      (('query 'search pattern) pattern)
                                       (_                   #f))
                                     opts))
-              (regexps  (map (cut make-regexp* <> regexp/icase) patterns))
-              (matches  (find-packages-by-description regexps)))
+              (matches  (find-packages-by-description patterns)))
          (leave-on-EPIPE
           (display-search-results matches (current-output-port)))
          #t))
diff --git a/guix/ui.scm b/guix/ui.scm
index 4a22358963..56754dba83 100644
--- a/guix/ui.scm
+++ b/guix/ui.scm
@@ -1489,41 +1489,53 @@ HYPERLINKS? is true, emit hyperlink escape sequences when appropriate."
 ;;; Searching.
 ;;;
 
-(define (relevance obj regexps metrics)
+(define (relevance obj patterns metrics)
   "Compute a \"relevance score\" for OBJ as a function of its number of
-matches of REGEXPS and accordingly to METRICS.  METRICS is list of
+matches of PATTERNS and accordingly to METRICS.  METRICS is list of
 field/weight pairs, where FIELD is a procedure that returns a string or list
 of strings describing OBJ, and WEIGHT is a positive integer denoting the
 weight of this field in the final score.
 
-A score of zero means that OBJ does not match any of REGEXPS.  The higher the
-score, the more relevant OBJ is to REGEXPS."
-  (define (score regexp str)
-    (fold-matches regexp str 0
-                  (lambda (m score)
-                    (+ score
-                       (if (string=? (match:substring m) str)
-                           5             ;exact match
-                           1)))))
-
-  (define (regexp->score regexp)
-    (let ((score-regexp (lambda (str) (score regexp str))))
+A score of zero means that OBJ does not match any of PATTERNS.  The higher the
+score, the more relevant OBJ is to PATTERNS."
+  (define (score pattern str)
+    (match pattern
+      ((? string? pattern)
+       (cond
+        ((string=? str pattern) 5)
+        (else
+         (let loop ((score 0) (start 0))
+           (cond
+            ((string-contains-ci str pattern start)
+             => (lambda (index)
+                  (loop (+ score 1) (+ index (string-length pattern)))))
+            (else score))))))
+      ((? regexp? regexp)
+       (fold-matches regexp str 0
+                     (lambda (m score)
+                       (+ score
+                          (if (string=? (match:substring m) str)
+                              5             ;exact match
+                              1)))))))
+
+  (define (pattern->score pattern)
+    (let ((score-pattern (lambda (str) (score pattern str))))
       (fold (lambda (metric relevance)
               (match metric
                 ((field . weight)
                  (match (field obj)
                    (#f  relevance)
                    ((? string? str)
-                    (+ relevance (* (score-regexp str) weight)))
+                    (+ relevance (* (score-pattern str) weight)))
                    ((lst ...)
-                    (+ relevance (* weight (apply + (map score-regexp lst)))))))))
+                    (+ relevance (* weight (apply + (map score-pattern lst)))))))))
             0 metrics)))
 
-  (let loop ((regexps regexps)
+  (let loop ((patterns patterns)
              (total-score 0))
-    (match regexps
+    (match patterns
       ((head . tail)
-       (let ((score (regexp->score head)))
+       (let ((score (pattern->score head)))
          ;; Return zero if one of PATTERNS doesn't match.
          (cond
           ((zero? score) 0)
-- 
2.26.2





^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 3/4] ui: Do not translate package synopsis a second time.
  2020-06-01  0:00 ` [bug#39258] [PATCH 0/4] Optimize guix search Arun Isaac
  2020-06-01  0:00   ` [bug#39258] [PATCH 1/4] ui: Cut off search early if any regexp does not match Arun Isaac
  2020-06-01  0:00   ` [bug#39258] [PATCH 2/4] ui: Use string matching with literal search strings Arun Isaac
@ 2020-06-01  0:00   ` Arun Isaac
  2020-06-09  8:33     ` Ludovic Courtès
  2020-06-01  0:00   ` [bug#39258] [PATCH 4/4] ui: Use package-description-string Arun Isaac
  2020-06-01  1:25   ` [bug#39258] [PATCH v5 0/4] Optimize guix search zimoun
  4 siblings, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-06-01  0:00 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac

* guix/ui.scm (package->recutils): package-synopsis-string already returns a
translated string. Do not attempt to translate it again.
---
 guix/ui.scm | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/guix/ui.scm b/guix/ui.scm
index 56754dba83..744758d1f3 100644
--- a/guix/ui.scm
+++ b/guix/ui.scm
@@ -1463,8 +1463,7 @@ HYPERLINKS? is true, emit hyperlink escape sequences when appropriate."
           (string-map (match-lambda
                         (#\newline #\space)
                         (chr       chr))
-                      (or (and=> (package-synopsis-string p) P_)
-                          "")))
+                      (or (package-synopsis-string p) "")))
   (format port "~a~%"
           (string->recutils
            (string-trim-right
-- 
2.26.2





^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 4/4] ui: Use package-description-string.
  2020-06-01  0:00 ` [bug#39258] [PATCH 0/4] Optimize guix search Arun Isaac
                     ` (2 preceding siblings ...)
  2020-06-01  0:00   ` [bug#39258] [PATCH 3/4] ui: Do not translate package synopsis a second time Arun Isaac
@ 2020-06-01  0:00   ` Arun Isaac
  2020-06-09  8:34     ` Ludovic Courtès
  2020-06-01  1:25   ` [bug#39258] [PATCH v5 0/4] Optimize guix search zimoun
  4 siblings, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-06-01  0:00 UTC (permalink / raw)
  To: 39258; +Cc: Arun Isaac

* guix/ui.scm (package->recutils): Use package-description-string instead of
package-description and P_.
---
 guix/ui.scm | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/guix/ui.scm b/guix/ui.scm
index 744758d1f3..caa506a645 100644
--- a/guix/ui.scm
+++ b/guix/ui.scm
@@ -1468,10 +1468,8 @@ HYPERLINKS? is true, emit hyperlink escape sequences when appropriate."
           (string->recutils
            (string-trim-right
             (parameterize ((%text-width width*))
-              (texi->plain-text
-               (string-append "description: "
-                              (or (and=> (package-description p) P_)
-                                  ""))))
+              (string-append "description: "
+                             (or (package-description-string p) "")))
             #\newline)))
   (for-each (match-lambda
               ((field . value)
-- 
2.26.2





^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v5 0/4] Optimize guix search
  2020-06-01  0:00 ` [bug#39258] [PATCH 0/4] Optimize guix search Arun Isaac
                     ` (3 preceding siblings ...)
  2020-06-01  0:00   ` [bug#39258] [PATCH 4/4] ui: Use package-description-string Arun Isaac
@ 2020-06-01  1:25   ` zimoun
  2020-06-01  2:24     ` Arun Isaac
  2020-06-01 10:01     ` zimoun
  4 siblings, 2 replies; 126+ messages in thread
From: zimoun @ 2020-06-01  1:25 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, 39258

Hi Arun,

On Mon, 1 Jun 2020 at 02:00, Arun Isaac <arunisaac@systemreboot.net> wrote:

> Sorry for the long delay in replying to this thread.

Based on the Ludo's comments [1] on v4 which is a simple re-write of
your v3, I am finishing a vN+1.. but time flies and I am late on the
topic too. :-)

Well, this still unsent vN+1 series has the same performance of v4 on
"guix pull" which is a key point compared to v3.  Obviously, the
performance on "guix search" are equivalent on both version.  This
vN+1 builds two caches -- to avoid binary breakage -- in only one go;
the consuming 'fold-modules-public-variables*' is applied only once.

[1] http://issues.guix.gnu.org/39258#93


> I think Ludo is right in that we can improve guix search performance with only
> simple code improvements rather than including xapian or improving our
> existing cache. Here are a few patches on those lines.

Well, improving the cache is easy; at least as you did in v3 by adding
another one.
The most annoying part is the arguments rewrite of 'package->recutils'
to be compliant.

However after some comparisons, I am not convinced that BM25 will be
worth to implement...


> In `relevance`, we set our score to 0 if any of the regexps don't match. Then,
> we might as well not match the remaining regexps. Patch 1 does this early cut
> off optimization.

Interesting.


> Often our search strings are only literal strings. So, we can save some time
> by using string-contains instead of invoking the regexp engine. Patch 2 does
> this. In addition, guile's string-contains uses a naive O(n^2) string search
> algorithm. We should perhaps use the O(n) Knuth-Morris-Pratt algorithm[1]. In
> fact, a comment on line 2006 of libguile/srfi-13.c in the guile source code
> mentions this. If implemented, the KMP algorithm could speed up guix search
> further.
>
> [1]: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

Really interesting idea,


> Patch 3 and 4 are minor improvements.
>
> Here's a rough performance comparison.

On cold or warm cache?


> --8<---------------cut here---------------start------------->8---
> time ./pre-inst-env guix search game
>
> real    0m2.261s
> user    0m2.351s
> sys     0m0.104s
> --8<---------------cut here---------------end--------------->8---
>
> --8<---------------cut here---------------start------------->8---
> time guix search game
>
> real    0m2.661s
> user    0m2.843s
> sys     0m0.080s
> --8<---------------cut here---------------end--------------->8---
>
> --8<---------------cut here---------------start------------->8---
> time ./pre-inst-env guix search strategy game
>
> real    0m1.613s
> user    0m1.635s
> sys     0m0.096s
> --8<---------------cut here---------------end--------------->8---
>
> --8<---------------cut here---------------start------------->8---
> time guix search strategy game
>
> real    0m2.520s
> user    0m2.583s
> sys     0m0.112s
> --8<---------------cut here---------------end--------------->8---

So in the best case, you have the ratio old/new is 1.5; this new
version is 1.5 faster.

Well, in the extra cache approach (v3 or v4) the ration old/new is
really higher: 3.1 faster on cold cache (which is the one I am
interested in) and 2.4 faster on warm cache.


I will give a look to this new series and report what happens on my
laptop.  But basically, I would like "guix search" under the 1.0
second on my machine. ;-)


Thank you for this new input.

Cheers,
simon




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v5 0/4] Optimize guix search
  2020-06-01  1:25   ` [bug#39258] [PATCH v5 0/4] Optimize guix search zimoun
@ 2020-06-01  2:24     ` Arun Isaac
  2020-06-01 10:01     ` zimoun
  1 sibling, 0 replies; 126+ messages in thread
From: Arun Isaac @ 2020-06-01  2:24 UTC (permalink / raw)
  To: zimoun; +Cc: Ludovic Courtès, 39258

[-- Attachment #1: Type: text/plain, Size: 1277 bytes --]


> Based on the Ludo's comments [1] on v4 which is a simple re-write of
> your v3, I am finishing a vN+1.. but time flies and I am late on the
> topic too. :-)
>
> Well, this still unsent vN+1 series has the same performance of v4 on
> "guix pull" which is a key point compared to v3.  Obviously, the
> performance on "guix search" are equivalent on both version.  This
> vN+1 builds two caches -- to avoid binary breakage -- in only one go;
> the consuming 'fold-modules-public-variables*' is applied only once.

Interesting, I'll be waiting for your patchset. :-)

> [1] http://issues.guix.gnu.org/39258#93

>> Here's a rough performance comparison.
>
> On cold or warm cache?

On a warm cache.

> So in the best case, you have the ratio old/new is 1.5; this new
> version is 1.5 faster.
>
> Well, in the extra cache approach (v3 or v4) the ration old/new is
> really higher: 3.1 faster on cold cache (which is the one I am
> interested in) and 2.4 faster on warm cache.

We could always have both my optimizations and your improved cache. So,
that's a win on both fronts.

> I will give a look to this new series and report what happens on my
> laptop.  But basically, I would like "guix search" under the 1.0
> second on my machine. ;-)

Indeed, I would love that too! :-)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v5 0/4] Optimize guix search
  2020-06-01  1:25   ` [bug#39258] [PATCH v5 0/4] Optimize guix search zimoun
  2020-06-01  2:24     ` Arun Isaac
@ 2020-06-01 10:01     ` zimoun
  1 sibling, 0 replies; 126+ messages in thread
From: zimoun @ 2020-06-01 10:01 UTC (permalink / raw)
  To: Arun Isaac; +Cc: Ludovic Courtès, 39258

> > --8<---------------cut here---------------start------------->8---
> > time ./pre-inst-env guix search strategy game
> >
> > real    0m1.613s
> > user    0m1.635s
> > sys     0m0.096s
> > --8<---------------cut here---------------end--------------->8---
> >
> > --8<---------------cut here---------------start------------->8---
> > time guix search strategy game
> >
> > real    0m2.520s
> > user    0m2.583s
> > sys     0m0.112s
> > --8<---------------cut here---------------end--------------->8---

I do not see any improvement on my machine.  Well, I am
double-checking because I should have screwed up something...




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] KMP string search algorithm?
  2020-01-23 19:51 [bug#39258] Faster guix search using an sqlite cache Arun Isaac
                   ` (6 preceding siblings ...)
  2020-06-01  0:00 ` [bug#39258] [PATCH 0/4] Optimize guix search Arun Isaac
@ 2020-06-01 10:11 ` zimoun
  2020-06-01 22:24   ` Leo Famulari
  2021-07-15  7:33 ` [bug#39258] [PATCH v6 0/2] DRAFT "guix search" performances zimoun
  8 siblings, 1 reply; 126+ messages in thread
From: zimoun @ 2020-06-01 10:11 UTC (permalink / raw)
  To: Arun Isaac, 39258, Ludovic Courtès

Dear,

> > Often our search strings are only literal strings. So, we can save some time
> > by using string-contains instead of invoking the regexp engine. Patch 2 does
> > this. In addition, guile's string-contains uses a naive O(n^2) string search
> > algorithm. We should perhaps use the O(n) Knuth-Morris-Pratt algorithm[1]. In
> > fact, a comment on line 2006 of libguile/srfi-13.c in the guile source code
> > mentions this. If implemented, the KMP algorithm could speed up guix search
> > further.
> >
> > [1]: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

It could improve.
Well, I will try to do some back-to-envelop computations because I am
not convinced that the mean value of 'n' (length of description,
isn't) is large enough to really see an improvement for the end-user;
the visible bottleneck is I/O.

All the best,
simon

ps;
To be honest, I thought this kind of algorithm was the default. :-)




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] KMP string search algorithm?
  2020-06-01 10:11 ` [bug#39258] KMP string search algorithm? zimoun
@ 2020-06-01 22:24   ` Leo Famulari
  2020-06-01 23:48     ` Arun Isaac
  0 siblings, 1 reply; 126+ messages in thread
From: Leo Famulari @ 2020-06-01 22:24 UTC (permalink / raw)
  To: zimoun; +Cc: Arun Isaac, Ludovic Courtès, 39258

On Mon, Jun 01, 2020 at 12:11:52PM +0200, zimoun wrote:
> Dear,
> 
> > > Often our search strings are only literal strings. So, we can save some time
> > > by using string-contains instead of invoking the regexp engine. Patch 2 does
> > > this. In addition, guile's string-contains uses a naive O(n^2) string search
> > > algorithm. We should perhaps use the O(n) Knuth-Morris-Pratt algorithm[1]. In
> > > fact, a comment on line 2006 of libguile/srfi-13.c in the guile source code
> > > mentions this. If implemented, the KMP algorithm could speed up guix search
> > > further.
> > >
> > > [1]: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
> 
> It could improve.
> Well, I will try to do some back-to-envelop computations because I am
> not convinced that the mean value of 'n' (length of description,
> isn't) is large enough to really see an improvement for the end-user;
> the visible bottleneck is I/O.
> 
> All the best,
> simon
> 
> ps;
> To be honest, I thought this kind of algorithm was the default. :-)

I also recommend taking a look at the Boyer Moore string search
implementation in (guix build grafts).

It would be great to generalize it and make it accessible to other parts
of Guix.




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] KMP string search algorithm?
  2020-06-01 22:24   ` Leo Famulari
@ 2020-06-01 23:48     ` Arun Isaac
  2020-06-02  8:49       ` Ludovic Courtès
  0 siblings, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-06-01 23:48 UTC (permalink / raw)
  To: Leo Famulari, zimoun; +Cc: Ludovic Courtès, 39258

[-- Attachment #1: Type: text/plain, Size: 415 bytes --]


> I also recommend taking a look at the Boyer Moore string search
> implementation in (guix build grafts).

Nice, I didn't know Guix had an implementation of Boyer Moore. I'll take
a look at it. At the very least, I need something similar for
guile-email.

But, the current implementation of guile's string-contains is in C. So,
I assume a KMP or Boyer Moore implementation of string-contains should
also be in C.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] KMP string search algorithm?
  2020-06-01 23:48     ` Arun Isaac
@ 2020-06-02  8:49       ` Ludovic Courtès
  0 siblings, 0 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-06-02  8:49 UTC (permalink / raw)
  To: Arun Isaac; +Cc: zimoun, 39258, Leo Famulari

Arun Isaac <arunisaac@systemreboot.net> skribis:

>> I also recommend taking a look at the Boyer Moore string search
>> implementation in (guix build grafts).
>
> Nice, I didn't know Guix had an implementation of Boyer Moore. I'll take
> a look at it. At the very least, I need something similar for
> guile-email.
>
> But, the current implementation of guile's string-contains is in C. So,
> I assume a KMP or Boyer Moore implementation of string-contains should
> also be in C.

Not necessarily.  But it’d be great to have it in Guile proper, for sure!

Ludo’.




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 1/4] ui: Cut off search early if any regexp does not match.
  2020-06-01  0:00   ` [bug#39258] [PATCH 1/4] ui: Cut off search early if any regexp does not match Arun Isaac
@ 2020-06-09  8:29     ` Ludovic Courtès
  0 siblings, 0 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-06-09  8:29 UTC (permalink / raw)
  To: Arun Isaac; +Cc: 39258

Hi Arun,

Arun Isaac <arunisaac@systemreboot.net> skribis:

> * guix/ui.scm (relevance): When one of the regexps does not match, cut off
> early and return 0. Do not try to match the remaining regexps.

Good catch, LGTM!

> diff --git a/guix/ui.scm b/guix/ui.scm
> index ea5f460865..4a22358963 100644
> --- a/guix/ui.scm
> +++ b/guix/ui.scm
> @@ -1519,11 +1519,16 @@ score, the more relevant OBJ is to REGEXPS."
>                      (+ relevance (* weight (apply + (map score-regexp lst)))))))))
>              0 metrics)))
>  
> -  (let ((scores (map regexp->score regexps)))
> -    ;; Return zero if one of REGEXPS doesn't match.
> -    (if (any zero? scores)
> -        0
> -        (reduce + 0 scores))))
> +  (let loop ((regexps regexps)
> +             (total-score 0))
> +    (match regexps
> +      ((head . tail)
> +       (let ((score (regexp->score head)))
> +         ;; Return zero if one of PATTERNS doesn't match.
> +         (cond
> +          ((zero? score) 0)
> +          (else (loop tail (+ total-score score))))))

You can use ‘if’ since there are only two arms.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 2/4] ui: Use string matching with literal search strings.
  2020-06-01  0:00   ` [bug#39258] [PATCH 2/4] ui: Use string matching with literal search strings Arun Isaac
@ 2020-06-09  8:33     ` Ludovic Courtès
  2020-06-09  9:55       ` zimoun
  2020-06-13 12:37       ` Arun Isaac
  0 siblings, 2 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-06-09  8:33 UTC (permalink / raw)
  To: Arun Isaac; +Cc: 39258

Arun Isaac <arunisaac@systemreboot.net> skribis:

> * guix/scripts/package.scm (process-query): Make search query a regexp only if
> it is not a literal search string.
> * guix/ui.scm (relevance): Use string matching with literal search strings and
> regexp matching with regexp search strings.

How does this affect performance?

I would expect the regexp engine in libc to do something similar
internally, so I wonder if the extra work in Scheme pays off.

Ludo’.




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 3/4] ui: Do not translate package synopsis a second time.
  2020-06-01  0:00   ` [bug#39258] [PATCH 3/4] ui: Do not translate package synopsis a second time Arun Isaac
@ 2020-06-09  8:33     ` Ludovic Courtès
  0 siblings, 0 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-06-09  8:33 UTC (permalink / raw)
  To: Arun Isaac; +Cc: 39258

Arun Isaac <arunisaac@systemreboot.net> skribis:

> * guix/ui.scm (package->recutils): package-synopsis-string already returns a
> translated string. Do not attempt to translate it again.

LGTM!




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 4/4] ui: Use package-description-string.
  2020-06-01  0:00   ` [bug#39258] [PATCH 4/4] ui: Use package-description-string Arun Isaac
@ 2020-06-09  8:34     ` Ludovic Courtès
  0 siblings, 0 replies; 126+ messages in thread
From: Ludovic Courtès @ 2020-06-09  8:34 UTC (permalink / raw)
  To: Arun Isaac; +Cc: 39258

Arun Isaac <arunisaac@systemreboot.net> skribis:

> * guix/ui.scm (package->recutils): Use package-description-string instead of
> package-description and P_.

LGTM, thank you!




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 2/4] ui: Use string matching with literal search strings.
  2020-06-09  8:33     ` Ludovic Courtès
@ 2020-06-09  9:55       ` zimoun
  2020-06-13 12:37       ` Arun Isaac
  1 sibling, 0 replies; 126+ messages in thread
From: zimoun @ 2020-06-09  9:55 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Arun Isaac, 39258

On Tue, 9 Jun 2020 at 10:34, Ludovic Courtès <ludo@gnu.org> wrote:
> Arun Isaac <arunisaac@systemreboot.net> skribis:
>
> > * guix/scripts/package.scm (process-query): Make search query a regexp only if
> > it is not a literal search string.
> > * guix/ui.scm (relevance): Use string matching with literal search strings and
> > regexp matching with regexp search strings.
>
> How does this affect performance?

On my machine, it changes nothing.

Even, I have applied the patches of the serie one by one to see the
effect on timing and I do not see an improvement.  Below an email that
I started but never completed. :-)
However, it seems to be The Right Thing to do. :-)

All the best,
simon

--

Here a quick benchmark.  Because once reading the code, I was not
convinced by the improvement. :-)
About the cut-off, the optimization should be hard to see because the
bottleneck is elsewhere. And I was doubtful about the string literal
but who knows. :-)

And to compare apple to apple, the patch set is rebased onto
a357849f5b as all the others.

Warm the cache is done by "guix search foo".


* Cut-off [PATCH 1/4]

The first patch: cut off i.e., finer implementation of '(map
regexp->score regexps)'.

** Query: crypto library

The query used is:

   guix search crypto library | recsel -P name | grep libb2

| cache | default  | v5       |
|-------+----------+----------|
| cold  | 0m2.083s | 0m2.292s |
| warm  | 0m1.404s | 0m1.470s |

And for another data point on the same query, see [1]:

| time | default  |
|------+----------|
| real | 0m2.216s | cold
|------+----------|
| real | 0m1.197s | warm

[1] http://issues.guix.gnu.org/issue/39258#78


** Query: strategy game

Using the query:

   guix search strategy game | recsel -P name | grep julius

| cache | default  | v5       |
|-------+----------+----------|
| cold  | 0m2.006s | 0m2.165s |
| warm  | 0m1.253s | 0m1.081s |


* String literal [PATCH 2/4] (+cut-off)

| cache | strategy game | crypto library |
|-------+---------------+----------------|
| cold  | 0m2.110s      | 0m2.246s       |
| warm  | 0m1.058s      | 0m1.217s       |




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 2/4] ui: Use string matching with literal search strings.
  2020-06-09  8:33     ` Ludovic Courtès
  2020-06-09  9:55       ` zimoun
@ 2020-06-13 12:37       ` Arun Isaac
  2020-06-13 13:36         ` zimoun
  2020-06-13 19:32         ` Ludovic Courtès
  1 sibling, 2 replies; 126+ messages in thread
From: Arun Isaac @ 2020-06-13 12:37 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 39258, zimoun

[-- Attachment #1: Type: text/plain, Size: 1094 bytes --]


>> * guix/scripts/package.scm (process-query): Make search query a regexp only if
>> it is not a literal search string.
>> * guix/ui.scm (relevance): Use string matching with literal search strings and
>> regexp matching with regexp search strings.
>
> How does this affect performance?

See my results from earlier.

--8<---------------cut here---------------start------------->8---
time ./pre-inst-env guix search game

real	0m2.261s
user	0m2.351s
sys	0m0.104s
--8<---------------cut here---------------end--------------->8---

--8<---------------cut here---------------start------------->8---
time guix search game

real	0m2.661s
user	0m2.843s
sys	0m0.080s
--8<---------------cut here---------------end--------------->8---

> I would expect the regexp engine in libc to do something similar
> internally, so I wonder if the extra work in Scheme pays off.

I agree it would better to do this optimization at the regexp engine, if
it doesn't do it already.

So, shall I push the remaining patches (patches 1, 3, 4) after applying
the change you suggested for patch 1 (use of if versus cond)?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 2/4] ui: Use string matching with literal search strings.
  2020-06-13 12:37       ` Arun Isaac
@ 2020-06-13 13:36         ` zimoun
  2020-06-13 17:21           ` Arun Isaac
  2020-06-13 19:32         ` Ludovic Courtès
  1 sibling, 1 reply; 126+ messages in thread
From: zimoun @ 2020-06-13 13:36 UTC (permalink / raw)
  To: Arun Isaac, Ludovic Courtès; +Cc: 39258

Dear Arun,

On Sat, 13 Jun 2020 at 18:07, Arun Isaac <arunisaac@systemreboot.net> wrote:

>> How does this affect performance?
>
> See my results from earlier.
>
> --8<---------------cut here---------------start------------->8---
> time ./pre-inst-env guix search game
>
> real	0m2.261s
> user	0m2.351s
> sys	0m0.104s
> --8<---------------cut here---------------end--------------->8---
>
> --8<---------------cut here---------------start------------->8---
> time guix search game
>
> real	0m2.661s
> user	0m2.843s
> sys	0m0.080s
> --8<---------------cut here---------------end--------------->8---

I confirm that it changes nothing.  See [1].

1: http://issues.guix.gnu.org/39258#112


On the other hand, on my machine I do not see any timing improvement with
this patch set.  Could you check that you are comparing apple to apple?


All the best,
simon





^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 2/4] ui: Use string matching with literal search strings.
  2020-06-13 13:36         ` zimoun
@ 2020-06-13 17:21           ` Arun Isaac
  2020-06-14 19:14             ` zimoun
  0 siblings, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2020-06-13 17:21 UTC (permalink / raw)
  To: zimoun, Ludovic Courtès; +Cc: 39258

[-- Attachment #1: Type: text/plain, Size: 931 bytes --]


> I confirm that it changes nothing.  See [1].
>
> 1: http://issues.guix.gnu.org/39258#112

Yes, I did read your earlier mail. And, I tried again, this time with
patch 1 alone. It certainly makes a difference on my machine. It is
clear from the code logic that it should make a difference on your
machine as well, at least for longer queries. But, somehow it isn't and
I do not understand why. :-(

Here are more fresh results. Could you try for longer queries like
"strategy game caesar" and without the output being piped to recsel,
grep, etc.? For simplicity, let's talk only about warm cache results.

|----------------------------------+--------+-------|
| query                            | before | after |
|----------------------------------+--------+-------|
| guix search strategy game        |   2.58 |  1.96 |
| guix search strategy game caesar |   2.95 |  1.76 |
|----------------------------------+--------+-------|

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 2/4] ui: Use string matching with literal search strings.
  2020-06-13 12:37       ` Arun Isaac
  2020-06-13 13:36         ` zimoun
@ 2020-06-13 19:32         ` Ludovic Courtès
  2020-06-15 20:18           ` Arun Isaac
  1 sibling, 1 reply; 126+ messages in thread
From: Ludovic Courtès @ 2020-06-13 19:32 UTC (permalink / raw)
  To: Arun Isaac; +Cc: 39258, zimoun

Hi,

Arun Isaac <arunisaac@systemreboot.net> skribis:

>>> * guix/scripts/package.scm (process-query): Make search query a regexp only if
>>> it is not a literal search string.
>>> * guix/ui.scm (relevance): Use string matching with literal search strings and
>>> regexp matching with regexp search strings.
>>
>> How does this affect performance?

(To be clear, I’m referring specifically to this patch.)

> See my results from earlier.
>
> time ./pre-inst-env guix search game
>
> real	0m2.261s
> user	0m2.351s
> sys	0m0.104s
>
> time guix search game
>
> real	0m2.661s
> user	0m2.843s
> sys	0m0.080s
>
>> I would expect the regexp engine in libc to do something similar
>> internally, so I wonder if the extra work in Scheme pays off.
>
> I agree it would better to do this optimization at the regexp engine, if
> it doesn't do it already.

Yeah.  I feel like we shouldn’t have to do this, so I’d lean towards
excluding this patch from the series.  It’s too early to be confident
about it, but it might be something as discussed in
<https://lists.gnu.org/archive/html/guile-user/2020-06/msg00038.html>,
i.e., a problem to solve at the Guile level.

> So, shall I push the remaining patches (patches 1, 3, 4) after applying
> the change you suggested for patch 1 (use of if versus cond)?

Yes, definitely!

Thank you,
Ludo’.




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 2/4] ui: Use string matching with literal search strings.
  2020-06-13 17:21           ` Arun Isaac
@ 2020-06-14 19:14             ` zimoun
  0 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2020-06-14 19:14 UTC (permalink / raw)
  To: Arun Isaac, Ludovic Courtès; +Cc: 39258

Dear Arun,

Here, I am speaking about only the first patch: the cut-off.

TL;DR:
 1. I was wrong about the bottleneck.
 2. The queries were not the good ones to see a clear effect
    -- on my machine.
    

On Sat, 13 Jun 2020 at 22:51, Arun Isaac <arunisaac@systemreboot.net> wrote:

> Yes, I did read your earlier mail. And, I tried again, this time with
> patch 1 alone. It certainly makes a difference on my machine. It is
> clear from the code logic that it should make a difference on your
> machine as well, at least for longer queries. But, somehow it isn't and
> I do not understand why. :-(

Well, I spent some hours* to do some stats (Student's t-test).  Roughly
speaking, on my machine, the standard deviation error (stddev) hides the
point -- depending on the query -- and that's why I am not always seeing
the improvement, I guess.

*ah all my Sunday in fact. ;-)


I compared different conditions for the query "game strategy":

 - cold    vs warm
 - xterm   vs shell in Emacs (my config vs -q)
 - no pipe vs pipe

And I run 10 times in a row each experiment.  The conclusion is: in
average -- on my machine -- the cut-off improves.  But sometimes
considering only 3 repeats in a row, the improvement is not obvious (on
the mean); because the both tails of distribution overlap a bit on my
machine and so it is kind of bad luck.  And it is ``worse'' depending
against which commit your patch is rebased: a357849 (old) vs e782756.

The t-test captures this variation, even with only 3 repeats, but I have
not done in my previous email and only compared the visible mean.  Sorry
about that.

Moreover, printing increases the stddev, so the results are more
fluctuating inside Emacs vs xterm and piping helps in this case.

Piping does not change the final result -- hopefully. :-)  It adds an
extra time but in average it is the same.

About cold vs warm cache, I notice that the improvement is not the same
(in average).  Considering the raw time, there is a difference about 10%
(with "good" confidence); it could be worth to understand why.


Well, considering that, I did other stats with other queries and the
conclusion for my machine is that *the patch improves* on average by
reducing the timing for typical usages.  Which is really cool! :-)


I definitively have wrong about the bottleneck and this one could be
one.  One way to have an idea is to use "statprof" but it is hard for me
to read the results (I believe Guile master have a fix improving the
'anon #addr', but do not really know more).

--8<---------------cut here---------------start------------->8---
$ /tmp/v5-1/bin/guix repl
scheme@(guix-user)> ,use(guix scripts search)
scheme@(guix-user)> ,pr (guix-search "game" "strategy")
%     cumulative   self             
time   seconds     seconds  procedure
 17.81      0.29      0.27  anon #xe40178
 12.33      0.20      0.18  ice-9/boot-9.scm:2201:0:%load-announce
 12.33      0.18      0.18  anon #xe3c770
  5.48      0.08      0.08  ice-9/boot-9.scm:1396:0:symbol-append
  4.11      1.57      0.06  guix/memoization.scm:100:0
  4.11      0.06      0.06  ice-9/popen.scm:145:0:reap-pipes
  2.74      0.55      0.04  guix/ui.scm:1511:12
  2.74      0.33      0.04  ice-9/regex.scm:170:0:fold-matches
  2.74      0.04      0.04  ice-9/boot-9.scm:3540:0:autoload-done-or-in-progress?
  2.74      0.04      0.04  texinfo/string-utils.scm:98:5
  2.74      0.04      0.04  ice-9/vlist.scm:539:0:vhash-assq
  1.37     69.81      0.02  ice-9/threads.scm:388:4
[...]
---
Sample count: 73
Total time: 1.490955132 seconds (0.387756476 seconds in GC)
--8<---------------cut here---------------end--------------->8---

To compare with the default:

--8<---------------cut here---------------start------------->8---
time   seconds     seconds  procedure
 24.47      0.49      0.46  anon #x1d89178
 21.28      0.40      0.40  anon #x1d85770
  9.57      0.20      0.18  ice-9/boot-9.scm:2201:0:%load-announce
  3.19      4.71      0.06  ice-9/boot-9.scm:1673:4:with-exception-handler
  3.19      1.64      0.06  guix/memoization.scm:100:0
  3.19      0.06      0.06  ice-9/boot-9.scm:3540:0:autoload-done-or-in-progress?
  3.19      0.06      0.06  anon #x1d84c78
  3.19      0.06      0.06  ice-9/popen.scm:145:0:reap-pipes
  2.13      1.01      0.04  guix/ui.scm:1511:12
  2.13      0.08      0.04  ice-9/boot-9.scm:1396:0:symbol-append
  2.13      0.04      0.04  anon #x1d83248
  1.06      0.30      0.02  anon #x7f057e6c90e8
[...]
--8<---------------cut here---------------end--------------->8---

So clearly the patch has an effect!  If someone knows what is:

 - ice-9/boot-9.scm:2201:0:%load-announce
 - ice-9/boot-9.scm:1396:0:symbol-append
 
and from where they could come from, it could help. :-)

Well, I am interested to know which part is the Regex Engine and the
string search. :-) Linking to the discussion about KMP and others.


> Here are more fresh results. Could you try for longer queries like
> "strategy game caesar" and without the output being piped to recsel,
> grep, etc.? For simplicity, let's talk only about warm cache results.
>
> |----------------------------------+--------+-------|
> | query                            | before | after |
> |----------------------------------+--------+-------|
> | guix search strategy game        |   2.58 |  1.96 |
> | guix search strategy game caesar |   2.95 |  1.76 |
> |----------------------------------+--------+-------|

At first, I was confused why one more terms returns faster.  This is
because the query "caesar" returns only one package so the query
"strategy game caesar" cuts off all the packages when searching the
terms "game" and then "strategy".  I mean

   guix search julius

should be as long as

   guix search strategy game caesar

It is; in average on my machine.

And secondly, I was confused because the timing of the query "caesar
strategy game" is almost the same (2.8% +/- 2.5% with 99.0% of
confidence; 10 repeats).  Well, it is because in one case the term
"caesar" is applied to 15 packages and in another case the terms
"strategy" and "game" are applied to 1 package.  Adding some stddev
error and not enough repeats (nor good stats), the confusion is complete
and my conclusion is wrong.


That's said, the effect of the cut-off is clear (on my machine even with
on shot) with the queries:

  - game strategy the
  - the game strategy


Thank you,
simon





^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH 2/4] ui: Use string matching with literal search strings.
  2020-06-13 19:32         ` Ludovic Courtès
@ 2020-06-15 20:18           ` Arun Isaac
  0 siblings, 0 replies; 126+ messages in thread
From: Arun Isaac @ 2020-06-15 20:18 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 39258, zimoun

[-- Attachment #1: Type: text/plain, Size: 1072 bytes --]


>> So, shall I push the remaining patches (patches 1, 3, 4) after applying
>> the change you suggested for patch 1 (use of if versus cond)?
>
> Yes, definitely!

Done!

>>>> * guix/scripts/package.scm (process-query): Make search query a regexp only if
>>>> it is not a literal search string.
>>>> * guix/ui.scm (relevance): Use string matching with literal search strings and
>>>> regexp matching with regexp search strings.
>>>
>>> How does this affect performance?
>
> (To be clear, I’m referring specifically to this patch.)

Oh, I misunderstood. Here are the results specifically comparing patch 2
against the latest master (that includes the patches 1, 3 and 4 I just
pushed). All readings are on a warm cache.

|----------------------------------+--------+-------|
| query                            | before | after |
|----------------------------------+--------+-------|
| guix search strategy game        |    2.1 |   1.7 |
| guix search strategy game caesar |    1.8 |   1.5 |
|----------------------------------+--------+-------|

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v6 0/2] DRAFT "guix search" performances
  2020-01-23 19:51 [bug#39258] Faster guix search using an sqlite cache Arun Isaac
                   ` (7 preceding siblings ...)
  2020-06-01 10:11 ` [bug#39258] KMP string search algorithm? zimoun
@ 2021-07-15  7:33 ` zimoun
  2021-07-15  7:33   ` [bug#39258] [PATCH v6 1/2] DRAFT packages: Add fields to packages cache zimoun
                     ` (2 more replies)
  8 siblings, 3 replies; 126+ messages in thread
From: zimoun @ 2021-07-15  7:33 UTC (permalink / raw)
  To: 39258; +Cc: ludo, arunisaac, zimoun

Hi,

This is an attempt to improve the performance of "guix search".  It is still
half baked but it allows to discuss further the idea about expanding the
current '/lib/guix/package.cache' and avoids to forget an IRL discussion. ;-)

Let start by what needs to be improved: the part when cache is not
authoritative.  It is slower than the current approach because the package is
read twice, i.e., the module is indeed loaded twice, once by
'fold-available-packages' via 'fold-module-public-variables*' and then again
by 'find-packages-by-description' via 'read-package-from'.  The issue is to
have a common interface for both cases (cache and no-cache).  More thoughts
are required. ;-)

Then, using the cache is slower than expected.  Therefore, something is maybe
twisted -- quick implementation before holidays ;-) -- with the use of
'fold-avaibale-packages' as proposed by Ludo [1].  Note that instead another
'fold-packages' (say 'fold-packages*') using the new cache should be used.  As
it is done with v4 and the performances were as expected:

   <http://issues.guix.gnu.org/39258#89>

1: <http://issues.guix.gnu.org/39258#93>

From my understanding, the issue that 'package-relevance' accepts a 'package'
(and then all the chain until displaying) and 'fold-avaibale-packages' does
not return a package.  Well, I do not know; especially where to put something
similiar to 'read-package-from'.


To test, after applying the patches, the command is:

   ./pre-inst-env guix pull --allow-downgrades --disable-authentication \
          --url=$(pwd) --branch=search-v6 -p /tmp/new


Let compare only for cold cache and time this cache building (Guix 7db8fd6):

  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time guix build --check $(guix gc --derivers $(readlink -f ~/.config/guix/current/lib/guix/package.cache))

  real	0m28,848s
  user	0m1,481s
  sys	0m0,252s

  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time guix build --check $(guix gc --derivers $(readlink -f /tmp/new/lib/guix/package.cache))

  real	0m40,279s
  user	0m1,582s
  sys	0m0,232s

It seems longer but compared to the time of "guix pull" completion, it seems
acceptable.  However, maybe it could become an issue when running a lot of
"guix time-machine"... Well, hard trade-off. ;-)

Let compare for some queries:

--8<---------------cut here---------------start------------->8---
  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time guix search game | recsel -C -P name | wc -l
  371

  real	0m7,561s
  user	0m3,525s
  sys	0m0,391s

  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time /tmp/new/bin/guix search game | recsel -C -P name | wc -l
  371

  real	0m9,814s
  user	0m3,240s
  sys	0m0,363s
--8<---------------cut here---------------end--------------->8---


  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time guix search strategy game | recsel -C -P name | wc -l
  16

  real	0m8,565s
  user	0m2,803s
  sys	0m0,430s

  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time /tmp/new/bin/guix search strategy game | recsel -C -P name | wc -l
  16

  real	0m9,679s
  user	0m2,370s
  sys	0m0,334s


--8<---------------cut here---------------start------------->8---
  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time guix search strategy game caesar | recsel -C -P name | wc -l
  0

  real	0m8,307s
  user	0m2,388s
  sys	0m0,366s

  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time /tmp/new/bin/guix search strategy game caesar | recsel -C -P name | wc -l
  0

  real	0m3,626s
  user	0m0,948s
  sys	0m0,101s
--8<---------------cut here---------------end--------------->8---


  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time guix search game strategy the | recsel -C -P name | wc -l
  4

  real	0m8,776s
  user	0m2,903s
  sys	0m0,454s

  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time /tmp/new/bin/guix search game strategy the | recsel -C -P name | wc -l
  4

  real	0m9,495s
  user	0m2,546s
  sys	0m0,313s


--8<---------------cut here---------------start------------->8---
  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time guix search the game strategy | recsel -C -P name | wc -l
  4

  real	0m8,502s
  user	0m2,534s
  sys	0m0,388s

  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time /tmp/new/bin/guix search the game strategy | recsel -C -P name | wc -l
  4

  real	0m9,508s
  user	0m2,254s
  sys	0m0,363s
--8<---------------cut here---------------end--------------->8---


  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time guix search crypto library | recsel -C -P name | grep libb2
  libb2

  real	0m8,744s
  user	0m2,875s
  sys	0m0,374s

  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time /tmp/new/bin/guix search crypto library | recsel -C -P name | grep libb2
  libb2

  real	0m9,229s
  user	0m2,448s
  sys	0m0,397s


--8<---------------cut here---------------start------------->8---
  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time guix search cuirass integration | recsel -C -P name | wc -l
  1

  real	0m8,132s
  user	0m2,343s
  sys	0m0,407s

  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time /tmp/new/bin/guix search cuirass integration | recsel -C -P name | wc -l
  1

  real	0m8,940s
  user	0m2,036s
  sys	0m0,369s
--8<---------------cut here---------------end--------------->8---


  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time guix search cuirass | recsel -C -P name | wc -l
  2

  real	0m8,240s
  user	0m2,461s
  sys	0m0,367s

  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time /tmp/new/bin/guix search cuirass | recsel -C -P name | wc -l
  2

  real	0m8,863s
  user	0m2,019s
  sys	0m0,377s


--8<---------------cut here---------------start------------->8---
  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time guix search cuirass integration foo | recsel -C -P name | wc -l
  0

  real	0m8,258s
  user	0m2,418s
  sys	0m0,521s

  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time /tmp/new/bin/guix search cuirass integration foo | recsel -C -P name | wc -l
  0

  real	0m3,358s
  user	0m0,867s
  sys	0m0,139s
--8<---------------cut here---------------end--------------->8---

This last example suggests that 'read-package-from' is the slowdown.


(On a side note, maybe I am doing wrong, but there is no improvement by the
recent introduction of 'cut' for multi-terms as the query "the game strategy"
and "game strategy the".  Another story. :-))


When cache is not authoritative, it is worse, as expected:

  sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
  time /tmp/new/bin/guix search -L /tmp/my-pkgs cuirass integration foo | recsel -C -P name | wc -l
  0

  real	0m12,503s
  user	0m7,807s
  sys	0m0,529s

and note that currently the performances of "guix search" is the same for both
cases (authoritative and not authoritative); i.e., see previous timing.


Last, two minor remarks about previous comments.

1. Ludo commented:

        > Therefore the cache '/lib/guix/package.cache' contains more
        > information.

        This breaks the binary interface, so we’ll have to analyze the impact of
        such a change and devise a strategy.

        <http://issues.guix.gnu.org/39258#93>

and after some checking, this should be fine, IIUC.  The '--news' is ok
because of '#:allow-other-keys'.  And other parts are modified accordingly.
Guix revision N creates a cache that Guix revision N+1 will read but it should
not be an issue; see 'inferior-available-packages'.

2. And Ludo wrote:

        I realize the other cache also has that problem, but it would be nice to add a
        version tag to the cache.  Basically emit something like:

          (package-metadata-cache (version 0) VECTOR …)

        instead of just:

          (VECTOR …)

        <http://issues.guix.gnu.org/39258#93>

which is, after discussions, not necessary.  Versioning does not make sense
here because the cache is read by the Guix which generates it.  Therefore,
specify a version is extraneous here.


Comments are welcome for this work-in-progress. ;-)

Cheers,
simon


zimoun (2): DRAFT packages: Add fields to packages cache.  DRAFT scripts:
  package: Use cache in 'find-packages-by-description'.

 gnu/packages.scm         | 52 +++++++++++++++++++++++++++-------------
 guix/scripts/package.scm | 46 +++++++++++++++++++++++++----------
 2 files changed, 70 insertions(+), 28 deletions(-)


base-commit: 4196087f3d6fc254db5b4c47658e5679c835516f
-- 
2.32.0





^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v6 1/2] DRAFT packages: Add fields to packages cache.
  2021-07-15  7:33 ` [bug#39258] [PATCH v6 0/2] DRAFT "guix search" performances zimoun
@ 2021-07-15  7:33   ` zimoun
  2021-07-17  8:31     ` Arun Isaac
  2021-07-15  7:33   ` [bug#39258] [PATCH v6 2/2] DRAFT scripts: package: Use cache in 'find-packages-by-description' zimoun
  2021-07-23 15:43   ` [bug#39258] [PATCH v6 0/2] DRAFT "guix search" performances Ludovic Courtès
  2 siblings, 1 reply; 126+ messages in thread
From: zimoun @ 2021-07-15  7:33 UTC (permalink / raw)
  To: 39258; +Cc: ludo, arunisaac, zimoun

* gnu/packages.scm (generate-package-cache)[expand-cache]: Add synopsis and
description.
(load-package-cache, find-packages-by-names, find-packages-locations): Adapt
accordingly.
(fold-available-packages): Add synopsis, description, module and symbol when
cache is authoritative.  Replace 'fold-packages' by
'fold-module-public-variables*' when cache is not authoritative.
---
 gnu/packages.scm | 52 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 36 insertions(+), 16 deletions(-)

diff --git a/gnu/packages.scm b/gnu/packages.scm
index ccfc83dd11..34c6d73b86 100644
--- a/gnu/packages.scm
+++ b/gnu/packages.scm
@@ -4,6 +4,7 @@
 ;;; Copyright © 2014 Eric Bavier <bavier@member.fsf.org>
 ;;; Copyright © 2016, 2017 Alex Kost <alezost@gmail.com>
 ;;; Copyright © 2016 Mathieu Lirzin <mthl@gnu.org>
+;;; Copyright © 2021 Simon Tournier <zimon.toutoune@gmail.com>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -211,28 +212,45 @@ package module."
       (vhash-fold (lambda (name vector result)
                     (match vector
                       (#(name version module symbol outputs
+                              synopsis description
                               supported? deprecated?
                               file line column)
                        (proc name version result
                              #:outputs outputs
+                             #:synopsis synopsis
+                             #:description description
                              #:location (and file
                                              (location file line column))
+                             #:module module
+                             #:symbol symbol
                              #:supported? supported?
                              #:deprecated? deprecated?))))
                   init
                   cache)
-      (fold-packages (lambda (package result)
-                       (proc (package-name package)
-                             (package-version package)
-                             result
-                             #:outputs (package-outputs package)
-                             #:location (package-location package)
-                             #:supported?
-                             (->bool (supported-package? package))
-                             #:deprecated?
-                             (->bool
-                              (package-superseded package))))
-                     init)))
+      (fold-module-public-variables*
+       (lambda (module symbol variable result)
+         (let ((package (false-if-exception
+                         (variable-ref variable))))
+           (if (package? package)
+               (proc (package-name package)
+                     (package-version package)
+                     result
+                     #:outputs (package-outputs package)
+                     #:synopsis (package-synopsis package)
+                     #:description (package-description package)
+                     #:location (package-location package)
+                     #:module (module-name module)
+                     #:symbol symbol
+                     #:supported?
+                     (->bool (supported-package? package))
+                     #:deprecated?
+                     (->bool
+                      (package-superseded package)))
+               result)))
+       init
+       (all-modules (%package-module-path)
+                    #:warn
+                    warn-about-load-error))))
 
 (define* (fold-packages proc init
                         #:optional
@@ -268,6 +286,7 @@ package names.  Return #f on failure."
            (fold (lambda (item vhash)
                    (match item
                      (#(name version module symbol outputs
+                             synopsis description
                              supported? deprecated?
                              file line column)
                       (vhash-cons name item vhash))))
@@ -316,7 +335,7 @@ decreasing version order."
   (if (and (cache-is-authoritative?) cache)
       (match (cache-lookup cache name)
         (#f #f)
-        ((#(_ versions modules symbols _ _ _ _ _ _) ...)
+        ((#(_ versions modules symbols _ _ _ _ _ _ _ _) ...)
          (fold (lambda (version* module symbol result)
                  (if (or (not version)
                          (version-prefix? version version*))
@@ -337,9 +356,8 @@ matching NAME and VERSION."
   (if (and cache (cache-is-authoritative?))
       (match (cache-lookup cache name)
         (#f '())
-        ((#(name versions modules symbols outputs
-                 supported? deprecated?
-                 files lines columns) ...)
+        ((#(_ versions _ _ _ _ _ _ _
+              files lines columns) ...)
          (fold (lambda (version* file line column result)
                  (if (and file
                           (or (not version)
@@ -393,6 +411,8 @@ reducing the memory footprint."
                             ,(module-name module)
                             ,symbol
                             ,(package-outputs package)
+                            ,(package-synopsis package)
+                            ,(package-description package)
                             ,(->bool (supported-package? package))
                             ,(->bool (package-superseded package))
                             ,@(let ((loc (package-location package)))
-- 
2.32.0





^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v6 2/2] DRAFT scripts: package: Use cache in 'find-packages-by-description'.
  2021-07-15  7:33 ` [bug#39258] [PATCH v6 0/2] DRAFT "guix search" performances zimoun
  2021-07-15  7:33   ` [bug#39258] [PATCH v6 1/2] DRAFT packages: Add fields to packages cache zimoun
@ 2021-07-15  7:33   ` zimoun
  2021-07-23 15:43   ` [bug#39258] [PATCH v6 0/2] DRAFT "guix search" performances Ludovic Courtès
  2 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2021-07-15  7:33 UTC (permalink / raw)
  To: 39258; +Cc: ludo, arunisaac, zimoun

* guix/scripts/package.scm (find-packages-by-description): Replace
'fold-packages' by 'fold-available-packages'.
---
 guix/scripts/package.scm | 46 +++++++++++++++++++++++++++++-----------
 1 file changed, 34 insertions(+), 12 deletions(-)

diff --git a/guix/scripts/package.scm b/guix/scripts/package.scm
index 694959d326..ff2aed6905 100644
--- a/guix/scripts/package.scm
+++ b/guix/scripts/package.scm
@@ -8,7 +8,7 @@
 ;;; Copyright © 2016 Chris Marusich <cmmarusich@gmail.com>
 ;;; Copyright © 2019 Tobias Geerinckx-Rice <me@tobias.gr>
 ;;; Copyright © 2020 Ricardo Wurmus <rekado@elephly.net>
-;;; Copyright © 2020 Simon Tournier <zimon.toutoune@gmail.com>
+;;; Copyright © 2020, 2021 Simon Tournier <zimon.toutoune@gmail.com>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -181,17 +181,39 @@ hooks\" run when building the profile."
   "Return a list of pairs: packages whose name, synopsis, description,
 or output matches at least one of REGEXPS sorted by relevance, and its
 non-zero relevance score."
-  (let ((matches (fold-packages (lambda (package result)
-                                  (if (package-superseded package)
-                                      result
-                                      (match (package-relevance package
-                                                                regexps)
-                                        ((? zero?)
-                                         result)
-                                        (score
-                                         (cons (cons package score)
-                                               result)))))
-                                '())))
+  (define (read-package-from module symbol)
+    (module-ref (resolve-interface module) symbol))
+
+  (let ((matches (fold-available-packages
+                  (lambda* (name version result
+                                 #:key outputs description synopsis location
+                                 module symbol
+                                 deprecated?
+                                 #:allow-other-keys)
+                    (if deprecated?
+                        result
+                        (let* ((package
+                                 (package
+                                   (name name)
+                                   (version version)
+                                   (source #f)
+                                   (build-system #f)
+                                   (outputs outputs)
+                                   (synopsis synopsis)
+                                   (description description)
+                                   (home-page #f)
+                                   (license #f)
+                                   (location location))))
+                          (match (package-relevance package
+                                                    regexps)
+                            ((? zero?)
+                             result)
+                            (score
+                             (cons (cons (read-package-from module symbol)
+                                         score)
+                                   result))))))
+                  '())))
+
     (sort matches
           (lambda (m1 m2)
             (match m1
-- 
2.32.0





^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v6 1/2] DRAFT packages: Add fields to packages cache.
  2021-07-15  7:33   ` [bug#39258] [PATCH v6 1/2] DRAFT packages: Add fields to packages cache zimoun
@ 2021-07-17  8:31     ` Arun Isaac
  2021-07-23 15:30       ` Ludovic Courtès
  0 siblings, 1 reply; 126+ messages in thread
From: Arun Isaac @ 2021-07-17  8:31 UTC (permalink / raw)
  To: zimoun, 39258; +Cc: ludo, zimoun

[-- Attachment #1: Type: text/plain, Size: 1099 bytes --]


Hi Simon,

I understand that one of the things you are trying to do is to have a
common interface for the cache and no-cache cases. To achieve this, I
think fold-available-packages and fold-packages should have the same
function signature. They should both pass a <package> object to
PROC. Currently, fold-packages is passing a <package> object whereas
fold-available-packages is passing the fields of the <package> object as
individual parameters. If fold-packages and fold-available-packages have
the same function signature, then the changes in your [PATCH v6 2/2]
would be way simpler.

Also, why do we need two separate functions---fold-available-packages
and fold-packages? Can't fold-available-packages do everything
fold-packages can and thus totally replace it?

> * gnu/packages.scm (generate-package-cache)[expand-cache]: Add synopsis and
> description.
> (load-package-cache, find-packages-by-names, find-packages-locations): Adapt
> accordingly.

A couple of typos here:

find-packages-by-names -> find-packages-by-name
find-packages-locations -> find-package-locations

Regards,
Arun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 524 bytes --]

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v6 1/2] DRAFT packages: Add fields to packages cache.
  2021-07-17  8:31     ` Arun Isaac
@ 2021-07-23 15:30       ` Ludovic Courtès
  2021-08-17 14:03         ` zimoun
  0 siblings, 1 reply; 126+ messages in thread
From: Ludovic Courtès @ 2021-07-23 15:30 UTC (permalink / raw)
  To: Arun Isaac; +Cc: 39258, zimoun

Hi Arun!

Arun Isaac <arunisaac@systemreboot.net> skribis:

> Also, why do we need two separate functions---fold-available-packages
> and fold-packages? Can't fold-available-packages do everything
> fold-packages can and thus totally replace it?

The initial goal was for ‘fold-available-packages’ to be lightweight.
Currently, it doesn’t allocate anything; instead, it passes info as
keyword parameters, which the callee is free to ignore.  That’s why
these two procedures have different signatures.

One benchmark is “guix package -A > /dev/null”.  This should take
ideally 0.5s at most because that’s what’s used by shell completion (the
first time); currently it takes 0.82s on my laptop, though.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v6 0/2] DRAFT "guix search" performances
  2021-07-15  7:33 ` [bug#39258] [PATCH v6 0/2] DRAFT "guix search" performances zimoun
  2021-07-15  7:33   ` [bug#39258] [PATCH v6 1/2] DRAFT packages: Add fields to packages cache zimoun
  2021-07-15  7:33   ` [bug#39258] [PATCH v6 2/2] DRAFT scripts: package: Use cache in 'find-packages-by-description' zimoun
@ 2021-07-23 15:43   ` Ludovic Courtès
  2021-08-20 15:42     ` zimoun
  2 siblings, 1 reply; 126+ messages in thread
From: Ludovic Courtès @ 2021-07-23 15:43 UTC (permalink / raw)
  To: zimoun; +Cc: arunisaac, 39258

Hi!

zimoun <zimon.toutoune@gmail.com> skribis:

> This is an attempt to improve the performance of "guix search".  It is still
> half baked but it allows to discuss further the idea about expanding the
> current '/lib/guix/package.cache' and avoids to forget an IRL discussion. ;-)

Thanks for resuming this discussion.  :-)

> From my understanding, the issue that 'package-relevance' accepts a 'package'
> (and then all the chain until displaying) and 'fold-avaibale-packages' does
> not return a package.  Well, I do not know; especially where to put something
> similiar to 'read-package-from'.

Yeah that’s annoying.  Perhaps we need <proto-package> or
<package-metadata>.  With some trickery we could have record type
inheritance or something, maybe.  Dunno.

It would be good if we could arrange so that ‘fold-available-packages’
doesn’t allocate anything though.

> Let compare only for cold cache and time this cache building (Guix 7db8fd6):
>
>   sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
>   time guix build --check $(guix gc --derivers $(readlink -f ~/.config/guix/current/lib/guix/package.cache))
>
>   real	0m28,848s
>   user	0m1,481s
>   sys	0m0,252s
>
>   sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
>   time guix build --check $(guix gc --derivers $(readlink -f /tmp/new/lib/guix/package.cache))
>
>   real	0m40,279s
>   user	0m1,582s
>   sys	0m0,232s
>
> It seems longer but compared to the time of "guix pull" completion, it seems
> acceptable.

Both the initial timing and the target are waaay too much.  :-/
On my i7 laptop I have:

--8<---------------cut here---------------start------------->8---
$ time ./pre-inst-env  guile -c '(use-modules (gnu packages)) (generate-package-cache "/tmp/t.cache")'

real    0m20.738s
user    0m44.413s
sys     0m0.341s
--8<---------------cut here---------------end--------------->8---

It’s CPU-bound; we should probably start by optimizing that.

In Guile 3.0.7 there was a change that improved this noticeably:

  https://git.savannah.gnu.org/cgit/guile.git/commit/?id=05614f792bfabbc33798863edd0bb67c744e9299

We should prolly look for similar optimization opportunities in the
assembler…

> Let compare for some queries:
>
>   sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
>   time guix search game | recsel -C -P name | wc -l
>   371
>
>   real	0m7,561s
>   user	0m3,525s
>   sys	0m0,391s

I think you should run:

  time guix search game > /dev/null

otherwise Bash’s built-in ‘time’ shows the wall-clock time of the whole
pipeline, and the processing time of ‘recsel’ is probably not negligible
here.

[...]

> Last, two minor remarks about previous comments.
>
> 1. Ludo commented:
>
>         > Therefore the cache '/lib/guix/package.cache' contains more
>         > information.
>
>         This breaks the binary interface, so we’ll have to analyze the impact of
>         such a change and devise a strategy.
>
>         <http://issues.guix.gnu.org/39258#93>
>
> and after some checking, this should be fine, IIUC.  The '--news' is ok
> because of '#:allow-other-keys'.  And other parts are modified accordingly.
> Guix revision N creates a cache that Guix revision N+1 will read but it should
> not be an issue; see 'inferior-available-packages'.
>
> 2. And Ludo wrote:
>
>         I realize the other cache also has that problem, but it would be nice to add a
>         version tag to the cache.  Basically emit something like:
>
>           (package-metadata-cache (version 0) VECTOR …)
>
>         instead of just:
>
>           (VECTOR …)
>
>         <http://issues.guix.gnu.org/39258#93>
>
> which is, after discussions, not necessary.  Versioning does not make sense
> here because the cache is read by the Guix which generates it.  Therefore,
> specify a version is extraneous here.

I confirm!  :-)

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v6 1/2] DRAFT packages: Add fields to packages cache.
  2021-07-23 15:30       ` Ludovic Courtès
@ 2021-08-17 14:03         ` zimoun
  0 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2021-08-17 14:03 UTC (permalink / raw)
  To: Ludovic Courtès, Arun Isaac; +Cc: 39258

Hi Arun and Ludo,

Thanks for the review.

On Sat, 17 Jul 2021 at 14:01, Arun Isaac <arunisaac@systemreboot.net> wrote:

> I understand that one of the things you are trying to do is to have a
> common interface for the cache and no-cache cases. To achieve this, I
> think fold-available-packages and fold-packages should have the same
> function signature. They should both pass a <package> object to
> PROC. Currently, fold-packages is passing a <package> object whereas
> fold-available-packages is passing the fields of the <package> object as
> individual parameters. If fold-packages and fold-available-packages have
> the same function signature, then the changes in your [PATCH v6 2/2]
> would be way simpler.

I agree.  Previously [1], I created ’fold-packages*’ which was a cached
’fold-packages’ and Ludo answered [2]:

        Did you see ‘fold-available-packages’?  It seems you could
        extend it instead of introducing ‘fold-packages*’, no?

therefore, it is somehow another attempt on the other side. :-)

1: <http://issues.guix.gnu.org/39258#91>
2: <http://issues.guix.gnu.org/39258#93>

> Also, why do we need two separate functions---fold-available-packages
> and fold-packages? Can't fold-available-packages do everything
> fold-packages can and thus totally replace it?

To be honest, I have been lazy because unifying
’fold-available-packages’ and ’fold-packages’ means to change the
signature and so a bit of refactoring.  And as Ludo explained,
’fold-available-packages’ has to be as light as possible because it is
used by Emacs-Guix and maybe Nyxt for completion. :-)

>> * gnu/packages.scm (generate-package-cache)[expand-cache]: Add synopsis and
>> description.
>> (load-package-cache, find-packages-by-names, find-packages-locations): Adapt
>> accordingly.
>
> A couple of typos here:
>
> find-packages-by-names -> find-packages-by-name
> find-packages-locations -> find-package-locations

Thanks for the spot.


On Fri, 23 Jul 2021 at 17:30, Ludovic Courtès <ludo@gnu.org> wrote:

> One benchmark is “guix package -A > /dev/null”.  This should take
> ideally 0.5s at most because that’s what’s used by shell completion (the
> first time); currently it takes 0.82s on my laptop, though.

On cold cache, on my laptop:

--8<---------------cut here---------------start------------->8---
$ time guix package -A > /dev/null

real	0m1.717s
user	0m2.526s
sys	0m0.083s
--8<---------------cut here---------------end--------------->8---

and on my (slow) desktop:

real	0m6.196s
user	0m2.008s
sys	0m0.093s


Warn cache:
        laptop          desktop
real	0m1.425s        0m1.217s
user	0m2.505s   	0m1.901s
sys	0m0.033s   	0m0.051s


Well, another story I guess. :-)


Cheers,
simon




^ permalink raw reply	[flat|nested] 126+ messages in thread

* [bug#39258] [PATCH v6 0/2] DRAFT "guix search" performances
  2021-07-23 15:43   ` [bug#39258] [PATCH v6 0/2] DRAFT "guix search" performances Ludovic Courtès
@ 2021-08-20 15:42     ` zimoun
  0 siblings, 0 replies; 126+ messages in thread
From: zimoun @ 2021-08-20 15:42 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Arun Isaac, 39258

Hi Ludo,

On Fri, 23 Jul 2021 at 17:43, Ludovic Courtès <ludo@gnu.org> wrote:

> > From my understanding, the issue that 'package-relevance' accepts a 'package'
> > (and then all the chain until displaying) and 'fold-avaibale-packages' does
> > not return a package.  Well, I do not know; especially where to put something
> > similiar to 'read-package-from'.
>
> Yeah that’s annoying.  Perhaps we need <proto-package> or
> <package-metadata>.  With some trickery we could have record type
> inheritance or something, maybe.  Dunno.
>
> It would be good if we could arrange so that ‘fold-available-packages’
> doesn’t allocate anything though.

It does not allocate, the allocation is done by 'find-packages-by-description'.

Well, I think v4/v3 [0] is the direction to follow.  Therefore, I
would revisit this version and try to address two of Ludo's comments
[1] and the other ones in v3.

BTW, I have not investigated from where the slowness comes:
 - allocation
 - garbage collection
 - '(module-ref (resolve-interface module) symbol)'
because I have been a bit disappointed by the performance of this v6.

0: <http://issues.guix.gnu.org/39258#89>
1: <http://issues.guix.gnu.org/39258#93>

> > Let compare only for cold cache and time this cache building (Guix 7db8fd6):
> >
> >   sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
> >   time guix build --check $(guix gc --derivers $(readlink -f ~/.config/guix/current/lib/guix/package.cache))
> >
> >   real        0m28,848s
> >   user        0m1,481s
> >   sys 0m0,252s
> >
> >   sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
> >   time guix build --check $(guix gc --derivers $(readlink -f /tmp/new/lib/guix/package.cache))
> >
> >   real        0m40,279s
> >   user        0m1,582s
> >   sys 0m0,232s
> >
> > It seems longer but compared to the time of "guix pull" completion, it seems
> > acceptable.
>
> Both the initial timing and the target are waaay too much.  :-/

Yes, but that's another story. :-)  We cannot fix all in the same time, IMHO.

Here the point is to speedup "guix search" and to accept to pay a
little more at "guix pull" time; then we could optimize the cache
generation.  Considering the overall time of "guix pull", this extra
time appears to me acceptable---if "guix search" is faster. :-)

> On my i7 laptop I have:
>
> --8<---------------cut here---------------start------------->8---
> $ time ./pre-inst-env  guile -c '(use-modules (gnu packages)) (generate-package-cache "/tmp/t.cache")'
>
> real    0m20.738s
> user    0m44.413s
> sys     0m0.341s
> --8<---------------cut here---------------end--------------->8---
>
> It’s CPU-bound; we should probably start by optimizing that.
>
> In Guile 3.0.7 there was a change that improved this noticeably:
>
>   https://git.savannah.gnu.org/cgit/guile.git/commit/?id=05614f792bfabbc33798863edd0bb67c744e9299
>
> We should prolly look for similar optimization opportunities in the
> assembler…

Yes, but this kind of optimization will not provide a faster "guix
search" but a faster "guix pull".

--8<---------------cut here---------------start------------->8---
$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time guile -c '(use-modules (gnu packages)) (generate-package-cache
"/tmp/t1.cache")'

real    0m15,728s
user    0m23,940s
sys     0m0,826s

$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time guile -c '(load-compiled "/tmp/t1.cache/lib/guix/package.cache")'

real    0m1,026s
user    0m0,258s
sys     0m0,051s

$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time ./pre-inst-env  guile -c '(use-modules (gnu packages))
(generate-package-cache "/tmp/t2.cache")'

real    0m35,570s
user    3m12,951s
sys     0m3,807s

$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time guile -c '(load-compiled "/tmp/t2.cache/lib/guix/package.cache")'

real    0m1,055s
user    0m0,283s
sys     0m0,055s
--8<---------------cut here---------------end--------------->8---

(And if we speak about performance, raw loading the cache takes ~1s
but "guix package -A >/dev/null" takes ~8s.  It is a big gap for
parsing the CLI and sorting.  Worse, "guix --version >/dev/null" takes
~2s on cold cache.  We should probably start by reduce this overhead.)

> > Let compare for some queries:
> >
> >   sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
> >   time guix search game | recsel -C -P name | wc -l
> >   371
> >
> >   real        0m7,561s
> >   user        0m3,525s
> >   sys 0m0,391s
>
> I think you should run:
>
>   time guix search game > /dev/null
>
> otherwise Bash’s built-in ‘time’ shows the wall-clock time of the whole
> pipeline, and the processing time of ‘recsel’ is probably not negligible
> here.

First, I am confused... If the formatter (recsel) is not negligible,
then it should be dropped and replaced by an internal (fast)
formatter.  Well, I mean that as an end-user I am interested by the
time of the whole pipeline.

Second, on my machine the time is somehow negligible*. ;-)  On cold
cache, 10 runs using the pipe or using the redirection, keeping the
max and the min for each:

--8<---------------cut here---------------start------------->8---
real    0m9,598s
user    0m3,961s
sys     0m0,415s

real    0m8,744s
user    0m3,772s
sys     0m0,431s
--8<---------------cut here---------------end--------------->8---

--8<---------------cut here---------------start------------->8---
real    0m8,755s
user    0m3,869s
sys     0m0,540s

real    0m8,390s
user    0m3,767s
sys     0m0,416s
--8<---------------cut here---------------end--------------->8---

*negligible: better said, it does not give the bad impression.  Even
if it is roughly 5% of difference.


Cheers,
simon




^ permalink raw reply	[flat|nested] 126+ messages in thread

end of thread, other threads:[~2021-08-20 15:44 UTC | newest]

Thread overview: 126+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-23 19:51 [bug#39258] Faster guix search using an sqlite cache Arun Isaac
2020-01-29 23:33 ` zimoun
2020-01-30 13:48   ` Arun Isaac
2020-01-31 12:48     ` zimoun
2020-02-02 21:16       ` Arun Isaac
2020-02-04 10:19         ` zimoun
2020-02-06  1:58           ` Arun Isaac
2020-02-11 16:29             ` Ludovic Courtès
2020-02-11 18:21               ` zimoun
2020-02-11 18:39                 ` Ludovic Courtès
2020-02-11 19:07                   ` Arun Isaac
2020-02-11 20:20                     ` zimoun
2020-02-15 14:50                     ` Arun Isaac
2020-02-11 20:13                   ` zimoun
2020-02-27 20:41 ` [bug#39258] [PATCH 0/4] Xapian for Guix package search Arun Isaac
2020-02-27 20:41   ` [bug#39258] [PATCH 1/4] gnu: Add guile-xapian Arun Isaac
2020-03-03 16:29     ` zimoun
2020-02-27 20:41   ` [bug#39258] [PATCH 2/4] build-self: Add guile-xapian to Guix dependencies Arun Isaac
2020-02-27 20:41   ` [bug#39258] [PATCH 3/4] gnu: Generate xapian package search index Arun Isaac
2020-02-28  8:04     ` Pierre Neidhardt
2020-03-05 20:26       ` Arun Isaac
2020-03-03 18:29     ` zimoun
2020-02-27 20:41   ` [bug#39258] [PATCH 4/4] gnu: Use xapian index for package search Arun Isaac
2020-02-28  8:11     ` Pierre Neidhardt
2020-03-03 19:21     ` zimoun
2020-03-03 19:51       ` zimoun
2020-02-28  8:13   ` [bug#39258] [PATCH 0/4] Xapian for Guix " Pierre Neidhardt
2020-02-28 12:39     ` zimoun
2020-02-28 12:49       ` Pierre Neidhardt
2020-02-28 15:36     ` Arun Isaac
2020-02-28 16:04       ` Arun Isaac
2020-03-02 18:37         ` zimoun
2020-03-02 19:13           ` zimoun
2020-03-03 20:04             ` zimoun
2020-02-29  8:25       ` Arun Isaac
2020-03-02 18:27         ` zimoun
2020-02-28 12:36   ` zimoun
2020-03-05 16:46   ` Ludovic Courtès
2020-03-07 13:31 ` [bug#39258] [PATCH v2 0/3] " Arun Isaac
2020-03-07 13:31   ` [bug#39258] [PATCH v2 1/3] build-self: Add guile-xapian to Guix dependencies Arun Isaac
2020-03-09 18:14     ` zimoun
2020-03-09 23:40     ` Jonathan Brielmaier
2020-03-10  5:24       ` Arun Isaac
2020-03-07 13:31   ` [bug#39258] [PATCH v2 2/3] gnu: Generate Xapian package search index Arun Isaac
2020-03-09 18:19     ` zimoun
2020-03-07 13:31   ` [bug#39258] [PATCH v2 3/3] gnu: Use Xapian index for package search Arun Isaac
2020-03-07 20:33   ` [bug#39258] [PATCH v2 0/3] Xapian for Guix " Ludovic Courtès
2020-03-08  9:01     ` Arun Isaac
2020-03-08 11:33       ` Ludovic Courtès
2020-03-08 20:27         ` Arun Isaac
2020-03-09  7:42           ` Pierre Neidhardt
2020-03-09 12:50             ` zimoun
2020-03-09 10:35           ` Ludovic Courtès
2020-03-10 14:17             ` Arun Isaac
2020-03-10 14:33               ` zimoun
2020-03-11 13:50               ` Ludovic Courtès
2020-03-13  5:37                 ` Arun Isaac
2020-03-15 20:40                   ` Ludovic Courtès
2020-03-09  7:50         ` Pierre Neidhardt
2020-03-09 10:28           ` Ludovic Courtès
2020-03-09 13:03             ` zimoun
2020-03-09 12:53           ` zimoun
2020-03-09 12:47         ` zimoun
2020-03-09 12:40       ` zimoun
2020-03-09 12:34     ` zimoun
2020-03-08 20:27   ` zimoun
2020-03-08 20:40     ` Arun Isaac
2020-03-09 12:28   ` zimoun
2020-03-27 16:26 ` [bug#39258] [PATCH v3 0/3] Package metadata cache for guix search Arun Isaac
2020-03-27 16:26   ` [bug#39258] [PATCH v3 1/3] guix: Generate package metadata cache Arun Isaac
2020-04-24 20:48     ` Ludovic Courtès
2020-04-26  9:48       ` zimoun
2020-04-26 14:35         ` Ludovic Courtès
2020-04-26 14:54           ` Pierre Neidhardt
2020-04-26 15:33             ` Ludovic Courtès
2020-04-26 15:05           ` zimoun
2020-03-27 16:26   ` [bug#39258] [PATCH v3 2/3] guix: Search " Arun Isaac
2020-04-24 20:58     ` Ludovic Courtès
2020-03-27 16:26   ` [bug#39258] [PATCH v3 3/3] guix: Use package metadata cache for package search Arun Isaac
2020-04-24 21:03     ` Ludovic Courtès
2020-04-05 14:08   ` [bug#39258] [PATCH v3 0/3] Package metadata cache for guix search Ludovic Courtès
2020-04-24 21:05   ` Ludovic Courtès
2020-04-26  3:54 ` [bug#39258] benchmark search: default vs v2 vs v3 zimoun
2020-04-26  7:29   ` Pierre Neidhardt
2020-04-26 15:49   ` Ludovic Courtès
2020-04-26 17:01     ` zimoun
2020-04-26 20:22       ` Ludovic Courtès
2020-04-30 13:10     ` zimoun
2020-05-03 15:01 ` [bug#39258] [PATCH v4 0/3] Faster cache generation (similar as v3) zimoun
2020-05-03 15:01   ` [bug#39258] [PATCH v4 1/3] DRAFT packages: Add fields to packages cache zimoun
2020-05-03 15:01   ` [bug#39258] [PATCH v4 2/3] DRAFT packages: Add new procedure 'fold-packages*' zimoun
2020-05-03 15:01   ` [bug#39258] [PATCH v4 3/3] DRAFT guix package: Use cache in 'find-packages-by-description' zimoun
2020-05-03 16:43   ` [bug#39258] [PATCH v4 0/3] Faster cache generation (similar as v3) Ludovic Courtès
2020-05-03 18:10     ` zimoun
2020-05-03 19:49       ` Ludovic Courtès
2020-06-01  0:00 ` [bug#39258] [PATCH 0/4] Optimize guix search Arun Isaac
2020-06-01  0:00   ` [bug#39258] [PATCH 1/4] ui: Cut off search early if any regexp does not match Arun Isaac
2020-06-09  8:29     ` Ludovic Courtès
2020-06-01  0:00   ` [bug#39258] [PATCH 2/4] ui: Use string matching with literal search strings Arun Isaac
2020-06-09  8:33     ` Ludovic Courtès
2020-06-09  9:55       ` zimoun
2020-06-13 12:37       ` Arun Isaac
2020-06-13 13:36         ` zimoun
2020-06-13 17:21           ` Arun Isaac
2020-06-14 19:14             ` zimoun
2020-06-13 19:32         ` Ludovic Courtès
2020-06-15 20:18           ` Arun Isaac
2020-06-01  0:00   ` [bug#39258] [PATCH 3/4] ui: Do not translate package synopsis a second time Arun Isaac
2020-06-09  8:33     ` Ludovic Courtès
2020-06-01  0:00   ` [bug#39258] [PATCH 4/4] ui: Use package-description-string Arun Isaac
2020-06-09  8:34     ` Ludovic Courtès
2020-06-01  1:25   ` [bug#39258] [PATCH v5 0/4] Optimize guix search zimoun
2020-06-01  2:24     ` Arun Isaac
2020-06-01 10:01     ` zimoun
2020-06-01 10:11 ` [bug#39258] KMP string search algorithm? zimoun
2020-06-01 22:24   ` Leo Famulari
2020-06-01 23:48     ` Arun Isaac
2020-06-02  8:49       ` Ludovic Courtès
2021-07-15  7:33 ` [bug#39258] [PATCH v6 0/2] DRAFT "guix search" performances zimoun
2021-07-15  7:33   ` [bug#39258] [PATCH v6 1/2] DRAFT packages: Add fields to packages cache zimoun
2021-07-17  8:31     ` Arun Isaac
2021-07-23 15:30       ` Ludovic Courtès
2021-08-17 14:03         ` zimoun
2021-07-15  7:33   ` [bug#39258] [PATCH v6 2/2] DRAFT scripts: package: Use cache in 'find-packages-by-description' zimoun
2021-07-23 15:43   ` [bug#39258] [PATCH v6 0/2] DRAFT "guix search" performances Ludovic Courtès
2021-08-20 15:42     ` zimoun

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).