all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* File search progress: database review and question on triggers
@ 2020-08-10 14:32 Pierre Neidhardt
  2020-08-11  9:43 ` Mathieu Othacehe
                   ` (3 more replies)
  0 siblings, 4 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-10 14:32 UTC (permalink / raw)
  To: guix-devel

[-- Attachment #1: Type: text/plain, Size: 5575 bytes --]

Hi!

After much delay I finally got down to work on file search support for Guix.
By "file search", I mean the ability to find which package contains files
matching the queried pattern.

If we want to be able to know which package to install, we need file search to
be able to work for packages that have not yet been installed nor built
on the system.

As we previously discussed, a good approach, mostly for performance reasons,
would be to store all files in a database that's populated on demand from the
substitute servers.

What I've done so far:

1. An SQLite database with the following schema:

--8<---------------cut here---------------start------------->8---
create table if not exists Packages (
    name        text not null,
    output      text default "out",
    system      text not null,
    path        text primary key not null, -- store path, e.g. /gnu/store/abcd...-foo
    version     text not null,
    guix        text not null  -- The Guix version in which the package
    can be found.
);

create table if not exists Files (
    subpath     text not null,
    package  text not null,
    primary key (subpath, package), -- Same subpath can occur in multiple packages.
    foreign key (package) references Packages(path) on delete cascade
);
--8<---------------cut here---------------end--------------->8---

   I'm not very good with SQL, so thanks in advance for reviewing this
   carefully; let me know if we can do better.

2. A procedure that persists the filepaths of a given package in the database.

3. Size of the database:
   I've persisted all locally-present store items for my current Guix version
   and it produced a database of 72 MiB.  It compresses down to 8 MiB in zstd.

   But since we can have multiple Guix versions, this means that the
   packages have one entry per store path, so we might end up with more
   entries than that as the number of Guix generations grows.

   The worse case is around (number of guix generations) x ~100 MiB.

   If we compress, it would be at least 10x less, maybe way less.

   To be sustainable, I suggest that when we remove a Guix generation we
   "garbage-collect" the corresponding database entries.

   Thoughts?

4. Indexing speed:
   The above items took some 20 minutes to complete (on my rather powerful machine).
   A single store path takes a fraction of a second to index (on an SSD).
   The storage device is the bottleneck here.  Not sure we can do better than
   the following procedure:

--8<---------------cut here---------------start------------->8---
(define (directory-files path)
  "Return a list of all files within PATH, recursively.
Each file is returned as the path relative to PATH, starting with a '/'.

It's important that the first character be the directory separator because it
gives more expressive power for search.  For instance, searching \"/bin\"
matches both \"/bin/foo\" and \"/usr/bin/foo\" but not \"barbin\"."
  ;; TODO: This does not include empty directories.  Should we?
  ;; REVIEW: Use vlist for performance?  Big packages take a fraction of a
  ;; second on a hot cache, so it's probably not worth it.
  (let ((file-list '()))
    (ftw path
         (lambda (filename statinfo flag)
           (when (eq? flag 'regular)
             (set! file-list (cons (string-drop filename (string-length path))
                                   file-list))) #t))
    file-list))
--8<---------------cut here---------------end--------------->8---

   Most of the indexing will be done by the substitute servers however, so this is
   of little concern for the end user.

   Question: Should we include empty directories in the database?  I'm tempted
   to answer no.

5. Search speed: It completes in a fraction of a second and supports
   SQLite patterns.  Example:

--8<---------------cut here---------------start------------->8---
> (format-search (search-file-package "%libio%"))
samba:out@4.12.3	/lib/libiov-buf-samba4.so
guix:out@1.1.0-18.218a67d	/share/guile/site/3.0/gnu/packages/patches/m4-gnulib-libio.patch
--8<---------------cut here---------------end--------------->8---

   Question: This bounds us to the SQLite syntax for pattern matching.  Is it a
   problem?
   It seems powerful enough in practice.  But maybe we can use regular
   expression in SQLite as well?


Next points I'd like to address:

6. Automatically persist the database entry when building a package.
   Any idea where I should plug that in?

7. Have substitute servers distribute database content.  When the user performs
   a file search, Guix asks the substitute server for a database update.  Only
   the diff should be sent over the network, not the whole thing since it might
   be very large.

   Question 1: If the substitute server does not have data corresponding to the
   Guix server of the user, shall we send data of the version that's the closest
   to that of the user?
   Locally, if there are not many entries for the current Guix version, but many
   for an older version, shall perform the search on the older version as well?
   Multiple older versions?

   Question 2: How do we send "SQLite diffs"?  I know xdelta for binary diffs,
   but I don't think that what we want to do here.  Or else can we export the right
   SELECT result as text and send it compressed over the network?

I'll send a patch soon, in the meantime let me know about your thoughts!
:)

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2020-10-21  9:59 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-08-10 14:32 File search progress: database review and question on triggers Pierre Neidhardt
2020-08-11  9:43 ` Mathieu Othacehe
2020-08-11 12:35   ` Pierre Neidhardt
2020-08-15 12:48     ` Hartmut Goebel
2020-08-11 15:43 ` Ricardo Wurmus
2020-08-11 17:54   ` Pierre Neidhardt
2020-08-11 17:58     ` Pierre Neidhardt
2020-08-11 20:08       ` Ricardo Wurmus
2020-08-12 19:10         ` Pierre Neidhardt
2020-08-12 20:13           ` Julien Lepiller
2020-08-12 20:43             ` Pierre Neidhardt
2020-08-12 21:29               ` Julien Lepiller
2020-08-12 22:29                 ` Ricardo Wurmus
2020-08-13  6:55                   ` Pierre Neidhardt
2020-08-13  6:52                 ` Pierre Neidhardt
2020-08-13  9:34                   ` Ricardo Wurmus
2020-08-13 10:04                     ` Pierre Neidhardt
2020-08-15 12:47                       ` Hartmut Goebel
2020-08-15 21:20                         ` Bengt Richter
2020-08-16  8:18                           ` Hartmut Goebel
2020-08-12 20:32           ` Pierre Neidhardt
2020-08-13  0:17           ` Arun Isaac
2020-08-13  6:58             ` Pierre Neidhardt
2020-08-13  9:40             ` Pierre Neidhardt
2020-08-13 10:08               ` Pierre Neidhardt
2020-08-13 11:47               ` Ricardo Wurmus
2020-08-13 13:44                 ` Pierre Neidhardt
2020-08-13 12:20               ` Arun Isaac
2020-08-13 13:53                 ` Pierre Neidhardt
2020-08-13 15:14                   ` Arun Isaac
2020-08-13 15:36                     ` Pierre Neidhardt
2020-08-13 15:56                       ` Pierre Neidhardt
2020-08-15 19:33                         ` Arun Isaac
2020-08-24  8:29                           ` Pierre Neidhardt
2020-08-24 10:53                             ` Pierre Neidhardt
2020-09-04 19:15                               ` Arun Isaac
2020-09-05  7:48                                 ` Pierre Neidhardt
2020-09-06  9:25                                   ` Arun Isaac
2020-09-06 10:05                                     ` Pierre Neidhardt
2020-09-06 10:33                                       ` Arun Isaac
2020-08-18 14:58 ` File search progress: database review and question on triggers OFF TOPIC PRAISE Joshua Branson
2020-08-27 10:00 ` File search progress: database review and question on triggers zimoun
2020-08-27 11:15   ` Pierre Neidhardt
2020-08-27 12:56     ` zimoun
2020-08-27 13:19       ` Pierre Neidhardt
2020-09-26 14:04         ` Pierre Neidhardt
2020-09-26 14:12           ` Pierre Neidhardt
2020-10-05 12:35           ` Ludovic Courtès
2020-10-05 18:53             ` Pierre Neidhardt
2020-10-09 21:16               ` zimoun
2020-10-10  8:57                 ` Pierre Neidhardt
2020-10-10 14:58                   ` zimoun
2020-10-12 10:16                   ` Ludovic Courtès
2020-10-12 11:18                     ` Pierre Neidhardt
2020-10-13 13:48                       ` Ludovic Courtès
2020-10-13 13:59                         ` Pierre Neidhardt
2020-10-10 16:03               ` zimoun
2020-10-11 11:19                 ` Pierre Neidhardt
2020-10-11 13:02                   ` zimoun
2020-10-11 14:25                     ` Pierre Neidhardt
2020-10-11 16:05                       ` zimoun
2020-10-12 10:20               ` Ludovic Courtès
2020-10-12 11:21                 ` Pierre Neidhardt
2020-10-13 13:45                   ` Ludovic Courtès
2020-10-13 13:56                     ` Pierre Neidhardt
2020-10-13 21:22                       ` Ludovic Courtès
2020-10-14  7:50                         ` Pierre Neidhardt
2020-10-16 10:30                           ` Ludovic Courtès
2020-10-17  9:14                             ` Pierre Neidhardt
2020-10-17 19:17                               ` Pierre Neidhardt
2020-10-21  9:53                               ` Ludovic Courtès
2020-10-21  9:58                                 ` Pierre Neidhardt
2020-10-12 11:23                 ` zimoun

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.