From: Pierre Neidhardt <mail@ambrevar.xyz>
To: guix-devel@gnu.org
Subject: File search progress: database review and question on triggers
Date: Mon, 10 Aug 2020 16:32:08 +0200 [thread overview]
Message-ID: <87sgcuh8rb.fsf@ambrevar.xyz> (raw)
[-- Attachment #1: Type: text/plain, Size: 5575 bytes --]
Hi!
After much delay I finally got down to work on file search support for Guix.
By "file search", I mean the ability to find which package contains files
matching the queried pattern.
If we want to be able to know which package to install, we need file search to
be able to work for packages that have not yet been installed nor built
on the system.
As we previously discussed, a good approach, mostly for performance reasons,
would be to store all files in a database that's populated on demand from the
substitute servers.
What I've done so far:
1. An SQLite database with the following schema:
--8<---------------cut here---------------start------------->8---
create table if not exists Packages (
name text not null,
output text default "out",
system text not null,
path text primary key not null, -- store path, e.g. /gnu/store/abcd...-foo
version text not null,
guix text not null -- The Guix version in which the package
can be found.
);
create table if not exists Files (
subpath text not null,
package text not null,
primary key (subpath, package), -- Same subpath can occur in multiple packages.
foreign key (package) references Packages(path) on delete cascade
);
--8<---------------cut here---------------end--------------->8---
I'm not very good with SQL, so thanks in advance for reviewing this
carefully; let me know if we can do better.
2. A procedure that persists the filepaths of a given package in the database.
3. Size of the database:
I've persisted all locally-present store items for my current Guix version
and it produced a database of 72 MiB. It compresses down to 8 MiB in zstd.
But since we can have multiple Guix versions, this means that the
packages have one entry per store path, so we might end up with more
entries than that as the number of Guix generations grows.
The worse case is around (number of guix generations) x ~100 MiB.
If we compress, it would be at least 10x less, maybe way less.
To be sustainable, I suggest that when we remove a Guix generation we
"garbage-collect" the corresponding database entries.
Thoughts?
4. Indexing speed:
The above items took some 20 minutes to complete (on my rather powerful machine).
A single store path takes a fraction of a second to index (on an SSD).
The storage device is the bottleneck here. Not sure we can do better than
the following procedure:
--8<---------------cut here---------------start------------->8---
(define (directory-files path)
"Return a list of all files within PATH, recursively.
Each file is returned as the path relative to PATH, starting with a '/'.
It's important that the first character be the directory separator because it
gives more expressive power for search. For instance, searching \"/bin\"
matches both \"/bin/foo\" and \"/usr/bin/foo\" but not \"barbin\"."
;; TODO: This does not include empty directories. Should we?
;; REVIEW: Use vlist for performance? Big packages take a fraction of a
;; second on a hot cache, so it's probably not worth it.
(let ((file-list '()))
(ftw path
(lambda (filename statinfo flag)
(when (eq? flag 'regular)
(set! file-list (cons (string-drop filename (string-length path))
file-list))) #t))
file-list))
--8<---------------cut here---------------end--------------->8---
Most of the indexing will be done by the substitute servers however, so this is
of little concern for the end user.
Question: Should we include empty directories in the database? I'm tempted
to answer no.
5. Search speed: It completes in a fraction of a second and supports
SQLite patterns. Example:
--8<---------------cut here---------------start------------->8---
> (format-search (search-file-package "%libio%"))
samba:out@4.12.3 /lib/libiov-buf-samba4.so
guix:out@1.1.0-18.218a67d /share/guile/site/3.0/gnu/packages/patches/m4-gnulib-libio.patch
--8<---------------cut here---------------end--------------->8---
Question: This bounds us to the SQLite syntax for pattern matching. Is it a
problem?
It seems powerful enough in practice. But maybe we can use regular
expression in SQLite as well?
Next points I'd like to address:
6. Automatically persist the database entry when building a package.
Any idea where I should plug that in?
7. Have substitute servers distribute database content. When the user performs
a file search, Guix asks the substitute server for a database update. Only
the diff should be sent over the network, not the whole thing since it might
be very large.
Question 1: If the substitute server does not have data corresponding to the
Guix server of the user, shall we send data of the version that's the closest
to that of the user?
Locally, if there are not many entries for the current Guix version, but many
for an older version, shall perform the search on the older version as well?
Multiple older versions?
Question 2: How do we send "SQLite diffs"? I know xdelta for binary diffs,
but I don't think that what we want to do here. Or else can we export the right
SELECT result as text and send it compressed over the network?
I'll send a patch soon, in the meantime let me know about your thoughts!
:)
Cheers!
--
Pierre Neidhardt
https://ambrevar.xyz/
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]
next reply other threads:[~2020-08-10 14:32 UTC|newest]
Thread overview: 73+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-10 14:32 Pierre Neidhardt [this message]
2020-08-11 9:43 ` File search progress: database review and question on triggers Mathieu Othacehe
2020-08-11 12:35 ` Pierre Neidhardt
2020-08-15 12:48 ` Hartmut Goebel
2020-08-11 15:43 ` Ricardo Wurmus
2020-08-11 17:54 ` Pierre Neidhardt
2020-08-11 17:58 ` Pierre Neidhardt
2020-08-11 20:08 ` Ricardo Wurmus
2020-08-12 19:10 ` Pierre Neidhardt
2020-08-12 20:13 ` Julien Lepiller
2020-08-12 20:43 ` Pierre Neidhardt
2020-08-12 21:29 ` Julien Lepiller
2020-08-12 22:29 ` Ricardo Wurmus
2020-08-13 6:55 ` Pierre Neidhardt
2020-08-13 6:52 ` Pierre Neidhardt
2020-08-13 9:34 ` Ricardo Wurmus
2020-08-13 10:04 ` Pierre Neidhardt
2020-08-15 12:47 ` Hartmut Goebel
2020-08-15 21:20 ` Bengt Richter
2020-08-16 8:18 ` Hartmut Goebel
2020-08-12 20:32 ` Pierre Neidhardt
2020-08-13 0:17 ` Arun Isaac
2020-08-13 6:58 ` Pierre Neidhardt
2020-08-13 9:40 ` Pierre Neidhardt
2020-08-13 10:08 ` Pierre Neidhardt
2020-08-13 11:47 ` Ricardo Wurmus
2020-08-13 13:44 ` Pierre Neidhardt
2020-08-13 12:20 ` Arun Isaac
2020-08-13 13:53 ` Pierre Neidhardt
2020-08-13 15:14 ` Arun Isaac
2020-08-13 15:36 ` Pierre Neidhardt
2020-08-13 15:56 ` Pierre Neidhardt
2020-08-15 19:33 ` Arun Isaac
2020-08-24 8:29 ` Pierre Neidhardt
2020-08-24 10:53 ` Pierre Neidhardt
2020-09-04 19:15 ` Arun Isaac
2020-09-05 7:48 ` Pierre Neidhardt
2020-09-06 9:25 ` Arun Isaac
2020-09-06 10:05 ` Pierre Neidhardt
2020-09-06 10:33 ` Arun Isaac
2020-08-18 14:58 ` File search progress: database review and question on triggers OFF TOPIC PRAISE Joshua Branson
2020-08-27 10:00 ` File search progress: database review and question on triggers zimoun
2020-08-27 11:15 ` Pierre Neidhardt
2020-08-27 12:56 ` zimoun
2020-08-27 13:19 ` Pierre Neidhardt
2020-09-26 14:04 ` Pierre Neidhardt
2020-09-26 14:12 ` Pierre Neidhardt
2020-10-05 12:35 ` Ludovic Courtès
2020-10-05 18:53 ` Pierre Neidhardt
2020-10-09 21:16 ` zimoun
2020-10-10 8:57 ` Pierre Neidhardt
2020-10-10 14:58 ` zimoun
2020-10-12 10:16 ` Ludovic Courtès
2020-10-12 11:18 ` Pierre Neidhardt
2020-10-13 13:48 ` Ludovic Courtès
2020-10-13 13:59 ` Pierre Neidhardt
2020-10-10 16:03 ` zimoun
2020-10-11 11:19 ` Pierre Neidhardt
2020-10-11 13:02 ` zimoun
2020-10-11 14:25 ` Pierre Neidhardt
2020-10-11 16:05 ` zimoun
2020-10-12 10:20 ` Ludovic Courtès
2020-10-12 11:21 ` Pierre Neidhardt
2020-10-13 13:45 ` Ludovic Courtès
2020-10-13 13:56 ` Pierre Neidhardt
2020-10-13 21:22 ` Ludovic Courtès
2020-10-14 7:50 ` Pierre Neidhardt
2020-10-16 10:30 ` Ludovic Courtès
2020-10-17 9:14 ` Pierre Neidhardt
2020-10-17 19:17 ` Pierre Neidhardt
2020-10-21 9:53 ` Ludovic Courtès
2020-10-21 9:58 ` Pierre Neidhardt
2020-10-12 11:23 ` zimoun
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://guix.gnu.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87sgcuh8rb.fsf@ambrevar.xyz \
--to=mail@ambrevar.xyz \
--cc=guix-devel@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).