Hi! After much delay I finally got down to work on file search support for Guix. By "file search", I mean the ability to find which package contains files matching the queried pattern. If we want to be able to know which package to install, we need file search to be able to work for packages that have not yet been installed nor built on the system. As we previously discussed, a good approach, mostly for performance reasons, would be to store all files in a database that's populated on demand from the substitute servers. What I've done so far: 1. An SQLite database with the following schema: --8<---------------cut here---------------start------------->8--- create table if not exists Packages ( name text not null, output text default "out", system text not null, path text primary key not null, -- store path, e.g. /gnu/store/abcd...-foo version text not null, guix text not null -- The Guix version in which the package can be found. ); create table if not exists Files ( subpath text not null, package text not null, primary key (subpath, package), -- Same subpath can occur in multiple packages. foreign key (package) references Packages(path) on delete cascade ); --8<---------------cut here---------------end--------------->8--- I'm not very good with SQL, so thanks in advance for reviewing this carefully; let me know if we can do better. 2. A procedure that persists the filepaths of a given package in the database. 3. Size of the database: I've persisted all locally-present store items for my current Guix version and it produced a database of 72 MiB. It compresses down to 8 MiB in zstd. But since we can have multiple Guix versions, this means that the packages have one entry per store path, so we might end up with more entries than that as the number of Guix generations grows. The worse case is around (number of guix generations) x ~100 MiB. If we compress, it would be at least 10x less, maybe way less. To be sustainable, I suggest that when we remove a Guix generation we "garbage-collect" the corresponding database entries. Thoughts? 4. Indexing speed: The above items took some 20 minutes to complete (on my rather powerful machine). A single store path takes a fraction of a second to index (on an SSD). The storage device is the bottleneck here. Not sure we can do better than the following procedure: --8<---------------cut here---------------start------------->8--- (define (directory-files path) "Return a list of all files within PATH, recursively. Each file is returned as the path relative to PATH, starting with a '/'. It's important that the first character be the directory separator because it gives more expressive power for search. For instance, searching \"/bin\" matches both \"/bin/foo\" and \"/usr/bin/foo\" but not \"barbin\"." ;; TODO: This does not include empty directories. Should we? ;; REVIEW: Use vlist for performance? Big packages take a fraction of a ;; second on a hot cache, so it's probably not worth it. (let ((file-list '())) (ftw path (lambda (filename statinfo flag) (when (eq? flag 'regular) (set! file-list (cons (string-drop filename (string-length path)) file-list))) #t)) file-list)) --8<---------------cut here---------------end--------------->8--- Most of the indexing will be done by the substitute servers however, so this is of little concern for the end user. Question: Should we include empty directories in the database? I'm tempted to answer no. 5. Search speed: It completes in a fraction of a second and supports SQLite patterns. Example: --8<---------------cut here---------------start------------->8--- > (format-search (search-file-package "%libio%")) samba:out@4.12.3 /lib/libiov-buf-samba4.so guix:out@1.1.0-18.218a67d /share/guile/site/3.0/gnu/packages/patches/m4-gnulib-libio.patch --8<---------------cut here---------------end--------------->8--- Question: This bounds us to the SQLite syntax for pattern matching. Is it a problem? It seems powerful enough in practice. But maybe we can use regular expression in SQLite as well? Next points I'd like to address: 6. Automatically persist the database entry when building a package. Any idea where I should plug that in? 7. Have substitute servers distribute database content. When the user performs a file search, Guix asks the substitute server for a database update. Only the diff should be sent over the network, not the whole thing since it might be very large. Question 1: If the substitute server does not have data corresponding to the Guix server of the user, shall we send data of the version that's the closest to that of the user? Locally, if there are not many entries for the current Guix version, but many for an older version, shall perform the search on the older version as well? Multiple older versions? Question 2: How do we send "SQLite diffs"? I know xdelta for binary diffs, but I don't think that what we want to do here. Or else can we export the right SELECT result as text and send it compressed over the network? I'll send a patch soon, in the meantime let me know about your thoughts! :) Cheers! -- Pierre Neidhardt https://ambrevar.xyz/