[-- Attachment #1: Type: text/plain, Size: 5575 bytes --] Hi! After much delay I finally got down to work on file search support for Guix. By "file search", I mean the ability to find which package contains files matching the queried pattern. If we want to be able to know which package to install, we need file search to be able to work for packages that have not yet been installed nor built on the system. As we previously discussed, a good approach, mostly for performance reasons, would be to store all files in a database that's populated on demand from the substitute servers. What I've done so far: 1. An SQLite database with the following schema: --8<---------------cut here---------------start------------->8--- create table if not exists Packages ( name text not null, output text default "out", system text not null, path text primary key not null, -- store path, e.g. /gnu/store/abcd...-foo version text not null, guix text not null -- The Guix version in which the package can be found. ); create table if not exists Files ( subpath text not null, package text not null, primary key (subpath, package), -- Same subpath can occur in multiple packages. foreign key (package) references Packages(path) on delete cascade ); --8<---------------cut here---------------end--------------->8--- I'm not very good with SQL, so thanks in advance for reviewing this carefully; let me know if we can do better. 2. A procedure that persists the filepaths of a given package in the database. 3. Size of the database: I've persisted all locally-present store items for my current Guix version and it produced a database of 72 MiB. It compresses down to 8 MiB in zstd. But since we can have multiple Guix versions, this means that the packages have one entry per store path, so we might end up with more entries than that as the number of Guix generations grows. The worse case is around (number of guix generations) x ~100 MiB. If we compress, it would be at least 10x less, maybe way less. To be sustainable, I suggest that when we remove a Guix generation we "garbage-collect" the corresponding database entries. Thoughts? 4. Indexing speed: The above items took some 20 minutes to complete (on my rather powerful machine). A single store path takes a fraction of a second to index (on an SSD). The storage device is the bottleneck here. Not sure we can do better than the following procedure: --8<---------------cut here---------------start------------->8--- (define (directory-files path) "Return a list of all files within PATH, recursively. Each file is returned as the path relative to PATH, starting with a '/'. It's important that the first character be the directory separator because it gives more expressive power for search. For instance, searching \"/bin\" matches both \"/bin/foo\" and \"/usr/bin/foo\" but not \"barbin\"." ;; TODO: This does not include empty directories. Should we? ;; REVIEW: Use vlist for performance? Big packages take a fraction of a ;; second on a hot cache, so it's probably not worth it. (let ((file-list '())) (ftw path (lambda (filename statinfo flag) (when (eq? flag 'regular) (set! file-list (cons (string-drop filename (string-length path)) file-list))) #t)) file-list)) --8<---------------cut here---------------end--------------->8--- Most of the indexing will be done by the substitute servers however, so this is of little concern for the end user. Question: Should we include empty directories in the database? I'm tempted to answer no. 5. Search speed: It completes in a fraction of a second and supports SQLite patterns. Example: --8<---------------cut here---------------start------------->8--- > (format-search (search-file-package "%libio%")) samba:out@4.12.3 /lib/libiov-buf-samba4.so guix:out@1.1.0-18.218a67d /share/guile/site/3.0/gnu/packages/patches/m4-gnulib-libio.patch --8<---------------cut here---------------end--------------->8--- Question: This bounds us to the SQLite syntax for pattern matching. Is it a problem? It seems powerful enough in practice. But maybe we can use regular expression in SQLite as well? Next points I'd like to address: 6. Automatically persist the database entry when building a package. Any idea where I should plug that in? 7. Have substitute servers distribute database content. When the user performs a file search, Guix asks the substitute server for a database update. Only the diff should be sent over the network, not the whole thing since it might be very large. Question 1: If the substitute server does not have data corresponding to the Guix server of the user, shall we send data of the version that's the closest to that of the user? Locally, if there are not many entries for the current Guix version, but many for an older version, shall perform the search on the older version as well? Multiple older versions? Question 2: How do we send "SQLite diffs"? I know xdelta for binary diffs, but I don't think that what we want to do here. Or else can we export the right SELECT result as text and send it compressed over the network? I'll send a patch soon, in the meantime let me know about your thoughts! :) Cheers! -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
Hello Pierre, Thanks for sharing your progress. A few remarks below. > 3. Size of the database: > I've persisted all locally-present store items for my current Guix version > and it produced a database of 72 MiB. It compresses down to 8 MiB in zstd. > > But since we can have multiple Guix versions, this means that the > packages have one entry per store path, so we might end up with more > entries than that as the number of Guix generations grows. I'm not sure we actually need to save the full history. I think we could just store that the package X produces [Y1, Y2, ...] executable files. Then on package X update, the executable files list could be refreshed. > Question: Should we include empty directories in the database? I'm tempted > to answer no. I would also say no, and also exclude non-executable files. > Question: This bounds us to the SQLite syntax for pattern matching. Is it a > problem? > It seems powerful enough in practice. But maybe we can use regular > expression in SQLite as well? From the UI perspective, we already have "guix search" that expects a regex. If we were to include a "guix file-search" command, then I think it would make sense that it uses the same regex syntax. > Next points I'd like to address: > > 6. Automatically persist the database entry when building a package. > Any idea where I should plug that in? > > 7. Have substitute servers distribute database content. When the user performs > a file search, Guix asks the substitute server for a database update. Only > the diff should be sent over the network, not the whole thing since it might If I understand correctly, you are proposing to create local databases that would be kept in sync with a master database populated by the CI server. This seems a bit complex. What about extending Cuirass database to add the two tables you are proposing. Then, each time a package is built, if the version is updated, the "Files" table would be updated. Then we could add an HTTP interface such as "/search/file?query=libxxx" to Cuirass, that would directly query the database. In Guix itself, we could add the counterpart in the (guix ci) module. WDYT? Thanks, Mathieu
[-- Attachment #1: Type: text/plain, Size: 3454 bytes --] Hi Mathieu, Thanks for you comments! Answers below. >> 3. Size of the database: >> I've persisted all locally-present store items for my current Guix version >> and it produced a database of 72 MiB. It compresses down to 8 MiB in zstd. >> >> But since we can have multiple Guix versions, this means that the >> packages have one entry per store path, so we might end up with more >> entries than that as the number of Guix generations grows. > > I'm not sure we actually need to save the full history. I think we could > just store that the package X produces [Y1, Y2, ...] executable files. Then > on package X update, the executable files list could be refreshed. Maybe you are missing some context. The original discussion is there: https://lists.gnu.org/archive/html/guix-devel/2020-01/msg00019.html. Unlike Nix, we would like to do more than just index executable files. Indeed, it's very useful to know where to find, say, a C header, a .so library, a TeXlive .sty file, etc. However, as you hinted, maybe it's unnecessary to save the file listings of the packages for every Guix versions. Maybe we could only store the "diffs" between the Guix generations. I don't know if SQLite supports this. If not, it sounds like a rather complex thing to do. But really, if the compressed database over multiple Guix generations is <100 MiB, then size is not a big problem. >> Question: Should we include empty directories in the database? I'm tempted >> to answer no. > > I would also say no, and also exclude non-executable files. See above, I think we would lose a lot in not including non-executable files. >> Question: This bounds us to the SQLite syntax for pattern matching. Is it a >> problem? >> It seems powerful enough in practice. But maybe we can use regular >> expression in SQLite as well? > > From the UI perspective, we already have "guix search" that expects a > regex. If we were to include a "guix file-search" command, then I think > it would make sense that it uses the same regex syntax. I found out that SQLite has a REGEXP operator, I'll see if it works well enough. >> 7. Have substitute servers distribute database content. When the user performs >> a file search, Guix asks the substitute server for a database update. Only >> the diff should be sent over the network, not the whole thing since it might > > If I understand correctly, you are proposing to create local databases > that would be kept in sync with a master database populated by the CI > server. This seems a bit complex. > > What about extending Cuirass database to add the two tables you are > proposing. Then, each time a package is built, if the version is > updated, the "Files" table would be updated. > > Then we could add an HTTP interface such as "/search/file?query=libxxx" > to Cuirass, that would directly query the database. In Guix itself, we > could add the counterpart in the (guix ci) module. The problem with this approach is that it would not work offline. In my opinion, this is a big limitation. I'd rather have a local database. Besides, we need a local database for non-official, locally-built packages anyways (which Cuirass would not know about). Since this is a requirement, the only piece that'd be missing is database synchronization. Thoughts? -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
Pierre Neidhardt <mail@ambrevar.xyz> writes: > 3. Size of the database: > I've persisted all locally-present store items for my current Guix version > and it produced a database of 72 MiB. It compresses down to 8 MiB > in zstd. For comparison, my laptop’s store contains 1,103,543 files, excluding .links 691,994. The updatedb database for all of them is 86MB and takes ~6 seconds to generate: time updatedb \ --localpaths=/gnu/store \ --findoptions='( -path /gnu/store/.links -o -name *.drv -o -name *.chroot ) -prune -o -type f -print' \ --output=/tmp/dbfile locate -d /tmp/dbfile ecxc0800 (This could be further tweaked to exclude links…) > The worse case is around (number of guix generations) x ~100 MiB. This seems a little excessive. > 4. Indexing speed: > The above items took some 20 minutes to complete (on my rather > powerful machine). Oof. The updatedb hack above takes 6 seconds on my i7-6500U CPU @ 2.50GHz with SSD. I’m not suggesting to use updatedb, but I think it can be instructive to look at how the file database is implemented there. We don’t have to use SQlite if it is much slower and heavier than a custom inverted index. -- Ricardo
[-- Attachment #1: Type: text/plain, Size: 573 bytes --] Ricardo Wurmus <rekado@elephly.net> writes: > Oof. The updatedb hack above takes 6 seconds on my i7-6500U CPU @ > 2.50GHz with SSD. > > I’m not suggesting to use updatedb, but I think it can be instructive to > look at how the file database is implemented there. We don’t have to > use SQlite if it is much slower and heavier than a custom inverted > index. Good call, I'll benchmark against an inverted index. Some cost may also be induced by the Guix store queries, not sure if we can optimize these. -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 942 bytes --] Pierre Neidhardt <mail@ambrevar.xyz> writes: > Ricardo Wurmus <rekado@elephly.net> writes: > >> I’m not suggesting to use updatedb, but I think it can be instructive to >> look at how the file database is implemented there. We don’t have to >> use SQlite if it is much slower and heavier than a custom inverted >> index. > > Good call, I'll benchmark against an inverted index. > > Some cost may also be induced by the Guix store queries, not sure if we > can optimize these. With an s-exp based file, or a trivial text-based format, the downside is that it needs a bit of extra work to only load select entries, e.g. just the entries matching a specific Guix version. Would you happen to know a serialization library that allows for loading only a select portion of a file? Otherwise a trivial workaround would be to persist one index file per Guix generation. -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
Pierre Neidhardt <mail@ambrevar.xyz> writes:
> Pierre Neidhardt <mail@ambrevar.xyz> writes:
>
>> Ricardo Wurmus <rekado@elephly.net> writes:
>>
>>> I’m not suggesting to use updatedb, but I think it can be instructive to
>>> look at how the file database is implemented there. We don’t have to
>>> use SQlite if it is much slower and heavier than a custom inverted
>>> index.
>>
>> Good call, I'll benchmark against an inverted index.
>>
>> Some cost may also be induced by the Guix store queries, not sure if we
>> can optimize these.
>
> With an s-exp based file, or a trivial text-based format, the downside
> is that it needs a bit of extra work to only load select entries,
> e.g. just the entries matching a specific Guix version.
>
> Would you happen to know a serialization library that allows for loading
> only a select portion of a file?
I don’t know of any suitable file format, but a generated offset index
at the beginning of the file could be useful. You’d read the first
expression and then seek to the specified byte offset (after the
position of the index expression) where you then read the target
expression. This can easily be generated and it can be extended without
having to rewrite the whole file.
But perhaps that’s premature optimization.
--
Ricardo
[-- Attachment #1: Type: text/plain, Size: 1764 bytes --] I've done some benchmarking. 1. I tried to fine-tune the SQL a bit: - Open/close the database only once for the whole indexing. - Use "insert" instead of "insert or replace". - Use numeric ID as key instead of path. Result: Still around 15-20 minutes to build. Switching to numeric indices shrank the database by half. 2. I've tried with the following naive 1-file-per-line format: --8<---------------cut here---------------start------------->8--- "/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man5/acl.5.gz" "/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_add_perm.3.gz" "/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_calc_mask.3.gz" ... --8<---------------cut here---------------end--------------->8--- Result: Takes between 20 and 2 minutes to complete and the result is 32 MiB big. (I don't know why the timing varies.) A string-contains filter takes less than 1 second. A string-match (regex) search takes some 3 seconds (Ryzen 5 @ 3.5 GHz). I'm not sure if we can go faster. I need to measure the time SQL takes for a regexp match. Question: Any idea how to load the database as fast as possible? I tried the following, it takes 1.5s on my machine: --8<---------------cut here---------------start------------->8--- (define (load-textual-database) (call-with-input-file %textual-db (lambda (port) (let loop ((line (get-line port)) (result '())) (if (string? line) (loop (get-line port) (cons line result)) result))))) --8<---------------cut here---------------end--------------->8--- Cheers! -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 2612 bytes --] Have you tried something more structured? I have some code for creating a binary search tree and even compressing/decompressing strings with huffman, as well as code to serialize all that (my deserialization is in Java though, so not very useful to you): https://framagit.org/nani-project/nani-website See modules/nani/trie.scm for instance. The idea is to have a binary search tree whose keys are filenames and value is a pointer in the file to a structure that holds data for this filerame (packages and versions of guix for instance). It's super fast to look it up, because you don't load the whole file, you only seek to the right position. It's a lot longer to build, and not so easy to update though. On 2020年8月12日 15:10:08 GMT-04:00, Pierre Neidhardt <mail@ambrevar.xyz> wrote: >I've done some benchmarking. > >1. I tried to fine-tune the SQL a bit: > - Open/close the database only once for the whole indexing. > - Use "insert" instead of "insert or replace". > - Use numeric ID as key instead of path. > > Result: Still around 15-20 minutes to build. Switching to numeric > indices shrank the database by half. > >2. I've tried with the following naive 1-file-per-line format: > >--8<---------------cut here---------------start------------->8--- >"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man5/acl.5.gz" >"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_add_perm.3.gz" >"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_calc_mask.3.gz" >... >--8<---------------cut here---------------end--------------->8--- > > Result: Takes between 20 and 2 minutes to complete and the result is > 32 MiB big. (I don't know why the timing varies.) > > A string-contains filter takes less than 1 second. > > A string-match (regex) search takes some 3 seconds (Ryzen 5 @ 3.5 > GHz). I'm not sure if we can go faster. I need to measure the time > SQL takes for a regexp match. > >Question: Any idea how to load the database as fast as possible? I >tried the following, it takes 1.5s on my machine: > >--8<---------------cut here---------------start------------->8--- >(define (load-textual-database) > (call-with-input-file %textual-db > (lambda (port) > (let loop ((line (get-line port)) > (result '())) > (if (string? line) > (loop (get-line port) (cons line result)) > result))))) >--8<---------------cut here---------------end--------------->8--- > >Cheers! > >-- >Pierre Neidhardt >https://ambrevar.xyz/ [-- Attachment #2: Type: text/html, Size: 3040 bytes --]
[-- Attachment #1: Type: text/plain, Size: 319 bytes --] Pierre Neidhardt <mail@ambrevar.xyz> writes: > Result: Takes between 20 and 2 minutes to complete and the result is > 32 MiB big. (I don't know why the timing varies.) Typo: 20 _seconds_ to 2 minutes! So it's faster than SQL by 1 or 2 orders of magnitude. -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 1417 bytes --] Julien Lepiller <julien@lepiller.eu> writes: > Have you tried something more structured? I have some code for creating a binary search tree and even compressing/decompressing strings with huffman, as well as code to serialize all that (my deserialization is in Java though, so not very useful to you): https://framagit.org/nani-project/nani-website > > See modules/nani/trie.scm for instance. > > The idea is to have a binary search tree whose keys are filenames and value is a pointer in the file to a structure that holds data for this filerame (packages and versions of guix for instance). It's super fast to look it up, because you don't load the whole file, you only seek to the right position. It's a lot longer to build, and not so easy to update though. Assuming we'd have only 1 Guix generation per file, I'm not sure a trie would help because we _always_ need to search _all_ file paths, so in all cases we've got some 10,000+ entries to load in memory and loop over them. The total number of entries is the bottleneck here, both for the database load and the individual search queries. An obvious optimization is to memoize the database load. This has a significant memory cost though. The trie could be helpful for multiple Guix generations in the same database, but I'm not sure it warrant the increased complexity. Thoughts? -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 1880 bytes --] Why wouldn't it help? Can't you make it a trie from basename -> complete name? If I'm looking for "libcord.so" (which is a key in the trie), I don't think I need to look for every path. I only need to follow the trie until I find a pointer to some structure that contains the data I look for (ex: a list of complete filenames). On 2020年8月12日 16:43:37 GMT-04:00, Pierre Neidhardt <mail@ambrevar.xyz> wrote: >Julien Lepiller <julien@lepiller.eu> writes: > >> Have you tried something more structured? I have some code for >creating a binary search tree and even compressing/decompressing >strings with huffman, as well as code to serialize all that (my >deserialization is in Java though, so not very useful to you): >https://framagit.org/nani-project/nani-website >> >> See modules/nani/trie.scm for instance. >> >> The idea is to have a binary search tree whose keys are filenames and >value is a pointer in the file to a structure that holds data for this >filerame (packages and versions of guix for instance). It's super fast >to look it up, because you don't load the whole file, you only seek to >the right position. It's a lot longer to build, and not so easy to >update though. > >Assuming we'd have only 1 Guix generation per file, I'm not sure a trie >would help because we _always_ need to search _all_ file paths, so in >all cases we've got some 10,000+ entries to load in memory and loop >over >them. > >The total number of entries is the bottleneck here, both for the >database load and the individual search queries. > >An obvious optimization is to memoize the database load. This has a >significant memory cost though. > >The trie could be helpful for multiple Guix generations in the same >database, but I'm not sure it warrant the increased complexity. > >Thoughts? > >-- >Pierre Neidhardt >https://ambrevar.xyz/ [-- Attachment #2: Type: text/html, Size: 2275 bytes --]
Julien Lepiller <julien@lepiller.eu> writes:
> Why wouldn't it help? Can't you make it a trie from basename -> complete name? If I'm looking for "libcord.so" (which is a key in the trie), I don't think I need to look for every path. I only need to follow the trie until I find a pointer to some structure that contains the data I look for (ex: a list of complete filenames).
Exactly! It’s somewhat less useful (without pre-processing) for base
names and more useful for absolute file names, where you can quickly
toss out mismatches while traversing the trie.
You could also optimize the data structure by swapping the base name and
the directory name, assuming that users would want to search for either
absolute file names and also for base names (and not for a directory in
between).
--
Ricardo
[-- Attachment #1: Type: text/plain, Size: 1756 bytes --] Hi, > 1. I tried to fine-tune the SQL a bit: > - Open/close the database only once for the whole indexing. > - Use "insert" instead of "insert or replace". > - Use numeric ID as key instead of path. > > Result: Still around 15-20 minutes to build. Switching to numeric > indices shrank the database by half. sqlite insert statements can be very fast. sqlite.org claims 50000 or more insert statements per second. But in order to achieve that speed all insert statements have to be grouped together in a single transaction. See https://www.sqlite.org/faq.html#q19 > A string-contains filter takes less than 1 second. Guile's string-contains function uses a naive O(nk) implementation, where 'n' is the length of string s1 and 'k' is the length of string s2. If it was implemented using the Knuth-Morris-Pratt algorithm, it could cost only O(n+k). So, there is some scope for improvement here. In fact, a comment on line 2007 of libguile/srfi-13.c in the guile source tree makes this very point. > I need to measure the time SQL takes for a regexp match. sqlite, by default, does not come with regexp support. You might have to load some external library. See https://www.sqlite.org/lang_expr.html#the_like_glob_regexp_and_match_operators --8<---------------cut here---------------start------------->8--- The REGEXP operator is a special syntax for the regexp() user function. No regexp() user function is defined by default and so use of the REGEXP operator will normally result in an error message. If an application-defined SQL function named "regexp" is added at run-time, then the "X REGEXP Y" operator will be implemented as a call to "regexp(Y,X)". --8<---------------cut here---------------end--------------->8--- Regards, Arun [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 1058 bytes --] Julien Lepiller <julien@lepiller.eu> writes: > Why wouldn't it help? Can't you make it a trie from basename -> > complete name? If I'm looking for "libcord.so" (which is a key in the > trie), I don't think I need to look for every path. I only need to > follow the trie until I find a pointer to some structure that contains > the data I look for (ex: a list of complete filenames). Fair enough, but it's a more limited scope: here we assume the user knows the exact basename. It's a bit too limited in my opinion: - It's only too common to have shared objects ending with a .X.Y extension (where X.Y is a version), the version-less file is not always present which means a lot of trial and error on the user end just to search the right file. - It does not cover the case where I don't know the basename, e.g. if I'm looking for a FOO header file my query would look like "/include/.*foo.*". I believe it's important that the search be as general as possible. Thoughts? -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 1185 bytes --] Hi Ricardo, Ricardo Wurmus <rekado@elephly.net> writes: >> Why wouldn't it help? Can't you make it a trie from basename -> complete name? If I'm looking for "libcord.so" (which is a key in the trie), I don't think I need to look for every path. I only need to follow the trie until I find a pointer to some structure that contains the data I look for (ex: a list of complete filenames). > > Exactly! It’s somewhat less useful (without pre-processing) for base > names and more useful for absolute file names, where you can quickly > toss out mismatches while traversing the trie. > > You could also optimize the data structure by swapping the base name and > the directory name, assuming that users would want to search for either > absolute file names and also for base names (and not for a directory in > between). By absolute file name, you mean relative to the Guix store item? As I mentioned in my reply to Julien, this would be a specific use-case optimization, like you said it won't work when you don't know the exact basename or full path, so I'm not sure it's worth going down that road. Thoughts? -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 1950 bytes --] Hi Arun, Arun Isaac <arunisaac@systemreboot.net> writes: > sqlite insert statements can be very fast. sqlite.org claims 50000 or > more insert statements per second. But in order to achieve that speed > all insert statements have to be grouped together in a single > transaction. See https://www.sqlite.org/faq.html#q19 Very good point, thanks a lot! I'll try again! >> A string-contains filter takes less than 1 second. > > Guile's string-contains function uses a naive O(nk) implementation, > where 'n' is the length of string s1 and 'k' is the length of string > s2. If it was implemented using the Knuth-Morris-Pratt algorithm, it > could cost only O(n+k). So, there is some scope for improvement here. In > fact, a comment on line 2007 of libguile/srfi-13.c in the guile source > tree makes this very point. Thanks for this low-level insight, this is very useful! Then if we can get Knuth-Morris-Pratt in the next Guile version, a textual database could be an ideal option in terms of performance. >> I need to measure the time SQL takes for a regexp match. > > sqlite, by default, does not come with regexp support. You might have to > load some external library. See > https://www.sqlite.org/lang_expr.html#the_like_glob_regexp_and_match_operators > > --8<---------------cut here---------------start------------->8--- > The REGEXP operator is a special syntax for the regexp() user > function. No regexp() user function is defined by default and so use of > the REGEXP operator will normally result in an error message. If an > application-defined SQL function named "regexp" is added at run-time, > then the "X REGEXP Y" operator will be implemented as a call to > "regexp(Y,X)". > --8<---------------cut here---------------end--------------->8--- I saw that. Now I need to how regexp search performs as opposed to pattern search. Cheers! -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
Pierre Neidhardt <mail@ambrevar.xyz> writes: > Julien Lepiller <julien@lepiller.eu> writes: > >> Why wouldn't it help? Can't you make it a trie from basename -> >> complete name? If I'm looking for "libcord.so" (which is a key in the >> trie), I don't think I need to look for every path. I only need to >> follow the trie until I find a pointer to some structure that contains >> the data I look for (ex: a list of complete filenames). > > Fair enough, but it's a more limited scope: here we assume the user > knows the exact basename. > > It's a bit too limited in my opinion: > > - It's only too common to have shared objects ending with a .X.Y extension > (where X.Y is a version), the version-less file is not always present > which means a lot of trial and error on the user end just to search > the right file. We can always abort the radix trie traversal early and print all leaves that can be reached from the node that we arrived at. This allows us to search for “libc” and find “libcord.so.1.2”, “libc.so”, “libcaca.a”, etc. > - It does not cover the case where I don't know the basename, e.g. if I'm > looking for a FOO header file my query would look like "/include/.*foo.*". I think this is a rather rare use case, which in my opinion doesn’t justify forgoing the use of a smart data structure. It would *still* be possible to use the prefix tree for this kind of search, but it would certainly not be optimal. (E.g. by searching up to the first wildcard, and then searching each resulting sub-tree up to the first non-wildcard, and resuming the search in all remaining sub-trees.) > I believe it's important that the search be as general as possible. Search should be cheap and fast. Dependent on how we arrange the data, users can easily filter the results with grep, if necessary. For example, we could have a radix tree of the base names and have the directory prefix as the leaf nodes. The search would then involve two steps: culling the set of directory prefixes by following the base name trie (that’s fast), then do a simple linear search over the results (that’s slow). The trie can easily be serialized as a bunch of offsets that we only need to repeatedly seek to, so we never need to hold it in memory. -- Ricardo
[-- Attachment #1: Type: text/plain, Size: 1534 bytes --] Arun Isaac <arunisaac@systemreboot.net> writes: > sqlite insert statements can be very fast. sqlite.org claims 50000 > or... I tried it, and got it down to... 30s! Turns out that Guix already provides `call-with-transaction' which I've used here. So SQLite is back on the table. >> I need to measure the time SQL takes for a regexp match. > > sqlite, by default, does not come with regexp support. You might have to > load some external library. See > https://www.sqlite.org/lang_expr.html#the_like_glob_regexp_and_match_operators > > --8<---------------cut here---------------start------------->8--- > The REGEXP operator is a special syntax for the regexp() user > function. No regexp() user function is defined by default and so use of > the REGEXP operator will normally result in an error message. If an > application-defined SQL function named "regexp" is added at run-time, > then the "X REGEXP Y" operator will be implemented as a call to > "regexp(Y,X)". > --8<---------------cut here---------------end--------------->8--- After more reading, I see the following options: - Either we use the sqlite-pcre extension (which is not packaged in Guix). - Or we extend guile-sqlite to add a binding to create_function (https://www.sqlite.org/appfunc.html). - Or do you think SQLite patterns (using "%") would do for now? As Mathieu pointed out, it's an unfortunate inconsistency with the rest of Guix. But maybe regexp support can be added in a second stage. Thoughts? -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 2149 bytes --] Hi Ricardo, See my recent email with the new SQLite benchmark: I can now generate the whole database in under 30 seconds, so about the same order of magnitude than a naive text database (which has a less data). SQLite pattern search queries are extremely fast (<0.1s) and cover all examples named so far: - exact basename match - partial path match - pattern match (e.g. "/include/%foo%") The only thing that's missing is regexp support. I'll see what we can do. Considering this, I find that leveraging SQLite is very attractive at this point: - Fast (fastest?). - Low on memory. - No need to come up with our own data format. - Less work, no need to reinvent the wheel. >> - It does not cover the case where I don't know the basename, e.g. if I'm >> looking for a FOO header file my query would look like "/include/.*foo.*". > > I think this is a rather rare use case, which in my opinion doesn’t > justify forgoing the use of a smart data structure. I don't find it so rare, actually. I've used filesearch (on other OSes) to find - Emacs packages where I didn't know the exact name of the .el. - TeXlive packages (every time you get a font a .sty error) - Shared libraries (.so of which I didn't know the exact name) - include files as above - Your favourite programming language package... > It would *still* be possible to use the prefix tree for this kind of > search, but it would certainly not be optimal. (E.g. by searching up > to the first wildcard, and then searching each resulting sub-tree up > to the first non-wildcard, and resuming the search in all remaining > sub-trees.) But it's not a wildcard, it's a regexp. Can we run full-path regexp queries over a trie? >> I believe it's important that the search be as general as possible. > > Search should be cheap and fast. While very important, I would put more priority on exactness. A search which gives unreliable results (e.g. returns nothing while there exists a result) is a search that would deter many users from using it, in my experience. Cheers! -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 221 bytes --] An alternative to regexp would be "full-text search": https://www.sqlite.org/fts5.html I don't know how fast that is. Might also induce more disk usage. I'll test. -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
Pierre Neidhardt <mail@ambrevar.xyz> writes:
> - Or do you think SQLite patterns (using "%") would do for now? As
> Mathieu pointed out, it's an unfortunate inconsistency with the rest of
> Guix. But maybe regexp support can be added in a second stage.
These patterns could be generated from user input that looks more like
what we already use elsewhere. We don’t have to directly pass the
SQLite interface to users.
--
Ricardo
[-- Attachment #1: Type: text/plain, Size: 1025 bytes --] >> sqlite insert statements can be very fast. sqlite.org claims 50000 >> or... > > I tried it, and got it down to... 30s! That's great! :-) > - Or do you think SQLite patterns (using "%") would do for now? As > Mathieu pointed out, it's an unfortunate inconsistency with the rest of > Guix. But maybe regexp support can be added in a second stage. The inconsistency is unfortunate. Personally, I am in favor of dropping regexp support everywhere in Guix, and only having literal string search. But that is backward incompatible, and may be controversial. In this specific case of file search, we could use the sqlite like patterns, but not expose them to the user. For example, if the search query is "<query>", we search for the LIKE pattern "%<query>%". I think this addresses how users normally search for files. I don't think regexps add much value. Full text search may not be relevant to file search. Full text search is more suited for natural language search involving such things as stemming algorithms. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 693 bytes --] Ricardo Wurmus <rekado@elephly.net> writes: > Pierre Neidhardt <mail@ambrevar.xyz> writes: > >> - Or do you think SQLite patterns (using "%") would do for now? As >> Mathieu pointed out, it's an unfortunate inconsistency with the rest of >> Guix. But maybe regexp support can be added in a second stage. > > These patterns could be generated from user input that looks more like > what we already use elsewhere. We don’t have to directly pass the > SQLite interface to users. True, an obvious fix to the problem of syntax familiarity would be to replace "*" by "%" in the user query, to have it look like shell globing. -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 1915 bytes --] Arun Isaac <arunisaac@systemreboot.net> writes: >> - Or do you think SQLite patterns (using "%") would do for now? As >> Mathieu pointed out, it's an unfortunate inconsistency with the rest of >> Guix. But maybe regexp support can be added in a second stage. > > The inconsistency is unfortunate. Personally, I am in favor of dropping > regexp support everywhere in Guix, and only having literal string > search. But that is backward incompatible, and may be controversial. > > In this specific case of file search, we could use the sqlite like > patterns, but not expose them to the user. For example, if the search > query is "<query>", we search for the LIKE pattern "%<query>%". I think > this addresses how users normally search for files. I don't think > regexps add much value. I agree. > Full text search may not be relevant to file search. Full text search is > more suited for natural language search involving such things as > stemming algorithms. Yes, but full text search brings us a few niceties here: - Wildcards using the `*` character. This fixes the unfamiliarity of `%`. - "Automatic word permutations" (maybe not the right term). "foo bar" and "bar foo" both match the same results! - Case insensitive, diacritic insensitive (e.g. "e" matches "É"). - Logic: we can do "(OR foo bar) AND (OR qux quuz)". - Relevance ranking: results can be sorted by relevance, another problem we don't have to fix ourselves ;) All the above is arguably more powerful and easier to use than regexp. But even if no user ever bothers with the logic operators, the default behaviour "just works" in the fashion of a search engine. The main thing I don't know how to do is suffix matches (like "%foo"). With FTS, looking up "foo*" won't match "libfoo", which is problematic. Any clue how to fix that? -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 1228 bytes --] > Yes, but full text search brings us a few niceties here: These are nice features, but I don't know if all of them are useful for file search. Normally, with Arch's pkgfile, I seach for some missing header file, shared library, etc. Usually, I know the exact filename I am looking for, or I know some prefix or suffix of the exact filename. > - Case insensitive, diacritic insensitive (e.g. "e" matches "É"). Case insensitivity is quite useful. Most filenames are in lower case, but there is always that one odd filename out there. But filenames usually don't have diacritics. So, I'm not sure if diacritic insensitivity is useful. > All the above is arguably more powerful and easier to use than regexp. > But even if no user ever bothers with the logic operators, the default > behaviour "just works" in the fashion of a search engine. > > The main thing I don't know how to do is suffix matches (like "%foo"). > With FTS, looking up "foo*" won't match "libfoo", which is problematic. > Any clue how to fix that? This is handled by stemming. We'll need a custom stemmer that normalizes libfoo to foo. Xapian has a nice page on stemming. See https://xapian.org/docs/stemming.html Cheers! [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 868 bytes --] Arun Isaac <arunisaac@systemreboot.net> writes: > But filenames usually don't have diacritics. So, I'm not sure if > diacritic insensitivity is useful. Probably not, but if there ever is this odd file name with an accent, then we won't have to worry about it, it will be handled. Better too much than too little! > This is handled by stemming. We'll need a custom stemmer that normalizes > libfoo to foo. Xapian has a nice page on stemming. See > https://xapian.org/docs/stemming.html I think I gave a confusing example. I think stemming is too specific. I meant searching a pattern that does not start the word. If I search "bar", I'd like to match "foobar". The reverse is possible ("foo" matches "foobar"). Well, I'll share my FTS demo just now, we can always think about this case later. -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 1400 bytes --] I've pushed my experiment to the `wip-filesearch' branch. As of this writing it is not automatically triggered by "guix build". To test it: - Load the module from a REPL. - Run --8<---------------cut here---------------start------------->8--- (test-index-git) --8<---------------cut here---------------end--------------->8--- You should get a database at ~/.cache/guix/files.db. - Run --8<---------------cut here---------------start------------->8--- (format-search (search-file-package "git perl5")) --8<---------------cut here---------------end--------------->8--- to get some search results. - Run --8<---------------cut here---------------start------------->8--- (persist-all-local-packages) --8<---------------cut here---------------end--------------->8--- to build the database for all your local files. Warning, this takes some 30 seconds on my machine, could take minutes if you has a spinning drive or an old processor. I'll be away for a couple of days, feel free to experiment and send me some feedback in the meantime. What remains to be done: - Automatically index "guix build" result in database. Any idea where I can trigger the indexing from a "guix build"? - Sync the database data from a substitute server. Any pointer would be appreciated :) Cheers! -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
Am 13.08.20 um 12:04 schrieb Pierre Neidhardt: > SQLite pattern search queries are extremely fast (<0.1s) and cover all > examples named so far: > > - exact basename match > - partial path match > - pattern match (e.g. "/include/%foo%") For comparison: These are the options Debian Package search <https://www.debian.org/distrib/packages.en#search_packages> supports: * paths ending with the keyword -> e.g. file-extensions * packages that contain files named like this -> basename * packages that contain files whose names contain the keyword -> LIKE %<keyword>% -- Regards Hartmut Goebel | Hartmut Goebel | h.goebel@crazy-compilers.com | | www.crazy-compilers.com | compilers which you thought are impossible |
Am 11.08.20 um 14:35 schrieb Pierre Neidhardt:
> Unlike Nix, we would like to do more than just index executable files.
> Indeed, it's very useful to know where to find, say, a C header, a .so
> library, a TeXlive .sty file, etc.
+1
Most of the time I'm searching for non-executable files.
--
Regards
Hartmut Goebel
| Hartmut Goebel | h.goebel@crazy-compilers.com |
| www.crazy-compilers.com | compilers which you thought are impossible |
[-- Attachment #1: Type: text/plain, Size: 1191 bytes --] Hi Pierre, I tried the wip-filesearch branch. Nice work! :-) persist-all-local-packages takes around 350 seconds on my machine (slow machine with spinning disk) and the database is 50 MB. Some other comments follow. - Maybe, we shouldn't index hidden files, particularly all the .xxx-real files created by our wrap phases. - You should use SQL prepared statements with sqlite-prepare, sqlite-bind, etc. That would correctly handle escaping special characters in the search string. Currently, searching for "transmission-gtk", "libm.so", etc. errors out. - Searching for "git perl5" works as expected, but searching for "git perl" returns no results. I think this is due to the tokenizer used by the full text search indexer. The tokenizer sees the word "perl5" as one indivisible token and does not realize that "perl" is a prefix of "perl5". Unfortunately, I think this is a fundamental problem with FTS -- one that can only be fixed by using simple LIKE patterns. FTS is meant for natural language search where this kind of thing would be normal. - I guess you are only indexing local packages now, but will include all packages later by some means. Cheers! [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
Hi Hartmut, et al On +2020-08-15 14:47:12 +0200, Hartmut Goebel wrote: > Am 13.08.20 um 12:04 schrieb Pierre Neidhardt: > > SQLite pattern search queries are extremely fast (<0.1s) and cover all > > examples named so far: > > > > - exact basename match > > - partial path match > > - pattern match (e.g. "/include/%foo%") > > > For comparison: These are the options Debian Package search > <https://www.debian.org/distrib/packages.en#search_packages> supports: > > * paths ending with the keyword -> e.g. file-extensions > * packages that contain files named like this -> basename > * packages that contain files whose names contain the keyword -> > LIKE %<keyword>% > If you are on debian, have you tried dpkg -l '*your*globbed*name*here*' or dpkg -l '*'|egrep your-regex-here or the former, preserving header and excluding non-installed, e.g.,? dpkg -l '*your*globbed*name*here*'|grep -v ^un > -- > Regards > Hartmut Goebel > > | Hartmut Goebel | h.goebel@crazy-compilers.com | > | www.crazy-compilers.com | compilers which you thought are impossible | > > -- Regards, Bengt Richter
Am 15.08.20 um 23:20 schrieb Bengt Richter:
> If you are on debian, have you tried
> dpkg -l '*your*globbed*name*here*'
No, since for Debian I'm not using a command-line tool, but the
Web-Interface - which allows querying even packages I have not
installed. (And the later is my specific use-case: Find which package I
need to install to get a specific file.)
--
Regards
Hartmut Goebel
| Hartmut Goebel | h.goebel@crazy-compilers.com |
| www.crazy-compilers.com | compilers which you thought are impossible |
Thanks for working on this! This is a super awesome feature! Best of luck! -- Joshua Branson Sent from Emacs and Gnus
[-- Attachment #1: Type: text/plain, Size: 2150 bytes --] Hi Arun, thanks for your feedback! 350 seconds does not seem too bad. Let's keep in mind that this operation will mostly be used by the build farm / substitute server, not by the user. > - Maybe, we shouldn't index hidden files, particularly all the .xxx-real > files created by our wrap phases. I thought about this, but for now I don't think it's a good idea: - Hidden files are rather rare and won't take up much space in the database. - What if the users actually searches for a hidden file? - It's easy to exclude hidden files from the search results. > - You should use SQL prepared statements with sqlite-prepare, > sqlite-bind, etc. That would correctly handle escaping special > characters in the search string. Currently, searching for > "transmission-gtk", "libm.so", etc. errors out. Thanks for pointing this out, I'll look into it. > - Searching for "git perl5" works as expected, but searching for "git > perl" returns no results. I think this is due to the tokenizer used by > the full text search indexer. The tokenizer sees the word "perl5" as > one indivisible token and does not realize that "perl" is a prefix of > "perl5". Unfortunately, I think this is a fundamental problem with FTS > -- one that can only be fixed by using simple LIKE patterns. FTS is > meant for natural language search where this kind of thing would be > normal. Indeed, but "git perl*" would have worked I think. We can always pre-process the query to add stars everywhere. At the moment, the only downside I see with FTS is that it seems to be impossible to match words for which we don't know the beginning. Anyways, switching from FTS to LIKE patterns is rather easy. In fact, I could implement both with a tiny abstraction so that we can choose which one we want in the end. > - I guess you are only indexing local packages now, but will include all > packages later by some means. Indeed, I want the substitute server / build farm to expose the database for syncing. I'd need some help to get started. Anyone? -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 875 bytes --] >> - You should use SQL prepared statements with sqlite-prepare, >> sqlite-bind, etc. That would correctly handle escaping special >> characters in the search string. Currently, searching for >> "transmission-gtk", "libm.so", etc. errors out. > > Thanks for pointing this out, I'll look into it. I've looked into it. Actually the issue is not with the lack of sqlite-bind-arguments, but with "-" and "." triggering an error in FTS. I've fixed it in e08f913d20428a9a925cc46d177c7446f55e6443. The downside is that we can't use any special character like boolean operators. I'm not sure how we can get the best of both worlds. Maybe add a command line flag that would enable "extended search syntax" in which the user can use all of FTS but must make sure to quote "libm.so" for instance. Cheers! -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
Hi Pierre, Cool to relive the topic. :-) (disclaim: I have not yet process all the thread and emails) On Mon, 10 Aug 2020 at 16:32, Pierre Neidhardt <mail@ambrevar.xyz> wrote: > 1. An SQLite database with the following schema: If you are going to an local SQL database, my two questions are: a) Which part would update it? “guix pull”? Other? Even using substitutes, the channels and co could lead to an extra cost and so what is acceptable and what is not? b) Could you also include other fields such that “synopsis” and “description”? Because it could speed up “guix search” without adding (or modifying) the current cache (~/config/guix/current/lib/package.cache); discussed at length in #39258 <http://issues.guix.gnu.org/issue/39258>. Cheers, simon
[-- Attachment #1: Type: text/plain, Size: 1217 bytes --] Hi Simon! zimoun <zimon.toutoune@gmail.com> writes: > If you are going to an local SQL database, my two questions are: > > a) > Which part would update it? “guix pull”? Other? Even using > substitutes, the channels and co could lead to an extra cost and so what > is acceptable and what is not? I suggest fetching database updates when performing the filesearch, just like `guix size` does. The induced cost would be little in my opinion. A compressed database for a whole generation is below 10 MiB. This operation would need to be done only once per Guix generation. If we are a bit smart with timestamping and only send a diff, it could even be much, much smaller. > b) > Could you also include other fields such that “synopsis” and > “description”? Because it could speed up “guix search” without adding > (or modifying) the current cache > (~/config/guix/current/lib/package.cache); discussed at length in > #39258 <http://issues.guix.gnu.org/issue/39258>. I think this is a bit beyond the scope of this patch set. I'd rather focus on files exclusively for now and proceed one step at a time :) Cheers! -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
Hi Pierre, On Thu, 27 Aug 2020 at 13:15, Pierre Neidhardt <mail@ambrevar.xyz> wrote: > zimoun <zimon.toutoune@gmail.com> writes: > >> If you are going to an local SQL database, my two questions are: >> >> a) >> Which part would update it? “guix pull”? Other? Even using >> substitutes, the channels and co could lead to an extra cost and so what >> is acceptable and what is not? > > I suggest fetching database updates when performing the filesearch, just > like `guix size` does. I am not sure to see how. One needs all the database to search inside and cannot know in advance which packages etc.. Contrary to “guix size” which manipulates the graph and then download the missing parts. Therefore, your suggestion is to download all the database the first time the user run “guix filesearch”, i.e., download ~10MiB. Then each time the user runs “guix pull” then “guix filesearch”, two options, either download the new database for this last Guix generation, or either download a diff (not sure the complexity is worth for ~10MiB). Right? And what about the channels? Because if I read correctly, “guix size” fails when “no available substitute information“. >> b) >> Could you also include other fields such that “synopsis” and >> “description”? Because it could speed up “guix search” without adding >> (or modifying) the current cache >> (~/config/guix/current/lib/package.cache); discussed at length in >> #39258 <http://issues.guix.gnu.org/issue/39258>. > > I think this is a bit beyond the scope of this patch set. I'd rather > focus on files exclusively for now and proceed one step at a time :) I do not think it is beyond the scope because Arun introduced an SQLite database for improving “guix search”. But this path had been stopped because of “introducing complexity” [1]. Therefore, if “guix filesearch” introduces a SQL cache, then it seems a good idea to be also usable by “guix search”. Well, if the table is extended with the fields “synopsis” and “description” then what is the size of the database? Does it kill the lookup performance? [1] http://issues.guix.gnu.org/issue/39258#7 Cheers, simon
[-- Attachment #1: Type: text/plain, Size: 1945 bytes --] zimoun <zimon.toutoune@gmail.com> writes: > I am not sure to see how. One needs all the database to search inside > and cannot know in advance which packages etc.. Contrary to “guix size” > which manipulates the graph and then download the missing parts. > > Therefore, your suggestion is to download all the database the first > time the user run “guix filesearch”, i.e., download ~10MiB. Yes. If the user is not using substitutes, they can also compute the database from their local store items (like I'm doing in the graft). It's less useful but still a bit more convenient than running find/locate on /gnu/store. > Then each time the user runs “guix pull” then “guix filesearch”, two > options, either download the new database for this last Guix generation, > or either download a diff (not sure the complexity is worth for ~10MiB). > > Right? Yes. > And what about the channels? Because if I read correctly, “guix size” > fails when “no available substitute information“. Just like `guix size', it would work with local items. But if there is no substitute server for channels, then there is no way around it. >> I think this is a bit beyond the scope of this patch set. I'd rather >> focus on files exclusively for now and proceed one step at a time :) > > I do not think it is beyond the scope because Arun introduced an SQLite > database for improving “guix search”. But this path had been stopped > because of “introducing complexity” [1]. Therefore, if “guix > filesearch” introduces a SQL cache, then it seems a good idea to be also > usable by “guix search”. > > Well, if the table is extended with the fields “synopsis” and > “description” then what is the size of the database? Does it kill the > lookup performance? Good point, I'll test and report my measures. Cheers! -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
Hi Pierre!
Sorry for the very late response.
> I've fixed it in
> e08f913d20428a9a925cc46d177c7446f55e6443. The downside is that we can't
> use any special character like boolean operators. I'm not sure how we
> can get the best of both worlds. Maybe add a command line flag that
> would enable "extended search syntax" in which the user can use all of
> FTS but must make sure to quote "libm.so" for instance.
Where is this commit so I may test it? I couldn't find the commit in the
wip-filesearch branch. The HEAD of the wip-filesearch branch I see is
49b52c2c7be03caf3636632c31f4451d5bc88125.
Regards,
Arun
[-- Attachment #1: Type: text/plain, Size: 183 bytes --] Oops! Forgot to push. It's actually commit 25147f983bdf432b03e8271abe0318f4812f94ba on wip-filesearch. Thanks for the notice! -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 698 bytes --] > Oops! Forgot to push. > > It's actually commit 25147f983bdf432b03e8271abe0318f4812f94ba on > wip-filesearch. I checked and it works now. Just a tiny nitpick: The function signature of search-file-package, --8<---------------cut here---------------start------------->8--- (define (search-file-package pattern . more-patterns) ...) --8<---------------cut here---------------end--------------->8--- can be rewritten as --8<---------------cut here---------------start------------->8--- (define (search-file-package . patterns) ...) --8<---------------cut here---------------end--------------->8--- No need to cons pattern onto more-patterns in the function body. Thanks for working on this! [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 524 bytes --]
[-- Attachment #1: Type: text/plain, Size: 890 bytes --] Arun Isaac <arunisaac@systemreboot.net> writes: > The function signature of search-file-package, > > --8<---------------cut here---------------start------------->8--- > (define (search-file-package pattern . more-patterns) ...) > --8<---------------cut here---------------end--------------->8--- > > can be rewritten as > > --8<---------------cut here---------------start------------->8--- > (define (search-file-package . patterns) ...) > --8<---------------cut here---------------end--------------->8--- > > No need to cons pattern onto more-patterns in the function body. There is a subtle different: in the latter, (search-file-package) is allowed and won't trigger a compile time error, while the former does. This "foo . more-foo" paradigm is a way to say "1 or more arguments", instead of "0 or more". Cheers! -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --]
[-- Attachment #1: Type: text/plain, Size: 285 bytes --] > There is a subtle different: in the latter, (search-file-package) is > allowed and won't trigger a compile time error, while the former does. > This "foo . more-foo" paradigm is a way to say "1 or more arguments", > instead of "0 or more". Ah, ok, that makes sense! Regards, Arun [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 524 bytes --]
[-- Attachment #1: Type: text/plain, Size: 1031 bytes --] Sorry for the delay. I've pushed a commit which adds the synopsis and the description to the database. 0127cfa5d0898222257a716bf7b0a167f31cc6dd The quite surprising result is that these new details only cost 1 MiB extra! I can then search packages and it's super fast: --8<---------------cut here---------------start------------->8--- > ,time (search-package "org" "equival") $51 = (("emacs-org-contrib")) ;; 0.002930s real time, 0.002925s run time. 0.000000s spent in GC. --8<---------------cut here---------------end--------------->8--- My local database only includes the store items, that is to say some 1800 items, but even with the 15000 packages the search would not be impacted by much. The size might grow by some 10 MiB though, but that's reasonable and it compresses very well (down to 6-7 MiB in zstd). It seems to me that Simon had a very good hunch: the filesearch implementation could fix the slow search issue for free. Thoughts? -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 511 bytes --]
[-- Attachment #1: Type: text/plain, Size: 155 bytes --] The database will all package descriptions and synopsis is 46 MiB and compresses down to 11 MiB in zstd. -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 511 bytes --]
Hi,
Pierre Neidhardt <mail@ambrevar.xyz> skribis:
> I've pushed a commit which adds the synopsis and the description to the
> database.
>
> 0127cfa5d0898222257a716bf7b0a167f31cc6dd
>
> The quite surprising result is that these new details only cost 1 MiB extra!
>
> I can then search packages and it's super fast:
>
>> ,time (search-package "org" "equival")
> $51 = (("emacs-org-contrib"))
> ;; 0.002930s real time, 0.002925s run time. 0.000000s spent in GC.
>
> My local database only includes the store items, that is to say some
> 1800 items, but even with the 15000 packages the search would not be
> impacted by much. The size might grow by some 10 MiB though, but that's
> reasonable and it compresses very well (down to 6-7 MiB in zstd).
Nice!
Could you post a summary of what you have done, what’s left to do, and
how you’d like to integrate it? (If you’ve already done it, my
apologies, but you can resend a link. :-))
Thank you,
Ludo’.
[-- Attachment #1: Type: text/plain, Size: 2220 bytes --] Hi Ludo! Ludovic Courtès <ludo@gnu.org> writes: > Nice! Thanks! > Could you post a summary of what you have done, what’s left to do, and > how you’d like to integrate it? (If you’ve already done it, my > apologies, but you can resend a link. :-)) What I've done: mostly a database benchmark. - Textual database: slow and not lighter than SQLite. Not worth it I believe. - SQLite without full-text search: fast, supports classic patterns (e.g. "foo*bar") but does not support word permutations. - SQLite with full-text search: fast, supports word permutations but does not support suffix-matching (e.g. "bar" won't match "foobar"). Size is about the same as without full-text search. - Include synopsis and descriptions. Maybe we should include all fields that are searched by `guix search`. This incurs a cost on the database size but it would fix the `guix search` speed issue. Size increases by some 10 MiB. I say we go with SQLite full-text search for now with all package details. Switching to without full-text search is just a matter of a minor adjustment, which we can decide later when merging the final patch. Same if we decide not to include the description, synopsis, etc. What's left to do: - Populate the database on demand, either after a `guix build` or from a `guix filesearch...`. This is important so that `guix filesearch` works on packages built locally. If `guix build`, I need help to know where to plug it in. - Adapt Cuirass so that it builds its file database. I need pointers to get started here. - Sync the databases from the substitute server to the client when running `guix filesearch`. For this I suggest we send the compressed database corresponding to a guix generation over the network (around 10 MiB). Not sure sending just the delta is worth it. - Find a way to garbage-collect the database(s). My intuition is that we should have 1 database per Guix checkout and when we `guix gc` a Guix checkout we collect the corresponding database. I would store the databases in /var/guix/... Comments and help welcome! :) -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 511 bytes --]
Hi Pierre,
On Mon, 5 Oct 2020 at 20:53, Pierre Neidhardt <mail@ambrevar.xyz> wrote:
> Comments and help welcome! :)
I have just checked out and I am probably failing but "./pre-inst-env
guix filesearch whatever" raises an error with the backtrace:
--8<---------------cut here---------------start------------->8---
Backtrace:
In ice-9/boot-9.scm:
1736:10 5 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
In unknown file:
4 (apply-smob/0 #<thunk 7f43aa9254a0>)
In ice-9/boot-9.scm:
718:2 3 (call-with-prompt _ _ #<procedure default-prompt-handler (k proc)>)
In ice-9/eval.scm:
619:8 2 (_ #(#(#<directory (guile-user) 7f43aa549f00>)))
In guix/ui.scm:
2038:22 1 (run-guix-command filesearch "hello")
In ice-9/boot-9.scm:
2828:12 0 (module-ref _ _ . _)
ice-9/boot-9.scm:2828:12: In procedure module-ref:
No variable named guix-filesearch in #<interface (guix scripts
filesearch) 7f43a83f2000>
--8<---------------cut here---------------end--------------->8---
I will comment later on some specific points.
BTW, thank you for revivifying the topic, especially the database. :-)
Cheers,
simon
[-- Attachment #1: Type: text/plain, Size: 229 bytes --] Of course, `guix filesearch` hasn't been implemented yet ;) We still need to decide whether we want to make it part of `guix search' or define a separate command. Thoughts? -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 511 bytes --]
Hi, On Sat, 10 Oct 2020 at 10:57, Pierre Neidhardt <mail@ambrevar.xyz> wrote: > Of course, `guix filesearch` hasn't been implemented yet ;) Sorry, I have overlooked the status. :-) > We still need to decide whether we want to make it part of `guix search' > or define a separate command. From my point of view, it would be better to include it under “guix search”. Even, from my point of view, the code about “search” in “guix package” should go in “guix search”. Well do the opposite of the current: “guix package –search” becoming an alias of “guix search” in term of code. Anyway. :-) Cheers, simon
Hi, On Mon, 05 Oct 2020 at 20:53, Pierre Neidhardt <mail@ambrevar.xyz> wrote: > - Textual database: slow and not lighter than SQLite. Not worth it I believe. Maybe I am out-of-scope, but re-reading *all* the discussion about “fileserch”, is it possible to really do better than “locate”? As Ricardo mentioned. --8<---------------cut here---------------start------------->8--- echo 3 > /proc/sys/vm/drop_caches time updatedb --output=/tmp/store.db --database-root=/gnu/store/ real 0m19.903s user 0m1.549s sys 0m4.500s du -sh /gnu/store /tmp/store.db 30G /gnu/store 56M /tmp/store.db guix gc -F XXG echo 3 > /proc/sys/vm/drop_caches time updatedb --output=/tmp/store.db --database-root=/gnu/store/ real 0m10.105s user 0m0.865s sys 0m2.020s du -sh /gnu/store /tmp/store.db 28G /gnu/store 52M /tmp/store.db --8<---------------cut here---------------end--------------->8--- And then “locate” support regexp and regex and it is fast enough. --8<---------------cut here---------------start------------->8--- echo 3 > /proc/sys/vm/drop_caches time locate -d /tmp/store.db --regex "emacs-ma[a-z0-9\.\-]+\/[^.]+.el$" | tail -n5 /gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit-transient.el /gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit-utils.el /gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit-wip.el /gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit-worktree.el /gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit.el real 0m3.601s user 0m3.528s sys 0m0.061s --8<---------------cut here---------------end--------------->8--- The only point is that regexp is always cumbersome for me. Well: «Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.» :-) [1] [1] https://en.wikiquote.org/wiki/Jamie_Zawinski > - Include synopsis and descriptions. Maybe we should include all fields > that are searched by `guix search`. This incurs a cost on the > database size but it would fix the `guix search` speed issue. Size > increases by some 10 MiB. From my point of view, yes. Somehow “filesearch” is a subpart of “search”. So it should be the machinery. > I say we go with SQLite full-text search for now with all package > details. Switching to without full-text search is just a matter of a > minor adjustment, which we can decide later when merging the final > patch. Same if we decide not to include the description, synopsis, etc. [...] > - Populate the database on demand, either after a `guix build` or from a > `guix filesearch...`. This is important so that `guix filesearch` > works on packages built locally. If `guix build`, I need help to know > where to plug it in. [...] > - Sync the databases from the substitute server to the client when > running `guix filesearch`. For this I suggest we send the compressed > database corresponding to a guix generation over the network (around > 10 MiB). Not sure sending just the delta is worth it. From my point of view, how to transfer the database from substitutes to users and how to locally update (custom channels or custom load path) are not easy. Maybe the core issues. For example, I just did “guix pull” and “–list-generation” says from f6dfe42 (Sept. 15) to 4ec2190 (Oct. 10):: 39.9 MB will be download more the tiny bits before “Computing Guix derivation”. Say 50MB max. Well, the “locate” database for my “/gnu/store” (~30GB) is already to ~50MB, and ~20MB when compressed with gzip. And Pierre said: The database will all package descriptions and synopsis is 46 MiB and compresses down to 11 MiB in zstd. which is better but still something. Well, it is not affordable to fetch the database with “guix pull”, IMHO. Therefore, the database would be fetched at the first “guix search” (assuming point above). But now, how “search” could know what is custom build and what is not? Somehow, “search” should scan all the store to be able to update the database. And what happens each time I am doing a custom build then “filesearch”. The database should be updated, right? Well, it seems almost unusable. The model “updatedb/locate” seems better. The user updates “manually” if required and then location is fast. Most of the cases are searching files in packages that are not my custom packages. IMHO. To me, each time I am using “filesearch”: - first time: fetch the database corresponding the Guix commit and then update it with my local store - otherwise: use this database - optionally update the database if the user wants to include new custom items. We could imagine a hook or option to “guix pull” specifying to also fetch the database and update it at pull time instead of “search” time. Personally, I prefer longer “guix pull” because it is already a bit long and then fast “search” than half/half (not so long pull and longer search). WDYT? > - Find a way to garbage-collect the database(s). My intuition is that > we should have 1 database per Guix checkout and when we `guix gc` a > Guix checkout we collect the corresponding database. Well, the exact same strategy as ~/.config/guix/current/lib/guix/package.cache can be used. BTW, thanks Pierre for improving the Guix discoverability. :-) Cheers, simon
[-- Attachment #1: Type: text/plain, Size: 4800 bytes --] Hi Zimoun, Thanks for the feedback! > --8<---------------cut here---------------start------------->8--- > echo 3 > /proc/sys/vm/drop_caches > time updatedb --output=/tmp/store.db --database-root=/gnu/store/ > > real 0m19.903s > user 0m1.549s > sys 0m4.500s I don't know the size of your store nor your hardware. Could you benchmark against my filesearch implementation? > And then “locate” support regexp and regex and it is fast enough. But locate does not support word permutations, which is a very important feature for filesearch in my opinion. > The only point is that regexp is always cumbersome for me. Well: «Some > people, when confronted with a problem, think "I know, I'll use regular > expressions." Now they have two problems.» :-) [1] Exactly. Full text search is a big step forward for usability I think. > From my point of view, yes. Somehow “filesearch” is a subpart of > “search”. So it should be the machinery. I'll work on it. I'll try to make the code flexible enough so that it can be moved to another command easily, should we decide that "search" is not the right fit. > From my point of view, how to transfer the database from substitutes to > users and how to locally update (custom channels or custom load path) are > not easy. Maybe the core issues. Absolutely. > For example, I just did “guix pull” and “–list-generation” says from > f6dfe42 (Sept. 15) to 4ec2190 (Oct. 10):: > > 39.9 MB will be download > > more the tiny bits before “Computing Guix derivation”. Say 50MB max. > > Well, the “locate” database for my “/gnu/store” (~30GB) is already to > ~50MB, and ~20MB when compressed with gzip. And Pierre said: > > The database will all package descriptions and synopsis is 46 MiB > and compresses down to 11 MiB in zstd. I should have benchmarked with Lzip, it would have been more useful. I think we can get it down to approximately 8 MiB in Lzip. > which is better but still something. Well, it is not affordable to > fetch the database with “guix pull”, In My Humble Opinion. We could send a "diff" of the database. For instance, if the user already has a file database for the Guix generation A, then guix pulls to B, the substitute server can send the diff between A and B. This would probably amount to less than 1 MiB if the generations are not too far apart. (Warning: actual measures needed!) > Therefore, the database would be fetched at the first “guix search” > (assuming point above). But now, how “search” could know what is custom > build and what is not? Somehow, “search” should scan all the store to > be able to update the database. > > And what happens each time I am doing a custom build then “filesearch”. > The database should be updated, right? Well, it seems almost unusable. I mentioned this previously: we need to update the database on "guix build". This is very fast and would be mostly transparent to the user. This is essentially how "guix size" behaves. > The model “updatedb/locate” seems better. The user updates “manually” > if required and then location is fast. "manually" is not good in my opinion. The end-user will inevitably forget. An out-of-sync database would return bad results which is a big no-no for search. On-demand database updates are ideals I think. > To me, each time I am using “filesearch”: > > - first time: fetch the database corresponding the Guix commit and then > update it with my local store Possibly using a "diff" to shrink the download size. > - otherwise: use this database > - optionally update the database if the user wants to include new > custom items. No need for the optional point I believe. > We could imagine a hook or option to “guix pull” specifying to also > fetch the database and update it at pull time instead of “search” time. > Personally, I prefer longer “guix pull” because it is already a bit long > and then fast “search” than half/half (not so long pull and longer > search). I suggest we do it at pull time so that =guix search= does not need an online network. =guix pull= requires networking anyways. >> - Find a way to garbage-collect the database(s). My intuition is that >> we should have 1 database per Guix checkout and when we `guix gc` a >> Guix checkout we collect the corresponding database. > > Well, the exact same strategy as > ~/.config/guix/current/lib/guix/package.cache can be used. Oh! I didn't know about this file! What is it used for? > BTW, thanks Pierre for improving the Guix discoverability. :-) Thank you! :) -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 511 bytes --]
Hi Pierre, I am trying to resume the work on "guix search" to improve it (faster). That's why I am asking the details. :-) Because with the introduction of this database, as mentioned earlier, 2 annoyances could be fixed at once. On Sun, 11 Oct 2020 at 13:19, Pierre Neidhardt <mail@ambrevar.xyz> wrote: > > --8<---------------cut here---------------start------------->8--- > > echo 3 > /proc/sys/vm/drop_caches > > time updatedb --output=/tmp/store.db --database-root=/gnu/store/ > > > > real 0m19.903s > > user 0m1.549s > > sys 0m4.500s > > I don't know the size of your store nor your hardware. Could you > benchmark against my filesearch implementation? 30G as I reported in my previous email. ;-) > > From my point of view, yes. Somehow “filesearch” is a subpart of > > “search”. So it should be the machinery. > > I'll work on it. I'll try to make the code flexible enough so that it > can be moved to another command easily, should we decide that "search" > is not the right fit. UI does not matter so much at this point, I guess. But the nice final UI should be: guix search --file= > > For example, I just did “guix pull” and “–list-generation” says from > > f6dfe42 (Sept. 15) to 4ec2190 (Oct. 10):: > > > > 39.9 MB will be download > > > > more the tiny bits before “Computing Guix derivation”. Say 50MB max. > > > > Well, the “locate” database for my “/gnu/store” (~30GB) is already to > > ~50MB, and ~20MB when compressed with gzip. And Pierre said: > > > > The database will all package descriptions and synopsis is 46 MiB > > and compresses down to 11 MiB in zstd. > > I should have benchmarked with Lzip, it would have been more useful. I > think we can get it down to approximately 8 MiB in Lzip. Well, I think it will be more with all the items of all the packages. My point is: the database will be comparable in size with the bits of "guix pull"; it is not much but still something. > > which is better but still something. Well, it is not affordable to > > fetch the database with “guix pull”, In My Humble Opinion. > > We could send a "diff" of the database. This means to setup server side, right? So implement the "diff" in "guix publish", right? Hum? I feel it is overcomplicated. > For instance, if the user already has a file database for the Guix > generation A, then guix pulls to B, the substitute server can send the > diff between A and B. This would probably amount to less than 1 MiB if > the generations are not too far apart. (Warning: actual measures needed!) Well, what is the size of for a full /gnu/store/ containing all the packages of one specific revision? Sorry if you already provided this information, I have missed it. > > Therefore, the database would be fetched at the first “guix search” > > (assuming point above). But now, how “search” could know what is custom > > build and what is not? Somehow, “search” should scan all the store to > > be able to update the database. > > > > And what happens each time I am doing a custom build then “filesearch”. > > The database should be updated, right? Well, it seems almost unusable. > > I mentioned this previously: we need to update the database on "guix > build". This is very fast and would be mostly transparent to the user. > This is essentially how "guix size" behaves. Ok. > > The model “updatedb/locate” seems better. The user updates “manually” > > if required and then location is fast. > > "manually" is not good in my opinion. The end-user will inevitably > forget. An out-of-sync database would return bad results which is a > big no-no for search. On-demand database updates are ideals I think. The tradeoff is: - when is "on-demand"? When updates the database? - still fast when I search - do not slow down other guix subcommands What you are proposing is: - when "guix search --file": + if the database does not exist: fetch it + otherwise: use it - after each "guix build", update the database Right? I am still missing the other update mechanism for updating the database. (Note that the "fetch it" could be done at "guix pull" time which is more meaningful since pull requires network access as you said. And the real computations for updating could be done at the first "guix search --file" after the pull.) > Possibly using a "diff" to shrink the download size. > > > - otherwise: use this database > > - optionally update the database if the user wants to include new > > custom items. > > No need for the optional point I believe. Note that since the same code is used on build farms and their store is several TB (see recent discussion about "guix gc" on Berlin that takes hours), the build and update of the database need some care. :-) > >> - Find a way to garbage-collect the database(s). My intuition is that > >> we should have 1 database per Guix checkout and when we `guix gc` a > >> Guix checkout we collect the corresponding database. > > > > Well, the exact same strategy as > > ~/.config/guix/current/lib/guix/package.cache can be used. > > Oh! I didn't know about this file! What is it used for? Basically for "--news". Otherwise, it is used by "fold-available-packages", "find-packages-by-name" and "find-packages-by-location". It is used only if "--load-path" is not provided (cache-is-authoritative?). And it is computed at the end "guix pull". The discussions about improving "guix search" was first to replace it by SQL database, then to add another file mimicking it, then to extend it (which leads to backward compatibility issues). For example, compare: --8<---------------cut here---------------start------------->8--- time guix package --list-available > /dev/null real 0m1.025s user 0m1.866s sys 0m0.044s time guix package --list-available -L /tmp/foo > /dev/null real 0m4.436s user 0m6.734s sys 0m0.124s --8<---------------cut here---------------end--------------->8--- The first uses the case, the second not. Cheers, simon
[-- Attachment #1: Type: text/plain, Size: 3898 bytes --] Hi Zimoun, Maybe you misunderstood a point: the filesearch database is not a database of _all store items_, but only of the items that correspond to the packages of a given Guix generation. This should answer many of your comments. >> I don't know the size of your store nor your hardware. Could you >> benchmark against my filesearch implementation? > > 30G as I reported in my previous email. ;-) Sorry, I was unclear: I meant to benchmark the runtime of the Guile code I wrote in the patch, i.e. --8<---------------cut here---------------start------------->8--- ,time (persist-all-local-packages) --8<---------------cut here---------------end--------------->8--- >> I should have benchmarked with Lzip, it would have been more useful. I >> think we can get it down to approximately 8 MiB in Lzip. > > Well, I think it will be more with all the items of all the packages. No, the 8 MiB include _all the packages_ of a Guix generation. We never include the complete store, it would not make sense for filesearch. > This means to setup server side, right? So implement the "diff" in > "guix publish", right? Hum? I feel it is overcomplicated. I don't think it's to complicated: client sends a request along with the Guix generation commit and the closer Guix generation commit for which they have a database, server diffs the 2 SQLite database, compresses the result and sends it back. > Well, what is the size of for a full /gnu/store/ containing all the > packages of one specific revision? Sorry if you already provided this > information, I have missed it. The size of a /gnu/store does not matter. The size of the databse does however. In the email from the 26th of September: --8<---------------cut here---------------start------------->8--- The database will all package descriptions and synopsis is 46 MiB and compresses down to 11 MiB in zstd. --8<---------------cut here---------------end--------------->8--- >> "manually" is not good in my opinion. The end-user will inevitably >> forget. An out-of-sync database would return bad results which is a >> big no-no for search. On-demand database updates are ideals I think. > > The tradeoff is: > - when is "on-demand"? When updates the database? "guix build" and "guix pull". > - still fast when I search Sorry, what is your question? > - do not slow down other guix subcommands "guix pull" is not called by other commands. I don't think that "guix build" would be impacted much because the database update for a single store item is very fast. > What you are proposing is: > > - when "guix search --file": > + if the database does not exist: fetch it > + otherwise: use it No, do it in "guix pull" since it requires networking already. > - after each "guix build", update the database Yes. > I am still missing the other update mechanism for updating the database. Why? > (Note that the "fetch it" could be done at "guix pull" time which is > more meaningful since pull requires network access as you said. And > the real computations for updating could be done at the first "guix > search --file" after the pull.) Maybe this is the misunderstanding: "fetch it" and "update it" is the same thing. You fetch the diff from the substitute server and you apply it onto your local database. > Note that since the same code is used on build farms and their store > is several TB (see recent discussion about "guix gc" on Berlin that > takes hours), the build and update of the database need some care. :-) There is no difference between the build farm and my computer since I've generated the database over all 15000+ packages. That the store has several TB is irrelevant since only the given 15000 items will be browsed. Cheers! -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 511 bytes --]
On Sun, 11 Oct 2020 at 16:25, Pierre Neidhardt <mail@ambrevar.xyz> wrote: > Maybe you misunderstood a point: the filesearch database is not a > database of _all store items_, but only of the items that correspond to > the packages of a given Guix generation. Yes, it is clear for me. I meant: “all the store items of one specific Guix generation/revision”. So all the outputs files of ~15k packages. >> Well, I think it will be more with all the items of all the packages. > > No, the 8 MiB include _all the packages_ of a Guix generation. > We never include the complete store, it would not make sense for filesearch. [...] > --8<---------------cut here---------------start------------->8--- > The database will all package descriptions and synopsis is 46 MiB and > compresses down to 11 MiB in zstd. > --8<---------------cut here---------------end--------------->8--- Sorry, I have overlooked the code. All is fine. :-) I will try to adapat the “guix search” to use the database instead. Cheers, simon
Hi,
Pierre Neidhardt <mail@ambrevar.xyz> skribis:
> Of course, `guix filesearch` hasn't been implemented yet ;)
>
> We still need to decide whether we want to make it part of `guix search'
> or define a separate command.
I would lean towards keeping it separate, so that it’s an optional
feature (given that it relies on downloading an external database).
Thanks,
Ludo’.
Hi! Pierre Neidhardt <mail@ambrevar.xyz> skribis: >> Could you post a summary of what you have done, what’s left to do, and >> how you’d like to integrate it? (If you’ve already done it, my >> apologies, but you can resend a link. :-)) > > What I've done: mostly a database benchmark. > > - Textual database: slow and not lighter than SQLite. Not worth it I believe. > > - SQLite without full-text search: fast, supports classic patterns > (e.g. "foo*bar") but does not support word permutations. > > - SQLite with full-text search: fast, supports word permutations but > does not support suffix-matching (e.g. "bar" won't match "foobar"). > Size is about the same as without full-text search. > > - Include synopsis and descriptions. Maybe we should include all fields > that are searched by `guix search`. This incurs a cost on the > database size but it would fix the `guix search` speed issue. Size > increases by some 10 MiB. Oh so this is going beyond file search, right? Perhaps it would make sense to focus on file search only as a first step, and see what can be done with synopses/descriptions (like Arun and zimoun did before) later, separately? > What's left to do: > > - Populate the database on demand, either after a `guix build` or from a > `guix filesearch...`. This is important so that `guix filesearch` > works on packages built locally. If `guix build`, I need help to know > where to plug it in. > > - Adapt Cuirass so that it builds its file database. > I need pointers to get started here. > > - Sync the databases from the substitute server to the client when > running `guix filesearch`. For this I suggest we send the compressed > database corresponding to a guix generation over the network (around > 10 MiB). Not sure sending just the delta is worth it. It would be nice to see whether/how this could be integrated with third-party channels. Of course it’s not a priority, but while designing this feature, we should keep in mind that we might want third-party channel authors to be able to offer such a database for their packages. > - Find a way to garbage-collect the database(s). My intuition is that > we should have 1 database per Guix checkout and when we `guix gc` a > Guix checkout we collect the corresponding database. If we download a fresh database every time, we might as well simply overwrite the one we have? Thanks, Ludo’.
[-- Attachment #1: Type: text/plain, Size: 426 bytes --] Ludovic Courtès <ludo@gnu.org> writes: > I would lean towards keeping it separate, so that it’s an optional > feature (given that it relies on downloading an external database). I was leaning towards downloading the database with "guix pull", so that the "filesearch" subcommand would not download anything. In this case, we can have "guix search --file...", no? -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 511 bytes --]
[-- Attachment #1: Type: text/plain, Size: 1540 bytes --] Ludovic Courtès <ludo@gnu.org> writes: >> - Include synopsis and descriptions. Maybe we should include all fields >> that are searched by `guix search`. This incurs a cost on the >> database size but it would fix the `guix search` speed issue. Size >> increases by some 10 MiB. > > Oh so this is going beyond file search, right? > > Perhaps it would make sense to focus on file search only as a first > step, and see what can be done with synopses/descriptions (like Arun and > zimoun did before) later, separately? We can do this in a 2nd step, no problem, I only did the benchmark to answer Simon's question whether we could include both data in the same database and which size/speed cost it would incur. > It would be nice to see whether/how this could be integrated with > third-party channels. Of course it’s not a priority, but while > designing this feature, we should keep in mind that we might want > third-party channel authors to be able to offer such a database for > their packages. Wouldn't it work automatically for any substitute server? >> - Find a way to garbage-collect the database(s). My intuition is that >> we should have 1 database per Guix checkout and when we `guix gc` a >> Guix checkout we collect the corresponding database. > > If we download a fresh database every time, we might as well simply > overwrite the one we have? But then we would miss the database when switching Guix generation. Cheers! -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 511 bytes --]
On Mon, 12 Oct 2020 at 12:20, Ludovic Courtès <ludo@gnu.org> wrote: >> - Textual database: slow and not lighter than SQLite. Not worth it I believe. >> >> - SQLite without full-text search: fast, supports classic patterns >> (e.g. "foo*bar") but does not support word permutations. >> >> - SQLite with full-text search: fast, supports word permutations but >> does not support suffix-matching (e.g. "bar" won't match "foobar"). >> Size is about the same as without full-text search. >> >> - Include synopsis and descriptions. Maybe we should include all fields >> that are searched by `guix search`. This incurs a cost on the >> database size but it would fix the `guix search` speed issue. Size >> increases by some 10 MiB. > > Oh so this is going beyond file search, right? > > Perhaps it would make sense to focus on file search only as a first > step, and see what can be done with synopses/descriptions (like Arun and > zimoun did before) later, separately? Well, the first patch set that Arun sent for improving “guix search” was the introduction of a SQLite database, replacing the current ’package.cache’. And I quote your wise advice: I would rather keep the current package cache as-is instead of inserting sqlite in here. I don’t expect it to bring much compared performance-wise to the current simple cache (especially if we look at load time), and it does increase complexity quite a bit. However, using sqlite for keyword search as you initially proposed on guix-devel does sound like a great idea to me. Message-ID: <87sgjhx92g.fsf@gnu.org> Therefore, if Pierre is going to introduce a SQL database where the addition of the synopses/descriptions is cheap, it seems a good idea to use it, isn’t it? Keeping the ’package.cache’ as-is. And in parallel, “we“ can try to use this WIP branch for improving the speed of “guix search” (by “we”, I mean that I plan to work on). BTW, somehow, it would be really easy to remove these 2 extra fields if it is not concluding for search, since it is only the function ’add-files’: --8<---------------cut here---------------start------------->8--- (with-statement db (string-append "insert into Info (name, synopsis, description, package)" " values (:name, :synopsis, :description, :id)") stmt (sqlite-bind-arguments stmt #:name name #:synopsis synopsis #:description description #:id id) --8<---------------cut here---------------end--------------->8--- and used only once by ’persist-package-files’. > It would be nice to see whether/how this could be integrated with > third-party channels. Of course it’s not a priority, but while > designing this feature, we should keep in mind that we might want > third-party channel authors to be able to offer such a database for > their packages. If the third-party channels also provides substitutes, then it would be part of the substitutes, or easy to build from the substitute meta-data. >> - Find a way to garbage-collect the database(s). My intuition is that >> we should have 1 database per Guix checkout and when we `guix gc` a >> Guix checkout we collect the corresponding database. > > If we download a fresh database every time, we might as well simply > overwrite the one we have? But you do not want to download it again if you roll-back for example. From my point of view, it should be the same mechanism as ’package.cache’. Cheers, simon
Hi Pierre, Pierre Neidhardt <mail@ambrevar.xyz> skribis: > Ludovic Courtès <ludo@gnu.org> writes: [...] >> It would be nice to see whether/how this could be integrated with >> third-party channels. Of course it’s not a priority, but while >> designing this feature, we should keep in mind that we might want >> third-party channel authors to be able to offer such a database for >> their packages. > > Wouldn't it work automatically for any substitute server? “Something” needs to build the file-to-package database (which is what you’re working on), and then there needs to be a way for users to fetch that database. This is all orthogonal to substitutes, as I see it, which is why I think we need to think about integrating it maybe with ‘guix publish’ on the server side and (guix channels) on the client side. >>> - Find a way to garbage-collect the database(s). My intuition is that >>> we should have 1 database per Guix checkout and when we `guix gc` a >>> Guix checkout we collect the corresponding database. >> >> If we download a fresh database every time, we might as well simply >> overwrite the one we have? > > But then we would miss the database when switching Guix generation. Ah got it; having a database known to correspond to a specific commit is even better. However, note that it could take time for the server to provide the complete database for a commit (the time for as many packages as possible to be built), so clients may want to refresh it anyway, or even perhaps to use an older database. Thanks, Ludo’.
Pierre Neidhardt <mail@ambrevar.xyz> skribis:
> Ludovic Courtès <ludo@gnu.org> writes:
>
>> I would lean towards keeping it separate, so that it’s an optional
>> feature (given that it relies on downloading an external database).
>
> I was leaning towards downloading the database with "guix pull", so that
> the "filesearch" subcommand would not download anything.
Like substitutes, it should be an optional feature IMO, which means we
need to keep clear boundaries.
There’s also the question of whether we should have a system-wide
database so that each user doesn’t have to download it, especially if
it’s relatively big.
Apologies for throwing more questions than answers into the mix. :-)
Ludo’.
[-- Attachment #1: Type: text/plain, Size: 1475 bytes --] Hi Ludo! > “Something” needs to build the file-to-package database (which is what > you’re working on), and then there needs to be a way for users to fetch > that database. This is all orthogonal to substitutes, as I see it, > which is why I think we need to think about integrating it maybe with > ‘guix publish’ on the server side and (guix channels) on the client > side. If the database is filled on =guix build=, then the substitute server would automatically have a filesearch database. Question: How do I hook onto =guix build=? The only thing left to do after that is to expose the database in =guix publish= so that the client can fetch it with a =guix pull=. >>> If we download a fresh database every time, we might as well simply >>> overwrite the one we have? >> >> But then we would miss the database when switching Guix generation. > > Ah got it; having a database known to correspond to a specific commit is > even better. > > However, note that it could take time for the server to provide the > complete database for a commit (the time for as many packages as > possible to be built), so clients may want to refresh it anyway, or even > perhaps to use an older database. Indeed, and that's why I think we need to include a timestamp: if the client's database timestamp is older than that of the substitute server database, then we re-fetch it. Sounds good? -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 511 bytes --]
[-- Attachment #1: Type: text/plain, Size: 1157 bytes --] Ludovic Courtès <ludo@gnu.org> writes: > Pierre Neidhardt <mail@ambrevar.xyz> skribis: > >> Ludovic Courtès <ludo@gnu.org> writes: >> >>> I would lean towards keeping it separate, so that it’s an optional >>> feature (given that it relies on downloading an external database). >> >> I was leaning towards downloading the database with "guix pull", so that >> the "filesearch" subcommand would not download anything. > > Like substitutes, it should be an optional feature IMO, which means we > need to keep clear boundaries. We can still fetch the database with =guix pull= while making it optional. > There’s also the question of whether we should have a system-wide > database so that each user doesn’t have to download it, especially if > it’s relatively big. I'm leaning towards storing them in /var/guix/. Something like this: /var/guix/filesearch-databases/guix-$COMMIT.db /var/guix/filesearch-databases/guix-$OTHER_COMMIT.db > Apologies for throwing more questions than answers into the mix. :-) No problem, it's a good sign that we are going forward :) -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 511 bytes --]
Pierre Neidhardt <mail@ambrevar.xyz> skribis: >> “Something” needs to build the file-to-package database (which is what >> you’re working on), and then there needs to be a way for users to fetch >> that database. This is all orthogonal to substitutes, as I see it, >> which is why I think we need to think about integrating it maybe with >> ‘guix publish’ on the server side and (guix channels) on the client >> side. > > If the database is filled on =guix build=, then the substitute server > would automatically have a filesearch database. > > Question: How do I hook onto =guix build=? You would need a build-completion hook in the daemon, which doesn’t exist (yet!). Note also that at this level we only see derivations, not packages. >> Ah got it; having a database known to correspond to a specific commit is >> even better. >> >> However, note that it could take time for the server to provide the >> complete database for a commit (the time for as many packages as >> possible to be built), so clients may want to refresh it anyway, or even >> perhaps to use an older database. > > Indeed, and that's why I think we need to include a timestamp: if the > client's database timestamp is older than that of the substitute server > database, then we re-fetch it. Yeah, maybe that can work. Thanks, Ludo’.
[-- Attachment #1: Type: text/plain, Size: 437 bytes --] Ludovic Courtès <ludo@gnu.org> writes: >> Question: How do I hook onto =guix build=? > > You would need a build-completion hook in the daemon, which doesn’t > exist (yet!). Note also that at this level we only see derivations, not > packages. Hmm... Can you explain me how =guix size= works with local builds? Does it compute the size on demand or is it stored somewhere? -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 511 bytes --]
Pierre Neidhardt <mail@ambrevar.xyz> skribis:
> Ludovic Courtès <ludo@gnu.org> writes:
>
>>> Question: How do I hook onto =guix build=?
>>
>> You would need a build-completion hook in the daemon, which doesn’t
>> exist (yet!). Note also that at this level we only see derivations, not
>> packages.
>
> Hmm... Can you explain me how =guix size= works with local builds? Does
> it compute the size on demand or is it stored somewhere?
It first tries ‘query-path-info’, which succeeds if the store item is
available and contains info about its size, references, and so on.
When ‘query-path-info’ fails, it falls back to
‘query-substitutable-path-info’, which allows it to obtain the same
information for substitutes. Thus, it doesn’t need to have the store
item available locally.
HTH!
Ludo’.
[-- Attachment #1: Type: text/plain, Size: 768 bytes --] Ludovic Courtès <ludo@gnu.org> writes: > It first tries ‘query-path-info’, which succeeds if the store item is > available and contains info about its size, references, and so on. > > When ‘query-path-info’ fails, it falls back to > ‘query-substitutable-path-info’, which allows it to obtain the same > information for substitutes. Thus, it doesn’t need to have the store > item available locally. So we could do the same with `guix filesearch`: - First try the entry in the database. - If not there, try query-path-info and if it succeeds, populate the database. - If query-path-info does not succeed, try our new query-substitutable-filesearch-info. That would work, no? -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 511 bytes --]
[-- Attachment #1: Type: text/plain, Size: 692 bytes --] Pierre Neidhardt <mail@ambrevar.xyz> writes: > So we could do the same with `guix filesearch`: > > - First try the entry in the database. > > - If not there, try query-path-info and if it succeeds, populate the > database. > > - If query-path-info does not succeed, try our new > query-substitutable-filesearch-info. > > That would work, no? Oops, I must have been distracted when writing this, of course this cannot work "just in time" because we `guix filesearch' does not take the local package as argument (unlike `guix size'). The local package must be indexed right after it's built, there is no way around it. -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 511 bytes --]
Pierre Neidhardt <mail@ambrevar.xyz> skribis:
> Ludovic Courtès <ludo@gnu.org> writes:
>
>> It first tries ‘query-path-info’, which succeeds if the store item is
>> available and contains info about its size, references, and so on.
>>
>> When ‘query-path-info’ fails, it falls back to
>> ‘query-substitutable-path-info’, which allows it to obtain the same
>> information for substitutes. Thus, it doesn’t need to have the store
>> item available locally.
>
> So we could do the same with `guix filesearch`:
>
> - First try the entry in the database.
>
> - If not there, try query-path-info and if it succeeds, populate the
> database.
>
> - If query-path-info does not succeed, try our new
> query-substitutable-filesearch-info.
>
> That would work, no?
Sure, but it’s a big change on the daemon size (you need this new RPC, a
new database, a way to update this database, presumably orthogonal to
substitutes (?), etc.)
A client-side approach (not involving guix-daemon) would be more readily
usable, though some of the questions above remain open.
Thanks,
Ludo’.
[-- Attachment #1: Type: text/plain, Size: 529 bytes --] Ludovic Courtès <ludo@gnu.org> writes: > A client-side approach (not involving guix-daemon) would be more readily > usable, though some of the questions above remain open. I'd also prefer to stick to the client side. But how can I trigger an event when a package gets built? Maybe we could hook into specific commands like `guix build', collect the list of builds that are going to be done, and upon completion, fill the database with the build results. Cheers! -- Pierre Neidhardt https://ambrevar.xyz/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 511 bytes --]