File search progress: database review and question on triggers

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

* File search progress: database review and question on triggers
@ 2020-08-10 14:32 Pierre Neidhardt
  2020-08-11  9:43 ` Mathieu Othacehe
                   ` (3 more replies)
  0 siblings, 4 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-10 14:32 UTC (permalink / raw)
  To: guix-devel

[-- Attachment #1: Type: text/plain, Size: 5575 bytes --]

Hi!

After much delay I finally got down to work on file search support for Guix.
By "file search", I mean the ability to find which package contains files
matching the queried pattern.

If we want to be able to know which package to install, we need file search to
be able to work for packages that have not yet been installed nor built
on the system.

As we previously discussed, a good approach, mostly for performance reasons,
would be to store all files in a database that's populated on demand from the
substitute servers.

What I've done so far:

1. An SQLite database with the following schema:

--8<---------------cut here---------------start------------->8---
create table if not exists Packages (
    name        text not null,
    output      text default "out",
    system      text not null,
    path        text primary key not null, -- store path, e.g. /gnu/store/abcd...-foo
    version     text not null,
    guix        text not null  -- The Guix version in which the package
    can be found.
);

create table if not exists Files (
    subpath     text not null,
    package  text not null,
    primary key (subpath, package), -- Same subpath can occur in multiple packages.
    foreign key (package) references Packages(path) on delete cascade
);
--8<---------------cut here---------------end--------------->8---

   I'm not very good with SQL, so thanks in advance for reviewing this
   carefully; let me know if we can do better.

2. A procedure that persists the filepaths of a given package in the database.

3. Size of the database:
   I've persisted all locally-present store items for my current Guix version
   and it produced a database of 72 MiB.  It compresses down to 8 MiB in zstd.

   But since we can have multiple Guix versions, this means that the
   packages have one entry per store path, so we might end up with more
   entries than that as the number of Guix generations grows.

   The worse case is around (number of guix generations) x ~100 MiB.

   If we compress, it would be at least 10x less, maybe way less.

   To be sustainable, I suggest that when we remove a Guix generation we
   "garbage-collect" the corresponding database entries.

   Thoughts?

4. Indexing speed:
   The above items took some 20 minutes to complete (on my rather powerful machine).
   A single store path takes a fraction of a second to index (on an SSD).
   The storage device is the bottleneck here.  Not sure we can do better than
   the following procedure:

--8<---------------cut here---------------start------------->8---
(define (directory-files path)
  "Return a list of all files within PATH, recursively.
Each file is returned as the path relative to PATH, starting with a '/'.

It's important that the first character be the directory separator because it
gives more expressive power for search.  For instance, searching \"/bin\"
matches both \"/bin/foo\" and \"/usr/bin/foo\" but not \"barbin\"."
  ;; TODO: This does not include empty directories.  Should we?
  ;; REVIEW: Use vlist for performance?  Big packages take a fraction of a
  ;; second on a hot cache, so it's probably not worth it.
  (let ((file-list '()))
    (ftw path
         (lambda (filename statinfo flag)
           (when (eq? flag 'regular)
             (set! file-list (cons (string-drop filename (string-length path))
                                   file-list))) #t))
    file-list))
--8<---------------cut here---------------end--------------->8---

   Most of the indexing will be done by the substitute servers however, so this is
   of little concern for the end user.

   Question: Should we include empty directories in the database?  I'm tempted
   to answer no.

5. Search speed: It completes in a fraction of a second and supports
   SQLite patterns.  Example:

--8<---------------cut here---------------start------------->8---
> (format-search (search-file-package "%libio%"))
samba:out@4.12.3	/lib/libiov-buf-samba4.so
guix:out@1.1.0-18.218a67d	/share/guile/site/3.0/gnu/packages/patches/m4-gnulib-libio.patch
--8<---------------cut here---------------end--------------->8---

   Question: This bounds us to the SQLite syntax for pattern matching.  Is it a
   problem?
   It seems powerful enough in practice.  But maybe we can use regular
   expression in SQLite as well?

Next points I'd like to address:

6. Automatically persist the database entry when building a package.
   Any idea where I should plug that in?

7. Have substitute servers distribute database content.  When the user performs
   a file search, Guix asks the substitute server for a database update.  Only
   the diff should be sent over the network, not the whole thing since it might
   be very large.

   Question 1: If the substitute server does not have data corresponding to the
   Guix server of the user, shall we send data of the version that's the closest
   to that of the user?
   Locally, if there are not many entries for the current Guix version, but many
   for an older version, shall perform the search on the older version as well?
   Multiple older versions?

   Question 2: How do we send "SQLite diffs"?  I know xdelta for binary diffs,
   but I don't think that what we want to do here.  Or else can we export the right
   SELECT result as text and send it compressed over the network?

I'll send a patch soon, in the meantime let me know about your thoughts!
:)

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-10 14:32 File search progress: database review and question on triggers Pierre Neidhardt
@ 2020-08-11  9:43 ` Mathieu Othacehe
  2020-08-11 12:35   ` Pierre Neidhardt
  2020-08-11 15:43 ` Ricardo Wurmus
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 73+ messages in thread
From: Mathieu Othacehe @ 2020-08-11  9:43 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel

Hello Pierre,

Thanks for sharing your progress. A few remarks below.

> 3. Size of the database:
>    I've persisted all locally-present store items for my current Guix version
>    and it produced a database of 72 MiB.  It compresses down to 8 MiB in zstd.
>
>    But since we can have multiple Guix versions, this means that the
>    packages have one entry per store path, so we might end up with more
>    entries than that as the number of Guix generations grows.

I'm not sure we actually need to save the full history. I think we could
just store that the package X produces [Y1, Y2, ...]  executable files. Then
on package X update, the executable files list could be refreshed.

>    Question: Should we include empty directories in the database?  I'm tempted
>    to answer no.

I would also say no, and also exclude non-executable files.

>    Question: This bounds us to the SQLite syntax for pattern matching.  Is it a
>    problem?
>    It seems powerful enough in practice.  But maybe we can use regular
>    expression in SQLite as well?

From the UI perspective, we already have "guix search" that expects a
regex. If we were to include a "guix file-search" command, then I think
it would make sense that it uses the same regex syntax.

> Next points I'd like to address:
>
> 6. Automatically persist the database entry when building a package.
>    Any idea where I should plug that in?
>
> 7. Have substitute servers distribute database content.  When the user performs
>    a file search, Guix asks the substitute server for a database update.  Only
>    the diff should be sent over the network, not the whole thing since it might

If I understand correctly, you are proposing to create local databases
that would be kept in sync with a master database populated by the CI
server. This seems a bit complex.

What about extending Cuirass database to add the two tables you are
proposing. Then, each time a package is built, if the version is
updated, the "Files" table would be updated.

Then we could add an HTTP interface such as "/search/file?query=libxxx"
to Cuirass, that would directly query the database. In Guix itself, we
could add the counterpart in the (guix ci) module.

WDYT?

Thanks,

Mathieu

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-11  9:43 ` Mathieu Othacehe
@ 2020-08-11 12:35   ` Pierre Neidhardt
  2020-08-15 12:48     ` Hartmut Goebel
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-11 12:35 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 3454 bytes --]

Hi Mathieu,

Thanks for you comments!
Answers below.

>> 3. Size of the database:
>>    I've persisted all locally-present store items for my current Guix version
>>    and it produced a database of 72 MiB.  It compresses down to 8 MiB in zstd.
>>
>>    But since we can have multiple Guix versions, this means that the
>>    packages have one entry per store path, so we might end up with more
>>    entries than that as the number of Guix generations grows.
>
> I'm not sure we actually need to save the full history. I think we could
> just store that the package X produces [Y1, Y2, ...]  executable files. Then
> on package X update, the executable files list could be refreshed.

Maybe you are missing some context.  The original discussion is there:
https://lists.gnu.org/archive/html/guix-devel/2020-01/msg00019.html.

Unlike Nix, we would like to do more than just index executable files.
Indeed, it's very useful to know where to find, say, a C header, a .so
library, a TeXlive .sty file, etc.

However, as you hinted, maybe it's unnecessary to save the file listings of
the packages for every Guix versions.  Maybe we could only store the
"diffs" between the Guix generations.  I don't know if SQLite supports
this.  If not, it sounds like a rather complex thing to do.

But really, if the compressed database over multiple Guix generations is
<100 MiB, then size is not a big problem.

>>    Question: Should we include empty directories in the database?  I'm tempted
>>    to answer no.
>
> I would also say no, and also exclude non-executable files.

See above, I think we would lose a lot in not including non-executable files.

>>    Question: This bounds us to the SQLite syntax for pattern matching.  Is it a
>>    problem?
>>    It seems powerful enough in practice.  But maybe we can use regular
>>    expression in SQLite as well?
>
> From the UI perspective, we already have "guix search" that expects a
> regex. If we were to include a "guix file-search" command, then I think
> it would make sense that it uses the same regex syntax.

I found out that SQLite has a REGEXP operator, I'll see if it works well enough.

>> 7. Have substitute servers distribute database content.  When the user performs
>>    a file search, Guix asks the substitute server for a database update.  Only
>>    the diff should be sent over the network, not the whole thing since it might
>
> If I understand correctly, you are proposing to create local databases
> that would be kept in sync with a master database populated by the CI
> server. This seems a bit complex.
>
> What about extending Cuirass database to add the two tables you are
> proposing. Then, each time a package is built, if the version is
> updated, the "Files" table would be updated.
>
> Then we could add an HTTP interface such as "/search/file?query=libxxx"
> to Cuirass, that would directly query the database. In Guix itself, we
> could add the counterpart in the (guix ci) module.

The problem with this approach is that it would not work offline.  In my
opinion, this is a big limitation.  I'd rather have a local database.

Besides, we need a local database for non-official, locally-built
packages anyways (which Cuirass would not know about).  Since this is a
requirement, the only piece that'd be missing is database synchronization.

Thoughts?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-10 14:32 File search progress: database review and question on triggers Pierre Neidhardt
  2020-08-11  9:43 ` Mathieu Othacehe
@ 2020-08-11 15:43 ` Ricardo Wurmus
  2020-08-11 17:54   ` Pierre Neidhardt
  2020-08-18 14:58 ` File search progress: database review and question on triggers OFF TOPIC PRAISE Joshua Branson
  2020-08-27 10:00 ` File search progress: database review and question on triggers zimoun
  3 siblings, 1 reply; 73+ messages in thread
From: Ricardo Wurmus @ 2020-08-11 15:43 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel


Pierre Neidhardt <mail@ambrevar.xyz> writes:

> 3. Size of the database:
>    I've persisted all locally-present store items for my current Guix version
>    and it produced a database of 72 MiB.  It compresses down to 8 MiB
>    in zstd.

For comparison, my laptop’s store contains 1,103,543 files, excluding
.links 691,994.  The updatedb database for all of them is 86MB and takes
~6 seconds to generate:

    time updatedb \
         --localpaths=/gnu/store \
         --findoptions='( -path /gnu/store/.links -o -name *.drv -o -name *.chroot ) -prune -o -type f -print' \
         --output=/tmp/dbfile

    locate -d /tmp/dbfile ecxc0800

(This could be further tweaked to exclude links…)

>    The worse case is around (number of guix generations) x ~100 MiB.

This seems a little excessive.

> 4. Indexing speed:
>    The above items took some 20 minutes to complete (on my rather
>    powerful machine).

Oof.  The updatedb hack above takes 6 seconds on my i7-6500U CPU @
2.50GHz with SSD.

I’m not suggesting to use updatedb, but I think it can be instructive to
look at how the file database is implemented there.  We don’t have to
use SQlite if it is much slower and heavier than a custom inverted
index.

-- 
Ricardo


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-11 15:43 ` Ricardo Wurmus
@ 2020-08-11 17:54   ` Pierre Neidhardt
  2020-08-11 17:58     ` Pierre Neidhardt
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-11 17:54 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 573 bytes --]

Ricardo Wurmus <rekado@elephly.net> writes:

> Oof.  The updatedb hack above takes 6 seconds on my i7-6500U CPU @
> 2.50GHz with SSD.
>
> I’m not suggesting to use updatedb, but I think it can be instructive to
> look at how the file database is implemented there.  We don’t have to
> use SQlite if it is much slower and heavier than a custom inverted
> index.

Good call, I'll benchmark against an inverted index.

Some cost may also be induced by the Guix store queries, not sure if we
can optimize these.

--
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-11 17:54   ` Pierre Neidhardt
@ 2020-08-11 17:58     ` Pierre Neidhardt
  2020-08-11 20:08       ` Ricardo Wurmus
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-11 17:58 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 942 bytes --]

Pierre Neidhardt <mail@ambrevar.xyz> writes:

> Ricardo Wurmus <rekado@elephly.net> writes:
>
>> I’m not suggesting to use updatedb, but I think it can be instructive to
>> look at how the file database is implemented there.  We don’t have to
>> use SQlite if it is much slower and heavier than a custom inverted
>> index.
>
> Good call, I'll benchmark against an inverted index.
>
> Some cost may also be induced by the Guix store queries, not sure if we
> can optimize these.

With an s-exp based file, or a trivial text-based format, the downside
is that it needs a bit of extra work to only load select entries,
e.g. just the entries matching a specific Guix version.

Would you happen to know a serialization library that allows for loading
only a select portion of a file?

Otherwise a trivial workaround would be to persist one index file per
Guix generation.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-11 17:58     ` Pierre Neidhardt
@ 2020-08-11 20:08       ` Ricardo Wurmus
  2020-08-12 19:10         ` Pierre Neidhardt
  0 siblings, 1 reply; 73+ messages in thread
From: Ricardo Wurmus @ 2020-08-11 20:08 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel


Pierre Neidhardt <mail@ambrevar.xyz> writes:

> Pierre Neidhardt <mail@ambrevar.xyz> writes:
>
>> Ricardo Wurmus <rekado@elephly.net> writes:
>>
>>> I’m not suggesting to use updatedb, but I think it can be instructive to
>>> look at how the file database is implemented there.  We don’t have to
>>> use SQlite if it is much slower and heavier than a custom inverted
>>> index.
>>
>> Good call, I'll benchmark against an inverted index.
>>
>> Some cost may also be induced by the Guix store queries, not sure if we
>> can optimize these.
>
> With an s-exp based file, or a trivial text-based format, the downside
> is that it needs a bit of extra work to only load select entries,
> e.g. just the entries matching a specific Guix version.
>
> Would you happen to know a serialization library that allows for loading
> only a select portion of a file?

I don’t know of any suitable file format, but a generated offset index
at the beginning of the file could be useful.  You’d read the first
expression and then seek to the specified byte offset (after the
position of the index expression) where you then read the target
expression.  This can easily be generated and it can be extended without
having to rewrite the whole file.

But perhaps that’s premature optimization.

-- 
Ricardo


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-11 20:08       ` Ricardo Wurmus
@ 2020-08-12 19:10         ` Pierre Neidhardt
  2020-08-12 20:13           ` Julien Lepiller
                             ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-12 19:10 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1764 bytes --]

I've done some benchmarking.

1. I tried to fine-tune the SQL a bit:
  - Open/close the database only once for the whole indexing.
  - Use "insert" instead of "insert or replace".
  - Use numeric ID as key instead of path.

  Result: Still around 15-20 minutes to build.  Switching to numeric
  indices shrank the database by half.

2. I've tried with the following naive 1-file-per-line format:

--8<---------------cut here---------------start------------->8---
"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man5/acl.5.gz"
"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_add_perm.3.gz"
"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_calc_mask.3.gz"
...
--8<---------------cut here---------------end--------------->8---

  Result: Takes between 20 and 2 minutes to complete and the result is
  32 MiB big.  (I don't know why the timing varies.)

  A string-contains filter takes less than 1 second.

  A string-match (regex) search takes some 3 seconds (Ryzen 5 @ 3.5
  GHz).  I'm not sure if we can go faster.  I need to measure the time
  SQL takes for a regexp match.

Question: Any idea how to load the database as fast as possible?  I
tried the following, it takes 1.5s on my machine:

--8<---------------cut here---------------start------------->8---
(define (load-textual-database)
  (call-with-input-file %textual-db
    (lambda (port)
      (let loop ((line (get-line port))
                 (result '()))
        (if (string? line)
            (loop (get-line port) (cons line result))
            result)))))
--8<---------------cut here---------------end--------------->8---

Cheers!

--
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-12 19:10         ` Pierre Neidhardt
@ 2020-08-12 20:13           ` Julien Lepiller
  2020-08-12 20:43             ` Pierre Neidhardt
  2020-08-12 20:32           ` Pierre Neidhardt
  2020-08-13  0:17           ` Arun Isaac
  2 siblings, 1 reply; 73+ messages in thread
From: Julien Lepiller @ 2020-08-12 20:13 UTC (permalink / raw)
  To: guix-devel, Pierre Neidhardt, Ricardo Wurmus

[-- Attachment #1: Type: text/plain, Size: 2612 bytes --]

Have you tried something more structured? I have some code for creating a binary search tree and even compressing/decompressing strings with huffman, as well as code to serialize all that (my deserialization is in Java though, so not very useful to you): https://framagit.org/nani-project/nani-website

See modules/nani/trie.scm for instance.

The idea is to have a binary search tree whose keys are filenames and value is a pointer in the file to a structure that holds data for this filerame (packages and versions of guix for instance). It's super fast to look it up, because you don't load the whole file, you only seek to the right position. It's a lot longer to build, and not so easy to update though.

On 2020年8月12日 15:10:08 GMT-04:00, Pierre Neidhardt <mail@ambrevar.xyz> wrote:
>I've done some benchmarking.
>
>1. I tried to fine-tune the SQL a bit:
>  - Open/close the database only once for the whole indexing.
>  - Use "insert" instead of "insert or replace".
>  - Use numeric ID as key instead of path.
>
>  Result: Still around 15-20 minutes to build.  Switching to numeric
>  indices shrank the database by half.
>
>2. I've tried with the following naive 1-file-per-line format:
>
>--8<---------------cut here---------------start------------->8---
>"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man5/acl.5.gz"
>"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_add_perm.3.gz"
>"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_calc_mask.3.gz"
>...
>--8<---------------cut here---------------end--------------->8---
>
>  Result: Takes between 20 and 2 minutes to complete and the result is
>  32 MiB big.  (I don't know why the timing varies.)
>
>  A string-contains filter takes less than 1 second.
>
>  A string-match (regex) search takes some 3 seconds (Ryzen 5 @ 3.5
>  GHz).  I'm not sure if we can go faster.  I need to measure the time
>  SQL takes for a regexp match.
>
>Question: Any idea how to load the database as fast as possible?  I
>tried the following, it takes 1.5s on my machine:
>
>--8<---------------cut here---------------start------------->8---
>(define (load-textual-database)
>  (call-with-input-file %textual-db
>    (lambda (port)
>      (let loop ((line (get-line port))
>                 (result '()))
>        (if (string? line)
>            (loop (get-line port) (cons line result))
>            result)))))
>--8<---------------cut here---------------end--------------->8---
>
>Cheers!
>
>--
>Pierre Neidhardt
>https://ambrevar.xyz/

[-- Attachment #2: Type: text/html, Size: 3040 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-12 19:10         ` Pierre Neidhardt
  2020-08-12 20:13           ` Julien Lepiller
@ 2020-08-12 20:32           ` Pierre Neidhardt
  2020-08-13  0:17           ` Arun Isaac
  2 siblings, 0 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-12 20:32 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 319 bytes --]

Pierre Neidhardt <mail@ambrevar.xyz> writes:

>   Result: Takes between 20 and 2 minutes to complete and the result is
>   32 MiB big.  (I don't know why the timing varies.)

Typo: 20 _seconds_ to 2 minutes!  So it's faster than SQL by 1 or 2 orders of magnitude.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-12 20:13           ` Julien Lepiller
@ 2020-08-12 20:43             ` Pierre Neidhardt
  2020-08-12 21:29               ` Julien Lepiller
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-12 20:43 UTC (permalink / raw)
  To: Julien Lepiller, guix-devel, Ricardo Wurmus

[-- Attachment #1: Type: text/plain, Size: 1417 bytes --]

Julien Lepiller <julien@lepiller.eu> writes:

> Have you tried something more structured? I have some code for creating a binary search tree and even compressing/decompressing strings with huffman, as well as code to serialize all that (my deserialization is in Java though, so not very useful to you): https://framagit.org/nani-project/nani-website
>
> See modules/nani/trie.scm for instance.
>
> The idea is to have a binary search tree whose keys are filenames and value is a pointer in the file to a structure that holds data for this filerame (packages and versions of guix for instance). It's super fast to look it up, because you don't load the whole file, you only seek to the right position. It's a lot longer to build, and not so easy to update though.

Assuming we'd have only 1 Guix generation per file, I'm not sure a trie
would help because we _always_ need to search _all_ file paths, so in
all cases we've got some 10,000+ entries to load in memory and loop over
them.

The total number of entries is the bottleneck here, both for the
database load and the individual search queries.

An obvious optimization is to memoize the database load.  This has a
significant memory cost though.

The trie could be helpful for multiple Guix generations in the same
database, but I'm not sure it warrant the increased complexity.

Thoughts?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-12 20:43             ` Pierre Neidhardt
@ 2020-08-12 21:29               ` Julien Lepiller
  2020-08-12 22:29                 ` Ricardo Wurmus
  2020-08-13  6:52                 ` Pierre Neidhardt
  0 siblings, 2 replies; 73+ messages in thread
From: Julien Lepiller @ 2020-08-12 21:29 UTC (permalink / raw)
  To: Pierre Neidhardt, guix-devel, Ricardo Wurmus

[-- Attachment #1: Type: text/plain, Size: 1880 bytes --]

Why wouldn't it help? Can't you make it a trie from basename -> complete name? If I'm looking for "libcord.so" (which is a key in the trie), I don't think I need to look for every path. I only need to follow the trie until I find a pointer to some structure that contains the data I look for (ex: a list of complete filenames).

On 2020年8月12日 16:43:37 GMT-04:00, Pierre Neidhardt <mail@ambrevar.xyz> wrote:
>Julien Lepiller <julien@lepiller.eu> writes:
>
>> Have you tried something more structured? I have some code for
>creating a binary search tree and even compressing/decompressing
>strings with huffman, as well as code to serialize all that (my
>deserialization is in Java though, so not very useful to you):
>https://framagit.org/nani-project/nani-website
>>
>> See modules/nani/trie.scm for instance.
>>
>> The idea is to have a binary search tree whose keys are filenames and
>value is a pointer in the file to a structure that holds data for this
>filerame (packages and versions of guix for instance). It's super fast
>to look it up, because you don't load the whole file, you only seek to
>the right position. It's a lot longer to build, and not so easy to
>update though.
>
>Assuming we'd have only 1 Guix generation per file, I'm not sure a trie
>would help because we _always_ need to search _all_ file paths, so in
>all cases we've got some 10,000+ entries to load in memory and loop
>over
>them.
>
>The total number of entries is the bottleneck here, both for the
>database load and the individual search queries.
>
>An obvious optimization is to memoize the database load.  This has a
>significant memory cost though.
>
>The trie could be helpful for multiple Guix generations in the same
>database, but I'm not sure it warrant the increased complexity.
>
>Thoughts?
>
>-- 
>Pierre Neidhardt
>https://ambrevar.xyz/

[-- Attachment #2: Type: text/html, Size: 2275 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-12 21:29               ` Julien Lepiller
@ 2020-08-12 22:29                 ` Ricardo Wurmus
  2020-08-13  6:55                   ` Pierre Neidhardt
  2020-08-13  6:52                 ` Pierre Neidhardt
  1 sibling, 1 reply; 73+ messages in thread
From: Ricardo Wurmus @ 2020-08-12 22:29 UTC (permalink / raw)
  To: Julien Lepiller; +Cc: guix-devel

Julien Lepiller <julien@lepiller.eu> writes:

> Why wouldn't it help? Can't you make it a trie from basename -> complete name? If I'm looking for "libcord.so" (which is a key in the trie), I don't think I need to look for every path. I only need to follow the trie until I find a pointer to some structure that contains the data I look for (ex: a list of complete filenames).

Exactly!  It’s somewhat less useful (without pre-processing) for base
names and more useful for absolute file names, where you can quickly
toss out mismatches while traversing the trie.

You could also optimize the data structure by swapping the base name and
the directory name, assuming that users would want to search for either
absolute file names and also for base names (and not for a directory in
between).

-- 
Ricardo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-12 19:10         ` Pierre Neidhardt
  2020-08-12 20:13           ` Julien Lepiller
  2020-08-12 20:32           ` Pierre Neidhardt
@ 2020-08-13  0:17           ` Arun Isaac
  2020-08-13  6:58             ` Pierre Neidhardt
  2020-08-13  9:40             ` Pierre Neidhardt
  2 siblings, 2 replies; 73+ messages in thread
From: Arun Isaac @ 2020-08-13  0:17 UTC (permalink / raw)
  To: Pierre Neidhardt, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1756 bytes --]

Hi,

> 1. I tried to fine-tune the SQL a bit:
>   - Open/close the database only once for the whole indexing.
>   - Use "insert" instead of "insert or replace".
>   - Use numeric ID as key instead of path.
>
>   Result: Still around 15-20 minutes to build.  Switching to numeric
>   indices shrank the database by half.

sqlite insert statements can be very fast. sqlite.org claims 50000 or
more insert statements per second. But in order to achieve that speed
all insert statements have to be grouped together in a single
transaction. See https://www.sqlite.org/faq.html#q19

>   A string-contains filter takes less than 1 second.

Guile's string-contains function uses a naive O(nk) implementation,
where 'n' is the length of string s1 and 'k' is the length of string
s2. If it was implemented using the Knuth-Morris-Pratt algorithm, it
could cost only O(n+k). So, there is some scope for improvement here. In
fact, a comment on line 2007 of libguile/srfi-13.c in the guile source
tree makes this very point.

> I need to measure the time SQL takes for a regexp match.

sqlite, by default, does not come with regexp support. You might have to
load some external library. See
https://www.sqlite.org/lang_expr.html#the_like_glob_regexp_and_match_operators

--8<---------------cut here---------------start------------->8---
The REGEXP operator is a special syntax for the regexp() user
function. No regexp() user function is defined by default and so use of
the REGEXP operator will normally result in an error message. If an
application-defined SQL function named "regexp" is added at run-time,
then the "X REGEXP Y" operator will be implemented as a call to
"regexp(Y,X)".
--8<---------------cut here---------------end--------------->8---

Regards,
Arun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-12 21:29               ` Julien Lepiller
  2020-08-12 22:29                 ` Ricardo Wurmus
@ 2020-08-13  6:52                 ` Pierre Neidhardt
  2020-08-13  9:34                   ` Ricardo Wurmus
  1 sibling, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-13  6:52 UTC (permalink / raw)
  To: Julien Lepiller, guix-devel, Ricardo Wurmus

[-- Attachment #1: Type: text/plain, Size: 1058 bytes --]

Julien Lepiller <julien@lepiller.eu> writes:

> Why wouldn't it help? Can't you make it a trie from basename ->
> complete name? If I'm looking for "libcord.so" (which is a key in the
> trie), I don't think I need to look for every path. I only need to
> follow the trie until I find a pointer to some structure that contains
> the data I look for (ex: a list of complete filenames).

Fair enough, but it's a more limited scope: here we assume the user
knows the exact basename.

It's a bit too limited in my opinion:

- It's only too common to have shared objects ending with a .X.Y extension
  (where X.Y is a version), the version-less file is not always present
  which means a lot of trial and error on the user end just to search
  the right file.

- It does not cover the case where I don't know the basename, e.g. if I'm
  looking for a FOO header file my query would look like "/include/.*foo.*".

I believe it's important that the search be as general as possible.

Thoughts?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-12 22:29                 ` Ricardo Wurmus
@ 2020-08-13  6:55                   ` Pierre Neidhardt
  0 siblings, 0 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-13  6:55 UTC (permalink / raw)
  To: Ricardo Wurmus, Julien Lepiller; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1185 bytes --]

Hi Ricardo,

Ricardo Wurmus <rekado@elephly.net> writes:

>> Why wouldn't it help? Can't you make it a trie from basename -> complete name? If I'm looking for "libcord.so" (which is a key in the trie), I don't think I need to look for every path. I only need to follow the trie until I find a pointer to some structure that contains the data I look for (ex: a list of complete filenames).
>
> Exactly!  It’s somewhat less useful (without pre-processing) for base
> names and more useful for absolute file names, where you can quickly
> toss out mismatches while traversing the trie.
>
> You could also optimize the data structure by swapping the base name and
> the directory name, assuming that users would want to search for either
> absolute file names and also for base names (and not for a directory in
> between).

By absolute file name, you mean relative to the Guix store item?  As I
mentioned in my reply to Julien, this would be a specific use-case
optimization, like you said it won't work when you don't know the exact
basename or full path, so I'm not sure it's worth going down that road.

Thoughts?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13  0:17           ` Arun Isaac
@ 2020-08-13  6:58             ` Pierre Neidhardt
  2020-08-13  9:40             ` Pierre Neidhardt
  1 sibling, 0 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-13  6:58 UTC (permalink / raw)
  To: Arun Isaac, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1950 bytes --]

Hi Arun,

Arun Isaac <arunisaac@systemreboot.net> writes:

> sqlite insert statements can be very fast. sqlite.org claims 50000 or
> more insert statements per second. But in order to achieve that speed
> all insert statements have to be grouped together in a single
> transaction. See https://www.sqlite.org/faq.html#q19

Very good point, thanks a lot!  I'll try again!

>>   A string-contains filter takes less than 1 second.
>
> Guile's string-contains function uses a naive O(nk) implementation,
> where 'n' is the length of string s1 and 'k' is the length of string
> s2. If it was implemented using the Knuth-Morris-Pratt algorithm, it
> could cost only O(n+k). So, there is some scope for improvement here. In
> fact, a comment on line 2007 of libguile/srfi-13.c in the guile source
> tree makes this very point.

Thanks for this low-level insight, this is very useful!
Then if we can get Knuth-Morris-Pratt in the next Guile version, a
textual database could be an ideal option in terms of performance.

>> I need to measure the time SQL takes for a regexp match.
>
> sqlite, by default, does not come with regexp support. You might have to
> load some external library. See
> https://www.sqlite.org/lang_expr.html#the_like_glob_regexp_and_match_operators
>
> --8<---------------cut here---------------start------------->8---
> The REGEXP operator is a special syntax for the regexp() user
> function. No regexp() user function is defined by default and so use of
> the REGEXP operator will normally result in an error message. If an
> application-defined SQL function named "regexp" is added at run-time,
> then the "X REGEXP Y" operator will be implemented as a call to
> "regexp(Y,X)".
> --8<---------------cut here---------------end--------------->8---

I saw that.  Now I need to how regexp search performs as opposed to
pattern search.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13  6:52                 ` Pierre Neidhardt
@ 2020-08-13  9:34                   ` Ricardo Wurmus
  2020-08-13 10:04                     ` Pierre Neidhardt
  0 siblings, 1 reply; 73+ messages in thread
From: Ricardo Wurmus @ 2020-08-13  9:34 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel

Pierre Neidhardt <mail@ambrevar.xyz> writes:

> Julien Lepiller <julien@lepiller.eu> writes:
>
>> Why wouldn't it help? Can't you make it a trie from basename ->
>> complete name? If I'm looking for "libcord.so" (which is a key in the
>> trie), I don't think I need to look for every path. I only need to
>> follow the trie until I find a pointer to some structure that contains
>> the data I look for (ex: a list of complete filenames).
>
> Fair enough, but it's a more limited scope: here we assume the user
> knows the exact basename.
>
> It's a bit too limited in my opinion:
>
> - It's only too common to have shared objects ending with a .X.Y extension
>   (where X.Y is a version), the version-less file is not always present
>   which means a lot of trial and error on the user end just to search
>   the right file.

We can always abort the radix trie traversal early and print all leaves
that can be reached from the node that we arrived at.  This allows us to
search for “libc” and find “libcord.so.1.2”, “libc.so”, “libcaca.a”,
etc.

> - It does not cover the case where I don't know the basename, e.g. if I'm
>   looking for a FOO header file my query would look like "/include/.*foo.*".

I think this is a rather rare use case, which in my opinion doesn’t
justify forgoing the use of a smart data structure.  It would *still* be
possible to use the prefix tree for this kind of search, but it would
certainly not be optimal. (E.g. by searching up to the first wildcard,
and then searching each resulting sub-tree up to the first non-wildcard,
and resuming the search in all remaining sub-trees.)

> I believe it's important that the search be as general as possible.

Search should be cheap and fast.  Dependent on how we arrange the data,
users can easily filter the results with grep, if necessary.  For
example, we could have a radix tree of the base names and have the
directory prefix as the leaf nodes.  The search would then involve two
steps: culling the set of directory prefixes by following the base name
trie (that’s fast), then do a simple linear search over the results
(that’s slow).

The trie can easily be serialized as a bunch of offsets that we only
need to repeatedly seek to, so we never need to hold it in memory.

-- 
Ricardo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13  0:17           ` Arun Isaac
  2020-08-13  6:58             ` Pierre Neidhardt
@ 2020-08-13  9:40             ` Pierre Neidhardt
  2020-08-13 10:08               ` Pierre Neidhardt
                                 ` (2 more replies)
  1 sibling, 3 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-13  9:40 UTC (permalink / raw)
  To: Arun Isaac, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1534 bytes --]

Arun Isaac <arunisaac@systemreboot.net> writes:

> sqlite insert statements can be very fast. sqlite.org claims 50000
> or...

I tried it, and got it down to... 30s!  Turns out that Guix already
provides `call-with-transaction' which I've used here.

So SQLite is back on the table.

>> I need to measure the time SQL takes for a regexp match.
>
> sqlite, by default, does not come with regexp support. You might have to
> load some external library. See
> https://www.sqlite.org/lang_expr.html#the_like_glob_regexp_and_match_operators
>
> --8<---------------cut here---------------start------------->8---
> The REGEXP operator is a special syntax for the regexp() user
> function. No regexp() user function is defined by default and so use of
> the REGEXP operator will normally result in an error message. If an
> application-defined SQL function named "regexp" is added at run-time,
> then the "X REGEXP Y" operator will be implemented as a call to
> "regexp(Y,X)".
> --8<---------------cut here---------------end--------------->8---

After more reading, I see the following options:

- Either we use the sqlite-pcre extension (which is not packaged in
  Guix).

- Or we extend guile-sqlite to add a binding to create_function
  (https://www.sqlite.org/appfunc.html).

- Or do you think SQLite patterns (using "%") would do for now?  As
  Mathieu pointed out, it's an unfortunate inconsistency with the rest of
  Guix.  But maybe regexp support can be added in a second stage.

Thoughts?

--
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13  9:34                   ` Ricardo Wurmus
@ 2020-08-13 10:04                     ` Pierre Neidhardt
  2020-08-15 12:47                       ` Hartmut Goebel
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-13 10:04 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 2149 bytes --]

Hi Ricardo,

See my recent email with the new SQLite benchmark: I can now generate
the whole database in under 30 seconds, so about the same order of
magnitude than a naive text database (which has a less data).

SQLite pattern search queries are extremely fast (<0.1s) and cover all
examples named so far:

- exact basename match
- partial path match
- pattern match (e.g. "/include/%foo%")

The only thing that's missing is regexp support.  I'll see what we can do.

Considering this, I find that leveraging SQLite is very attractive at
this point:

- Fast (fastest?).
- Low on memory.
- No need to come up with our own data format.
- Less work, no need to reinvent the wheel.

>> - It does not cover the case where I don't know the basename, e.g. if I'm
>>   looking for a FOO header file my query would look like "/include/.*foo.*".
>
> I think this is a rather rare use case, which in my opinion doesn’t
> justify forgoing the use of a smart data structure.

I don't find it so rare, actually.  I've used filesearch (on other OSes)
to find

- Emacs packages where I didn't know the exact name of the .el.
- TeXlive packages (every time you get a font a .sty error)
- Shared libraries (.so of which I didn't know the exact name)
- include files as above
- Your favourite programming language package...

> It would *still* be possible to use the prefix tree for this kind of
> search, but it would certainly not be optimal. (E.g. by searching up
> to the first wildcard, and then searching each resulting sub-tree up
> to the first non-wildcard, and resuming the search in all remaining
> sub-trees.)

But it's not a wildcard, it's a regexp.  Can we run full-path regexp queries
over a trie?

>> I believe it's important that the search be as general as possible.
>
> Search should be cheap and fast.

While very important, I would put more priority on exactness.  A search
which gives unreliable results (e.g. returns nothing while there exists
a result) is a search that would deter many users from using it, in my
experience.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13  9:40             ` Pierre Neidhardt
@ 2020-08-13 10:08               ` Pierre Neidhardt
  2020-08-13 11:47               ` Ricardo Wurmus
  2020-08-13 12:20               ` Arun Isaac
  2 siblings, 0 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-13 10:08 UTC (permalink / raw)
  To: Arun Isaac, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 221 bytes --]

An alternative to regexp would be "full-text search":

https://www.sqlite.org/fts5.html

I don't know how fast that is.  Might also induce more disk usage.  I'll test.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13  9:40             ` Pierre Neidhardt
  2020-08-13 10:08               ` Pierre Neidhardt
@ 2020-08-13 11:47               ` Ricardo Wurmus
  2020-08-13 13:44                 ` Pierre Neidhardt
  2020-08-13 12:20               ` Arun Isaac
  2 siblings, 1 reply; 73+ messages in thread
From: Ricardo Wurmus @ 2020-08-13 11:47 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel


Pierre Neidhardt <mail@ambrevar.xyz> writes:

> - Or do you think SQLite patterns (using "%") would do for now?  As
>   Mathieu pointed out, it's an unfortunate inconsistency with the rest of
>   Guix.  But maybe regexp support can be added in a second stage.

These patterns could be generated from user input that looks more like
what we already use elsewhere.  We don’t have to directly pass the
SQLite interface to users.

-- 
Ricardo


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13  9:40             ` Pierre Neidhardt
  2020-08-13 10:08               ` Pierre Neidhardt
  2020-08-13 11:47               ` Ricardo Wurmus
@ 2020-08-13 12:20               ` Arun Isaac
  2020-08-13 13:53                 ` Pierre Neidhardt
  2 siblings, 1 reply; 73+ messages in thread
From: Arun Isaac @ 2020-08-13 12:20 UTC (permalink / raw)
  To: Pierre Neidhardt, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1025 bytes --]

>> sqlite insert statements can be very fast. sqlite.org claims 50000
>> or...
>
> I tried it, and got it down to... 30s!

That's great! :-)

> - Or do you think SQLite patterns (using "%") would do for now?  As
>   Mathieu pointed out, it's an unfortunate inconsistency with the rest of
>   Guix.  But maybe regexp support can be added in a second stage.

The inconsistency is unfortunate. Personally, I am in favor of dropping
regexp support everywhere in Guix, and only having literal string
search. But that is backward incompatible, and may be controversial.

In this specific case of file search, we could use the sqlite like
patterns, but not expose them to the user. For example, if the search
query is "<query>", we search for the LIKE pattern "%<query>%". I think
this addresses how users normally search for files. I don't think
regexps add much value.

Full text search may not be relevant to file search. Full text search is
more suited for natural language search involving such things as
stemming algorithms.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13 11:47               ` Ricardo Wurmus
@ 2020-08-13 13:44                 ` Pierre Neidhardt
  0 siblings, 0 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-13 13:44 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 693 bytes --]

Ricardo Wurmus <rekado@elephly.net> writes:

> Pierre Neidhardt <mail@ambrevar.xyz> writes:
>
>> - Or do you think SQLite patterns (using "%") would do for now?  As
>>   Mathieu pointed out, it's an unfortunate inconsistency with the rest of
>>   Guix.  But maybe regexp support can be added in a second stage.
>
> These patterns could be generated from user input that looks more like
> what we already use elsewhere.  We don’t have to directly pass the
> SQLite interface to users.

True, an obvious fix to the problem of syntax familiarity would be to
replace "*" by "%" in the user query, to have it look like shell globing.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13 12:20               ` Arun Isaac
@ 2020-08-13 13:53                 ` Pierre Neidhardt
  2020-08-13 15:14                   ` Arun Isaac
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-13 13:53 UTC (permalink / raw)
  To: Arun Isaac, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1915 bytes --]

Arun Isaac <arunisaac@systemreboot.net> writes:

>> - Or do you think SQLite patterns (using "%") would do for now?  As
>>   Mathieu pointed out, it's an unfortunate inconsistency with the rest of
>>   Guix.  But maybe regexp support can be added in a second stage.
>
> The inconsistency is unfortunate. Personally, I am in favor of dropping
> regexp support everywhere in Guix, and only having literal string
> search. But that is backward incompatible, and may be controversial.
>
> In this specific case of file search, we could use the sqlite like
> patterns, but not expose them to the user. For example, if the search
> query is "<query>", we search for the LIKE pattern "%<query>%". I think
> this addresses how users normally search for files. I don't think
> regexps add much value.

I agree.

> Full text search may not be relevant to file search. Full text search is
> more suited for natural language search involving such things as
> stemming algorithms.

Yes, but full text search brings us a few niceties here:

- Wildcards using the `*` character.  This fixes the unfamiliarity of `%`.

- "Automatic word permutations" (maybe not the right term).  "foo bar" and
  "bar foo" both match the same results!  

- Case insensitive, diacritic insensitive (e.g. "e" matches "É").

- Logic: we can do "(OR foo bar) AND (OR qux quuz)".

- Relevance ranking: results can be sorted by relevance, another problem
  we don't have to fix ourselves ;)

All the above is arguably more powerful and easier to use than regexp.
But even if no user ever bothers with the logic operators, the default
behaviour "just works" in the fashion of a search engine.

The main thing I don't know how to do is suffix matches (like "%foo").
With FTS, looking up "foo*" won't match "libfoo", which is problematic.
Any clue how to fix that?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13 13:53                 ` Pierre Neidhardt
@ 2020-08-13 15:14                   ` Arun Isaac
  2020-08-13 15:36                     ` Pierre Neidhardt
  0 siblings, 1 reply; 73+ messages in thread
From: Arun Isaac @ 2020-08-13 15:14 UTC (permalink / raw)
  To: Pierre Neidhardt, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1228 bytes --]

> Yes, but full text search brings us a few niceties here:

These are nice features, but I don't know if all of them are useful for
file search. Normally, with Arch's pkgfile, I seach for some missing
header file, shared library, etc. Usually, I know the exact filename I
am looking for, or I know some prefix or suffix of the exact filename.

> - Case insensitive, diacritic insensitive (e.g. "e" matches "É").

Case insensitivity is quite useful. Most filenames are in lower case,
but there is always that one odd filename out there.

But filenames usually don't have diacritics. So, I'm not sure if
diacritic insensitivity is useful.

> All the above is arguably more powerful and easier to use than regexp.
> But even if no user ever bothers with the logic operators, the default
> behaviour "just works" in the fashion of a search engine.
>
> The main thing I don't know how to do is suffix matches (like "%foo").
> With FTS, looking up "foo*" won't match "libfoo", which is problematic.
> Any clue how to fix that?

This is handled by stemming. We'll need a custom stemmer that normalizes
libfoo to foo. Xapian has a nice page on stemming. See
https://xapian.org/docs/stemming.html

Cheers!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13 15:14                   ` Arun Isaac
@ 2020-08-13 15:36                     ` Pierre Neidhardt
  2020-08-13 15:56                       ` Pierre Neidhardt
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-13 15:36 UTC (permalink / raw)
  To: Arun Isaac, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 868 bytes --]

Arun Isaac <arunisaac@systemreboot.net> writes:

> But filenames usually don't have diacritics. So, I'm not sure if
> diacritic insensitivity is useful.

Probably not, but if there ever is this odd file name with an accent,
then we won't have to worry about it, it will be handled.  Better too
much than too little!

> This is handled by stemming. We'll need a custom stemmer that normalizes
> libfoo to foo. Xapian has a nice page on stemming. See
> https://xapian.org/docs/stemming.html

I think I gave a confusing example.  I think stemming is too specific.
I meant searching a pattern that does not start the word.
If I search "bar", I'd like to match "foobar".

The reverse is possible ("foo" matches "foobar").

Well, I'll share my FTS demo just now, we can always think about this case later.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13 15:36                     ` Pierre Neidhardt
@ 2020-08-13 15:56                       ` Pierre Neidhardt
  2020-08-15 19:33                         ` Arun Isaac
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-13 15:56 UTC (permalink / raw)
  To: Arun Isaac, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1400 bytes --]

I've pushed my experiment to the `wip-filesearch' branch.

As of this writing it is not automatically triggered by "guix build".
To test it:

- Load the module from a REPL.

- Run

--8<---------------cut here---------------start------------->8---
  (test-index-git)
--8<---------------cut here---------------end--------------->8---

  You should get a database at ~/.cache/guix/files.db.

- Run

--8<---------------cut here---------------start------------->8---
  (format-search (search-file-package "git perl5"))
--8<---------------cut here---------------end--------------->8---

  to get some search results.

- Run

--8<---------------cut here---------------start------------->8---
  (persist-all-local-packages)
--8<---------------cut here---------------end--------------->8---

  to build the database for all your local files.  Warning, this takes
  some 30 seconds on my machine, could take minutes if you has a
  spinning drive or an old processor.

I'll be away for a couple of days, feel free to experiment and send me
some feedback in the meantime.

What remains to be done:

- Automatically index "guix build" result in database.
  Any idea where I can trigger the indexing from a "guix build"?

- Sync the database data from a substitute server.  Any pointer would be
  appreciated :)

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13 10:04                     ` Pierre Neidhardt
@ 2020-08-15 12:47                       ` Hartmut Goebel
  2020-08-15 21:20                         ` Bengt Richter
  0 siblings, 1 reply; 73+ messages in thread
From: Hartmut Goebel @ 2020-08-15 12:47 UTC (permalink / raw)
  To: guix-devel

Am 13.08.20 um 12:04 schrieb Pierre Neidhardt:
> SQLite pattern search queries are extremely fast (<0.1s) and cover all
> examples named so far:
>
> - exact basename match
> - partial path match
> - pattern match (e.g. "/include/%foo%")


For comparison: These are the options Debian Package search
<https://www.debian.org/distrib/packages.en#search_packages> supports:

  * paths ending with the keyword  -> e.g. file-extensions
  * packages that contain files named like this -> basename
  * packages that contain files whose names contain the keyword  ->
    LIKE  %<keyword>%

-- 
Regards
Hartmut Goebel

| Hartmut Goebel          | h.goebel@crazy-compilers.com               |
| www.crazy-compilers.com | compilers which you thought are impossible |



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-11 12:35   ` Pierre Neidhardt
@ 2020-08-15 12:48     ` Hartmut Goebel
  0 siblings, 0 replies; 73+ messages in thread
From: Hartmut Goebel @ 2020-08-15 12:48 UTC (permalink / raw)
  To: Pierre Neidhardt, Mathieu Othacehe; +Cc: guix-devel

Am 11.08.20 um 14:35 schrieb Pierre Neidhardt:
> Unlike Nix, we would like to do more than just index executable files.
> Indeed, it's very useful to know where to find, say, a C header, a .so
> library, a TeXlive .sty file, etc.

+1

Most of the time I'm searching for non-executable files.

-- 
Regards
Hartmut Goebel

| Hartmut Goebel          | h.goebel@crazy-compilers.com               |
| www.crazy-compilers.com | compilers which you thought are impossible |



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-13 15:56                       ` Pierre Neidhardt
@ 2020-08-15 19:33                         ` Arun Isaac
  2020-08-24  8:29                           ` Pierre Neidhardt
  0 siblings, 1 reply; 73+ messages in thread
From: Arun Isaac @ 2020-08-15 19:33 UTC (permalink / raw)
  To: Pierre Neidhardt, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1191 bytes --]


Hi Pierre,

I tried the wip-filesearch branch. Nice work! :-)
persist-all-local-packages takes around 350 seconds on my machine (slow
machine with spinning disk) and the database is 50 MB. Some other
comments follow.

- Maybe, we shouldn't index hidden files, particularly all the .xxx-real
  files created by our wrap phases.

- You should use SQL prepared statements with sqlite-prepare,
  sqlite-bind, etc. That would correctly handle escaping special
  characters in the search string. Currently, searching for
  "transmission-gtk", "libm.so", etc. errors out.

- Searching for "git perl5" works as expected, but searching for "git
  perl" returns no results. I think this is due to the tokenizer used by
  the full text search indexer. The tokenizer sees the word "perl5" as
  one indivisible token and does not realize that "perl" is a prefix of
  "perl5". Unfortunately, I think this is a fundamental problem with FTS
  -- one that can only be fixed by using simple LIKE patterns. FTS is
  meant for natural language search where this kind of thing would be
  normal.

- I guess you are only indexing local packages now, but will include all
  packages later by some means.

Cheers!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-15 12:47                       ` Hartmut Goebel
@ 2020-08-15 21:20                         ` Bengt Richter
  2020-08-16  8:18                           ` Hartmut Goebel
  0 siblings, 1 reply; 73+ messages in thread
From: Bengt Richter @ 2020-08-15 21:20 UTC (permalink / raw)
  To: Hartmut Goebel; +Cc: guix-devel

Hi Hartmut, et al

On +2020-08-15 14:47:12 +0200, Hartmut Goebel wrote:
> Am 13.08.20 um 12:04 schrieb Pierre Neidhardt:
> > SQLite pattern search queries are extremely fast (<0.1s) and cover all
> > examples named so far:
> >
> > - exact basename match
> > - partial path match
> > - pattern match (e.g. "/include/%foo%")
> 
> 
> For comparison: These are the options Debian Package search
> <https://www.debian.org/distrib/packages.en#search_packages> supports:
> 
>   * paths ending with the keyword  -> e.g. file-extensions
>   * packages that contain files named like this -> basename
>   * packages that contain files whose names contain the keyword  ->
>     LIKE  %<keyword>%
>

If you are on debian, have you tried
    dpkg -l '*your*globbed*name*here*'
or
    dpkg -l '*'|egrep your-regex-here
or the former, preserving header and excluding non-installed, e.g.,?
    dpkg -l '*your*globbed*name*here*'|grep -v ^un

> -- 
> Regards
> Hartmut Goebel
> 
> | Hartmut Goebel          | h.goebel@crazy-compilers.com               |
> | www.crazy-compilers.com | compilers which you thought are impossible |
> 
> 
-- 
Regards,
Bengt Richter


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-15 21:20                         ` Bengt Richter
@ 2020-08-16  8:18                           ` Hartmut Goebel
  0 siblings, 0 replies; 73+ messages in thread
From: Hartmut Goebel @ 2020-08-16  8:18 UTC (permalink / raw)
  To: Bengt Richter; +Cc: guix-devel

Am 15.08.20 um 23:20 schrieb Bengt Richter:
> If you are on debian, have you tried
>     dpkg -l '*your*globbed*name*here*'

No, since for Debian I'm not using a command-line tool, but the
Web-Interface - which allows querying even packages I have not
installed. (And the later is my specific use-case: Find which package I
need to install to get a specific file.)

-- 
Regards
Hartmut Goebel

| Hartmut Goebel          | h.goebel@crazy-compilers.com               |
| www.crazy-compilers.com | compilers which you thought are impossible |



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers OFF TOPIC PRAISE
  2020-08-10 14:32 File search progress: database review and question on triggers Pierre Neidhardt
  2020-08-11  9:43 ` Mathieu Othacehe
  2020-08-11 15:43 ` Ricardo Wurmus
@ 2020-08-18 14:58 ` Joshua Branson
  2020-08-27 10:00 ` File search progress: database review and question on triggers zimoun
  3 siblings, 0 replies; 73+ messages in thread
From: Joshua Branson @ 2020-08-18 14:58 UTC (permalink / raw)
  To: guix-devel


Thanks for working on this!  This is a super awesome feature!  Best of luck!

-- 
Joshua Branson
Sent from Emacs and Gnus


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-15 19:33                         ` Arun Isaac
@ 2020-08-24  8:29                           ` Pierre Neidhardt
  2020-08-24 10:53                             ` Pierre Neidhardt
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-24  8:29 UTC (permalink / raw)
  To: Arun Isaac, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 2150 bytes --]

Hi Arun,

thanks for your feedback!

350 seconds does not seem too bad.  Let's keep in mind that this
operation will mostly be used by the build farm / substitute server, not
by the user.

> - Maybe, we shouldn't index hidden files, particularly all the .xxx-real
>   files created by our wrap phases.

I thought about this, but for now I don't think it's a good idea:

- Hidden files are rather rare and won't take up much space in the
  database.
- What if the users actually searches for a hidden file?
- It's easy to exclude hidden files from the search results.

> - You should use SQL prepared statements with sqlite-prepare,
>   sqlite-bind, etc. That would correctly handle escaping special
>   characters in the search string. Currently, searching for
>   "transmission-gtk", "libm.so", etc. errors out.

Thanks for pointing this out, I'll look into it.

> - Searching for "git perl5" works as expected, but searching for "git
>   perl" returns no results. I think this is due to the tokenizer used by
>   the full text search indexer. The tokenizer sees the word "perl5" as
>   one indivisible token and does not realize that "perl" is a prefix of
>   "perl5". Unfortunately, I think this is a fundamental problem with FTS
>   -- one that can only be fixed by using simple LIKE patterns. FTS is
>   meant for natural language search where this kind of thing would be
>   normal.

Indeed, but "git perl*" would have worked I think.  We can always
pre-process the query to add stars everywhere.

At the moment, the only downside I see with FTS is that it seems to be
impossible to match words for which we don't know the beginning.

Anyways, switching from FTS to LIKE patterns is rather easy.  In fact, I
could implement both with a tiny abstraction so that we can choose which
one we want in the end.

> - I guess you are only indexing local packages now, but will include all
>   packages later by some means.

Indeed, I want the substitute server / build farm to expose the database
for syncing.  I'd need some help to get started.  Anyone?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-24  8:29                           ` Pierre Neidhardt
@ 2020-08-24 10:53                             ` Pierre Neidhardt
  2020-09-04 19:15                               ` Arun Isaac
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-24 10:53 UTC (permalink / raw)
  To: Arun Isaac, Ricardo Wurmus; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 875 bytes --]

>> - You should use SQL prepared statements with sqlite-prepare,
>>   sqlite-bind, etc. That would correctly handle escaping special
>>   characters in the search string. Currently, searching for
>>   "transmission-gtk", "libm.so", etc. errors out.
>
> Thanks for pointing this out, I'll look into it.

I've looked into it.
Actually the issue is not with the lack of sqlite-bind-arguments, but
with "-" and "." triggering an error in FTS.

I've fixed it in
e08f913d20428a9a925cc46d177c7446f55e6443.  The downside is that we can't
use any special character like boolean operators.  I'm not sure how we
can get the best of both worlds.  Maybe add a command line flag that
would enable "extended search syntax" in which the user can use all of
FTS but must make sure to quote "libm.so" for instance.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-10 14:32 File search progress: database review and question on triggers Pierre Neidhardt
                   ` (2 preceding siblings ...)
  2020-08-18 14:58 ` File search progress: database review and question on triggers OFF TOPIC PRAISE Joshua Branson
@ 2020-08-27 10:00 ` zimoun
  2020-08-27 11:15   ` Pierre Neidhardt
  3 siblings, 1 reply; 73+ messages in thread
From: zimoun @ 2020-08-27 10:00 UTC (permalink / raw)
  To: Pierre Neidhardt, guix-devel

Hi Pierre,

Cool to relive the topic. :-)

(disclaim: I have not yet process all the thread and emails)

On Mon, 10 Aug 2020 at 16:32, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> 1. An SQLite database with the following schema:

If you are going to an local SQL database, my two questions are:

a)
Which part would update it?  “guix pull”?  Other?  Even using
substitutes, the channels and co could lead to an extra cost and so what
is acceptable and what is not?

b)
Could you also include other fields such that “synopsis” and
“description”?  Because it could speed up “guix search” without adding
(or modifying) the current cache
(~/config/guix/current/lib/package.cache); discussed at length in
#39258 <http://issues.guix.gnu.org/issue/39258>.

Cheers,
simon

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-27 10:00 ` File search progress: database review and question on triggers zimoun
@ 2020-08-27 11:15   ` Pierre Neidhardt
  2020-08-27 12:56     ` zimoun
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-27 11:15 UTC (permalink / raw)
  To: zimoun, guix-devel

[-- Attachment #1: Type: text/plain, Size: 1217 bytes --]

Hi Simon!

zimoun <zimon.toutoune@gmail.com> writes:

> If you are going to an local SQL database, my two questions are:
>
> a)
> Which part would update it?  “guix pull”?  Other?  Even using
> substitutes, the channels and co could lead to an extra cost and so what
> is acceptable and what is not?

I suggest fetching database updates when performing the filesearch, just
like `guix size` does.

The induced cost would be little in my opinion.  A compressed database
for a whole generation is below 10 MiB.  This operation would need to be
done only once per Guix generation.  If we are a bit smart with timestamping
and only send a diff, it could even be much, much smaller.

> b)
> Could you also include other fields such that “synopsis” and
> “description”?  Because it could speed up “guix search” without adding
> (or modifying) the current cache
> (~/config/guix/current/lib/package.cache); discussed at length in
> #39258 <http://issues.guix.gnu.org/issue/39258>.

I think this is a bit beyond the scope of this patch set.  I'd rather
focus on files exclusively for now and proceed one step at a time :)

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-27 11:15   ` Pierre Neidhardt
@ 2020-08-27 12:56     ` zimoun
  2020-08-27 13:19       ` Pierre Neidhardt
  0 siblings, 1 reply; 73+ messages in thread
From: zimoun @ 2020-08-27 12:56 UTC (permalink / raw)
  To: Pierre Neidhardt, guix-devel

Hi Pierre,

On Thu, 27 Aug 2020 at 13:15, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> zimoun <zimon.toutoune@gmail.com> writes:
>
>> If you are going to an local SQL database, my two questions are:
>>
>> a)
>> Which part would update it?  “guix pull”?  Other?  Even using
>> substitutes, the channels and co could lead to an extra cost and so what
>> is acceptable and what is not?
>
> I suggest fetching database updates when performing the filesearch, just
> like `guix size` does.

I am not sure to see how.  One needs all the database to search inside
and cannot know in advance which packages etc..  Contrary to “guix size”
which manipulates the graph and then download the missing parts.

Therefore, your suggestion is to download all the database the first
time the user run “guix filesearch”, i.e., download ~10MiB.

Then each time the user runs “guix pull” then “guix filesearch”, two
options, either download the new database for this last Guix generation,
or either download a diff (not sure the complexity is worth for ~10MiB).

Right?

And what about the channels?  Because if I read correctly, “guix size”
fails when “no available substitute information“.

>> b)
>> Could you also include other fields such that “synopsis” and
>> “description”?  Because it could speed up “guix search” without adding
>> (or modifying) the current cache
>> (~/config/guix/current/lib/package.cache); discussed at length in
>> #39258 <http://issues.guix.gnu.org/issue/39258>.
>
> I think this is a bit beyond the scope of this patch set.  I'd rather
> focus on files exclusively for now and proceed one step at a time :)

I do not think it is beyond the scope because Arun introduced an SQLite
database for improving “guix search”.  But this path had been stopped
because of “introducing complexity” [1].  Therefore, if “guix
filesearch” introduces a SQL cache, then it seems a good idea to be also
usable by “guix search”.

Well, if the table is extended with the fields “synopsis” and
“description” then what is the size of the database?  Does it kill the
lookup performance?

[1] http://issues.guix.gnu.org/issue/39258#7

Cheers,
simon

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-27 12:56     ` zimoun
@ 2020-08-27 13:19       ` Pierre Neidhardt
  2020-09-26 14:04         ` Pierre Neidhardt
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-08-27 13:19 UTC (permalink / raw)
  To: zimoun, guix-devel

[-- Attachment #1: Type: text/plain, Size: 1945 bytes --]

zimoun <zimon.toutoune@gmail.com> writes:

> I am not sure to see how.  One needs all the database to search inside
> and cannot know in advance which packages etc..  Contrary to “guix size”
> which manipulates the graph and then download the missing parts.
>
> Therefore, your suggestion is to download all the database the first
> time the user run “guix filesearch”, i.e., download ~10MiB.

Yes.
If the user is not using substitutes, they can also compute the database
from their local store items (like I'm doing in the graft).  It's less
useful but still a bit more convenient than running find/locate on
/gnu/store.

> Then each time the user runs “guix pull” then “guix filesearch”, two
> options, either download the new database for this last Guix generation,
> or either download a diff (not sure the complexity is worth for ~10MiB).
>
> Right?

Yes.

> And what about the channels?  Because if I read correctly, “guix size”
> fails when “no available substitute information“.

Just like `guix size', it would work with local items.  But if there is
no substitute server for channels, then there is no way around it.

>> I think this is a bit beyond the scope of this patch set.  I'd rather
>> focus on files exclusively for now and proceed one step at a time :)
>
> I do not think it is beyond the scope because Arun introduced an SQLite
> database for improving “guix search”.  But this path had been stopped
> because of “introducing complexity” [1].  Therefore, if “guix
> filesearch” introduces a SQL cache, then it seems a good idea to be also
> usable by “guix search”.
>
> Well, if the table is extended with the fields “synopsis” and
> “description” then what is the size of the database?  Does it kill the
> lookup performance?

Good point, I'll test and report my measures.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-24 10:53                             ` Pierre Neidhardt
@ 2020-09-04 19:15                               ` Arun Isaac
  2020-09-05  7:48                                 ` Pierre Neidhardt
  0 siblings, 1 reply; 73+ messages in thread
From: Arun Isaac @ 2020-09-04 19:15 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel


Hi Pierre!

Sorry for the very late response.

> I've fixed it in
> e08f913d20428a9a925cc46d177c7446f55e6443.  The downside is that we can't
> use any special character like boolean operators.  I'm not sure how we
> can get the best of both worlds.  Maybe add a command line flag that
> would enable "extended search syntax" in which the user can use all of
> FTS but must make sure to quote "libm.so" for instance.

Where is this commit so I may test it? I couldn't find the commit in the
wip-filesearch branch. The HEAD of the wip-filesearch branch I see is
49b52c2c7be03caf3636632c31f4451d5bc88125.

Regards,
Arun


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-09-04 19:15                               ` Arun Isaac
@ 2020-09-05  7:48                                 ` Pierre Neidhardt
  2020-09-06  9:25                                   ` Arun Isaac
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-09-05  7:48 UTC (permalink / raw)
  To: Arun Isaac; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 183 bytes --]

Oops!  Forgot to push.

It's actually commit 25147f983bdf432b03e8271abe0318f4812f94ba on
wip-filesearch.

Thanks for the notice!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-09-05  7:48                                 ` Pierre Neidhardt
@ 2020-09-06  9:25                                   ` Arun Isaac
  2020-09-06 10:05                                     ` Pierre Neidhardt
  0 siblings, 1 reply; 73+ messages in thread
From: Arun Isaac @ 2020-09-06  9:25 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 698 bytes --]


> Oops!  Forgot to push.
>
> It's actually commit 25147f983bdf432b03e8271abe0318f4812f94ba on
> wip-filesearch.

I checked and it works now. Just a tiny nitpick:

The function signature of search-file-package,

--8<---------------cut here---------------start------------->8---
(define (search-file-package pattern . more-patterns) ...)
--8<---------------cut here---------------end--------------->8---

can be rewritten as

--8<---------------cut here---------------start------------->8---
(define (search-file-package . patterns) ...)
--8<---------------cut here---------------end--------------->8---

No need to cons pattern onto more-patterns in the function body.

Thanks for working on this!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 524 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-09-06  9:25                                   ` Arun Isaac
@ 2020-09-06 10:05                                     ` Pierre Neidhardt
  2020-09-06 10:33                                       ` Arun Isaac
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-09-06 10:05 UTC (permalink / raw)
  To: Arun Isaac; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 890 bytes --]

Arun Isaac <arunisaac@systemreboot.net> writes:

> The function signature of search-file-package,
>
> --8<---------------cut here---------------start------------->8---
> (define (search-file-package pattern . more-patterns) ...)
> --8<---------------cut here---------------end--------------->8---
>
> can be rewritten as
>
> --8<---------------cut here---------------start------------->8---
> (define (search-file-package . patterns) ...)
> --8<---------------cut here---------------end--------------->8---
>
> No need to cons pattern onto more-patterns in the function body.

There is a subtle different: in the latter, (search-file-package) is
allowed and won't trigger a compile time error, while the former does.
This "foo . more-foo" paradigm is a way to say "1 or more arguments",
instead of "0 or more".

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-09-06 10:05                                     ` Pierre Neidhardt
@ 2020-09-06 10:33                                       ` Arun Isaac
  0 siblings, 0 replies; 73+ messages in thread
From: Arun Isaac @ 2020-09-06 10:33 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 285 bytes --]


> There is a subtle different: in the latter, (search-file-package) is
> allowed and won't trigger a compile time error, while the former does.
> This "foo . more-foo" paradigm is a way to say "1 or more arguments",
> instead of "0 or more".

Ah, ok, that makes sense!

Regards,
Arun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 524 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-08-27 13:19       ` Pierre Neidhardt
@ 2020-09-26 14:04         ` Pierre Neidhardt
  2020-09-26 14:12           ` Pierre Neidhardt
  2020-10-05 12:35           ` Ludovic Courtès
  0 siblings, 2 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-09-26 14:04 UTC (permalink / raw)
  To: zimoun, guix-devel; +Cc: Mathieu Othacehe

[-- Attachment #1: Type: text/plain, Size: 1031 bytes --]

Sorry for the delay.

I've pushed a commit which adds the synopsis and the description to the
database.

0127cfa5d0898222257a716bf7b0a167f31cc6dd

The quite surprising result is that these new details only cost 1 MiB extra!

I can then search packages and it's super fast:

--8<---------------cut here---------------start------------->8---
> ,time (search-package "org" "equival") 
$51 = (("emacs-org-contrib"))
;; 0.002930s real time, 0.002925s run time.  0.000000s spent in GC.
--8<---------------cut here---------------end--------------->8---

My local database only includes the store items, that is to say some
1800 items, but even with the 15000 packages the search would not be
impacted by much.  The size might grow by some 10 MiB though, but that's
reasonable and it compresses very well (down to 6-7 MiB in zstd).

It seems to me that Simon had a very good hunch: the filesearch
implementation could fix the slow search issue for free.

Thoughts?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-09-26 14:04         ` Pierre Neidhardt
@ 2020-09-26 14:12           ` Pierre Neidhardt
  2020-10-05 12:35           ` Ludovic Courtès
  1 sibling, 0 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-09-26 14:12 UTC (permalink / raw)
  To: zimoun, guix-devel; +Cc: Mathieu Othacehe

[-- Attachment #1: Type: text/plain, Size: 155 bytes --]

The database will all package descriptions and synopsis is 46 MiB and
compresses down to 11 MiB in zstd.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-09-26 14:04         ` Pierre Neidhardt
  2020-09-26 14:12           ` Pierre Neidhardt
@ 2020-10-05 12:35           ` Ludovic Courtès
  2020-10-05 18:53             ` Pierre Neidhardt
  1 sibling, 1 reply; 73+ messages in thread
From: Ludovic Courtès @ 2020-10-05 12:35 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel, Mathieu Othacehe

Hi,

Pierre Neidhardt <mail@ambrevar.xyz> skribis:

> I've pushed a commit which adds the synopsis and the description to the
> database.
>
> 0127cfa5d0898222257a716bf7b0a167f31cc6dd
>
> The quite surprising result is that these new details only cost 1 MiB extra!
>
> I can then search packages and it's super fast:
>
>> ,time (search-package "org" "equival") 
> $51 = (("emacs-org-contrib"))
> ;; 0.002930s real time, 0.002925s run time.  0.000000s spent in GC.
>
> My local database only includes the store items, that is to say some
> 1800 items, but even with the 15000 packages the search would not be
> impacted by much.  The size might grow by some 10 MiB though, but that's
> reasonable and it compresses very well (down to 6-7 MiB in zstd).

Nice!

Could you post a summary of what you have done, what’s left to do, and
how you’d like to integrate it?  (If you’ve already done it, my
apologies, but you can resend a link.  :-))

Thank you,
Ludo’.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-05 12:35           ` Ludovic Courtès
@ 2020-10-05 18:53             ` Pierre Neidhardt
  2020-10-09 21:16               ` zimoun
                                 ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-10-05 18:53 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel, Mathieu Othacehe

[-- Attachment #1: Type: text/plain, Size: 2220 bytes --]

Hi Ludo!

Ludovic Courtès <ludo@gnu.org> writes:

> Nice!

Thanks!

> Could you post a summary of what you have done, what’s left to do, and
> how you’d like to integrate it?  (If you’ve already done it, my
> apologies, but you can resend a link.  :-))

What I've done: mostly a database benchmark.

- Textual database: slow and not lighter than SQLite.  Not worth it I believe.

- SQLite without full-text search: fast, supports classic patterns
  (e.g. "foo*bar") but does not support word permutations.

- SQLite with full-text search: fast, supports word permutations but
  does not support suffix-matching (e.g. "bar" won't match "foobar").
  Size is about the same as without full-text search.

- Include synopsis and descriptions.  Maybe we should include all fields
  that are searched by `guix search`.  This incurs a cost on the
  database size but it would fix the `guix search` speed issue.  Size
  increases by some 10 MiB.

I say we go with SQLite full-text search for now with all package
details.  Switching to without full-text search is just a matter of a
minor adjustment, which we can decide later when merging the final
patch.  Same if we decide not to include the description, synopsis, etc.

What's left to do:

- Populate the database on demand, either after a `guix build` or from a
  `guix filesearch...`.  This is important so that `guix filesearch`
  works on packages built locally.  If `guix build`, I need help to know
  where to plug it in.

- Adapt Cuirass so that it builds its file database.
  I need pointers to get started here.

- Sync the databases from the substitute server to the client when
  running `guix filesearch`.  For this I suggest we send the compressed
  database corresponding to a guix generation over the network (around
  10 MiB).  Not sure sending just the delta is worth it.

- Find a way to garbage-collect the database(s).  My intuition is that
  we should have 1 database per Guix checkout and when we `guix gc` a
  Guix checkout we collect the corresponding database.

  I would store the databases in /var/guix/...

Comments and help welcome! :)

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-05 18:53             ` Pierre Neidhardt
@ 2020-10-09 21:16               ` zimoun
  2020-10-10  8:57                 ` Pierre Neidhardt
  2020-10-10 16:03               ` zimoun
  2020-10-12 10:20               ` Ludovic Courtès
  2 siblings, 1 reply; 73+ messages in thread
From: zimoun @ 2020-10-09 21:16 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: Guix Devel, Mathieu Othacehe

Hi Pierre,

On Mon, 5 Oct 2020 at 20:53, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> Comments and help welcome! :)

I have just checked out and I am probably failing but "./pre-inst-env
guix filesearch whatever" raises an error with the backtrace:

--8<---------------cut here---------------start------------->8---
Backtrace:
In ice-9/boot-9.scm:
  1736:10  5 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
In unknown file:
           4 (apply-smob/0 #<thunk 7f43aa9254a0>)
In ice-9/boot-9.scm:
    718:2  3 (call-with-prompt _ _ #<procedure default-prompt-handler (k proc)>)
In ice-9/eval.scm:
    619:8  2 (_ #(#(#<directory (guile-user) 7f43aa549f00>)))
In guix/ui.scm:
  2038:22  1 (run-guix-command filesearch "hello")
In ice-9/boot-9.scm:
  2828:12  0 (module-ref _ _ . _)

ice-9/boot-9.scm:2828:12: In procedure module-ref:
No variable named guix-filesearch in #<interface (guix scripts
filesearch) 7f43a83f2000>
--8<---------------cut here---------------end--------------->8---

I will comment later on some specific points.

BTW, thank you for revivifying the topic, especially the database. :-)

Cheers,
simon


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-09 21:16               ` zimoun
@ 2020-10-10  8:57                 ` Pierre Neidhardt
  2020-10-10 14:58                   ` zimoun
  2020-10-12 10:16                   ` Ludovic Courtès
  0 siblings, 2 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-10-10  8:57 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel, Mathieu Othacehe

[-- Attachment #1: Type: text/plain, Size: 229 bytes --]

Of course, `guix filesearch` hasn't been implemented yet ;)

We still need to decide whether we want to make it part of `guix search'
or define a separate command.

Thoughts?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-10  8:57                 ` Pierre Neidhardt
@ 2020-10-10 14:58                   ` zimoun
  2020-10-12 10:16                   ` Ludovic Courtès
  1 sibling, 0 replies; 73+ messages in thread
From: zimoun @ 2020-10-10 14:58 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: Guix Devel, Mathieu Othacehe

Hi,

On Sat, 10 Oct 2020 at 10:57, Pierre Neidhardt <mail@ambrevar.xyz> wrote:
> Of course, `guix filesearch` hasn't been implemented yet ;)

Sorry, I have overlooked the status. :-)

> We still need to decide whether we want to make it part of `guix search'
> or define a separate command.

From my point of view, it would be better to include it under “guix
search”.  Even, from my point of view, the code about “search” in “guix
package” should go in “guix search”.  Well do the opposite of the
current: “guix package –search” becoming an alias of “guix search” in
term of code.  Anyway. :-)

Cheers,
simon

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-05 18:53             ` Pierre Neidhardt
  2020-10-09 21:16               ` zimoun
@ 2020-10-10 16:03               ` zimoun
  2020-10-11 11:19                 ` Pierre Neidhardt
  2020-10-12 10:20               ` Ludovic Courtès
  2 siblings, 1 reply; 73+ messages in thread
From: zimoun @ 2020-10-10 16:03 UTC (permalink / raw)
  To: Pierre Neidhardt, Ludovic Courtès; +Cc: guix-devel, Mathieu Othacehe

Hi,

On Mon, 05 Oct 2020 at 20:53, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> - Textual database: slow and not lighter than SQLite.  Not worth it I believe.

Maybe I am out-of-scope, but re-reading *all* the discussion about
“fileserch”, is it possible to really do better than “locate”? As
Ricardo mentioned.

--8<---------------cut here---------------start------------->8---
echo 3 > /proc/sys/vm/drop_caches
time updatedb --output=/tmp/store.db --database-root=/gnu/store/

real    0m19.903s
user    0m1.549s
sys     0m4.500s

du -sh /gnu/store /tmp/store.db
30G     /gnu/store
56M     /tmp/store.db

guix gc -F XXG
echo 3 > /proc/sys/vm/drop_caches
time updatedb --output=/tmp/store.db --database-root=/gnu/store/

real    0m10.105s
user    0m0.865s
sys     0m2.020s

du -sh /gnu/store /tmp/store.db
28G     /gnu/store
52M     /tmp/store.db
--8<---------------cut here---------------end--------------->8---

And then “locate” support regexp and regex and it is fast enough.

--8<---------------cut here---------------start------------->8---
echo 3 > /proc/sys/vm/drop_caches
time locate -d /tmp/store.db --regex "emacs-ma[a-z0-9\.\-]+\/[^.]+.el$" | tail -n5
/gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit-transient.el
/gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit-utils.el
/gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit-wip.el
/gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit-worktree.el
/gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit.el

real    0m3.601s
user    0m3.528s
sys     0m0.061s
--8<---------------cut here---------------end--------------->8---

The only point is that regexp is always cumbersome for me.  Well: «Some
people, when confronted with a problem, think "I know, I'll use regular
expressions." Now they have two problems.» :-) [1]

[1] https://en.wikiquote.org/wiki/Jamie_Zawinski

> - Include synopsis and descriptions.  Maybe we should include all fields
>   that are searched by `guix search`.  This incurs a cost on the
>   database size but it would fix the `guix search` speed issue.  Size
>   increases by some 10 MiB.

From my point of view, yes.  Somehow “filesearch” is a subpart of
“search”.  So it should be the machinery.

> I say we go with SQLite full-text search for now with all package
> details.  Switching to without full-text search is just a matter of a
> minor adjustment, which we can decide later when merging the final
> patch.  Same if we decide not to include the description, synopsis, etc.

[...]

> - Populate the database on demand, either after a `guix build` or from a
>   `guix filesearch...`.  This is important so that `guix filesearch`
>   works on packages built locally.  If `guix build`, I need help to know
>   where to plug it in.

[...]

> - Sync the databases from the substitute server to the client when
>   running `guix filesearch`.  For this I suggest we send the compressed
>   database corresponding to a guix generation over the network (around
>   10 MiB).  Not sure sending just the delta is worth it.

From my point of view, how to transfer the database from substitutes to
users and how to locally update (custom channels or custom load path) are
not easy.  Maybe the core issues.

For example, I just did “guix pull” and “–list-generation” says from
f6dfe42 (Sept. 15) to 4ec2190 (Oct. 10)::

   39.9 MB will be download

more the tiny bits before “Computing Guix derivation”.  Say 50MB max.

Well, the “locate” database for my “/gnu/store” (~30GB) is already to
~50MB, and ~20MB when compressed with gzip.  And Pierre said:

      The database will all package descriptions and synopsis is 46 MiB
      and compresses down to 11 MiB in zstd.

which is better but still something.  Well, it is not affordable to
fetch the database with “guix pull”, IMHO.

Therefore, the database would be fetched at the first “guix search”
(assuming point above).  But now, how “search” could know what is custom
build and what is not?  Somehow, “search” should scan all the store to
be able to update the database.

And what happens each time I am doing a custom build then “filesearch”.
The database should be updated, right?  Well, it seems almost unusable.

The model “updatedb/locate” seems better.  The user updates “manually”
if required and then location is fast.

Most of the cases are searching files in packages that are not my custom
packages.  IMHO.

To me, each time I am using “filesearch”:

 - first time: fetch the database corresponding the Guix commit and then
 update it with my local store
 - otherwise: use this database
 - optionally update the database if the user wants to include new
 custom items.

We could imagine a hook or option to “guix pull” specifying to also
fetch the database and update it at pull time instead of “search” time.
Personally, I prefer longer “guix pull” because it is already a bit long
and then fast “search” than half/half (not so long pull and longer
search).

WDYT?

> - Find a way to garbage-collect the database(s).  My intuition is that
>   we should have 1 database per Guix checkout and when we `guix gc` a
>   Guix checkout we collect the corresponding database.

Well, the exact same strategy as
~/.config/guix/current/lib/guix/package.cache can be used.

BTW, thanks Pierre for improving the Guix discoverability. :-)

Cheers,
simon

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-10 16:03               ` zimoun
@ 2020-10-11 11:19                 ` Pierre Neidhardt
  2020-10-11 13:02                   ` zimoun
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-10-11 11:19 UTC (permalink / raw)
  To: zimoun, Ludovic Courtès; +Cc: guix-devel, Mathieu Othacehe

[-- Attachment #1: Type: text/plain, Size: 4800 bytes --]

Hi Zimoun,

Thanks for the feedback!

> --8<---------------cut here---------------start------------->8---
> echo 3 > /proc/sys/vm/drop_caches
> time updatedb --output=/tmp/store.db --database-root=/gnu/store/
>
> real    0m19.903s
> user    0m1.549s
> sys     0m4.500s

I don't know the size of your store nor your hardware.  Could you
benchmark against my filesearch implementation?

> And then “locate” support regexp and regex and it is fast enough.

But locate does not support word permutations, which is a very important
feature for filesearch in my opinion.

> The only point is that regexp is always cumbersome for me.  Well: «Some
> people, when confronted with a problem, think "I know, I'll use regular
> expressions." Now they have two problems.» :-) [1]

Exactly.  Full text search is a big step forward for usability I think.

> From my point of view, yes.  Somehow “filesearch” is a subpart of
> “search”.  So it should be the machinery.

I'll work on it.  I'll try to make the code flexible enough so that it
can be moved to another command easily, should we decide that "search"
is not the right fit.

> From my point of view, how to transfer the database from substitutes to
> users and how to locally update (custom channels or custom load path) are
> not easy.  Maybe the core issues.

Absolutely.

> For example, I just did “guix pull” and “–list-generation” says from
> f6dfe42 (Sept. 15) to 4ec2190 (Oct. 10)::
>
>    39.9 MB will be download
>
> more the tiny bits before “Computing Guix derivation”.  Say 50MB max.
>
> Well, the “locate” database for my “/gnu/store” (~30GB) is already to
> ~50MB, and ~20MB when compressed with gzip.  And Pierre said:
>
>       The database will all package descriptions and synopsis is 46 MiB
>       and compresses down to 11 MiB in zstd.

I should have benchmarked with Lzip, it would have been more useful.  I
think we can get it down to approximately 8 MiB in Lzip.

> which is better but still something.  Well, it is not affordable to
> fetch the database with “guix pull”, In My Humble Opinion.

We could send a "diff" of the database.

For instance, if the user already has a file database for the Guix
generation A, then guix pulls to B, the substitute server can send the
diff between A and B.  This would probably amount to less than 1 MiB if
the generations are not too far apart.  (Warning: actual measures needed!)

> Therefore, the database would be fetched at the first “guix search”
> (assuming point above).  But now, how “search” could know what is custom
> build and what is not?  Somehow, “search” should scan all the store to
> be able to update the database.
>
> And what happens each time I am doing a custom build then “filesearch”.
> The database should be updated, right?  Well, it seems almost unusable.

I mentioned this previously: we need to update the database on "guix
build".  This is very fast and would be mostly transparent to the user.
This is essentially how "guix size" behaves.

> The model “updatedb/locate” seems better.  The user updates “manually”
> if required and then location is fast.

"manually" is not good in my opinion.  The end-user will inevitably
forget.  An out-of-sync database would return bad results which is a
big no-no for search.  On-demand database updates are ideals I think.

> To me, each time I am using “filesearch”:
>
>  - first time: fetch the database corresponding the Guix commit and then
>  update it with my local store

Possibly using a "diff" to shrink the download size.

>  - otherwise: use this database
>  - optionally update the database if the user wants to include new
>  custom items.

No need for the optional point I believe.

> We could imagine a hook or option to “guix pull” specifying to also
> fetch the database and update it at pull time instead of “search” time.
> Personally, I prefer longer “guix pull” because it is already a bit long
> and then fast “search” than half/half (not so long pull and longer
> search).

I suggest we do it at pull time so that =guix search= does not need an
online network.  =guix pull= requires networking anyways.

>> - Find a way to garbage-collect the database(s).  My intuition is that
>>   we should have 1 database per Guix checkout and when we `guix gc` a
>>   Guix checkout we collect the corresponding database.
>
> Well, the exact same strategy as
> ~/.config/guix/current/lib/guix/package.cache can be used.

Oh!  I didn't know about this file!  What is it used for?

> BTW, thanks Pierre for improving the Guix discoverability. :-)

Thank you! :)

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-11 11:19                 ` Pierre Neidhardt
@ 2020-10-11 13:02                   ` zimoun
  2020-10-11 14:25                     ` Pierre Neidhardt
  0 siblings, 1 reply; 73+ messages in thread
From: zimoun @ 2020-10-11 13:02 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: Guix Devel, Mathieu Othacehe

Hi Pierre,

I am trying to resume the work on "guix search" to improve it (faster).
That's why I am asking the details.  :-)
Because with the introduction of this database, as mentioned earlier,
2 annoyances could be fixed at once.

On Sun, 11 Oct 2020 at 13:19, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> > --8<---------------cut here---------------start------------->8---
> > echo 3 > /proc/sys/vm/drop_caches
> > time updatedb --output=/tmp/store.db --database-root=/gnu/store/
> >
> > real    0m19.903s
> > user    0m1.549s
> > sys     0m4.500s
>
> I don't know the size of your store nor your hardware.  Could you
> benchmark against my filesearch implementation?

30G as I reported in my previous email. ;-)

> > From my point of view, yes.  Somehow “filesearch” is a subpart of
> > “search”.  So it should be the machinery.
>
> I'll work on it.  I'll try to make the code flexible enough so that it
> can be moved to another command easily, should we decide that "search"
> is not the right fit.

UI does not matter so much at this point, I guess.  But the nice final
UI should be:

   guix search --file=

> > For example, I just did “guix pull” and “–list-generation” says from
> > f6dfe42 (Sept. 15) to 4ec2190 (Oct. 10)::
> >
> >    39.9 MB will be download
> >
> > more the tiny bits before “Computing Guix derivation”.  Say 50MB max.
> >
> > Well, the “locate” database for my “/gnu/store” (~30GB) is already to
> > ~50MB, and ~20MB when compressed with gzip.  And Pierre said:
> >
> >       The database will all package descriptions and synopsis is 46 MiB
> >       and compresses down to 11 MiB in zstd.
>
> I should have benchmarked with Lzip, it would have been more useful.  I
> think we can get it down to approximately 8 MiB in Lzip.

Well, I think it will be more with all the items of all the packages.
My point is: the database will be comparable in size with the bits of
"guix pull"; it is not much but still something.

> > which is better but still something.  Well, it is not affordable to
> > fetch the database with “guix pull”, In My Humble Opinion.
>
> We could send a "diff" of the database.

This means to setup server side, right?  So implement the "diff" in
"guix publish", right?  Hum? I feel it is overcomplicated.

> For instance, if the user already has a file database for the Guix
> generation A, then guix pulls to B, the substitute server can send the
> diff between A and B.  This would probably amount to less than 1 MiB if
> the generations are not too far apart.  (Warning: actual measures needed!)

Well, what is the size of for a full /gnu/store/ containing all the
packages of one specific revision?  Sorry if you already provided this
information, I have missed it.

> > Therefore, the database would be fetched at the first “guix search”
> > (assuming point above).  But now, how “search” could know what is custom
> > build and what is not?  Somehow, “search” should scan all the store to
> > be able to update the database.
> >
> > And what happens each time I am doing a custom build then “filesearch”.
> > The database should be updated, right?  Well, it seems almost unusable.
>
> I mentioned this previously: we need to update the database on "guix
> build".  This is very fast and would be mostly transparent to the user.
> This is essentially how "guix size" behaves.

Ok.

> > The model “updatedb/locate” seems better.  The user updates “manually”
> > if required and then location is fast.
>
> "manually" is not good in my opinion.  The end-user will inevitably
> forget.  An out-of-sync database would return bad results which is a
> big no-no for search.  On-demand database updates are ideals I think.

The tradeoff is:
  - when is "on-demand"?  When updates the database?
  - still fast when I search
 - do not slow down other guix subcommands

What you are proposing is:

 - when "guix search --file":
     + if the database does not exist: fetch it
     + otherwise: use it
 - after each "guix build", update the database

Right?

I am still missing the other update mechanism for updating the database.

(Note that the "fetch it" could be done at "guix pull" time which is
more meaningful since pull requires network access as you said.  And
the real computations for updating could be done at the first "guix
search --file" after the pull.)

> Possibly using a "diff" to shrink the download size.
>
> >  - otherwise: use this database
> >  - optionally update the database if the user wants to include new
> >  custom items.
>
> No need for the optional point I believe.

Note that since the same code is used on build farms and their store
is several TB (see recent discussion about "guix gc" on Berlin that
takes hours), the build and update of the database need some care. :-)

> >> - Find a way to garbage-collect the database(s).  My intuition is that
> >>   we should have 1 database per Guix checkout and when we `guix gc` a
> >>   Guix checkout we collect the corresponding database.
> >
> > Well, the exact same strategy as
> > ~/.config/guix/current/lib/guix/package.cache can be used.
>
> Oh!  I didn't know about this file!  What is it used for?

Basically for "--news".  Otherwise, it is used by
"fold-available-packages", "find-packages-by-name" and
"find-packages-by-location".  It is used only if "--load-path" is not
provided (cache-is-authoritative?).  And it is computed at the end
"guix pull".  The discussions about improving "guix search" was first
to replace it by SQL database, then to add another file mimicking it,
then to extend it (which leads to backward compatibility issues).

For example, compare:

--8<---------------cut here---------------start------------->8---
time guix package --list-available > /dev/null

real    0m1.025s
user    0m1.866s
sys     0m0.044s

time guix package --list-available -L /tmp/foo > /dev/null

real    0m4.436s
user    0m6.734s
sys     0m0.124s
--8<---------------cut here---------------end--------------->8---

The first uses the case, the second not.

Cheers,
simon

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-11 13:02                   ` zimoun
@ 2020-10-11 14:25                     ` Pierre Neidhardt
  2020-10-11 16:05                       ` zimoun
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-10-11 14:25 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel, Mathieu Othacehe

[-- Attachment #1: Type: text/plain, Size: 3898 bytes --]

Hi Zimoun,

Maybe you misunderstood a point: the filesearch database is not a
database of _all store items_, but only of the items that correspond to
the packages of a given Guix generation.

This should answer many of your comments.

>> I don't know the size of your store nor your hardware.  Could you
>> benchmark against my filesearch implementation?
>
> 30G as I reported in my previous email. ;-)

Sorry, I was unclear: I meant to benchmark the runtime of the Guile code
I wrote in the patch, i.e.

--8<---------------cut here---------------start------------->8---
,time (persist-all-local-packages)
--8<---------------cut here---------------end--------------->8---

>> I should have benchmarked with Lzip, it would have been more useful.  I
>> think we can get it down to approximately 8 MiB in Lzip.
>
> Well, I think it will be more with all the items of all the packages.

No, the 8 MiB include _all the packages_ of a Guix generation.
We never include the complete store, it would not make sense for filesearch.

> This means to setup server side, right?  So implement the "diff" in
> "guix publish", right?  Hum? I feel it is overcomplicated.

I don't think it's to complicated: client sends a request along with the
Guix generation commit and the closer Guix generation commit for which
they have a database, server diffs the 2 SQLite database, compresses the
result and sends it back.

> Well, what is the size of for a full /gnu/store/ containing all the
> packages of one specific revision?  Sorry if you already provided this
> information, I have missed it.

The size of a /gnu/store does not matter.  The size of the databse does
however. In the email from the 26th of September:

--8<---------------cut here---------------start------------->8---
			The database will all package descriptions and synopsis is 46 MiB and
			compresses down to 11 MiB in zstd.
--8<---------------cut here---------------end--------------->8---

>> "manually" is not good in my opinion.  The end-user will inevitably
>> forget.  An out-of-sync database would return bad results which is a
>> big no-no for search.  On-demand database updates are ideals I think.
>
> The tradeoff is:
>   - when is "on-demand"?  When updates the database?

"guix build" and "guix pull".

>   - still fast when I search

Sorry, what is your question?

>  - do not slow down other guix subcommands

"guix pull" is not called by other commands.
I don't think that "guix build" would be impacted much because the
database update for a single store item is very fast.

> What you are proposing is:
>
>  - when "guix search --file":
>      + if the database does not exist: fetch it
>      + otherwise: use it

No, do it in "guix pull" since it requires networking already.

>  - after each "guix build", update the database

Yes.

> I am still missing the other update mechanism for updating the database.

Why?

> (Note that the "fetch it" could be done at "guix pull" time which is
> more meaningful since pull requires network access as you said.  And
> the real computations for updating could be done at the first "guix
> search --file" after the pull.)

Maybe this is the misunderstanding: "fetch it" and "update it" is the
same thing.
You fetch the diff from the substitute server and you apply it onto your
local database.

> Note that since the same code is used on build farms and their store
> is several TB (see recent discussion about "guix gc" on Berlin that
> takes hours), the build and update of the database need some care. :-)

There is no difference between the build farm and my computer since I've
generated the database over all 15000+ packages.  That the store has
several TB is irrelevant since only the given 15000 items will be browsed.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-11 14:25                     ` Pierre Neidhardt
@ 2020-10-11 16:05                       ` zimoun
  0 siblings, 0 replies; 73+ messages in thread
From: zimoun @ 2020-10-11 16:05 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: Guix Devel, Mathieu Othacehe

On Sun, 11 Oct 2020 at 16:25, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> Maybe you misunderstood a point: the filesearch database is not a
> database of _all store items_, but only of the items that correspond to
> the packages of a given Guix generation.

Yes, it is clear for me.  I meant: “all the store items of one specific
Guix generation/revision”.  So all the outputs files of ~15k packages.


>> Well, I think it will be more with all the items of all the packages.
>
> No, the 8 MiB include _all the packages_ of a Guix generation.
> We never include the complete store, it would not make sense for filesearch.

[...]

> --8<---------------cut here---------------start------------->8---
> 	The database will all package descriptions and synopsis is 46 MiB and
> 	compresses down to 11 MiB in zstd.
> --8<---------------cut here---------------end--------------->8---

Sorry, I have overlooked the code.  All is fine. :-)


I will try to adapat the “guix search” to use the database instead.


Cheers,
simon


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-10  8:57                 ` Pierre Neidhardt
  2020-10-10 14:58                   ` zimoun
@ 2020-10-12 10:16                   ` Ludovic Courtès
  2020-10-12 11:18                     ` Pierre Neidhardt
  1 sibling, 1 reply; 73+ messages in thread
From: Ludovic Courtès @ 2020-10-12 10:16 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: Guix Devel, Mathieu Othacehe

Hi,

Pierre Neidhardt <mail@ambrevar.xyz> skribis:

> Of course, `guix filesearch` hasn't been implemented yet ;)
>
> We still need to decide whether we want to make it part of `guix search'
> or define a separate command.

I would lean towards keeping it separate, so that it’s an optional
feature (given that it relies on downloading an external database).

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-05 18:53             ` Pierre Neidhardt
  2020-10-09 21:16               ` zimoun
  2020-10-10 16:03               ` zimoun
@ 2020-10-12 10:20               ` Ludovic Courtès
  2020-10-12 11:21                 ` Pierre Neidhardt
  2020-10-12 11:23                 ` zimoun
  2 siblings, 2 replies; 73+ messages in thread
From: Ludovic Courtès @ 2020-10-12 10:20 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel, Mathieu Othacehe

Hi!

Pierre Neidhardt <mail@ambrevar.xyz> skribis:

>> Could you post a summary of what you have done, what’s left to do, and
>> how you’d like to integrate it?  (If you’ve already done it, my
>> apologies, but you can resend a link.  :-))
>
> What I've done: mostly a database benchmark.
>
> - Textual database: slow and not lighter than SQLite.  Not worth it I believe.
>
> - SQLite without full-text search: fast, supports classic patterns
>   (e.g. "foo*bar") but does not support word permutations.
>
> - SQLite with full-text search: fast, supports word permutations but
>   does not support suffix-matching (e.g. "bar" won't match "foobar").
>   Size is about the same as without full-text search.
>
> - Include synopsis and descriptions.  Maybe we should include all fields
>   that are searched by `guix search`.  This incurs a cost on the
>   database size but it would fix the `guix search` speed issue.  Size
>   increases by some 10 MiB.

Oh so this is going beyond file search, right?

Perhaps it would make sense to focus on file search only as a first
step, and see what can be done with synopses/descriptions (like Arun and
zimoun did before) later, separately?

> What's left to do:
>
> - Populate the database on demand, either after a `guix build` or from a
>   `guix filesearch...`.  This is important so that `guix filesearch`
>   works on packages built locally.  If `guix build`, I need help to know
>   where to plug it in.
>
> - Adapt Cuirass so that it builds its file database.
>   I need pointers to get started here.
>
> - Sync the databases from the substitute server to the client when
>   running `guix filesearch`.  For this I suggest we send the compressed
>   database corresponding to a guix generation over the network (around
>   10 MiB).  Not sure sending just the delta is worth it.

It would be nice to see whether/how this could be integrated with
third-party channels.  Of course it’s not a priority, but while
designing this feature, we should keep in mind that we might want
third-party channel authors to be able to offer such a database for
their packages.

> - Find a way to garbage-collect the database(s).  My intuition is that
>   we should have 1 database per Guix checkout and when we `guix gc` a
>   Guix checkout we collect the corresponding database.

If we download a fresh database every time, we might as well simply
overwrite the one we have?

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-12 10:16                   ` Ludovic Courtès
@ 2020-10-12 11:18                     ` Pierre Neidhardt
  2020-10-13 13:48                       ` Ludovic Courtès
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-10-12 11:18 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix Devel, Mathieu Othacehe

[-- Attachment #1: Type: text/plain, Size: 426 bytes --]

Ludovic Courtès <ludo@gnu.org> writes:

> I would lean towards keeping it separate, so that it’s an optional
> feature (given that it relies on downloading an external database).

I was leaning towards downloading the database with "guix pull", so that
the "filesearch" subcommand would not download anything.

In this case, we can have "guix search --file...", no?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-12 10:20               ` Ludovic Courtès
@ 2020-10-12 11:21                 ` Pierre Neidhardt
  2020-10-13 13:45                   ` Ludovic Courtès
  2020-10-12 11:23                 ` zimoun
  1 sibling, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-10-12 11:21 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel, Mathieu Othacehe

[-- Attachment #1: Type: text/plain, Size: 1540 bytes --]

Ludovic Courtès <ludo@gnu.org> writes:

>> - Include synopsis and descriptions.  Maybe we should include all fields
>>   that are searched by `guix search`.  This incurs a cost on the
>>   database size but it would fix the `guix search` speed issue.  Size
>>   increases by some 10 MiB.
>
> Oh so this is going beyond file search, right?
>
> Perhaps it would make sense to focus on file search only as a first
> step, and see what can be done with synopses/descriptions (like Arun and
> zimoun did before) later, separately?

We can do this in a 2nd step, no problem, I only did the benchmark to
answer Simon's question whether we could include both data in the same
database and which size/speed cost it would incur.

> It would be nice to see whether/how this could be integrated with
> third-party channels.  Of course it’s not a priority, but while
> designing this feature, we should keep in mind that we might want
> third-party channel authors to be able to offer such a database for
> their packages.

Wouldn't it work automatically for any substitute server?

>> - Find a way to garbage-collect the database(s).  My intuition is that
>>   we should have 1 database per Guix checkout and when we `guix gc` a
>>   Guix checkout we collect the corresponding database.
>
> If we download a fresh database every time, we might as well simply
> overwrite the one we have?

But then we would miss the database when switching Guix generation.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-12 10:20               ` Ludovic Courtès
  2020-10-12 11:21                 ` Pierre Neidhardt
@ 2020-10-12 11:23                 ` zimoun
  1 sibling, 0 replies; 73+ messages in thread
From: zimoun @ 2020-10-12 11:23 UTC (permalink / raw)
  To: Ludovic Courtès, Pierre Neidhardt; +Cc: guix-devel, Mathieu Othacehe

On Mon, 12 Oct 2020 at 12:20, Ludovic Courtès <ludo@gnu.org> wrote:

>> - Textual database: slow and not lighter than SQLite.  Not worth it I believe.
>>
>> - SQLite without full-text search: fast, supports classic patterns
>>   (e.g. "foo*bar") but does not support word permutations.
>>
>> - SQLite with full-text search: fast, supports word permutations but
>>   does not support suffix-matching (e.g. "bar" won't match "foobar").
>>   Size is about the same as without full-text search.
>>
>> - Include synopsis and descriptions.  Maybe we should include all fields
>>   that are searched by `guix search`.  This incurs a cost on the
>>   database size but it would fix the `guix search` speed issue.  Size
>>   increases by some 10 MiB.
>
> Oh so this is going beyond file search, right?
>
> Perhaps it would make sense to focus on file search only as a first
> step, and see what can be done with synopses/descriptions (like Arun and
> zimoun did before) later, separately?

Well, the first patch set that Arun sent for improving “guix search” was
the introduction of a SQLite database, replacing the current
’package.cache’.  And I quote your wise advice:

        I would rather keep the current package cache as-is instead of
        inserting sqlite in here.  I don’t expect it to bring much
        compared performance-wise to the current simple cache
        (especially if we look at load time), and it does increase
        complexity quite a bit.

        However, using sqlite for keyword search as you initially
        proposed on guix-devel does sound like a great idea to me.

Message-ID: <87sgjhx92g.fsf@gnu.org>


Therefore, if Pierre is going to introduce a SQL database where the
addition of the synopses/descriptions is cheap, it seems a good idea to
use it, isn’t it?  Keeping the ’package.cache’ as-is.  And in parallel,
“we“ can try to use this WIP branch for improving the speed of “guix
search” (by “we”, I mean that I plan to work on).

BTW, somehow, it would be really easy to remove these 2 extra fields if
it is not concluding for search, since it is only the function
’add-files’:

--8<---------------cut here---------------start------------->8---
    (with-statement
        db
        (string-append "insert into Info (name, synopsis, description, package)"
                       " values (:name, :synopsis, :description, :id)")
        stmt
      (sqlite-bind-arguments stmt
                             #:name name
                             #:synopsis synopsis
                             #:description description
                             #:id id)        
--8<---------------cut here---------------end--------------->8---

and used only once by ’persist-package-files’.



> It would be nice to see whether/how this could be integrated with
> third-party channels.  Of course it’s not a priority, but while
> designing this feature, we should keep in mind that we might want
> third-party channel authors to be able to offer such a database for
> their packages.

If the third-party channels also provides substitutes, then it would be
part of the substitutes, or easy to build from the substitute meta-data.



>> - Find a way to garbage-collect the database(s).  My intuition is that
>>   we should have 1 database per Guix checkout and when we `guix gc` a
>>   Guix checkout we collect the corresponding database.
>
> If we download a fresh database every time, we might as well simply
> overwrite the one we have?

But you do not want to download it again if you roll-back for example.
From my point of view, it should be the same mechanism as
’package.cache’.


Cheers,
simon


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-12 11:21                 ` Pierre Neidhardt
@ 2020-10-13 13:45                   ` Ludovic Courtès
  2020-10-13 13:56                     ` Pierre Neidhardt
  0 siblings, 1 reply; 73+ messages in thread
From: Ludovic Courtès @ 2020-10-13 13:45 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel, Mathieu Othacehe

Hi Pierre,

Pierre Neidhardt <mail@ambrevar.xyz> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:

[...]

>> It would be nice to see whether/how this could be integrated with
>> third-party channels.  Of course it’s not a priority, but while
>> designing this feature, we should keep in mind that we might want
>> third-party channel authors to be able to offer such a database for
>> their packages.
>
> Wouldn't it work automatically for any substitute server?

“Something” needs to build the file-to-package database (which is what
you’re working on), and then there needs to be a way for users to fetch
that database.  This is all orthogonal to substitutes, as I see it,
which is why I think we need to think about integrating it maybe with
‘guix publish’ on the server side and (guix channels) on the client
side.

>>> - Find a way to garbage-collect the database(s).  My intuition is that
>>>   we should have 1 database per Guix checkout and when we `guix gc` a
>>>   Guix checkout we collect the corresponding database.
>>
>> If we download a fresh database every time, we might as well simply
>> overwrite the one we have?
>
> But then we would miss the database when switching Guix generation.

Ah got it; having a database known to correspond to a specific commit is
even better.

However, note that it could take time for the server to provide the
complete database for a commit (the time for as many packages as
possible to be built), so clients may want to refresh it anyway, or even
perhaps to use an older database.

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-12 11:18                     ` Pierre Neidhardt
@ 2020-10-13 13:48                       ` Ludovic Courtès
  2020-10-13 13:59                         ` Pierre Neidhardt
  0 siblings, 1 reply; 73+ messages in thread
From: Ludovic Courtès @ 2020-10-13 13:48 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: Guix Devel, Mathieu Othacehe

Pierre Neidhardt <mail@ambrevar.xyz> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> I would lean towards keeping it separate, so that it’s an optional
>> feature (given that it relies on downloading an external database).
>
> I was leaning towards downloading the database with "guix pull", so that
> the "filesearch" subcommand would not download anything.

Like substitutes, it should be an optional feature IMO, which means we
need to keep clear boundaries.

There’s also the question of whether we should have a system-wide
database so that each user doesn’t have to download it, especially if
it’s relatively big.

Apologies for throwing more questions than answers into the mix.  :-)

Ludo’.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-13 13:45                   ` Ludovic Courtès
@ 2020-10-13 13:56                     ` Pierre Neidhardt
  2020-10-13 21:22                       ` Ludovic Courtès
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-10-13 13:56 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel, Mathieu Othacehe

[-- Attachment #1: Type: text/plain, Size: 1475 bytes --]

Hi Ludo!

> “Something” needs to build the file-to-package database (which is what
> you’re working on), and then there needs to be a way for users to fetch
> that database.  This is all orthogonal to substitutes, as I see it,
> which is why I think we need to think about integrating it maybe with
> ‘guix publish’ on the server side and (guix channels) on the client
> side.

If the database is filled on =guix build=, then the substitute server
would automatically have a filesearch database.

Question: How do I hook onto =guix build=?

The only thing left to do after that is to expose the database in =guix
publish= so that the client can fetch it with a =guix pull=.

>>> If we download a fresh database every time, we might as well simply
>>> overwrite the one we have?
>>
>> But then we would miss the database when switching Guix generation.
>
> Ah got it; having a database known to correspond to a specific commit is
> even better.
>
> However, note that it could take time for the server to provide the
> complete database for a commit (the time for as many packages as
> possible to be built), so clients may want to refresh it anyway, or even
> perhaps to use an older database.

Indeed, and that's why I think we need to include a timestamp: if the
client's database timestamp is older than that of the substitute server
database, then we re-fetch it.

Sounds good?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-13 13:48                       ` Ludovic Courtès
@ 2020-10-13 13:59                         ` Pierre Neidhardt
  0 siblings, 0 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-10-13 13:59 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix Devel, Mathieu Othacehe

[-- Attachment #1: Type: text/plain, Size: 1157 bytes --]

Ludovic Courtès <ludo@gnu.org> writes:

> Pierre Neidhardt <mail@ambrevar.xyz> skribis:
>
>> Ludovic Courtès <ludo@gnu.org> writes:
>>
>>> I would lean towards keeping it separate, so that it’s an optional
>>> feature (given that it relies on downloading an external database).
>>
>> I was leaning towards downloading the database with "guix pull", so that
>> the "filesearch" subcommand would not download anything.
>
> Like substitutes, it should be an optional feature IMO, which means we
> need to keep clear boundaries.

We can still fetch the database with =guix pull= while making it
optional.

> There’s also the question of whether we should have a system-wide
> database so that each user doesn’t have to download it, especially if
> it’s relatively big.

I'm leaning towards storing them in /var/guix/.  Something like this:

/var/guix/filesearch-databases/guix-$COMMIT.db
/var/guix/filesearch-databases/guix-$OTHER_COMMIT.db

> Apologies for throwing more questions than answers into the mix.  :-)

No problem, it's a good sign that we are going forward :)

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-13 13:56                     ` Pierre Neidhardt
@ 2020-10-13 21:22                       ` Ludovic Courtès
  2020-10-14  7:50                         ` Pierre Neidhardt
  0 siblings, 1 reply; 73+ messages in thread
From: Ludovic Courtès @ 2020-10-13 21:22 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel, Mathieu Othacehe

Pierre Neidhardt <mail@ambrevar.xyz> skribis:

>> “Something” needs to build the file-to-package database (which is what
>> you’re working on), and then there needs to be a way for users to fetch
>> that database.  This is all orthogonal to substitutes, as I see it,
>> which is why I think we need to think about integrating it maybe with
>> ‘guix publish’ on the server side and (guix channels) on the client
>> side.
>
> If the database is filled on =guix build=, then the substitute server
> would automatically have a filesearch database.
>
> Question: How do I hook onto =guix build=?

You would need a build-completion hook in the daemon, which doesn’t
exist (yet!).  Note also that at this level we only see derivations, not
packages.

>> Ah got it; having a database known to correspond to a specific commit is
>> even better.
>>
>> However, note that it could take time for the server to provide the
>> complete database for a commit (the time for as many packages as
>> possible to be built), so clients may want to refresh it anyway, or even
>> perhaps to use an older database.
>
> Indeed, and that's why I think we need to include a timestamp: if the
> client's database timestamp is older than that of the substitute server
> database, then we re-fetch it.

Yeah, maybe that can work.

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-13 21:22                       ` Ludovic Courtès
@ 2020-10-14  7:50                         ` Pierre Neidhardt
  2020-10-16 10:30                           ` Ludovic Courtès
  0 siblings, 1 reply; 73+ messages in thread
From: Pierre Neidhardt @ 2020-10-14  7:50 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel, Mathieu Othacehe

[-- Attachment #1: Type: text/plain, Size: 437 bytes --]

Ludovic Courtès <ludo@gnu.org> writes:

>> Question: How do I hook onto =guix build=?
>
> You would need a build-completion hook in the daemon, which doesn’t
> exist (yet!).  Note also that at this level we only see derivations, not
> packages.

Hmm... Can you explain me how =guix size= works with local builds?  Does
it compute the size on demand or is it stored somewhere?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-14  7:50                         ` Pierre Neidhardt
@ 2020-10-16 10:30                           ` Ludovic Courtès
  2020-10-17  9:14                             ` Pierre Neidhardt
  0 siblings, 1 reply; 73+ messages in thread
From: Ludovic Courtès @ 2020-10-16 10:30 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel, Mathieu Othacehe

Pierre Neidhardt <mail@ambrevar.xyz> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>>> Question: How do I hook onto =guix build=?
>>
>> You would need a build-completion hook in the daemon, which doesn’t
>> exist (yet!).  Note also that at this level we only see derivations, not
>> packages.
>
> Hmm... Can you explain me how =guix size= works with local builds?  Does
> it compute the size on demand or is it stored somewhere?

It first tries ‘query-path-info’, which succeeds if the store item is
available and contains info about its size, references, and so on.

When ‘query-path-info’ fails, it falls back to
‘query-substitutable-path-info’, which allows it to obtain the same
information for substitutes.  Thus, it doesn’t need to have the store
item available locally.

HTH!

Ludo’.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-16 10:30                           ` Ludovic Courtès
@ 2020-10-17  9:14                             ` Pierre Neidhardt
  2020-10-17 19:17                               ` Pierre Neidhardt
  2020-10-21  9:53                               ` Ludovic Courtès
  0 siblings, 2 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-10-17  9:14 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel, Mathieu Othacehe

[-- Attachment #1: Type: text/plain, Size: 768 bytes --]

Ludovic Courtès <ludo@gnu.org> writes:

> It first tries ‘query-path-info’, which succeeds if the store item is
> available and contains info about its size, references, and so on.
>
> When ‘query-path-info’ fails, it falls back to
> ‘query-substitutable-path-info’, which allows it to obtain the same
> information for substitutes.  Thus, it doesn’t need to have the store
> item available locally.

So we could do the same with `guix filesearch`:

- First try the entry in the database.

- If not there, try query-path-info and if it succeeds, populate the
  database.

- If query-path-info does not succeed, try our new
  query-substitutable-filesearch-info.

That would work, no?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-17  9:14                             ` Pierre Neidhardt
@ 2020-10-17 19:17                               ` Pierre Neidhardt
  2020-10-21  9:53                               ` Ludovic Courtès
  1 sibling, 0 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-10-17 19:17 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel, Mathieu Othacehe

[-- Attachment #1: Type: text/plain, Size: 692 bytes --]

Pierre Neidhardt <mail@ambrevar.xyz> writes:

> So we could do the same with `guix filesearch`:
>
> - First try the entry in the database.
>
> - If not there, try query-path-info and if it succeeds, populate the
>   database.
>
> - If query-path-info does not succeed, try our new
>   query-substitutable-filesearch-info.
>
> That would work, no?

Oops, I must have been distracted when writing this, of course this
cannot work "just in time" because we `guix filesearch' does not take
the local package as argument (unlike `guix size').

The local package must be indexed right after it's built, there is no
way around it.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-17  9:14                             ` Pierre Neidhardt
  2020-10-17 19:17                               ` Pierre Neidhardt
@ 2020-10-21  9:53                               ` Ludovic Courtès
  2020-10-21  9:58                                 ` Pierre Neidhardt
  1 sibling, 1 reply; 73+ messages in thread
From: Ludovic Courtès @ 2020-10-21  9:53 UTC (permalink / raw)
  To: Pierre Neidhardt; +Cc: guix-devel, Mathieu Othacehe

Pierre Neidhardt <mail@ambrevar.xyz> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> It first tries ‘query-path-info’, which succeeds if the store item is
>> available and contains info about its size, references, and so on.
>>
>> When ‘query-path-info’ fails, it falls back to
>> ‘query-substitutable-path-info’, which allows it to obtain the same
>> information for substitutes.  Thus, it doesn’t need to have the store
>> item available locally.
>
> So we could do the same with `guix filesearch`:
>
> - First try the entry in the database.
>
> - If not there, try query-path-info and if it succeeds, populate the
>   database.
>
> - If query-path-info does not succeed, try our new
>   query-substitutable-filesearch-info.
>
> That would work, no?

Sure, but it’s a big change on the daemon size (you need this new RPC, a
new database, a way to update this database, presumably orthogonal to
substitutes (?), etc.)

A client-side approach (not involving guix-daemon) would be more readily
usable, though some of the questions above remain open.

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: File search progress: database review and question on triggers
  2020-10-21  9:53                               ` Ludovic Courtès
@ 2020-10-21  9:58                                 ` Pierre Neidhardt
  0 siblings, 0 replies; 73+ messages in thread
From: Pierre Neidhardt @ 2020-10-21  9:58 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel, Mathieu Othacehe

[-- Attachment #1: Type: text/plain, Size: 529 bytes --]

Ludovic Courtès <ludo@gnu.org> writes:

> A client-side approach (not involving guix-daemon) would be more readily
> usable, though some of the questions above remain open.

I'd also prefer to stick to the client side.  But how can I trigger an
event when a package gets built?

Maybe we could hook into specific commands like `guix build', collect
the list of builds that are going to be done, and upon completion, fill
the database with the build results.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2020-10-21  9:59 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-10 14:32 File search progress: database review and question on triggers Pierre Neidhardt
2020-08-11  9:43 ` Mathieu Othacehe
2020-08-11 12:35   ` Pierre Neidhardt
2020-08-15 12:48     ` Hartmut Goebel
2020-08-11 15:43 ` Ricardo Wurmus
2020-08-11 17:54   ` Pierre Neidhardt
2020-08-11 17:58     ` Pierre Neidhardt
2020-08-11 20:08       ` Ricardo Wurmus
2020-08-12 19:10         ` Pierre Neidhardt
2020-08-12 20:13           ` Julien Lepiller
2020-08-12 20:43             ` Pierre Neidhardt
2020-08-12 21:29               ` Julien Lepiller
2020-08-12 22:29                 ` Ricardo Wurmus
2020-08-13  6:55                   ` Pierre Neidhardt
2020-08-13  6:52                 ` Pierre Neidhardt
2020-08-13  9:34                   ` Ricardo Wurmus
2020-08-13 10:04                     ` Pierre Neidhardt
2020-08-15 12:47                       ` Hartmut Goebel
2020-08-15 21:20                         ` Bengt Richter
2020-08-16  8:18                           ` Hartmut Goebel
2020-08-12 20:32           ` Pierre Neidhardt
2020-08-13  0:17           ` Arun Isaac
2020-08-13  6:58             ` Pierre Neidhardt
2020-08-13  9:40             ` Pierre Neidhardt
2020-08-13 10:08               ` Pierre Neidhardt
2020-08-13 11:47               ` Ricardo Wurmus
2020-08-13 13:44                 ` Pierre Neidhardt
2020-08-13 12:20               ` Arun Isaac
2020-08-13 13:53                 ` Pierre Neidhardt
2020-08-13 15:14                   ` Arun Isaac
2020-08-13 15:36                     ` Pierre Neidhardt
2020-08-13 15:56                       ` Pierre Neidhardt
2020-08-15 19:33                         ` Arun Isaac
2020-08-24  8:29                           ` Pierre Neidhardt
2020-08-24 10:53                             ` Pierre Neidhardt
2020-09-04 19:15                               ` Arun Isaac
2020-09-05  7:48                                 ` Pierre Neidhardt
2020-09-06  9:25                                   ` Arun Isaac
2020-09-06 10:05                                     ` Pierre Neidhardt
2020-09-06 10:33                                       ` Arun Isaac
2020-08-18 14:58 ` File search progress: database review and question on triggers OFF TOPIC PRAISE Joshua Branson
2020-08-27 10:00 ` File search progress: database review and question on triggers zimoun
2020-08-27 11:15   ` Pierre Neidhardt
2020-08-27 12:56     ` zimoun
2020-08-27 13:19       ` Pierre Neidhardt
2020-09-26 14:04         ` Pierre Neidhardt
2020-09-26 14:12           ` Pierre Neidhardt
2020-10-05 12:35           ` Ludovic Courtès
2020-10-05 18:53             ` Pierre Neidhardt
2020-10-09 21:16               ` zimoun
2020-10-10  8:57                 ` Pierre Neidhardt
2020-10-10 14:58                   ` zimoun
2020-10-12 10:16                   ` Ludovic Courtès
2020-10-12 11:18                     ` Pierre Neidhardt
2020-10-13 13:48                       ` Ludovic Courtès
2020-10-13 13:59                         ` Pierre Neidhardt
2020-10-10 16:03               ` zimoun
2020-10-11 11:19                 ` Pierre Neidhardt
2020-10-11 13:02                   ` zimoun
2020-10-11 14:25                     ` Pierre Neidhardt
2020-10-11 16:05                       ` zimoun
2020-10-12 10:20               ` Ludovic Courtès
2020-10-12 11:21                 ` Pierre Neidhardt
2020-10-13 13:45                   ` Ludovic Courtès
2020-10-13 13:56                     ` Pierre Neidhardt
2020-10-13 21:22                       ` Ludovic Courtès
2020-10-14  7:50                         ` Pierre Neidhardt
2020-10-16 10:30                           ` Ludovic Courtès
2020-10-17  9:14                             ` Pierre Neidhardt
2020-10-17 19:17                               ` Pierre Neidhardt
2020-10-21  9:53                               ` Ludovic Courtès
2020-10-21  9:58                                 ` Pierre Neidhardt
2020-10-12 11:23                 ` zimoun

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).