Hi Zimoun,

Thanks for the feedback!

> --8<---------------cut here---------------start------------->8---
> echo 3 > /proc/sys/vm/drop_caches
> time updatedb --output=/tmp/store.db --database-root=/gnu/store/
>
> real    0m19.903s
> user    0m1.549s
> sys     0m4.500s

I don't know the size of your store nor your hardware.  Could you
benchmark against my filesearch implementation?

> And then “locate” support regexp and regex and it is fast enough.

But locate does not support word permutations, which is a very important
feature for filesearch in my opinion.

> The only point is that regexp is always cumbersome for me.  Well: «Some
> people, when confronted with a problem, think "I know, I'll use regular
> expressions." Now they have two problems.» :-) [1]

Exactly.  Full text search is a big step forward for usability I think.

> From my point of view, yes.  Somehow “filesearch” is a subpart of
> “search”.  So it should be the machinery.

I'll work on it.  I'll try to make the code flexible enough so that it
can be moved to another command easily, should we decide that "search"
is not the right fit.

> From my point of view, how to transfer the database from substitutes to
> users and how to locally update (custom channels or custom load path) are
> not easy.  Maybe the core issues.

Absolutely.

> For example, I just did “guix pull” and “–list-generation” says from
> f6dfe42 (Sept. 15) to 4ec2190 (Oct. 10)::
>
>    39.9 MB will be download
>
> more the tiny bits before “Computing Guix derivation”.  Say 50MB max.
>
> Well, the “locate” database for my “/gnu/store” (~30GB) is already to
> ~50MB, and ~20MB when compressed with gzip.  And Pierre said:
>
>       The database will all package descriptions and synopsis is 46 MiB
>       and compresses down to 11 MiB in zstd.

I should have benchmarked with Lzip, it would have been more useful.  I
think we can get it down to approximately 8 MiB in Lzip.

> which is better but still something.  Well, it is not affordable to
> fetch the database with “guix pull”, In My Humble Opinion.

We could send a "diff" of the database.

For instance, if the user already has a file database for the Guix
generation A, then guix pulls to B, the substitute server can send the
diff between A and B.  This would probably amount to less than 1 MiB if
the generations are not too far apart.  (Warning: actual measures needed!)

> Therefore, the database would be fetched at the first “guix search”
> (assuming point above).  But now, how “search” could know what is custom
> build and what is not?  Somehow, “search” should scan all the store to
> be able to update the database.
>
> And what happens each time I am doing a custom build then “filesearch”.
> The database should be updated, right?  Well, it seems almost unusable.

I mentioned this previously: we need to update the database on "guix
build".  This is very fast and would be mostly transparent to the user.
This is essentially how "guix size" behaves.

> The model “updatedb/locate” seems better.  The user updates “manually”
> if required and then location is fast.

"manually" is not good in my opinion.  The end-user will inevitably
forget.  An out-of-sync database would return bad results which is a
big no-no for search.  On-demand database updates are ideals I think.

> To me, each time I am using “filesearch”:
>
>  - first time: fetch the database corresponding the Guix commit and then
>  update it with my local store

Possibly using a "diff" to shrink the download size.

>  - otherwise: use this database
>  - optionally update the database if the user wants to include new
>  custom items.

No need for the optional point I believe.

> We could imagine a hook or option to “guix pull” specifying to also
> fetch the database and update it at pull time instead of “search” time.
> Personally, I prefer longer “guix pull” because it is already a bit long
> and then fast “search” than half/half (not so long pull and longer
> search).

I suggest we do it at pull time so that =guix search= does not need an
online network.  =guix pull= requires networking anyways.

>> - Find a way to garbage-collect the database(s).  My intuition is that
>>   we should have 1 database per Guix checkout and when we `guix gc` a
>>   Guix checkout we collect the corresponding database.
>
> Well, the exact same strategy as
> ~/.config/guix/current/lib/guix/package.cache can be used.

Oh!  I didn't know about this file!  What is it used for?

> BTW, thanks Pierre for improving the Guix discoverability. :-)

Thank you! :)

-- 
Pierre Neidhardt
https://ambrevar.xyz/