unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: zimoun <zimon.toutoune@gmail.com>
To: Pierre Neidhardt <mail@ambrevar.xyz>
Cc: Guix Devel <guix-devel@gnu.org>, Mathieu Othacehe <othacehe@gnu.org>
Subject: Re: File search progress: database review and question on triggers
Date: Sun, 11 Oct 2020 15:02:52 +0200	[thread overview]
Message-ID: <CAJ3okZ2EbtvBF7O-m1sU0PT1VrJtyM5ezKzy8DFsNXCz+w2x6Q@mail.gmail.com> (raw)
In-Reply-To: <87wnzx6mdh.fsf@ambrevar.xyz>

Hi Pierre,

I am trying to resume the work on "guix search" to improve it (faster).
That's why I am asking the details.  :-)
Because with the introduction of this database, as mentioned earlier,
2 annoyances could be fixed at once.


On Sun, 11 Oct 2020 at 13:19, Pierre Neidhardt <mail@ambrevar.xyz> wrote:

> > --8<---------------cut here---------------start------------->8---
> > echo 3 > /proc/sys/vm/drop_caches
> > time updatedb --output=/tmp/store.db --database-root=/gnu/store/
> >
> > real    0m19.903s
> > user    0m1.549s
> > sys     0m4.500s
>
> I don't know the size of your store nor your hardware.  Could you
> benchmark against my filesearch implementation?

30G as I reported in my previous email. ;-)


> > From my point of view, yes.  Somehow “filesearch” is a subpart of
> > “search”.  So it should be the machinery.
>
> I'll work on it.  I'll try to make the code flexible enough so that it
> can be moved to another command easily, should we decide that "search"
> is not the right fit.

UI does not matter so much at this point, I guess.  But the nice final
UI should be:

   guix search --file=


> > For example, I just did “guix pull” and “–list-generation” says from
> > f6dfe42 (Sept. 15) to 4ec2190 (Oct. 10)::
> >
> >    39.9 MB will be download
> >
> > more the tiny bits before “Computing Guix derivation”.  Say 50MB max.
> >
> > Well, the “locate” database for my “/gnu/store” (~30GB) is already to
> > ~50MB, and ~20MB when compressed with gzip.  And Pierre said:
> >
> >       The database will all package descriptions and synopsis is 46 MiB
> >       and compresses down to 11 MiB in zstd.
>
> I should have benchmarked with Lzip, it would have been more useful.  I
> think we can get it down to approximately 8 MiB in Lzip.

Well, I think it will be more with all the items of all the packages.
My point is: the database will be comparable in size with the bits of
"guix pull"; it is not much but still something.


> > which is better but still something.  Well, it is not affordable to
> > fetch the database with “guix pull”, In My Humble Opinion.
>
> We could send a "diff" of the database.

This means to setup server side, right?  So implement the "diff" in
"guix publish", right?  Hum? I feel it is overcomplicated.


> For instance, if the user already has a file database for the Guix
> generation A, then guix pulls to B, the substitute server can send the
> diff between A and B.  This would probably amount to less than 1 MiB if
> the generations are not too far apart.  (Warning: actual measures needed!)

Well, what is the size of for a full /gnu/store/ containing all the
packages of one specific revision?  Sorry if you already provided this
information, I have missed it.


> > Therefore, the database would be fetched at the first “guix search”
> > (assuming point above).  But now, how “search” could know what is custom
> > build and what is not?  Somehow, “search” should scan all the store to
> > be able to update the database.
> >
> > And what happens each time I am doing a custom build then “filesearch”.
> > The database should be updated, right?  Well, it seems almost unusable.
>
> I mentioned this previously: we need to update the database on "guix
> build".  This is very fast and would be mostly transparent to the user.
> This is essentially how "guix size" behaves.

Ok.


> > The model “updatedb/locate” seems better.  The user updates “manually”
> > if required and then location is fast.
>
> "manually" is not good in my opinion.  The end-user will inevitably
> forget.  An out-of-sync database would return bad results which is a
> big no-no for search.  On-demand database updates are ideals I think.

The tradeoff is:
  - when is "on-demand"?  When updates the database?
  - still fast when I search
 - do not slow down other guix subcommands


What you are proposing is:

 - when "guix search --file":
     + if the database does not exist: fetch it
     + otherwise: use it
 - after each "guix build", update the database

Right?

I am still missing the other update mechanism for updating the database.

(Note that the "fetch it" could be done at "guix pull" time which is
more meaningful since pull requires network access as you said.  And
the real computations for updating could be done at the first "guix
search --file" after the pull.)


> Possibly using a "diff" to shrink the download size.
>
> >  - otherwise: use this database
> >  - optionally update the database if the user wants to include new
> >  custom items.
>
> No need for the optional point I believe.

Note that since the same code is used on build farms and their store
is several TB (see recent discussion about "guix gc" on Berlin that
takes hours), the build and update of the database need some care. :-)


> >> - Find a way to garbage-collect the database(s).  My intuition is that
> >>   we should have 1 database per Guix checkout and when we `guix gc` a
> >>   Guix checkout we collect the corresponding database.
> >
> > Well, the exact same strategy as
> > ~/.config/guix/current/lib/guix/package.cache can be used.
>
> Oh!  I didn't know about this file!  What is it used for?

Basically for "--news".  Otherwise, it is used by
"fold-available-packages", "find-packages-by-name" and
"find-packages-by-location".  It is used only if "--load-path" is not
provided (cache-is-authoritative?).  And it is computed at the end
"guix pull".  The discussions about improving "guix search" was first
to replace it by SQL database, then to add another file mimicking it,
then to extend it (which leads to backward compatibility issues).

For example, compare:

--8<---------------cut here---------------start------------->8---
time guix package --list-available > /dev/null

real    0m1.025s
user    0m1.866s
sys     0m0.044s

time guix package --list-available -L /tmp/foo > /dev/null

real    0m4.436s
user    0m6.734s
sys     0m0.124s
--8<---------------cut here---------------end--------------->8---

The first uses the case, the second not.


Cheers,
simon


  reply	other threads:[~2020-10-11 13:03 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-10 14:32 File search progress: database review and question on triggers Pierre Neidhardt
2020-08-11  9:43 ` Mathieu Othacehe
2020-08-11 12:35   ` Pierre Neidhardt
2020-08-15 12:48     ` Hartmut Goebel
2020-08-11 15:43 ` Ricardo Wurmus
2020-08-11 17:54   ` Pierre Neidhardt
2020-08-11 17:58     ` Pierre Neidhardt
2020-08-11 20:08       ` Ricardo Wurmus
2020-08-12 19:10         ` Pierre Neidhardt
2020-08-12 20:13           ` Julien Lepiller
2020-08-12 20:43             ` Pierre Neidhardt
2020-08-12 21:29               ` Julien Lepiller
2020-08-12 22:29                 ` Ricardo Wurmus
2020-08-13  6:55                   ` Pierre Neidhardt
2020-08-13  6:52                 ` Pierre Neidhardt
2020-08-13  9:34                   ` Ricardo Wurmus
2020-08-13 10:04                     ` Pierre Neidhardt
2020-08-15 12:47                       ` Hartmut Goebel
2020-08-15 21:20                         ` Bengt Richter
2020-08-16  8:18                           ` Hartmut Goebel
2020-08-12 20:32           ` Pierre Neidhardt
2020-08-13  0:17           ` Arun Isaac
2020-08-13  6:58             ` Pierre Neidhardt
2020-08-13  9:40             ` Pierre Neidhardt
2020-08-13 10:08               ` Pierre Neidhardt
2020-08-13 11:47               ` Ricardo Wurmus
2020-08-13 13:44                 ` Pierre Neidhardt
2020-08-13 12:20               ` Arun Isaac
2020-08-13 13:53                 ` Pierre Neidhardt
2020-08-13 15:14                   ` Arun Isaac
2020-08-13 15:36                     ` Pierre Neidhardt
2020-08-13 15:56                       ` Pierre Neidhardt
2020-08-15 19:33                         ` Arun Isaac
2020-08-24  8:29                           ` Pierre Neidhardt
2020-08-24 10:53                             ` Pierre Neidhardt
2020-09-04 19:15                               ` Arun Isaac
2020-09-05  7:48                                 ` Pierre Neidhardt
2020-09-06  9:25                                   ` Arun Isaac
2020-09-06 10:05                                     ` Pierre Neidhardt
2020-09-06 10:33                                       ` Arun Isaac
2020-08-18 14:58 ` File search progress: database review and question on triggers OFF TOPIC PRAISE Joshua Branson
2020-08-27 10:00 ` File search progress: database review and question on triggers zimoun
2020-08-27 11:15   ` Pierre Neidhardt
2020-08-27 12:56     ` zimoun
2020-08-27 13:19       ` Pierre Neidhardt
2020-09-26 14:04         ` Pierre Neidhardt
2020-09-26 14:12           ` Pierre Neidhardt
2020-10-05 12:35           ` Ludovic Courtès
2020-10-05 18:53             ` Pierre Neidhardt
2020-10-09 21:16               ` zimoun
2020-10-10  8:57                 ` Pierre Neidhardt
2020-10-10 14:58                   ` zimoun
2020-10-12 10:16                   ` Ludovic Courtès
2020-10-12 11:18                     ` Pierre Neidhardt
2020-10-13 13:48                       ` Ludovic Courtès
2020-10-13 13:59                         ` Pierre Neidhardt
2020-10-10 16:03               ` zimoun
2020-10-11 11:19                 ` Pierre Neidhardt
2020-10-11 13:02                   ` zimoun [this message]
2020-10-11 14:25                     ` Pierre Neidhardt
2020-10-11 16:05                       ` zimoun
2020-10-12 10:20               ` Ludovic Courtès
2020-10-12 11:21                 ` Pierre Neidhardt
2020-10-13 13:45                   ` Ludovic Courtès
2020-10-13 13:56                     ` Pierre Neidhardt
2020-10-13 21:22                       ` Ludovic Courtès
2020-10-14  7:50                         ` Pierre Neidhardt
2020-10-16 10:30                           ` Ludovic Courtès
2020-10-17  9:14                             ` Pierre Neidhardt
2020-10-17 19:17                               ` Pierre Neidhardt
2020-10-21  9:53                               ` Ludovic Courtès
2020-10-21  9:58                                 ` Pierre Neidhardt
2020-10-12 11:23                 ` zimoun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJ3okZ2EbtvBF7O-m1sU0PT1VrJtyM5ezKzy8DFsNXCz+w2x6Q@mail.gmail.com \
    --to=zimon.toutoune@gmail.com \
    --cc=guix-devel@gnu.org \
    --cc=mail@ambrevar.xyz \
    --cc=othacehe@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).