From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id GNhcMJ0Cg19mQAAA0tVLHw (envelope-from ) for ; Sun, 11 Oct 2020 13:03:25 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id qDsiLJ0Cg1+XVgAAB5/wlQ (envelope-from ) for ; Sun, 11 Oct 2020 13:03:25 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 5A8A49404C4 for ; Sun, 11 Oct 2020 13:03:25 +0000 (UTC) Received: from localhost ([::1]:40338 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kRb0W-0001OO-95 for larch@yhetil.org; Sun, 11 Oct 2020 09:03:24 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:34022) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kRb0G-0001OF-8R for guix-devel@gnu.org; Sun, 11 Oct 2020 09:03:08 -0400 Received: from mail-qt1-x832.google.com ([2607:f8b0:4864:20::832]:33277) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kRb0E-0004mb-A4; Sun, 11 Oct 2020 09:03:07 -0400 Received: by mail-qt1-x832.google.com with SMTP id c23so11661693qtp.0; Sun, 11 Oct 2020 06:03:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=n/zCKaXMV+BQmA3pryw72932hrd2wRCe27PXGoBeqYY=; b=ZR2w+ljU8MBXbgYU0YTRGGXUXyuGF14It+7fjEYnKd5GwSC5rBm/OdU1AVOaXPvt5N AACsexMIWndi716lWaA6QRvQAe2B6I0WJJ4o3o/BmFZaxOt44nzpmk1zCYSuJJSrGI5Z kEmfsN2UraHDJVh68WjVyYAaANXZ97poPBEYCnTTrDy7L/+84dnQuHWLF+2kGTHg9srJ j+jYV68/lPZRW+FOgHDYnKSricp6gANWzmqhM+etrcR5j1UhBmoVd0pbKhl1MY5l7vYQ jV4dKcrfLxQhHOwgdRAdta2STdCoVsju1iryS3OCQXZFZaS1mBbXQl+tBID3uNtMhRSu NcrQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=n/zCKaXMV+BQmA3pryw72932hrd2wRCe27PXGoBeqYY=; b=ikb5FM0k7T3UqElu8cDuHCnrWQ3m74tWp61w7vRuKjXm5de6SHeRXcsCs66tVfShYd /DMimGvlbY0hHGf130tahiz9Za5fzOyheKrvZEumyYiykC+SZGo4/ahW2+psmH2aXzWE IB4FTGaM1g13lPFzWCvg5CQdUvqus93eFDhdJXVhWiTyf9K22TQB+8vVK4Ju6akEG1HV 33nbGJIOa+uOooW1ssLaMVAVDAJfzrxwmphB+RVKJOUyARlxikvBJRsxrt2pkgV15DJ6 XECq5pdo0SVCpPhTGkcuUIW/lZeU9ayG/Yc/ZA0WH26yhIEMWlNvWNjGxFfkuz4o725y q0Lg== X-Gm-Message-State: AOAM532GwNd7Xfkrcx0I+8H2B/dHUgHav8USOOtQimM1gMPzzuLdt25a tEAyL0v/ftTTuITvKDjS86M5PUEAmc60MYTlbrobC2KGMQw= X-Google-Smtp-Source: ABdhPJyh7IGR4Ch6nUuxC+5dgTemWjlQPfzEgSOcmO2dHWlyk5KDPHYhccuKbgLR5llYbVKzujI16EfdO9NMeiR2z/c= X-Received: by 2002:ac8:3663:: with SMTP id n32mr6135290qtb.354.1602421384214; Sun, 11 Oct 2020 06:03:04 -0700 (PDT) MIME-Version: 1.0 References: <87sgcuh8rb.fsf@ambrevar.xyz> <86imd4e7cr.fsf@gmail.com> <87eenspcf8.fsf@ambrevar.xyz> <865z94dz83.fsf@gmail.com> <87zh6gns4l.fsf@ambrevar.xyz> <87zh5c7hx6.fsf@ambrevar.xyz> <87k0w4zw8q.fsf@gnu.org> <875z7oijxu.fsf@ambrevar.xyz> <865z7iqd9f.fsf@gmail.com> <87wnzx6mdh.fsf@ambrevar.xyz> In-Reply-To: <87wnzx6mdh.fsf@ambrevar.xyz> From: zimoun Date: Sun, 11 Oct 2020 15:02:52 +0200 Message-ID: Subject: Re: File search progress: database review and question on triggers To: Pierre Neidhardt Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2607:f8b0:4864:20::832; envelope-from=zimon.toutoune@gmail.com; helo=mail-qt1-x832.google.com X-detected-operating-system: by eggs.gnu.org: No matching host in p0f cache. That's all we know. X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Guix Devel , Mathieu Othacehe Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=fail (rsa verify failed) header.d=gmail.com header.s=20161025 header.b=ZR2w+ljU; dmarc=fail reason="SPF not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Spam-Score: 1.59 X-TUID: zMJjyOknl7gn Hi Pierre, I am trying to resume the work on "guix search" to improve it (faster). That's why I am asking the details. :-) Because with the introduction of this database, as mentioned earlier, 2 annoyances could be fixed at once. On Sun, 11 Oct 2020 at 13:19, Pierre Neidhardt wrote: > > --8<---------------cut here---------------start------------->8--- > > echo 3 > /proc/sys/vm/drop_caches > > time updatedb --output=3D/tmp/store.db --database-root=3D/gnu/store/ > > > > real 0m19.903s > > user 0m1.549s > > sys 0m4.500s > > I don't know the size of your store nor your hardware. Could you > benchmark against my filesearch implementation? 30G as I reported in my previous email. ;-) > > From my point of view, yes. Somehow =E2=80=9Cfilesearch=E2=80=9D is a = subpart of > > =E2=80=9Csearch=E2=80=9D. So it should be the machinery. > > I'll work on it. I'll try to make the code flexible enough so that it > can be moved to another command easily, should we decide that "search" > is not the right fit. UI does not matter so much at this point, I guess. But the nice final UI should be: guix search --file=3D > > For example, I just did =E2=80=9Cguix pull=E2=80=9D and =E2=80=9C=E2=80= =93list-generation=E2=80=9D says from > > f6dfe42 (Sept. 15) to 4ec2190 (Oct. 10):: > > > > 39.9 MB will be download > > > > more the tiny bits before =E2=80=9CComputing Guix derivation=E2=80=9D. = Say 50MB max. > > > > Well, the =E2=80=9Clocate=E2=80=9D database for my =E2=80=9C/gnu/store= =E2=80=9D (~30GB) is already to > > ~50MB, and ~20MB when compressed with gzip. And Pierre said: > > > > The database will all package descriptions and synopsis is 46 MiB > > and compresses down to 11 MiB in zstd. > > I should have benchmarked with Lzip, it would have been more useful. I > think we can get it down to approximately 8 MiB in Lzip. Well, I think it will be more with all the items of all the packages. My point is: the database will be comparable in size with the bits of "guix pull"; it is not much but still something. > > which is better but still something. Well, it is not affordable to > > fetch the database with =E2=80=9Cguix pull=E2=80=9D, In My Humble Opini= on. > > We could send a "diff" of the database. This means to setup server side, right? So implement the "diff" in "guix publish", right? Hum? I feel it is overcomplicated. > For instance, if the user already has a file database for the Guix > generation A, then guix pulls to B, the substitute server can send the > diff between A and B. This would probably amount to less than 1 MiB if > the generations are not too far apart. (Warning: actual measures needed!= ) Well, what is the size of for a full /gnu/store/ containing all the packages of one specific revision? Sorry if you already provided this information, I have missed it. > > Therefore, the database would be fetched at the first =E2=80=9Cguix sea= rch=E2=80=9D > > (assuming point above). But now, how =E2=80=9Csearch=E2=80=9D could kn= ow what is custom > > build and what is not? Somehow, =E2=80=9Csearch=E2=80=9D should scan a= ll the store to > > be able to update the database. > > > > And what happens each time I am doing a custom build then =E2=80=9Cfile= search=E2=80=9D. > > The database should be updated, right? Well, it seems almost unusable. > > I mentioned this previously: we need to update the database on "guix > build". This is very fast and would be mostly transparent to the user. > This is essentially how "guix size" behaves. Ok. > > The model =E2=80=9Cupdatedb/locate=E2=80=9D seems better. The user upd= ates =E2=80=9Cmanually=E2=80=9D > > if required and then location is fast. > > "manually" is not good in my opinion. The end-user will inevitably > forget. An out-of-sync database would return bad results which is a > big no-no for search. On-demand database updates are ideals I think. The tradeoff is: - when is "on-demand"? When updates the database? - still fast when I search - do not slow down other guix subcommands What you are proposing is: - when "guix search --file": + if the database does not exist: fetch it + otherwise: use it - after each "guix build", update the database Right? I am still missing the other update mechanism for updating the database. (Note that the "fetch it" could be done at "guix pull" time which is more meaningful since pull requires network access as you said. And the real computations for updating could be done at the first "guix search --file" after the pull.) > Possibly using a "diff" to shrink the download size. > > > - otherwise: use this database > > - optionally update the database if the user wants to include new > > custom items. > > No need for the optional point I believe. Note that since the same code is used on build farms and their store is several TB (see recent discussion about "guix gc" on Berlin that takes hours), the build and update of the database need some care. :-) > >> - Find a way to garbage-collect the database(s). My intuition is that > >> we should have 1 database per Guix checkout and when we `guix gc` a > >> Guix checkout we collect the corresponding database. > > > > Well, the exact same strategy as > > ~/.config/guix/current/lib/guix/package.cache can be used. > > Oh! I didn't know about this file! What is it used for? Basically for "--news". Otherwise, it is used by "fold-available-packages", "find-packages-by-name" and "find-packages-by-location". It is used only if "--load-path" is not provided (cache-is-authoritative?). And it is computed at the end "guix pull". The discussions about improving "guix search" was first to replace it by SQL database, then to add another file mimicking it, then to extend it (which leads to backward compatibility issues). For example, compare: --8<---------------cut here---------------start------------->8--- time guix package --list-available > /dev/null real 0m1.025s user 0m1.866s sys 0m0.044s time guix package --list-available -L /tmp/foo > /dev/null real 0m4.436s user 0m6.734s sys 0m0.124s --8<---------------cut here---------------end--------------->8--- The first uses the case, the second not. Cheers, simon