unofficial mirror of guix-patches@gnu.org 
 help / color / mirror / code / Atom feed
From: Timothy Sample <samplet@ngyro.com>
To: zimoun <zimon.toutoune@gmail.com>
Cc: Mathieu Othacehe <othacehe@gnu.org>, 47336@debbugs.gnu.org
Subject: [bug#47336] Disarchive as a fallback for downloads
Date: Tue, 23 Mar 2021 10:31:15 -0400	[thread overview]
Message-ID: <87sg4mt00c.fsf@ngyro.com> (raw)
In-Reply-To: <86sg4mnreu.fsf@gmail.com> (zimoun's message of "Tue, 23 Mar 2021 10:35:53 +0100")

Hi zimoun,

You make a lot of good points here.  Let me at least provide some quick
answers even if I’m not ready to comment on some of the bigger picture
stuff.

zimoun <zimon.toutoune@gmail.com> writes:

> (CC Mathieu to advice if it could be a feature of Cuirass.)

So far I have been using Cuirass with only a tiny patch.  I’m not sure
we need anything more than what Cuirass already provides.  (The tiny
patch is for allowing sorting the “latestbuilds” results by “stoptime”
and “id”.  This in turn allows paging through all the builds from the
API.)

> On Tue, 23 Mar 2021 at 00:42, Timothy Sample <samplet@ngyro.com> wrote:
>
>> Now you can ask Guix for a recent .tar.gz source package:
>>
>>     $ ./pre-inst-env guix build --no-substitutes -S python-httpretty
>
> Neat!  Now, there is a way to easily check the coverage, right?  Since
> SWH is ingesting the tarball using <http://guix.gnu.org/sources.json>,
> there is now a mean to report what Guix is able to rebuild.

I’m not sure I fully understand.  Disarchive covers about 4,300 Gzip’ed
tarballs (no XZ yet).  There are about 100 for which compression
parameters cannot be found, and a handful (about 5) that have a
particularly funny idea about what a tarball is.  The metadata builds
for my database started one week ago and have been continuously updating
since then.

Are you asking if we could check what SWH has?  Yes!  Each metadata
file contains the SWHID of the input directory.  You could use
Disarchive to get this value or a simple “grep swhid” would do it.  :)

    $ curl https://disarchive.ngyro.com/sha256/67989614004773db349791c37675efb914d084bdb221356a05e4369c35e7eb62 | grep swhid

It would be neat to have a big database of archive coverage from Guix
1.0 through to the present.  It’s quite a big project though.

Of course, you know all about the SWH rate limit....

>>     Checking httpretty-1.0.5 digest... ok
>
> What happens if it is not ok?

For that particular digest, it means the source directory is wrong.
Since we get the source from SWH, it means that the SWH archive is
wrong.  You will have to look elsewhere, I guess (this seems pretty
unlikely).  (There is a vanishing possibility that Disarchive
miscomputed the SWHID and managed to come up with a different, but still
valid SWHID....)

The other digest checks are more likely to fail.  They would indicate
that Disarchive no longer knows how to interpret the metadata.  Maybe
there will be a subtle bug in Disarchive 0.3.0 that causes this.  Either
use an old version of Disarchive or try to fix the current version.  :)
I worry about this, because it would be annoying, but the metadata does
have all the information needed to recover the original archive, so
nothing is really lost (except the user’s time).

>>     Assembling the tarball httpretty-1.0.5.tar
>>     Checking httpretty-1.0.5.tar digest... ok
>>     Assembling the Gzip file httpretty-1.0.5.tar.gz
>>     Checking httpretty-1.0.5.tar.gz digest... ok
>>     Copying result to
>> /gnu/store/kbcnm57y2q1jvhvd8zw1g5vdiwlv19y9-httpretty-1.0.5.tar.gz
>
> Where is the assembly done?  In /tmp/, right?

Yes.

>>     successfully built
>> /gnu/store/k0b3c7kgzyn1nlyhx192pcbcgbfnhnwa-httpretty-1.0.5.tar.gz.drv
>
> Just to be sure, when does Guix check the integrity checksum?  I mean,
> does Guix check the checksum after ’disassemble’ re-assembled the source?

Disarchive checks the result against the metadata to make sure it didn’t
make a mistake.  Guix also checks the final result to make sure the
fixed-output derivation is correct.  A fixed-output derivation is
basically just a checksum with a hint about how the data can be
obtained.  Guix really only cares about the checksum, the hint can do
whatever as long as it produces the result Guix wants.  With this patch
series, Disarchive is part of the hint.

>> First, it looks up the metadata on my server.  This is fine for a demo,
>> but not what we want forever.  The patch series supports adding
>> several
>
> As we talked before, how does the database scale?  Do you have some
> numbers for the current demo?  In order to try to extrapolate what does
> it mean for a server to «store the metadata».

With “gzip -9”, the average metadata file is 6.8KiB.  It’s pretty
manageable.  There’s room for improvement on the Disarchive side, too.
It still stores some redundant information.  Uncompressed, it’s more
like 112KiB per file.  This is still pretty okay, really.  It means we
might hit tens of GiB over a couple years.  (It would take just over
100GiB to store a million uncompressed metadata files.)  The compression
ratio is what drove me to skip Git for now.

>> mirrors for looking up the metadata.  In the past, we talked about
>> putting everything on one or a few of the big Git hosting platforms like
>> GitHub or Gitlab.  That way, it would be easily picked up by SWH and
>> archived “forever”.  Right now, I have Cuirass set up to build the
>> metadata, and a little script that moves it from the build server to my
>> Web server.  It would be simple enough to adjust that script to push it
>> to a remote Git repo.  (Of course, the next step is to move this setup
>> to Guix infrastructure.)  Thoughts?
>
> Maybe this database could be a package, say “guix-tarball-db”, updated
> in agreement with the package “guix”.  The source of this
> “guix-tarball-db” would be a remote big Git hosting platforms like
> GitHub or whatever and not stored on Guix infrastructure, or maybe
> stored on Guix infra.
>
> Regularly, i.e., when the package “guix” is updated, in the same time,
> the package “guix-tarball-db” is updated too.  The “guix lint -c
> archival” sends the saving request to SWH.  Even if this saving request
> should be automated soon. :-)
>
> Then if Cuirass would have a feature to disassemble and update the Git
> repo.
>
> Last, a service should run as your demo.  But for long-term, this
> service could disappear––assuming SWH not :-).  Therefore, we could
> imagine installing “guix-tarball-db” then tweak some parameters of the
> guix-daemon and “guix build <foo>”.  Both installing and building would
> fetch from SWH if both upstream disappear.
>
> Or this “guix-tarball-db” should not be a plain package but only an
> input as origin for the package “guix”.

This is an interesting idea, but one that I would have to think about
more.  :)


-- Tim




  reply	other threads:[~2021-03-23 14:36 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <87eeg6o50b.fsf@ngyro.com>
2021-03-23  9:35 ` [bug#47336] Disarchive as a fallback for downloads zimoun
2021-03-23 14:31   ` Timothy Sample [this message]
2021-03-27 10:39     ` Ludovic Courtès
     [not found] ` <20210323045213.9419-1-samplet@ngyro.com>
2021-03-27 10:40   ` Ludovic Courtès
2021-04-10 20:52     ` Ludovic Courtès
2021-04-26  9:49     ` Ludovic Courtès
2021-04-28  2:30       ` bug#47336: " Timothy Sample
2021-04-28  7:01         ` [bug#47336] " Timothy Sample
2021-04-29  7:48           ` Ludovic Courtès
2021-04-29 17:24             ` bug#47336: " Timothy Sample
     [not found]   ` <20210323045213.9419-2-samplet@ngyro.com>
2021-03-27 10:57     ` [bug#47336] " Ludovic Courtès
2021-05-14 21:36 ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87sg4mt00c.fsf@ngyro.com \
    --to=samplet@ngyro.com \
    --cc=47336@debbugs.gnu.org \
    --cc=othacehe@gnu.org \
    --cc=zimon.toutoune@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).