unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
From: zimoun <zimon.toutoune@gmail.com>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: 42162@debbugs.gnu.org, "Maurice Brémond" <Maurice.Bremond@inria.fr>
Subject: bug#42162: Recovering source tarballs
Date: Wed, 15 Jul 2020 18:55:21 +0200	[thread overview]
Message-ID: <CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@mail.gmail.com> (raw)
In-Reply-To: <87r1tit5j6.fsf_-_@gnu.org>

Hi Ludo,

Well, you enlarge the discussion to more than the issue of the 5
url-fetch packages on gforge.inria.fr :-)


First of all, you wrote [1] ``Migration away from tarballs is already
happening as more and more software is distributed straight from
content-addressed VCS repositories, though progress has been relatively
slow since we first discussed it in 2016.'' but on the other hand Guix
uses more than often [2] "url-fetch" even if "git-fetch" is available
upstream.  Other said, I am not convinced the migration is really
happening...

The issue would be mitigated if Guix transitions from "url-fetch" to
"git-fetch" when possible.

1: https://forge.softwareheritage.org/T2430#45800
2: https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html


Second, trying to do some stats about the SWH coverage, I note that
non-neglectible "url-fetch" are reachable by "lookup-content".  The
coverage is not straightforward because of the 120 request per hour rate
limit or unexpected server error.  Another story.

Well, I would like having numbers because I do not know what is
concretely the issue: how many "url-fetch" packages are reachable?  And
if they are unreachable, is it because they are not in yet? or is it
because Guix does not have enough info to lookup them?


On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:

> For the now, since 70% of our packages use ‘url-fetch’, we need to be
> able to fetch or to reconstruct tarballs.  There’s no way around it.

Yes, but for example all the packages in gnu/packages/bioconductor.scm
could be "git-fetch".  Today the source is over url-fetch but it could
be over git-fetch with https://git.bioconductor.org/packages/flowCore or
git@git.bioconductor.org:packages/flowCore.

Another example is the packages in gnu/packages/emacs-xyz.scm and the
ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for
example using
http://git.savannah.gnu.org/gitweb/?p=emacs/elpa.git;a=tree;f=packages/ace-window;h=71d3eb7bd2efceade91846a56b9937812f658bae;hb=HEAD

So I would be more reserved about the "no way around it". :-)  I mean
the 70% could be a bit mitigated.


> In the short term, we should arrange so that the build farm keeps GC
> roots on source tarballs for an indefinite amount of time.  Cuirass
> jobset?  Mcron job to preserve GC roots?  Ideas?

Yes, preserving source tarballs for an indefinite amount of time will
help.  At least all the packages where "lookup-content" returns #f,
which means they are not in SWH or they are unreachable -- both is
equivalent from Guix side.

What about in addition push to IPFS?  Feasible?  Lookup issue?

> For the future, we could store nar hashes of unpacked tarballs instead
> of hashes over tarballs.  But that raises two questions:
>
>   • If we no longer deal with tarballs but upstreams keep signing
>     tarballs (not raw directory hashes), how can we authenticate our
>     code after the fact?

Does Guix automatically authenticate code using signed tarballs?


>   • SWH internally store Git-tree hashes, not nar hashes, so we still
>     wouldn’t be able to fetch our unpacked trees from SWH.
>
> (Both issues were previously discussed at
> <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.)
>
> So for the medium term, and perhaps for the future, a possible option
> would be to preserve tarball metadata so we can reconstruct them:
>
>   tarball = metadata + tree

There is different issues at different levels:

 1. how to lookup? what information do we need to keep/store to be able
    to query SWH?
 2. how to check the integrity? what information do we need to
    keep/store to be able to verify that SWH returns what Guix expects?
 3. how to authenticate? where the tarball metadata has to be stored if
    SWH removes it?

Basically, the git-fetch source stores 3 identifiers:

 - upstream url
 - commit / tag
 - integrity (sha256)

Fetching from SWH requires the commit only (lookup-revision) or the
tag+url (lookup-origin-revision) then from the returned revision, the
integrity of the downloaded data is checked using the sha256, right?

Therefore, one way to fix lookup of the url-fetch source is to add an
extra field mimicking the commit role.

The easiest is to store a SWHID or an identifier allowing to deduce the
SWHID.

I have not checked the code, but something like this:

  https://pypi.org/project/swh.model/
  https://forge.softwareheritage.org/source/swh-model/

and at package time, this identifier is added, similarly to integrity.

Aside, does Guix use the authentication metadata that tarballs provide?


( BTW, I failed [3,4] to package swh.model so if someone wants to give a
try.
3: https://lists.gnu.org/archive/html/help-guix/2020-06/msg00158.html
4: https://lists.gnu.org/archive/html/help-guix/2020-06/msg00161.html )


> After all, tarballs are byproducts and should be no exception: we should
> build them from source.  :-)

[...]

> The code below can “disassemble” and “assemble” a tar.  When it
> disassembles it, it generates metadata like this:

[...]

> The ’assemble-archive’ procedure consumes that, looks up file contents
> by hash on SWH, and reconstructs the original tarball…

Where do you plan to store the "disassembled" metadata?
And where do you plan to "assemble-archive"?

I mean,

 What is pushed to SWH? And how?
 What is fetched from SWH? And how?

(Well, answer below. :-))

> … at least in theory, because in practice we hit the SWH rate limit
> after looking up a few files:

Yes, it is 120 request per hour and 10 save per hour.  Well, I do not
think they will increase much these numbers in general.  However,
they seem open for specific machines.  So, I do not want to speak for
them, but we could ask an higher rate limit for ci.guix.gnu.org for
example.  Then we need to distinguish between source substitutes and
binary substitutes.  And basically, when an user runs "guix build foo",
if the source is not available upstream nor already on ci.guix.gnu.org,
then ci.guix.gnu.org fetch the missing sources from SWH and delivers it
to the user.


>   https://archive.softwareheritage.org/api/#rate-limiting
>
> So it’s a bit ridiculous, but we may have to store a SWH “dir”
> identifier for the whole extracted tree—a Git-tree hash—since that would
> allow us to retrieve the whole thing in a single HTTP request.

Well, the limited resources of SWH is an issue but SWH is not a mirror
but an archive. :-)

And as I wrote above, we could ask to SWH to increase the rate limit for
specific machine such as ci.guix.gnu.org


> I think we’d have to maintain a database that maps tarball hashes to
> metadata (!).  A simple version of it could be a Git repo where, say,
> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would
> contain the metadata above.  The nice thing is that the Git repo itself
> could be archived by SWH.  :-)

How this database that maps tarball hashes to metadata should be
maintained?  Git push hook?  Cron task?

What about foreign channels?  Should they maintain their own map?

To summary, it would work like this, right?

at package time:
 - store an integrity identiter (today sha256-nix-base32)
 - disassemble the tarball
 - commit to another repo the metadata using the path (address)
   sha256/base32/<identitier>
 - push to packages-repo *and* metadata-database-repo

at future time: (upstream has disappeared, say!)
 - use the integrity identifier to query the database repo
 - lookup the SWHID from the database repo
 - fetch the data from SWH
 - or lookup the IPFS identifier from the database repo and fetch the
   data from IPFS, for another example
 - re-assemble the tarball using the metadata from the database repo
 - check integrity, authentication, etc.

Well, right it is better than only adding an identifier for looking up
as I described above; because it is more general and flexible than only
SWH as fall-back.

The format of metadata (disassemble) that you propose is schemish
(obviously! :-)) but we could propose something more JSON-like.


All the best,
simon




  parent reply	other threads:[~2020-07-15 16:56 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-02  7:29 bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Ludovic Courtès
2020-07-02  8:50 ` zimoun
2020-07-02 10:03   ` Ludovic Courtès
2020-07-11 15:50     ` bug#42162: Recovering source tarballs Ludovic Courtès
2020-07-13 19:20       ` Christopher Baines
2020-07-20 21:27         ` zimoun
2020-07-15 16:55       ` zimoun [this message]
2020-07-20  8:39         ` Ludovic Courtès
2020-07-20 15:52           ` zimoun
2020-07-20 17:05             ` Dr. Arne Babenhauserheide
2020-07-20 19:59               ` zimoun
2020-07-21 21:22             ` Ludovic Courtès
2020-07-22  0:27               ` zimoun
2020-07-22 10:28                 ` Ludovic Courtès
2020-08-03 21:10         ` Ricardo Wurmus
2020-07-30 17:36       ` Timothy Sample
2020-07-31 14:41         ` Ludovic Courtès
2020-08-03 16:59           ` Timothy Sample
2020-08-05 17:14             ` Ludovic Courtès
2020-08-05 18:57               ` Timothy Sample
2020-08-23 16:21                 ` Ludovic Courtès
2020-11-03 14:26                 ` Ludovic Courtès
2020-11-03 16:37                   ` zimoun
2020-11-03 19:20                   ` Timothy Sample
2020-11-04 16:49                     ` Ludovic Courtès
2022-09-29  0:32                       ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2022-09-29 10:56                         ` zimoun
2022-09-29 15:00                           ` Ludovic Courtès
2022-09-30  3:10                             ` Maxim Cournoyer
2022-09-30 12:13                               ` zimoun
2022-10-01 22:04                                 ` Ludovic Courtès
2022-10-03 15:20                                 ` Maxim Cournoyer
2022-10-04 21:26                                   ` Ludovic Courtès
2022-09-30 18:17                               ` Maxime Devos
2020-08-26 10:04         ` bug#42162: Recovering source tarballs zimoun
2020-08-26 21:11           ` Timothy Sample
2020-08-27  9:41             ` zimoun
2020-08-27 12:49               ` Ludovic Courtès
2020-08-27 18:06               ` Bengt Richter
2021-01-10 19:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2021-01-13 10:39   ` Ludovic Courtès
2021-01-13 12:27     ` Andreas Enge
2021-01-13 15:07     ` Andreas Enge
     [not found] ` <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org>
2021-01-13 14:28   ` Ludovic Courtès
2021-01-14 14:21     ` Maxim Cournoyer
2021-10-04 15:59     ` bug#42162: gforge.inria.fr is off-line Ludovic Courtès
2021-10-04 17:50       ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 zimoun
2021-10-07 16:07         ` Ludovic Courtès
2021-10-09 17:29           ` raingloom
2021-10-11  8:41           ` zimoun
2021-10-12  9:24             ` Ludovic Courtès
2021-10-12 10:50               ` zimoun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@mail.gmail.com' \
    --to=zimon.toutoune@gmail.com \
    --cc=42162@debbugs.gnu.org \
    --cc=Maurice.Bremond@inria.fr \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).