From: "Ludovic Courtès" <ludo@gnu.org>
To: zimoun <zimon.toutoune@gmail.com>
Cc: 42162@debbugs.gnu.org, "Maurice Brémond" <Maurice.Bremond@inria.fr>
Subject: bug#42162: Recovering source tarballs
Date: Mon, 20 Jul 2020 10:39:06 +0200 [thread overview]
Message-ID: <87365mzil1.fsf@gnu.org> (raw)
In-Reply-To: <CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@mail.gmail.com> (zimoun's message of "Wed, 15 Jul 2020 18:55:21 +0200")
Hi!
There are many many comments in your message, so I took the liberty to
reply only to the essence of it. :-)
zimoun <zimon.toutoune@gmail.com> skribis:
> On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> For the now, since 70% of our packages use ‘url-fetch’, we need to be
>> able to fetch or to reconstruct tarballs. There’s no way around it.
>
> Yes, but for example all the packages in gnu/packages/bioconductor.scm
> could be "git-fetch". Today the source is over url-fetch but it could
> be over git-fetch with https://git.bioconductor.org/packages/flowCore or
> git@git.bioconductor.org:packages/flowCore.
>
> Another example is the packages in gnu/packages/emacs-xyz.scm and the
> ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for
> example using
> http://git.savannah.gnu.org/gitweb/?p=emacs/elpa.git;a=tree;f=packages/ace-window;h=71d3eb7bd2efceade91846a56b9937812f658bae;hb=HEAD
>
> So I would be more reserved about the "no way around it". :-) I mean
> the 70% could be a bit mitigated.
The “no way around it” was about the situation today: it’s a fact that
70% of packages are built from tarballs, so we need to be able to fetch
them or reconstruct them.
However, the two examples above are good ideas as to the way forward: we
could start a url-fetch-to-git-fetch migration in these two cases, and
perhaps more.
>> In the short term, we should arrange so that the build farm keeps GC
>> roots on source tarballs for an indefinite amount of time. Cuirass
>> jobset? Mcron job to preserve GC roots? Ideas?
>
> Yes, preserving source tarballs for an indefinite amount of time will
> help. At least all the packages where "lookup-content" returns #f,
> which means they are not in SWH or they are unreachable -- both is
> equivalent from Guix side.
>
> What about in addition push to IPFS? Feasible? Lookup issue?
Lookup issue. :-) The hash in a CID is not just a raw blob hash.
Files are typically chunked beforehand, assembled as a Merkle tree, and
the CID is roughly the hash to the tree root. So it would seem we can’t
use IPFS as-is for tarballs.
>> For the future, we could store nar hashes of unpacked tarballs instead
>> of hashes over tarballs. But that raises two questions:
>>
>> • If we no longer deal with tarballs but upstreams keep signing
>> tarballs (not raw directory hashes), how can we authenticate our
>> code after the fact?
>
> Does Guix automatically authenticate code using signed tarballs?
Not automatically; packagers are supposed to authenticate code when they
add a package (‘guix refresh -u’ does that automatically).
>> • SWH internally store Git-tree hashes, not nar hashes, so we still
>> wouldn’t be able to fetch our unpacked trees from SWH.
>>
>> (Both issues were previously discussed at
>> <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.)
>>
>> So for the medium term, and perhaps for the future, a possible option
>> would be to preserve tarball metadata so we can reconstruct them:
>>
>> tarball = metadata + tree
>
> There is different issues at different levels:
>
> 1. how to lookup? what information do we need to keep/store to be able
> to query SWH?
> 2. how to check the integrity? what information do we need to
> keep/store to be able to verify that SWH returns what Guix expects?
> 3. how to authenticate? where the tarball metadata has to be stored if
> SWH removes it?
>
> Basically, the git-fetch source stores 3 identifiers:
>
> - upstream url
> - commit / tag
> - integrity (sha256)
>
> Fetching from SWH requires the commit only (lookup-revision) or the
> tag+url (lookup-origin-revision) then from the returned revision, the
> integrity of the downloaded data is checked using the sha256, right?
Yes.
> Therefore, one way to fix lookup of the url-fetch source is to add an
> extra field mimicking the commit role.
But today, we store tarball hashes, not directory hashes.
> The easiest is to store a SWHID or an identifier allowing to deduce the
> SWHID.
>
> I have not checked the code, but something like this:
>
> https://pypi.org/project/swh.model/
> https://forge.softwareheritage.org/source/swh-model/
>
> and at package time, this identifier is added, similarly to integrity.
I’m skeptical about adding a field that is practically never used.
[...]
>> The code below can “disassemble” and “assemble” a tar. When it
>> disassembles it, it generates metadata like this:
>
> [...]
>
>> The ’assemble-archive’ procedure consumes that, looks up file contents
>> by hash on SWH, and reconstructs the original tarball…
>
> Where do you plan to store the "disassembled" metadata?
> And where do you plan to "assemble-archive"?
We’d have a repo/database containing metadata indexed by tarball sha256.
> How this database that maps tarball hashes to metadata should be
> maintained? Git push hook? Cron task?
Yes, something like that. :-)
> What about foreign channels? Should they maintain their own map?
Yes, presumably.
> To summary, it would work like this, right?
>
> at package time:
> - store an integrity identiter (today sha256-nix-base32)
> - disassemble the tarball
> - commit to another repo the metadata using the path (address)
> sha256/base32/<identitier>
> - push to packages-repo *and* metadata-database-repo
>
> at future time: (upstream has disappeared, say!)
> - use the integrity identifier to query the database repo
> - lookup the SWHID from the database repo
> - fetch the data from SWH
> - or lookup the IPFS identifier from the database repo and fetch the
> data from IPFS, for another example
> - re-assemble the tarball using the metadata from the database repo
> - check integrity, authentication, etc.
That’s the idea.
> The format of metadata (disassemble) that you propose is schemish
> (obviously! :-)) but we could propose something more JSON-like.
Sure, if that helps get other people on-board, why not (though sexps
have lived much longer than JSON and XML together :-)).
Thanks,
Ludo’.
next prev parent reply other threads:[~2020-07-20 8:40 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-02 7:29 bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Ludovic Courtès
2020-07-02 8:50 ` zimoun
2020-07-02 10:03 ` Ludovic Courtès
2020-07-11 15:50 ` bug#42162: Recovering source tarballs Ludovic Courtès
2020-07-13 19:20 ` Christopher Baines
2020-07-20 21:27 ` zimoun
2020-07-15 16:55 ` zimoun
2020-07-20 8:39 ` Ludovic Courtès [this message]
2020-07-20 15:52 ` zimoun
2020-07-20 17:05 ` Dr. Arne Babenhauserheide
2020-07-20 19:59 ` zimoun
2020-07-21 21:22 ` Ludovic Courtès
2020-07-22 0:27 ` zimoun
2020-07-22 10:28 ` Ludovic Courtès
2020-08-03 21:10 ` Ricardo Wurmus
2020-07-30 17:36 ` Timothy Sample
2020-07-31 14:41 ` Ludovic Courtès
2020-08-03 16:59 ` Timothy Sample
2020-08-05 17:14 ` Ludovic Courtès
2020-08-05 18:57 ` Timothy Sample
2020-08-23 16:21 ` Ludovic Courtès
2020-11-03 14:26 ` Ludovic Courtès
2020-11-03 16:37 ` zimoun
2020-11-03 19:20 ` Timothy Sample
2020-11-04 16:49 ` Ludovic Courtès
2022-09-29 0:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2022-09-29 10:56 ` zimoun
2022-09-29 15:00 ` Ludovic Courtès
2022-09-30 3:10 ` Maxim Cournoyer
2022-09-30 12:13 ` zimoun
2022-10-01 22:04 ` Ludovic Courtès
2022-10-03 15:20 ` Maxim Cournoyer
2022-10-04 21:26 ` Ludovic Courtès
2022-09-30 18:17 ` Maxime Devos
2020-08-26 10:04 ` bug#42162: Recovering source tarballs zimoun
2020-08-26 21:11 ` Timothy Sample
2020-08-27 9:41 ` zimoun
2020-08-27 12:49 ` Ludovic Courtès
2020-08-27 18:06 ` Bengt Richter
2021-01-10 19:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2021-01-13 10:39 ` Ludovic Courtès
2021-01-13 12:27 ` Andreas Enge
2021-01-13 15:07 ` Andreas Enge
[not found] ` <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org>
2021-01-13 14:28 ` Ludovic Courtès
2021-01-14 14:21 ` Maxim Cournoyer
2021-10-04 15:59 ` bug#42162: gforge.inria.fr is off-line Ludovic Courtès
2021-10-04 17:50 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 zimoun
2021-10-07 16:07 ` Ludovic Courtès
2021-10-09 17:29 ` raingloom
2021-10-11 8:41 ` zimoun
2021-10-12 9:24 ` Ludovic Courtès
2021-10-12 10:50 ` zimoun
2021-10-12 16:04 ` Substitute retention Ludovic Courtès
2021-10-12 18:06 ` zimoun
2021-10-15 9:27 ` Ludovic Courtès
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87365mzil1.fsf@gnu.org \
--to=ludo@gnu.org \
--cc=42162@debbugs.gnu.org \
--cc=Maurice.Bremond@inria.fr \
--cc=zimon.toutoune@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.