all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: "Ludovic Courtès" <ludo@gnu.org>
To: zimoun <zimon.toutoune@gmail.com>
Cc: 42162@debbugs.gnu.org, "Maurice Brémond" <Maurice.Bremond@inria.fr>
Subject: bug#42162: Recovering source tarballs
Date: Mon, 20 Jul 2020 10:39:06 +0200	[thread overview]
Message-ID: <87365mzil1.fsf@gnu.org> (raw)
In-Reply-To: <CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@mail.gmail.com> (zimoun's message of "Wed, 15 Jul 2020 18:55:21 +0200")

Hi!

There are many many comments in your message, so I took the liberty to
reply only to the essence of it.  :-)

zimoun <zimon.toutoune@gmail.com> skribis:

> On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> For the now, since 70% of our packages use ‘url-fetch’, we need to be
>> able to fetch or to reconstruct tarballs.  There’s no way around it.
>
> Yes, but for example all the packages in gnu/packages/bioconductor.scm
> could be "git-fetch".  Today the source is over url-fetch but it could
> be over git-fetch with https://git.bioconductor.org/packages/flowCore or
> git@git.bioconductor.org:packages/flowCore.
>
> Another example is the packages in gnu/packages/emacs-xyz.scm and the
> ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for
> example using
> http://git.savannah.gnu.org/gitweb/?p=emacs/elpa.git;a=tree;f=packages/ace-window;h=71d3eb7bd2efceade91846a56b9937812f658bae;hb=HEAD
>
> So I would be more reserved about the "no way around it". :-)  I mean
> the 70% could be a bit mitigated.

The “no way around it” was about the situation today: it’s a fact that
70% of packages are built from tarballs, so we need to be able to fetch
them or reconstruct them.

However, the two examples above are good ideas as to the way forward: we
could start a url-fetch-to-git-fetch migration in these two cases, and
perhaps more.

>> In the short term, we should arrange so that the build farm keeps GC
>> roots on source tarballs for an indefinite amount of time.  Cuirass
>> jobset?  Mcron job to preserve GC roots?  Ideas?
>
> Yes, preserving source tarballs for an indefinite amount of time will
> help.  At least all the packages where "lookup-content" returns #f,
> which means they are not in SWH or they are unreachable -- both is
> equivalent from Guix side.
>
> What about in addition push to IPFS?  Feasible?  Lookup issue?

Lookup issue.  :-)  The hash in a CID is not just a raw blob hash.
Files are typically chunked beforehand, assembled as a Merkle tree, and
the CID is roughly the hash to the tree root.  So it would seem we can’t
use IPFS as-is for tarballs.

>> For the future, we could store nar hashes of unpacked tarballs instead
>> of hashes over tarballs.  But that raises two questions:
>>
>>   • If we no longer deal with tarballs but upstreams keep signing
>>     tarballs (not raw directory hashes), how can we authenticate our
>>     code after the fact?
>
> Does Guix automatically authenticate code using signed tarballs?

Not automatically; packagers are supposed to authenticate code when they
add a package (‘guix refresh -u’ does that automatically).

>>   • SWH internally store Git-tree hashes, not nar hashes, so we still
>>     wouldn’t be able to fetch our unpacked trees from SWH.
>>
>> (Both issues were previously discussed at
>> <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.)
>>
>> So for the medium term, and perhaps for the future, a possible option
>> would be to preserve tarball metadata so we can reconstruct them:
>>
>>   tarball = metadata + tree
>
> There is different issues at different levels:
>
>  1. how to lookup? what information do we need to keep/store to be able
>     to query SWH?
>  2. how to check the integrity? what information do we need to
>     keep/store to be able to verify that SWH returns what Guix expects?
>  3. how to authenticate? where the tarball metadata has to be stored if
>     SWH removes it?
>
> Basically, the git-fetch source stores 3 identifiers:
>
>  - upstream url
>  - commit / tag
>  - integrity (sha256)
>
> Fetching from SWH requires the commit only (lookup-revision) or the
> tag+url (lookup-origin-revision) then from the returned revision, the
> integrity of the downloaded data is checked using the sha256, right?

Yes.

> Therefore, one way to fix lookup of the url-fetch source is to add an
> extra field mimicking the commit role.

But today, we store tarball hashes, not directory hashes.

> The easiest is to store a SWHID or an identifier allowing to deduce the
> SWHID.
>
> I have not checked the code, but something like this:
>
>   https://pypi.org/project/swh.model/
>   https://forge.softwareheritage.org/source/swh-model/
>
> and at package time, this identifier is added, similarly to integrity.

I’m skeptical about adding a field that is practically never used.

[...]

>> The code below can “disassemble” and “assemble” a tar.  When it
>> disassembles it, it generates metadata like this:
>
> [...]
>
>> The ’assemble-archive’ procedure consumes that, looks up file contents
>> by hash on SWH, and reconstructs the original tarball…
>
> Where do you plan to store the "disassembled" metadata?
> And where do you plan to "assemble-archive"?

We’d have a repo/database containing metadata indexed by tarball sha256.

> How this database that maps tarball hashes to metadata should be
> maintained?  Git push hook?  Cron task?

Yes, something like that.  :-)

> What about foreign channels?  Should they maintain their own map?

Yes, presumably.

> To summary, it would work like this, right?
>
> at package time:
>  - store an integrity identiter (today sha256-nix-base32)
>  - disassemble the tarball
>  - commit to another repo the metadata using the path (address)
>    sha256/base32/<identitier>
>  - push to packages-repo *and* metadata-database-repo
>
> at future time: (upstream has disappeared, say!)
>  - use the integrity identifier to query the database repo
>  - lookup the SWHID from the database repo
>  - fetch the data from SWH
>  - or lookup the IPFS identifier from the database repo and fetch the
>    data from IPFS, for another example
>  - re-assemble the tarball using the metadata from the database repo
>  - check integrity, authentication, etc.

That’s the idea.

> The format of metadata (disassemble) that you propose is schemish
> (obviously! :-)) but we could propose something more JSON-like.

Sure, if that helps get other people on-board, why not (though sexps
have lived much longer than JSON and XML together :-)).

Thanks,
Ludo’.




  reply	other threads:[~2020-07-20  8:40 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-02  7:29 bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Ludovic Courtès
2020-07-02  8:50 ` zimoun
2020-07-02 10:03   ` Ludovic Courtès
2020-07-11 15:50     ` bug#42162: Recovering source tarballs Ludovic Courtès
2020-07-13 19:20       ` Christopher Baines
2020-07-20 21:27         ` zimoun
2020-07-15 16:55       ` zimoun
2020-07-20  8:39         ` Ludovic Courtès [this message]
2020-07-20 15:52           ` zimoun
2020-07-20 17:05             ` Dr. Arne Babenhauserheide
2020-07-20 19:59               ` zimoun
2020-07-21 21:22             ` Ludovic Courtès
2020-07-22  0:27               ` zimoun
2020-07-22 10:28                 ` Ludovic Courtès
2020-08-03 21:10         ` Ricardo Wurmus
2020-07-30 17:36       ` Timothy Sample
2020-07-31 14:41         ` Ludovic Courtès
2020-08-03 16:59           ` Timothy Sample
2020-08-05 17:14             ` Ludovic Courtès
2020-08-05 18:57               ` Timothy Sample
2020-08-23 16:21                 ` Ludovic Courtès
2020-11-03 14:26                 ` Ludovic Courtès
2020-11-03 16:37                   ` zimoun
2020-11-03 19:20                   ` Timothy Sample
2020-11-04 16:49                     ` Ludovic Courtès
2022-09-29  0:32                       ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2022-09-29 10:56                         ` zimoun
2022-09-29 15:00                           ` Ludovic Courtès
2022-09-30  3:10                             ` Maxim Cournoyer
2022-09-30 12:13                               ` zimoun
2022-10-01 22:04                                 ` Ludovic Courtès
2022-10-03 15:20                                 ` Maxim Cournoyer
2022-10-04 21:26                                   ` Ludovic Courtès
2022-09-30 18:17                               ` Maxime Devos
2020-08-26 10:04         ` bug#42162: Recovering source tarballs zimoun
2020-08-26 21:11           ` Timothy Sample
2020-08-27  9:41             ` zimoun
2020-08-27 12:49               ` Ludovic Courtès
2020-08-27 18:06               ` Bengt Richter
2021-01-10 19:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2021-01-13 10:39   ` Ludovic Courtès
2021-01-13 12:27     ` Andreas Enge
2021-01-13 15:07     ` Andreas Enge
     [not found] ` <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org>
2021-01-13 14:28   ` Ludovic Courtès
2021-01-14 14:21     ` Maxim Cournoyer
2021-10-04 15:59     ` bug#42162: gforge.inria.fr is off-line Ludovic Courtès
2021-10-04 17:50       ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 zimoun
2021-10-07 16:07         ` Ludovic Courtès
2021-10-09 17:29           ` raingloom
2021-10-11  8:41           ` zimoun
2021-10-12  9:24             ` Ludovic Courtès
2021-10-12 10:50               ` zimoun
2021-10-12 16:04                 ` Substitute retention Ludovic Courtès
2021-10-12 18:06                   ` zimoun
2021-10-15  9:27                     ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87365mzil1.fsf@gnu.org \
    --to=ludo@gnu.org \
    --cc=42162@debbugs.gnu.org \
    --cc=Maurice.Bremond@inria.fr \
    --cc=zimon.toutoune@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.