Hi, Ludovic Courtès skribis: > There’s this other discussion you mentioned, which I hope will have a > positive outcome: > > https://forge.softwareheritage.org/T2430 This discussion as well as discussions on #swh-devel have made it clear that SWH will not archive raw tarballs, at least not in the foreseeable future. Instead, it will keep archiving the contents of tarballs, as it has always done—that’s already a huge service. Not storing raw tarballs makes sense from an engineering perspective, but it does mean that we cannot rely on SWH as a content-addressed mirror for tarballs. (In fact, some raw tarballs are available on SWH, but that’s mostly “by chance”, for instance because they appear as-is in a Git repo that was ingested.) In fact this is one of the challenges mentioned in . So we need a solution for now (and quite urgently), and a solution for the future. For the now, since 70% of our packages use ‘url-fetch’, we need to be able to fetch or to reconstruct tarballs. There’s no way around it. In the short term, we should arrange so that the build farm keeps GC roots on source tarballs for an indefinite amount of time. Cuirass jobset? Mcron job to preserve GC roots? Ideas? For the future, we could store nar hashes of unpacked tarballs instead of hashes over tarballs. But that raises two questions: • If we no longer deal with tarballs but upstreams keep signing tarballs (not raw directory hashes), how can we authenticate our code after the fact? • SWH internally store Git-tree hashes, not nar hashes, so we still wouldn’t be able to fetch our unpacked trees from SWH. (Both issues were previously discussed at .) So for the medium term, and perhaps for the future, a possible option would be to preserve tarball metadata so we can reconstruct them: tarball = metadata + tree After all, tarballs are byproducts and should be no exception: we should build them from source. :-) In , Stefano mentioned pristine-tar, which does almost that, but not quite: it stores a binary delta between a tarball and a tree: https://manpages.debian.org/unstable/pristine-tar/pristine-tar.1.en.html I think we should have something more transparent than a binary delta. The code below can “disassemble” and “assemble” a tar. When it disassembles it, it generates metadata like this: --8<---------------cut here---------------start------------->8--- (tar-source (version 0) (headers (("guile-3.0.4/" (mode 493) (size 0) (mtime 1593007723) (chksum 3979) (typeflag #\5)) ("guile-3.0.4/m4/" (mode 493) (size 0) (mtime 1593007720) (chksum 4184) (typeflag #\5)) ("guile-3.0.4/m4/pipe2.m4" (mode 420) (size 531) (mtime 1536050419) (chksum 4812) (hash (sha256 "arx6n2rmtf66yjlwkgwp743glcpdsfzgjiqrqhfegutmcwvwvsza"))) ("guile-3.0.4/m4/time_h.m4" (mode 420) (size 5471) (mtime 1536050419) (chksum 4974) (hash (sha256 "z4py26rmvsk4st7db6vwziwwhkrjjrwj7nra4al6ipqh2ms45kka"))) […] --8<---------------cut here---------------end--------------->8--- The ’assemble-archive’ procedure consumes that, looks up file contents by hash on SWH, and reconstructs the original tarball… … at least in theory, because in practice we hit the SWH rate limit after looking up a few files: https://archive.softwareheritage.org/api/#rate-limiting So it’s a bit ridiculous, but we may have to store a SWH “dir” identifier for the whole extracted tree—a Git-tree hash—since that would allow us to retrieve the whole thing in a single HTTP request. Besides, we’ll also have to handle compression: storing gzip/xz headers and compression levels. How would we put that in practice? Good question. :-) I think we’d have to maintain a database that maps tarball hashes to metadata (!). A simple version of it could be a Git repo where, say, ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would contain the metadata above. The nice thing is that the Git repo itself could be archived by SWH. :-) Thus, if a tarball vanishes, we’d look it up in the database and reconstruct it from its metadata plus content store in SWH. Thoughts? Anyhow, we should team up with fellow NixOS and SWH hackers to address this, and with developers of other distros as well—this problem is not just that of the functional deployment geeks, is it? Ludo’.