From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id zvQFC0oFI18UTQAA0tVLHw (envelope-from ) for ; Thu, 30 Jul 2020 17:37:14 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1 with LMTPS id kA0+BkoFI19IdAAAbx9fmQ (envelope-from ) for ; Thu, 30 Jul 2020 17:37:14 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 442C89403E7 for ; Thu, 30 Jul 2020 17:37:13 +0000 (UTC) Received: from localhost ([::1]:39754 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1k1CUQ-0008HH-Om for larch@yhetil.org; Thu, 30 Jul 2020 13:37:10 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:53586) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1k1CUJ-0008Gv-8H for bug-guix@gnu.org; Thu, 30 Jul 2020 13:37:03 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:53351) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1k1CUI-0004LB-Ut for bug-guix@gnu.org; Thu, 30 Jul 2020 13:37:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1k1CUI-00054U-Is for bug-guix@gnu.org; Thu, 30 Jul 2020 13:37:02 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#42162: Recovering source tarballs In-Reply-To: <87mu4iv0gc.fsf@inria.fr> Resent-From: Timothy Sample Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Thu, 30 Jul 2020 17:37:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 42162 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Received: via spool by 42162-submit@debbugs.gnu.org id=B42162.159613062219490 (code B ref 42162); Thu, 30 Jul 2020 17:37:02 +0000 Received: (at 42162) by debbugs.gnu.org; 30 Jul 2020 17:37:02 +0000 Received: from localhost ([127.0.0.1]:36664 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1k1CUH-00054E-9Z for submit@debbugs.gnu.org; Thu, 30 Jul 2020 13:37:01 -0400 Received: from out3-smtp.messagingengine.com ([66.111.4.27]:54387) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1k1CUF-000542-12 for 42162@debbugs.gnu.org; Thu, 30 Jul 2020 13:37:00 -0400 Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailout.nyi.internal (Postfix) with ESMTP id E4A8B5C0180; Thu, 30 Jul 2020 13:36:53 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute3.internal (MEProxy); Thu, 30 Jul 2020 13:36:53 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:message-id:mime-version:references:subject:to :x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm3; bh=87h8COJA8UsCPaH5R9nD3E/ixlDsWTtqCLc58qgCEV4=; b=Tj1rOiQK GKmg8Ly1XjpC333OPSKHPH8dbIpN7nG3eMBmpzoeNxK6nMJsesn6UWyKIVpLEAXO HAdxisu3kvfCeoRQHCQ1cFzD1hY+TjACc5us5j+Hu7wh8wS02/lNChs1HVE4/Pqe cH9y4PiY51clPYEeKt/F7/RshxjzV7l7hhPyfH3GY7iKtEvq6xopXPo3XQrnzqhl GSOab2uUQQYd+L6j/PLD6mC21LRqBihm4/PeRVxvQVPvxV5KtJSI1IT39XAOHp4R PdDjhKArr/V0zMfylJyQOi+4WF9UskFKbmc4vUEkqk63adFJW2ruxModDcT0zwHD QGUfoH22fJf9JQ== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeduiedrieeigdduudeiucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhephffvufhffffkfgggtgfgsehtqhertddtreejnecuhfhrohhmpefvihhmohht hhihucfurghmphhlvgcuoehsrghmphhlvghtsehnghihrhhordgtohhmqeenucggtffrrg htthgvrhhnpeetvdeltdfgudehvdegtddutddugeeigeehvedvgfegffelhefgvdeghfeu ueejhfenucffohhmrghinhepshhofhhtfigrrhgvhhgvrhhithgrghgvrdhorhhgpdguvg gsihgrnhdrohhrghdpnhhghihrohdrtghomhenucfkphepjeegrdduudeirddukeeirdeg geenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehsrg hmphhlvghtsehnghihrhhordgtohhm X-ME-Proxy: Received: from mrblack (74-116-186-44.qc.dsl.ebox.net [74.116.186.44]) by mail.messagingengine.com (Postfix) with ESMTPA id 36DA0328005E; Thu, 30 Jul 2020 13:36:53 -0400 (EDT) From: Timothy Sample References: <87mu4iv0gc.fsf@inria.fr> <86h7uq8fmk.fsf@gmail.com> <87d05etero.fsf@gnu.org> <87r1tit5j6.fsf_-_@gnu.org> Date: Thu, 30 Jul 2020 13:36:52 -0400 Message-ID: <875za4ykej.fsf@ngyro.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-Spam-Score: -1.7 (-) X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 42162@debbugs.gnu.org, Maurice =?UTF-8?Q?Br=C3=A9mond?= Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=fail (rsa verify failed) header.d=messagingengine.com header.s=fm3 header.b=Tj1rOiQK; dmarc=none; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Spam-Score: 1.49 X-TUID: qxXe1h4hEjel Hi Ludovic, Ludovic Court=C3=A8s writes: > Hi, > > Ludovic Court=C3=A8s skribis: > > [...] > > So for the medium term, and perhaps for the future, a possible option > would be to preserve tarball metadata so we can reconstruct them: > > tarball =3D metadata + tree > > After all, tarballs are byproducts and should be no exception: we should > build them from source. :-) > > In , Stefano mentioned > pristine-tar, which does almost that, but not quite: it stores a binary > delta between a tarball and a tree: > > https://manpages.debian.org/unstable/pristine-tar/pristine-tar.1.en.html > > I think we should have something more transparent than a binary delta. > > The code below can =E2=80=9Cdisassemble=E2=80=9D and =E2=80=9Cassemble=E2= =80=9D a tar. When it > disassembles it, it generates metadata like this: > > (tar-source > (version 0) > (headers > (("guile-3.0.4/" > (mode 493) > (size 0) > (mtime 1593007723) > (chksum 3979) > (typeflag #\5)) > ("guile-3.0.4/m4/" > (mode 493) > (size 0) > (mtime 1593007720) > (chksum 4184) > (typeflag #\5)) > ("guile-3.0.4/m4/pipe2.m4" > (mode 420) > (size 531) > (mtime 1536050419) > (chksum 4812) > (hash (sha256 > "arx6n2rmtf66yjlwkgwp743glcpdsfzgjiqrqhfegutmcwvwvsza"))) > ("guile-3.0.4/m4/time_h.m4" > (mode 420) > (size 5471) > (mtime 1536050419) > (chksum 4974) > (hash (sha256 > "z4py26rmvsk4st7db6vwziwwhkrjjrwj7nra4al6ipqh2ms45kka"))) > [=E2=80=A6] > > The =E2=80=99assemble-archive=E2=80=99 procedure consumes that, looks up = file contents > by hash on SWH, and reconstructs the original tarball=E2=80=A6 > > =E2=80=A6 at least in theory, because in practice we hit the SWH rate lim= it > after looking up a few files: > > https://archive.softwareheritage.org/api/#rate-limiting > > So it=E2=80=99s a bit ridiculous, but we may have to store a SWH =E2=80= =9Cdir=E2=80=9D > identifier for the whole extracted tree=E2=80=94a Git-tree hash=E2=80=94s= ince that would > allow us to retrieve the whole thing in a single HTTP request. > > Besides, we=E2=80=99ll also have to handle compression: storing gzip/xz h= eaders > and compression levels. This jumped out at me because I have been working with compression and tarballs for the bootstrapping effort. I started pulling some threads and doing some research, and ended up prototyping an end-to-end solution for decomposing a Gzip=E2=80=99d tarball into Gzip metadata, tarball metada= ta, and an SWH directory ID. It can even put them back together! :) There are a bunch of problems still, but I think this project is doable in the short-term. I=E2=80=99ve tested 100 arbitrary Gzip=E2=80=99d tarballs from= Guix, and found and fixed a bunch of little gaffes. There=E2=80=99s a ton of work to= do, of course, but here=E2=80=99s another small step. I call the thing =E2=80=9CDisarchive=E2=80=9D as in =E2=80=9Cdisassemble a = source code archive=E2=80=9D. You can find it at . It has a simple command-line interface so you can do $ disarchive save software-1.0.tar.gz which serializes a disassembled version of =E2=80=9Csoftware-1.0.tar.gz=E2= =80=9D to the database (which is just a directory) specified by the =E2=80=9CDISARCHIVE_D= B=E2=80=9D environment variable. Next, you can run $ disarchive load hash-of-something-in-the-db which will recover an original file from its metadata (stored in the database) and data retrieved from the SWH archive or taken from a cache (again, just a directory) specified by =E2=80=9CDISARCHIVE_DIRCACHE=E2=80= =9D. Now some implementation details. The way I=E2=80=99ve set it up is that al= l of the assembly happens through Guix. Each step in recreating a compressed tarball is a fixed-output derivation: the download from SWH, the creation of the tarball, and the compression. I wanted an easy way to build and verify things according to a dependency graph without writing any code. Hi Guix Daemon! I=E2=80=99m not sure if this is a good long-term approach, though. It could work well for reproducibility, but it might be easier to let some external service drive my code as a Guix package. Either way, it was an easy way to get started. For disassembly, it takes a Gzip file (containing a single member) and breaks it down like this: (gzip-member (version 0) (name "hungrycat-0.4.1.tar.gz") (input (sha256 "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1")) (header (mtime 0) (extra-flags 2) (os 3)) (footer (crc 3863610951) (isize 194560)) (compressor gnu-best) (digest (sha256 "03fc1zsrf99lvxa7b4ps6pbi43304wbxh1f6ci4q0vkal370yfwh"))) The header and footer are read directly from the file. Finding the compressor is harder. I followed the approach taken by the pristine-tar project. That is, try a bunch of compressors and hope for a match. Currently, I have: =E2=80=A2 gnu-best =E2=80=A2 gnu-best-rsync =E2=80=A2 gnu =E2=80=A2 gnu-rsync =E2=80=A2 gnu-fast =E2=80=A2 gnu-fast-rsync =E2=80=A2 zlib-best =E2=80=A2 zlib =E2=80=A2 zlib-fast =E2=80=A2 zlib-best-perl =E2=80=A2 zlib-perl =E2=80=A2 zlib-fast-perl =E2=80=A2 gnu-best-rsync-1.4 =E2=80=A2 gnu-rsync-1.4 =E2=80=A2 gnu-fast-rsync-1.4 This list is inspired by pristine-tar. The first couple GNU compressors use modern Gzip from Guix. The zlib and rsync-1.4 ones use the Gzip and zlib wrapper from pristine-tar called =E2=80=9Czgz=E2=80=9D. The 100 Gzip = files I looked at use =E2=80=9Cgnu=E2=80=9D, =E2=80=9Cgnu-best=E2=80=9D, =E2=80=9Cg= nu-best-rsync-1.4=E2=80=9D, =E2=80=9Czlib=E2=80=9D, =E2=80=9Czlib-best=E2=80=9D, and =E2=80=9Czlib-fast-perl=E2=80=9D. (As an aside, I had a way to decompose multi-member Gzip files, but it was much, much slower. Since I doubt they exist in the wild, I removed that code.) The =E2=80=9Cinput=E2=80=9D field likely points to a tarball, which looks l= ike this: (tarball (version 0) (name "hungrycat-0.4.1.tar") (input (sha256 "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r")) (default-header) (headers ((name "hungrycat-0.4.1/") (mode 493) (mtime 1513360022) (chksum 5058) (typeflag 53)) ((name "hungrycat-0.4.1/configure") (mode 493) (size 130263) (mtime 1513360022) (chksum 6043)) ...) (padding 3584) (digest (sha256 "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1"))) Originally, I used your code, but I ran into some problems. Namely, real tarballs are not well-behaved. I wrote new code to keep track of subtle things like the formatting of the octal values. Even though they are not well-behaved, they are usually self-consistent, so I introduced the =E2=80=9Cdefault-header=E2=80=9D field to set default values for all he= aders. Any omitted fields in the headers use the value from the default header, and the default header takes defaults from a =E2=80=9Cdefault default header=E2= =80=9D defined in the code. Here=E2=80=99s a default header from a different tarb= all: (default-header (uid 1199) (gid 30) (magic "ustar ") (version " \x00") (uname "cagordon") (gname "lhea") (devmajor-format (width 0)) (devminor-format (width 0))) These default values are computed to minimize the noise in the serialized form. Here we see for example that each header should have UID 1199 unless otherwise specified. We also see that the device fields should be null strings instead of octal zeros. Another good example here is that the magic field has a space after =E2=80=9Custar=E2=80=9D, whi= ch is not what modern POSIX says to do. My tarball reader has minimal support for extended headers, but they are not serialized cleanly (they survive the round-trip, but they are not human-readable). Finally, the =E2=80=9Cinput=E2=80=9D field here points to an =E2=80=9Cswh-d= irectory=E2=80=9D object. It looks like this: (swh-directory (version 0) (name "hungrycat-0.4.1") (id "0496abd5a2e9e05c9fe20ae7684f48130ef6124a") (digest (sha256 "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r"))) I have a little module for computing the directory hash like SWH does (which is in-turn like what Git does). I did not verify that the 100 packages where in the SWH archive. I did verify a couple of packages, but I hit the rate limit and decided to avoid it for now. To avoid hitting the SWH archive at all, I introduced a directory cache so that I can store the directories locally. If the directory cache is available, directories are stored and retrieved from it. > How would we put that in practice? Good question. :-) > > I think we=E2=80=99d have to maintain a database that maps tarball hashes= to > metadata (!). A simple version of it could be a Git repo where, say, > =E2=80=98sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk=E2= =80=99 would > contain the metadata above. The nice thing is that the Git repo itself > could be archived by SWH. :-) You mean like ? :) This was generated by a little script built on top of =E2=80=9Cfold-package= s=E2=80=9D. It downloads Gzip=E2=80=99d tarballs used by Guix packages and passes them = on to Disarchive for disassembly. I limited the number to 100 because it=E2=80= =99s slow and because I=E2=80=99m sure there is a long tail of weird software archives that are going to be hard to process. The metadata directory ended up being 13M and the directory cache 2G. > Thus, if a tarball vanishes, we=E2=80=99d look it up in the database and > reconstruct it from its metadata plus content store in SWH. > > Thoughts? Obviously I like the idea. ;) Even with the code I have so far, I have a lot of questions. Mainly I=E2= =80=99m worried about keeping everything working into the future. It would be easy to make incompatible changes. A lot of care would have to be taken. Of course, keeping a Guix commit and a Disarchive commit might be enough to make any assembling reproducible, but there=E2=80=99s a chicken-and-egg problem there. What if a tarball from the closure of one the derivations is missing? I guess you could work around it, but it would be tricky. > Anyhow, we should team up with fellow NixOS and SWH hackers to address > this, and with developers of other distros as well=E2=80=94this problem i= s not > just that of the functional deployment geeks, is it? I could remove most of the Guix stuff so that it would be easy to package in Guix, Nix, Debian, etc. Then, someone=E2=84=A2 could write a se= rvice that consumes a =E2=80=9Csources.json=E2=80=9D file, adds the sources to a = Disarchive database, and pushes everything to a Git repo. I guess everyone who cares has to produce a =E2=80=9Csources.json=E2=80=9D file anyway, so it wi= ll be very little extra work. Other stuff like changing the serialization format to JSON would be pretty easy, too. I=E2=80=99m not well connected to these other projects, mind you, so I=E2=80=99m not really sure how to reach out. Sorry about the big mess of code and ideas =E2=80=93 I realize I may have t= aken the =E2=80=9Cdo-ocracy=E2=80=9D approach a little far here. :) Even if th= is is not =E2=80=9Cthe=E2=80=9D solution, hopefully it=E2=80=99s useful for discussio= n! -- Tim