From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id cK5IFjEuJF+WXgAA0tVLHw (envelope-from ) for ; Fri, 31 Jul 2020 14:44:01 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1 with LMTPS id KIMBEjEuJF/dJgAAbx9fmQ (envelope-from ) for ; Fri, 31 Jul 2020 14:44:01 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 9FA309404D9 for ; Fri, 31 Jul 2020 14:44:00 +0000 (UTC) Received: from localhost ([::1]:50902 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1k1WGN-0003Aw-KD for larch@yhetil.org; Fri, 31 Jul 2020 10:43:59 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:56332) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1k1WFS-0002S7-Ap for bug-guix@gnu.org; Fri, 31 Jul 2020 10:43:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:55079) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1k1WFS-0000KK-1b for bug-guix@gnu.org; Fri, 31 Jul 2020 10:43:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1k1WFR-0006AE-Sw for bug-guix@gnu.org; Fri, 31 Jul 2020 10:43:01 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#42162: Recovering source tarballs Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Fri, 31 Jul 2020 14:43:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 42162 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Timothy Sample Received: via spool by 42162-submit@debbugs.gnu.org id=B42162.159620653323637 (code B ref 42162); Fri, 31 Jul 2020 14:43:01 +0000 Received: (at 42162) by debbugs.gnu.org; 31 Jul 2020 14:42:13 +0000 Received: from localhost ([127.0.0.1]:38392 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1k1WEe-00069A-9I for submit@debbugs.gnu.org; Fri, 31 Jul 2020 10:42:12 -0400 Received: from eggs.gnu.org ([209.51.188.92]:46052) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1k1WEc-00068x-9M for 42162@debbugs.gnu.org; Fri, 31 Jul 2020 10:42:11 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:42974) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1k1WEV-0000Fv-Qc; Fri, 31 Jul 2020 10:42:03 -0400 Received: from [2a01:e35:2ffd:930:68c2:32f7:f96f:b343] (port=48714 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1k1WET-0003nA-To; Fri, 31 Jul 2020 10:42:03 -0400 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= References: <87mu4iv0gc.fsf@inria.fr> <86h7uq8fmk.fsf@gmail.com> <87d05etero.fsf@gnu.org> <87r1tit5j6.fsf_-_@gnu.org> <875za4ykej.fsf@ngyro.com> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 14 Thermidor an 228 de la =?UTF-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Fri, 31 Jul 2020 16:41:59 +0200 In-Reply-To: <875za4ykej.fsf@ngyro.com> (Timothy Sample's message of "Thu, 30 Jul 2020 13:36:52 -0400") Message-ID: <87bljvu4p4.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-Spam-Score: -3.3 (---) X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 42162@debbugs.gnu.org, Maurice =?UTF-8?Q?Br=C3=A9mond?= Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Spam-Score: -1.01 X-TUID: 82KAd+RIuVbO Hi Timothy! Timothy Sample skribis: > This jumped out at me because I have been working with compression and > tarballs for the bootstrapping effort. I started pulling some threads > and doing some research, and ended up prototyping an end-to-end solution > for decomposing a Gzip=E2=80=99d tarball into Gzip metadata, tarball meta= data, > and an SWH directory ID. It can even put them back together! :) There > are a bunch of problems still, but I think this project is doable in the > short-term. I=E2=80=99ve tested 100 arbitrary Gzip=E2=80=99d tarballs fr= om Guix, and > found and fixed a bunch of little gaffes. There=E2=80=99s a ton of work = to do, > of course, but here=E2=80=99s another small step. > > I call the thing =E2=80=9CDisarchive=E2=80=9D as in =E2=80=9Cdisassemble = a source code archive=E2=80=9D. > You can find it at . It has a simple > command-line interface so you can do > > $ disarchive save software-1.0.tar.gz > > which serializes a disassembled version of =E2=80=9Csoftware-1.0.tar.gz= =E2=80=9D to the > database (which is just a directory) specified by the =E2=80=9CDISARCHIVE= _DB=E2=80=9D > environment variable. Next, you can run > > $ disarchive load hash-of-something-in-the-db > > which will recover an original file from its metadata (stored in the > database) and data retrieved from the SWH archive or taken from a cache > (again, just a directory) specified by =E2=80=9CDISARCHIVE_DIRCACHE=E2=80= =9D. Wooohoo! Is it that time of the year when people give presents to one another? I can=E2=80=99t believe it. :-) > Now some implementation details. The way I=E2=80=99ve set it up is that = all of > the assembly happens through Guix. Each step in recreating a compressed > tarball is a fixed-output derivation: the download from SWH, the > creation of the tarball, and the compression. I wanted an easy way to > build and verify things according to a dependency graph without writing > any code. Hi Guix Daemon! I=E2=80=99m not sure if this is a good long-t= erm > approach, though. It could work well for reproducibility, but it might > be easier to let some external service drive my code as a Guix package. > Either way, it was an easy way to get started. > > For disassembly, it takes a Gzip file (containing a single member) and > breaks it down like this: > > (gzip-member > (version 0) > (name "hungrycat-0.4.1.tar.gz") > (input (sha256 > "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1")) > (header > (mtime 0) > (extra-flags 2) > (os 3)) > (footer > (crc 3863610951) > (isize 194560)) > (compressor gnu-best) > (digest > (sha256 > "03fc1zsrf99lvxa7b4ps6pbi43304wbxh1f6ci4q0vkal370yfwh"))) Awesome. > The header and footer are read directly from the file. Finding the > compressor is harder. I followed the approach taken by the pristine-tar > project. That is, try a bunch of compressors and hope for a match. > Currently, I have: > > =E2=80=A2 gnu-best > =E2=80=A2 gnu-best-rsync > =E2=80=A2 gnu > =E2=80=A2 gnu-rsync > =E2=80=A2 gnu-fast > =E2=80=A2 gnu-fast-rsync > =E2=80=A2 zlib-best > =E2=80=A2 zlib > =E2=80=A2 zlib-fast > =E2=80=A2 zlib-best-perl > =E2=80=A2 zlib-perl > =E2=80=A2 zlib-fast-perl > =E2=80=A2 gnu-best-rsync-1.4 > =E2=80=A2 gnu-rsync-1.4 > =E2=80=A2 gnu-fast-rsync-1.4 I would have used the integers that zlib supports, but I guess that doesn=E2=80=99t capture this whole gamut of compression setups. And yeah, = it=E2=80=99s not great that we actually have to try and find the right compression levels, but there=E2=80=99s no way around it it seems, and as you write, we= can expect a couple of variants to be the most commonly used ones. > The =E2=80=9Cinput=E2=80=9D field likely points to a tarball, which looks= like this: > > (tarball > (version 0) > (name "hungrycat-0.4.1.tar") > (input (sha256 > "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r")) > (default-header) > (headers > ((name "hungrycat-0.4.1/") > (mode 493) > (mtime 1513360022) > (chksum 5058) > (typeflag 53)) > ((name "hungrycat-0.4.1/configure") > (mode 493) > (size 130263) > (mtime 1513360022) > (chksum 6043)) > ...) > (padding 3584) > (digest > (sha256 > "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1"))) > > Originally, I used your code, but I ran into some problems. Namely, > real tarballs are not well-behaved. I wrote new code to keep track of > subtle things like the formatting of the octal values. Yeah I guess I was too optimistic. :-) I wanted to have the serialization/deserialization code automatically generated by that macro, but yeah, it doesn=E2=80=99t capture enough details for real-world tarballs. Do you know how frequently you get =E2=80=9Cweird=E2=80=9D tarballs? I was= thinking about having something that works for plain GNU tar, but it=E2=80=99s even better to have something that works with =E2=80=9Cunusual=E2=80=9D tarballs! (BTW the code I posted or the one in Disarchive could perhaps replace the one in Gash-Utils. I was frustrated to not see a =E2=80=98fold-archive= =E2=80=99 procedure there, notably.) > Even though they are not well-behaved, they are usually > self-consistent, so I introduced the =E2=80=9Cdefault-header=E2=80=9D fie= ld to set > default values for all headers. Any omitted fields in the headers use > the value from the default header, and the default header takes > defaults from a =E2=80=9Cdefault default header=E2=80=9D defined in the c= ode. Here=E2=80=99s > a default header from a different tarball: > > (default-header > (uid 1199) > (gid 30) > (magic "ustar ") > (version " \x00") > (uname "cagordon") > (gname "lhea") > (devmajor-format (width 0)) > (devminor-format (width 0))) Very nice. > Finally, the =E2=80=9Cinput=E2=80=9D field here points to an =E2=80=9Cswh= -directory=E2=80=9D object. It > looks like this: > > (swh-directory > (version 0) > (name "hungrycat-0.4.1") > (id "0496abd5a2e9e05c9fe20ae7684f48130ef6124a") > (digest > (sha256 > "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r"))) Yay! > I have a little module for computing the directory hash like SWH does > (which is in-turn like what Git does). I did not verify that the 100 > packages where in the SWH archive. I did verify a couple of packages, > but I hit the rate limit and decided to avoid it for now. > > To avoid hitting the SWH archive at all, I introduced a directory cache > so that I can store the directories locally. If the directory cache is > available, directories are stored and retrieved from it. I guess we can get back to them eventually to estimate our coverage ratio. >> I think we=E2=80=99d have to maintain a database that maps tarball hashe= s to >> metadata (!). A simple version of it could be a Git repo where, say, >> =E2=80=98sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk=E2= =80=99 would >> contain the metadata above. The nice thing is that the Git repo itself >> could be archived by SWH. :-) > > You mean like ? :) Woow. :-) We could actually have a CI job to create the database: it would basically do =E2=80=98disarchive save=E2=80=99 for each tarball and store t= hat using a layout like the one you used. Then we could have a job somewhere that periodically fetches that and adds it to the database. WDYT? I think we should leave room for other hash algorithms (in the sexps above too). > This was generated by a little script built on top of =E2=80=9Cfold-packa= ges=E2=80=9D. > It downloads Gzip=E2=80=99d tarballs used by Guix packages and passes the= m on to > Disarchive for disassembly. I limited the number to 100 because it=E2=80= =99s > slow and because I=E2=80=99m sure there is a long tail of weird software > archives that are going to be hard to process. The metadata directory > ended up being 13M and the directory cache 2G. Neat. So it does mean that we could pretty much right away add a fall-back in (guix download) that looks up tarballs in your database and uses Disarchive to recontruct it, right? I love solved problems. :-) Of course we could improve Disarchive and the database, but it seems to me that we already have enough to improve the situation. WDYT? > Even with the code I have so far, I have a lot of questions. Mainly I=E2= =80=99m > worried about keeping everything working into the future. It would be > easy to make incompatible changes. A lot of care would have to be > taken. Of course, keeping a Guix commit and a Disarchive commit might > be enough to make any assembling reproducible, but there=E2=80=99s a > chicken-and-egg problem there. The way I see it, Guix would always look up tarballs in the HEAD of the database (no need to pick a specific commit). Worst that could happen is we reconstruct a tarball that doesn=E2=80=99t match, and so the daemon e= rrors out. Regarding future-proofness, I think we must be super careful about the file formats (the sexps). You did pay attention to not having implicit defaults, which is perfect. Perhaps one thing to change (or perhaps it=E2=80=99s already there) is support for other hashes in those sexps: both hash algorithms and directory hash methods (SWH dir/Git tree, nar, Git tree with different hash algorithm, IPFS CID, etc.). Also the ability to specify several hashes. That way we could =E2=80=9Crefresh=E2=80=9D the database anytime by adding = the hash du jour for already-present tarballs. > What if a tarball from the closure of one the derivations is missing? > I guess you could work around it, but it would be tricky. Well, more generally, we=E2=80=99ll have to monitor archive coverage. But I don=E2=80=99t think the issue is specific to this method. >> Anyhow, we should team up with fellow NixOS and SWH hackers to address >> this, and with developers of other distros as well=E2=80=94this problem = is not >> just that of the functional deployment geeks, is it? > > I could remove most of the Guix stuff so that it would be easy to > package in Guix, Nix, Debian, etc. Then, someone=E2=84=A2 could write a = service > that consumes a =E2=80=9Csources.json=E2=80=9D file, adds the sources to = a Disarchive > database, and pushes everything to a Git repo. I guess everyone who > cares has to produce a =E2=80=9Csources.json=E2=80=9D file anyway, so it = will be very > little extra work. Other stuff like changing the serialization format > to JSON would be pretty easy, too. I=E2=80=99m not well connected to the= se > other projects, mind you, so I=E2=80=99m not really sure how to reach out. If you feel like it, you=E2=80=99re welcome to point them to your work in t= he discussion at . There=E2=80=99s = one person from NixOS (lewo) participating in the discussion and I=E2=80=99m su= re they=E2=80=99d be interested. Perhaps they=E2=80=99ll tell whether they ca= re about having it available as JSON. > Sorry about the big mess of code and ideas =E2=80=93 I realize I may have= taken > the =E2=80=9Cdo-ocracy=E2=80=9D approach a little far here. :) Even if = this is not > =E2=80=9Cthe=E2=80=9D solution, hopefully it=E2=80=99s useful for discuss= ion! You did great! I had a very rough sketch and you did the real thing, that=E2=80=99s just awesome. :-) Thanks a lot! Ludo=E2=80=99.