From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id mIncKdxCKF8YfAAA0tVLHw (envelope-from ) for ; Mon, 03 Aug 2020 17:01:16 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id EO/LJdxCKF9xawAAB5/wlQ (envelope-from ) for ; Mon, 03 Aug 2020 17:01:16 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 0C3249400BB for ; Mon, 3 Aug 2020 17:01:14 +0000 (UTC) Received: from localhost ([::1]:55130 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1k2dpo-0001h2-Im for larch@yhetil.org; Mon, 03 Aug 2020 13:01:12 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55098) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1k2dpe-0001gc-5q for bug-guix@gnu.org; Mon, 03 Aug 2020 13:01:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:34397) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1k2dpd-0001Kl-PO for bug-guix@gnu.org; Mon, 03 Aug 2020 13:01:01 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1k2dpd-0006yK-Mc for bug-guix@gnu.org; Mon, 03 Aug 2020 13:01:01 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#42162: Recovering source tarballs In-Reply-To: <87mu4iv0gc.fsf@inria.fr> Resent-From: Timothy Sample Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Mon, 03 Aug 2020 17:01:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 42162 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Received: via spool by 42162-submit@debbugs.gnu.org id=B42162.159647400526661 (code B ref 42162); Mon, 03 Aug 2020 17:01:01 +0000 Received: (at 42162) by debbugs.gnu.org; 3 Aug 2020 17:00:05 +0000 Received: from localhost ([127.0.0.1]:45943 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1k2doi-0006vw-EO for submit@debbugs.gnu.org; Mon, 03 Aug 2020 13:00:05 -0400 Received: from out4-smtp.messagingengine.com ([66.111.4.28]:47565) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1k2dod-0006ut-Pa for 42162@debbugs.gnu.org; Mon, 03 Aug 2020 13:00:03 -0400 Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailout.nyi.internal (Postfix) with ESMTP id 9A6485C0182; Mon, 3 Aug 2020 12:59:54 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute3.internal (MEProxy); Mon, 03 Aug 2020 12:59:54 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:message-id:mime-version:references:subject:to :x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm3; bh=5hbXhmVVVrDHTCYfYDwKzOX63YmRuJHwjUiZbyDthNA=; b=j/Z3ktWR YoCJc6EXfvyz7eQffxfW7pq1rGx7HQtEUlBXzhdaMjeL/K3MEelOgBC9G2ciGXu1 9McRnUPG94I2PD35TzEfaUVzuye++nb3HmyqLOz/4g/FODy9e2Hf/ubGVqIUvtm4 kMk/utemdo3n8UjloXq8p+ihNJwz7pGBM6ea28j1GvfljV18cP5kqpY6sDOM6tBz s4KTjzATYk8A7UGVuJlPjnqb2Ed52Xx/+BhB9woNkVTmm4kB6fxID9iP3eGtczFQ 4AGsSJJidQ15SbaRSn7kbd3zpZgbRV6sovaQ0PyRROHm2CSVGAmfPTfz/ymQV1Pb uDiVCriBdFMzWA== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeduiedrjeeggddutdekucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhephffvufhffffkfgggtgfgsehtqhertddtreejnecuhfhrohhmpefvihhmohht hhihucfurghmphhlvgcuoehsrghmphhlvghtsehnghihrhhordgtohhmqeenucggtffrrg htthgvrhhnpeevkeekhffftdefjeevgeevgfethfeuveevffdvkeffveeiudefgedvlefh jeetjeenucffohhmrghinhepnhhghihrohdrtghomhdpshhofhhtfigrrhgvhhgvrhhith grghgvrdhorhhgnecukfhppeejgedrudduiedrudekiedrgeegnecuvehluhhsthgvrhfu ihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepshgrmhhplhgvthesnhhghihroh drtghomh X-ME-Proxy: Received: from mrblack (74-116-186-44.qc.dsl.ebox.net [74.116.186.44]) by mail.messagingengine.com (Postfix) with ESMTPA id 90DB430600B7; Mon, 3 Aug 2020 12:59:53 -0400 (EDT) From: Timothy Sample References: <87mu4iv0gc.fsf@inria.fr> <86h7uq8fmk.fsf@gmail.com> <87d05etero.fsf@gnu.org> <87r1tit5j6.fsf_-_@gnu.org> <875za4ykej.fsf@ngyro.com> <87bljvu4p4.fsf@gnu.org> Date: Mon, 03 Aug 2020 12:59:52 -0400 Message-ID: <87d047u0l3.fsf@ngyro.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -1.7 (-) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-Spam-Score: -2.7 (--) X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 42162@debbugs.gnu.org, Maurice =?UTF-8?Q?Br=C3=A9mond?= Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=fail (rsa verify failed) header.d=messagingengine.com header.s=fm3 header.b=j/Z3ktWR; dmarc=none; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Spam-Score: -0.01 X-TUID: RqUjf89HdHrK Hi Ludovic, Ludovic Court=C3=A8s writes: > Wooohoo! Is it that time of the year when people give presents to one > another? I can=E2=80=99t believe it. :-) Not to be too cynical, but I think it=E2=80=99s just the time of year that = I get frustrated with what I should be working on, and start fantasizing about green-field projects. :p > Timothy Sample skribis: > >> The header and footer are read directly from the file. Finding the >> compressor is harder. I followed the approach taken by the pristine-tar >> project. That is, try a bunch of compressors and hope for a match. >> Currently, I have: >> >> =E2=80=A2 gnu-best >> =E2=80=A2 gnu-best-rsync >> =E2=80=A2 gnu >> =E2=80=A2 gnu-rsync >> =E2=80=A2 gnu-fast >> =E2=80=A2 gnu-fast-rsync >> =E2=80=A2 zlib-best >> =E2=80=A2 zlib >> =E2=80=A2 zlib-fast >> =E2=80=A2 zlib-best-perl >> =E2=80=A2 zlib-perl >> =E2=80=A2 zlib-fast-perl >> =E2=80=A2 gnu-best-rsync-1.4 >> =E2=80=A2 gnu-rsync-1.4 >> =E2=80=A2 gnu-fast-rsync-1.4 > > I would have used the integers that zlib supports, but I guess that > doesn=E2=80=99t capture this whole gamut of compression setups. And yeah= , it=E2=80=99s > not great that we actually have to try and find the right compression > levels, but there=E2=80=99s no way around it it seems, and as you write, = we can > expect a couple of variants to be the most commonly used ones. My first instinct was =E2=80=9Cthis is impossible =E2=80=93 a DEFLATE compr= essor can do just about whatever it wants!=E2=80=9D Then I looked at pristine-tar and realized that their hack probably works pretty well. If I had infinite time, I would think about some kind of fully general, parameterized LZ77 algorithm that could describe any implementation. If I had a lot of time I would peel back the curtain on Gzip and zlib and expose their tuning parameters. That would be nicer, but keep in mind we will have to cover XZ, bzip2, and ZIP, too! There=E2=80=99s a bit of balance between quality and coverage. Any improvement to the representation of the compression algorithm could be implemented easily: just replace the names with their improved representation. One thing pristine-tar does is reorder the compressor list based on the input metadata. A Gzip member usually stores its compression level, so it makes sense to try everything at that level first before moving one. >> Originally, I used your code, but I ran into some problems. Namely, >> real tarballs are not well-behaved. I wrote new code to keep track of >> subtle things like the formatting of the octal values. > > Yeah I guess I was too optimistic. :-) I wanted to have the > serialization/deserialization code automatically generated by that > macro, but yeah, it doesn=E2=80=99t capture enough details for real-world > tarballs. I enjoyed your implementation! I might even bring back its style. It was a little stiff for trying to figure out exactly what I needed for reproducing the tarballs. > Do you know how frequently you get =E2=80=9Cweird=E2=80=9D tarballs? I w= as thinking > about having something that works for plain GNU tar, but it=E2=80=99s even > better to have something that works with =E2=80=9Cunusual=E2=80=9D tarbal= ls! I don=E2=80=99t have hard numbers, but I would say that a good handful (5= =E2=80=9310%) have =E2=80=9CX-format=E2=80=9D fields, meaning their octal formatting is u= nusual. (I=E2=80=99m looking at =E2=80=9Cgrep -A 10 default-header=E2=80=9D over all the S-Exp f= iles.) The most charming thing is the =E2=80=9Cuname=E2=80=9D and =E2=80=9Cgname=E2=80= =9D fields. For example, =E2=80=9Crtmidi-4.0.0=E2=80=9D was made by =E2=80=9Cgary=E2=80=9D from =E2= =80=9Cstaff=E2=80=9D. :) > (BTW the code I posted or the one in Disarchive could perhaps replace > the one in Gash-Utils. I was frustrated to not see a =E2=80=98fold-archi= ve=E2=80=99 > procedure there, notably.) I really like =E2=80=9Cfold-archive=E2=80=9D. One of the reasons I started= doing this is to possibly share code with Gash-Utils. It=E2=80=99s not as easy as I w= as hoping, but I=E2=80=99m planning on improving things there based on my experience here. I=E2=80=99ve now worked with four Scheme tar implementati= ons, maybe if I write a really good one I could cap that number at five! >> To avoid hitting the SWH archive at all, I introduced a directory cache >> so that I can store the directories locally. If the directory cache is >> available, directories are stored and retrieved from it. > > I guess we can get back to them eventually to estimate our coverage ratio. It would be nice to know, but pretty hard to find out with the rate limit. I guess it will improve immensely when we set up a =E2=80=9Csources.json=E2=80=9D file. >> You mean like ? :) > > Woow. :-) > > We could actually have a CI job to create the database: it would > basically do =E2=80=98disarchive save=E2=80=99 for each tarball and store= that using a > layout like the one you used. Then we could have a job somewhere that > periodically fetches that and adds it to the database. WDYT? Maybe.... I assume that Disarchive would fail for a few of them. We would need a plan for monitoring those failures so that Disarchive can be improved. Also, unless I=E2=80=99m misunderstanding something, this mea= ns building the whole database at every commit, no? That would take a lot of time and space. On the other hand, it would be easy enough to try. If it works, it=E2=80=99s a lot easier than setting up a whole other servic= e. > I think we should leave room for other hash algorithms (in the sexps > above too). It works for different hash algorithms, but not for different directory hashing methods (like you mention below). >> This was generated by a little script built on top of =E2=80=9Cfold-pack= ages=E2=80=9D. >> It downloads Gzip=E2=80=99d tarballs used by Guix packages and passes th= em on to >> Disarchive for disassembly. I limited the number to 100 because it=E2= =80=99s >> slow and because I=E2=80=99m sure there is a long tail of weird software >> archives that are going to be hard to process. The metadata directory >> ended up being 13M and the directory cache 2G. > > Neat. > > So it does mean that we could pretty much right away add a fall-back in > (guix download) that looks up tarballs in your database and uses > Disarchive to recontruct it, right? I love solved problems. :-) > > Of course we could improve Disarchive and the database, but it seems to > me that we already have enough to improve the situation. WDYT? I would say that we are darn close! In theory it would work. It would be much more practical if we had better coverage in the SWH archive (i.e., =E2=80=9Csources.json=E2=80=9D) and a way to get metadata for a sour= ce archive without downloading the entire Disarchive database. It=E2=80=99s 13M now, = but it will likely be 500M with all the Gzip=E2=80=99d tarballs from a recent c= ommit of Guix. It will only grow after that, too. Of course those are not hard blockers, so =E2=80=98(guix download)=E2=80=99= could start using Disarchive as soon as we package it. I=E2=80=99ve starting looking i= nto it, but I=E2=80=99m confused about getting access to Disarchive from the =E2=80=9Cout-of-band=E2=80=9D download system. Would it have to become a d= ependency of Guix? >> Even with the code I have so far, I have a lot of questions. Mainly I= =E2=80=99m >> worried about keeping everything working into the future. It would be >> easy to make incompatible changes. A lot of care would have to be >> taken. Of course, keeping a Guix commit and a Disarchive commit might >> be enough to make any assembling reproducible, but there=E2=80=99s a >> chicken-and-egg problem there. > > The way I see it, Guix would always look up tarballs in the HEAD of the > database (no need to pick a specific commit). Worst that could happen > is we reconstruct a tarball that doesn=E2=80=99t match, and so the daemon= errors > out. I was imagining an escape hatch beyond this, where one could look up a provenance record from when Disarchive ingested and verified a source code archive. The provenance record would tell you which version of Guix was used when saving the archive, so you could try your luck with using =E2=80=9Cguix time-machine=E2=80=9D to reproduce Disarchive=E2=80=99s= original computation. If we perform database migrations, you would need to travel back in time in the database, too. The idea is that you could work around breakages in Disarchive automatically using the Power of Guix=E2=84=A2. Just a stray thought, really. > Regarding future-proofness, I think we must be super careful about the > file formats (the sexps). You did pay attention to not having implicit > defaults, which is perfect. Perhaps one thing to change (or perhaps > it=E2=80=99s already there) is support for other hashes in those sexps: b= oth > hash algorithms and directory hash methods (SWH dir/Git tree, nar, Git > tree with different hash algorithm, IPFS CID, etc.). Also the ability > to specify several hashes. > > That way we could =E2=80=9Crefresh=E2=80=9D the database anytime by addin= g the hash du > jour for already-present tarballs. The hash algorithm is already configurable, but the directory hash method is not. You=E2=80=99re right that it should be, and that there shou= ld be support for multiple digests. >> What if a tarball from the closure of one the derivations is missing? >> I guess you could work around it, but it would be tricky. > > Well, more generally, we=E2=80=99ll have to monitor archive coverage. Bu= t I > don=E2=80=99t think the issue is specific to this method. Again, I=E2=80=99m thinking about the case where I want to travel back in t= ime to reproduce a Disarchive computation. It=E2=80=99s really an unlikely scenario, I=E2=80=99m just trying to think of everything that could go wron= g. >>> Anyhow, we should team up with fellow NixOS and SWH hackers to address >>> this, and with developers of other distros as well=E2=80=94this problem= is not >>> just that of the functional deployment geeks, is it? >> >> I could remove most of the Guix stuff so that it would be easy to >> package in Guix, Nix, Debian, etc. Then, someone=E2=84=A2 could write a= service >> that consumes a =E2=80=9Csources.json=E2=80=9D file, adds the sources to= a Disarchive >> database, and pushes everything to a Git repo. I guess everyone who >> cares has to produce a =E2=80=9Csources.json=E2=80=9D file anyway, so it= will be very >> little extra work. Other stuff like changing the serialization format >> to JSON would be pretty easy, too. I=E2=80=99m not well connected to th= ese >> other projects, mind you, so I=E2=80=99m not really sure how to reach ou= t. > > If you feel like it, you=E2=80=99re welcome to point them to your work in= the > discussion at . There=E2=80=99= s one > person from NixOS (lewo) participating in the discussion and I=E2=80=99m = sure > they=E2=80=99d be interested. Perhaps they=E2=80=99ll tell whether they = care about > having it available as JSON. Good idea. I will work out a few more kinks and then bring it up there. I=E2=80=99ve already rewritten the parts that used the Guix daemon. Disarc= hive now only needs a handful Guix modules ('base32', 'serialization', and 'swh' are the ones that would be hard to remove). >> Sorry about the big mess of code and ideas =E2=80=93 I realize I may hav= e taken >> the =E2=80=9Cdo-ocracy=E2=80=9D approach a little far here. :) Even if= this is not >> =E2=80=9Cthe=E2=80=9D solution, hopefully it=E2=80=99s useful for discus= sion! > > You did great! I had a very rough sketch and you did the real thing, > that=E2=80=99s just awesome. :-) > > Thanks a lot! My pleasure! Thanks for the feedback so far. -- Tim