From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id OCQQLyHpKl/UXAAA0tVLHw (envelope-from ) for ; Wed, 05 Aug 2020 17:15:13 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id yNPyKiHpKl80SAAA1q6Kng (envelope-from ) for ; Wed, 05 Aug 2020 17:15:13 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id D242D9400BF for ; Wed, 5 Aug 2020 17:15:12 +0000 (UTC) Received: from localhost ([::1]:42830 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1k3N0Q-0002bV-Ab for larch@yhetil.org; Wed, 05 Aug 2020 13:15:10 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:35210) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1k3N0I-0002bJ-Cs for bug-guix@gnu.org; Wed, 05 Aug 2020 13:15:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:40602) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1k3N0I-0007ME-3F for bug-guix@gnu.org; Wed, 05 Aug 2020 13:15:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1k3N0H-0001Xl-V8 for bug-guix@gnu.org; Wed, 05 Aug 2020 13:15:01 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#42162: Recovering source tarballs Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Wed, 05 Aug 2020 17:15:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 42162 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Timothy Sample Received: via spool by 42162-submit@debbugs.gnu.org id=B42162.15966476765882 (code B ref 42162); Wed, 05 Aug 2020 17:15:01 +0000 Received: (at 42162) by debbugs.gnu.org; 5 Aug 2020 17:14:36 +0000 Received: from localhost ([127.0.0.1]:52144 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1k3Mzg-0001WW-3N for submit@debbugs.gnu.org; Wed, 05 Aug 2020 13:14:35 -0400 Received: from eggs.gnu.org ([209.51.188.92]:52982) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1k3Mzd-0001WE-Ni for 42162@debbugs.gnu.org; Wed, 05 Aug 2020 13:14:22 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:60296) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1k3MzX-0007Gl-4A; Wed, 05 Aug 2020 13:14:15 -0400 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=45254 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1k3MzW-0001dk-9H; Wed, 05 Aug 2020 13:14:14 -0400 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= References: <87mu4iv0gc.fsf@inria.fr> <86h7uq8fmk.fsf@gmail.com> <87d05etero.fsf@gnu.org> <87r1tit5j6.fsf_-_@gnu.org> <875za4ykej.fsf@ngyro.com> <87bljvu4p4.fsf@gnu.org> <87d047u0l3.fsf@ngyro.com> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 19 Thermidor an 228 de la =?UTF-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Wed, 05 Aug 2020 19:14:12 +0200 In-Reply-To: <87d047u0l3.fsf@ngyro.com> (Timothy Sample's message of "Mon, 03 Aug 2020 12:59:52 -0400") Message-ID: <87wo2dnhgb.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-Spam-Score: -3.3 (---) X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 42162@debbugs.gnu.org, Maurice =?UTF-8?Q?Br=C3=A9mond?= Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Spam-Score: -1.01 X-TUID: EADU0CdUwu/L Hello! Timothy Sample skribis: > Ludovic Court=C3=A8s writes: > >> Wooohoo! Is it that time of the year when people give presents to one >> another? I can=E2=80=99t believe it. :-) > > Not to be too cynical, but I think it=E2=80=99s just the time of year tha= t I get > frustrated with what I should be working on, and start fantasizing about > green-field projects. :p :-) >> Timothy Sample skribis: >> >>> The header and footer are read directly from the file. Finding the >>> compressor is harder. I followed the approach taken by the pristine-tar >>> project. That is, try a bunch of compressors and hope for a match. >>> Currently, I have: >>> >>> =E2=80=A2 gnu-best >>> =E2=80=A2 gnu-best-rsync >>> =E2=80=A2 gnu >>> =E2=80=A2 gnu-rsync >>> =E2=80=A2 gnu-fast >>> =E2=80=A2 gnu-fast-rsync >>> =E2=80=A2 zlib-best >>> =E2=80=A2 zlib >>> =E2=80=A2 zlib-fast >>> =E2=80=A2 zlib-best-perl >>> =E2=80=A2 zlib-perl >>> =E2=80=A2 zlib-fast-perl >>> =E2=80=A2 gnu-best-rsync-1.4 >>> =E2=80=A2 gnu-rsync-1.4 >>> =E2=80=A2 gnu-fast-rsync-1.4 >> >> I would have used the integers that zlib supports, but I guess that >> doesn=E2=80=99t capture this whole gamut of compression setups. And yea= h, it=E2=80=99s >> not great that we actually have to try and find the right compression >> levels, but there=E2=80=99s no way around it it seems, and as you write,= we can >> expect a couple of variants to be the most commonly used ones. > > My first instinct was =E2=80=9Cthis is impossible =E2=80=93 a DEFLATE com= pressor can do > just about whatever it wants!=E2=80=9D Then I looked at pristine-tar and > realized that their hack probably works pretty well. If I had infinite > time, I would think about some kind of fully general, parameterized LZ77 > algorithm that could describe any implementation. If I had a lot of > time I would peel back the curtain on Gzip and zlib and expose their > tuning parameters. That would be nicer, but keep in mind we will have > to cover XZ, bzip2, and ZIP, too! There=E2=80=99s a bit of balance betwe= en > quality and coverage. Any improvement to the representation of the > compression algorithm could be implemented easily: just replace the > names with their improved representation. Yup, it makes sense to not spend too much time on this bit. I guess we=E2=80=99d already have good coverage with gzip and xz. >> (BTW the code I posted or the one in Disarchive could perhaps replace >> the one in Gash-Utils. I was frustrated to not see a =E2=80=98fold-arch= ive=E2=80=99 >> procedure there, notably.) > > I really like =E2=80=9Cfold-archive=E2=80=9D. One of the reasons I start= ed doing this > is to possibly share code with Gash-Utils. It=E2=80=99s not as easy as I= was > hoping, but I=E2=80=99m planning on improving things there based on my > experience here. I=E2=80=99ve now worked with four Scheme tar implementa= tions, > maybe if I write a really good one I could cap that number at five! Heh. :-) The needs are different anyway. In Gash-Utils the focus is probably on simplicity/maintainability, whereas here you really want to cover all the details of the wire representation. >>> To avoid hitting the SWH archive at all, I introduced a directory cache >>> so that I can store the directories locally. If the directory cache is >>> available, directories are stored and retrieved from it. >> >> I guess we can get back to them eventually to estimate our coverage rati= o. > > It would be nice to know, but pretty hard to find out with the rate > limit. I guess it will improve immensely when we set up a > =E2=80=9Csources.json=E2=80=9D file. Note that we have . Last I checked, SWH was ingesting it in its =E2=80=9Cqualification=E2=80=9D instance, so it= should be ingesting it for good real soon if it=E2=80=99s not doing it already. >>> You mean like ? :) >> >> Woow. :-) >> >> We could actually have a CI job to create the database: it would >> basically do =E2=80=98disarchive save=E2=80=99 for each tarball and stor= e that using a >> layout like the one you used. Then we could have a job somewhere that >> periodically fetches that and adds it to the database. WDYT? > > Maybe.... I assume that Disarchive would fail for a few of them. We > would need a plan for monitoring those failures so that Disarchive can > be improved. Also, unless I=E2=80=99m misunderstanding something, this m= eans > building the whole database at every commit, no? That would take a lot > of time and space. On the other hand, it would be easy enough to try. > If it works, it=E2=80=99s a lot easier than setting up a whole other serv= ice. One can easily write a procedure that takes a tarball and returns a that builds its database entry. So at each commit, we=E2= =80=99d just rebuild things that have changed. >> I think we should leave room for other hash algorithms (in the sexps >> above too). > > It works for different hash algorithms, but not for different directory > hashing methods (like you mention below). OK. [...] >> So it does mean that we could pretty much right away add a fall-back in >> (guix download) that looks up tarballs in your database and uses >> Disarchive to recontruct it, right? I love solved problems. :-) >> >> Of course we could improve Disarchive and the database, but it seems to >> me that we already have enough to improve the situation. WDYT? > > I would say that we are darn close! In theory it would work. It would > be much more practical if we had better coverage in the SWH archive > (i.e., =E2=80=9Csources.json=E2=80=9D) and a way to get metadata for a so= urce archive > without downloading the entire Disarchive database. It=E2=80=99s 13M now= , but > it will likely be 500M with all the Gzip=E2=80=99d tarballs from a recent= commit > of Guix. It will only grow after that, too. If we expose the database over HTTP (like over cgit), we can arrange so that (guix download) simply GETs db.example.org/sha256/xyz. No need to fetch the whole database. It might be more reasonable to have a real database and a real service around it, I=E2=80=99m sure Chris Baines would agree ;-), but we can choose= URLs that could easily be implemented by a =E2=80=9Creal=E2=80=9D service instea= d of cgit in the future. > Of course those are not hard blockers, so =E2=80=98(guix download)=E2=80= =99 could start > using Disarchive as soon as we package it. I=E2=80=99ve starting looking= into > it, but I=E2=80=99m confused about getting access to Disarchive from the > =E2=80=9Cout-of-band=E2=80=9D download system. Would it have to become a= dependency of > Guix? Yes. It could be a behind-the-scenes dependency of =E2=80=9Cbuiltin:downlo= ad=E2=80=9D; it doesn=E2=80=99t have to be a dependency of each and every fixed-output derivation. > I was imagining an escape hatch beyond this, where one could look up a > provenance record from when Disarchive ingested and verified a source > code archive. The provenance record would tell you which version of > Guix was used when saving the archive, so you could try your luck with > using =E2=80=9Cguix time-machine=E2=80=9D to reproduce Disarchive=E2=80= =99s original > computation. If we perform database migrations, you would need to > travel back in time in the database, too. The idea is that you could > work around breakages in Disarchive automatically using the Power of > Guix=E2=84=A2. Just a stray thought, really. Seems to me it Shouldn=E2=80=99t Be Necessary? :-) I mean, as long as the format is extensible and =E2=80=9Cfuture-proof=E2=80= =9D, we=E2=80=99ll always be able to rebuild tarballs and then re-disassemble them if we need to compute new hashes or whatever. >> If you feel like it, you=E2=80=99re welcome to point them to your work i= n the >> discussion at . There=E2=80= =99s one >> person from NixOS (lewo) participating in the discussion and I=E2=80=99m= sure >> they=E2=80=99d be interested. Perhaps they=E2=80=99ll tell whether they= care about >> having it available as JSON. > > Good idea. I will work out a few more kinks and then bring it up there. > I=E2=80=99ve already rewritten the parts that used the Guix daemon. Disa= rchive > now only needs a handful Guix modules ('base32', 'serialization', and > 'swh' are the ones that would be hard to remove). An option would be to use (gcrypt base64); another one would be to bundle (guix base32). I was thinking that it might be best to not use Guix for computations. For example, have =E2=80=9Cdisarchive save=E2=80=9D not build derivations a= nd instead do everything =E2=80=9Chere and now=E2=80=9D. That would make it easier for o= thers to adopt. Wait, looking at the Git history, it looks like you already addressed that point, neat. :-) Thank you! Ludo=E2=80=99.