From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id CAByEhM0Rl99fwAA0tVLHw (envelope-from ) for ; Wed, 26 Aug 2020 10:06:11 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1 with LMTPS id iB8nDhM0Rl/LNgAAbx9fmQ (envelope-from ) for ; Wed, 26 Aug 2020 10:06:11 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id AECB89404E1 for ; Wed, 26 Aug 2020 10:06:10 +0000 (UTC) Received: from localhost ([::1]:56688 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kAsJl-0006hq-Eo for larch@yhetil.org; Wed, 26 Aug 2020 06:06:09 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:60448) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kAsJe-0006hi-GA for bug-guix@gnu.org; Wed, 26 Aug 2020 06:06:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:54124) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kAsJe-0003e6-73 for bug-guix@gnu.org; Wed, 26 Aug 2020 06:06:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1kAsJe-0005Zf-2x for bug-guix@gnu.org; Wed, 26 Aug 2020 06:06:02 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#42162: Recovering source tarballs Resent-From: zimoun Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Wed, 26 Aug 2020 10:06:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 42162 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Timothy Sample , Ludovic =?UTF-8?Q?Court=C3=A8s?= Received: via spool by 42162-submit@debbugs.gnu.org id=B42162.159843631221365 (code B ref 42162); Wed, 26 Aug 2020 10:06:02 +0000 Received: (at 42162) by debbugs.gnu.org; 26 Aug 2020 10:05:12 +0000 Received: from localhost ([127.0.0.1]:37434 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kAsIq-0005YX-1k for submit@debbugs.gnu.org; Wed, 26 Aug 2020 06:05:12 -0400 Received: from mail-wr1-f65.google.com ([209.85.221.65]:46966) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kAsIi-0005XL-2s for 42162@debbugs.gnu.org; Wed, 26 Aug 2020 06:05:10 -0400 Received: by mail-wr1-f65.google.com with SMTP id r15so1167666wrp.13 for <42162@debbugs.gnu.org>; Wed, 26 Aug 2020 03:05:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:in-reply-to:references:date:message-id :mime-version:content-transfer-encoding; bh=zhHFO+g8iEfdoCani6/L6PNlDjNXkm8YD7nCiPcLqRk=; b=ZmlV3hN4nNvtpy0cVRaFUPiWpGi4gZNUPYCpIO37C9foWlEvUgRocPYzROwYookDBR zD7tWh7i1NXZWFK200Q0q9pTuIw2hnI6vPf89PYZ1AtPUwr7b2K0FQUkhZ3qZiqlMn5m iYy3pEvex7CIvQtwrdaXAxZ7kHIkCwN7FIW/64ev39n7/cVX116SVfrJZ+G4dRoSlMIp AArViAb3jJOPRxiX4lkWb+z7CKAPMg1+CaiAVKt6v/CAuvoqBi/GWpjUxH2b8zNLtjk3 6D0V5o5MxXGvjN0GjxbB9qIMND7RuDSC01etoExNfIzQ52tbXxioefVE1Er5z3+GumRF Yk9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date :message-id:mime-version:content-transfer-encoding; bh=zhHFO+g8iEfdoCani6/L6PNlDjNXkm8YD7nCiPcLqRk=; b=YHaSrI1fBGMu1qzEaH3na8I6tv/B82ArvaIBiOfzqgllH2Dl/N7xhC9RVr0fondgmm RsIZB/DCCjagvjca7y21TRqIwJaYN14XqKYUj+QZC7ErxayAaUizxHbcl8fm9difVQf7 NNohqU/AHGBhKIXcy+jLDSCuwpqsMc4iHBaFVpEG3P8nrprMseHrVzOq90NVsmxI0Ys7 y5H2QY1iw3DivWBzrm+rmTmnFEJTTGF6gpkVjt+KLwo2btob3l3djWNwtw+97+9AQKtm HKgS2ffIt/BmtM6UXMPuR7pgBixDiKwpdsTiYhr/l6bgIkZT3MXO9NpUeTquPnXBCXUk wv3A== X-Gm-Message-State: AOAM530PTLbBXbY8lJRz6r3X8dWZw11TcGUAclJPTBTJMJXF0AWk+7mH z6AMBYIk5vjE29eROSQwnuE= X-Google-Smtp-Source: ABdhPJz9qYFy+AjB9uHysaAgfoQwHBE2Nj3D0vlg9XKXN4N48EpfVD7qwVBdWzzgUi1kRrb4/T0LsA== X-Received: by 2002:a5d:4ccb:: with SMTP id c11mr14511391wrt.159.1598436298156; Wed, 26 Aug 2020 03:04:58 -0700 (PDT) Received: from lili (57.246.195.77.rev.sfr.net. [77.195.246.57]) by smtp.gmail.com with ESMTPSA id a74sm4506921wme.11.2020.08.26.03.04.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Aug 2020 03:04:57 -0700 (PDT) From: zimoun In-Reply-To: <875za4ykej.fsf@ngyro.com> References: <87mu4iv0gc.fsf@inria.fr> <86h7uq8fmk.fsf@gmail.com> <87d05etero.fsf@gnu.org> <87r1tit5j6.fsf_-_@gnu.org> <875za4ykej.fsf@ngyro.com> Date: Wed, 26 Aug 2020 12:04:55 +0200 Message-ID: <86blixyb7c.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-Spam-Score: -1.0 (-) X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 42162@debbugs.gnu.org, Maurice =?UTF-8?Q?Br=C3=A9mond?= Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=fail (rsa verify failed) header.d=gmail.com header.s=20161025 header.b=ZmlV3hN4; dmarc=fail reason="SPF not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Spam-Score: 0.09 X-TUID: tTm+HZgItAw3 Dear Timothy, On Thu, 30 Jul 2020 at 13:36, Timothy Sample wrote: > I call the thing =E2=80=9CDisarchive=E2=80=9D as in =E2=80=9Cdisassemble = a source code archive=E2=80=9D. > You can find it at . It has a simple > command-line interface so you can do > > $ disarchive save software-1.0.tar.gz > > which serializes a disassembled version of =E2=80=9Csoftware-1.0.tar.gz= =E2=80=9D to the > database (which is just a directory) specified by the =E2=80=9CDISARCHIVE= _DB=E2=80=9D > environment variable. Next, you can run > > $ disarchive load hash-of-something-in-the-db > > which will recover an original file from its metadata (stored in the > database) and data retrieved from the SWH archive or taken from a cache > (again, just a directory) specified by =E2=80=9CDISARCHIVE_DIRCACHE=E2=80= =9D. Really nice! Thank you! >> I think we=E2=80=99d have to maintain a database that maps tarball hashe= s to >> metadata (!). A simple version of it could be a Git repo where, say, >> =E2=80=98sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk=E2= =80=99 would >> contain the metadata above. The nice thing is that the Git repo itself >> could be archived by SWH. :-) > > You mean like ? :) [...] > This was generated by a little script built on top of =E2=80=9Cfold-packa= ges=E2=80=9D. > It downloads Gzip=E2=80=99d tarballs used by Guix packages and passes the= m on to > Disarchive for disassembly. I limited the number to 100 because it=E2=80= =99s > slow and because I=E2=80=99m sure there is a long tail of weird software > archives that are going to be hard to process. The metadata directory > ended up being 13M and the directory cache 2G. One question is how this database scales? For example, a quick back-to-envelop estimation leads to ~1.2GB metadata for ~14k packages and then an increase of ~700MB per year, both with the Ludo=E2=80=99s code [1]. [1] > I could remove most of the Guix stuff so that it would be easy to > package in Guix, Nix, Debian, etc. Then, someone=E2=84=A2 could write a = service > that consumes a =E2=80=9Csources.json=E2=80=9D file, adds the sources to = a Disarchive > database, and pushes everything to a Git repo. I guess everyone who > cares has to produce a =E2=80=9Csources.json=E2=80=9D file anyway, so it = will be very > little extra work. Other stuff like changing the serialization format > to JSON would be pretty easy, too. I=E2=80=99m not well connected to the= se > other projects, mind you, so I=E2=80=99m not really sure how to reach out. This service could be really useful. Yes, it could be easy to update the database each time Guix produces a new =E2=80=9Csources.json=E2=80=9D. As mentioned [2], should this service be part of SWH (download cooking task)? Or project side? [2] Thank you again for this piece for work. All the best, simon