From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id QDVzCPN/R1+PcgAA0tVLHw (envelope-from ) for ; Thu, 27 Aug 2020 09:42:11 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id 8KlXBPN/R18ZbwAAB5/wlQ (envelope-from ) for ; Thu, 27 Aug 2020 09:42:11 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 90CA19401AE for ; Thu, 27 Aug 2020 09:42:10 +0000 (UTC) Received: from localhost ([::1]:57884 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kBEQ4-0003ap-SR for larch@yhetil.org; Thu, 27 Aug 2020 05:42:08 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:60944) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kBEPy-0003af-Qj for bug-guix@gnu.org; Thu, 27 Aug 2020 05:42:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:58326) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kBEPy-0005fq-FK for bug-guix@gnu.org; Thu, 27 Aug 2020 05:42:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1kBEPy-0003wn-DA for bug-guix@gnu.org; Thu, 27 Aug 2020 05:42:02 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#42162: Recovering source tarballs Resent-From: zimoun Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Thu, 27 Aug 2020 09:42:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 42162 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Timothy Sample Received: via spool by 42162-submit@debbugs.gnu.org id=B42162.159852129615142 (code B ref 42162); Thu, 27 Aug 2020 09:42:02 +0000 Received: (at 42162) by debbugs.gnu.org; 27 Aug 2020 09:41:36 +0000 Received: from localhost ([127.0.0.1]:41639 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kBEPY-0003wA-C3 for submit@debbugs.gnu.org; Thu, 27 Aug 2020 05:41:36 -0400 Received: from mail-wm1-f49.google.com ([209.85.128.49]:51782) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kBEPW-0003vw-3f for 42162@debbugs.gnu.org; Thu, 27 Aug 2020 05:41:35 -0400 Received: by mail-wm1-f49.google.com with SMTP id w2so4318609wmi.1 for <42162@debbugs.gnu.org>; Thu, 27 Aug 2020 02:41:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:in-reply-to:references:date:message-id :mime-version:content-transfer-encoding; bh=bOkVksEsXMklsMzX7QPR8JTaILiDsXqtGVCryj/nq7Q=; b=DxZxtCBfpQxXKEyAMdzAtg2nTAHpNkPOFKFLOciSGy1bEc944zmiAvEU49ipzAf+Fu SE6+qLIWWPp+MA/U8XbRK8dViocvIfYDFZNs2p7yzAg0ZMQsW7B4aczuZv4fTL/N18Tp pzDXxvMZb1U3lIylKaz7aJCEOPFn+cfzqz/oTVrMsK3N7G7mrcrhFRicYMAuJUVOq1Qk NbHMMTRhwumI5L6Uj758MiQJe4UFc9SzNeUoBgcrBggNoe5fo0kkbpQyMQY1+NCHm19c +AevV6QGM0isI3Bw/MZnWSQZlkN4G5d69ly30Sgx3V8w087ktt5Nu/gJrADDs+yzNqDa +MhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date :message-id:mime-version:content-transfer-encoding; bh=bOkVksEsXMklsMzX7QPR8JTaILiDsXqtGVCryj/nq7Q=; b=pjiTv0vpg61yNTrntSppqTde20ptLTk0ppWCIo7nUUHQrm+sjhJSkYIxadldgZC+JK 37xjHfkno595JxVJHFAg1ekL3NclXfyggwMYhEiQ/K1flCqYZmTQYGZsSiLWk97awJYL X5Q4DS7pbj8S7oW3XPmxgtmGUo07zpnq21xEfJUMTmXR23xubAiRVM8QpTZCzPy6bBtN 36dcnjpQuVaqZRKfsDIFOFa15kbsyeHK2e9bGJLFLjWT7/IUSvy3FCaE+B7LaoUXw43K aY4o/dX2vTMAhOkwAP0lKr+d2mX0Voh0uNq764zj8XuVL2nz+jFjvcEImFMK4xmnkKyT MLNw== X-Gm-Message-State: AOAM5336qwCCxP4CWotUx3jtg55Y8GKlw+nituei95AVMZz8nvCEJxAR ZdaRaZSK+0DgN444kvbLDL8= X-Google-Smtp-Source: ABdhPJxpRWDgPL+4SCtCt32ajrnd2KLv7Ff4tdph/2b7QErurIBKpHNA9kmo2xJk9m4wfK8JKViGyA== X-Received: by 2002:a1c:43c3:: with SMTP id q186mr11685689wma.144.1598521288186; Thu, 27 Aug 2020 02:41:28 -0700 (PDT) Received: from lili (57.246.195.77.rev.sfr.net. [77.195.246.57]) by smtp.gmail.com with ESMTPSA id v8sm4594222wrm.53.2020.08.27.02.41.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Aug 2020 02:41:27 -0700 (PDT) From: zimoun In-Reply-To: <87k0xlaz8p.fsf@ngyro.com> References: <87mu4iv0gc.fsf@inria.fr> <86h7uq8fmk.fsf@gmail.com> <87d05etero.fsf@gnu.org> <87r1tit5j6.fsf_-_@gnu.org> <875za4ykej.fsf@ngyro.com> <86blixyb7c.fsf@gmail.com> <87k0xlaz8p.fsf@ngyro.com> Date: Thu, 27 Aug 2020 11:41:24 +0200 Message-ID: <86lfi0e88r.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-Spam-Score: -1.0 (-) X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 42162@debbugs.gnu.org, Maurice =?UTF-8?Q?Br=C3=A9mond?= Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=fail (rsa verify failed) header.d=gmail.com header.s=20161025 header.b=DxZxtCBf; dmarc=fail reason="SPF not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Spam-Score: 0.09 X-TUID: WlbAcorpV3eJ Hi, On Wed, 26 Aug 2020 at 17:11, Timothy Sample wrote: > zimoun writes: > >> One question is how this database scales? >> >> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata >> for ~14k packages and then an increase of ~700MB per year, both with the >> Ludo=E2=80=99s code [1]. >> >> [1] > > It=E2=80=99s a good question. A good part of the size comes from the > representation rather than the data. Compression helps a lot here. I > have a database of 3,912 packages. It=E2=80=99s 295M uncompressed (which= is a > little better than your estimation). If I pass each file through Lzip, > it shrinks down to 60M. That=E2=80=99s more like 15.5K per package, whic= h is > almost an order of magnitude smaller than the estimation you used > (120K). I think that makes the numbers rather pleasant, but it comes at > the expense of easy storing in Git. Thank you for these numbers. Really interesting! First, I do not know if the database needs to be stored with Git. What should be the advantage? (naive question :-)) On SWH T2430 [1], you explain the =E2=80=9Cdefault-header=E2=80=9D trick to= cut down the size. Nice! Moreover, the format is a long list, e.g., --8<---------------cut here---------------start------------->8--- (headers ((name "raptor2-2.0.15/") (mode 493) (mtime 1414909500) (chksum 4225) (typeflag 53)) ((name "raptor2-2.0.15/build/") (mode 493) (mtime 1414909497) (chksum 4797) (typeflag 53)) ((name "raptor2-2.0.15/build/ltversion.m4") (size 690) (mtime 1414908273) (chksum 5958)) [=E2=80=A6]) --8<---------------cut here---------------end--------------->8--- which is human-readable. Is it useful? Instead, one could imagine shorter keywords: ((na "raptor2-2.0.15/") (mo 493) (mt 1414909500) (ch 4225) (ty 53)) which using your database (commit fc50927) reduces from 295MB to 279MB. Or even plain list: (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53) (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958) where the first element provides the =E2=80=9Ctype=E2=80=9D of list to ease= the reader. Well, the 2 naive questions are: does it make sense to - have the database stored under Git? - have an human-readable format? Thank you again for pushing forward this topic. :-) All the best, simon [1] https://forge.softwareheritage.org/T2430#47522