From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id UKMnK4z2R1+paQAA0tVLHw (envelope-from ) for ; Thu, 27 Aug 2020 18:08:12 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id 4C/7Joz2R1/WZQAAB5/wlQ (envelope-from ) for ; Thu, 27 Aug 2020 18:08:12 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 08C7994021E for ; Thu, 27 Aug 2020 18:08:11 +0000 (UTC) Received: from localhost ([::1]:36964 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kBMJm-0006TH-P6 for larch@yhetil.org; Thu, 27 Aug 2020 14:08:10 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:40336) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kBMJe-0006RU-74 for bug-guix@gnu.org; Thu, 27 Aug 2020 14:08:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:60586) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kBMJd-0006OY-UC for bug-guix@gnu.org; Thu, 27 Aug 2020 14:08:01 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1kBMJd-0005ni-Q2 for bug-guix@gnu.org; Thu, 27 Aug 2020 14:08:01 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#42162: Recovering source tarballs Resent-From: Bengt Richter Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Thu, 27 Aug 2020 18:08:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 42162 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: zimoun Received: via spool by 42162-submit@debbugs.gnu.org id=B42162.159855163922250 (code B ref 42162); Thu, 27 Aug 2020 18:08:01 +0000 Received: (at 42162) by debbugs.gnu.org; 27 Aug 2020 18:07:19 +0000 Received: from localhost ([127.0.0.1]:43899 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kBMIx-0005mn-45 for submit@debbugs.gnu.org; Thu, 27 Aug 2020 14:07:19 -0400 Received: from imta-35.everyone.net ([216.200.145.35]:46824 helo=imta-38.everyone.net) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kBMIv-0005me-82 for 42162@debbugs.gnu.org; Thu, 27 Aug 2020 14:07:18 -0400 Received: from pps.filterd (m0004961.ppops.net [127.0.0.1]) by imta-38.everyone.net (8.16.0.27/8.16.0.27) with SMTP id 07RI43qe009175; Thu, 27 Aug 2020 11:07:15 -0700 X-Eon-Originating-Account: wVvPZLly5FanX1K4Tx9U5p3ez1pLEyXOCiYwdT6e8PM X-Eon-Dm: m0117124.ppops.net Received: by m0117124.mta.everyone.net (EON-AUTHRELAY2 - 5a81d81c) id m0117124.5f332921.23eb48; Thu, 27 Aug 2020 11:07:07 -0700 X-Eon-Sig: AQMHrIJfR/ZL4Y047QIAAAAE,3d85287383ccb99dc52470193382448f X-Eip: 2DBQm5kIfibIrvNK5ubFV2-YWoW86xZhtnL1tziijYw Date: Thu, 27 Aug 2020 20:06:51 +0200 From: Bengt Richter Message-ID: <20200827180651.GA3255@LionPure> References: <87mu4iv0gc.fsf@inria.fr> <86h7uq8fmk.fsf@gmail.com> <87d05etero.fsf@gnu.org> <87r1tit5j6.fsf_-_@gnu.org> <875za4ykej.fsf@ngyro.com> <86blixyb7c.fsf@gmail.com> <87k0xlaz8p.fsf@ngyro.com> <86lfi0e88r.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <86lfi0e88r.fsf@gmail.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235, 18.0.687 definitions=2020-08-27_10:2020-08-27, 2020-08-27 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1034 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-2006250000 definitions=main-2008270136 X-Spam-Score: -0.4 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-Spam-Score: -1.4 (-) X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Bengt Richter Cc: 42162@debbugs.gnu.org, Maurice =?UTF-8?Q?Br=C3=A9mond?= Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Spam-Score: -0.51 X-TUID: SF77h5FFhMhm Hi, On +2020-08-27 11:41:24 +0200, zimoun wrote: > Hi, > > On Wed, 26 Aug 2020 at 17:11, Timothy Sample wrote: > > zimoun writes: > > > >> One question is how this database scales? > >> > >> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata > >> for ~14k packages and then an increase of ~700MB per year, both with the > >> Ludo’s code [1]. > >> > >> [1] > > > > It’s a good question. A good part of the size comes from the > > representation rather than the data. Compression helps a lot here. I > > have a database of 3,912 packages. It’s 295M uncompressed (which is a > > little better than your estimation). If I pass each file through Lzip, > > it shrinks down to 60M. That’s more like 15.5K per package, which is > > almost an order of magnitude smaller than the estimation you used > > (120K). I think that makes the numbers rather pleasant, but it comes at > > the expense of easy storing in Git. > > Thank you for these numbers. Really interesting! > > First, I do not know if the database needs to be stored with Git. What > should be the advantage? (naive question :-)) > > > On SWH T2430 [1], you explain the “default-header” trick to cut down the > size. Nice! > > Moreover, the format is a long list, e.g., > > --8<---------------cut here---------------start------------->8--- > (headers How about (X-v1-headers (borrowing from rfc2045 MIME usage indicating as-yet-not-a-formal-standard) The idea is to make it easy to script the change to "(headers" once there is consensus for declaring a new standard. The "v1-" part could allow a simultaneous "(X-v2-headers" alternative for zimoun's concise suggestion, or even a base64 of a compressed format. There's lots that could be borrowed from the MIME rfc's :) --8<---------------cut here---------------start------------->8--- 6.3. New Content-Transfer-Encodings Implementors may, if necessary, define private Content-Transfer- Encoding values, but must use an x-token, which is a name prefixed by "X-", to indicate its non-standard status, e.g., "Content-Transfer- Encoding: x-my-new-encoding". Additional standardized Content- Transfer-Encoding values must be specified by a standards-track RFC. The requirements such specifications must meet are given in RFC 2048. As such, all content-transfer-encoding namespace except that beginning with "X-" is explicitly reserved to the IETF for future use. Unlike media types and subtypes, the creation of new Content- Transfer-Encoding values is STRONGLY discouraged, as it seems likely to hinder interoperability with little potential benefit --8<---------------cut here---------------end--------------->8--- > ((name "raptor2-2.0.15/") > (mode 493) If you want to be more human-readable with mode, I would put a chmod argument in place of 493 :) --8<---------------cut here---------------start------------->8--- $ printf "%o\n" 493 755 $ --8<---------------cut here---------------end--------------->8--- Hm, could this be a security risk?? I mean, could a mode typo here inadvertently open a door for a nasty mod by oportunistic code buried in a later-executed apparently unrelated app? > (mtime 1414909500) One of these might be more human-recognizable :) --8<---------------cut here---------------start------------->8--- $ date --date='@1414909497' -Is 2014-11-02T07:24:57+01:00 $ date --date='@1414909497' -uIs 2014-11-02T06:24:57+00:00 $ TZ=America/Buenos_Aires date --date='@1414909497' -Is 2014-11-02T03:24:57-03:00 $ $ date --date='@1414909497' -u '+%Y%m%d_%H%M%S' 20141102_062457 # vs 1414909497, which, yes, costs 5 chars less $ --8<---------------cut here---------------end--------------->8--- > (chksum 4225) > (typeflag 53)) > ((name "raptor2-2.0.15/build/") > (mode 493) > (mtime 1414909497) > (chksum 4797) > (typeflag 53)) > ((name "raptor2-2.0.15/build/ltversion.m4") > (size 690) > (mtime 1414908273) > (chksum 5958)) > > […]) > --8<---------------cut here---------------end--------------->8--- > > which is human-readable. Is it useful? > > > Instead, one could imagine shorter keywords: > (X-v2-headers ;; ;-) > ((na "raptor2-2.0.15/") > (mo 493) > (mt 1414909500) > (ch 4225) > (ty 53)) > > which using your database (commit fc50927) reduces from 295MB to 279MB. > > Or even plain list: > (X-v3-headers > (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53) > (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958) > > where the first element provides the “type” of list to ease the reader. > > > Well, the 2 naive questions are: does it make sense to > - have the database stored under Git? > - have an human-readable format? > > > Thank you again for pushing forward this topic. :-) > > All the best, > simon > > [1] https://forge.softwareheritage.org/T2430#47522 > > > Prefixing "X-" can obviously be used with any tentative name for anything. I am suggesting it as a counter to premature (and likely clashing) bindings of valuable names, which IMO is as bad as premature optimization :) Naming is too important to be defined by first-user flag-planting, ISTM. -- Regards, Bengt Richter