From: zimoun <zimon.toutoune@gmail.com>
To: Timothy Sample <samplet@ngyro.com>
Cc: 42162@debbugs.gnu.org, "Maurice Brémond" <Maurice.Bremond@inria.fr>
Subject: bug#42162: Recovering source tarballs
Date: Thu, 27 Aug 2020 11:41:24 +0200 [thread overview]
Message-ID: <86lfi0e88r.fsf@gmail.com> (raw)
In-Reply-To: <87k0xlaz8p.fsf@ngyro.com>
Hi,
On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samplet@ngyro.com> wrote:
> zimoun <zimon.toutoune@gmail.com> writes:
>
>> One question is how this database scales?
>>
>> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
>> for ~14k packages and then an increase of ~700MB per year, both with the
>> Ludo’s code [1].
>>
>> [1] <http://issues.guix.gnu.org/issue/42162#11>
>
> It’s a good question. A good part of the size comes from the
> representation rather than the data. Compression helps a lot here. I
> have a database of 3,912 packages. It’s 295M uncompressed (which is a
> little better than your estimation). If I pass each file through Lzip,
> it shrinks down to 60M. That’s more like 15.5K per package, which is
> almost an order of magnitude smaller than the estimation you used
> (120K). I think that makes the numbers rather pleasant, but it comes at
> the expense of easy storing in Git.
Thank you for these numbers. Really interesting!
First, I do not know if the database needs to be stored with Git. What
should be the advantage? (naive question :-))
On SWH T2430 [1], you explain the “default-header” trick to cut down the
size. Nice!
Moreover, the format is a long list, e.g.,
--8<---------------cut here---------------start------------->8---
(headers
((name "raptor2-2.0.15/")
(mode 493)
(mtime 1414909500)
(chksum 4225)
(typeflag 53))
((name "raptor2-2.0.15/build/")
(mode 493)
(mtime 1414909497)
(chksum 4797)
(typeflag 53))
((name "raptor2-2.0.15/build/ltversion.m4")
(size 690)
(mtime 1414908273)
(chksum 5958))
[…])
--8<---------------cut here---------------end--------------->8---
which is human-readable. Is it useful?
Instead, one could imagine shorter keywords:
((na "raptor2-2.0.15/")
(mo 493)
(mt 1414909500)
(ch 4225)
(ty 53))
which using your database (commit fc50927) reduces from 295MB to 279MB.
Or even plain list:
(\x00 "raptor2-2.0.15/" 493 1414909500 4225 53)
(\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958)
where the first element provides the “type” of list to ease the reader.
Well, the 2 naive questions are: does it make sense to
- have the database stored under Git?
- have an human-readable format?
Thank you again for pushing forward this topic. :-)
All the best,
simon
[1] https://forge.softwareheritage.org/T2430#47522
next prev parent reply other threads:[~2020-08-27 9:42 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-02 7:29 bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Ludovic Courtès
2020-07-02 8:50 ` zimoun
2020-07-02 10:03 ` Ludovic Courtès
2020-07-11 15:50 ` bug#42162: Recovering source tarballs Ludovic Courtès
2020-07-13 19:20 ` Christopher Baines
2020-07-20 21:27 ` zimoun
2020-07-15 16:55 ` zimoun
2020-07-20 8:39 ` Ludovic Courtès
2020-07-20 15:52 ` zimoun
2020-07-20 17:05 ` Dr. Arne Babenhauserheide
2020-07-20 19:59 ` zimoun
2020-07-21 21:22 ` Ludovic Courtès
2020-07-22 0:27 ` zimoun
2020-07-22 10:28 ` Ludovic Courtès
2020-08-03 21:10 ` Ricardo Wurmus
2020-07-30 17:36 ` Timothy Sample
2020-07-31 14:41 ` Ludovic Courtès
2020-08-03 16:59 ` Timothy Sample
2020-08-05 17:14 ` Ludovic Courtès
2020-08-05 18:57 ` Timothy Sample
2020-08-23 16:21 ` Ludovic Courtès
2020-11-03 14:26 ` Ludovic Courtès
2020-11-03 16:37 ` zimoun
2020-11-03 19:20 ` Timothy Sample
2020-11-04 16:49 ` Ludovic Courtès
2022-09-29 0:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2022-09-29 10:56 ` zimoun
2022-09-29 15:00 ` Ludovic Courtès
2022-09-30 3:10 ` Maxim Cournoyer
2022-09-30 12:13 ` zimoun
2022-10-01 22:04 ` Ludovic Courtès
2022-10-03 15:20 ` Maxim Cournoyer
2022-10-04 21:26 ` Ludovic Courtès
2022-09-30 18:17 ` Maxime Devos
2020-08-26 10:04 ` bug#42162: Recovering source tarballs zimoun
2020-08-26 21:11 ` Timothy Sample
2020-08-27 9:41 ` zimoun [this message]
2020-08-27 12:49 ` Ludovic Courtès
2020-08-27 18:06 ` Bengt Richter
2021-01-10 19:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2021-01-13 10:39 ` Ludovic Courtès
2021-01-13 12:27 ` Andreas Enge
2021-01-13 15:07 ` Andreas Enge
[not found] ` <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org>
2021-01-13 14:28 ` Ludovic Courtès
2021-01-14 14:21 ` Maxim Cournoyer
2021-10-04 15:59 ` bug#42162: gforge.inria.fr is off-line Ludovic Courtès
2021-10-04 17:50 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 zimoun
2021-10-07 16:07 ` Ludovic Courtès
2021-10-09 17:29 ` raingloom
2021-10-11 8:41 ` zimoun
2021-10-12 9:24 ` Ludovic Courtès
2021-10-12 10:50 ` zimoun
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://guix.gnu.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=86lfi0e88r.fsf@gmail.com \
--to=zimon.toutoune@gmail.com \
--cc=42162@debbugs.gnu.org \
--cc=Maurice.Bremond@inria.fr \
--cc=samplet@ngyro.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).