all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: zimoun <zimon.toutoune@gmail.com>
To: Timothy Sample <samplet@ngyro.com>
Cc: 42162@debbugs.gnu.org, "Maurice Brémond" <Maurice.Bremond@inria.fr>
Subject: bug#42162: Recovering source tarballs
Date: Thu, 27 Aug 2020 11:41:24 +0200	[thread overview]
Message-ID: <86lfi0e88r.fsf@gmail.com> (raw)
In-Reply-To: <87k0xlaz8p.fsf@ngyro.com>

Hi,

On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samplet@ngyro.com> wrote:
> zimoun <zimon.toutoune@gmail.com> writes:
>
>> One question is how this database scales?
>>
>> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
>> for ~14k packages and then an increase of ~700MB per year, both with the
>> Ludo’s code [1].
>>
>> [1] <http://issues.guix.gnu.org/issue/42162#11>
>
> It’s a good question.  A good part of the size comes from the
> representation rather than the data.  Compression helps a lot here.  I
> have a database of 3,912 packages.  It’s 295M uncompressed (which is a
> little better than your estimation).  If I pass each file through Lzip,
> it shrinks down to 60M.  That’s more like 15.5K per package, which is
> almost an order of magnitude smaller than the estimation you used
> (120K).  I think that makes the numbers rather pleasant, but it comes at
> the expense of easy storing in Git.

Thank you for these numbers.  Really interesting!

First, I do not know if the database needs to be stored with Git.  What
should be the advantage? (naive question :-))


On SWH T2430 [1], you explain the “default-header” trick to cut down the
size.  Nice!

Moreover, the format is a long list, e.g.,

--8<---------------cut here---------------start------------->8---
(headers
    ((name "raptor2-2.0.15/")
     (mode 493)
     (mtime 1414909500)
     (chksum 4225)
     (typeflag 53))
    ((name "raptor2-2.0.15/build/")
     (mode 493)
     (mtime 1414909497)
     (chksum 4797)
     (typeflag 53))
    ((name "raptor2-2.0.15/build/ltversion.m4")
     (size 690)
     (mtime 1414908273)
     (chksum 5958))

     […])
--8<---------------cut here---------------end--------------->8---

which is human-readable.  Is it useful?


Instead, one could imagine shorter keywords:

    ((na "raptor2-2.0.15/")
     (mo 493)
     (mt 1414909500)
     (ch 4225)
     (ty 53))

which using your database (commit fc50927) reduces from 295MB to 279MB.

Or even plain list:

   (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53)
   (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958)

where the first element provides the “type” of list to ease the reader.


Well, the 2 naive questions are: does it make sense to
 - have the database stored under Git?
 - have an human-readable format?


Thank you again for pushing forward this topic. :-)

All the best,
simon

[1] https://forge.softwareheritage.org/T2430#47522




  reply	other threads:[~2020-08-27  9:42 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-02  7:29 bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Ludovic Courtès
2020-07-02  8:50 ` zimoun
2020-07-02 10:03   ` Ludovic Courtès
2020-07-11 15:50     ` bug#42162: Recovering source tarballs Ludovic Courtès
2020-07-13 19:20       ` Christopher Baines
2020-07-20 21:27         ` zimoun
2020-07-15 16:55       ` zimoun
2020-07-20  8:39         ` Ludovic Courtès
2020-07-20 15:52           ` zimoun
2020-07-20 17:05             ` Dr. Arne Babenhauserheide
2020-07-20 19:59               ` zimoun
2020-07-21 21:22             ` Ludovic Courtès
2020-07-22  0:27               ` zimoun
2020-07-22 10:28                 ` Ludovic Courtès
2020-08-03 21:10         ` Ricardo Wurmus
2020-07-30 17:36       ` Timothy Sample
2020-07-31 14:41         ` Ludovic Courtès
2020-08-03 16:59           ` Timothy Sample
2020-08-05 17:14             ` Ludovic Courtès
2020-08-05 18:57               ` Timothy Sample
2020-08-23 16:21                 ` Ludovic Courtès
2020-11-03 14:26                 ` Ludovic Courtès
2020-11-03 16:37                   ` zimoun
2020-11-03 19:20                   ` Timothy Sample
2020-11-04 16:49                     ` Ludovic Courtès
2022-09-29  0:32                       ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2022-09-29 10:56                         ` zimoun
2022-09-29 15:00                           ` Ludovic Courtès
2022-09-30  3:10                             ` Maxim Cournoyer
2022-09-30 12:13                               ` zimoun
2022-10-01 22:04                                 ` Ludovic Courtès
2022-10-03 15:20                                 ` Maxim Cournoyer
2022-10-04 21:26                                   ` Ludovic Courtès
2022-09-30 18:17                               ` Maxime Devos
2020-08-26 10:04         ` bug#42162: Recovering source tarballs zimoun
2020-08-26 21:11           ` Timothy Sample
2020-08-27  9:41             ` zimoun [this message]
2020-08-27 12:49               ` Ludovic Courtès
2020-08-27 18:06               ` Bengt Richter
2021-01-10 19:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2021-01-13 10:39   ` Ludovic Courtès
2021-01-13 12:27     ` Andreas Enge
2021-01-13 15:07     ` Andreas Enge
     [not found] ` <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org>
2021-01-13 14:28   ` Ludovic Courtès
2021-01-14 14:21     ` Maxim Cournoyer
2021-10-04 15:59     ` bug#42162: gforge.inria.fr is off-line Ludovic Courtès
2021-10-04 17:50       ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 zimoun
2021-10-07 16:07         ` Ludovic Courtès
2021-10-09 17:29           ` raingloom
2021-10-11  8:41           ` zimoun
2021-10-12  9:24             ` Ludovic Courtès
2021-10-12 10:50               ` zimoun
2021-10-12 16:04                 ` Substitute retention Ludovic Courtès
2021-10-12 18:06                   ` zimoun
2021-10-15  9:27                     ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86lfi0e88r.fsf@gmail.com \
    --to=zimon.toutoune@gmail.com \
    --cc=42162@debbugs.gnu.org \
    --cc=Maurice.Bremond@inria.fr \
    --cc=samplet@ngyro.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.