From: Bengt Richter <bokr@bokr.com>
To: zimoun <zimon.toutoune@gmail.com>
Cc: 42162@debbugs.gnu.org, "Maurice Brémond" <Maurice.Bremond@inria.fr>
Subject: bug#42162: Recovering source tarballs
Date: Thu, 27 Aug 2020 20:06:51 +0200 [thread overview]
Message-ID: <20200827180651.GA3255@LionPure> (raw)
In-Reply-To: <86lfi0e88r.fsf@gmail.com>
Hi,
On +2020-08-27 11:41:24 +0200, zimoun wrote:
> Hi,
>
> On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samplet@ngyro.com> wrote:
> > zimoun <zimon.toutoune@gmail.com> writes:
> >
> >> One question is how this database scales?
> >>
> >> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
> >> for ~14k packages and then an increase of ~700MB per year, both with the
> >> Ludo’s code [1].
> >>
> >> [1] <http://issues.guix.gnu.org/issue/42162#11>
> >
> > It’s a good question. A good part of the size comes from the
> > representation rather than the data. Compression helps a lot here. I
> > have a database of 3,912 packages. It’s 295M uncompressed (which is a
> > little better than your estimation). If I pass each file through Lzip,
> > it shrinks down to 60M. That’s more like 15.5K per package, which is
> > almost an order of magnitude smaller than the estimation you used
> > (120K). I think that makes the numbers rather pleasant, but it comes at
> > the expense of easy storing in Git.
>
> Thank you for these numbers. Really interesting!
>
> First, I do not know if the database needs to be stored with Git. What
> should be the advantage? (naive question :-))
>
>
> On SWH T2430 [1], you explain the “default-header” trick to cut down the
> size. Nice!
>
> Moreover, the format is a long list, e.g.,
>
> --8<---------------cut here---------------start------------->8---
> (headers
How about
(X-v1-headers
(borrowing from rfc2045 MIME usage indicating as-yet-not-a-formal-standard)
The idea is to make it easy to script the change to "(headers" once
there is consensus for declaring a new standard. The "v1-" part could allow
a simultaneous "(X-v2-headers" alternative for zimoun's concise suggestion,
or even a base64 of a compressed format. There's lots that could be borrowed from
the MIME rfc's :)
--8<---------------cut here---------------start------------->8---
6.3. New Content-Transfer-Encodings
Implementors may, if necessary, define private Content-Transfer-
Encoding values, but must use an x-token, which is a name prefixed by
"X-", to indicate its non-standard status, e.g., "Content-Transfer-
Encoding: x-my-new-encoding". Additional standardized Content-
Transfer-Encoding values must be specified by a standards-track RFC.
The requirements such specifications must meet are given in RFC 2048.
As such, all content-transfer-encoding namespace except that
beginning with "X-" is explicitly reserved to the IETF for future
use.
Unlike media types and subtypes, the creation of new Content-
Transfer-Encoding values is STRONGLY discouraged, as it seems likely
to hinder interoperability with little potential benefit
--8<---------------cut here---------------end--------------->8---
> ((name "raptor2-2.0.15/")
> (mode 493)
If you want to be more human-readable with mode, I would put
a chmod argument in place of 493 :)
--8<---------------cut here---------------start------------->8---
$ printf "%o\n" 493
755
$
--8<---------------cut here---------------end--------------->8---
Hm, could this be a security risk??
I mean, could a mode typo here inadvertently open a door for a nasty mod
by oportunistic code buried in a later-executed apparently unrelated app?
> (mtime 1414909500)
One of these might be more human-recognizable :)
--8<---------------cut here---------------start------------->8---
$ date --date='@1414909497' -Is
2014-11-02T07:24:57+01:00
$ date --date='@1414909497' -uIs
2014-11-02T06:24:57+00:00
$ TZ=America/Buenos_Aires date --date='@1414909497' -Is
2014-11-02T03:24:57-03:00
$
$ date --date='@1414909497' -u '+%Y%m%d_%H%M%S'
20141102_062457
# vs 1414909497, which, yes, costs 5 chars less
$
--8<---------------cut here---------------end--------------->8---
> (chksum 4225)
> (typeflag 53))
> ((name "raptor2-2.0.15/build/")
> (mode 493)
> (mtime 1414909497)
> (chksum 4797)
> (typeflag 53))
> ((name "raptor2-2.0.15/build/ltversion.m4")
> (size 690)
> (mtime 1414908273)
> (chksum 5958))
>
> […])
> --8<---------------cut here---------------end--------------->8---
>
> which is human-readable. Is it useful?
>
>
> Instead, one could imagine shorter keywords:
>
(X-v2-headers ;; ;-)
> ((na "raptor2-2.0.15/")
> (mo 493)
> (mt 1414909500)
> (ch 4225)
> (ty 53))
>
> which using your database (commit fc50927) reduces from 295MB to 279MB.
>
> Or even plain list:
>
(X-v3-headers
> (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53)
> (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958)
>
> where the first element provides the “type” of list to ease the reader.
>
>
> Well, the 2 naive questions are: does it make sense to
> - have the database stored under Git?
> - have an human-readable format?
>
>
> Thank you again for pushing forward this topic. :-)
>
> All the best,
> simon
>
> [1] https://forge.softwareheritage.org/T2430#47522
>
>
>
Prefixing "X-" can obviously be used with any tentative name for anything.
I am suggesting it as a counter to premature (and likely clashing) bindings
of valuable names, which IMO is as bad as premature optimization :)
Naming is too important to be defined by first-user flag-planting, ISTM.
--
Regards,
Bengt Richter
next prev parent reply other threads:[~2020-08-27 18:08 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-02 7:29 bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Ludovic Courtès
2020-07-02 8:50 ` zimoun
2020-07-02 10:03 ` Ludovic Courtès
2020-07-11 15:50 ` bug#42162: Recovering source tarballs Ludovic Courtès
2020-07-13 19:20 ` Christopher Baines
2020-07-20 21:27 ` zimoun
2020-07-15 16:55 ` zimoun
2020-07-20 8:39 ` Ludovic Courtès
2020-07-20 15:52 ` zimoun
2020-07-20 17:05 ` Dr. Arne Babenhauserheide
2020-07-20 19:59 ` zimoun
2020-07-21 21:22 ` Ludovic Courtès
2020-07-22 0:27 ` zimoun
2020-07-22 10:28 ` Ludovic Courtès
2020-08-03 21:10 ` Ricardo Wurmus
2020-07-30 17:36 ` Timothy Sample
2020-07-31 14:41 ` Ludovic Courtès
2020-08-03 16:59 ` Timothy Sample
2020-08-05 17:14 ` Ludovic Courtès
2020-08-05 18:57 ` Timothy Sample
2020-08-23 16:21 ` Ludovic Courtès
2020-11-03 14:26 ` Ludovic Courtès
2020-11-03 16:37 ` zimoun
2020-11-03 19:20 ` Timothy Sample
2020-11-04 16:49 ` Ludovic Courtès
2022-09-29 0:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2022-09-29 10:56 ` zimoun
2022-09-29 15:00 ` Ludovic Courtès
2022-09-30 3:10 ` Maxim Cournoyer
2022-09-30 12:13 ` zimoun
2022-10-01 22:04 ` Ludovic Courtès
2022-10-03 15:20 ` Maxim Cournoyer
2022-10-04 21:26 ` Ludovic Courtès
2022-09-30 18:17 ` Maxime Devos
2020-08-26 10:04 ` bug#42162: Recovering source tarballs zimoun
2020-08-26 21:11 ` Timothy Sample
2020-08-27 9:41 ` zimoun
2020-08-27 12:49 ` Ludovic Courtès
2020-08-27 18:06 ` Bengt Richter [this message]
2021-01-10 19:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2021-01-13 10:39 ` Ludovic Courtès
2021-01-13 12:27 ` Andreas Enge
2021-01-13 15:07 ` Andreas Enge
[not found] ` <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org>
2021-01-13 14:28 ` Ludovic Courtès
2021-01-14 14:21 ` Maxim Cournoyer
2021-10-04 15:59 ` bug#42162: gforge.inria.fr is off-line Ludovic Courtès
2021-10-04 17:50 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 zimoun
2021-10-07 16:07 ` Ludovic Courtès
2021-10-09 17:29 ` raingloom
2021-10-11 8:41 ` zimoun
2021-10-12 9:24 ` Ludovic Courtès
2021-10-12 10:50 ` zimoun
2021-10-12 16:04 ` Substitute retention Ludovic Courtès
2021-10-12 18:06 ` zimoun
2021-10-15 9:27 ` Ludovic Courtès
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200827180651.GA3255@LionPure \
--to=bokr@bokr.com \
--cc=42162@debbugs.gnu.org \
--cc=Maurice.Bremond@inria.fr \
--cc=zimon.toutoune@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.