unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
From: Bengt Richter <bokr@bokr.com>
To: zimoun <zimon.toutoune@gmail.com>
Cc: 42162@debbugs.gnu.org, "Maurice Brémond" <Maurice.Bremond@inria.fr>
Subject: bug#42162: Recovering source tarballs
Date: Thu, 27 Aug 2020 20:06:51 +0200	[thread overview]
Message-ID: <20200827180651.GA3255@LionPure> (raw)
In-Reply-To: <86lfi0e88r.fsf@gmail.com>

Hi,

On +2020-08-27 11:41:24 +0200, zimoun wrote:
> Hi,
> 
> On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samplet@ngyro.com> wrote:
> > zimoun <zimon.toutoune@gmail.com> writes:
> >
> >> One question is how this database scales?
> >>
> >> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
> >> for ~14k packages and then an increase of ~700MB per year, both with the
> >> Ludo’s code [1].
> >>
> >> [1] <http://issues.guix.gnu.org/issue/42162#11>
> >
> > It’s a good question.  A good part of the size comes from the
> > representation rather than the data.  Compression helps a lot here.  I
> > have a database of 3,912 packages.  It’s 295M uncompressed (which is a
> > little better than your estimation).  If I pass each file through Lzip,
> > it shrinks down to 60M.  That’s more like 15.5K per package, which is
> > almost an order of magnitude smaller than the estimation you used
> > (120K).  I think that makes the numbers rather pleasant, but it comes at
> > the expense of easy storing in Git.
> 
> Thank you for these numbers.  Really interesting!
> 
> First, I do not know if the database needs to be stored with Git.  What
> should be the advantage? (naive question :-))
> 
> 
> On SWH T2430 [1], you explain the “default-header” trick to cut down the
> size.  Nice!
> 
> Moreover, the format is a long list, e.g.,
> 
> --8<---------------cut here---------------start------------->8---
> (headers

How about
    (X-v1-headers
(borrowing from rfc2045 MIME usage indicating as-yet-not-a-formal-standard)
The idea is to make it easy to script the change to "(headers" once
there is consensus for declaring a new standard. The "v1-" part could allow
a simultaneous "(X-v2-headers" alternative for zimoun's concise suggestion,
or even a base64 of a compressed format. There's lots that could be borrowed from
the MIME rfc's :)

--8<---------------cut here---------------start------------->8---
6.3.  New Content-Transfer-Encodings

   Implementors may, if necessary, define private Content-Transfer-
   Encoding values, but must use an x-token, which is a name prefixed by
   "X-", to indicate its non-standard status, e.g., "Content-Transfer-
   Encoding: x-my-new-encoding".  Additional standardized Content-
   Transfer-Encoding values must be specified by a standards-track RFC.
   The requirements such specifications must meet are given in RFC 2048.
   As such, all content-transfer-encoding namespace except that
   beginning with "X-" is explicitly reserved to the IETF for future
   use.

   Unlike media types and subtypes, the creation of new Content-
   Transfer-Encoding values is STRONGLY discouraged, as it seems likely
   to hinder interoperability with little potential benefit
--8<---------------cut here---------------end--------------->8---

>     ((name "raptor2-2.0.15/")
>      (mode 493)
If you want to be more human-readable with mode, I would put
a chmod argument in place of 493 :)

--8<---------------cut here---------------start------------->8---
$ printf "%o\n" 493
755
$ 
--8<---------------cut here---------------end--------------->8---

Hm, could this be a security risk??
I mean, could a mode typo here inadvertently open a door for a nasty mod
by oportunistic code buried in a later-executed apparently unrelated app?

>      (mtime 1414909500)
One of these might be more human-recognizable :)
--8<---------------cut here---------------start------------->8---
$ date --date='@1414909497' -Is
2014-11-02T07:24:57+01:00
$ date --date='@1414909497' -uIs
2014-11-02T06:24:57+00:00
$ TZ=America/Buenos_Aires date --date='@1414909497' -Is
2014-11-02T03:24:57-03:00
$
$ date --date='@1414909497' -u '+%Y%m%d_%H%M%S'
20141102_062457
# vs 1414909497, which, yes, costs 5 chars less
$ 
--8<---------------cut here---------------end--------------->8---

>      (chksum 4225)
>      (typeflag 53))
>     ((name "raptor2-2.0.15/build/")
>      (mode 493)
>      (mtime 1414909497)
>      (chksum 4797)
>      (typeflag 53))
>     ((name "raptor2-2.0.15/build/ltversion.m4")
>      (size 690)
>      (mtime 1414908273)
>      (chksum 5958))
> 
>      […])
> --8<---------------cut here---------------end--------------->8---
> 
> which is human-readable.  Is it useful?
> 
> 
> Instead, one could imagine shorter keywords:
>
(X-v2-headers  ;; ;-)
>     ((na "raptor2-2.0.15/")
>      (mo 493)
>      (mt 1414909500)
>      (ch 4225)
>      (ty 53))
> 
> which using your database (commit fc50927) reduces from 295MB to 279MB.
> 
> Or even plain list:
>
(X-v3-headers
>    (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53)
>    (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958)
> 
> where the first element provides the “type” of list to ease the reader.
> 
> 
> Well, the 2 naive questions are: does it make sense to
>  - have the database stored under Git?
>  - have an human-readable format?
> 
> 
> Thank you again for pushing forward this topic. :-)
> 
> All the best,
> simon
> 
> [1] https://forge.softwareheritage.org/T2430#47522
> 
> 
> 

Prefixing "X-" can obviously be used with any tentative name for anything.

I am suggesting it as a counter to premature (and likely clashing) bindings
of valuable names, which IMO is as bad as premature optimization :)

Naming is too important to be defined by first-user flag-planting, ISTM.
-- 
Regards,
Bengt Richter




  parent reply	other threads:[~2020-08-27 18:08 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-02  7:29 bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Ludovic Courtès
2020-07-02  8:50 ` zimoun
2020-07-02 10:03   ` Ludovic Courtès
2020-07-11 15:50     ` bug#42162: Recovering source tarballs Ludovic Courtès
2020-07-13 19:20       ` Christopher Baines
2020-07-20 21:27         ` zimoun
2020-07-15 16:55       ` zimoun
2020-07-20  8:39         ` Ludovic Courtès
2020-07-20 15:52           ` zimoun
2020-07-20 17:05             ` Dr. Arne Babenhauserheide
2020-07-20 19:59               ` zimoun
2020-07-21 21:22             ` Ludovic Courtès
2020-07-22  0:27               ` zimoun
2020-07-22 10:28                 ` Ludovic Courtès
2020-08-03 21:10         ` Ricardo Wurmus
2020-07-30 17:36       ` Timothy Sample
2020-07-31 14:41         ` Ludovic Courtès
2020-08-03 16:59           ` Timothy Sample
2020-08-05 17:14             ` Ludovic Courtès
2020-08-05 18:57               ` Timothy Sample
2020-08-23 16:21                 ` Ludovic Courtès
2020-11-03 14:26                 ` Ludovic Courtès
2020-11-03 16:37                   ` zimoun
2020-11-03 19:20                   ` Timothy Sample
2020-11-04 16:49                     ` Ludovic Courtès
2022-09-29  0:32                       ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2022-09-29 10:56                         ` zimoun
2022-09-29 15:00                           ` Ludovic Courtès
2022-09-30  3:10                             ` Maxim Cournoyer
2022-09-30 12:13                               ` zimoun
2022-10-01 22:04                                 ` Ludovic Courtès
2022-10-03 15:20                                 ` Maxim Cournoyer
2022-10-04 21:26                                   ` Ludovic Courtès
2022-09-30 18:17                               ` Maxime Devos
2020-08-26 10:04         ` bug#42162: Recovering source tarballs zimoun
2020-08-26 21:11           ` Timothy Sample
2020-08-27  9:41             ` zimoun
2020-08-27 12:49               ` Ludovic Courtès
2020-08-27 18:06               ` Bengt Richter [this message]
2021-01-10 19:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2021-01-13 10:39   ` Ludovic Courtès
2021-01-13 12:27     ` Andreas Enge
2021-01-13 15:07     ` Andreas Enge
     [not found] ` <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org>
2021-01-13 14:28   ` Ludovic Courtès
2021-01-14 14:21     ` Maxim Cournoyer
2021-10-04 15:59     ` bug#42162: gforge.inria.fr is off-line Ludovic Courtès
2021-10-04 17:50       ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 zimoun
2021-10-07 16:07         ` Ludovic Courtès
2021-10-09 17:29           ` raingloom
2021-10-11  8:41           ` zimoun
2021-10-12  9:24             ` Ludovic Courtès
2021-10-12 10:50               ` zimoun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200827180651.GA3255@LionPure \
    --to=bokr@bokr.com \
    --cc=42162@debbugs.gnu.org \
    --cc=Maurice.Bremond@inria.fr \
    --cc=zimon.toutoune@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).