all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: zimoun <zimon.toutoune@gmail.com>
To: "Timothy Sample" <samplet@ngyro.com>, "Ludovic Courtès" <ludo@gnu.org>
Cc: 42162@debbugs.gnu.org, "Maurice Brémond" <Maurice.Bremond@inria.fr>
Subject: bug#42162: Recovering source tarballs
Date: Wed, 26 Aug 2020 12:04:55 +0200	[thread overview]
Message-ID: <86blixyb7c.fsf@gmail.com> (raw)
In-Reply-To: <875za4ykej.fsf@ngyro.com>

Dear Timothy,

On Thu, 30 Jul 2020 at 13:36, Timothy Sample <samplet@ngyro.com> wrote:

> I call the thing “Disarchive” as in “disassemble a source code archive”.
> You can find it at <https://git.ngyro.com/disarchive/>.  It has a simple
> command-line interface so you can do
>
>     $ disarchive save software-1.0.tar.gz
>
> which serializes a disassembled version of “software-1.0.tar.gz” to the
> database (which is just a directory) specified by the “DISARCHIVE_DB”
> environment variable.  Next, you can run
>
>     $ disarchive load hash-of-something-in-the-db
>
> which will recover an original file from its metadata (stored in the
> database) and data retrieved from the SWH archive or taken from a cache
> (again, just a directory) specified by “DISARCHIVE_DIRCACHE”.

Really nice!  Thank you!


>> I think we’d have to maintain a database that maps tarball hashes to
>> metadata (!).  A simple version of it could be a Git repo where, say,
>> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would
>> contain the metadata above.  The nice thing is that the Git repo itself
>> could be archived by SWH.  :-)
>
> You mean like <https://git.ngyro.com/disarchive-db/>?  :)

[...]

> This was generated by a little script built on top of “fold-packages”.
> It downloads Gzip’d tarballs used by Guix packages and passes them on to
> Disarchive for disassembly.  I limited the number to 100 because it’s
> slow and because I’m sure there is a long tail of weird software
> archives that are going to be hard to process.  The metadata directory
> ended up being 13M and the directory cache 2G.

One question is how this database scales?

For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
for ~14k packages and then an increase of ~700MB per year, both with the
Ludo’s code [1].

[1] <http://issues.guix.gnu.org/issue/42162#11>



> I could remove most of the Guix stuff so that it would be easy to
> package in Guix, Nix, Debian, etc.  Then, someone™ could write a service
> that consumes a “sources.json” file, adds the sources to a Disarchive
> database, and pushes everything to a Git repo.  I guess everyone who
> cares has to produce a “sources.json” file anyway, so it will be very
> little extra work.  Other stuff like changing the serialization format
> to JSON would be pretty easy, too.  I’m not well connected to these
> other projects, mind you, so I’m not really sure how to reach out.

This service could be really useful.  Yes, it could be easy to update
the database each time Guix produces a new “sources.json”.

As mentioned [2], should this service be part of SWH (download cooking
task)?  Or project side?

[2] <https://forge.softwareheritage.org/T2430#47486>


Thank you again for this piece for work.

All the best,
simon




  parent reply	other threads:[~2020-08-26 10:06 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-02  7:29 bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Ludovic Courtès
2020-07-02  8:50 ` zimoun
2020-07-02 10:03   ` Ludovic Courtès
2020-07-11 15:50     ` bug#42162: Recovering source tarballs Ludovic Courtès
2020-07-13 19:20       ` Christopher Baines
2020-07-20 21:27         ` zimoun
2020-07-15 16:55       ` zimoun
2020-07-20  8:39         ` Ludovic Courtès
2020-07-20 15:52           ` zimoun
2020-07-20 17:05             ` Dr. Arne Babenhauserheide
2020-07-20 19:59               ` zimoun
2020-07-21 21:22             ` Ludovic Courtès
2020-07-22  0:27               ` zimoun
2020-07-22 10:28                 ` Ludovic Courtès
2020-08-03 21:10         ` Ricardo Wurmus
2020-07-30 17:36       ` Timothy Sample
2020-07-31 14:41         ` Ludovic Courtès
2020-08-03 16:59           ` Timothy Sample
2020-08-05 17:14             ` Ludovic Courtès
2020-08-05 18:57               ` Timothy Sample
2020-08-23 16:21                 ` Ludovic Courtès
2020-11-03 14:26                 ` Ludovic Courtès
2020-11-03 16:37                   ` zimoun
2020-11-03 19:20                   ` Timothy Sample
2020-11-04 16:49                     ` Ludovic Courtès
2022-09-29  0:32                       ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2022-09-29 10:56                         ` zimoun
2022-09-29 15:00                           ` Ludovic Courtès
2022-09-30  3:10                             ` Maxim Cournoyer
2022-09-30 12:13                               ` zimoun
2022-10-01 22:04                                 ` Ludovic Courtès
2022-10-03 15:20                                 ` Maxim Cournoyer
2022-10-04 21:26                                   ` Ludovic Courtès
2022-09-30 18:17                               ` Maxime Devos
2020-08-26 10:04         ` zimoun [this message]
2020-08-26 21:11           ` bug#42162: Recovering source tarballs Timothy Sample
2020-08-27  9:41             ` zimoun
2020-08-27 12:49               ` Ludovic Courtès
2020-08-27 18:06               ` Bengt Richter
2021-01-10 19:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2021-01-13 10:39   ` Ludovic Courtès
2021-01-13 12:27     ` Andreas Enge
2021-01-13 15:07     ` Andreas Enge
     [not found] ` <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org>
2021-01-13 14:28   ` Ludovic Courtès
2021-01-14 14:21     ` Maxim Cournoyer
2021-10-04 15:59     ` bug#42162: gforge.inria.fr is off-line Ludovic Courtès
2021-10-04 17:50       ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 zimoun
2021-10-07 16:07         ` Ludovic Courtès
2021-10-09 17:29           ` raingloom
2021-10-11  8:41           ` zimoun
2021-10-12  9:24             ` Ludovic Courtès
2021-10-12 10:50               ` zimoun
2021-10-12 16:04                 ` Substitute retention Ludovic Courtès
2021-10-12 18:06                   ` zimoun
2021-10-15  9:27                     ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86blixyb7c.fsf@gmail.com \
    --to=zimon.toutoune@gmail.com \
    --cc=42162@debbugs.gnu.org \
    --cc=Maurice.Bremond@inria.fr \
    --cc=ludo@gnu.org \
    --cc=samplet@ngyro.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.