unofficial mirror of bug-guix@gnu.org 
 help / color / Atom feed
From: Timothy Sample <samplet@ngyro.com>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: 42162@debbugs.gnu.org, "Maurice Brémond" <Maurice.Bremond@inria.fr>
Subject: bug#42162: Recovering source tarballs
Date: Thu, 30 Jul 2020 13:36:52 -0400
Message-ID: <875za4ykej.fsf@ngyro.com> (raw)
In-Reply-To: <87mu4iv0gc.fsf@inria.fr>

Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

> Hi,
>
> Ludovic Courtès <ludo@gnu.org> skribis:
>
> [...]
>
> So for the medium term, and perhaps for the future, a possible option
> would be to preserve tarball metadata so we can reconstruct them:
>
>   tarball = metadata + tree
>
> After all, tarballs are byproducts and should be no exception: we should
> build them from source.  :-)
>
> In <https://forge.softwareheritage.org/T2430>, Stefano mentioned
> pristine-tar, which does almost that, but not quite: it stores a binary
> delta between a tarball and a tree:
>
>   https://manpages.debian.org/unstable/pristine-tar/pristine-tar.1.en.html
>
> I think we should have something more transparent than a binary delta.
>
> The code below can “disassemble” and “assemble” a tar.  When it
> disassembles it, it generates metadata like this:
>
> (tar-source
>   (version 0)
>   (headers
>     (("guile-3.0.4/"
>       (mode 493)
>       (size 0)
>       (mtime 1593007723)
>       (chksum 3979)
>       (typeflag #\5))
>      ("guile-3.0.4/m4/"
>       (mode 493)
>       (size 0)
>       (mtime 1593007720)
>       (chksum 4184)
>       (typeflag #\5))
>      ("guile-3.0.4/m4/pipe2.m4"
>       (mode 420)
>       (size 531)
>       (mtime 1536050419)
>       (chksum 4812)
>       (hash (sha256
>               "arx6n2rmtf66yjlwkgwp743glcpdsfzgjiqrqhfegutmcwvwvsza")))
>      ("guile-3.0.4/m4/time_h.m4"
>       (mode 420)
>       (size 5471)
>       (mtime 1536050419)
>       (chksum 4974)
>       (hash (sha256
>               "z4py26rmvsk4st7db6vwziwwhkrjjrwj7nra4al6ipqh2ms45kka")))
> […]
>
> The ’assemble-archive’ procedure consumes that, looks up file contents
> by hash on SWH, and reconstructs the original tarball…
>
> … at least in theory, because in practice we hit the SWH rate limit
> after looking up a few files:
>
>   https://archive.softwareheritage.org/api/#rate-limiting
>
> So it’s a bit ridiculous, but we may have to store a SWH “dir”
> identifier for the whole extracted tree—a Git-tree hash—since that would
> allow us to retrieve the whole thing in a single HTTP request.
>
> Besides, we’ll also have to handle compression: storing gzip/xz headers
> and compression levels.

This jumped out at me because I have been working with compression and
tarballs for the bootstrapping effort.  I started pulling some threads
and doing some research, and ended up prototyping an end-to-end solution
for decomposing a Gzip’d tarball into Gzip metadata, tarball metadata,
and an SWH directory ID.  It can even put them back together!  :)  There
are a bunch of problems still, but I think this project is doable in the
short-term.  I’ve tested 100 arbitrary Gzip’d tarballs from Guix, and
found and fixed a bunch of little gaffes.  There’s a ton of work to do,
of course, but here’s another small step.

I call the thing “Disarchive” as in “disassemble a source code archive”.
You can find it at <https://git.ngyro.com/disarchive/>.  It has a simple
command-line interface so you can do

    $ disarchive save software-1.0.tar.gz

which serializes a disassembled version of “software-1.0.tar.gz” to the
database (which is just a directory) specified by the “DISARCHIVE_DB”
environment variable.  Next, you can run

    $ disarchive load hash-of-something-in-the-db

which will recover an original file from its metadata (stored in the
database) and data retrieved from the SWH archive or taken from a cache
(again, just a directory) specified by “DISARCHIVE_DIRCACHE”.

Now some implementation details.  The way I’ve set it up is that all of
the assembly happens through Guix.  Each step in recreating a compressed
tarball is a fixed-output derivation: the download from SWH, the
creation of the tarball, and the compression.  I wanted an easy way to
build and verify things according to a dependency graph without writing
any code.  Hi Guix Daemon!  I’m not sure if this is a good long-term
approach, though.  It could work well for reproducibility, but it might
be easier to let some external service drive my code as a Guix package.
Either way, it was an easy way to get started.

For disassembly, it takes a Gzip file (containing a single member) and
breaks it down like this:

    (gzip-member
      (version 0)
      (name "hungrycat-0.4.1.tar.gz")
      (input (sha256
               "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1"))
      (header
        (mtime 0)
        (extra-flags 2)
        (os 3))
      (footer
        (crc 3863610951)
        (isize 194560))
      (compressor gnu-best)
      (digest
        (sha256
          "03fc1zsrf99lvxa7b4ps6pbi43304wbxh1f6ci4q0vkal370yfwh")))

The header and footer are read directly from the file.  Finding the
compressor is harder.  I followed the approach taken by the pristine-tar
project.  That is, try a bunch of compressors and hope for a match.
Currently, I have:

    • gnu-best
    • gnu-best-rsync
    • gnu
    • gnu-rsync
    • gnu-fast
    • gnu-fast-rsync
    • zlib-best
    • zlib
    • zlib-fast
    • zlib-best-perl
    • zlib-perl
    • zlib-fast-perl
    • gnu-best-rsync-1.4
    • gnu-rsync-1.4
    • gnu-fast-rsync-1.4

This list is inspired by pristine-tar.  The first couple GNU compressors
use modern Gzip from Guix.  The zlib and rsync-1.4 ones use the Gzip and
zlib wrapper from pristine-tar called “zgz”.  The 100 Gzip files I
looked at use “gnu”, “gnu-best”, “gnu-best-rsync-1.4”, “zlib”,
“zlib-best”, and “zlib-fast-perl”.

(As an aside, I had a way to decompose multi-member Gzip files, but it
was much, much slower.  Since I doubt they exist in the wild, I removed
that code.)

The “input” field likely points to a tarball, which looks like this:

    (tarball
      (version 0)
      (name "hungrycat-0.4.1.tar")
      (input (sha256
               "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r"))
      (default-header)
      (headers
        ((name "hungrycat-0.4.1/")
         (mode 493)
         (mtime 1513360022)
         (chksum 5058)
         (typeflag 53))
        ((name "hungrycat-0.4.1/configure")
         (mode 493)
         (size 130263)
         (mtime 1513360022)
         (chksum 6043))
        ...)
      (padding 3584)
      (digest
        (sha256
          "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1")))

Originally, I used your code, but I ran into some problems.  Namely,
real tarballs are not well-behaved.  I wrote new code to keep track of
subtle things like the formatting of the octal values.  Even though they
are not well-behaved, they are usually self-consistent, so I introduced
the “default-header” field to set default values for all headers.  Any
omitted fields in the headers use the value from the default header, and
the default header takes defaults from a “default default header”
defined in the code.  Here’s a default header from a different tarball:

    (default-header
      (uid 1199)
      (gid 30)
      (magic "ustar ")
      (version " \x00")
      (uname "cagordon")
      (gname "lhea")
      (devmajor-format (width 0))
      (devminor-format (width 0)))

These default values are computed to minimize the noise in the
serialized form.  Here we see for example that each header should have
UID 1199 unless otherwise specified.  We also see that the device fields
should be null strings instead of octal zeros.  Another good example
here is that the magic field has a space after “ustar”, which is not
what modern POSIX says to do.

My tarball reader has minimal support for extended headers, but they are
not serialized cleanly (they survive the round-trip, but they are not
human-readable).

Finally, the “input” field here points to an “swh-directory” object.  It
looks like this:

    (swh-directory
      (version 0)
      (name "hungrycat-0.4.1")
      (id "0496abd5a2e9e05c9fe20ae7684f48130ef6124a")
      (digest
        (sha256
          "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r")))

I have a little module for computing the directory hash like SWH does
(which is in-turn like what Git does).  I did not verify that the 100
packages where in the SWH archive.  I did verify a couple of packages,
but I hit the rate limit and decided to avoid it for now.

To avoid hitting the SWH archive at all, I introduced a directory cache
so that I can store the directories locally.  If the directory cache is
available, directories are stored and retrieved from it.

> How would we put that in practice?  Good question.  :-)
>
> I think we’d have to maintain a database that maps tarball hashes to
> metadata (!).  A simple version of it could be a Git repo where, say,
> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would
> contain the metadata above.  The nice thing is that the Git repo itself
> could be archived by SWH.  :-)

You mean like <https://git.ngyro.com/disarchive-db/>?  :)

This was generated by a little script built on top of “fold-packages”.
It downloads Gzip’d tarballs used by Guix packages and passes them on to
Disarchive for disassembly.  I limited the number to 100 because it’s
slow and because I’m sure there is a long tail of weird software
archives that are going to be hard to process.  The metadata directory
ended up being 13M and the directory cache 2G.

> Thus, if a tarball vanishes, we’d look it up in the database and
> reconstruct it from its metadata plus content store in SWH.
>
> Thoughts?

Obviously I like the idea.  ;)

Even with the code I have so far, I have a lot of questions.  Mainly I’m
worried about keeping everything working into the future.  It would be
easy to make incompatible changes.  A lot of care would have to be
taken.  Of course, keeping a Guix commit and a Disarchive commit might
be enough to make any assembling reproducible, but there’s a
chicken-and-egg problem there.  What if a tarball from the closure of
one the derivations is missing?  I guess you could work around it, but
it would be tricky.

> Anyhow, we should team up with fellow NixOS and SWH hackers to address
> this, and with developers of other distros as well—this problem is not
> just that of the functional deployment geeks, is it?

I could remove most of the Guix stuff so that it would be easy to
package in Guix, Nix, Debian, etc.  Then, someone™ could write a service
that consumes a “sources.json” file, adds the sources to a Disarchive
database, and pushes everything to a Git repo.  I guess everyone who
cares has to produce a “sources.json” file anyway, so it will be very
little extra work.  Other stuff like changing the serialization format
to JSON would be pretty easy, too.  I’m not well connected to these
other projects, mind you, so I’m not really sure how to reach out.

Sorry about the big mess of code and ideas – I realize I may have taken
the “do-ocracy” approach a little far here.  :)  Even if this is not
“the” solution, hopefully it’s useful for discussion!


-- Tim




  parent reply index

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-02  7:29 bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Ludovic Courtès
2020-07-02  8:50 ` zimoun
2020-07-02 10:03   ` Ludovic Courtès
2020-07-11 15:50     ` bug#42162: Recovering source tarballs Ludovic Courtès
2020-07-13 19:20       ` Christopher Baines
2020-07-20 21:27         ` zimoun
2020-07-15 16:55       ` zimoun
2020-07-20  8:39         ` Ludovic Courtès
2020-07-20 15:52           ` zimoun
2020-07-20 17:05             ` Dr. Arne Babenhauserheide
2020-07-20 19:59               ` zimoun
2020-07-21 21:22             ` Ludovic Courtès
2020-07-22  0:27               ` zimoun
2020-07-22 10:28                 ` Ludovic Courtès
2020-08-03 21:10         ` Ricardo Wurmus
2020-07-30 17:36       ` Timothy Sample [this message]
2020-07-31 14:41         ` Ludovic Courtès
2020-08-03 16:59           ` Timothy Sample
2020-08-05 17:14             ` Ludovic Courtès
2020-08-05 18:57               ` Timothy Sample

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875za4ykej.fsf@ngyro.com \
    --to=samplet@ngyro.com \
    --cc=42162@debbugs.gnu.org \
    --cc=Maurice.Bremond@inria.fr \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

unofficial mirror of bug-guix@gnu.org 

Archives are clonable:
	git clone --mirror https://yhetil.org/guix-bugs/0 guix-bugs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 guix-bugs guix-bugs/ https://yhetil.org/guix-bugs \
		bug-guix@gnu.org
	public-inbox-index guix-bugs

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.yhetil.org/yhetil.gnu.guix.bugs
	nntp://news.gmane.io/gmane.comp.gnu.guix.bugs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git