unofficial mirror of guix-science@gnu.org 
 help / color / mirror / Atom feed
From: Timothy Sample <samplet@ngyro.com>
To: "Ludovic Courtès" <ludovic.courtes@inria.fr>
Cc: guix-devel@gnu.org, guix-science@gnu.org
Subject: Re: Software Heritage fifth anniversary event
Date: Wed, 01 Dec 2021 13:04:18 -0500	[thread overview]
Message-ID: <87tufsgq1p.fsf@ngyro.com> (raw)
In-Reply-To: 87sfvc4q8j.fsf@inria.fr

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

> I gave a 10–15mn talk on how Guix uses SWH, what Disarchive is, what
> the current status of the “preservation of Guix” is, and what remains
> to be done:
>
>   https://git.savannah.gnu.org/cgit/guix/maintenance.git/plain/talks/swh-unesco-2021/talk.20211130.pdf

Wow – great work!

> I chatted with the SWH tech team; they’re obviously very busy solving
> all sorts of scalability challenges :-) but they’re also truly
> interested in what we’re doing and in supporting our use case.  Off the
> top of my head, here are some of the topics discussed:
>
>   • ingesting past revisions: if we can give them ‘sources.json’ for
>     past revisions, they’re happy to ingest them;

This is something I can probably coax out of the Preservation of Guix
database.  That might be the cheapest way to do it.  Alternatively, when
we get “sources.json” built with Cuirass, we could tell Cuirass to build
out a sample of previous commits to get pretty good coverage.  (Side
note: eventually we could verify the coverage of the sampling approach
using the Data Service, which has a processed a very exhaustive list of
commits.)

>   • rate limit: we can find an arrangement to raise it for the purposes
>     of statistics gathering like Simon and Timothy have been doing (we
>     can discuss the details off-list);

Cool!  So far it hasn’t been a concern for me, but it would help in the
future if want to try and track down Git repositories that have gone
missing.

>   • Disarchive: they’d like to better understand the “unknowns” in the
>     PoG plots (I wasn’t sure if it was non-tar.gz tarballs or what) and
>     to work on the definitely-missing origins that show up there;

Many of the unknowns are there for me to track Disarchive progress.
It’s not really the clearest reporting, but it tracks more what Guix can
handle automatically than what we could theoretically know about.
Basically something is “known” if it can be downloaded from upstream,
and either: it’s a non-recursive Git reference; or it’s something
Disarchive can handle.  Hence, we know nothing about other version
control systems and, say, “.tar.bz2” archives.  Also, all these things
are based on heuristics.  :)  As we get closer to 100% known, we can
start analyzing everything more closely.

>     they’re not opposed to the idea of eventually hosting or maintaining
>     the Disarchive database (in fact one of the developers thought we
>     were hosting it in Git and that as such they were already archiving
>     it—maybe we could go back to Git?);

It’s a possibility, but right now I’m hopeful that the database will be
in the care of SWH directly before too long.  I’d rather wait and see at
this point.  I’m sure we could manage it, but the uncompressed size of
the Disarchive specification of a Chromium tarball is 366M.  Storing all
the XZ specifications uncompressed is over 20G.  It would be a big Git
repo!

>   • bit-for-bit archival: there’s a tension between making SWH a
>     “canonical” representation of VCS repos and making it a faithful,
>     bit-for-bit identical copy of the original, and there are different
>     opinions in the team here; our use case pretty much requires
>     bit-for-bit copies, and fortunately this is what SWH is giving us in
>     practice for Git repos, so checkout authentication (for example)
>     should work even when fetching Guix from SWH.

That’s interesting.  I’m sure most of us in the Guix camp are on team
bit-for-bit, but I’m sure we can all agree that it’s not easy to get
there.

> There were other discussions about Guix and Nix and I was pleased to see
> people were enthusiastic about functional package management and about
> our whole endeavor.
>
> Anyway I think we can take this as an opportunity to increase bandwidth
> with the SWH developers!

Good idea.  It’s nice when our efforts and experience produce something
useful to the broader free software community.  :)


-- Tim


  reply	other threads:[~2021-12-01 18:04 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-01  9:41 Software Heritage fifth anniversary event Ludovic Courtès
2021-12-01 18:04 ` Timothy Sample [this message]
2021-12-02  8:59   ` Ludovic Courtès
2021-12-02 13:17   ` zimoun
2021-12-02 14:04     ` Timothy Sample
2021-12-02 15:49       ` zimoun
2021-12-02 18:04         ` Ludovic Courtès
2021-12-02 15:02     ` Ludovic Courtès
2021-12-03  3:58 ` Maxim Cournoyer
2021-12-03 11:01   ` Ludovic Courtès
2021-12-05  5:10     ` Maxim Cournoyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87tufsgq1p.fsf@ngyro.com \
    --to=samplet@ngyro.com \
    --cc=guix-devel@gnu.org \
    --cc=guix-science@gnu.org \
    --cc=ludovic.courtes@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).