Re: Software Heritage fifth anniversary event

unofficial mirror of guix-science@gnu.org 
 help / color / mirror / Atom feed

From: zimoun <zimon.toutoune@gmail.com>
To: Timothy Sample <samplet@ngyro.com>
Cc: "Ludovic Courtès" <ludovic.courtes@inria.fr>,
	"Guix Devel" <guix-devel@gnu.org>,
	guix-science@gnu.org
Subject: Re: Software Heritage fifth anniversary event
Date: Thu, 2 Dec 2021 14:17:39 +0100	[thread overview]
Message-ID: <CAJ3okZ3RoK3_ziWVE5TqF+uRvbpsY-Dgqy4zawjoyv3L9N-Dfw@mail.gmail.com> (raw)
In-Reply-To: <87tufsgq1p.fsf@ngyro.com>

Hi,

On Wed, 1 Dec 2021 at 19:04, Timothy Sample <samplet@ngyro.com> wrote:
> Ludovic Courtès <ludovic.courtes@inria.fr> writes:
>
> > I gave a 10–15mn talk on how Guix uses SWH, what Disarchive is, what
> > the current status of the “preservation of Guix” is, and what remains
> > to be done:
> >
> >   https://git.savannah.gnu.org/cgit/guix/maintenance.git/plain/talks/swh-unesco-2021/talk.20211130.pdf

Thank you Ludo for this nice write up!  I hope the stream had been
recorded and soon available for all. :-)

> > I chatted with the SWH tech team; they’re obviously very busy solving
> > all sorts of scalability challenges :-) but they’re also truly
> > interested in what we’re doing and in supporting our use case.  Off the
> > top of my head, here are some of the topics discussed:
> >
> >   • ingesting past revisions: if we can give them ‘sources.json’ for
> >     past revisions, they’re happy to ingest them;
>
> This is something I can probably coax out of the Preservation of Guix
> database.  That might be the cheapest way to do it.  Alternatively, when
> we get “sources.json” built with Cuirass, we could tell Cuirass to build
> out a sample of previous commits to get pretty good coverage.  (Side
> note: eventually we could verify the coverage of the sampling approach
> using the Data Service, which has a processed a very exhaustive list of
> commits.)

Let avoid "quirk" because now the ingestion requires too many manual checks. :-)

For instance, "guix lint -c archival" works well but it is not
systematically done by contributors or pushers; especially on quick
updated packages.  This is mainly what we see: 35 vs 24 missing
type:git from PoG [1,2].

On the other hand, 'sources.json' is built with the Guix website.  But
SWH ingests only the tarball items from there.

It is not clear to me how to add to CI both: saving requests for
git-fetch packages and build 'sources.json'.

Last, all the packages are not equal.  We could have 99.99% for the
coverage but if the missing 0.01% packages are deep in the graph, then
all the house of card falls down.  Somehow, we need to work on the
graph and spot the "important", or least sort them.  Argh, it is
something I would like to do since long time (help when release is
coming) but days count only 24h. ;-)

1: https://ngyro.com/pog-reports/2021-10-31/
2: https://ngyro.com/pog-reports/2021-11-30/

> >   • rate limit: we can find an arrangement to raise it for the purposes
> >     of statistics gathering like Simon and Timothy have been doing (we
> >     can discuss the details off-list);
>
> Cool!  So far it hasn’t been a concern for me, but it would help in the
> future if want to try and track down Git repositories that have gone
> missing.

Timothy, could you provide again the entry point you use?

> >     they’re not opposed to the idea of eventually hosting or maintaining
> >     the Disarchive database (in fact one of the developers thought we
> >     were hosting it in Git and that as such they were already archiving
> >     it—maybe we could go back to Git?);
>
> It’s a possibility, but right now I’m hopeful that the database will be
> in the care of SWH directly before too long.  I’d rather wait and see at
> this point.  I’m sure we could manage it, but the uncompressed size of
> the Disarchive specification of a Chromium tarball is 366M.  Storing all
> the XZ specifications uncompressed is over 20G.  It would be a big Git
> repo!

Hehe!  That's something we discussed at the very beginning of Disarchive. :-)

If Disarchive-DB is managed by SWH, maybe some people would be afraid
by security concerns.  I mean, today SWH ingests an archive. Today,
this archive is checksummed using a robust algorithm say Foo.  Using
the content from SWH and the meta from Disarchive-DB, the archive is
rebuilt and because Foo is robust, it is possible to checksum that the
rebuild match the expectation.  Later, Foo is weak and preimage attack
is possible.  All one has is the expectation using Foo.  Therefore,
SWH could cheat and introduce something in content and/or meta that
matches the expectation using Foo.  If the 2 databases are
independent, then it is harder. :-)

Well, the assumptions are: SWH would be still there when Foo is
broken.  Currently Foo is SHA-256, so who knows. :-)

From scientific context, this scenario (SWH corrupted) is really low
in the list of issues. ;-)

> >   • bit-for-bit archival: there’s a tension between making SWH a
> >     “canonical” representation of VCS repos and making it a faithful,
> >     bit-for-bit identical copy of the original, and there are different
> >     opinions in the team here; our use case pretty much requires
> >     bit-for-bit copies, and fortunately this is what SWH is giving us in
> >     practice for Git repos, so checkout authentication (for example)
> >     should work even when fetching Guix from SWH.

The main issue is the lookup.  Non bit-for-bit archival implies that
people store a SWH lookup key (swhid I guess) at ingestion time,
otherwise it becomes nearly impossible to find back.  To me, the
tension is in the meaning of preservation of source code, i.e.,
between archiving for reading or archiving for compiling.  In the case
of compilation, all the lookup must be automated and so non
bit-for-bit archival means: make swhid THE standard for serialization;
somehow replacing all the other checksums.

> > Anyway I think we can take this as an opportunity to increase bandwidth
> > with the SWH developers!

Yeah, let have a good story! :-)

Cheers,
simon

next prev parent reply	other threads:[~2021-12-02 13:18 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-01  9:41 Software Heritage fifth anniversary event Ludovic Courtès
2021-12-01 18:04 ` Timothy Sample
2021-12-02  8:59   ` Ludovic Courtès
2021-12-02 13:17   ` zimoun [this message]
2021-12-02 14:04     ` Timothy Sample
2021-12-02 15:49       ` zimoun
2021-12-02 18:04         ` Ludovic Courtès
2021-12-02 15:02     ` Ludovic Courtès
2021-12-03  3:58 ` Maxim Cournoyer
2021-12-03 11:01   ` Ludovic Courtès
2021-12-05  5:10     ` Maxim Cournoyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJ3okZ3RoK3_ziWVE5TqF+uRvbpsY-Dgqy4zawjoyv3L9N-Dfw@mail.gmail.com \
    --to=zimon.toutoune@gmail.com \
    --cc=guix-devel@gnu.org \
    --cc=guix-science@gnu.org \
    --cc=ludovic.courtes@inria.fr \
    --cc=samplet@ngyro.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).