all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* Preservation of Guix report for 2024-01-26
@ 2024-01-28  0:47 Timothy Sample
  2024-01-29 17:16 ` Ludovic Courtès
  0 siblings, 1 reply; 4+ messages in thread
From: Timothy Sample @ 2024-01-28  0:47 UTC (permalink / raw)
  To: guix-devel

Hello all,

For a while now, I’ve been tracking coverage of Guix sources in the
Software Heritage (SWH) archive.  I maintain a dataset of sources that
goes back (almost five years) to Guix 1.0.0.  Every once in a while, I
update this dataset and check it against SWH to see how much is missing.
I just put together a new report.

The permalink is https://ngyro.com/pog-reports/2024-01-26, but you can
link to the latest report, too: https://ngyro.com/pog-reports/latest/.

New in this edition is checking for Subversion sources and
bzip2-compressed tarballs.  Subversion is well covered (98.5%), since it
is basically asking, “is TeX Live in SWH?”.  The bzip2 sources are
similar to other compressed tarballs.

One of the benefits of this report is that it catches issues with our
integration with SWH.  This is the second time publishing this report
that I discovered that SWH had stopped loading sources from us.  When
that happens, the number of missing sources starts climbing steeply for
recent commits.  Before publishing this, I reached out to SWH and they
restarted the loader.  It was able to bring in most of the sources but
you can see a slight increase in missing sources about halfway between
September (when it stopped) and now.  That’s likely due to sources that
came and went from our “sources.json” listing while they weren’t
looking.

Speaking of which, another benefit of this dataset is that we have a
list of ~6K historical sources that we would like to see added to SWH.
We are currently coordinating with them to load these sources.  I plan
to update the report when we get results from that.

However, there remain a handful of missing sources that are current, and
should be getting loaded.  This suggests areas where we could improve.
Here’s a not-quite-random sample of some of the current missing sources
(from commit 25bcf4e), and my thoughts as to why they are missing.

mirror://gnupg/gpgme/gpgme-1.18.0.tar.bz2
https://download.enlightenment.org/rel/apps/econnman/econnman-1.1.tar.gz
https://ftp.heanet.ie/mirrors/ftp.xemacs.org/aux/compface-1.5.2.tar.gz
mirror://cpan/authors/id/E/ET/ETHER/MooseX-Types-0.45.tar.gz
mirror://apache/commons/daemon/source/commons-daemon-1.1.0-src.tar.gz

  Some of these (I didn’t check them all) are in SWH as content rather
  than directories.  That’s kinda good, because Guix knows how to get
  them, but also kinda mysterious.  I’ve asked swh-devel about it.
  Depending on the answer, I might have to adapt the checks to deal with
  the possibility of SWH having the tarball rather than its contents.
  In fact, that might be an improvement either way, but it muddies the
  data model quite a bit.

https://rubygems.org/downloads/rjb-1.6.7.gem
https://rubygems.org/downloads/mspec-1.9.1.gem
https://rubygems.org/downloads/cztop-0.12.2.gem
https://rubygems.org/downloads/morecane-0.2.0.gem

  This is an error on my side.  I’ve been treating gems as regular
  files, but they are (and SWH treats them as) tarballs.

https://git.sr.ht/~abcdw/guile-ares-rs

  This one was in SWH, but not up-to-date enough to have the tag we use.
  I don’t think they regularly crawl git.sr.ht yet.  Also, it looks like
  they tried to visit this origin while SourceHut was down (around a
  week ago).  I used “Save code now” to fix this and now this source is
  in SWH.  This kind of thing should be improved soon, as they are
  working on new code that will pick up Git repositories from our
  “sources.json” file.

Given that some of those tarballs and Ruby gems are in fact in SWH and
I’m just missing them, we are probably doing better than the report
suggests!

The short-term road map for this is to send the historical sources to
SWH and fix the Ruby gems, and then make a new report.  So expect a
minor update with much better numbers soon-ish.

The long-term road map is to make it work like an archive.  It will run
continuously and store *all* Guix sources.  To make this easy data-wise,
it will only store what’s not covered by SWH.  I avoided this earlier
out of fear of creating another point of failure.  I’m still afraid of
this, but as it stands every source that is just out there on the
Internet and not in SWH is a point of failure.  Surely having them all
in one place would be better, right?


-- Tim


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Preservation of Guix report for 2024-01-26
  2024-01-28  0:47 Preservation of Guix report for 2024-01-26 Timothy Sample
@ 2024-01-29 17:16 ` Ludovic Courtès
  2024-01-30 18:01   ` Timothy Sample
  0 siblings, 1 reply; 4+ messages in thread
From: Ludovic Courtès @ 2024-01-29 17:16 UTC (permalink / raw)
  To: Timothy Sample; +Cc: guix-devel

Hi Timothy!

Timothy Sample <samplet@ngyro.com> skribis:

> The permalink is https://ngyro.com/pog-reports/2024-01-26, but you can
> link to the latest report, too: https://ngyro.com/pog-reports/latest/.

Yay!

> New in this edition is checking for Subversion sources and
> bzip2-compressed tarballs.  Subversion is well covered (98.5%), since it
> is basically asking, “is TeX Live in SWH?”.  The bzip2 sources are
> similar to other compressed tarballs.

Thumbs up on bzip2 support!  We should update Disarchive in Guix but
perhaps that’s already in your pipeline?  We’ll also have to sync the
disarchive.guix.gnu.org with ngyro.com.

How did you implement the Subversion check?

Until the recent addition of ‘nar-sha256’ ExtIDs¹, my understanding is
that there was no way to check whether a Subversion revision (and
actually, a specific sub-directory checkout) was in SWH.

¹ https://issues.guix.gnu.org/68741

>   Some of these (I didn’t check them all) are in SWH as content rather
>   than directories.  That’s kinda good, because Guix knows how to get
>   them, but also kinda mysterious.  I’ve asked swh-devel about it.
>   Depending on the answer, I might have to adapt the checks to deal with
>   the possibility of SWH having the tarball rather than its contents.
>   In fact, that might be an improvement either way, but it muddies the
>   data model quite a bit.

Back in the day, they told me that tarballs can sometimes be ingested,
for instance if they are committed to a VCS repo (that’s why our
fallback code tries that as well).  Maybe that’s what happened?

> The short-term road map for this is to send the historical sources to
> SWH and fix the Ruby gems, and then make a new report.  So expect a
> minor update with much better numbers soon-ish.

Awesome.

> The long-term road map is to make it work like an archive.  It will run
> continuously and store *all* Guix sources.  To make this easy data-wise,
> it will only store what’s not covered by SWH.

By “it”, you mean the Disarchive DB?

Thanks for the exciting news!

Ludo’.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Preservation of Guix report for 2024-01-26
  2024-01-29 17:16 ` Ludovic Courtès
@ 2024-01-30 18:01   ` Timothy Sample
  2024-02-06 14:44     ` Ludovic Courtès
  0 siblings, 1 reply; 4+ messages in thread
From: Timothy Sample @ 2024-01-30 18:01 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

> Thumbs up on bzip2 support!  We should update Disarchive in Guix but
> perhaps that’s already in your pipeline?

I sent https://issues.guix.gnu.org/68769.  Now I see that I didn’t have
the newest Git hooks installed, so no change ID and no email to the
relevant team.  Sorry!  (I use worktrees so the Makefile didn’t fix this
for me automatically – I should have double checked.)

> We’ll also have to sync the disarchive.guix.gnu.org with ngyro.com.

Hopefully our old system will work again, but I will have to consolidate
my collection of Disarchive specifications, first.

> How did you implement the Subversion check?

The hard way: download, verify Guix hash, compute SWHID, check existence
in SWHID.  (That’s how everything works to date with PoG, but hopefully
the hard way will become obsolete with all the recent support SWH has
been providing us.)

>>   Some of these (I didn’t check them all) are in SWH as content rather
>>   than directories.
>
> Back in the day, they told me that tarballs can sometimes be ingested,
> for instance if they are committed to a VCS repo (that’s why our
> fallback code tries that as well).  Maybe that’s what happened?

Probably, but I don’t quite understand the mechanism.  The “nixguix”
loader uses ExtIDs for deduplication.  My assumption is that it will
only skip unpacking a tarball if there’s an existing ExtID.  It doesn’t
look like there are ExtIDs for these tarballs, so I’m not sure.  (I’ve
been fumbling a bit trying to use the ExtID API, so maybe it’s a mistake
on my end.)

>> The long-term road map is to make it work like an archive.  It will run
>> continuously and store *all* Guix sources.  To make this easy data-wise,
>> it will only store what’s not covered by SWH.
>
> By “it”, you mean the Disarchive DB?

I mean the PoG “project”.  Instead of just testing and reporting, it
will preserve.  For instance, right now if it encounters a tarball that
Disarchive can’t unpack, it shrugs and moves on.  I want it to store
those so that we are guaranteed to be able to revisit it in the future.
Same thing for sources not (yet) in SWH.  I want to store those so that
SWH can ingest them later.  Because we are doing so well working with
SWH, the storage requirements for this will be manageable (10s of
gigabytes).  With that in place, the PoG report will simply explain what
the archive needs to store and why.  Our goal, then, will be for it to
store nothing.


-- Tim


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Preservation of Guix report for 2024-01-26
  2024-01-30 18:01   ` Timothy Sample
@ 2024-02-06 14:44     ` Ludovic Courtès
  0 siblings, 0 replies; 4+ messages in thread
From: Ludovic Courtès @ 2024-02-06 14:44 UTC (permalink / raw)
  To: Timothy Sample; +Cc: guix-devel

Hello!

Timothy Sample <samplet@ngyro.com> skribis:

> I sent https://issues.guix.gnu.org/68769.  Now I see that I didn’t have
> the newest Git hooks installed, so no change ID and no email to the
> relevant team.  Sorry!  (I use worktrees so the Makefile didn’t fix this
> for me automatically – I should have double checked.)

Impressive stuff.

>> We’ll also have to sync the disarchive.guix.gnu.org with ngyro.com.
>
> Hopefully our old system will work again, but I will have to consolidate
> my collection of Disarchive specifications, first.

Please let me/sysadmins know when and how we should run rsync to grab
new stuff from your database.

(We could also set up an infrequent rsync job if that makes sense.)

> I mean the PoG “project”.  Instead of just testing and reporting, it
> will preserve.  For instance, right now if it encounters a tarball that
> Disarchive can’t unpack, it shrugs and moves on.  I want it to store
> those so that we are guaranteed to be able to revisit it in the future.
> Same thing for sources not (yet) in SWH.  I want to store those so that
> SWH can ingest them later.  Because we are doing so well working with
> SWH, the storage requirements for this will be manageable (10s of
> gigabytes).  With that in place, the PoG report will simply explain what
> the archive needs to store and why.  Our goal, then, will be for it to
> store nothing.

That makes a lot of sense to me.  I’m not sure how to implement it but
you certainly have ideas.

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-02-06 14:45 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-28  0:47 Preservation of Guix report for 2024-01-26 Timothy Sample
2024-01-29 17:16 ` Ludovic Courtès
2024-01-30 18:01   ` Timothy Sample
2024-02-06 14:44     ` Ludovic Courtès

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.