unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Timothy Sample <samplet@ngyro.com>
To: Simon Tournier <zimon.toutoune@gmail.com>
Cc: guix-devel@gnu.org
Subject: Re: Preservation of Guix (PoG) report 2023-03-13
Date: Sat, 18 Mar 2023 14:35:40 -0600	[thread overview]
Message-ID: <87o7oplrvn.fsf@ngyro.com> (raw)
In-Reply-To: <86356739hb.fsf@gmail.com> (Simon Tournier's message of "Tue, 14 Mar 2023 11:36:48 +0100")

Hey,

Simon Tournier <zimon.toutoune@gmail.com> writes:

> Well, I do not remember if you consider also the ’origin’
> (fixed-outputs) as ’inputs’ or ’patches’.  Do you?

I’m quite confident I’m getting everything.  I’ll describe my approach,
because I’m happy with it.  :)

The Guix package graph exists twice, essentially.  There’s the
high-level representation made up of packages, origins, gexps, etc.
Then, there is the low-level representation which is just derivations.
The high-level representation has nice metadata and makes sense to
humans, while the low-level representation is easy to traverse.

AFAICT, there’s no generic way to traverse the high-level
representation.  Every lowerable object has complete control over how it
references other lowerable objects, and is not obliged provide any means
of listing those references.  That is, there’s no ‘lowerable-inputs’
procedure or anything like that.  (We have ‘bag-node-edges’ in ‘(guix
scripts graph)’, but it doesn’t cover everything.)

What I do for the report is traverse (as best I can) the high-level
representation and construct a map from derivations to origin objects.
Then, I traverse the low-level representation to find all the
fixed-output derivations.  Finally, I use the map to look up origin
objects for each fixed-output derivation.  If I miss an origin object,
the fixed-output derivation still gets recorded.  It will show up in the
report as “unknown” until I investigate why it’s missing and correct it.

There’s currently 56 (out of 54K) fixed-output derivations that are
missing metadata in my database.  A fair few of them have to do with
Telegram, Thunderbird, and UBlock Origin.  All it means is that those
packages have sneaky ways of referencing origins that my code can’t
handle.  It’s harmless and easy to fix as time permits.

>> Over the whole set, 77.1% are known to be safely tucked away in the
>> Software Heritage archive.  But it’s actually much better than that.  If
>> we only look at the most recent sampled commit (from Sunday the 5th),
>> that number becomes 87.4%, which is starting to look pretty good!
>
> Just to be point the new nixguix loader [1] is still in SWH staging and
> not yet deployed, IIRC.  It will not change much the coverage on our
> side but it should be fix some corner-cases.
>
> 1: <https://gitlab.softwareheritage.org/swh/meta/-/issues/4662>

Good to know!

>>      This is kinda like an automated version of Simon’s recent
>> investigation.
>
> Neat!  Note that I also wanted to check the SWH capacity for cooking,
> not only checking the end points.  For instance, it allowed to discover
> mismatch due to uncovered CR/LF normalization; now fixed with:
> 58f20fa8181bdcd4269671e1d3cef1268947af3a.

Maybe we need a “chaos monkey mode” for Guix.  It could randomly select
packages to build, randomly pick source code fallback methods, and also
test reproducibility (like “--check”).  You could have a blocklist for
browsers, etc., but otherwise it could pick the odd package to test
thoroughly.  Those of us with the time and inclination could crank up
that knob and get interesting feedback about reproducibility at the cost
of doing a few package builds here and there.

>> Here’s a rough road map for that based on a glance at the script’s
>> output:
>>
>>     • Subversion support (for TeX-based documentation stuff, I guess)
>
> For the interested reader, details for helping in the implementation:
>
>     https://issues.guix.gnu.org/issue/43442#9
>     https://issues.guix.gnu.org/issue/43442#11

Fantastic.  That looks very promising!

> However, it would ease all the dance if SWH would consider to store and
> expose NAR hashes on their side.  As discussed here:
>
>     https://gitlab.softwareheritage.org/swh/meta/-/issues/4538

This would be nice, yes.

>>              However, 42% of them are old Bioconductor packages.  They
>> seem to be lost.  It looks like Bioconductor now stores multiple package
>> versions per Bioconductor version [2], but before version 3.15 that was
>> not the case.  As an example, take “ggcyto” from Bioconductor 3.10 [3].
>> We packaged version 1.14.0, and then at some point Bioconductor 3.10
>> switched to version 1.14.1.  We packaged that, too, but now 1.14.0 is
>> gone.
>
> Well, I have not investigated much because it is between December 2019
> and March 2020 thus “guix time-machine” is not smooth for this old time.
>
> First question, does we have the source tarball in Berlin or Bordeaux or
> somewhere else?  If yes, there is a hope. :-) Else, it is probably gone
> forever.

Like I wrote, I picked up a handful from Bordeaux, but not much.

> The hope is: https://git.bioconductor.org/packages/ggcyto
>
> If we have the tarball with the correct checksum from commit
> f5f440312d848e12463f0c6f7510a86b623a9e27
>
> +    (version "1.14.0")
> +    (source
> +     (origin
> +       (method url-fetch)
> +       (uri (bioconductor-uri "ggcyto" version))
> +       (sha256
> +        (base32
> +         "165qszvy5z176h1l3dnjb5dcm279b6bjl5n5gzz8wfn4xpn8anc8"))))
>
> then we can disassemble it and then using the Git repository, we can try
> to assemble the content from SWH and the meta from Disarchive DB.

I played around with this approach a bit, but it’s extremely tedious,
and I’m not hopeful it will work.  Even if it does, it will be hard to
automate.  I never fully tested the idea, just decided the effort was
too high for such a low probability of success.  I’m putting these in
the “low priority” bin for now.


-- Tim


  reply	other threads:[~2023-03-18 20:36 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-14  1:37 Preservation of Guix (PoG) report 2023-03-13 Timothy Sample
2023-03-14 10:36 ` Simon Tournier
2023-03-18 20:35   ` Timothy Sample [this message]
2023-03-22 14:21     ` Ludovic Courtès
2023-03-16 16:41 ` Ludovic Courtès
2023-03-19  2:25   ` Timothy Sample

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87o7oplrvn.fsf@ngyro.com \
    --to=samplet@ngyro.com \
    --cc=guix-devel@gnu.org \
    --cc=zimon.toutoune@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).