Hi Guix, It’s been a while! :) Allow me to present to you a long-overdue update to the Preservation of Guix (PoG) report: . 🎉 Note that you can link to the most recent version of the report using . What is this? Well, I added a description to the report itself, but here’s a brief teaser. The PoG report shows what we know about the archival status of the approximately 54K sources (and counting) Guix has linked to since around the time of the 1.0 release. For this edition, I took a bit of time to fix the contrast and colours to be a bit more accessible. They’re about half as garish as they used to be, too. Over the whole set, 77.1% are known to be safely tucked away in the Software Heritage archive. But it’s actually much better than that. If we only look at the most recent sampled commit (from Sunday the 5th), that number becomes 87.4%, which is starting to look pretty good! I have a few more notes on the report, but I want to put this near the top of the message so that people will see it. :) I wrote a script (see attached) that uses the PoG database to find missing sources on a packge-by-package basis. That is, you can run guix repl specification-to-swhids.scm pog.db bash and it will print a table of all of the transitive sources needed to build Bash, along with their preservation status. Here’s a (heavily edited and snipped to fit an email message) sample of its output: [... many “stored” inputs] sha256 0r5p. swh:1:dir:02f7. stored /gnu/store/.-gmp-6.0.0a.tar.xz sha256 0c3k. swh:1:dir:6027. stored /gnu/store/.-mescc-....tar.xz sha256 1r1z. swh:1:dir:6087. stored /gnu/store/.-bash-2.05b.tar.gz sha256 14l0. unknown unknown /gnu/store/.-gcc-4.9.4.tar.bz2 sha256 0m2y. unknown unknown /gnu/store/.-ed-1.17.tar.lz [... more “unknown” inputs] (I had to pipe the output to “sort -k 4” to have it sorted by status.) The first two columns are the Guix hash. The next two columns are the SWHID (if known) and whether SWH has it (if known). That last column is the store filename (which is nice because it usually tells you what it is we are looking at). In this sample, you can see that GMP, MesCC Tools, and Bash are all safe. However, we don’t know about GCC 4 and ed. This is kinda like an automated version of Simon’s recent investigation [1]. The “unknown” two are due to Disarchive’s lack of support for those compression formats. I just wrote this script today (mind the rough edges), and I’ve learned a lot from trying it on a few packages. It’s a little like a terrifying robotic TODO list, since it shows a lot of problems, but it’s also exiting because solving all the problems for the Guix package, say, would be a massive leap forward. Here’s a rough road map for that based on a glance at the script’s output: • Subversion support (for TeX-based documentation stuff, I guess) • bzip2 support for Disarchive (there are 45 bzip2 tarballs) • ZIP support for Disarchive (for the 8 ZIP files) • lzip support for Disarchive (or a workaround for ed) • Fix some issues (gettext is .tar.gz, but something went wrong) • Do something with the static bootstrap binaries [1] https://lists.gnu.org/archive/html/guix-devel/2023-02/msg00398.html If you want to try it out for yourself, you’ll need to download the database . Heads up: it’s just over 200M, and my server can be pretty slow. One other stray thought: the script should work with the time machine, so you can check on packages from the past. I didn’t test it, but I bet it’s fine. Okay. Here are the rest of my notes about the report itself. One thing that jumps out at me is 189 Git sources that SWH does not have. Usually they have basically all of the non-recursive Git sources. It’s something to look into. I also took a quick peek at the 1.9K “unknown” tar-gz sources. About 39% percent of them are old Rust crates. It’s a known problem with Disarchive. However, 42% of them are old Bioconductor packages. They seem to be lost. It looks like Bioconductor now stores multiple package versions per Bioconductor version [2], but before version 3.15 that was not the case. As an example, take “ggcyto” from Bioconductor 3.10 [3]. We packaged version 1.14.0, and then at some point Bioconductor 3.10 switched to version 1.14.1. We packaged that, too, but now 1.14.0 is gone. I know it’s been discussed before, but I can’t remember what the conclusion was. Are these just gone forever? I’m doing another pass through all of them and recovering a few from the bordeaux substitute server, but only a handful. [2] https://bioconductor.org/packages/3.15/bioc/src/contrib/Archive/DiffBind/ [3] https://bioconductor.org/packages/3.10/bioc/html/ggcyto.html That’s all for now. Enjoy the update and the script! -- Tim