unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Preservation of Guix (PoG) report 2023-03-13
@ 2023-03-14  1:37 Timothy Sample
  2023-03-14 10:36 ` Simon Tournier
  2023-03-16 16:41 ` Ludovic Courtès
  0 siblings, 2 replies; 6+ messages in thread
From: Timothy Sample @ 2023-03-14  1:37 UTC (permalink / raw)
  To: guix-devel

[-- Attachment #1: Type: text/plain, Size: 5115 bytes --]

Hi Guix,

It’s been a while!  :)

Allow me to present to you a long-overdue update to the Preservation of
Guix (PoG) report: <https://ngyro.com/pog-reports/2023-03-13/>.  🎉

Note that you can link to the most recent version of the report using
<https://ngyro.com/pog-reports/latest/>.

What is this?  Well, I added a description to the report itself, but
here’s a brief teaser.  The PoG report shows what we know about the
archival status of the approximately 54K sources (and counting) Guix has
linked to since around the time of the 1.0 release.

For this edition, I took a bit of time to fix the contrast and colours
to be a bit more accessible.  They’re about half as garish as they used
to be, too.

Over the whole set, 77.1% are known to be safely tucked away in the
Software Heritage archive.  But it’s actually much better than that.  If
we only look at the most recent sampled commit (from Sunday the 5th),
that number becomes 87.4%, which is starting to look pretty good!

I have a few more notes on the report, but I want to put this near the
top of the message so that people will see it.  :)  I wrote a script
(see attached) that uses the PoG database to find missing sources on a
packge-by-package basis.  That is, you can run

    guix repl specification-to-swhids.scm pog.db bash

and it will print a table of all of the transitive sources needed to
build Bash, along with their preservation status.  Here’s a (heavily
edited and snipped to fit an email message) sample of its output:

[... many “stored” inputs]
sha256 0r5p. swh:1:dir:02f7. stored  /gnu/store/.-gmp-6.0.0a.tar.xz
sha256 0c3k. swh:1:dir:6027. stored  /gnu/store/.-mescc-....tar.xz
sha256 1r1z. swh:1:dir:6087. stored  /gnu/store/.-bash-2.05b.tar.gz
sha256 14l0. unknown         unknown /gnu/store/.-gcc-4.9.4.tar.bz2
sha256 0m2y. unknown         unknown /gnu/store/.-ed-1.17.tar.lz
[... more “unknown” inputs]

(I had to pipe the output to “sort -k 4” to have it sorted by status.)

The first two columns are the Guix hash.  The next two columns are the
SWHID (if known) and whether SWH has it (if known).  That last column is
the store filename (which is nice because it usually tells you what it
is we are looking at).  In this sample, you can see that GMP, MesCC
Tools, and Bash are all safe.  However, we don’t know about GCC 4 and
ed.  This is kinda like an automated version of Simon’s recent
investigation [1].  The “unknown” two are due to Disarchive’s lack of
support for those compression formats.  I just wrote this script today
(mind the rough edges), and I’ve learned a lot from trying it on a few
packages.  It’s a little like a terrifying robotic TODO list, since it
shows a lot of problems, but it’s also exiting because solving all the
problems for the Guix package, say, would be a massive leap forward.
Here’s a rough road map for that based on a glance at the script’s
output:

    • Subversion support (for TeX-based documentation stuff, I guess)
    • bzip2 support for Disarchive (there are 45 bzip2 tarballs)
    • ZIP support for Disarchive (for the 8 ZIP files)
    • lzip support for Disarchive (or a workaround for ed)
    • Fix some issues (gettext is .tar.gz, but something went wrong)
    • Do something with the static bootstrap binaries

[1] https://lists.gnu.org/archive/html/guix-devel/2023-02/msg00398.html

If you want to try it out for yourself, you’ll need to download the
database <https://ngyro.com/pog-reports/2023-03-13/pog.db>.  Heads up:
it’s just over 200M, and my server can be pretty slow.

One other stray thought: the script should work with the time machine,
so you can check on packages from the past.  I didn’t test it, but I bet
it’s fine.

Okay.  Here are the rest of my notes about the report itself.

One thing that jumps out at me is 189 Git sources that SWH does not
have.  Usually they have basically all of the non-recursive Git sources.
It’s something to look into.

I also took a quick peek at the 1.9K “unknown” tar-gz sources.  About
39% percent of them are old Rust crates.  It’s a known problem with
Disarchive.  However, 42% of them are old Bioconductor packages.  They
seem to be lost.  It looks like Bioconductor now stores multiple package
versions per Bioconductor version [2], but before version 3.15 that was
not the case.  As an example, take “ggcyto” from Bioconductor 3.10 [3].
We packaged version 1.14.0, and then at some point Bioconductor 3.10
switched to version 1.14.1.  We packaged that, too, but now 1.14.0 is
gone.  I know it’s been discussed before, but I can’t remember what the
conclusion was.  Are these just gone forever?  I’m doing another pass
through all of them and recovering a few from the bordeaux substitute
server, but only a handful.

[2] https://bioconductor.org/packages/3.15/bioc/src/contrib/Archive/DiffBind/
[3] https://bioconductor.org/packages/3.10/bioc/html/ggcyto.html

That’s all for now.  Enjoy the update and the script!


-- Tim


[-- Attachment #2: specification-to-swhids.scm --]
[-- Type: text/plain, Size: 6548 bytes --]

;;; specification-to-swhids.scm
;;; Copyright © 2023 Timothy Sample <samplet@ngyro.com>
;;;
;;; This program is free software: you can redistribute it and/or modify
;;; it under the terms of the GNU General Public License as published by
;;; the Free Software Foundation, either version 3 of the License, or (at
;;; your option) any later version.
;;;
;;; This program is distributed in the hope that it will be useful, but
;;; WITHOUT ANY WARRANTY; without even the implied warranty of
;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
;;; General Public License for more details.
;;;
;;; You should have received a copy of the GNU General Public License
;;; along with this program. If not, see <https://www.gnu.org/licenses/>.

(use-modules (gnu packages)
             (guix base32)
             (guix derivations)
             (guix gexp)
             (guix monads)
             (guix store)
             (ice-9 format)
             (ice-9 getopt-long)
             (ice-9 match)
             (sqlite3)
             (srfi srfi-9 gnu))

\f
;;; Database stuff

(define (call-with-sqlite-db filename proc)
  "Open the SQLite database at FILENAME and pass the resulting
connection to PROC.  The connection will only be open during the dynamic
extent of PROC.  If that dynamic extent is re-entered (using a
continuation, say), the database connection will be re-established."
  (let ((db #f))
    (dynamic-wind
      (lambda ()
        (set! db (sqlite-open filename)))
      (lambda ()
        (proc db))
      (lambda ()
        (sqlite-close db)
        (set! db #f)))))

(define (database-lookup db query params converter)
  "Using the SQLite database connection DB, run QUERY with PARAMS, and
map CONVERTER over the resulting rows."
  (let* ((stmt (sqlite-prepare db query))
         (_ (unless (null? params)
              (apply sqlite-bind-arguments stmt params)))
         (result (sqlite-fold (lambda (x acc)
                                (cons (converter x) acc))
                              '()
                              stmt)))
    (sqlite-finalize stmt)
    result))

(define (lookup-swh-status db algorithm hash)
  "Using the SQLite database connection DB, lookup the SWHID of the
fixed-output derivation with the ALGORITHM-computed checksum HASH.
Here, both ALGORITHM and HASH are strings, the latter being the Nix
base-32 representation of the hash value."
  (define query "\
SELECT swhid, is_in_swh
FROM fods
WHERE algorithm = ?
    AND hash = ?")
  (define (converter row) row)
  (and=> (database-lookup db query (list algorithm hash) converter) car))

\f
;;; Guix stuff

(define (derivation-transitive-fixed-output-inputs drv)
  "Compute the list of all fixed-output derivations in the transitive
inputs of the derivation DRV."
  (define seen (make-hash-table))
  (define fod-hashes (make-hash-table))

  (define (seen? drv)
    (hashq-ref seen drv))

  (let loop ((queue (list drv)))
    (match queue
      (() (hash-map->list cons fod-hashes))
      ((drv . rest)
       (hashq-set! seen drv #t)
       (when (fixed-output-derivation? drv)
         (let* ((out (assoc-ref (derivation-outputs drv) "out"))
                (algo (derivation-output-hash-algo out))
                (hash (derivation-output-hash out))
                (filename (derivation-output-path out)))
           (hash-set! fod-hashes (cons algo hash) filename)))
       (loop (append (filter (negate seen?)
                             (map derivation-input-derivation
                                  (derivation-inputs drv)))
                     rest))))))

(define (lookup-object-hashes obj)
  "Get the list of Guix hashes needed for the lowerable object OBJ."
  (let ((drv (run-with-store (open-connection)
               (lower-object obj))))
    (derivation-transitive-fixed-output-inputs drv)))

\f
;;; Glue

(define-immutable-record-type <source>
  (make-source algorithm hash filename swhid in-swh?)
  source?
  (algorithm source-algorithm)
  (hash source-hash)
  (filename source-filename)
  (swhid source-swhid)
  (in-swh? source-in-swh?))

(define (guix-hash->source db hash-obj)
  "Using the SQLite database connection DB, convert HASH-OBJ to a source
record.  HASH-OBJ should be a result from 'lookup-object-hashes'."
  (match-let* ((((algorithm . hash) . filename) hash-obj)
               (algorithm (symbol->string algorithm))
               (hash (bytevector->nix-base32-string hash))
               (#(swhid in-swh?) (lookup-swh-status db algorithm hash)))
    (make-source algorithm hash filename swhid in-swh?)))

(define (object-sources db obj)
  "Using the SQLite database connection DB, get the list of source
records for the lowerable object OBJ."
  (let ((hashes (lookup-object-hashes obj)))
    (map (lambda (hash) (guix-hash->source db hash)) hashes)))

\f
;; Shell interface

(define (print-source src)
  (format #t "~a\t~a\t~50a\t~a\t~a~%"
          (source-algorithm src)
          (source-hash src)
          (or (source-swhid src) "unknown")
          (cond
           ((source-in-swh? src) "stored")
           ((source-swhid src) "missing")
           (else "unknown"))
          (source-filename src)))

(define version "2023-03-13-0")

(define version-message (string-append "\
specification-to-swhids.scm " version "
"))

(define help-message "\
Usage: guix repl specification-to-swhids.scm DB-FILENAME SPECIFICATION

Print a table of Guix hashes, SWHIDs, and store filenames for the Guix
package SPECIFICATION using the Preservation of Guix database at
DB-FILENAME.  See <https://ngyro.com/pog-reports/latest>.
")

(define options-grammar
  `((help (single-char #\h))
    (version (single-char #\V))))

(define (main args)
  (let ((options (getopt-long args options-grammar)))
    (when (option-ref options 'help #f)
      (display help-message)
      (exit EXIT_SUCCESS))
    (when (option-ref options 'version #f)
      (display version-message)
      (exit EXIT_SUCCESS))
    (match (option-ref options '() #f)
      ((db-filename specification)
       (for-each print-source
                 (let ((obj (specification->package specification)))
                   (call-with-sqlite-db db-filename
                     (lambda (db)
                       (object-sources db obj)))))
       (exit EXIT_SUCCESS))
      (_
       (display help-message)
       (exit EXIT_FAILURE)))))

(main (command-line))

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-03-22 14:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-14  1:37 Preservation of Guix (PoG) report 2023-03-13 Timothy Sample
2023-03-14 10:36 ` Simon Tournier
2023-03-18 20:35   ` Timothy Sample
2023-03-22 14:21     ` Ludovic Courtès
2023-03-16 16:41 ` Ludovic Courtès
2023-03-19  2:25   ` Timothy Sample

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).