* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 @ 2020-07-02 7:29 Ludovic Courtès 2020-07-02 8:50 ` zimoun ` (2 more replies) 0 siblings, 3 replies; 55+ messages in thread From: Ludovic Courtès @ 2020-07-02 7:29 UTC (permalink / raw) To: 42162; +Cc: Maurice Brémond [-- Attachment #1: Type: text/plain, Size: 2952 bytes --] Hello! The hosting site gforge.inria.fr will be taken off-line in December 2020. This GForge instance hosts source code as tarballs, Subversion repos, and Git repos. Users have been invited to migrate to gitlab.inria.fr, which is Git only. It seems that Software Heritage hasn’t archived (yet) all of gforge.inria.fr. Let’s keep track of the situation in this issue. The following packages have their source on gforge.inria.fr: --8<---------------cut here---------------start------------->8--- scheme@(guile-user)> ,pp packages-on-gforge $7 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640> #<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0> #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280> #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0> #<package pt-scotch@6.0.6 gnu/packages/maths.scm:2920 7f632d832640> #<package scotch@6.0.6 gnu/packages/maths.scm:2774 7f632d832780> #<package pt-scotch32@6.0.6 gnu/packages/maths.scm:2944 7f632d8325a0> #<package scotch32@6.0.6 gnu/packages/maths.scm:2873 7f632d8326e0> #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280> #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960> #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>) --8<---------------cut here---------------end--------------->8--- ‘isl’ (a dependency of GCC) has its source on gforge.inria.fr but it’s also mirrored at gcc.gnu.org apparently. Of these, the following are available on Software Heritage: --8<---------------cut here---------------start------------->8--- scheme@(guile-user)> ,pp archived-source $8 = (#<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0> #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280> #<package pt-scotch@6.0.6 gnu/packages/maths.scm:2920 7f632d832640> #<package scotch@6.0.6 gnu/packages/maths.scm:2774 7f632d832780> #<package pt-scotch32@6.0.6 gnu/packages/maths.scm:2944 7f632d8325a0> #<package scotch32@6.0.6 gnu/packages/maths.scm:2873 7f632d8326e0> #<package isl@0.18 gnu/packages/gcc.scm:925 7f632dc82320> #<package isl@0.11.1 gnu/packages/gcc.scm:939 7f632dc82280>) --8<---------------cut here---------------end--------------->8--- So we’ll be missing these: --8<---------------cut here---------------start------------->8--- scheme@(guile-user)> ,pp (lset-difference eq? $7 $8) $11 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640> #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0> #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280> #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960> #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>) --8<---------------cut here---------------end--------------->8--- Attached the code I used for this. Thanks, Ludo’. [-- Attachment #2: the code --] [-- Type: text/plain, Size: 1284 bytes --] (use-modules (guix) (gnu) (guix svn-download) (guix git-download) (guix swh) (ice-9 match) (srfi srfi-1) (srfi srfi-26)) (define (gforge? package) (define (gforge-string? str) (string-contains str "gforge.inria.fr")) (match (package-source package) ((? origin? o) (match (origin-uri o) ((? string? url) (gforge-string? url)) (((? string? urls) ...) (any gforge-string? urls)) ;or 'find' ((? git-reference? ref) (gforge-string? (git-reference-url ref))) ((? svn-reference? ref) (gforge-string? (svn-reference-url ref))) (_ #f))) (_ #f))) (define packages-on-gforge (fold-packages (lambda (package result) (if (gforge? package) (cons package result) result)) '())) (define archived-source (filter (lambda (package) (let* ((origin (package-source package)) (hash (origin-hash origin))) (lookup-content (content-hash-value hash) (symbol->string (content-hash-algorithm hash))))) packages-on-gforge)) ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2020-07-02 7:29 bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Ludovic Courtès @ 2020-07-02 8:50 ` zimoun 2020-07-02 10:03 ` Ludovic Courtès 2021-01-10 19:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer [not found] ` <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org> 2 siblings, 1 reply; 55+ messages in thread From: zimoun @ 2020-07-02 8:50 UTC (permalink / raw) To: Ludovic Courtès, 42162; +Cc: Maurice Brémond Hi Ludo, On Thu, 02 Jul 2020 at 09:29, Ludovic Courtès <ludovic.courtes@inria.fr> wrote: > The hosting site gforge.inria.fr will be taken off-line in December > 2020. This GForge instance hosts source code as tarballs, Subversion > repos, and Git repos. Users have been invited to migrate to > gitlab.inria.fr, which is Git only. It seems that Software Heritage > hasn’t archived (yet) all of gforge.inria.fr. Let’s keep track of the > situation in this issue. [...] > --8<---------------cut here---------------start------------->8--- > scheme@(guile-user)> ,pp (lset-difference eq? $7 $8) > $11 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640> > #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0> > #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280> > #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960> > #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>) > --8<---------------cut here---------------end--------------->8--- All the 5 are 'url-fetch' so we can expect that sources.json will be up before the shutdown on December. :-) Then, all the 14 packages we have from gforge.inria.fr will be git-fetch, right? So should we contact upstream to inform us when they switch? Then we can adapt the origin. > (use-modules (guix) (gnu) > (guix svn-download) > (guix git-download) > (guix swh) It does not work properly if I do not replace by ((guix swh) #:hide (origin?)) Well, I have no investigate further. > (ice-9 match) > (srfi srfi-1) > (srfi srfi-26)) [...] > (define archived-source > (filter (lambda (package) > (let* ((origin (package-source package)) > (hash (origin-hash origin))) > (lookup-content (content-hash-value hash) > (symbol->string > (content-hash-algorithm hash))))) > packages-on-gforge)) I am a bit lost about the other discussion on falling back for tarball. But that's another story. :-) Cheers, simon ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2020-07-02 8:50 ` zimoun @ 2020-07-02 10:03 ` Ludovic Courtès 2020-07-11 15:50 ` bug#42162: Recovering source tarballs Ludovic Courtès 0 siblings, 1 reply; 55+ messages in thread From: Ludovic Courtès @ 2020-07-02 10:03 UTC (permalink / raw) To: zimoun; +Cc: 42162, Maurice Brémond zimoun <zimon.toutoune@gmail.com> skribis: > On Thu, 02 Jul 2020 at 09:29, Ludovic Courtès <ludovic.courtes@inria.fr> wrote: > >> The hosting site gforge.inria.fr will be taken off-line in December >> 2020. This GForge instance hosts source code as tarballs, Subversion >> repos, and Git repos. Users have been invited to migrate to >> gitlab.inria.fr, which is Git only. It seems that Software Heritage >> hasn’t archived (yet) all of gforge.inria.fr. Let’s keep track of the >> situation in this issue. > > [...] > >> --8<---------------cut here---------------start------------->8--- >> scheme@(guile-user)> ,pp (lset-difference eq? $7 $8) >> $11 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640> >> #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0> >> #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280> >> #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960> >> #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>) >> --8<---------------cut here---------------end--------------->8--- > > All the 5 are 'url-fetch' so we can expect that sources.json will be up > before the shutdown on December. :-) Unfortunately, it won’t help for tarballs: https://sympa.inria.fr/sympa/arc/swh-devel/2020-07/msg00001.html There’s this other discussion you mentioned, which I hope will have a positive outcome: https://forge.softwareheritage.org/T2430 >> (use-modules (guix) (gnu) >> (guix svn-download) >> (guix git-download) >> (guix swh) > > It does not work properly if I do not replace by > > ((guix swh) #:hide (origin?)) Oh right, I had overlooked this as I played at the REPL. Thanks, Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-07-02 10:03 ` Ludovic Courtès @ 2020-07-11 15:50 ` Ludovic Courtès 2020-07-13 19:20 ` Christopher Baines ` (2 more replies) 0 siblings, 3 replies; 55+ messages in thread From: Ludovic Courtès @ 2020-07-11 15:50 UTC (permalink / raw) To: zimoun; +Cc: 42162, Maurice Brémond [-- Attachment #1: Type: text/plain, Size: 4874 bytes --] Hi, Ludovic Courtès <ludo@gnu.org> skribis: > There’s this other discussion you mentioned, which I hope will have a > positive outcome: > > https://forge.softwareheritage.org/T2430 This discussion as well as discussions on #swh-devel have made it clear that SWH will not archive raw tarballs, at least not in the foreseeable future. Instead, it will keep archiving the contents of tarballs, as it has always done—that’s already a huge service. Not storing raw tarballs makes sense from an engineering perspective, but it does mean that we cannot rely on SWH as a content-addressed mirror for tarballs. (In fact, some raw tarballs are available on SWH, but that’s mostly “by chance”, for instance because they appear as-is in a Git repo that was ingested.) In fact this is one of the challenges mentioned in <https://guix.gnu.org/blog/2019/connecting-reproducible-deployment-to-a-long-term-source-code-archive/>. So we need a solution for now (and quite urgently), and a solution for the future. For the now, since 70% of our packages use ‘url-fetch’, we need to be able to fetch or to reconstruct tarballs. There’s no way around it. In the short term, we should arrange so that the build farm keeps GC roots on source tarballs for an indefinite amount of time. Cuirass jobset? Mcron job to preserve GC roots? Ideas? For the future, we could store nar hashes of unpacked tarballs instead of hashes over tarballs. But that raises two questions: • If we no longer deal with tarballs but upstreams keep signing tarballs (not raw directory hashes), how can we authenticate our code after the fact? • SWH internally store Git-tree hashes, not nar hashes, so we still wouldn’t be able to fetch our unpacked trees from SWH. (Both issues were previously discussed at <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.) So for the medium term, and perhaps for the future, a possible option would be to preserve tarball metadata so we can reconstruct them: tarball = metadata + tree After all, tarballs are byproducts and should be no exception: we should build them from source. :-) In <https://forge.softwareheritage.org/T2430>, Stefano mentioned pristine-tar, which does almost that, but not quite: it stores a binary delta between a tarball and a tree: https://manpages.debian.org/unstable/pristine-tar/pristine-tar.1.en.html I think we should have something more transparent than a binary delta. The code below can “disassemble” and “assemble” a tar. When it disassembles it, it generates metadata like this: --8<---------------cut here---------------start------------->8--- (tar-source (version 0) (headers (("guile-3.0.4/" (mode 493) (size 0) (mtime 1593007723) (chksum 3979) (typeflag #\5)) ("guile-3.0.4/m4/" (mode 493) (size 0) (mtime 1593007720) (chksum 4184) (typeflag #\5)) ("guile-3.0.4/m4/pipe2.m4" (mode 420) (size 531) (mtime 1536050419) (chksum 4812) (hash (sha256 "arx6n2rmtf66yjlwkgwp743glcpdsfzgjiqrqhfegutmcwvwvsza"))) ("guile-3.0.4/m4/time_h.m4" (mode 420) (size 5471) (mtime 1536050419) (chksum 4974) (hash (sha256 "z4py26rmvsk4st7db6vwziwwhkrjjrwj7nra4al6ipqh2ms45kka"))) […] --8<---------------cut here---------------end--------------->8--- The ’assemble-archive’ procedure consumes that, looks up file contents by hash on SWH, and reconstructs the original tarball… … at least in theory, because in practice we hit the SWH rate limit after looking up a few files: https://archive.softwareheritage.org/api/#rate-limiting So it’s a bit ridiculous, but we may have to store a SWH “dir” identifier for the whole extracted tree—a Git-tree hash—since that would allow us to retrieve the whole thing in a single HTTP request. Besides, we’ll also have to handle compression: storing gzip/xz headers and compression levels. How would we put that in practice? Good question. :-) I think we’d have to maintain a database that maps tarball hashes to metadata (!). A simple version of it could be a Git repo where, say, ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would contain the metadata above. The nice thing is that the Git repo itself could be archived by SWH. :-) Thus, if a tarball vanishes, we’d look it up in the database and reconstruct it from its metadata plus content store in SWH. Thoughts? Anyhow, we should team up with fellow NixOS and SWH hackers to address this, and with developers of other distros as well—this problem is not just that of the functional deployment geeks, is it? Ludo’. [-- Attachment #2: the tar assembler/disassembler --] [-- Type: text/plain, Size: 15660 bytes --] ;;; GNU Guix --- Functional package management for GNU ;;; Copyright © 2020 Ludovic Courtès <ludo@gnu.org> ;;; ;;; This file is part of GNU Guix. ;;; ;;; GNU Guix is free software; you can redistribute it and/or modify it ;;; under the terms of the GNU General Public License as published by ;;; the Free Software Foundation; either version 3 of the License, or (at ;;; your option) any later version. ;;; ;;; GNU Guix is distributed in the hope that it will be useful, but ;;; WITHOUT ANY WARRANTY; without even the implied warranty of ;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;;; GNU General Public License for more details. ;;; ;;; You should have received a copy of the GNU General Public License ;;; along with GNU Guix. If not, see <http://www.gnu.org/licenses/>. (define-module (tar) #:use-module (ice-9 match) #:use-module (ice-9 binary-ports) #:use-module (rnrs bytevectors) #:use-module (srfi srfi-1) #:use-module (srfi srfi-9) #:use-module (srfi srfi-26) #:use-module (gcrypt hash) #:use-module (guix base16) #:use-module (guix base32) #:use-module ((ice-9 rdelim) #:select ((read-string . get-string-all))) #:use-module (web client) #:use-module (web response) #:export (disassemble-archive assemble-archive)) \f ;;; ;;; Tar. ;;; (define %TMAGIC "ustar\0") (define %TVERSION "00") (define-syntax-rule (define-field-type type type-size read-proc write-proc) "Define TYPE as a ustar header field type of TYPE-SIZE bytes. READ-PROC is the procedure to obtain the value of an object of this type froma bytevector, and WRITE-PROC writes it to a bytevector." (define-syntax type (syntax-rules (read write size) ((_ size) type-size) ((_ read) read-proc) ((_ write) write-proc)))) (define (sub-bytevector bv offset size) (let ((sub (make-bytevector size))) (bytevector-copy! bv offset sub 0 size) sub)) (define (read-integer bv offset len) (string->number (read-string bv offset len) 8)) (define read-integer12 (cut read-integer <> <> 12)) (define read-integer8 (cut read-integer <> <> 8)) (define (read-string bv offset max-len) (define len (let loop ((len 0)) (cond ((= len max-len) len) ((zero? (bytevector-u8-ref bv (+ offset len))) len) (else (loop (+ 1 len)))))) (utf8->string (sub-bytevector bv offset len))) (define read-string155 (cut read-string <> <> 155)) (define read-string100 (cut read-string <> <> 100)) (define read-string32 (cut read-string <> <> 32)) (define read-string6 (cut read-string <> <> 6)) (define read-string2 (cut read-string <> <> 2)) (define (read-character bv offset) (integer->char (bytevector-u8-ref bv offset))) (define (read-padding12 bv offset) (bytevector-uint-ref bv offset (endianness big) 12)) (define (write-integer! bv offset value len) (let ((str (string-pad (number->string value 8) (- len 1) #\0))) (write-string! bv offset str len))) (define write-integer12! (cut write-integer! <> <> <> 12)) (define write-integer8! (cut write-integer! <> <> <> 8)) (define (write-string! bv offset str len) (let* ((str (string-pad-right str len #\nul)) (buf (string->utf8 str))) (bytevector-copy! buf 0 bv offset (bytevector-length buf)))) (define write-string155! (cut write-string! <> <> <> 155)) (define write-string100! (cut write-string! <> <> <> 100)) (define write-string32! (cut write-string! <> <> <> 32)) (define write-string6! (cut write-string! <> <> <> 6)) (define write-string2! (cut write-string! <> <> <> 2)) (define (write-character! bv offset value) (bytevector-u8-set! bv offset (char->integer value))) (define (write-padding12! bv offset value) (bytevector-uint-set! bv offset value (endianness big) 12)) (define-field-type integer12 12 read-integer12 write-integer12!) (define-field-type integer8 8 read-integer8 write-integer8!) (define-field-type character 1 read-character write-character!) (define-field-type string155 155 read-string155 write-string155!) (define-field-type string100 100 read-string100 write-string100!) (define-field-type string32 32 read-string32 write-string32!) (define-field-type string6 6 read-string6 write-string6!) (define-field-type string2 2 read-string2 write-string2!) (define-field-type padding12 12 read-padding12 write-padding12!) (define-syntax define-pack (syntax-rules () ((_ type ctor pred write-header read-header (field-names field-types field-getters) ...) (begin (define-record-type type (ctor field-names ...) pred (field-names field-getters) ...) (define (read-header port) "Return the ustar header read from PORT." (set-port-encoding! port "ISO-8859-1") (let ((bv (get-bytevector-n port (+ (field-types size) ...)))) (letrec-syntax ((build (syntax-rules () ((_ bv () offset (fields (... ...))) (ctor fields (... ...))) ((_ bv (type0 types (... ...)) offset (fields (... ...))) (build bv (types (... ...)) (+ offset (type0 size)) (fields (... ...) ((type0 read) bv offset))))))) (build bv (field-types ...) 0 ())))) (define (write-header header port) "Serialize HEADER, a <ustar-header> record, to PORT." (let* ((len (+ (field-types size) ...)) (bv (make-bytevector len))) (match header (($ type field-names ...) (letrec-syntax ((write! (syntax-rules () ((_ () offset) #t) ((_ ((type value) rest (... ...)) offset) (begin ((type write) bv offset value) (write! (rest (... ...)) (+ offset (type size)))))))) (write! ((field-types field-names) ...) 0) (put-bytevector port bv)))))))))) ;; The ustar header. See <tar.h>. (define-pack <ustar-header> %make-ustar-header ustar-header? write-ustar-header read-ustar-header (name string100 ustar-header-name) ;NUL-terminated if NUL fits (mode integer8 ustar-header-mode) (uid integer8 ustar-header-uid) (gid integer8 ustar-header-gid) (size integer12 ustar-header-size) (mtime integer12 ustar-header-mtime) (chksum integer8 ustar-header-checksum) (typeflag character ustar-header-type-flag) (linkname string100 ustar-header-link-name) (magic string6 ustar-header-magic) ;must be TMAGIC (version string2 ustar-header-version) ;must be TVERSION (uname string32 ustar-header-uname) ;NUL-terminated (gname string32 ustar-header-gname) ;NUL-terminated (devmajor integer8 ustar-header-device-major) (devminor integer8 ustar-header-device-minor) (prefix string155 ustar-header-prefix) ;NUL-terminated if NUL fits (padding padding12 ustar-header-padding)) (define* (make-ustar-header name #:key (mode 0) (uid 0) (gid 0) (size 0) (mtime 0) (checksum 0) (type-flag 0) (link-name "") (magic %TMAGIC) (version %TVERSION) (uname "") (gname "") (device-major 0) (device-minor 0) (prefix "") (padding 0)) (%make-ustar-header name mode uid gid size mtime checksum type-flag link-name magic version uname gname device-major device-minor prefix padding)) (define %zero-header ;; The all-zeros header, which marks the end of stream. (read-ustar-header (open-bytevector-input-port (make-bytevector 512 0)))) (define (consumer port) "Return a procedure that consumes or skips the given number of bytes from PORT." (if (false-if-exception (seek port 0 SEEK_CUR)) (lambda (len) (seek port len SEEK_CUR)) (lambda (len) (define bv (make-bytevector 8192)) (let loop ((len len)) (define block (min len (bytevector-length bv))) (unless (or (zero? block) (eof-object? (get-bytevector-n! port bv 0 block))) (loop (- len block))))))) (define (fold-archive proc seed port) "Read ustar headers from PORT; for each header, call PROC." (define skip (consumer port)) (let loop ((result seed)) (define header (read-ustar-header port)) (if (equal? header %zero-header) result (let* ((result (proc header port result)) (size (ustar-header-size header)) (remainder (modulo size 512))) ;; It's up to PROC to consume the SIZE bytes of data corresponding ;; to HEADER. Here we consume padding. (unless (zero? remainder) (skip (- 512 remainder))) (loop result))))) \f ;;; ;;; Disassembling/assembling an archive. ;;; (define (dump in out size) "Copy SIZE bytes from IN to OUT." (define buf-size 65536) (define buf (make-bytevector buf-size)) (let loop ((left size)) (if (<= left 0) 0 (let ((read (get-bytevector-n! in buf 0 (min left buf-size)))) (if (eof-object? read) left (begin (put-bytevector out buf 0 read) (loop (- left read)))))))) (define* (disassemble-archive port #:optional (algorithm (hash-algorithm sha256))) "Read tar archive from PORT and return an sexp representing its metadata, including individual file hashes with ALGORITHM." (define headers+hashes (fold-archive (lambda (header port result) (if (zero? (ustar-header-size header)) (alist-cons header #f result) (let () (define-values (hash-port get-hash) (open-hash-port algorithm)) (dump port hash-port (ustar-header-size header)) (close-port hash-port) (alist-cons header (get-hash) result)))) '() port)) (define header+hash->sexp (match-lambda ((header . hash) (letrec-syntax ((serialize (syntax-rules () ((_) '()) ((_ (tag get default) rest ...) (let ((value (get header))) (append (if (equal? default value) '() `((tag ,value))) (serialize rest ...)))) ((_ (tag get) rest ...) (append `((tag ,(get header))) (serialize rest ...)))))) `(,(ustar-header-name header) ,@(serialize (mode ustar-header-mode) (uid ustar-header-uid 0) (gid ustar-header-gid 0) (size ustar-header-size) (mtime ustar-header-mtime) (chksum ustar-header-checksum) (typeflag ustar-header-type-flag #\nul) (linkname ustar-header-link-name "") (magic ustar-header-magic "") (version ustar-header-version "") (uname ustar-header-uname "") (gname ustar-header-gname "") (devmajor ustar-header-device-major 0) (devminor ustar-header-device-minor 0) (prefix ustar-header-prefix "") (padding ustar-header-padding 0) (hash (lambda (_) (and hash `(,(hash-algorithm-name algorithm) ,(bytevector->base32-string hash)))) #f))))))) `(tar-source (version 0) (headers ,(map header+hash->sexp (reverse headers+hashes))))) (define (fetch-from-swh algorithm hash) (define url (string-append "https://archive.softwareheritage.org/api/1/content/" (symbol->string algorithm) ":" (bytevector->base16-string hash) "/raw/")) (define-values (response port) (http-get url #:streaming? #t #:verify-certificate? #f)) (if (= 200 (response-code response)) port (throw 'swh-fetch-error url (get-string-all port)))) (define* (assemble-archive source port #:optional (fetch-data fetch-from-swh)) "Assemble archive from SOURCE, an sexp as returned by 'disassemble-archive'." (define sexp->header (match-lambda ((name . properties) (let ((ref (lambda (field) (and=> (assq-ref properties field) car)))) (make-ustar-header name #:mode (ref 'mode) #:uid (or (ref 'uid) 0) #:gid (or (ref 'gid) 0) #:size (ref 'size) #:mtime (ref 'mtime) #:checksum (ref 'chksum) #:type-flag (or (ref 'typeflag) #\nul) #:link-name (or (ref 'linkname) "") #:magic (or (ref 'magic) "") #:version (or (ref 'version) "") #:uname (or (ref 'uname) "") #:gname (or (ref 'gname) "") #:device-major (or (ref 'devmajor) 0) #:device-minor (or (ref 'devminor) 0) #:prefix (or (ref 'prefix) "") #:padding (or (ref 'padding) 0)))))) (define sexp->data (match-lambda ((name . properties) (match (assq-ref properties 'hash) (((algorithm (= base32-string->bytevector hash)) _ ...) (fetch-data algorithm hash)) (#f (open-input-string "")))))) (match source (('tar-source ('version 0) ('headers headers) _ ...) (for-each (lambda (sexp) (let ((header (sexp->header sexp)) (data (sexp->data sexp))) (write-ustar-header header port) (dump-port data port) (close-port data))) headers)))) ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-07-11 15:50 ` bug#42162: Recovering source tarballs Ludovic Courtès @ 2020-07-13 19:20 ` Christopher Baines 2020-07-20 21:27 ` zimoun 2020-07-15 16:55 ` zimoun 2020-07-30 17:36 ` Timothy Sample 2 siblings, 1 reply; 55+ messages in thread From: Christopher Baines @ 2020-07-13 19:20 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 42162, Maurice Brémond [-- Attachment #1: Type: text/plain, Size: 2120 bytes --] Ludovic Courtès <ludo@gnu.org> writes: > Hi, > > Ludovic Courtès <ludo@gnu.org> skribis: > >> There’s this other discussion you mentioned, which I hope will have a >> positive outcome: >> >> https://forge.softwareheritage.org/T2430 > > This discussion as well as discussions on #swh-devel have made it clear > that SWH will not archive raw tarballs, at least not in the foreseeable > future. Instead, it will keep archiving the contents of tarballs, as it > has always done—that’s already a huge service. > > Not storing raw tarballs makes sense from an engineering perspective, > but it does mean that we cannot rely on SWH as a content-addressed > mirror for tarballs. (In fact, some raw tarballs are available on SWH, > but that’s mostly “by chance”, for instance because they appear as-is in > a Git repo that was ingested.) In fact this is one of the challenges > mentioned in > <https://guix.gnu.org/blog/2019/connecting-reproducible-deployment-to-a-long-term-source-code-archive/>. > > So we need a solution for now (and quite urgently), and a solution for > the future. > > For the now, since 70% of our packages use ‘url-fetch’, we need to be > able to fetch or to reconstruct tarballs. There’s no way around it. > > In the short term, we should arrange so that the build farm keeps GC > roots on source tarballs for an indefinite amount of time. Cuirass > jobset? Mcron job to preserve GC roots? Ideas? Going forward, being methodical as a project about storing the tarballs and source material for the packages is probalby the way to ensure it's available for the future. I'm not sure the data storage cost is significant, the cost of doing this is probably in working out what to store, doing so in a redundant manor, and making the data available. The Guix Data Service knows about fixed output derivations, so it might be possible to backfill such a store by just attempting to build those derivations. It might also be possible to use the Guix Data Service to work out what's available, and what tarballs are missing. Chris [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 962 bytes --] ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-07-13 19:20 ` Christopher Baines @ 2020-07-20 21:27 ` zimoun 0 siblings, 0 replies; 55+ messages in thread From: zimoun @ 2020-07-20 21:27 UTC (permalink / raw) To: Christopher Baines, Ludovic Courtès; +Cc: 42162, Maurice Brémond Hi Chris, On Mon, 13 Jul 2020 at 20:20, Christopher Baines <mail@cbaines.net> wrote: > Going forward, being methodical as a project about storing the tarballs > and source material for the packages is probalby the way to ensure it's > available for the future. I'm not sure the data storage cost is > significant, the cost of doing this is probably in working out what to > store, doing so in a redundant manor, and making the data available. A really rough estimate is 120KB on average* per raw tarball. So if we consider 14000 packages and 70% of them are url-fetch, then it leads to 14k*0.7*120K= 1.2GB; which is not significant. Moreover, if we extrapolate the numbers, between v1.0.0 and now it is 23 commits per day modifying gnu/packages/ so 0.7*23*120K*365= 700MB per year. However, the 120KB of metadata to re-assemble the tarball have to be compared to the 712KB of raw compressed tarball; both about the hello package. *based on the hello package. And it depends on the number of files in the tarball. File stored not compressed: plain sexp. Therefore, in addition to what to store, redundancy and availability, one question is how to store? Git-repo? SQL database? etc. > The Guix Data Service knows about fixed output derivations, so it might > be possible to backfill such a store by just attempting to build those > derivations. It might also be possible to use the Guix Data Service to > work out what's available, and what tarballs are missing. Missing from where? The substitutes farm or SWH? Cheers, simon ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-07-11 15:50 ` bug#42162: Recovering source tarballs Ludovic Courtès 2020-07-13 19:20 ` Christopher Baines @ 2020-07-15 16:55 ` zimoun 2020-07-20 8:39 ` Ludovic Courtès 2020-08-03 21:10 ` Ricardo Wurmus 2020-07-30 17:36 ` Timothy Sample 2 siblings, 2 replies; 55+ messages in thread From: zimoun @ 2020-07-15 16:55 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 42162, Maurice Brémond Hi Ludo, Well, you enlarge the discussion to more than the issue of the 5 url-fetch packages on gforge.inria.fr :-) First of all, you wrote [1] ``Migration away from tarballs is already happening as more and more software is distributed straight from content-addressed VCS repositories, though progress has been relatively slow since we first discussed it in 2016.'' but on the other hand Guix uses more than often [2] "url-fetch" even if "git-fetch" is available upstream. Other said, I am not convinced the migration is really happening... The issue would be mitigated if Guix transitions from "url-fetch" to "git-fetch" when possible. 1: https://forge.softwareheritage.org/T2430#45800 2: https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html Second, trying to do some stats about the SWH coverage, I note that non-neglectible "url-fetch" are reachable by "lookup-content". The coverage is not straightforward because of the 120 request per hour rate limit or unexpected server error. Another story. Well, I would like having numbers because I do not know what is concretely the issue: how many "url-fetch" packages are reachable? And if they are unreachable, is it because they are not in yet? or is it because Guix does not have enough info to lookup them? On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote: > For the now, since 70% of our packages use ‘url-fetch’, we need to be > able to fetch or to reconstruct tarballs. There’s no way around it. Yes, but for example all the packages in gnu/packages/bioconductor.scm could be "git-fetch". Today the source is over url-fetch but it could be over git-fetch with https://git.bioconductor.org/packages/flowCore or git@git.bioconductor.org:packages/flowCore. Another example is the packages in gnu/packages/emacs-xyz.scm and the ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for example using http://git.savannah.gnu.org/gitweb/?p=emacs/elpa.git;a=tree;f=packages/ace-window;h=71d3eb7bd2efceade91846a56b9937812f658bae;hb=HEAD So I would be more reserved about the "no way around it". :-) I mean the 70% could be a bit mitigated. > In the short term, we should arrange so that the build farm keeps GC > roots on source tarballs for an indefinite amount of time. Cuirass > jobset? Mcron job to preserve GC roots? Ideas? Yes, preserving source tarballs for an indefinite amount of time will help. At least all the packages where "lookup-content" returns #f, which means they are not in SWH or they are unreachable -- both is equivalent from Guix side. What about in addition push to IPFS? Feasible? Lookup issue? > For the future, we could store nar hashes of unpacked tarballs instead > of hashes over tarballs. But that raises two questions: > > • If we no longer deal with tarballs but upstreams keep signing > tarballs (not raw directory hashes), how can we authenticate our > code after the fact? Does Guix automatically authenticate code using signed tarballs? > • SWH internally store Git-tree hashes, not nar hashes, so we still > wouldn’t be able to fetch our unpacked trees from SWH. > > (Both issues were previously discussed at > <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.) > > So for the medium term, and perhaps for the future, a possible option > would be to preserve tarball metadata so we can reconstruct them: > > tarball = metadata + tree There is different issues at different levels: 1. how to lookup? what information do we need to keep/store to be able to query SWH? 2. how to check the integrity? what information do we need to keep/store to be able to verify that SWH returns what Guix expects? 3. how to authenticate? where the tarball metadata has to be stored if SWH removes it? Basically, the git-fetch source stores 3 identifiers: - upstream url - commit / tag - integrity (sha256) Fetching from SWH requires the commit only (lookup-revision) or the tag+url (lookup-origin-revision) then from the returned revision, the integrity of the downloaded data is checked using the sha256, right? Therefore, one way to fix lookup of the url-fetch source is to add an extra field mimicking the commit role. The easiest is to store a SWHID or an identifier allowing to deduce the SWHID. I have not checked the code, but something like this: https://pypi.org/project/swh.model/ https://forge.softwareheritage.org/source/swh-model/ and at package time, this identifier is added, similarly to integrity. Aside, does Guix use the authentication metadata that tarballs provide? ( BTW, I failed [3,4] to package swh.model so if someone wants to give a try. 3: https://lists.gnu.org/archive/html/help-guix/2020-06/msg00158.html 4: https://lists.gnu.org/archive/html/help-guix/2020-06/msg00161.html ) > After all, tarballs are byproducts and should be no exception: we should > build them from source. :-) [...] > The code below can “disassemble” and “assemble” a tar. When it > disassembles it, it generates metadata like this: [...] > The ’assemble-archive’ procedure consumes that, looks up file contents > by hash on SWH, and reconstructs the original tarball… Where do you plan to store the "disassembled" metadata? And where do you plan to "assemble-archive"? I mean, What is pushed to SWH? And how? What is fetched from SWH? And how? (Well, answer below. :-)) > … at least in theory, because in practice we hit the SWH rate limit > after looking up a few files: Yes, it is 120 request per hour and 10 save per hour. Well, I do not think they will increase much these numbers in general. However, they seem open for specific machines. So, I do not want to speak for them, but we could ask an higher rate limit for ci.guix.gnu.org for example. Then we need to distinguish between source substitutes and binary substitutes. And basically, when an user runs "guix build foo", if the source is not available upstream nor already on ci.guix.gnu.org, then ci.guix.gnu.org fetch the missing sources from SWH and delivers it to the user. > https://archive.softwareheritage.org/api/#rate-limiting > > So it’s a bit ridiculous, but we may have to store a SWH “dir” > identifier for the whole extracted tree—a Git-tree hash—since that would > allow us to retrieve the whole thing in a single HTTP request. Well, the limited resources of SWH is an issue but SWH is not a mirror but an archive. :-) And as I wrote above, we could ask to SWH to increase the rate limit for specific machine such as ci.guix.gnu.org > I think we’d have to maintain a database that maps tarball hashes to > metadata (!). A simple version of it could be a Git repo where, say, > ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would > contain the metadata above. The nice thing is that the Git repo itself > could be archived by SWH. :-) How this database that maps tarball hashes to metadata should be maintained? Git push hook? Cron task? What about foreign channels? Should they maintain their own map? To summary, it would work like this, right? at package time: - store an integrity identiter (today sha256-nix-base32) - disassemble the tarball - commit to another repo the metadata using the path (address) sha256/base32/<identitier> - push to packages-repo *and* metadata-database-repo at future time: (upstream has disappeared, say!) - use the integrity identifier to query the database repo - lookup the SWHID from the database repo - fetch the data from SWH - or lookup the IPFS identifier from the database repo and fetch the data from IPFS, for another example - re-assemble the tarball using the metadata from the database repo - check integrity, authentication, etc. Well, right it is better than only adding an identifier for looking up as I described above; because it is more general and flexible than only SWH as fall-back. The format of metadata (disassemble) that you propose is schemish (obviously! :-)) but we could propose something more JSON-like. All the best, simon ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-07-15 16:55 ` zimoun @ 2020-07-20 8:39 ` Ludovic Courtès 2020-07-20 15:52 ` zimoun 2020-08-03 21:10 ` Ricardo Wurmus 1 sibling, 1 reply; 55+ messages in thread From: Ludovic Courtès @ 2020-07-20 8:39 UTC (permalink / raw) To: zimoun; +Cc: 42162, Maurice Brémond Hi! There are many many comments in your message, so I took the liberty to reply only to the essence of it. :-) zimoun <zimon.toutoune@gmail.com> skribis: > On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote: > >> For the now, since 70% of our packages use ‘url-fetch’, we need to be >> able to fetch or to reconstruct tarballs. There’s no way around it. > > Yes, but for example all the packages in gnu/packages/bioconductor.scm > could be "git-fetch". Today the source is over url-fetch but it could > be over git-fetch with https://git.bioconductor.org/packages/flowCore or > git@git.bioconductor.org:packages/flowCore. > > Another example is the packages in gnu/packages/emacs-xyz.scm and the > ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for > example using > http://git.savannah.gnu.org/gitweb/?p=emacs/elpa.git;a=tree;f=packages/ace-window;h=71d3eb7bd2efceade91846a56b9937812f658bae;hb=HEAD > > So I would be more reserved about the "no way around it". :-) I mean > the 70% could be a bit mitigated. The “no way around it” was about the situation today: it’s a fact that 70% of packages are built from tarballs, so we need to be able to fetch them or reconstruct them. However, the two examples above are good ideas as to the way forward: we could start a url-fetch-to-git-fetch migration in these two cases, and perhaps more. >> In the short term, we should arrange so that the build farm keeps GC >> roots on source tarballs for an indefinite amount of time. Cuirass >> jobset? Mcron job to preserve GC roots? Ideas? > > Yes, preserving source tarballs for an indefinite amount of time will > help. At least all the packages where "lookup-content" returns #f, > which means they are not in SWH or they are unreachable -- both is > equivalent from Guix side. > > What about in addition push to IPFS? Feasible? Lookup issue? Lookup issue. :-) The hash in a CID is not just a raw blob hash. Files are typically chunked beforehand, assembled as a Merkle tree, and the CID is roughly the hash to the tree root. So it would seem we can’t use IPFS as-is for tarballs. >> For the future, we could store nar hashes of unpacked tarballs instead >> of hashes over tarballs. But that raises two questions: >> >> • If we no longer deal with tarballs but upstreams keep signing >> tarballs (not raw directory hashes), how can we authenticate our >> code after the fact? > > Does Guix automatically authenticate code using signed tarballs? Not automatically; packagers are supposed to authenticate code when they add a package (‘guix refresh -u’ does that automatically). >> • SWH internally store Git-tree hashes, not nar hashes, so we still >> wouldn’t be able to fetch our unpacked trees from SWH. >> >> (Both issues were previously discussed at >> <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.) >> >> So for the medium term, and perhaps for the future, a possible option >> would be to preserve tarball metadata so we can reconstruct them: >> >> tarball = metadata + tree > > There is different issues at different levels: > > 1. how to lookup? what information do we need to keep/store to be able > to query SWH? > 2. how to check the integrity? what information do we need to > keep/store to be able to verify that SWH returns what Guix expects? > 3. how to authenticate? where the tarball metadata has to be stored if > SWH removes it? > > Basically, the git-fetch source stores 3 identifiers: > > - upstream url > - commit / tag > - integrity (sha256) > > Fetching from SWH requires the commit only (lookup-revision) or the > tag+url (lookup-origin-revision) then from the returned revision, the > integrity of the downloaded data is checked using the sha256, right? Yes. > Therefore, one way to fix lookup of the url-fetch source is to add an > extra field mimicking the commit role. But today, we store tarball hashes, not directory hashes. > The easiest is to store a SWHID or an identifier allowing to deduce the > SWHID. > > I have not checked the code, but something like this: > > https://pypi.org/project/swh.model/ > https://forge.softwareheritage.org/source/swh-model/ > > and at package time, this identifier is added, similarly to integrity. I’m skeptical about adding a field that is practically never used. [...] >> The code below can “disassemble” and “assemble” a tar. When it >> disassembles it, it generates metadata like this: > > [...] > >> The ’assemble-archive’ procedure consumes that, looks up file contents >> by hash on SWH, and reconstructs the original tarball… > > Where do you plan to store the "disassembled" metadata? > And where do you plan to "assemble-archive"? We’d have a repo/database containing metadata indexed by tarball sha256. > How this database that maps tarball hashes to metadata should be > maintained? Git push hook? Cron task? Yes, something like that. :-) > What about foreign channels? Should they maintain their own map? Yes, presumably. > To summary, it would work like this, right? > > at package time: > - store an integrity identiter (today sha256-nix-base32) > - disassemble the tarball > - commit to another repo the metadata using the path (address) > sha256/base32/<identitier> > - push to packages-repo *and* metadata-database-repo > > at future time: (upstream has disappeared, say!) > - use the integrity identifier to query the database repo > - lookup the SWHID from the database repo > - fetch the data from SWH > - or lookup the IPFS identifier from the database repo and fetch the > data from IPFS, for another example > - re-assemble the tarball using the metadata from the database repo > - check integrity, authentication, etc. That’s the idea. > The format of metadata (disassemble) that you propose is schemish > (obviously! :-)) but we could propose something more JSON-like. Sure, if that helps get other people on-board, why not (though sexps have lived much longer than JSON and XML together :-)). Thanks, Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-07-20 8:39 ` Ludovic Courtès @ 2020-07-20 15:52 ` zimoun 2020-07-20 17:05 ` Dr. Arne Babenhauserheide 2020-07-21 21:22 ` Ludovic Courtès 0 siblings, 2 replies; 55+ messages in thread From: zimoun @ 2020-07-20 15:52 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 42162, Maurice Brémond Hi, On Mon, 20 Jul 2020 at 10:39, Ludovic Courtès <ludo@gnu.org> wrote: > zimoun <zimon.toutoune@gmail.com> skribis: > > On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote: > There are many many comments in your message, so I took the liberty to > reply only to the essence of it. :-) Many comments because many open topics. ;-) > However, the two examples above are good ideas as to the way forward: we > could start a url-fetch-to-git-fetch migration in these two cases, and > perhaps more. Well, to be honest, I have tried to probe such migration when I opened this thread: https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html and I have tried to summarized the pros/cons arguments here: https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00448.html > > What about in addition push to IPFS? Feasible? Lookup issue? > > Lookup issue. :-) The hash in a CID is not just a raw blob hash. > Files are typically chunked beforehand, assembled as a Merkle tree, and > the CID is roughly the hash to the tree root. So it would seem we can’t > use IPFS as-is for tarballs. Using the Git-repo map/table, then it becomes an option, right? Well, SWH would be a backend and IPFS could be another one. Or any "cloudy" storage system that could appear in the future, right? > >> • If we no longer deal with tarballs but upstreams keep signing > >> tarballs (not raw directory hashes), how can we authenticate our > >> code after the fact? > > > > Does Guix automatically authenticate code using signed tarballs? > > Not automatically; packagers are supposed to authenticate code when they > add a package (‘guix refresh -u’ does that automatically). So I miss the point of having this authentication information in the future where upstream has disappeared. The authentication is done at packaging time. So once it is done, merged into master and then pushed to SWH, being able to authenticate again does not really matter. And if it matters, all should be updated each time vulnerabilities are discovered and so I am not sure SWH makes sense for this use-case. > But today, we store tarball hashes, not directory hashes. We store what "guix hash" returns. ;-) So it is easy to migrate from tarball hashes to whatever else. :-) I mean, it is "(sha256 (base32" and it is easy to have also "(sha256-tree (base32" or something like that. In the case where the integrity is also used as lookup key. > > The format of metadata (disassemble) that you propose is schemish > > (obviously! :-)) but we could propose something more JSON-like. > > Sure, if that helps get other people on-board, why not (though sexps > have lived much longer than JSON and XML together :-)). Lived much longer and still less less less used than JSON or XML alone. ;-) I have not done yet the clear back-to-envelop computations. Roughly, there are ~23 commits on average per day updating packages, so say 70% of them are url-fetch, it is ~16 new tarballs per day, on average. How the model using a Git-repo will scale? Because, naively the output of "disassemble-archive" in full text (pretty-print format) for the hello-2.10.tar is 120KB and so 16*365*120K = ~700Mb per year without considering all the Git internals. Obviously, it depends on the number of files and I do not know if hello is a representative example. And I do not know how Git operates on binary files if the disassembled tarball is stored as .go file, or any other. All the best, simon ps: Just if someone wants to check from where I estimate the numbers. --8<---------------cut here---------------start------------->8--- for ci in $(git log --after=v1.0.0 --oneline \ | grep "gnu:" | grep -E "(Add|Update)" \ | cut -f1 -d' ') do git --no-pager log -1 $ci --format="%cs" done | uniq -c > /tmp/commits guix environment --ad-hoc r-minimal \ -- R -e 'summary(read.table("/tmp/commits"))' gzip -dc < $(guix build -S hello) > /tmp/hello.tar guix repl -L /tmp/tar/ scheme@(guix-user)> (call-with-input-file "hello.tar" (lambda (port) (disassemble-archive port))) --8<---------------cut here---------------end--------------->8--- ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-07-20 15:52 ` zimoun @ 2020-07-20 17:05 ` Dr. Arne Babenhauserheide 2020-07-20 19:59 ` zimoun 2020-07-21 21:22 ` Ludovic Courtès 1 sibling, 1 reply; 55+ messages in thread From: Dr. Arne Babenhauserheide @ 2020-07-20 17:05 UTC (permalink / raw) To: zimoun; +Cc: 42162, Maurice.Bremond [-- Attachment #1: Type: text/plain, Size: 774 bytes --] zimoun <zimon.toutoune@gmail.com> writes: >> > The format of metadata (disassemble) that you propose is schemish >> > (obviously! :-)) but we could propose something more JSON-like. >> >> Sure, if that helps get other people on-board, why not (though sexps >> have lived much longer than JSON and XML together :-)). > > Lived much longer and still less less less used than JSON or XML alone. ;-) Though this is likely not a function of the format, but of the popularity of both Javascript and Java. JSON isn’t a well defined format for arbitrary data (try to store numbers as keys and reason about what you get as return-values), and XML is a monster of complexity. Best wishes, Arne -- Unpolitisch sein heißt politisch sein ohne es zu merken [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 1076 bytes --] ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-07-20 17:05 ` Dr. Arne Babenhauserheide @ 2020-07-20 19:59 ` zimoun 0 siblings, 0 replies; 55+ messages in thread From: zimoun @ 2020-07-20 19:59 UTC (permalink / raw) To: Dr. Arne Babenhauserheide; +Cc: 42162, Maurice Brémond On Mon, 20 Jul 2020 at 19:05, Dr. Arne Babenhauserheide <arne_bab@web.de> wrote: > zimoun <zimon.toutoune@gmail.com> writes: > >> > The format of metadata (disassemble) that you propose is schemish > >> > (obviously! :-)) but we could propose something more JSON-like. > >> > >> Sure, if that helps get other people on-board, why not (though sexps > >> have lived much longer than JSON and XML together :-)). > > > > Lived much longer and still less less less used than JSON or XML alone. ;-) > > Though this is likely not a function of the format, but of the > popularity of both Javascript and Java. Well, the popularity matters to attract a broad audience and maybe get other people on-board; if it is the aim. It seems the de-facto format; even if JSON has flaws. And zillions of parsers for all the languages are floating around, which is not the case for Sexp, even if it is easier to parse. And JSON is already used in Guix, see [1] for an example. 1: https://guix.gnu.org/manual/devel/en/guix.html#Additional-Build-Options However, I am not convinced that JSON or similarly Sexp will scale well for a Tarball Heritage perspective. All the best, simon ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-07-20 15:52 ` zimoun 2020-07-20 17:05 ` Dr. Arne Babenhauserheide @ 2020-07-21 21:22 ` Ludovic Courtès 2020-07-22 0:27 ` zimoun 1 sibling, 1 reply; 55+ messages in thread From: Ludovic Courtès @ 2020-07-21 21:22 UTC (permalink / raw) To: zimoun; +Cc: 42162, Maurice Brémond Hi! zimoun <zimon.toutoune@gmail.com> skribis: > On Mon, 20 Jul 2020 at 10:39, Ludovic Courtès <ludo@gnu.org> wrote: >> zimoun <zimon.toutoune@gmail.com> skribis: >> > On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote: > >> There are many many comments in your message, so I took the liberty to >> reply only to the essence of it. :-) > > Many comments because many open topics. ;-) Understood, and they’re very valuable but (1) I choose not to just do email :-), and (2) I like to separate issues in reasonable chunks rather than long threads addressing all the problems we’ll have to deal with. I think it really helps keep things tractable! >> Lookup issue. :-) The hash in a CID is not just a raw blob hash. >> Files are typically chunked beforehand, assembled as a Merkle tree, and >> the CID is roughly the hash to the tree root. So it would seem we can’t >> use IPFS as-is for tarballs. > > Using the Git-repo map/table, then it becomes an option, right? > Well, SWH would be a backend and IPFS could be another one. Or any > "cloudy" storage system that could appear in the future, right? Sure, why not. >> >> • If we no longer deal with tarballs but upstreams keep signing >> >> tarballs (not raw directory hashes), how can we authenticate our >> >> code after the fact? >> > >> > Does Guix automatically authenticate code using signed tarballs? >> >> Not automatically; packagers are supposed to authenticate code when they >> add a package (‘guix refresh -u’ does that automatically). > > So I miss the point of having this authentication information in the > future where upstream has disappeared. What I meant above, is that often, what we have is things like detached signatures of raw tarballs, or documents referring to a tarball hash: https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/msg00009.html >> But today, we store tarball hashes, not directory hashes. > > We store what "guix hash" returns. ;-) > So it is easy to migrate from tarball hashes to whatever else. :-) True, but that other thing, as it stands, would be a nar hash (like for ‘git-fetch’), not a Git-tree hash (what SWH uses). > I mean, it is "(sha256 (base32" and it is easy to have also > "(sha256-tree (base32" or something like that. Right, but that first and foremost requires daemon support. It’s doable, but migration would have to take a long time, since this is touching core parts of the “protocol”. > I have not done yet the clear back-to-envelop computations. Roughly, > there are ~23 commits on average per day updating packages, so say 70% > of them are url-fetch, it is ~16 new tarballs per day, on average. > How the model using a Git-repo will scale? Because, naively the > output of "disassemble-archive" in full text (pretty-print format) for > the hello-2.10.tar is 120KB and so 16*365*120K = ~700Mb per year > without considering all the Git internals. Obviously, it depends on > the number of files and I do not know if hello is a representative > example. Interesting, thanks for making that calculation! We could make the format more compact if needed. Thanks, Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-07-21 21:22 ` Ludovic Courtès @ 2020-07-22 0:27 ` zimoun 2020-07-22 10:28 ` Ludovic Courtès 0 siblings, 1 reply; 55+ messages in thread From: zimoun @ 2020-07-22 0:27 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 42162, Maurice Brémond Hi! On Tue, 21 Jul 2020 at 23:22, Ludovic Courtès <ludo@gnu.org> wrote: >>> >> • If we no longer deal with tarballs but upstreams keep signing >>> >> tarballs (not raw directory hashes), how can we authenticate our >>> >> code after the fact? >>> > >>> > Does Guix automatically authenticate code using signed tarballs? >>> >>> Not automatically; packagers are supposed to authenticate code when they >>> add a package (‘guix refresh -u’ does that automatically). >> >> So I miss the point of having this authentication information in the >> future where upstream has disappeared. > > What I meant above, is that often, what we have is things like detached > signatures of raw tarballs, or documents referring to a tarball hash: > > https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/msg00009.html I still miss why it matters to store detached signature of raw tarballs. The authentication is done now (at package time and/or inclusion in the lookup table proposal). I miss why we would have to re-authenticate again later. IMHO, having a lookup table that returns the signatures from a tarball hash or an archive of all the OpenGPG keys ever published is another topic. >>> But today, we store tarball hashes, not directory hashes. >> >> We store what "guix hash" returns. ;-) >> So it is easy to migrate from tarball hashes to whatever else. :-) > > True, but that other thing, as it stands, would be a nar hash (like for > ‘git-fetch’), not a Git-tree hash (what SWH uses). Ok, now I am totally convinced that a lookup table is The Right Thing™. :-) >> I mean, it is "(sha256 (base32" and it is easy to have also >> "(sha256-tree (base32" or something like that. > > Right, but that first and foremost requires daemon support. > > It’s doable, but migration would have to take a long time, since this is > touching core parts of the “protocol”. Doable but not necessary tractable. :-) >> I have not done yet the clear back-to-envelop computations. Roughly, >> there are ~23 commits on average per day updating packages, so say 70% >> of them are url-fetch, it is ~16 new tarballs per day, on average. >> How the model using a Git-repo will scale? Because, naively the >> output of "disassemble-archive" in full text (pretty-print format) for >> the hello-2.10.tar is 120KB and so 16*365*120K = ~700Mb per year >> without considering all the Git internals. Obviously, it depends on >> the number of files and I do not know if hello is a representative >> example. > > Interesting, thanks for making that calculation! We could make the > format more compact if needed. Compressing should help. Considering 14000 packages, based on this 120KB estimation, it leads to: 0.7*14k*120K= ~1.2GB for the Git-repo of the current Guix. Cheers, simon ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-07-22 0:27 ` zimoun @ 2020-07-22 10:28 ` Ludovic Courtès 0 siblings, 0 replies; 55+ messages in thread From: Ludovic Courtès @ 2020-07-22 10:28 UTC (permalink / raw) To: zimoun; +Cc: 42162, Maurice Brémond Hello! zimoun <zimon.toutoune@gmail.com> skribis: > On Tue, 21 Jul 2020 at 23:22, Ludovic Courtès <ludo@gnu.org> wrote: > >>>> >> • If we no longer deal with tarballs but upstreams keep signing >>>> >> tarballs (not raw directory hashes), how can we authenticate our >>>> >> code after the fact? >>>> > >>>> > Does Guix automatically authenticate code using signed tarballs? >>>> >>>> Not automatically; packagers are supposed to authenticate code when they >>>> add a package (‘guix refresh -u’ does that automatically). >>> >>> So I miss the point of having this authentication information in the >>> future where upstream has disappeared. >> >> What I meant above, is that often, what we have is things like detached >> signatures of raw tarballs, or documents referring to a tarball hash: >> >> https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/msg00009.html > > I still miss why it matters to store detached signature of raw tarballs. I’m not saying we (Guix) should store signatures; I’m just saying that developers typically sign raw tarballs. It’s a general statement to explain why storing or being able to reconstruct tarballs matters. Thanks, Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-07-15 16:55 ` zimoun 2020-07-20 8:39 ` Ludovic Courtès @ 2020-08-03 21:10 ` Ricardo Wurmus 1 sibling, 0 replies; 55+ messages in thread From: Ricardo Wurmus @ 2020-08-03 21:10 UTC (permalink / raw) To: zimoun; +Cc: 42162 zimoun <zimon.toutoune@gmail.com> writes: > Yes, but for example all the packages in gnu/packages/bioconductor.scm > could be "git-fetch". Today the source is over url-fetch but it could > be over git-fetch with https://git.bioconductor.org/packages/flowCore or > git@git.bioconductor.org:packages/flowCore. We should do that (and soon), especially because Bioconductor does not keep an archive of old releases. We can discuss this on a separate issue lest we derail the discussion at hand. -- Ricardo ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-07-11 15:50 ` bug#42162: Recovering source tarballs Ludovic Courtès 2020-07-13 19:20 ` Christopher Baines 2020-07-15 16:55 ` zimoun @ 2020-07-30 17:36 ` Timothy Sample 2020-07-31 14:41 ` Ludovic Courtès 2020-08-26 10:04 ` bug#42162: Recovering source tarballs zimoun 2 siblings, 2 replies; 55+ messages in thread From: Timothy Sample @ 2020-07-30 17:36 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 42162, Maurice Brémond Hi Ludovic, Ludovic Courtès <ludo@gnu.org> writes: > Hi, > > Ludovic Courtès <ludo@gnu.org> skribis: > > [...] > > So for the medium term, and perhaps for the future, a possible option > would be to preserve tarball metadata so we can reconstruct them: > > tarball = metadata + tree > > After all, tarballs are byproducts and should be no exception: we should > build them from source. :-) > > In <https://forge.softwareheritage.org/T2430>, Stefano mentioned > pristine-tar, which does almost that, but not quite: it stores a binary > delta between a tarball and a tree: > > https://manpages.debian.org/unstable/pristine-tar/pristine-tar.1.en.html > > I think we should have something more transparent than a binary delta. > > The code below can “disassemble” and “assemble” a tar. When it > disassembles it, it generates metadata like this: > > (tar-source > (version 0) > (headers > (("guile-3.0.4/" > (mode 493) > (size 0) > (mtime 1593007723) > (chksum 3979) > (typeflag #\5)) > ("guile-3.0.4/m4/" > (mode 493) > (size 0) > (mtime 1593007720) > (chksum 4184) > (typeflag #\5)) > ("guile-3.0.4/m4/pipe2.m4" > (mode 420) > (size 531) > (mtime 1536050419) > (chksum 4812) > (hash (sha256 > "arx6n2rmtf66yjlwkgwp743glcpdsfzgjiqrqhfegutmcwvwvsza"))) > ("guile-3.0.4/m4/time_h.m4" > (mode 420) > (size 5471) > (mtime 1536050419) > (chksum 4974) > (hash (sha256 > "z4py26rmvsk4st7db6vwziwwhkrjjrwj7nra4al6ipqh2ms45kka"))) > […] > > The ’assemble-archive’ procedure consumes that, looks up file contents > by hash on SWH, and reconstructs the original tarball… > > … at least in theory, because in practice we hit the SWH rate limit > after looking up a few files: > > https://archive.softwareheritage.org/api/#rate-limiting > > So it’s a bit ridiculous, but we may have to store a SWH “dir” > identifier for the whole extracted tree—a Git-tree hash—since that would > allow us to retrieve the whole thing in a single HTTP request. > > Besides, we’ll also have to handle compression: storing gzip/xz headers > and compression levels. This jumped out at me because I have been working with compression and tarballs for the bootstrapping effort. I started pulling some threads and doing some research, and ended up prototyping an end-to-end solution for decomposing a Gzip’d tarball into Gzip metadata, tarball metadata, and an SWH directory ID. It can even put them back together! :) There are a bunch of problems still, but I think this project is doable in the short-term. I’ve tested 100 arbitrary Gzip’d tarballs from Guix, and found and fixed a bunch of little gaffes. There’s a ton of work to do, of course, but here’s another small step. I call the thing “Disarchive” as in “disassemble a source code archive”. You can find it at <https://git.ngyro.com/disarchive/>. It has a simple command-line interface so you can do $ disarchive save software-1.0.tar.gz which serializes a disassembled version of “software-1.0.tar.gz” to the database (which is just a directory) specified by the “DISARCHIVE_DB” environment variable. Next, you can run $ disarchive load hash-of-something-in-the-db which will recover an original file from its metadata (stored in the database) and data retrieved from the SWH archive or taken from a cache (again, just a directory) specified by “DISARCHIVE_DIRCACHE”. Now some implementation details. The way I’ve set it up is that all of the assembly happens through Guix. Each step in recreating a compressed tarball is a fixed-output derivation: the download from SWH, the creation of the tarball, and the compression. I wanted an easy way to build and verify things according to a dependency graph without writing any code. Hi Guix Daemon! I’m not sure if this is a good long-term approach, though. It could work well for reproducibility, but it might be easier to let some external service drive my code as a Guix package. Either way, it was an easy way to get started. For disassembly, it takes a Gzip file (containing a single member) and breaks it down like this: (gzip-member (version 0) (name "hungrycat-0.4.1.tar.gz") (input (sha256 "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1")) (header (mtime 0) (extra-flags 2) (os 3)) (footer (crc 3863610951) (isize 194560)) (compressor gnu-best) (digest (sha256 "03fc1zsrf99lvxa7b4ps6pbi43304wbxh1f6ci4q0vkal370yfwh"))) The header and footer are read directly from the file. Finding the compressor is harder. I followed the approach taken by the pristine-tar project. That is, try a bunch of compressors and hope for a match. Currently, I have: • gnu-best • gnu-best-rsync • gnu • gnu-rsync • gnu-fast • gnu-fast-rsync • zlib-best • zlib • zlib-fast • zlib-best-perl • zlib-perl • zlib-fast-perl • gnu-best-rsync-1.4 • gnu-rsync-1.4 • gnu-fast-rsync-1.4 This list is inspired by pristine-tar. The first couple GNU compressors use modern Gzip from Guix. The zlib and rsync-1.4 ones use the Gzip and zlib wrapper from pristine-tar called “zgz”. The 100 Gzip files I looked at use “gnu”, “gnu-best”, “gnu-best-rsync-1.4”, “zlib”, “zlib-best”, and “zlib-fast-perl”. (As an aside, I had a way to decompose multi-member Gzip files, but it was much, much slower. Since I doubt they exist in the wild, I removed that code.) The “input” field likely points to a tarball, which looks like this: (tarball (version 0) (name "hungrycat-0.4.1.tar") (input (sha256 "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r")) (default-header) (headers ((name "hungrycat-0.4.1/") (mode 493) (mtime 1513360022) (chksum 5058) (typeflag 53)) ((name "hungrycat-0.4.1/configure") (mode 493) (size 130263) (mtime 1513360022) (chksum 6043)) ...) (padding 3584) (digest (sha256 "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1"))) Originally, I used your code, but I ran into some problems. Namely, real tarballs are not well-behaved. I wrote new code to keep track of subtle things like the formatting of the octal values. Even though they are not well-behaved, they are usually self-consistent, so I introduced the “default-header” field to set default values for all headers. Any omitted fields in the headers use the value from the default header, and the default header takes defaults from a “default default header” defined in the code. Here’s a default header from a different tarball: (default-header (uid 1199) (gid 30) (magic "ustar ") (version " \x00") (uname "cagordon") (gname "lhea") (devmajor-format (width 0)) (devminor-format (width 0))) These default values are computed to minimize the noise in the serialized form. Here we see for example that each header should have UID 1199 unless otherwise specified. We also see that the device fields should be null strings instead of octal zeros. Another good example here is that the magic field has a space after “ustar”, which is not what modern POSIX says to do. My tarball reader has minimal support for extended headers, but they are not serialized cleanly (they survive the round-trip, but they are not human-readable). Finally, the “input” field here points to an “swh-directory” object. It looks like this: (swh-directory (version 0) (name "hungrycat-0.4.1") (id "0496abd5a2e9e05c9fe20ae7684f48130ef6124a") (digest (sha256 "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r"))) I have a little module for computing the directory hash like SWH does (which is in-turn like what Git does). I did not verify that the 100 packages where in the SWH archive. I did verify a couple of packages, but I hit the rate limit and decided to avoid it for now. To avoid hitting the SWH archive at all, I introduced a directory cache so that I can store the directories locally. If the directory cache is available, directories are stored and retrieved from it. > How would we put that in practice? Good question. :-) > > I think we’d have to maintain a database that maps tarball hashes to > metadata (!). A simple version of it could be a Git repo where, say, > ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would > contain the metadata above. The nice thing is that the Git repo itself > could be archived by SWH. :-) You mean like <https://git.ngyro.com/disarchive-db/>? :) This was generated by a little script built on top of “fold-packages”. It downloads Gzip’d tarballs used by Guix packages and passes them on to Disarchive for disassembly. I limited the number to 100 because it’s slow and because I’m sure there is a long tail of weird software archives that are going to be hard to process. The metadata directory ended up being 13M and the directory cache 2G. > Thus, if a tarball vanishes, we’d look it up in the database and > reconstruct it from its metadata plus content store in SWH. > > Thoughts? Obviously I like the idea. ;) Even with the code I have so far, I have a lot of questions. Mainly I’m worried about keeping everything working into the future. It would be easy to make incompatible changes. A lot of care would have to be taken. Of course, keeping a Guix commit and a Disarchive commit might be enough to make any assembling reproducible, but there’s a chicken-and-egg problem there. What if a tarball from the closure of one the derivations is missing? I guess you could work around it, but it would be tricky. > Anyhow, we should team up with fellow NixOS and SWH hackers to address > this, and with developers of other distros as well—this problem is not > just that of the functional deployment geeks, is it? I could remove most of the Guix stuff so that it would be easy to package in Guix, Nix, Debian, etc. Then, someone™ could write a service that consumes a “sources.json” file, adds the sources to a Disarchive database, and pushes everything to a Git repo. I guess everyone who cares has to produce a “sources.json” file anyway, so it will be very little extra work. Other stuff like changing the serialization format to JSON would be pretty easy, too. I’m not well connected to these other projects, mind you, so I’m not really sure how to reach out. Sorry about the big mess of code and ideas – I realize I may have taken the “do-ocracy” approach a little far here. :) Even if this is not “the” solution, hopefully it’s useful for discussion! -- Tim ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-07-30 17:36 ` Timothy Sample @ 2020-07-31 14:41 ` Ludovic Courtès 2020-08-03 16:59 ` Timothy Sample 2020-08-26 10:04 ` bug#42162: Recovering source tarballs zimoun 1 sibling, 1 reply; 55+ messages in thread From: Ludovic Courtès @ 2020-07-31 14:41 UTC (permalink / raw) To: Timothy Sample; +Cc: 42162, Maurice Brémond Hi Timothy! Timothy Sample <samplet@ngyro.com> skribis: > This jumped out at me because I have been working with compression and > tarballs for the bootstrapping effort. I started pulling some threads > and doing some research, and ended up prototyping an end-to-end solution > for decomposing a Gzip’d tarball into Gzip metadata, tarball metadata, > and an SWH directory ID. It can even put them back together! :) There > are a bunch of problems still, but I think this project is doable in the > short-term. I’ve tested 100 arbitrary Gzip’d tarballs from Guix, and > found and fixed a bunch of little gaffes. There’s a ton of work to do, > of course, but here’s another small step. > > I call the thing “Disarchive” as in “disassemble a source code archive”. > You can find it at <https://git.ngyro.com/disarchive/>. It has a simple > command-line interface so you can do > > $ disarchive save software-1.0.tar.gz > > which serializes a disassembled version of “software-1.0.tar.gz” to the > database (which is just a directory) specified by the “DISARCHIVE_DB” > environment variable. Next, you can run > > $ disarchive load hash-of-something-in-the-db > > which will recover an original file from its metadata (stored in the > database) and data retrieved from the SWH archive or taken from a cache > (again, just a directory) specified by “DISARCHIVE_DIRCACHE”. Wooohoo! Is it that time of the year when people give presents to one another? I can’t believe it. :-) > Now some implementation details. The way I’ve set it up is that all of > the assembly happens through Guix. Each step in recreating a compressed > tarball is a fixed-output derivation: the download from SWH, the > creation of the tarball, and the compression. I wanted an easy way to > build and verify things according to a dependency graph without writing > any code. Hi Guix Daemon! I’m not sure if this is a good long-term > approach, though. It could work well for reproducibility, but it might > be easier to let some external service drive my code as a Guix package. > Either way, it was an easy way to get started. > > For disassembly, it takes a Gzip file (containing a single member) and > breaks it down like this: > > (gzip-member > (version 0) > (name "hungrycat-0.4.1.tar.gz") > (input (sha256 > "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1")) > (header > (mtime 0) > (extra-flags 2) > (os 3)) > (footer > (crc 3863610951) > (isize 194560)) > (compressor gnu-best) > (digest > (sha256 > "03fc1zsrf99lvxa7b4ps6pbi43304wbxh1f6ci4q0vkal370yfwh"))) Awesome. > The header and footer are read directly from the file. Finding the > compressor is harder. I followed the approach taken by the pristine-tar > project. That is, try a bunch of compressors and hope for a match. > Currently, I have: > > • gnu-best > • gnu-best-rsync > • gnu > • gnu-rsync > • gnu-fast > • gnu-fast-rsync > • zlib-best > • zlib > • zlib-fast > • zlib-best-perl > • zlib-perl > • zlib-fast-perl > • gnu-best-rsync-1.4 > • gnu-rsync-1.4 > • gnu-fast-rsync-1.4 I would have used the integers that zlib supports, but I guess that doesn’t capture this whole gamut of compression setups. And yeah, it’s not great that we actually have to try and find the right compression levels, but there’s no way around it it seems, and as you write, we can expect a couple of variants to be the most commonly used ones. > The “input” field likely points to a tarball, which looks like this: > > (tarball > (version 0) > (name "hungrycat-0.4.1.tar") > (input (sha256 > "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r")) > (default-header) > (headers > ((name "hungrycat-0.4.1/") > (mode 493) > (mtime 1513360022) > (chksum 5058) > (typeflag 53)) > ((name "hungrycat-0.4.1/configure") > (mode 493) > (size 130263) > (mtime 1513360022) > (chksum 6043)) > ...) > (padding 3584) > (digest > (sha256 > "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1"))) > > Originally, I used your code, but I ran into some problems. Namely, > real tarballs are not well-behaved. I wrote new code to keep track of > subtle things like the formatting of the octal values. Yeah I guess I was too optimistic. :-) I wanted to have the serialization/deserialization code automatically generated by that macro, but yeah, it doesn’t capture enough details for real-world tarballs. Do you know how frequently you get “weird” tarballs? I was thinking about having something that works for plain GNU tar, but it’s even better to have something that works with “unusual” tarballs! (BTW the code I posted or the one in Disarchive could perhaps replace the one in Gash-Utils. I was frustrated to not see a ‘fold-archive’ procedure there, notably.) > Even though they are not well-behaved, they are usually > self-consistent, so I introduced the “default-header” field to set > default values for all headers. Any omitted fields in the headers use > the value from the default header, and the default header takes > defaults from a “default default header” defined in the code. Here’s > a default header from a different tarball: > > (default-header > (uid 1199) > (gid 30) > (magic "ustar ") > (version " \x00") > (uname "cagordon") > (gname "lhea") > (devmajor-format (width 0)) > (devminor-format (width 0))) Very nice. > Finally, the “input” field here points to an “swh-directory” object. It > looks like this: > > (swh-directory > (version 0) > (name "hungrycat-0.4.1") > (id "0496abd5a2e9e05c9fe20ae7684f48130ef6124a") > (digest > (sha256 > "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r"))) Yay! > I have a little module for computing the directory hash like SWH does > (which is in-turn like what Git does). I did not verify that the 100 > packages where in the SWH archive. I did verify a couple of packages, > but I hit the rate limit and decided to avoid it for now. > > To avoid hitting the SWH archive at all, I introduced a directory cache > so that I can store the directories locally. If the directory cache is > available, directories are stored and retrieved from it. I guess we can get back to them eventually to estimate our coverage ratio. >> I think we’d have to maintain a database that maps tarball hashes to >> metadata (!). A simple version of it could be a Git repo where, say, >> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would >> contain the metadata above. The nice thing is that the Git repo itself >> could be archived by SWH. :-) > > You mean like <https://git.ngyro.com/disarchive-db/>? :) Woow. :-) We could actually have a CI job to create the database: it would basically do ‘disarchive save’ for each tarball and store that using a layout like the one you used. Then we could have a job somewhere that periodically fetches that and adds it to the database. WDYT? I think we should leave room for other hash algorithms (in the sexps above too). > This was generated by a little script built on top of “fold-packages”. > It downloads Gzip’d tarballs used by Guix packages and passes them on to > Disarchive for disassembly. I limited the number to 100 because it’s > slow and because I’m sure there is a long tail of weird software > archives that are going to be hard to process. The metadata directory > ended up being 13M and the directory cache 2G. Neat. So it does mean that we could pretty much right away add a fall-back in (guix download) that looks up tarballs in your database and uses Disarchive to recontruct it, right? I love solved problems. :-) Of course we could improve Disarchive and the database, but it seems to me that we already have enough to improve the situation. WDYT? > Even with the code I have so far, I have a lot of questions. Mainly I’m > worried about keeping everything working into the future. It would be > easy to make incompatible changes. A lot of care would have to be > taken. Of course, keeping a Guix commit and a Disarchive commit might > be enough to make any assembling reproducible, but there’s a > chicken-and-egg problem there. The way I see it, Guix would always look up tarballs in the HEAD of the database (no need to pick a specific commit). Worst that could happen is we reconstruct a tarball that doesn’t match, and so the daemon errors out. Regarding future-proofness, I think we must be super careful about the file formats (the sexps). You did pay attention to not having implicit defaults, which is perfect. Perhaps one thing to change (or perhaps it’s already there) is support for other hashes in those sexps: both hash algorithms and directory hash methods (SWH dir/Git tree, nar, Git tree with different hash algorithm, IPFS CID, etc.). Also the ability to specify several hashes. That way we could “refresh” the database anytime by adding the hash du jour for already-present tarballs. > What if a tarball from the closure of one the derivations is missing? > I guess you could work around it, but it would be tricky. Well, more generally, we’ll have to monitor archive coverage. But I don’t think the issue is specific to this method. >> Anyhow, we should team up with fellow NixOS and SWH hackers to address >> this, and with developers of other distros as well—this problem is not >> just that of the functional deployment geeks, is it? > > I could remove most of the Guix stuff so that it would be easy to > package in Guix, Nix, Debian, etc. Then, someone™ could write a service > that consumes a “sources.json” file, adds the sources to a Disarchive > database, and pushes everything to a Git repo. I guess everyone who > cares has to produce a “sources.json” file anyway, so it will be very > little extra work. Other stuff like changing the serialization format > to JSON would be pretty easy, too. I’m not well connected to these > other projects, mind you, so I’m not really sure how to reach out. If you feel like it, you’re welcome to point them to your work in the discussion at <https://forge.softwareheritage.org/T2430>. There’s one person from NixOS (lewo) participating in the discussion and I’m sure they’d be interested. Perhaps they’ll tell whether they care about having it available as JSON. > Sorry about the big mess of code and ideas – I realize I may have taken > the “do-ocracy” approach a little far here. :) Even if this is not > “the” solution, hopefully it’s useful for discussion! You did great! I had a very rough sketch and you did the real thing, that’s just awesome. :-) Thanks a lot! Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-07-31 14:41 ` Ludovic Courtès @ 2020-08-03 16:59 ` Timothy Sample 2020-08-05 17:14 ` Ludovic Courtès 0 siblings, 1 reply; 55+ messages in thread From: Timothy Sample @ 2020-08-03 16:59 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 42162, Maurice Brémond Hi Ludovic, Ludovic Courtès <ludo@gnu.org> writes: > Wooohoo! Is it that time of the year when people give presents to one > another? I can’t believe it. :-) Not to be too cynical, but I think it’s just the time of year that I get frustrated with what I should be working on, and start fantasizing about green-field projects. :p > Timothy Sample <samplet@ngyro.com> skribis: > >> The header and footer are read directly from the file. Finding the >> compressor is harder. I followed the approach taken by the pristine-tar >> project. That is, try a bunch of compressors and hope for a match. >> Currently, I have: >> >> • gnu-best >> • gnu-best-rsync >> • gnu >> • gnu-rsync >> • gnu-fast >> • gnu-fast-rsync >> • zlib-best >> • zlib >> • zlib-fast >> • zlib-best-perl >> • zlib-perl >> • zlib-fast-perl >> • gnu-best-rsync-1.4 >> • gnu-rsync-1.4 >> • gnu-fast-rsync-1.4 > > I would have used the integers that zlib supports, but I guess that > doesn’t capture this whole gamut of compression setups. And yeah, it’s > not great that we actually have to try and find the right compression > levels, but there’s no way around it it seems, and as you write, we can > expect a couple of variants to be the most commonly used ones. My first instinct was “this is impossible – a DEFLATE compressor can do just about whatever it wants!” Then I looked at pristine-tar and realized that their hack probably works pretty well. If I had infinite time, I would think about some kind of fully general, parameterized LZ77 algorithm that could describe any implementation. If I had a lot of time I would peel back the curtain on Gzip and zlib and expose their tuning parameters. That would be nicer, but keep in mind we will have to cover XZ, bzip2, and ZIP, too! There’s a bit of balance between quality and coverage. Any improvement to the representation of the compression algorithm could be implemented easily: just replace the names with their improved representation. One thing pristine-tar does is reorder the compressor list based on the input metadata. A Gzip member usually stores its compression level, so it makes sense to try everything at that level first before moving one. >> Originally, I used your code, but I ran into some problems. Namely, >> real tarballs are not well-behaved. I wrote new code to keep track of >> subtle things like the formatting of the octal values. > > Yeah I guess I was too optimistic. :-) I wanted to have the > serialization/deserialization code automatically generated by that > macro, but yeah, it doesn’t capture enough details for real-world > tarballs. I enjoyed your implementation! I might even bring back its style. It was a little stiff for trying to figure out exactly what I needed for reproducing the tarballs. > Do you know how frequently you get “weird” tarballs? I was thinking > about having something that works for plain GNU tar, but it’s even > better to have something that works with “unusual” tarballs! I don’t have hard numbers, but I would say that a good handful (5–10%) have “X-format” fields, meaning their octal formatting is unusual. (I’m looking at “grep -A 10 default-header” over all the S-Exp files.) The most charming thing is the “uname” and “gname” fields. For example, “rtmidi-4.0.0” was made by “gary” from “staff”. :) > (BTW the code I posted or the one in Disarchive could perhaps replace > the one in Gash-Utils. I was frustrated to not see a ‘fold-archive’ > procedure there, notably.) I really like “fold-archive”. One of the reasons I started doing this is to possibly share code with Gash-Utils. It’s not as easy as I was hoping, but I’m planning on improving things there based on my experience here. I’ve now worked with four Scheme tar implementations, maybe if I write a really good one I could cap that number at five! >> To avoid hitting the SWH archive at all, I introduced a directory cache >> so that I can store the directories locally. If the directory cache is >> available, directories are stored and retrieved from it. > > I guess we can get back to them eventually to estimate our coverage ratio. It would be nice to know, but pretty hard to find out with the rate limit. I guess it will improve immensely when we set up a “sources.json” file. >> You mean like <https://git.ngyro.com/disarchive-db/>? :) > > Woow. :-) > > We could actually have a CI job to create the database: it would > basically do ‘disarchive save’ for each tarball and store that using a > layout like the one you used. Then we could have a job somewhere that > periodically fetches that and adds it to the database. WDYT? Maybe.... I assume that Disarchive would fail for a few of them. We would need a plan for monitoring those failures so that Disarchive can be improved. Also, unless I’m misunderstanding something, this means building the whole database at every commit, no? That would take a lot of time and space. On the other hand, it would be easy enough to try. If it works, it’s a lot easier than setting up a whole other service. > I think we should leave room for other hash algorithms (in the sexps > above too). It works for different hash algorithms, but not for different directory hashing methods (like you mention below). >> This was generated by a little script built on top of “fold-packages”. >> It downloads Gzip’d tarballs used by Guix packages and passes them on to >> Disarchive for disassembly. I limited the number to 100 because it’s >> slow and because I’m sure there is a long tail of weird software >> archives that are going to be hard to process. The metadata directory >> ended up being 13M and the directory cache 2G. > > Neat. > > So it does mean that we could pretty much right away add a fall-back in > (guix download) that looks up tarballs in your database and uses > Disarchive to recontruct it, right? I love solved problems. :-) > > Of course we could improve Disarchive and the database, but it seems to > me that we already have enough to improve the situation. WDYT? I would say that we are darn close! In theory it would work. It would be much more practical if we had better coverage in the SWH archive (i.e., “sources.json”) and a way to get metadata for a source archive without downloading the entire Disarchive database. It’s 13M now, but it will likely be 500M with all the Gzip’d tarballs from a recent commit of Guix. It will only grow after that, too. Of course those are not hard blockers, so ‘(guix download)’ could start using Disarchive as soon as we package it. I’ve starting looking into it, but I’m confused about getting access to Disarchive from the “out-of-band” download system. Would it have to become a dependency of Guix? >> Even with the code I have so far, I have a lot of questions. Mainly I’m >> worried about keeping everything working into the future. It would be >> easy to make incompatible changes. A lot of care would have to be >> taken. Of course, keeping a Guix commit and a Disarchive commit might >> be enough to make any assembling reproducible, but there’s a >> chicken-and-egg problem there. > > The way I see it, Guix would always look up tarballs in the HEAD of the > database (no need to pick a specific commit). Worst that could happen > is we reconstruct a tarball that doesn’t match, and so the daemon errors > out. I was imagining an escape hatch beyond this, where one could look up a provenance record from when Disarchive ingested and verified a source code archive. The provenance record would tell you which version of Guix was used when saving the archive, so you could try your luck with using “guix time-machine” to reproduce Disarchive’s original computation. If we perform database migrations, you would need to travel back in time in the database, too. The idea is that you could work around breakages in Disarchive automatically using the Power of Guix™. Just a stray thought, really. > Regarding future-proofness, I think we must be super careful about the > file formats (the sexps). You did pay attention to not having implicit > defaults, which is perfect. Perhaps one thing to change (or perhaps > it’s already there) is support for other hashes in those sexps: both > hash algorithms and directory hash methods (SWH dir/Git tree, nar, Git > tree with different hash algorithm, IPFS CID, etc.). Also the ability > to specify several hashes. > > That way we could “refresh” the database anytime by adding the hash du > jour for already-present tarballs. The hash algorithm is already configurable, but the directory hash method is not. You’re right that it should be, and that there should be support for multiple digests. >> What if a tarball from the closure of one the derivations is missing? >> I guess you could work around it, but it would be tricky. > > Well, more generally, we’ll have to monitor archive coverage. But I > don’t think the issue is specific to this method. Again, I’m thinking about the case where I want to travel back in time to reproduce a Disarchive computation. It’s really an unlikely scenario, I’m just trying to think of everything that could go wrong. >>> Anyhow, we should team up with fellow NixOS and SWH hackers to address >>> this, and with developers of other distros as well—this problem is not >>> just that of the functional deployment geeks, is it? >> >> I could remove most of the Guix stuff so that it would be easy to >> package in Guix, Nix, Debian, etc. Then, someone™ could write a service >> that consumes a “sources.json” file, adds the sources to a Disarchive >> database, and pushes everything to a Git repo. I guess everyone who >> cares has to produce a “sources.json” file anyway, so it will be very >> little extra work. Other stuff like changing the serialization format >> to JSON would be pretty easy, too. I’m not well connected to these >> other projects, mind you, so I’m not really sure how to reach out. > > If you feel like it, you’re welcome to point them to your work in the > discussion at <https://forge.softwareheritage.org/T2430>. There’s one > person from NixOS (lewo) participating in the discussion and I’m sure > they’d be interested. Perhaps they’ll tell whether they care about > having it available as JSON. Good idea. I will work out a few more kinks and then bring it up there. I’ve already rewritten the parts that used the Guix daemon. Disarchive now only needs a handful Guix modules ('base32', 'serialization', and 'swh' are the ones that would be hard to remove). >> Sorry about the big mess of code and ideas – I realize I may have taken >> the “do-ocracy” approach a little far here. :) Even if this is not >> “the” solution, hopefully it’s useful for discussion! > > You did great! I had a very rough sketch and you did the real thing, > that’s just awesome. :-) > > Thanks a lot! My pleasure! Thanks for the feedback so far. -- Tim ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-08-03 16:59 ` Timothy Sample @ 2020-08-05 17:14 ` Ludovic Courtès 2020-08-05 18:57 ` Timothy Sample 0 siblings, 1 reply; 55+ messages in thread From: Ludovic Courtès @ 2020-08-05 17:14 UTC (permalink / raw) To: Timothy Sample; +Cc: 42162, Maurice Brémond Hello! Timothy Sample <samplet@ngyro.com> skribis: > Ludovic Courtès <ludo@gnu.org> writes: > >> Wooohoo! Is it that time of the year when people give presents to one >> another? I can’t believe it. :-) > > Not to be too cynical, but I think it’s just the time of year that I get > frustrated with what I should be working on, and start fantasizing about > green-field projects. :p :-) >> Timothy Sample <samplet@ngyro.com> skribis: >> >>> The header and footer are read directly from the file. Finding the >>> compressor is harder. I followed the approach taken by the pristine-tar >>> project. That is, try a bunch of compressors and hope for a match. >>> Currently, I have: >>> >>> • gnu-best >>> • gnu-best-rsync >>> • gnu >>> • gnu-rsync >>> • gnu-fast >>> • gnu-fast-rsync >>> • zlib-best >>> • zlib >>> • zlib-fast >>> • zlib-best-perl >>> • zlib-perl >>> • zlib-fast-perl >>> • gnu-best-rsync-1.4 >>> • gnu-rsync-1.4 >>> • gnu-fast-rsync-1.4 >> >> I would have used the integers that zlib supports, but I guess that >> doesn’t capture this whole gamut of compression setups. And yeah, it’s >> not great that we actually have to try and find the right compression >> levels, but there’s no way around it it seems, and as you write, we can >> expect a couple of variants to be the most commonly used ones. > > My first instinct was “this is impossible – a DEFLATE compressor can do > just about whatever it wants!” Then I looked at pristine-tar and > realized that their hack probably works pretty well. If I had infinite > time, I would think about some kind of fully general, parameterized LZ77 > algorithm that could describe any implementation. If I had a lot of > time I would peel back the curtain on Gzip and zlib and expose their > tuning parameters. That would be nicer, but keep in mind we will have > to cover XZ, bzip2, and ZIP, too! There’s a bit of balance between > quality and coverage. Any improvement to the representation of the > compression algorithm could be implemented easily: just replace the > names with their improved representation. Yup, it makes sense to not spend too much time on this bit. I guess we’d already have good coverage with gzip and xz. >> (BTW the code I posted or the one in Disarchive could perhaps replace >> the one in Gash-Utils. I was frustrated to not see a ‘fold-archive’ >> procedure there, notably.) > > I really like “fold-archive”. One of the reasons I started doing this > is to possibly share code with Gash-Utils. It’s not as easy as I was > hoping, but I’m planning on improving things there based on my > experience here. I’ve now worked with four Scheme tar implementations, > maybe if I write a really good one I could cap that number at five! Heh. :-) The needs are different anyway. In Gash-Utils the focus is probably on simplicity/maintainability, whereas here you really want to cover all the details of the wire representation. >>> To avoid hitting the SWH archive at all, I introduced a directory cache >>> so that I can store the directories locally. If the directory cache is >>> available, directories are stored and retrieved from it. >> >> I guess we can get back to them eventually to estimate our coverage ratio. > > It would be nice to know, but pretty hard to find out with the rate > limit. I guess it will improve immensely when we set up a > “sources.json” file. Note that we have <https://guix.gnu.org/sources.json>. Last I checked, SWH was ingesting it in its “qualification” instance, so it should be ingesting it for good real soon if it’s not doing it already. >>> You mean like <https://git.ngyro.com/disarchive-db/>? :) >> >> Woow. :-) >> >> We could actually have a CI job to create the database: it would >> basically do ‘disarchive save’ for each tarball and store that using a >> layout like the one you used. Then we could have a job somewhere that >> periodically fetches that and adds it to the database. WDYT? > > Maybe.... I assume that Disarchive would fail for a few of them. We > would need a plan for monitoring those failures so that Disarchive can > be improved. Also, unless I’m misunderstanding something, this means > building the whole database at every commit, no? That would take a lot > of time and space. On the other hand, it would be easy enough to try. > If it works, it’s a lot easier than setting up a whole other service. One can easily write a procedure that takes a tarball and returns a <computed-file> that builds its database entry. So at each commit, we’d just rebuild things that have changed. >> I think we should leave room for other hash algorithms (in the sexps >> above too). > > It works for different hash algorithms, but not for different directory > hashing methods (like you mention below). OK. [...] >> So it does mean that we could pretty much right away add a fall-back in >> (guix download) that looks up tarballs in your database and uses >> Disarchive to recontruct it, right? I love solved problems. :-) >> >> Of course we could improve Disarchive and the database, but it seems to >> me that we already have enough to improve the situation. WDYT? > > I would say that we are darn close! In theory it would work. It would > be much more practical if we had better coverage in the SWH archive > (i.e., “sources.json”) and a way to get metadata for a source archive > without downloading the entire Disarchive database. It’s 13M now, but > it will likely be 500M with all the Gzip’d tarballs from a recent commit > of Guix. It will only grow after that, too. If we expose the database over HTTP (like over cgit), we can arrange so that (guix download) simply GETs db.example.org/sha256/xyz. No need to fetch the whole database. It might be more reasonable to have a real database and a real service around it, I’m sure Chris Baines would agree ;-), but we can choose URLs that could easily be implemented by a “real” service instead of cgit in the future. > Of course those are not hard blockers, so ‘(guix download)’ could start > using Disarchive as soon as we package it. I’ve starting looking into > it, but I’m confused about getting access to Disarchive from the > “out-of-band” download system. Would it have to become a dependency of > Guix? Yes. It could be a behind-the-scenes dependency of “builtin:download”; it doesn’t have to be a dependency of each and every fixed-output derivation. > I was imagining an escape hatch beyond this, where one could look up a > provenance record from when Disarchive ingested and verified a source > code archive. The provenance record would tell you which version of > Guix was used when saving the archive, so you could try your luck with > using “guix time-machine” to reproduce Disarchive’s original > computation. If we perform database migrations, you would need to > travel back in time in the database, too. The idea is that you could > work around breakages in Disarchive automatically using the Power of > Guix™. Just a stray thought, really. Seems to me it Shouldn’t Be Necessary? :-) I mean, as long as the format is extensible and “future-proof”, we’ll always be able to rebuild tarballs and then re-disassemble them if we need to compute new hashes or whatever. >> If you feel like it, you’re welcome to point them to your work in the >> discussion at <https://forge.softwareheritage.org/T2430>. There’s one >> person from NixOS (lewo) participating in the discussion and I’m sure >> they’d be interested. Perhaps they’ll tell whether they care about >> having it available as JSON. > > Good idea. I will work out a few more kinks and then bring it up there. > I’ve already rewritten the parts that used the Guix daemon. Disarchive > now only needs a handful Guix modules ('base32', 'serialization', and > 'swh' are the ones that would be hard to remove). An option would be to use (gcrypt base64); another one would be to bundle (guix base32). I was thinking that it might be best to not use Guix for computations. For example, have “disarchive save” not build derivations and instead do everything “here and now”. That would make it easier for others to adopt. Wait, looking at the Git history, it looks like you already addressed that point, neat. :-) Thank you! Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-08-05 17:14 ` Ludovic Courtès @ 2020-08-05 18:57 ` Timothy Sample 2020-08-23 16:21 ` Ludovic Courtès 2020-11-03 14:26 ` Ludovic Courtès 0 siblings, 2 replies; 55+ messages in thread From: Timothy Sample @ 2020-08-05 18:57 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 42162, Maurice Brémond Hey, Ludovic Courtès <ludo@gnu.org> writes: > Note that we have <https://guix.gnu.org/sources.json>. Last I checked, > SWH was ingesting it in its “qualification” instance, so it should be > ingesting it for good real soon if it’s not doing it already. Oh fantastic! I was going to volunteer to do it, so that’s one thing off my list. > One can easily write a procedure that takes a tarball and returns a > <computed-file> that builds its database entry. So at each commit, we’d > just rebuild things that have changed. That makes more sense. I will give this a shot soon. > If we expose the database over HTTP (like over cgit), we can arrange so > that (guix download) simply GETs db.example.org/sha256/xyz. No need to > fetch the whole database. > > It might be more reasonable to have a real database and a real service > around it, I’m sure Chris Baines would agree ;-), but we can choose URLs > that could easily be implemented by a “real” service instead of cgit in > the future. I got it working over cgit shortly after sending my last message. :) So far, I am very much on team “good enough for now”. > Timothy Sample <samplet@ngyro.com> skribis: > >> I was imagining an escape hatch beyond this, where one could look up a >> provenance record from when Disarchive ingested and verified a source >> code archive. The provenance record would tell you which version of >> Guix was used when saving the archive, so you could try your luck with >> using “guix time-machine” to reproduce Disarchive’s original >> computation. If we perform database migrations, you would need to >> travel back in time in the database, too. The idea is that you could >> work around breakages in Disarchive automatically using the Power of >> Guix™. Just a stray thought, really. > > Seems to me it Shouldn’t Be Necessary? :-) > > I mean, as long as the format is extensible and “future-proof”, we’ll > always be able to rebuild tarballs and then re-disassemble them if we > need to compute new hashes or whatever. If Disarchive relies on external compressors, there’s an outside chance that those compressors could change under our feet. In that case, one would want to be able to track down exactly which version of XZ was used when Disarchive verified that it could reassemble a given source archive. Maybe I’m being paranoid, but if the database entries are being computed by the CI infrastructure it would be pretty easy to note the Guix commit just in case. > I was thinking that it might be best to not use Guix for computations. > For example, have “disarchive save” not build derivations and instead do > everything “here and now”. That would make it easier for others to > adopt. Wait, looking at the Git history, it looks like you already > addressed that point, neat. :-) Since my last message I managed to remove Guix as dependency completely. Right now it loads ‘(guix swh)’ opportunistically, but I might just copy the code in. Directory references now support multiple “addresses” so that you could have Nix-style, SWH-style, IPFS-style, etc. Hopefully my next message will have a WIP patch enabling Guix to use Disarchive! -- Tim ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-08-05 18:57 ` Timothy Sample @ 2020-08-23 16:21 ` Ludovic Courtès 2020-11-03 14:26 ` Ludovic Courtès 1 sibling, 0 replies; 55+ messages in thread From: Ludovic Courtès @ 2020-08-23 16:21 UTC (permalink / raw) To: Timothy Sample; +Cc: 42162, Maurice Brémond Hello! Timothy Sample <samplet@ngyro.com> skribis: >> If we expose the database over HTTP (like over cgit), we can arrange so >> that (guix download) simply GETs db.example.org/sha256/xyz. No need to >> fetch the whole database. >> >> It might be more reasonable to have a real database and a real service >> around it, I’m sure Chris Baines would agree ;-), but we can choose URLs >> that could easily be implemented by a “real” service instead of cgit in >> the future. > > I got it working over cgit shortly after sending my last message. :) So > far, I am very much on team “good enough for now”. Wonderful. :-) >> Timothy Sample <samplet@ngyro.com> skribis: >> >>> I was imagining an escape hatch beyond this, where one could look up a >>> provenance record from when Disarchive ingested and verified a source >>> code archive. The provenance record would tell you which version of >>> Guix was used when saving the archive, so you could try your luck with >>> using “guix time-machine” to reproduce Disarchive’s original >>> computation. If we perform database migrations, you would need to >>> travel back in time in the database, too. The idea is that you could >>> work around breakages in Disarchive automatically using the Power of >>> Guix™. Just a stray thought, really. >> >> Seems to me it Shouldn’t Be Necessary? :-) >> >> I mean, as long as the format is extensible and “future-proof”, we’ll >> always be able to rebuild tarballs and then re-disassemble them if we >> need to compute new hashes or whatever. > > If Disarchive relies on external compressors, there’s an outside chance > that those compressors could change under our feet. In that case, one > would want to be able to track down exactly which version of XZ was used > when Disarchive verified that it could reassemble a given source > archive. Oh, true. Gzip and bzip2 are more-or-less “set in stone”, but xz, lzip, or zstd could change. Recording the exact version of the implementation would be a good stopgap. > Maybe I’m being paranoid, but if the database entries are being > computed by the CI infrastructure it would be pretty easy to note the > Guix commit just in case. Yeah, that makes sense. At least we could have “notes” in the file format to store that kind of info. Using CI is also a good idea. >> I was thinking that it might be best to not use Guix for computations. >> For example, have “disarchive save” not build derivations and instead do >> everything “here and now”. That would make it easier for others to >> adopt. Wait, looking at the Git history, it looks like you already >> addressed that point, neat. :-) > > Since my last message I managed to remove Guix as dependency completely. > Right now it loads ‘(guix swh)’ opportunistically, but I might just copy > the code in. Directory references now support multiple “addresses” so > that you could have Nix-style, SWH-style, IPFS-style, etc. Hopefully my > next message will have a WIP patch enabling Guix to use Disarchive! Neat, looking forward to it! Thank you, Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-08-05 18:57 ` Timothy Sample 2020-08-23 16:21 ` Ludovic Courtès @ 2020-11-03 14:26 ` Ludovic Courtès 2020-11-03 16:37 ` zimoun 2020-11-03 19:20 ` Timothy Sample 1 sibling, 2 replies; 55+ messages in thread From: Ludovic Courtès @ 2020-11-03 14:26 UTC (permalink / raw) To: Timothy Sample; +Cc: 42162 Hi Timothy, I hope you’re well. I was wondering if you’ve had the chance to fiddle with Disarchive since the summer? I’m thinking there are small steps we could take to move forward: 1. Have a Disarchive package in Guix (and one for guile-quickcheck, kudos on that one!). 2. Have a Cuirass job running on ci.guix.gnu.org to build and publish the disarchive-db. 3. Integrate Disarchive in (guix download) to reconstruct tarballs. WDYT? Thanks, Ludo’, who’s still very much excited about these perspectives! ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-11-03 14:26 ` Ludovic Courtès @ 2020-11-03 16:37 ` zimoun 2020-11-03 19:20 ` Timothy Sample 1 sibling, 0 replies; 55+ messages in thread From: zimoun @ 2020-11-03 16:37 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 42162 Hi, On Tue, 3 Nov 2020 at 15:26, Ludovic Courtès <ludo@gnu.org> wrote: > 2. Have a Cuirass job running on ci.guix.gnu.org to build and publish > the disarchive-db. One question is: how does the database scale? And only the real world can show it. ;-) > Ludo’, who’s still very much excited about these perspectives! Sounds awesome! On my side, I asked twice on #swh-devel if it is possible to setup a higher rate limit for one specific machine. I have in mind one machine located at my place (Univ. Paris, ex Paris 7 Diderot) because of proximity and because I want to generate (script) some report about how much Guix is in SWH. Whatever! Instead, we could ask for Berlin or for one machine of INRIA Bordeaux, maybe the machine running guix.gnu.org or the one running hpc.guix.info. WDYT? BTW, not related to tarballs and I have not worked so much on (running out of time), but I would like to integrate hg-fetch and svn-fetch with SWH, first to "guix lint -c archival" then second to sources.json. The save seems not so hard, but the lookup needs some experiments with the SWH API. The big picture is to have all the ingestion of the Guix packages done by the automatically generated sources.json file and not via time-to-time "guix lint -c archival" (should be recommended for custom channels). All the best, simon ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-11-03 14:26 ` Ludovic Courtès 2020-11-03 16:37 ` zimoun @ 2020-11-03 19:20 ` Timothy Sample 2020-11-04 16:49 ` Ludovic Courtès 1 sibling, 1 reply; 55+ messages in thread From: Timothy Sample @ 2020-11-03 19:20 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 42162 Hi Ludo, Ludovic Courtès <ludo@gnu.org> writes: > I hope you’re well. I was wondering if you’ve had the chance to fiddle > with Disarchive since the summer? Sort of! I managed to get the entire corpus of tarballs that I started with to work (about 4000 archives). After that, I started writing some documentation. The goal there was to be more careful with serialization format. Starting to think clearly about the format and how to ensure long-term compatibility gave me a bit of vertigo, so I took a break. :) I was kind of hoping the initial excitement at SWH would push the project along, but that seems to have died down (for now). Going back to making sure it works for Guix is probably the best way to develop it until I hear more from SWH. > I’m thinking there are small steps we could take to move forward: > > 1. Have a Disarchive package in Guix (and one for guile-quickcheck, > kudos on that one!). This will be easy. The hang-up I had earlier was that I vendored the pristine-tar Gzip utility (“zgz”). Since then I don’t think it’s such a big deal. (I wrote Guile-QuickCheck ages ago! It was rotting away on my disk because I couldn’t figure out a good way to use it with, say, Gash. It has exposed several Disarchive bugs already.) > 2. Have a Cuirass job running on ci.guix.gnu.org to build and publish > the disarchive-db. I’m interested in running Cuirass locally for other reasons, so I should have a good test environment to figure this out. To be honest, I’ve had trouble figuring out Cuirass in the past, so I was dragging my feet a bit. > 3. Integrate Disarchive in (guix download) to reconstruct tarballs. I had a very simple patch that did this! It was less exciting when it sounded like SWH was going to use Disarchive directly. However, like I wrote, making Disarchive work for Guix is probably the best way to make it work for SWH if they want it in the future. > WDYT? This all will have to wait in the queue for a bit longer, but I should be able to return to it soon. I think the steps listed above are good, along with some changes I want to make to Disarchive itself. --Tim ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-11-03 19:20 ` Timothy Sample @ 2020-11-04 16:49 ` Ludovic Courtès 2022-09-29 0:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer 0 siblings, 1 reply; 55+ messages in thread From: Ludovic Courtès @ 2020-11-04 16:49 UTC (permalink / raw) To: Timothy Sample; +Cc: 42162 Hello! Timothy Sample <samplet@ngyro.com> skribis: > Ludovic Courtès <ludo@gnu.org> writes: > >> I hope you’re well. I was wondering if you’ve had the chance to fiddle >> with Disarchive since the summer? > > Sort of! I managed to get the entire corpus of tarballs that I started > with to work (about 4000 archives). After that, I started writing some > documentation. The goal there was to be more careful with serialization > format. Starting to think clearly about the format and how to ensure > long-term compatibility gave me a bit of vertigo, so I took a break. :) > > I was kind of hoping the initial excitement at SWH would push the > project along, but that seems to have died down (for now). Going back > to making sure it works for Guix is probably the best way to develop it > until I hear more from SWH. Yeah, I suppose they have enough on their plate and won’t add it to their agenda until we have shown that it works for us. >> I’m thinking there are small steps we could take to move forward: >> >> 1. Have a Disarchive package in Guix (and one for guile-quickcheck, >> kudos on that one!). > > This will be easy. The hang-up I had earlier was that I vendored the > pristine-tar Gzip utility (“zgz”). Since then I don’t think it’s such a > big deal. Yeah. > (I wrote Guile-QuickCheck ages ago! It was rotting away on my disk > because I couldn’t figure out a good way to use it with, say, Gash. It > has exposed several Disarchive bugs already.) Neat! I’m sure many of us would love to use it. :-) > This all will have to wait in the queue for a bit longer, but I should > be able to return to it soon. I think the steps listed above are good, > along with some changes I want to make to Disarchive itself. Alright! Let us know if you think there are tasks that people should just pick and work on in the meantime. Thanks for the prompt reply! Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2020-11-04 16:49 ` Ludovic Courtès @ 2022-09-29 0:32 ` Maxim Cournoyer 2022-09-29 10:56 ` zimoun 0 siblings, 1 reply; 55+ messages in thread From: Maxim Cournoyer @ 2022-09-29 0:32 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 42162, Timothy Sample, zimoun Hi, Can this issue be closed? Otherwise, what remains to be acted upon? Thanks, Maxim ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2022-09-29 0:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer @ 2022-09-29 10:56 ` zimoun 2022-09-29 15:00 ` Ludovic Courtès 0 siblings, 1 reply; 55+ messages in thread From: zimoun @ 2022-09-29 10:56 UTC (permalink / raw) To: Maxim Cournoyer, Ludovic Courtès; +Cc: 42162, Timothy Sample Hi Maxim, On Wed, 28 Sep 2022 at 20:32, Maxim Cournoyer <maxim.cournoyer@gmail.com> wrote: > Can this issue be closed? This “meta” bug raises 2 levels of issues for long-term: 1. save the source code, 2. save the current binary substitutes. For #1, we have now a roadmap to tackle this via Disarchive or sources.json and SWH [1,2]. Therefore, this point of the “meta” bug can be closed. However, about #2, we do not have a roadmap, AFAIK. For instance «Substitute retention» [3] is still an issue. The recent outage of Berlin exemplifies the potential problems. (Note that because of energy troubles in Europe, I do not exclude some potential short blackout or power outage for short period of time of the Berlin server.) Why kept binary substitutes? For instance, it is not possible to rebuild some packages on modern CPU; see OpenBLAS [4]. Therefore, waiting a better solution, it appears to me a pragmatic solution to keep as much substitutes as we are able to. We do not have a clear policy between Berlin and Bordeaux. 1: <https://forge.softwareheritage.org/T3781> 2: <https://forge.softwareheritage.org/T4538> 3: <https://issues.guix.gnu.org/issue/42162#42> 4: <https://yhetil.org/guix/86o83oywza.fsf@gmail.com> > Otherwise, what remains to be acted upon? The next action for this #2 is to address what I sent on guix-sysadmin about duplicating the storage of binary substitutes in some machine offered by INRAe / Univ. of Montpellier. And we need to draw a roadmap to tackle this #2. Then, this “meta” bug could be closed, IMHO. Cheers, simon ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2022-09-29 10:56 ` zimoun @ 2022-09-29 15:00 ` Ludovic Courtès 2022-09-30 3:10 ` Maxim Cournoyer 0 siblings, 1 reply; 55+ messages in thread From: Ludovic Courtès @ 2022-09-29 15:00 UTC (permalink / raw) To: zimoun; +Cc: 42162, Timothy Sample, Maxim Cournoyer Hi, zimoun <zimon.toutoune@gmail.com> skribis: > This “meta” bug raises 2 levels of issues for long-term: > > 1. save the source code, > 2. save the current binary substitutes. Maybe we can close this bug and open an issue for each of these, or discuss them on guix-devel until there are actionable items that come out of it? Or better yet: we could file more specific bugs such as “Disarchive lacks bzip2 support” or “SWH integration does not support Subversion”. Thoughts? Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2022-09-29 15:00 ` Ludovic Courtès @ 2022-09-30 3:10 ` Maxim Cournoyer 2022-09-30 12:13 ` zimoun 2022-09-30 18:17 ` Maxime Devos 0 siblings, 2 replies; 55+ messages in thread From: Maxim Cournoyer @ 2022-09-30 3:10 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 42162-done, Timothy Sample, zimoun Hi, Ludovic Courtès <ludo@gnu.org> writes: > Hi, > > zimoun <zimon.toutoune@gmail.com> skribis: > >> This “meta” bug raises 2 levels of issues for long-term: >> >> 1. save the source code, >> 2. save the current binary substitutes. > > Maybe we can close this bug and open an issue for each of these, or > discuss them on guix-devel until there are actionable items that come > out of it? > > Or better yet: we could file more specific bugs such as “Disarchive > lacks bzip2 support” or “SWH integration does not support Subversion”. > > Thoughts? Wholly agreed, this thread is already too long and the original problem was fixed. Closing. Thanks, -- Maxim ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2022-09-30 3:10 ` Maxim Cournoyer @ 2022-09-30 12:13 ` zimoun 2022-10-01 22:04 ` Ludovic Courtès 2022-10-03 15:20 ` Maxim Cournoyer 2022-09-30 18:17 ` Maxime Devos 1 sibling, 2 replies; 55+ messages in thread From: zimoun @ 2022-09-30 12:13 UTC (permalink / raw) To: Maxim Cournoyer, Ludovic Courtès; +Cc: 42162-done, Timothy Sample Hi, On Thu, 29 Sep 2022 at 23:10, Maxim Cournoyer <maxim.cournoyer@gmail.com> wrote: > Wholly agreed, this thread is already too long and the original problem > was fixed. Closing. I disagree, the original problem is not fixed; as I explained. Well, since you consider it is, please also close the related patch#43442. Cheers, simon ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2022-09-30 12:13 ` zimoun @ 2022-10-01 22:04 ` Ludovic Courtès 2022-10-03 15:20 ` Maxim Cournoyer 1 sibling, 0 replies; 55+ messages in thread From: Ludovic Courtès @ 2022-10-01 22:04 UTC (permalink / raw) To: zimoun; +Cc: 42162-done, Timothy Sample, Maxim Cournoyer Hi, zimoun <zimon.toutoune@gmail.com> skribis: > On Thu, 29 Sep 2022 at 23:10, Maxim Cournoyer <maxim.cournoyer@gmail.com> wrote: > >> Wholly agreed, this thread is already too long and the original problem >> was fixed. Closing. > > I disagree, the original problem is not fixed; as I explained. What would you think of opening specific issues as I proposed, including (I forgot that one) an issue on substitute preservation? I was thinking that this could make it clearer what actions remain to be taken. WDYT? Thanks, Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2022-09-30 12:13 ` zimoun 2022-10-01 22:04 ` Ludovic Courtès @ 2022-10-03 15:20 ` Maxim Cournoyer 2022-10-04 21:26 ` Ludovic Courtès 1 sibling, 1 reply; 55+ messages in thread From: Maxim Cournoyer @ 2022-10-03 15:20 UTC (permalink / raw) To: zimoun; +Cc: 42162-done, Timothy Sample, Ludovic Courtès Hi, zimoun <zimon.toutoune@gmail.com> writes: > Hi, > > On Thu, 29 Sep 2022 at 23:10, Maxim Cournoyer <maxim.cournoyer@gmail.com> wrote: > >> Wholly agreed, this thread is already too long and the original problem >> was fixed. Closing. > > I disagree, the original problem is not fixed; as I explained. Well, > since you consider it is, please also close the related patch#43442. Oh, from Ludovic's answer, I thought that the original issue had drifted to tangential problems; apologies for drawing the wrong conclusion. [time passes] I've now migrated the remaining packages off gforge.inria.fr; see the commits ending with 06201b76e5ef811245d627706c90117a0e9813d4. I tried to update ocaml-dose3, but aborted that effort, since it'd require packaging ocaml-parmap: --8<---------------cut here---------------start------------->8--- modified gnu/packages/ocaml.scm @@ -646,7 +646,7 @@ (define-public ocaml-mccs (define-public ocaml-dose3 (package (name "ocaml-dose3") - (version "5.0.1") + (version "7.0.0") (source (origin (method git-fetch) (uri (git-reference @@ -655,29 +655,19 @@ (define-public ocaml-dose3 (file-name (git-file-name name version)) (sha256 (base32 - "0dxkw37gj8z45kd0dnrlfgpj8yycq0dphs8kjm9kvq9xc8rikxp3")) - (patches - (search-patches - "ocaml-dose3-add-unix-dependency.patch" - "ocaml-dose3-Fix-for-ocaml-4.06.patch" - "ocaml-dose3-dont-make-printconf.patch" - "ocaml-dose3-Install-mli-cmx-etc.patch")))) - (build-system ocaml-build-system) - (arguments - `(#:tests? #f ;the test suite requires python 2 - #:configure-flags - ,#~(list (string-append "SHELL=" - #+(file-append (canonical-package bash-minimal) - "/bin/sh"))) - #:make-flags - ,#~(list (string-append "LIBDIR=" #$output "/lib/ocaml/site-lib")))) - (propagated-inputs - (list ocaml-graph ocaml-cudf ocaml-extlib ocaml-re)) + "0hcjh68svicap7j9bghgkp49xa12qhxa1pygmrgc9qwm0m4dhirb")))) + (build-system dune-build-system) (native-inputs (list perl + python-wrapper ocaml-extlib ocamlbuild ocaml-cppo)) + (propagated-inputs + (list ocaml-cudf + ocaml-extlib + ocaml-graph + ocaml-re)) (home-page "https://www.mancoosi.org/software/") (synopsis "Package distribution management framework") (description "Dose3 is a framework made of several OCaml libraries for --8<---------------cut here---------------end--------------->8--- Thanks, -- Maxim ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2022-10-03 15:20 ` Maxim Cournoyer @ 2022-10-04 21:26 ` Ludovic Courtès 0 siblings, 0 replies; 55+ messages in thread From: Ludovic Courtès @ 2022-10-04 21:26 UTC (permalink / raw) To: Maxim Cournoyer; +Cc: 42162-done, Timothy Sample, zimoun Hello, Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis: > zimoun <zimon.toutoune@gmail.com> writes: > >> Hi, >> >> On Thu, 29 Sep 2022 at 23:10, Maxim Cournoyer <maxim.cournoyer@gmail.com> wrote: >> >>> Wholly agreed, this thread is already too long and the original problem >>> was fixed. Closing. >> >> I disagree, the original problem is not fixed; as I explained. Well, >> since you consider it is, please also close the related patch#43442. > > Oh, from Ludovic's answer, I thought that the original issue had drifted > to tangential problems; apologies for drawing the wrong conclusion. Sorry for the misleading comment. > I've now migrated the remaining packages off gforge.inria.fr; see the > commits ending with 06201b76e5ef811245d627706c90117a0e9813d4. Woow, you rock!! (Would be nice to check Disarchive coverage of the previously-used tarballs, to see how good we’re doing.) > I tried to update ocaml-dose3, but aborted that effort, since it'd > require packaging ocaml-parmap: Something for the OCaml team! :-) Thanks, Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2022-09-30 3:10 ` Maxim Cournoyer 2022-09-30 12:13 ` zimoun @ 2022-09-30 18:17 ` Maxime Devos 1 sibling, 0 replies; 55+ messages in thread From: Maxime Devos @ 2022-09-30 18:17 UTC (permalink / raw) To: 42162, maxim.cournoyer, ludovic.courtes [-- Attachment #1.1.1: Type: text/plain, Size: 956 bytes --] On 30-09-2022 05:10, Maxim Cournoyer wrote: > Hi, > > Ludovic Courtès <ludo@gnu.org> writes: > >> Hi, >> >> zimoun <zimon.toutoune@gmail.com> skribis: >> >>> This “meta” bug raises 2 levels of issues for long-term: >>> >>> 1. save the source code, >>> 2. save the current binary substitutes. >> >> Maybe we can close this bug and open an issue for each of these, or >> discuss them on guix-devel until there are actionable items that come >> out of it? >> >> Or better yet: we could file more specific bugs such as “Disarchive >> lacks bzip2 support” or “SWH integration does not support Subversion”. >> >> Thoughts? > > Wholly agreed, this thread is already too long and the original problem > was fixed. Closing. I don't follow, our gf2x package (and a few others) still download from gforge.inria.fr? Which to me seems to be what zimoun is referring to (in the response)? Greetings, Maxime. [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 929 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 236 bytes --] ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-07-30 17:36 ` Timothy Sample 2020-07-31 14:41 ` Ludovic Courtès @ 2020-08-26 10:04 ` zimoun 2020-08-26 21:11 ` Timothy Sample 1 sibling, 1 reply; 55+ messages in thread From: zimoun @ 2020-08-26 10:04 UTC (permalink / raw) To: Timothy Sample, Ludovic Courtès; +Cc: 42162, Maurice Brémond Dear Timothy, On Thu, 30 Jul 2020 at 13:36, Timothy Sample <samplet@ngyro.com> wrote: > I call the thing “Disarchive” as in “disassemble a source code archive”. > You can find it at <https://git.ngyro.com/disarchive/>. It has a simple > command-line interface so you can do > > $ disarchive save software-1.0.tar.gz > > which serializes a disassembled version of “software-1.0.tar.gz” to the > database (which is just a directory) specified by the “DISARCHIVE_DB” > environment variable. Next, you can run > > $ disarchive load hash-of-something-in-the-db > > which will recover an original file from its metadata (stored in the > database) and data retrieved from the SWH archive or taken from a cache > (again, just a directory) specified by “DISARCHIVE_DIRCACHE”. Really nice! Thank you! >> I think we’d have to maintain a database that maps tarball hashes to >> metadata (!). A simple version of it could be a Git repo where, say, >> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would >> contain the metadata above. The nice thing is that the Git repo itself >> could be archived by SWH. :-) > > You mean like <https://git.ngyro.com/disarchive-db/>? :) [...] > This was generated by a little script built on top of “fold-packages”. > It downloads Gzip’d tarballs used by Guix packages and passes them on to > Disarchive for disassembly. I limited the number to 100 because it’s > slow and because I’m sure there is a long tail of weird software > archives that are going to be hard to process. The metadata directory > ended up being 13M and the directory cache 2G. One question is how this database scales? For example, a quick back-to-envelop estimation leads to ~1.2GB metadata for ~14k packages and then an increase of ~700MB per year, both with the Ludo’s code [1]. [1] <http://issues.guix.gnu.org/issue/42162#11> > I could remove most of the Guix stuff so that it would be easy to > package in Guix, Nix, Debian, etc. Then, someone™ could write a service > that consumes a “sources.json” file, adds the sources to a Disarchive > database, and pushes everything to a Git repo. I guess everyone who > cares has to produce a “sources.json” file anyway, so it will be very > little extra work. Other stuff like changing the serialization format > to JSON would be pretty easy, too. I’m not well connected to these > other projects, mind you, so I’m not really sure how to reach out. This service could be really useful. Yes, it could be easy to update the database each time Guix produces a new “sources.json”. As mentioned [2], should this service be part of SWH (download cooking task)? Or project side? [2] <https://forge.softwareheritage.org/T2430#47486> Thank you again for this piece for work. All the best, simon ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-08-26 10:04 ` bug#42162: Recovering source tarballs zimoun @ 2020-08-26 21:11 ` Timothy Sample 2020-08-27 9:41 ` zimoun 0 siblings, 1 reply; 55+ messages in thread From: Timothy Sample @ 2020-08-26 21:11 UTC (permalink / raw) To: zimoun; +Cc: 42162, Maurice Brémond Hi zimoun, zimoun <zimon.toutoune@gmail.com> writes: > One question is how this database scales? > > For example, a quick back-to-envelop estimation leads to ~1.2GB metadata > for ~14k packages and then an increase of ~700MB per year, both with the > Ludo’s code [1]. > > [1] <http://issues.guix.gnu.org/issue/42162#11> It’s a good question. A good part of the size comes from the representation rather than the data. Compression helps a lot here. I have a database of 3,912 packages. It’s 295M uncompressed (which is a little better than your estimation). If I pass each file through Lzip, it shrinks down to 60M. That’s more like 15.5K per package, which is almost an order of magnitude smaller than the estimation you used (120K). I think that makes the numbers rather pleasant, but it comes at the expense of easy storing in Git. > As mentioned [2], should this service be part of SWH (download cooking > task)? Or project side? > > [2] <https://forge.softwareheritage.org/T2430#47486> It would be interesting to just have SWH absorb the project. Since other distros already know how to produce a “sources.json” and how to query the SWH archive, it would mean that they benefit for free (and so would Guix, for that matter). I’m open to that, but right now having the freedom to experiment is important. -- Tim ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-08-26 21:11 ` Timothy Sample @ 2020-08-27 9:41 ` zimoun 2020-08-27 12:49 ` Ludovic Courtès 2020-08-27 18:06 ` Bengt Richter 0 siblings, 2 replies; 55+ messages in thread From: zimoun @ 2020-08-27 9:41 UTC (permalink / raw) To: Timothy Sample; +Cc: 42162, Maurice Brémond Hi, On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samplet@ngyro.com> wrote: > zimoun <zimon.toutoune@gmail.com> writes: > >> One question is how this database scales? >> >> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata >> for ~14k packages and then an increase of ~700MB per year, both with the >> Ludo’s code [1]. >> >> [1] <http://issues.guix.gnu.org/issue/42162#11> > > It’s a good question. A good part of the size comes from the > representation rather than the data. Compression helps a lot here. I > have a database of 3,912 packages. It’s 295M uncompressed (which is a > little better than your estimation). If I pass each file through Lzip, > it shrinks down to 60M. That’s more like 15.5K per package, which is > almost an order of magnitude smaller than the estimation you used > (120K). I think that makes the numbers rather pleasant, but it comes at > the expense of easy storing in Git. Thank you for these numbers. Really interesting! First, I do not know if the database needs to be stored with Git. What should be the advantage? (naive question :-)) On SWH T2430 [1], you explain the “default-header” trick to cut down the size. Nice! Moreover, the format is a long list, e.g., --8<---------------cut here---------------start------------->8--- (headers ((name "raptor2-2.0.15/") (mode 493) (mtime 1414909500) (chksum 4225) (typeflag 53)) ((name "raptor2-2.0.15/build/") (mode 493) (mtime 1414909497) (chksum 4797) (typeflag 53)) ((name "raptor2-2.0.15/build/ltversion.m4") (size 690) (mtime 1414908273) (chksum 5958)) […]) --8<---------------cut here---------------end--------------->8--- which is human-readable. Is it useful? Instead, one could imagine shorter keywords: ((na "raptor2-2.0.15/") (mo 493) (mt 1414909500) (ch 4225) (ty 53)) which using your database (commit fc50927) reduces from 295MB to 279MB. Or even plain list: (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53) (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958) where the first element provides the “type” of list to ease the reader. Well, the 2 naive questions are: does it make sense to - have the database stored under Git? - have an human-readable format? Thank you again for pushing forward this topic. :-) All the best, simon [1] https://forge.softwareheritage.org/T2430#47522 ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-08-27 9:41 ` zimoun @ 2020-08-27 12:49 ` Ludovic Courtès 2020-08-27 18:06 ` Bengt Richter 1 sibling, 0 replies; 55+ messages in thread From: Ludovic Courtès @ 2020-08-27 12:49 UTC (permalink / raw) To: zimoun; +Cc: 42162, Maurice Brémond Hi! zimoun <zimon.toutoune@gmail.com> skribis: > Moreover, the format is a long list, e.g., > > (headers > ((name "raptor2-2.0.15/") > (mode 493) > (mtime 1414909500) > (chksum 4225) > (typeflag 53)) > ((name "raptor2-2.0.15/build/") > (mode 493) > (mtime 1414909497) > (chksum 4797) > (typeflag 53)) > ((name "raptor2-2.0.15/build/ltversion.m4") > (size 690) > (mtime 1414908273) > (chksum 5958)) > > […]) > > which is human-readable. Is it useful? > > > Instead, one could imagine shorter keywords: > > ((na "raptor2-2.0.15/") > (mo 493) > (mt 1414909500) > (ch 4225) > (ty 53)) > > which using your database (commit fc50927) reduces from 295MB to 279MB. I think it’s nice, at least at this stage, that it’s human-readable—“premature optimization is the root of all evil”. :-) I guess it won’t be difficult to make the format more dense eventually if that is deemed necessary, using ‘write’ instead of ‘pretty-print’, using tricks like you write, or even going binary as a last resort. Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: Recovering source tarballs 2020-08-27 9:41 ` zimoun 2020-08-27 12:49 ` Ludovic Courtès @ 2020-08-27 18:06 ` Bengt Richter 1 sibling, 0 replies; 55+ messages in thread From: Bengt Richter @ 2020-08-27 18:06 UTC (permalink / raw) To: zimoun; +Cc: 42162, Maurice Brémond Hi, On +2020-08-27 11:41:24 +0200, zimoun wrote: > Hi, > > On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samplet@ngyro.com> wrote: > > zimoun <zimon.toutoune@gmail.com> writes: > > > >> One question is how this database scales? > >> > >> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata > >> for ~14k packages and then an increase of ~700MB per year, both with the > >> Ludo’s code [1]. > >> > >> [1] <http://issues.guix.gnu.org/issue/42162#11> > > > > It’s a good question. A good part of the size comes from the > > representation rather than the data. Compression helps a lot here. I > > have a database of 3,912 packages. It’s 295M uncompressed (which is a > > little better than your estimation). If I pass each file through Lzip, > > it shrinks down to 60M. That’s more like 15.5K per package, which is > > almost an order of magnitude smaller than the estimation you used > > (120K). I think that makes the numbers rather pleasant, but it comes at > > the expense of easy storing in Git. > > Thank you for these numbers. Really interesting! > > First, I do not know if the database needs to be stored with Git. What > should be the advantage? (naive question :-)) > > > On SWH T2430 [1], you explain the “default-header” trick to cut down the > size. Nice! > > Moreover, the format is a long list, e.g., > > --8<---------------cut here---------------start------------->8--- > (headers How about (X-v1-headers (borrowing from rfc2045 MIME usage indicating as-yet-not-a-formal-standard) The idea is to make it easy to script the change to "(headers" once there is consensus for declaring a new standard. The "v1-" part could allow a simultaneous "(X-v2-headers" alternative for zimoun's concise suggestion, or even a base64 of a compressed format. There's lots that could be borrowed from the MIME rfc's :) --8<---------------cut here---------------start------------->8--- 6.3. New Content-Transfer-Encodings Implementors may, if necessary, define private Content-Transfer- Encoding values, but must use an x-token, which is a name prefixed by "X-", to indicate its non-standard status, e.g., "Content-Transfer- Encoding: x-my-new-encoding". Additional standardized Content- Transfer-Encoding values must be specified by a standards-track RFC. The requirements such specifications must meet are given in RFC 2048. As such, all content-transfer-encoding namespace except that beginning with "X-" is explicitly reserved to the IETF for future use. Unlike media types and subtypes, the creation of new Content- Transfer-Encoding values is STRONGLY discouraged, as it seems likely to hinder interoperability with little potential benefit --8<---------------cut here---------------end--------------->8--- > ((name "raptor2-2.0.15/") > (mode 493) If you want to be more human-readable with mode, I would put a chmod argument in place of 493 :) --8<---------------cut here---------------start------------->8--- $ printf "%o\n" 493 755 $ --8<---------------cut here---------------end--------------->8--- Hm, could this be a security risk?? I mean, could a mode typo here inadvertently open a door for a nasty mod by oportunistic code buried in a later-executed apparently unrelated app? > (mtime 1414909500) One of these might be more human-recognizable :) --8<---------------cut here---------------start------------->8--- $ date --date='@1414909497' -Is 2014-11-02T07:24:57+01:00 $ date --date='@1414909497' -uIs 2014-11-02T06:24:57+00:00 $ TZ=America/Buenos_Aires date --date='@1414909497' -Is 2014-11-02T03:24:57-03:00 $ $ date --date='@1414909497' -u '+%Y%m%d_%H%M%S' 20141102_062457 # vs 1414909497, which, yes, costs 5 chars less $ --8<---------------cut here---------------end--------------->8--- > (chksum 4225) > (typeflag 53)) > ((name "raptor2-2.0.15/build/") > (mode 493) > (mtime 1414909497) > (chksum 4797) > (typeflag 53)) > ((name "raptor2-2.0.15/build/ltversion.m4") > (size 690) > (mtime 1414908273) > (chksum 5958)) > > […]) > --8<---------------cut here---------------end--------------->8--- > > which is human-readable. Is it useful? > > > Instead, one could imagine shorter keywords: > (X-v2-headers ;; ;-) > ((na "raptor2-2.0.15/") > (mo 493) > (mt 1414909500) > (ch 4225) > (ty 53)) > > which using your database (commit fc50927) reduces from 295MB to 279MB. > > Or even plain list: > (X-v3-headers > (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53) > (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958) > > where the first element provides the “type” of list to ease the reader. > > > Well, the 2 naive questions are: does it make sense to > - have the database stored under Git? > - have an human-readable format? > > > Thank you again for pushing forward this topic. :-) > > All the best, > simon > > [1] https://forge.softwareheritage.org/T2430#47522 > > > Prefixing "X-" can obviously be used with any tentative name for anything. I am suggesting it as a counter to premature (and likely clashing) bindings of valuable names, which IMO is as bad as premature optimization :) Naming is too important to be defined by first-user flag-planting, ISTM. -- Regards, Bengt Richter ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2020-07-02 7:29 bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Ludovic Courtès 2020-07-02 8:50 ` zimoun @ 2021-01-10 19:32 ` Maxim Cournoyer 2021-01-13 10:39 ` Ludovic Courtès [not found] ` <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org> 2 siblings, 1 reply; 55+ messages in thread From: Maxim Cournoyer @ 2021-01-10 19:32 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 42162-done, Maurice Brémond Hello Ludovic, Ludovic Courtès <ludovic.courtes@inria.fr> writes: > Hello! > > The hosting site gforge.inria.fr will be taken off-line in December > 2020. This GForge instance hosts source code as tarballs, Subversion > repos, and Git repos. Users have been invited to migrate to > gitlab.inria.fr, which is Git only. It seems that Software Heritage > hasn’t archived (yet) all of gforge.inria.fr. Let’s keep track of the > situation in this issue. > > The following packages have their source on gforge.inria.fr: > > scheme@(guile-user)> ,pp packages-on-gforge > $7 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640> > #<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0> > #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280> > #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0> > #<package pt-scotch@6.0.6 gnu/packages/maths.scm:2920 7f632d832640> > #<package scotch@6.0.6 gnu/packages/maths.scm:2774 7f632d832780> > #<package pt-scotch32@6.0.6 gnu/packages/maths.scm:2944 7f632d8325a0> > #<package scotch32@6.0.6 gnu/packages/maths.scm:2873 7f632d8326e0> > #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280> > #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960> > #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>) > > > ‘isl’ (a dependency of GCC) has its source on gforge.inria.fr but it’s > also mirrored at gcc.gnu.org apparently. > > Of these, the following are available on Software Heritage: > > scheme@(guile-user)> ,pp archived-source > $8 = (#<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0> > #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280> > #<package pt-scotch@6.0.6 gnu/packages/maths.scm:2920 7f632d832640> > #<package scotch@6.0.6 gnu/packages/maths.scm:2774 7f632d832780> > #<package pt-scotch32@6.0.6 gnu/packages/maths.scm:2944 7f632d8325a0> > #<package scotch32@6.0.6 gnu/packages/maths.scm:2873 7f632d8326e0> > #<package isl@0.18 gnu/packages/gcc.scm:925 7f632dc82320> > #<package isl@0.11.1 gnu/packages/gcc.scm:939 7f632dc82280>) I ran the code you had attached to the original message and got: ,pp packages-on-gforge $2 = () scheme@(guile-user)> ,pp archived-source $3 = () Closing, Thank you. Maxim ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2021-01-10 19:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer @ 2021-01-13 10:39 ` Ludovic Courtès 2021-01-13 12:27 ` Andreas Enge 2021-01-13 15:07 ` Andreas Enge 0 siblings, 2 replies; 55+ messages in thread From: Ludovic Courtès @ 2021-01-13 10:39 UTC (permalink / raw) To: Maxim Cournoyer; +Cc: 42162-done, Maurice Brémond, andreas.enge [-- Attachment #1: Type: text/plain, Size: 3547 bytes --] Hi Maxim, Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis: >> The following packages have their source on gforge.inria.fr: >> >> scheme@(guile-user)> ,pp packages-on-gforge >> $7 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640> >> #<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0> >> #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280> [...] > I ran the code you had attached to the original message and got: > > ,pp packages-on-gforge > $2 = () > scheme@(guile-user)> ,pp archived-source > $3 = () Oh, it’s due to a bug, where the wrong ‘origin?’ predicate was taken. After hiding the “wrong” one: #:use-module ((guix swh) #:hide (origin?)) I get: --8<---------------cut here---------------start------------->8--- scheme@(guix-user)> ,pp packages-on-gforge $1 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3964 7fa8a522b280> #<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:281 7fa8a4f44dc0> #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:343 7fa8a4f44c80> #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7fa8afd8aa00> #<package scotch@6.1.0 gnu/packages/maths.scm:3083 7fa8a69c8d20> #<package pt-scotch@6.1.0 gnu/packages/maths.scm:3229 7fa8a69c8be0> #<package scotch32@6.1.0 gnu/packages/maths.scm:3182 7fa8a69c8c80> #<package pt-scotch32@6.1.0 gnu/packages/maths.scm:3253 7fa8a69c8b40> #<package isl@0.22.1 gnu/packages/gcc.scm:932 7fa8a64cbdc0> #<package isl@0.11.1 gnu/packages/gcc.scm:997 7fa8a64cbc80> #<package isl@0.18 gnu/packages/gcc.scm:983 7fa8a64cbd20> #<package gf2x@1.2 gnu/packages/algebra.scm:104 7fa8a4f66500> #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:672 7fa8a4f70be0> #<package cmh@1.0 gnu/packages/algebra.scm:325 7fa8a4f660a0>) scheme@(guix-user)> ,pp archived-source $2 = (#<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:281 7fa8a4f44dc0> #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:343 7fa8a4f44c80> #<package scotch@6.1.0 gnu/packages/maths.scm:3083 7fa8a69c8d20> #<package pt-scotch@6.1.0 gnu/packages/maths.scm:3229 7fa8a69c8be0> #<package scotch32@6.1.0 gnu/packages/maths.scm:3182 7fa8a69c8c80> #<package pt-scotch32@6.1.0 gnu/packages/maths.scm:3253 7fa8a69c8b40> #<package isl@0.11.1 gnu/packages/gcc.scm:997 7fa8a64cbc80> #<package isl@0.18 gnu/packages/gcc.scm:983 7fa8a64cbd20>) --8<---------------cut here---------------end--------------->8--- Attaching the fixed script for clarity. BTW, gforge.inria.fr shutdown has been delayed a bit, but most active projects have started migrating to gitlab.inria.fr or elsewhere, so hopefully we should be able to start updating our package recipes accordingly. It’s likely, though, that tarballs were lost in the migration. For example, Scotch is now at <https://gitlab.inria.fr/scotch/scotch>. <https://gitlab.inria.fr/scotch/scotch/-/releases> shows “assets” for the 6.1.0 release, but these are auto-generated tarballs instead of the handcrafted one found on gforge.inria.fr (but this one is fine since its tarball is archived as-is on SWH.) ISL, MPFI, and GMP-ECM haven’t migrated, it seems. CMH is now at <https://gitlab.inria.fr/cmh/cmh> but without its tarballs. Andreas, do you happen to know about the status of these? We can already change Scotch and CMH to ‘git-fetch’ I think. That doesn’t solve the problem for earlier Guix revisions though, and I hope Disarchive will save us! Thanks, Ludo’. [-- Attachment #2: gforge.scm --] [-- Type: text/plain, Size: 1304 bytes --] (use-modules (guix) (gnu) (guix svn-download) (guix git-download) ((guix swh) #:hide (origin?)) (ice-9 match) (srfi srfi-1) (srfi srfi-26)) (define (gforge? package) (define (gforge-string? str) (string-contains str "gforge.inria.fr")) (match (package-source package) ((? origin? o) (match (origin-uri o) ((? string? url) (gforge-string? url)) (((? string? urls) ...) (any gforge-string? urls)) ;or 'find' ((? git-reference? ref) (gforge-string? (git-reference-url ref))) ((? svn-reference? ref) (gforge-string? (svn-reference-url ref))) (_ #f))) (_ #f))) (define packages-on-gforge (fold-packages (lambda (package result) (if (gforge? package) (cons package result) result)) '())) (define archived-source (filter (lambda (package) (let* ((origin (package-source package)) (hash (origin-hash origin))) (lookup-content (content-hash-value hash) (symbol->string (content-hash-algorithm hash))))) packages-on-gforge)) ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2021-01-13 10:39 ` Ludovic Courtès @ 2021-01-13 12:27 ` Andreas Enge 2021-01-13 15:07 ` Andreas Enge 1 sibling, 0 replies; 55+ messages in thread From: Andreas Enge @ 2021-01-13 12:27 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 42162, Maurice Brémond, Maxim Cournoyer Hello, Am Wed, Jan 13, 2021 at 11:39:19AM +0100 schrieb Ludovic Courtès: > ISL, MPFI, and GMP-ECM haven’t migrated, it seems. CMH is now at > <https://gitlab.inria.fr/cmh/cmh> but without its tarballs. > > Andreas, do you happen to know about the status of these? For CMH, the tarballs are available from its (new) homepage: http://www.multiprecision.org/cmh/home.html I can update the location at the next release, which I should prepare some time soon (TM). Concerning MPFI and GMP-ECM, I can ask their respective authors to keep me updated; I have no doubts they are going to migrate their projects. For ISL, I do not know. Andreas ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2021-01-13 10:39 ` Ludovic Courtès 2021-01-13 12:27 ` Andreas Enge @ 2021-01-13 15:07 ` Andreas Enge 1 sibling, 0 replies; 55+ messages in thread From: Andreas Enge @ 2021-01-13 15:07 UTC (permalink / raw) To: Ludovic Courtès Cc: 42162, Maurice Brémond, Maxim Cournoyer, andreas.enge Am Wed, Jan 13, 2021 at 11:39:19AM +0100 schrieb Ludovic Courtès: > ISL, MPFI, and GMP-ECM haven’t migrated, it seems. gmp-ecm has migrated to gitlab.inria.fr; I just pushed a commit with an updated URI. Besides the automatically created gitlab releases with git snapshots, the maintainer also uploads a release tarball. I chose to use the latter, which requires to manually update a hash together with the version number upon a new release. Andreas ^ permalink raw reply [flat|nested] 55+ messages in thread
[parent not found: <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org>]
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 [not found] ` <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org> @ 2021-01-13 14:28 ` Ludovic Courtès 2021-01-14 14:21 ` Maxim Cournoyer 2021-10-04 15:59 ` bug#42162: gforge.inria.fr is off-line Ludovic Courtès 0 siblings, 2 replies; 55+ messages in thread From: Ludovic Courtès @ 2021-01-13 14:28 UTC (permalink / raw) To: 42162; +Cc: Maurice [-- Attachment #1: Type: text/plain, Size: 240 bytes --] help-debbugs@gnu.org (GNU bug Tracking System) skribis: > We can already change Scotch and CMH to ‘git-fetch’ I think. For Scotch, the ‘v6.1.0’ tag at gitlab.inria.fr provides different content than the tarball on gforge: [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: the diff --] [-- Type: text/x-patch, Size: 3000 bytes --] Nur en /tmp/scotch_6.1.0/: bin Nur en /tmp/scotch_6.1.0/doc/src/ptscotch: p.ps Nur en /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout: .gitignore Nur en /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout: .gitlab-ci.yml Nur en /tmp/scotch_6.1.0/: include Nur en /tmp/scotch_6.1.0/: lib diff -ru /tmp/scotch_6.1.0/src/libscotch/library.h /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotch/library.h --- /tmp/scotch_6.1.0/src/libscotch/library.h 1970-01-01 01:00:01.000000000 +0100 +++ /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotch/library.h 1970-01-01 01:00:01.000000000 +0100 @@ -67,8 +67,6 @@ /*+ Integer type. +*/ -#include <stdint.h> - typedef DUMMYIDX SCOTCH_Idx; typedef DUMMYINT SCOTCH_Num; diff -ru /tmp/scotch_6.1.0/src/libscotch/Makefile /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotch/Makefile --- /tmp/scotch_6.1.0/src/libscotch/Makefile 1970-01-01 01:00:01.000000000 +0100 +++ /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotch/Makefile 1970-01-01 01:00:01.000000000 +0100 @@ -2320,28 +2320,6 @@ common.h \ scotch.h -library_graph_diam$(OBJ) : library_graph_diam.c \ - module.h \ - common.h \ - graph.h \ - scotch.h - -library_graph_diam_f$(OBJ) : library_graph_diam.c \ - module.h \ - common.h \ - scotch.h - -library_graph_induce$(OBJ) : library_graph_diam.c \ - module.h \ - common.h \ - graph.h \ - scotch.h - -library_graph_induce_f$(OBJ) : library_graph_diam.c \ - module.h \ - common.h \ - scotch.h - library_graph_io_chac$(OBJ) : library_graph_io_chac.c \ module.h \ common.h \ diff -ru /tmp/scotch_6.1.0/src/libscotchmetis/library_metis.h /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotchmetis/library_metis.h --- /tmp/scotch_6.1.0/src/libscotchmetis/library_metis.h 1970-01-01 01:00:01.000000000 +0100 +++ /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotchmetis/library_metis.h 1970-01-01 01:00:01.000000000 +0100 @@ -106,7 +106,6 @@ */ #ifndef SCOTCH_H /* In case "scotch.h" not included before */ -#include <stdint.h> typedef DUMMYINT SCOTCH_Num; #endif /* SCOTCH_H */ diff -ru /tmp/scotch_6.1.0/src/libscotchmetis/library_parmetis.h /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotchmetis/library_parmetis.h --- /tmp/scotch_6.1.0/src/libscotchmetis/library_parmetis.h 1970-01-01 01:00:01.000000000 +0100 +++ /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotchmetis/library_parmetis.h 1970-01-01 01:00:01.000000000 +0100 @@ -106,7 +106,6 @@ */ #ifndef SCOTCH_H /* In case "scotch.h" not included before */ -#include <stdint.h> typedef DUMMYINT SCOTCH_Num; #endif /* SCOTCH_H */ [-- Attachment #3: Type: text/plain, Size: 214 bytes --] There’s not much we can do if upstream isn’t more cautious though. Perhaps we can still update to the “new” 6.1.0, maybe labeling it “6.1.0b”? Attached a tentative patch. Thanks, Ludo’. [-- Attachment #4: Type: text/x-patch, Size: 1753 bytes --] diff --git a/gnu/packages/maths.scm b/gnu/packages/maths.scm index 7866bcc6eb..4f8f79052d 100644 --- a/gnu/packages/maths.scm +++ b/gnu/packages/maths.scm @@ -12,7 +12,7 @@ ;;; Copyright © 2015 Fabian Harfert <fhmgufs@web.de> ;;; Copyright © 2016 Roel Janssen <roel@gnu.org> ;;; Copyright © 2016, 2018, 2020 Kei Kebreau <kkebreau@posteo.net> -;;; Copyright © 2016, 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org> +;;; Copyright © 2016, 2017, 2018, 2019, 2020, 2021 Ludovic Courtès <ludo@gnu.org> ;;; Copyright © 2016 Leo Famulari <leo@famulari.name> ;;; Copyright © 2016, 2017 Thomas Danckaert <post@thomasdanckaert.be> ;;; Copyright © 2017, 2018, 2019, 2020 Paul Garlick <pgarlick@tourbillion-technology.com> @@ -3083,13 +3083,15 @@ implemented in ANSI C, and MPI for communications.") (package (name "scotch") (version "6.1.0") - (source - (origin - (method url-fetch) - (uri (string-append "https://gforge.inria.fr/frs/download.php/" - "latestfile/298/scotch_" version ".tar.gz")) + (source (origin + (method git-fetch) + (uri (git-reference + (url "https://gitlab.inria.fr/scotch/scotch") + (commit (string-append "v" version)))) + (file-name (git-file-name name version)) (sha256 - (base32 "1184fcv4wa2df8szb5lan6pjh0raarr45pk8ilpvbz23naikzg53")) + (base32 + "164jqsy75j7zfnwngj10jc4060shhxni3z8ykklhqjykdrinir55")) (patches (search-patches "scotch-build-parallelism.patch" "scotch-integer-declarations.patch")))) (build-system gnu-build-system) ^ permalink raw reply related [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2021-01-13 14:28 ` Ludovic Courtès @ 2021-01-14 14:21 ` Maxim Cournoyer 2021-10-04 15:59 ` bug#42162: gforge.inria.fr is off-line Ludovic Courtès 1 sibling, 0 replies; 55+ messages in thread From: Maxim Cournoyer @ 2021-01-14 14:21 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 42162, Maurice Hi Ludovic, Ludovic Courtès <ludovic.courtes@inria.fr> writes: [...] > There’s not much we can do if upstream isn’t more cautious though. > Perhaps we can still update to the “new” 6.1.0, maybe labeling it > “6.1.0b”? I'd prefer to append a '-1' revision rather than changing the version string itself; as that is IMO the business of upstream. Thanks, Maxim ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr is off-line 2021-01-13 14:28 ` Ludovic Courtès 2021-01-14 14:21 ` Maxim Cournoyer @ 2021-10-04 15:59 ` Ludovic Courtès 2021-10-04 17:50 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 zimoun 1 sibling, 1 reply; 55+ messages in thread From: Ludovic Courtès @ 2021-10-04 15:59 UTC (permalink / raw) To: 42162; +Cc: Maurice Brémond, andreas.enge, Maxim Cournoyer Hi! Ludovic Courtès <ludovic.courtes@inria.fr> skribis: > help-debbugs@gnu.org (GNU bug Tracking System) skribis: > >> We can already change Scotch and CMH to ‘git-fetch’ I think. > > For Scotch, the ‘v6.1.0’ tag at gitlab.inria.fr provides different > content than the tarball on gforge: [...] > There’s not much we can do if upstream isn’t more cautious though. > Perhaps we can still update to the “new” 6.1.0, maybe labeling it > “6.1.0b”? > > Attached a tentative patch. Believe it or not, gforge.inria.fr was finally phased out on Sept. 30th. And believe it or not, despite all the work and all the chat :-), we lost the source tarball of Scotch 6.1.1 for a short period of time (I found a copy and uploaded it to berlin a couple of hours ago). Going back to the script at the beginning of this bug report, we get (on 688a4db071736a772e6b5515d7c03fe501c3c15a): --8<---------------cut here---------------start------------->8--- scheme@(guile-user)> ,pp packages-on-gforge $2 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:4357 7f08823d8630> #<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:566 7f088675c630> #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:628 7f088675c4d0> #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f0881609160> #<package scotch-shared@6.1.1 gnu/packages/maths.scm:3732 7f0882964c60> #<package pt-scotch32@6.1.1 gnu/packages/maths.scm:3814 7f0882964b00> #<package pt-scotch-shared@6.1.1 gnu/packages/maths.scm:3837 7f0882964a50> #<package scotch32@6.1.1 gnu/packages/maths.scm:3684 7f0882964d10> #<package why3@1.3.3 gnu/packages/maths.scm:6904 7f0882357e70> #<package pt-scotch@6.1.1 gnu/packages/maths.scm:3790 7f0882964bb0> #<package scotch@6.1.1 gnu/packages/maths.scm:3581 7f0882964dc0> #<package isl@0.18 gnu/packages/gcc.scm:1103 7f088161cbb0> #<package isl@0.22.1 gnu/packages/gcc.scm:1052 7f088161cc60> #<package isl@0.11.1 gnu/packages/gcc.scm:1117 7f088161cb00> #<package gf2x@1.2 gnu/packages/algebra.scm:107 7f0880397b00> #<package gappa@1.3.5 gnu/packages/algebra.scm:1273 7f08803a76e0>) scheme@(guile-user)> ,pp archived-source $3 = (#<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:566 7f088675c630> #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:628 7f088675c4d0> #<package isl@0.18 gnu/packages/gcc.scm:1103 7f088161cbb0> #<package isl@0.11.1 gnu/packages/gcc.scm:1117 7f088161cb00>) scheme@(guile-user)> ,pp (lset-difference eq? packages-on-gforge archived-source) $4 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:4357 7f08823d8630> #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f0881609160> #<package scotch-shared@6.1.1 gnu/packages/maths.scm:3732 7f0882964c60> #<package pt-scotch32@6.1.1 gnu/packages/maths.scm:3814 7f0882964b00> #<package pt-scotch-shared@6.1.1 gnu/packages/maths.scm:3837 7f0882964a50> #<package scotch32@6.1.1 gnu/packages/maths.scm:3684 7f0882964d10> #<package why3@1.3.3 gnu/packages/maths.scm:6904 7f0882357e70> #<package pt-scotch@6.1.1 gnu/packages/maths.scm:3790 7f0882964bb0> #<package scotch@6.1.1 gnu/packages/maths.scm:3581 7f0882964dc0> #<package isl@0.22.1 gnu/packages/gcc.scm:1052 7f088161cc60> #<package gf2x@1.2 gnu/packages/algebra.scm:107 7f0880397b00> #<package gappa@1.3.5 gnu/packages/algebra.scm:1273 7f08803a76e0>) --8<---------------cut here---------------end--------------->8--- All this to say that we must really get our act together with Disarchive :-), and salvage all these tarballs until then. There are redirects in place for some of these, but probably not all. Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2021-10-04 15:59 ` bug#42162: gforge.inria.fr is off-line Ludovic Courtès @ 2021-10-04 17:50 ` zimoun 2021-10-07 16:07 ` Ludovic Courtès 0 siblings, 1 reply; 55+ messages in thread From: zimoun @ 2021-10-04 17:50 UTC (permalink / raw) To: Ludovic Courtès Cc: 42162, Maurice Brémond, Maxim Cournoyer, andreas.enge Hi Ludo, On Mon, 04 Oct 2021 at 17:59, Ludovic Courtès <ludovic.courtes@inria.fr> wrote: > Believe it or not, gforge.inria.fr was finally phased out on > Sept. 30th. And believe it or not, despite all the work and all the > chat :-), we lost the source tarball of Scotch 6.1.1 for a short period > of time (I found a copy and uploaded it to berlin a couple of hours > ago). Euh, I do not understand. From bug#43442 [1] on Wed, 16 Sep 2020, Scotch was not missing. And from [2] neither. Nah, the hole is the (double) update (from 6.0.6 to 6.1.0 then 6.1.1) without manually taking care of this bug report; by switching from url-fetch to git-fetch for instance. Somehow, it was bounded to happen because we lack automatic tools despite the fact they are there. Indeed, hard to believe. :-) As I am asking in this thread [3], the Guix project has the ressource, storage speaking, to archive these tarballs -- waiting a robust long-term automatic system. But we (the Guix projet) cannot because we duplicate the effort on keeping twice all the build outputs. Somehow, between Berlin and Bordeaux, coherent policies for conservancy are missing. IMHO. 1: <http://issues.guix.gnu.org/issue/43442> 2: <http://issues.guix.gnu.org/issue/42162#0> 3: <https://lists.gnu.org/archive/html/guix-devel/2021-09/msg00174.html> > All this to say that we must really get our act together with Disarchive > :-), and salvage all these tarballs until then. Definetly! We are witnesssing missing tarballs here. But many more could be missing from Berlin or Bordeaux and also upstream should have disappeared. Cheers, simon ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2021-10-04 17:50 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 zimoun @ 2021-10-07 16:07 ` Ludovic Courtès 2021-10-09 17:29 ` raingloom 2021-10-11 8:41 ` zimoun 0 siblings, 2 replies; 55+ messages in thread From: Ludovic Courtès @ 2021-10-07 16:07 UTC (permalink / raw) To: zimoun; +Cc: 42162, Maurice Brémond, Maxim Cournoyer, andreas.enge Hi! zimoun <zimon.toutoune@gmail.com> skribis: > Euh, I do not understand. From bug#43442 [1] on Wed, 16 Sep 2020, > Scotch was not missing. And from [2] neither. > > Nah, the hole is the (double) update (from 6.0.6 to 6.1.0 then 6.1.1) > without manually taking care of this bug report; by switching from > url-fetch to git-fetch for instance. Somehow, it was bounded to happen > because we lack automatic tools despite the fact they are there. > > Indeed, hard to believe. :-) I guess, in our mind, the problem was fixed long ago. :-) > As I am asking in this thread [3], the Guix project has the ressource, > storage speaking, to archive these tarballs -- waiting a robust > long-term automatic system. But we (the Guix projet) cannot because we > duplicate the effort on keeping twice all the build outputs. Somehow, > between Berlin and Bordeaux, coherent policies for conservancy are > missing. IMHO. So I think we’re lucky that we can try different solutions at once. The best solution is the one that won’t rely solely on the Guix project: SWH + Disarchive. We’re getting there! The second-best solution is to improve our tooling so we can actually keep source code in a more controlled way. That’s what I had in mind with <https://ci.guix.gnu.org/jobset/source>. We have storage space for that on berlin, but it’s not infinite. Another approach is to use ‘git-fetch’ more, at least for non-Autotools packages (that’s the case for Scotch, for instance.) So we can do all these things, and we’ll have to push hard to get the Disarchive option past the finish line because it’s the most promising long-term. Thanks, Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2021-10-07 16:07 ` Ludovic Courtès @ 2021-10-09 17:29 ` raingloom 2021-10-11 8:41 ` zimoun 1 sibling, 0 replies; 55+ messages in thread From: raingloom @ 2021-10-09 17:29 UTC (permalink / raw) To: 42162 On Thu, 07 Oct 2021 18:07:16 +0200 Ludovic Courtès <ludovic.courtes@inria.fr> wrote: > Hi! > > zimoun <zimon.toutoune@gmail.com> skribis: > > > Euh, I do not understand. From bug#43442 [1] on Wed, 16 Sep 2020, > > Scotch was not missing. And from [2] neither. > > > > Nah, the hole is the (double) update (from 6.0.6 to 6.1.0 then > > 6.1.1) without manually taking care of this bug report; by > > switching from url-fetch to git-fetch for instance. Somehow, it > > was bounded to happen because we lack automatic tools despite the > > fact they are there. > > > > Indeed, hard to believe. :-) > > I guess, in our mind, the problem was fixed long ago. :-) > > > As I am asking in this thread [3], the Guix project has the > > ressource, storage speaking, to archive these tarballs -- waiting a > > robust long-term automatic system. But we (the Guix projet) cannot > > because we duplicate the effort on keeping twice all the build > > outputs. Somehow, between Berlin and Bordeaux, coherent policies > > for conservancy are missing. IMHO. > > So I think we’re lucky that we can try different solutions at once. > > The best solution is the one that won’t rely solely on the Guix > project: SWH + Disarchive. We’re getting there! > > The second-best solution is to improve our tooling so we can actually > keep source code in a more controlled way. That’s what I had in mind > with <https://ci.guix.gnu.org/jobset/source>. We have storage space > for that on berlin, but it’s not infinite. > > Another approach is to use ‘git-fetch’ more, at least for > non-Autotools packages (that’s the case for Scotch, for instance.) Out of curiosity, why only non-autotools? ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2021-10-07 16:07 ` Ludovic Courtès 2021-10-09 17:29 ` raingloom @ 2021-10-11 8:41 ` zimoun 2021-10-12 9:24 ` Ludovic Courtès 1 sibling, 1 reply; 55+ messages in thread From: zimoun @ 2021-10-11 8:41 UTC (permalink / raw) To: Ludovic Courtès Cc: 42162, Maurice Brémond, Maxim Cournoyer, andreas.enge Hi, On Thu, 7 Oct 2021 at 18:07, Ludovic Courtès <ludovic.courtes@inria.fr> wrote: > I guess, in our mind, the problem was fixed long ago. :-) Yes, to me the 2 remaining packages was from <http://issues.guix.gnu.org/43442#0> but moved already to Gitlab. Whatever, :-) > > As I am asking in this thread [3], the Guix project has the ressource, > > storage speaking, to archive these tarballs -- waiting a robust > > long-term automatic system. But we (the Guix projet) cannot because we > > duplicate the effort on keeping twice all the build outputs. Somehow, > > between Berlin and Bordeaux, coherent policies for conservancy are > > missing. IMHO. > > So I think we’re lucky that we can try different solutions at once. Well, it is not what I am observing. Anyway. :-) > The best solution is the one that won’t rely solely on the Guix project: > SWH + Disarchive. We’re getting there! Yes. Although, it is hard to define "the Guix project". :-) Well, the remaining question is where to set the Disarchive database... but hardware could be floating around once it is ready. ;-) > The second-best solution is to improve our tooling so we can actually > keep source code in a more controlled way. That’s what I had in mind > with <https://ci.guix.gnu.org/jobset/source>. We have storage space for > that on berlin, but it’s not infinite. If Berlin has space, why so much derivations are missing when running time-machine? Well, aside the implementation that ci.guix.gnu.org fetches from repo every X minutes, i.e., drops all the commits (and the associated derivations) pushed in the meantime. And that bordeaux.guix.gnu.org fetches from guix-commits the commit batch, i.e., builds only one commit of this batch. > Another approach is to use ‘git-fetch’ more, at least for non-Autotools > packages (that’s the case for Scotch, for instance.) This is what I suggested when opening this thread [1] more than one year ago. Reading the discussion and keeping in mind the inertia, I do not think it is a viable path. For instance, you know all the pitfalls and you updated Scotch without switching to git-fetch -- no criticism :-) just a realistic matter of facts to have good coverage. <https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html> > So we can do all these things, and we’ll have to push hard to get the > Disarchive option past the finish line because it’s the most promising > long-term. Agree. Even, I think it is the only long-term option. :-) All the best, simon ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2021-10-11 8:41 ` zimoun @ 2021-10-12 9:24 ` Ludovic Courtès 2021-10-12 10:50 ` zimoun 0 siblings, 1 reply; 55+ messages in thread From: Ludovic Courtès @ 2021-10-12 9:24 UTC (permalink / raw) To: zimoun; +Cc: 42162, Maurice Brémond, Maxim Cournoyer, andreas.enge Hello! I sense a lot of impatience in your message :-), and I also see many questions. It is up to us all to answer them, I’ll just reply selectively here. zimoun <zimon.toutoune@gmail.com> skribis: > On Thu, 7 Oct 2021 at 18:07, Ludovic Courtès <ludovic.courtes@inria.fr> wrote: [...] >> The second-best solution is to improve our tooling so we can actually >> keep source code in a more controlled way. That’s what I had in mind >> with <https://ci.guix.gnu.org/jobset/source>. We have storage space for >> that on berlin, but it’s not infinite. > > If Berlin has space, why so much derivations are missing when running > time-machine? That’s not related to the question at hand, but it would be worth investigating, first by trying to quantify that. For the record, the ‘guix publish’ config on berlin is here: https://git.savannah.gnu.org/cgit/guix/maintenance.git/tree/hydra/modules/sysadmin/services.scm#n485 If I read that correctly, nars have a TTL of 180 days (this is the time a nar is retained after the last time it has been requested, so it’s a lower bound.) >> Another approach is to use ‘git-fetch’ more, at least for non-Autotools >> packages (that’s the case for Scotch, for instance.) > > This is what I suggested when opening this thread [1] more than one > year ago. Reading the discussion and keeping in mind the inertia, I > do not think it is a viable path. For instance, you know all the > pitfalls and you updated Scotch without switching to git-fetch -- no > criticism :-) just a realistic matter of facts to have good coverage. > > <https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html> Right, and I agree Scotch is a package that can definitely use ‘git-fetch’ (there are bootstrapping considerations of packages low in the stack, for instance you wouldn’t want to have Git fetched over ‘git-fetch’, but for packages like this there’s no reason not to use ‘git-fetch’.) Thanks, Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 2021-10-12 9:24 ` Ludovic Courtès @ 2021-10-12 10:50 ` zimoun 2021-10-12 16:04 ` Substitute retention Ludovic Courtès 0 siblings, 1 reply; 55+ messages in thread From: zimoun @ 2021-10-12 10:50 UTC (permalink / raw) To: Ludovic Courtès Cc: 42162, Maurice Brémond, Maxim Cournoyer, andreas.enge Hi Ludo, On Tue, 12 Oct 2021 at 11:24, Ludovic Courtès <ludovic.courtes@inria.fr> wrote: > I sense a lot of impatience in your message :-), and I also see many > questions. It is up to us all to answer them, I’ll just reply > selectively here. Impatience? Probably. :-) > zimoun <zimon.toutoune@gmail.com> skribis: >> On Thu, 7 Oct 2021 at 18:07, Ludovic Courtès <ludovic.courtes@inria.fr> wrote: > > [...] > >>> The second-best solution is to improve our tooling so we can actually >>> keep source code in a more controlled way. That’s what I had in mind >>> with <https://ci.guix.gnu.org/jobset/source>. We have storage space for >>> that on berlin, but it’s not infinite. >> >> If Berlin has space, why so much derivations are missing when running >> time-machine? > > That’s not related to the question at hand, but it would be worth > investigating, first by trying to quantify that. The question seems related. :-) Because you are saying “we have storage space for that on Berlin”… > For the record, the ‘guix publish’ config on berlin is here: > > https://git.savannah.gnu.org/cgit/guix/maintenance.git/tree/hydra/modules/sysadmin/services.scm#n485 > > If I read that correctly, nars have a TTL of 180 days (this is the time > a nar is retained after the last time it has been requested, so it’s a > lower bound.) …and the NARs are more or less removed after 180 days if no one asked for them during these 180 days, IIUC. This policy seems to keep under control the size of the storage, I guess. And I provide an annoying example of such policy. :-) Anyway, I agree it is not, for now, the core of the question at hand. :-) About quantifying, it is clearly not related to the question at hand. ;-) Just for the record, a back to envelope computations. 180 days before today was April 15th (M-x calendar C-u 180 C-b). It means 6996 commits (35aaf1fe10 is my current last commit). git log --format="%cd" --after=2021-04-15 | wc -l 6996 However, these commits are pushed by batch. Roughly, it reads: git log --format="%cd" --after=2021-04-15 --date=unix \ | awk 'NR == 1{old= $1; next}{print old - $1; old = $1}' \ | sort -n | uniq -c | grep -e "0$" | head 1 -1542620 3388 0 14 10 6 20 5 30 2 40 4 50 1 60 2 70 2 80 (Take the ’awk’ with care, I am not sure of what I am doing. :-) And, it is rough because timezone etc.) Other said 3388/6996= ~50% of commits are pushed at the same time, i.e., missed by both build farms using 2 different strategies to collect the thing to build (fetch every 5 minutes or fetch from guix-commits). It is a quick back to envelope so keep that with some salt. :-) On that number, after 180 days (6 months), it is hard to evaluate the rate of the time-machine queries. And from my experience (no number to back), running time-machine on a commit older than this 180 days implies to build derivations. Or it is a lucky day. :-) Drifting, right? Let focus on the question at hand. However, this question of long-term policy asked at: <https://lists.gnu.org/archive/html/guix-devel/2021-09/msg00215.html> appears to me worth. :-) Cheers, simon ^ permalink raw reply [flat|nested] 55+ messages in thread
* Substitute retention 2021-10-12 10:50 ` zimoun @ 2021-10-12 16:04 ` Ludovic Courtès 2021-10-12 18:06 ` zimoun 0 siblings, 1 reply; 55+ messages in thread From: Ludovic Courtès @ 2021-10-12 16:04 UTC (permalink / raw) To: zimoun; +Cc: guix-devel Hi! (Moving to guix-devel from <https://issues.guix.gnu.org/42162#43>.) zimoun <zimon.toutoune@gmail.com> skribis: >> For the record, the ‘guix publish’ config on berlin is here: >> >> https://git.savannah.gnu.org/cgit/guix/maintenance.git/tree/hydra/modules/sysadmin/services.scm#n485 >> >> If I read that correctly, nars have a TTL of 180 days (this is the time >> a nar is retained after the last time it has been requested, so it’s a >> lower bound.) [...] > Just for the record, a back to envelope computations. 180 days before > today was April 15th (M-x calendar C-u 180 C-b). It means 6996 commits > (35aaf1fe10 is my current last commit). > > git log --format="%cd" --after=2021-04-15 | wc -l > 6996 > > However, these commits are pushed by batch. Roughly, it reads: > > git log --format="%cd" --after=2021-04-15 --date=unix \ > | awk 'NR == 1{old= $1; next}{print old - $1; old = $1}' \ > | sort -n | uniq -c | grep -e "0$" | head > 1 -1542620 > 3388 0 > 14 10 > 6 20 > 5 30 > 2 40 > 4 50 > 1 60 > 2 70 > 2 80 > > (Take the ’awk’ with care, I am not sure of what I am doing. :-) And, > it is rough because timezone etc.) > > Other said 3388/6996= ~50% of commits are pushed at the same time, i.e., > missed by both build farms using 2 different strategies to collect the > thing to build (fetch every 5 minutes or fetch from guix-commits). It > is a quick back to envelope so keep that with some salt. :-) OK. > On that number, after 180 days (6 months), it is hard to evaluate the > rate of the time-machine queries. And from my experience (no number to > back), running time-machine on a commit older than this 180 days implies > to build derivations. Or it is a lucky day. :-) Right. So what can we do to address this issue? I *think* we could use a higher TTL on berlin, and we can try that right away (9 months to being with?). However, there is an upper bound anyway. To make informed decisions on the retention policy, we should monitor storage space on berlin/bayfront to better estimate what can be done. We have Zabbix but it’s not accessible from the outside; maybe we could graph storage space somewhere so people can grab the data and work on those estimates? What if we decide that we need to provide substitutes for 2y old commits? In that case, we need a plan to scale up. That could be renting storage space somewhere. That’s largely non-technical work that needs attention. There are also technical tweaks that could help: distinguishing between “important” substitutes that we want to keep, and less important substitutes (how?); identifying “equivalence classes” for builds of a given package; etc. The outcome is unclear and it’ll take time. Thoughts? Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: Substitute retention 2021-10-12 16:04 ` Substitute retention Ludovic Courtès @ 2021-10-12 18:06 ` zimoun 2021-10-15 9:27 ` Ludovic Courtès 0 siblings, 1 reply; 55+ messages in thread From: zimoun @ 2021-10-12 18:06 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guix-devel Hey, On Tue, 12 Oct 2021 at 18:04, Ludovic Courtès <ludovic.courtes@inria.fr> wrote: > (Moving to guix-devel from <https://issues.guix.gnu.org/42162#43>.) I was preparing a report. You have been faster than me. :-) Two questions are raising, IIUC: 1. the “modular” derivations for all the commits 2. long-term support for substitutes >> Other said 3388/6996= ~50% of commits are pushed at the same time, i.e., >> missed by both build farms using 2 different strategies to collect the >> thing to build (fetch every 5 minutes or fetch from guix-commits). It >> is a quick back to envelope so keep that with some salt. :-) > > OK. To make it explicit of #1, I was talking about the “modular” Guix, i.e., when running “guix pull” or “guix time-machine” it leads to build the derivations module-import.drv, guix-<hash>.drv, guix-command.drv, guix-module-union.drv, guix-<hash>-modules.drv, guix-packages-modules.drv, guix-system-tests-modules.drv, guix-packages-base-modules.drv, etc. On slow machines, it can be unpleasant; not to say unpractical. Even for recent commits. The recent addition of ’channel-with-substitutes-available’ helps when going forward (guix pull) if the build farm does not have yet these. The issue is going backward (guix time-machine). Basically, commit 59d10c3112 is from March 14, 2020 and it takes ~29min on my slow laptop. And to compare apple to apple, let take another commit one year later from March 14, 2021, e.g., commit 7327295462. It takes ~5min on the same machine. --8<---------------cut here---------------start------------->8--- $ time guix time-machine --commit=59d10c3112 -- --version Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'... substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0% The following derivations will be built: /gnu/store/zvy89f9xb53fbqvfrm7lql8mbfrsfk1b-compute-guix-derivation.drv /gnu/store/7y80kn1bypnbm869hvcq8841mr6nqvfm-module-import-compiled.drv /gnu/store/amwvgaf45722k6jn4r39983zsgmbyp2g-module-import.drv /gnu/store/h3h0qfiw5100zkwfb919r7vn0q06ksqy-config.scm.drv /gnu/store/jkwhdilsbxb18hx6gi4i2rj0v06mfbab-module-import.drv /gnu/store/sixfy4sazai667n99pxa5h7wzzaabw79-module-import-compiled.drv [...] Computing Guix derivation for 'x86_64-linux'... WARNING: (guix build emacs-build-system): imported module (guix build utils) overrides core binding `delete' substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0% [...] substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0% The following derivations will be built: /gnu/store/50ymxym19h8whzg3ajcl6kdjmq3p7qrg-profile.drv /gnu/store/l0znp7g83lbylbv97nd3ahz8rnrvxfrf-guix-59d10c311.drv /gnu/store/bvxrnp8bydl3zsbcg7j8j7m0qfygdhfs-guix-command.drv /gnu/store/xmsciylxx1j7nbry5cv1lm7595a8rilr-guix-module-union.drv /gnu/store/yvkw65kv9bhvx41750dchxw98qqxv64b-guix-59d10c311-modules.drv /gnu/store/6hrn3bpvcg8571mckzcj519xf2kqn2sl-guix-packages-base-modules.drv /gnu/store/nwl8z8cx9pdwn0sx5i5j5mp0bkdi64mm-guix-packages-base.drv /gnu/store/0j0271nbm2l526m5xs7zpd686qqrjz7w-guix-core.drv /gnu/store/2134jzhhckh871vlscw6dmwqqhny9zxg-guix-core-source.drv /gnu/store/mhik7ggrf4z5f38nsg7g0gbijm916b98-config.scm.drv /gnu/store/53zlyv96nrrm4h5ns7nmmndj8jys38f7-guix-extra.drv /gnu/store/7dn3f27i48jp6zvlwanzk8mfy828k6cm-guix-config-modules.drv /gnu/store/6xy3yyvjqvjyrlrwk1lzs12knbvilqpy-guix-config-source.drv /gnu/store/15ihwkqzyz4r4b4rppb92qcawha6a7p7-config.scm.drv /gnu/store/cacawv4yib8pa2ajzw0kyaihgym72mww-guix-config.drv /gnu/store/baivfv20hzm799v0wvdrcfaimh4aw22a-guix-extra-modules.drv /gnu/store/cxpkd0jkxapzbmg5vfmn4fy30yd7vlhm-guix-core-modules.drv /gnu/store/g3nqbybggh7dc2qd9gkj7swfmgmiigpp-guix-packages-modules.drv /gnu/store/nq9mzr00ny3nrsldvcq9r4va4fhb26sq-guix-packages.drv /gnu/store/i9dfjh5mf3r83447x7fa75hv1hnp9myv-guix-cli-modules.drv /gnu/store/xdsld8rhrawgngv93qx6lk9cgmql908c-guix-cli.drv /gnu/store/qnzm0a1gcwnfvkii7vk93rimzzp3mcf9-guix-system.drv /gnu/store/x2g2s3rhw0bv9qdp21yghgl2sij15dkr-guix-system-modules.drv /gnu/store/zandkjnylznvdj8jfsfgssvmvs6jfyph-guix-system-tests-modules.drv /gnu/store/45bs7y1h3gx2m7qry6r621klgkmv47wl-guix-system-tests.drv /gnu/store/caj7qjxvhrksk3jrkpsxqnx4kg7mlj9d-guix-daemon.drv /gnu/store/mah24wyy6bd51c27ww45hsqnjxhcn0yx-guix-manual.drv /gnu/store/002i672yl1192x7wvhkdbih94qffmcdk-guix-translated-texinfo.drv /gnu/store/5sf9aalvic81qlm06lq3a1pwdb2b3bm0-inferior-script.scm.drv /gnu/store/k612gnziiy3hn2dnrj33w4mw84kcnynm-profile.drv 11.7 MB will be downloaded [...] real 29m14.588s user 0m56.126s sys 0m1.032s --8<---------------cut here---------------end--------------->8--- --8<---------------cut here---------------start------------->8--- $ time guix time-machine --commit=7327295462 -- --version Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'... substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0% [...] building /gnu/store/6xq8vxpl51l8b3fz6sxpyspa1w5chbk9-module-import.drv... module-import 2KiB 43KiB/s 00:00 [##################] 100.0% module-import-compiled 1.5MiB 607KiB/s 00:03 [##################] 100.0% module-import-compiled 1.5MiB 606KiB/s 00:03 [##################] 100.0% building /gnu/store/pbv19dhrlqr2lnzphmydn4zrrdccghf2-compute-guix-derivation.drv... substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0% substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0% @ substituter-started /gnu/store/fbn395nfpbp4d4fr6jsbmwcx6n10kg16-python-minimal-3.8.2 substitute @ download-started /gnu/store/fbn395nfpbp4d4fr6jsbmwcx6n10kg16-python-minimal-3.8.2 https://ci.guix.gnu.org/nar/lzip/fbn395nfpbp4d4fr6jsbmwcx6n10kg16-python-minimal-3 [...] @ substituter-succeeded /gnu/store/fx0cdzzppd8jc09sianbq6gl1h7mxx3x-zziplib-0.13.72 substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0% The following derivations will be built: /gnu/store/8kmb40b2r3cx5zxpcrwa73a4lkaxjd9l-profile.drv /gnu/store/la6c7m2b6izy22vv8xpyvpz1ajyq72br-profile.drv /gnu/store/nljv2wnw0wqkyk0am8722gdwah3b0cx2-guix-732729546.drv /gnu/store/bmrx03y52d8dhhcpyf9i8j4zn2fg7pip-guix-command.drv /gnu/store/pg26n125c9bmvk4lxxp9ssd9havk89wc-guix-module-union.drv /gnu/store/mp0c2ad09axrq8zwhh3ycfk4a2mrgvm2-guix-732729546-modules.drv /gnu/store/0gx5jr1vgkdm5ajfcscladhzjx2gz5l2-guix-system-modules.drv /gnu/store/1sqwx35rn2qinlzib74zfjanxzzgmza3-guix-packages-base-modules.drv /gnu/store/6d64jyviixsjvfjgpxv3lyzq7l35y0f3-guix-config-modules.drv /gnu/store/5v24gh1pbqy6jzyl9x52wzxa6qprwl6v-guix-config-source.drv /gnu/store/pv6rskbpg8rzq3wi92m3iw7a2524r994-config.scm.drv /gnu/store/i6ncmiw6agrlq290drkg94scn12sv4v8-guix-config.drv /gnu/store/6l3l0qsq280pwmrnb0z2dfiyc7g5ff5i-guix-extra-modules.drv /gnu/store/hdy7q3xm8jalssc6jim2fk1wqgxqbfm9-guix-system-tests-modules.drv /gnu/store/ki86v53288fnx8sih3zlfi9qnjx2lzay-guix-packages-modules.drv /gnu/store/l63wxbqaq98nfjkq2cyshb6q8rjxjd6h-guix-core-modules.drv /gnu/store/b1qky9s98cfgp88xan9pqg0k9k0rlzrm-guix-core-source.drv /gnu/store/wmfssg7yyz2hrwanash7yk8f86faghlf-guix-cli-modules.drv /gnu/store/z1xxvp4hmffrpmbl0ll5y87w5pyfma9l-guix-daemon.drv /gnu/store/ms6pkrkggd0rl4fh5hfh20gcva7ryip5-inferior-script.scm.drv 28.7 MB will be downloaded [...] real 5m56.451s user 3m52.055s sys 0m1.738s --8<---------------cut here---------------end--------------->8--- To be on the same wavelength, --8<---------------cut here---------------start------------->8--- $ git log --format="%h %cd" --after=2021-03-14 --reverse | head -n16 [...] 2babf7d831 Sun Mar 14 19:16:55 2021 +0100 b15720182e Sun Mar 14 13:24:21 2021 -0500 207aa62e6b Sun Mar 14 13:24:21 2021 -0500 30f5381487 Sun Mar 14 13:24:21 2021 -0500 af25357b7d Sun Mar 14 13:24:21 2021 -0500 7164d2105a Sun Mar 14 13:24:21 2021 -0500 078f3288e2 Sun Mar 14 13:24:21 2021 -0500 5a31eb7d35 Sun Mar 14 13:24:21 2021 -0500 620206b680 Sun Mar 14 13:24:22 2021 -0500 b76762a9b7 Sun Mar 14 13:24:22 2021 -0500 cbfcbb79df Sun Mar 14 19:43:35 2021 +0100 --8<---------------cut here---------------end--------------->8--- and Cuirass builds only one of b15720182e, 207aa62e6b, 30f5381487, af25357b7d, 7164d2105a, 078f3288e2, 5a31eb7d35, 620206b680 or b76762a9b7. Considering the Build Coordinator, it uses guix-commits and from my understanding it reads: <https://lists.gnu.org/archive/html/guix-commits/2021-03/msg01201.html> therefore, b15720182e would be missed but not b76762a9b7–which would be missed by Cuirass. Cuirass and the Build Coordinator cannot each build the both commits b15720182e and b76762a9b7. Cuirass check every 5 minutes and Build Coordinator reads “state” from guix-commits. Other said, none of them builds all these “modular” derivations for all the commits; even for recent commits. The rough estimate is half of commits are missed by both build farms. Therefore, using “guix time-machine” with a random commit and one gets 1/2 probability to build something just to get the inferior – aside the TTL policy. (It is mitigated because the both build farms use different strategies and thus they do not miss the same commits. \o/) >> On that number, after 180 days (6 months), it is hard to evaluate the >> rate of the time-machine queries. And from my experience (no number to >> back), running time-machine on a commit older than this 180 days implies >> to build derivations. Or it is a lucky day. :-) > > Right. > > So what can we do to address this issue? I *think* we could use a > higher TTL on berlin, and we can try that right away (9 months to being > with?). I *think* the issue is not TTL for question #1. :-) But the issue that the both build farms do not build these “modular” derivations for all the commits. Here, I am focused on x86_64-linux which is the case of interest for such topic (scientific context), IMHO. Considering to build for every commit for all architectures is not affordable. I agree that increasing the TTL will help for question #2 about long-support of substitutes. > However, there is an upper bound anyway. To make informed decisions on > the retention policy, we should monitor storage space on berlin/bayfront > to better estimate what can be done. We have Zabbix but it’s not > accessible from the outside; maybe we could graph storage space > somewhere so people can grab the data and work on those estimates? Based on the size of these derivations for one commit, we could extrapolate back to envelope. Well, question #1 seems doable storage-speaking. The issue of #1 is to build these derivations for all the commits. IMHO. About #2, yeah if some data are available, I can try to make some estimates. Well, #1 seems actionable. However, #2 raises… > What if we decide that we need to provide substitutes for 2y old > commits? In that case, we need a plan to scale up. That could be > renting storage space somewhere. That’s largely non-technical work that > needs attention. …a strong question. :-) What do “we” do for what “we” build? Indeed, numbers are missing to make informed decisions on long-term storage of substitutes. What is Nix doing? > There are also technical tweaks that could help: distinguishing between > “important” substitutes that we want to keep, and less important > substitutes (how?); identifying “equivalence classes” for builds of a > given package; etc. The outcome is unclear and it’ll take time. I agree it will take time. :-) I think that having 2 build farms building in parallel is a strength. So let exploit it. :-) What one could have in mind is to challenge the outputs; if they are identical, let keep only one version “somewhere” and remove the other from the “elsewhere”. For instance, we (I? with help) could resume this discussion: <https://lists.gnu.org/archive/html/guix-devel/2020-10/msg00181.html> Or maybe, for the identical outputs, one could imagine (dream? for) a cooking service for missing outputs. Well, I do not know how this is actionable. :-) Cheers, simon ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: Substitute retention 2021-10-12 18:06 ` zimoun @ 2021-10-15 9:27 ` Ludovic Courtès 0 siblings, 0 replies; 55+ messages in thread From: Ludovic Courtès @ 2021-10-15 9:27 UTC (permalink / raw) To: zimoun; +Cc: guix-devel Hi! zimoun <zimon.toutoune@gmail.com> skribis: >>> missed by both build farms using 2 different strategies to collect the >>> thing to build (fetch every 5 minutes or fetch from guix-commits). It >>> is a quick back to envelope so keep that with some salt. :-) >> >> OK. > > To make it explicit of #1, I was talking about the “modular” Guix, i.e., > when running “guix pull” or “guix time-machine” it leads to build the > derivations module-import.drv, guix-<hash>.drv, guix-command.drv, > guix-module-union.drv, guix-<hash>-modules.drv, > guix-packages-modules.drv, guix-system-tests-modules.drv, > guix-packages-base-modules.drv, etc. On slow machines, it can be > unpleasant; not to say unpractical. Even for recent commits. Ah I see. Yeah, this can be kinda annoying, and amplified by the fact that CI only builds at each push, not at each commit. That said, this is mitigated by the fact that one typically travels to a previously-fetched commit, which is a commit that has been built by CI rather than a commit in between two pushes. > Basically, commit 59d10c3112 is from March 14, 2020 and it takes ~29min > on my slow laptop. And to compare apple to apple, let take another > commit one year later from March 14, 2021, e.g., commit 7327295462. It > takes ~5min on the same machine. Yeah, OK. > To be on the same wavelength, > > $ git log --format="%h %cd" --after=2021-03-14 --reverse | head -n16 > [...] > 2babf7d831 Sun Mar 14 19:16:55 2021 +0100 > b15720182e Sun Mar 14 13:24:21 2021 -0500 > 207aa62e6b Sun Mar 14 13:24:21 2021 -0500 > 30f5381487 Sun Mar 14 13:24:21 2021 -0500 > af25357b7d Sun Mar 14 13:24:21 2021 -0500 > 7164d2105a Sun Mar 14 13:24:21 2021 -0500 > 078f3288e2 Sun Mar 14 13:24:21 2021 -0500 > 5a31eb7d35 Sun Mar 14 13:24:21 2021 -0500 > 620206b680 Sun Mar 14 13:24:22 2021 -0500 > b76762a9b7 Sun Mar 14 13:24:22 2021 -0500 > cbfcbb79df Sun Mar 14 19:43:35 2021 +0100 > > and Cuirass builds only one of b15720182e, 207aa62e6b, 30f5381487, > af25357b7d, 7164d2105a, 078f3288e2, 5a31eb7d35, 620206b680 or > b76762a9b7. > > Considering the Build Coordinator, it uses guix-commits and from my > understanding it reads: > > <https://lists.gnu.org/archive/html/guix-commits/2021-03/msg01201.html> > > therefore, b15720182e would be missed but not b76762a9b7–which would be > missed by Cuirass. > > Cuirass and the Build Coordinator cannot each build the both commits > b15720182e and b76762a9b7. > > Cuirass check every 5 minutes and Build Coordinator reads “state” from > guix-commits. Other said, none of them builds all these “modular” > derivations for all the commits; even for recent commits. > > The rough estimate is half of commits are missed by both build farms. > Therefore, using “guix time-machine” with a random commit and one gets > 1/2 probability to build something just to get the inferior – aside the > TTL policy. Right. Not every derivation produced by (guix self) needs to be rebuilt in between two commits, but anything that depends on *package-modules* typically has to be rebuilt. We can reduce the amount of rebuilt like I did in commit abd38dcee16f0ac71191527c38dcd3659111e2ba, but you’ll always have the big (gnu packages …) derivation. >> So what can we do to address this issue? I *think* we could use a >> higher TTL on berlin, and we can try that right away (9 months to being >> with?). > > I *think* the issue is not TTL for question #1. :-) But the issue that > the both build farms do not build these “modular” derivations for all > the commits. Here, I am focused on x86_64-linux which is the case of > interest for such topic (scientific context), IMHO. > > Considering to build for every commit for all architectures is not > affordable. > > I agree that increasing the TTL will help for question #2 about > long-support of substitutes. Understood! >> However, there is an upper bound anyway. To make informed decisions on >> the retention policy, we should monitor storage space on berlin/bayfront >> to better estimate what can be done. We have Zabbix but it’s not >> accessible from the outside; maybe we could graph storage space >> somewhere so people can grab the data and work on those estimates? > > Based on the size of these derivations for one commit, we could > extrapolate back to envelope. Well, question #1 seems doable > storage-speaking. > > The issue of #1 is to build these derivations for all the commits. > IMHO. > > About #2, yeah if some data are available, I can try to make some > estimates. > > > Well, #1 seems actionable. However, #2 raises… > >> What if we decide that we need to provide substitutes for 2y old >> commits? In that case, we need a plan to scale up. That could be >> renting storage space somewhere. That’s largely non-technical work that >> needs attention. > > …a strong question. :-) What do “we” do for what “we” build? > > Indeed, numbers are missing to make informed decisions on long-term > storage of substitutes. What is Nix doing? Nix, AFAIK, is doing like everyone else: pouring money on Amazon. Last I heard they’d retain substitutes basically indefinitely on Amazon S3 (incidentally, one motivation for them to work with Software Heritage, AIUI, is that it would allow them to store less data on the storage they pay for :-)). For the record, berlin (aka ci.guix.gnu.org; it was donated by the Max Delbrück Center, MDC, and is generously hosted by them) has a 37 TiB disk for /gnu/store and “baked” substitutes. That’s a lot. Technically though, a lot of it is used by less important substitutes such as disk images or intermediate ‘core-updates’ substitutes. In the end we seem to be filling it more quickly than you’d think! Perhaps we need a better strategy with a low TTL for, say, intermediate ‘core-updates’ substitutes (no need to keep them more than a few weeks if we know we’re doing a world rebuild right after). It cannot be done as things are though because ‘guix publish’ doesn’t distinguish between store items. Or we could restart the Amazon front-end that Chris Marusich had set up right before 1.0 was released. Or we could build our own front-end for substitute delivery as a proxy to berlin, thereby distributing the burden. Thoughts? > I think that having 2 build farms building in parallel is a strength. > So let exploit it. :-) What one could have in mind is to challenge the > outputs; if they are identical, let keep only one version “somewhere” > and remove the other from the “elsewhere”. > > For instance, we (I? with help) could resume this discussion: > > <https://lists.gnu.org/archive/html/guix-devel/2020-10/msg00181.html> I hadn’t seen this message, interesting! Note however that bordeaux.guix has a tenth of the storage space of berlin (3.6 TiB), so right now we probably can’t count on it for long-term substitute storage. > Or maybe, for the identical outputs, one could imagine (dream? for) a > cooking service for missing outputs. Well, I do not know how this is > actionable. :-) Well, if we keep .drv around, we could arrange so that ‘guix publish’ rebuilds on-demand, after all. I’m not sure how practical that would be, though. Ludo’. ^ permalink raw reply [flat|nested] 55+ messages in thread
end of thread, other threads:[~2022-10-04 21:53 UTC | newest] Thread overview: 55+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-07-02 7:29 bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Ludovic Courtès 2020-07-02 8:50 ` zimoun 2020-07-02 10:03 ` Ludovic Courtès 2020-07-11 15:50 ` bug#42162: Recovering source tarballs Ludovic Courtès 2020-07-13 19:20 ` Christopher Baines 2020-07-20 21:27 ` zimoun 2020-07-15 16:55 ` zimoun 2020-07-20 8:39 ` Ludovic Courtès 2020-07-20 15:52 ` zimoun 2020-07-20 17:05 ` Dr. Arne Babenhauserheide 2020-07-20 19:59 ` zimoun 2020-07-21 21:22 ` Ludovic Courtès 2020-07-22 0:27 ` zimoun 2020-07-22 10:28 ` Ludovic Courtès 2020-08-03 21:10 ` Ricardo Wurmus 2020-07-30 17:36 ` Timothy Sample 2020-07-31 14:41 ` Ludovic Courtès 2020-08-03 16:59 ` Timothy Sample 2020-08-05 17:14 ` Ludovic Courtès 2020-08-05 18:57 ` Timothy Sample 2020-08-23 16:21 ` Ludovic Courtès 2020-11-03 14:26 ` Ludovic Courtès 2020-11-03 16:37 ` zimoun 2020-11-03 19:20 ` Timothy Sample 2020-11-04 16:49 ` Ludovic Courtès 2022-09-29 0:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer 2022-09-29 10:56 ` zimoun 2022-09-29 15:00 ` Ludovic Courtès 2022-09-30 3:10 ` Maxim Cournoyer 2022-09-30 12:13 ` zimoun 2022-10-01 22:04 ` Ludovic Courtès 2022-10-03 15:20 ` Maxim Cournoyer 2022-10-04 21:26 ` Ludovic Courtès 2022-09-30 18:17 ` Maxime Devos 2020-08-26 10:04 ` bug#42162: Recovering source tarballs zimoun 2020-08-26 21:11 ` Timothy Sample 2020-08-27 9:41 ` zimoun 2020-08-27 12:49 ` Ludovic Courtès 2020-08-27 18:06 ` Bengt Richter 2021-01-10 19:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer 2021-01-13 10:39 ` Ludovic Courtès 2021-01-13 12:27 ` Andreas Enge 2021-01-13 15:07 ` Andreas Enge [not found] ` <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org> 2021-01-13 14:28 ` Ludovic Courtès 2021-01-14 14:21 ` Maxim Cournoyer 2021-10-04 15:59 ` bug#42162: gforge.inria.fr is off-line Ludovic Courtès 2021-10-04 17:50 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 zimoun 2021-10-07 16:07 ` Ludovic Courtès 2021-10-09 17:29 ` raingloom 2021-10-11 8:41 ` zimoun 2021-10-12 9:24 ` Ludovic Courtès 2021-10-12 10:50 ` zimoun 2021-10-12 16:04 ` Substitute retention Ludovic Courtès 2021-10-12 18:06 ` zimoun 2021-10-15 9:27 ` Ludovic Courtès
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/guix.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.