bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020

all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
@ 2020-07-02  7:29 Ludovic Courtès
  2020-07-02  8:50 ` zimoun
                   ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Ludovic Courtès @ 2020-07-02  7:29 UTC (permalink / raw)
  To: 42162; +Cc: Maurice Brémond

[-- Attachment #1: Type: text/plain, Size: 2952 bytes --]

Hello!

The hosting site gforge.inria.fr will be taken off-line in December
2020.  This GForge instance hosts source code as tarballs, Subversion
repos, and Git repos.  Users have been invited to migrate to
gitlab.inria.fr, which is Git only.  It seems that Software Heritage
hasn’t archived (yet) all of gforge.inria.fr.  Let’s keep track of the
situation in this issue.

The following packages have their source on gforge.inria.fr:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,pp packages-on-gforge
$7 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640>
 #<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0>
 #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280>
 #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0>
 #<package pt-scotch@6.0.6 gnu/packages/maths.scm:2920 7f632d832640>
 #<package scotch@6.0.6 gnu/packages/maths.scm:2774 7f632d832780>
 #<package pt-scotch32@6.0.6 gnu/packages/maths.scm:2944 7f632d8325a0>
 #<package scotch32@6.0.6 gnu/packages/maths.scm:2873 7f632d8326e0>
 #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280>
 #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960>
 #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>)
--8<---------------cut here---------------end--------------->8---

‘isl’ (a dependency of GCC) has its source on gforge.inria.fr but it’s
also mirrored at gcc.gnu.org apparently.

Of these, the following are available on Software Heritage:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,pp archived-source
$8 = (#<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0>
 #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280>
 #<package pt-scotch@6.0.6 gnu/packages/maths.scm:2920 7f632d832640>
 #<package scotch@6.0.6 gnu/packages/maths.scm:2774 7f632d832780>
 #<package pt-scotch32@6.0.6 gnu/packages/maths.scm:2944 7f632d8325a0>
 #<package scotch32@6.0.6 gnu/packages/maths.scm:2873 7f632d8326e0>
 #<package isl@0.18 gnu/packages/gcc.scm:925 7f632dc82320>
 #<package isl@0.11.1 gnu/packages/gcc.scm:939 7f632dc82280>)
--8<---------------cut here---------------end--------------->8---

So we’ll be missing these:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,pp (lset-difference eq? $7 $8)
$11 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640>
 #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0>
 #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280>
 #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960>
 #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>)
--8<---------------cut here---------------end--------------->8---

Attached the code I used for this.

Thanks,
Ludo’.


[-- Attachment #2: the code --]
[-- Type: text/plain, Size: 1284 bytes --]

(use-modules (guix) (gnu)
             (guix svn-download)
             (guix git-download)
             (guix swh)
             (ice-9 match)
             (srfi srfi-1)
             (srfi srfi-26))

(define (gforge? package)
  (define (gforge-string? str)
    (string-contains str "gforge.inria.fr"))

  (match (package-source package)
    ((? origin? o)
     (match (origin-uri o)
       ((? string? url)
        (gforge-string? url))
       (((? string? urls) ...)
        (any gforge-string? urls))                ;or 'find'
       ((? git-reference? ref)
        (gforge-string? (git-reference-url ref)))
       ((? svn-reference? ref)
        (gforge-string? (svn-reference-url ref)))
       (_ #f)))
    (_ #f)))

(define packages-on-gforge
  (fold-packages (lambda (package result)
                   (if (gforge? package)
                       (cons package result)
                       result))
                 '()))

(define archived-source
  (filter (lambda (package)
            (let* ((origin (package-source package))
                   (hash  (origin-hash origin)))
              (lookup-content (content-hash-value hash)
                              (symbol->string
                               (content-hash-algorithm hash)))))
          packages-on-gforge))

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2020-07-02  7:29 bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Ludovic Courtès
@ 2020-07-02  8:50 ` zimoun
  2020-07-02 10:03   ` Ludovic Courtès
  2021-01-10 19:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
       [not found] ` <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org>
  2 siblings, 1 reply; 55+ messages in thread
From: zimoun @ 2020-07-02  8:50 UTC (permalink / raw)
  To: Ludovic Courtès, 42162; +Cc: Maurice Brémond

Hi Ludo,

On Thu, 02 Jul 2020 at 09:29, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:

> The hosting site gforge.inria.fr will be taken off-line in December
> 2020.  This GForge instance hosts source code as tarballs, Subversion
> repos, and Git repos.  Users have been invited to migrate to
> gitlab.inria.fr, which is Git only.  It seems that Software Heritage
> hasn’t archived (yet) all of gforge.inria.fr.  Let’s keep track of the
> situation in this issue.

[...]

> --8<---------------cut here---------------start------------->8---
> scheme@(guile-user)> ,pp (lset-difference eq? $7 $8)
> $11 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640>
>  #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0>
>  #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280>
>  #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960>
>  #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>)
> --8<---------------cut here---------------end--------------->8---

All the 5 are 'url-fetch' so we can expect that sources.json will be up
before the shutdown on December. :-)

Then, all the 14 packages we have from gforge.inria.fr will be
git-fetch, right?  So should we contact upstream to inform us when they
switch?  Then we can adapt the origin.

> (use-modules (guix) (gnu)
>              (guix svn-download)
>              (guix git-download)
>              (guix swh)

It does not work properly if I do not replace by

               ((guix swh) #:hide (origin?))

Well, I have no investigate further.

>              (ice-9 match)
>              (srfi srfi-1)
>              (srfi srfi-26))

[...]

> (define archived-source
>   (filter (lambda (package)
>             (let* ((origin (package-source package))
>                    (hash  (origin-hash origin)))
>               (lookup-content (content-hash-value hash)
>                               (symbol->string
>                                (content-hash-algorithm hash)))))
>           packages-on-gforge))

I am a bit lost about the other discussion on falling back for tarball.
But that's another story. :-)


Cheers,
simon




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2020-07-02  8:50 ` zimoun
@ 2020-07-02 10:03   ` Ludovic Courtès
  2020-07-11 15:50     ` bug#42162: Recovering source tarballs Ludovic Courtès
  0 siblings, 1 reply; 55+ messages in thread
From: Ludovic Courtès @ 2020-07-02 10:03 UTC (permalink / raw)
  To: zimoun; +Cc: 42162, Maurice Brémond

zimoun <zimon.toutoune@gmail.com> skribis:

> On Thu, 02 Jul 2020 at 09:29, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:
>
>> The hosting site gforge.inria.fr will be taken off-line in December
>> 2020.  This GForge instance hosts source code as tarballs, Subversion
>> repos, and Git repos.  Users have been invited to migrate to
>> gitlab.inria.fr, which is Git only.  It seems that Software Heritage
>> hasn’t archived (yet) all of gforge.inria.fr.  Let’s keep track of the
>> situation in this issue.
>
> [...]
>
>> --8<---------------cut here---------------start------------->8---
>> scheme@(guile-user)> ,pp (lset-difference eq? $7 $8)
>> $11 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640>
>>  #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0>
>>  #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280>
>>  #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960>
>>  #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>)
>> --8<---------------cut here---------------end--------------->8---
>
> All the 5 are 'url-fetch' so we can expect that sources.json will be up
> before the shutdown on December. :-)

Unfortunately, it won’t help for tarballs:

  https://sympa.inria.fr/sympa/arc/swh-devel/2020-07/msg00001.html

There’s this other discussion you mentioned, which I hope will have a
positive outcome:

  https://forge.softwareheritage.org/T2430

>> (use-modules (guix) (gnu)
>>              (guix svn-download)
>>              (guix git-download)
>>              (guix swh)
>
> It does not work properly if I do not replace by
>
>                ((guix swh) #:hide (origin?))

Oh right, I had overlooked this as I played at the REPL.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-07-02 10:03   ` Ludovic Courtès
@ 2020-07-11 15:50     ` Ludovic Courtès
  2020-07-13 19:20       ` Christopher Baines
                         ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Ludovic Courtès @ 2020-07-11 15:50 UTC (permalink / raw)
  To: zimoun; +Cc: 42162, Maurice Brémond

[-- Attachment #1: Type: text/plain, Size: 4874 bytes --]

Hi,

Ludovic Courtès <ludo@gnu.org> skribis:

> There’s this other discussion you mentioned, which I hope will have a
> positive outcome:
>
>   https://forge.softwareheritage.org/T2430

This discussion as well as discussions on #swh-devel have made it clear
that SWH will not archive raw tarballs, at least not in the foreseeable
future.  Instead, it will keep archiving the contents of tarballs, as it
has always done—that’s already a huge service.

Not storing raw tarballs makes sense from an engineering perspective,
but it does mean that we cannot rely on SWH as a content-addressed
mirror for tarballs.  (In fact, some raw tarballs are available on SWH,
but that’s mostly “by chance”, for instance because they appear as-is in
a Git repo that was ingested.)  In fact this is one of the challenges
mentioned in
<https://guix.gnu.org/blog/2019/connecting-reproducible-deployment-to-a-long-term-source-code-archive/>.

So we need a solution for now (and quite urgently), and a solution for
the future.

For the now, since 70% of our packages use ‘url-fetch’, we need to be
able to fetch or to reconstruct tarballs.  There’s no way around it.

In the short term, we should arrange so that the build farm keeps GC
roots on source tarballs for an indefinite amount of time.  Cuirass
jobset?  Mcron job to preserve GC roots?  Ideas?

For the future, we could store nar hashes of unpacked tarballs instead
of hashes over tarballs.  But that raises two questions:

  • If we no longer deal with tarballs but upstreams keep signing
    tarballs (not raw directory hashes), how can we authenticate our
    code after the fact?

  • SWH internally store Git-tree hashes, not nar hashes, so we still
    wouldn’t be able to fetch our unpacked trees from SWH.

(Both issues were previously discussed at
<https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.)

So for the medium term, and perhaps for the future, a possible option
would be to preserve tarball metadata so we can reconstruct them:

  tarball = metadata + tree

After all, tarballs are byproducts and should be no exception: we should
build them from source.  :-)

In <https://forge.softwareheritage.org/T2430>, Stefano mentioned
pristine-tar, which does almost that, but not quite: it stores a binary
delta between a tarball and a tree:

  https://manpages.debian.org/unstable/pristine-tar/pristine-tar.1.en.html

I think we should have something more transparent than a binary delta.

The code below can “disassemble” and “assemble” a tar.  When it
disassembles it, it generates metadata like this:

--8<---------------cut here---------------start------------->8---
(tar-source
  (version 0)
  (headers
    (("guile-3.0.4/"
      (mode 493)
      (size 0)
      (mtime 1593007723)
      (chksum 3979)
      (typeflag #\5))
     ("guile-3.0.4/m4/"
      (mode 493)
      (size 0)
      (mtime 1593007720)
      (chksum 4184)
      (typeflag #\5))
     ("guile-3.0.4/m4/pipe2.m4"
      (mode 420)
      (size 531)
      (mtime 1536050419)
      (chksum 4812)
      (hash (sha256
              "arx6n2rmtf66yjlwkgwp743glcpdsfzgjiqrqhfegutmcwvwvsza")))
     ("guile-3.0.4/m4/time_h.m4"
      (mode 420)
      (size 5471)
      (mtime 1536050419)
      (chksum 4974)
      (hash (sha256
              "z4py26rmvsk4st7db6vwziwwhkrjjrwj7nra4al6ipqh2ms45kka")))
[…]
--8<---------------cut here---------------end--------------->8---

The ’assemble-archive’ procedure consumes that, looks up file contents
by hash on SWH, and reconstructs the original tarball…

… at least in theory, because in practice we hit the SWH rate limit
after looking up a few files:

  https://archive.softwareheritage.org/api/#rate-limiting

So it’s a bit ridiculous, but we may have to store a SWH “dir”
identifier for the whole extracted tree—a Git-tree hash—since that would
allow us to retrieve the whole thing in a single HTTP request.

Besides, we’ll also have to handle compression: storing gzip/xz headers
and compression levels.


How would we put that in practice?  Good question.  :-)

I think we’d have to maintain a database that maps tarball hashes to
metadata (!).  A simple version of it could be a Git repo where, say,
‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would
contain the metadata above.  The nice thing is that the Git repo itself
could be archived by SWH.  :-)

Thus, if a tarball vanishes, we’d look it up in the database and
reconstruct it from its metadata plus content store in SWH.

Thoughts?

Anyhow, we should team up with fellow NixOS and SWH hackers to address
this, and with developers of other distros as well—this problem is not
just that of the functional deployment geeks, is it?

Ludo’.


[-- Attachment #2: the tar assembler/disassembler --]
[-- Type: text/plain, Size: 15660 bytes --]

;;; GNU Guix --- Functional package management for GNU
;;; Copyright © 2020 Ludovic Courtès <ludo@gnu.org>
;;;
;;; This file is part of GNU Guix.
;;;
;;; GNU Guix is free software; you can redistribute it and/or modify it
;;; under the terms of the GNU General Public License as published by
;;; the Free Software Foundation; either version 3 of the License, or (at
;;; your option) any later version.
;;;
;;; GNU Guix is distributed in the hope that it will be useful, but
;;; WITHOUT ANY WARRANTY; without even the implied warranty of
;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
;;; GNU General Public License for more details.
;;;
;;; You should have received a copy of the GNU General Public License
;;; along with GNU Guix.  If not, see <http://www.gnu.org/licenses/>.

(define-module (tar)
  #:use-module (ice-9 match)
  #:use-module (ice-9 binary-ports)
  #:use-module (rnrs bytevectors)
  #:use-module (srfi srfi-1)
  #:use-module (srfi srfi-9)
  #:use-module (srfi srfi-26)

  #:use-module (gcrypt hash)
  #:use-module (guix base16)
  #:use-module (guix base32)
  #:use-module ((ice-9 rdelim) #:select ((read-string . get-string-all)))
  #:use-module (web client)
  #:use-module (web response)
  #:export (disassemble-archive
            assemble-archive))

\f
;;;
;;; Tar.
;;;

(define %TMAGIC "ustar\0")
(define %TVERSION "00")

(define-syntax-rule (define-field-type type type-size read-proc write-proc)
  "Define TYPE as a ustar header field type of TYPE-SIZE bytes.  READ-PROC is
the procedure to obtain the value of an object of this type froma bytevector,
and WRITE-PROC writes it to a bytevector."
  (define-syntax type
    (syntax-rules (read write size)
      ((_ size)  type-size)
      ((_ read)  read-proc)
      ((_ write) write-proc))))

(define (sub-bytevector bv offset size)
  (let ((sub (make-bytevector size)))
    (bytevector-copy! bv offset sub 0 size)
    sub))

(define (read-integer bv offset len)
  (string->number (read-string bv offset len) 8))
(define read-integer12 (cut read-integer <> <> 12))
(define read-integer8  (cut read-integer <> <> 8))

(define (read-string bv offset max-len)
  (define len
    (let loop ((len 0))
      (cond ((= len max-len)
             len)
            ((zero? (bytevector-u8-ref bv (+ offset len)))
             len)
            (else
             (loop (+ 1 len))))))

  (utf8->string (sub-bytevector bv offset len)))
(define read-string155 (cut read-string <> <> 155))
(define read-string100 (cut read-string <> <> 100))
(define read-string32 (cut read-string <> <> 32))
(define read-string6 (cut read-string <> <> 6))
(define read-string2 (cut read-string <> <> 2))

(define (read-character bv offset)
  (integer->char (bytevector-u8-ref bv offset)))

(define (read-padding12 bv offset)
  (bytevector-uint-ref bv offset (endianness big) 12))

(define (write-integer! bv offset value len)
  (let ((str (string-pad (number->string value 8) (- len 1) #\0)))
    (write-string! bv offset str len)))
(define write-integer12! (cut write-integer! <> <> <> 12))
(define write-integer8!  (cut write-integer! <> <> <> 8))

(define (write-string! bv offset str len)
  (let* ((str (string-pad-right str len #\nul))
         (buf (string->utf8 str)))
    (bytevector-copy! buf 0 bv offset (bytevector-length buf))))

(define write-string155! (cut write-string! <> <> <> 155))
(define write-string100! (cut write-string! <> <> <> 100))
(define write-string32! (cut write-string! <> <> <> 32))
(define write-string6! (cut write-string! <> <> <> 6))
(define write-string2! (cut write-string! <> <> <> 2))

(define (write-character! bv offset value)
  (bytevector-u8-set! bv offset (char->integer value)))

(define (write-padding12! bv offset value)
  (bytevector-uint-set! bv offset value (endianness big) 12))

(define-field-type integer12     12 read-integer12    write-integer12!)
(define-field-type integer8       8 read-integer8     write-integer8!)
(define-field-type character      1 read-character    write-character!)
(define-field-type string155    155 read-string155    write-string155!)
(define-field-type string100    100 read-string100    write-string100!)
(define-field-type string32      32 read-string32     write-string32!)
(define-field-type string6        6 read-string6      write-string6!)
(define-field-type string2        2 read-string2      write-string2!)
(define-field-type padding12     12 read-padding12    write-padding12!)

(define-syntax define-pack
  (syntax-rules ()
    ((_ type ctor pred
        write-header read-header
        (field-names field-types field-getters) ...)
     (begin
       (define-record-type type
         (ctor field-names ...)
         pred
         (field-names field-getters) ...)

       (define (read-header port)
         "Return the ustar header read from PORT."
         (set-port-encoding! port "ISO-8859-1")
         (let ((bv (get-bytevector-n port (+ (field-types size) ...))))
           (letrec-syntax ((build
                            (syntax-rules ()
                              ((_ bv () offset (fields (... ...)))
                               (ctor fields (... ...)))
                              ((_ bv (type0 types (... ...))
                                  offset (fields (... ...)))
                               (build bv
                                      (types (... ...))
                                      (+ offset (type0 size))
                                      (fields (... ...)
                                              ((type0 read) bv offset)))))))
             (build bv (field-types ...) 0 ()))))

       (define (write-header header port)
         "Serialize HEADER, a <ustar-header> record, to PORT."
         (let* ((len (+ (field-types size) ...))
                (bv  (make-bytevector len)))
           (match header
             (($ type field-names ...)
              (letrec-syntax ((write!
                               (syntax-rules ()
                                 ((_ () offset)
                                  #t)
                                 ((_ ((type value) rest (... ...)) offset)
                                  (begin
                                    ((type write) bv offset value)
                                    (write! (rest (... ...))
                                            (+ offset (type size))))))))
                (write! ((field-types field-names) ...) 0)
                (put-bytevector port bv))))))))))

;; The ustar header.  See <tar.h>.
(define-pack <ustar-header>
  %make-ustar-header ustar-header?
  write-ustar-header read-ustar-header
  (name         string100 ustar-header-name)      ;NUL-terminated if NUL fits
  (mode		 integer8 ustar-header-mode)
  (uid		 integer8 ustar-header-uid)
  (gid		 integer8 ustar-header-gid)
  (size		integer12 ustar-header-size)
  (mtime	integer12 ustar-header-mtime)
  (chksum	 integer8 ustar-header-checksum)
  (typeflag	character ustar-header-type-flag)
  (linkname	string100 ustar-header-link-name)
  (magic	  string6 ustar-header-magic)     ;must be TMAGIC
  (version	  string2 ustar-header-version)   ;must be TVERSION
  (uname	 string32 ustar-header-uname)     ;NUL-terminated
  (gname	 string32 ustar-header-gname)     ;NUL-terminated
  (devmajor	 integer8 ustar-header-device-major)
  (devminor	 integer8 ustar-header-device-minor)
  (prefix	string155 ustar-header-prefix)    ;NUL-terminated if NUL fits
  (padding      padding12 ustar-header-padding))

(define* (make-ustar-header name
                            #:key
                            (mode 0) (uid 0) (gid 0) (size 0)
                            (mtime 0) (checksum 0) (type-flag 0)
                            (link-name "")
                            (magic %TMAGIC) (version %TVERSION)
                            (uname "") (gname "")
                            (device-major 0) (device-minor 0)
                            (prefix "") (padding 0))
  (%make-ustar-header name mode uid gid size mtime checksum
                      type-flag link-name magic version uname gname
                      device-major device-minor prefix padding))

(define %zero-header
  ;; The all-zeros header, which marks the end of stream.
  (read-ustar-header (open-bytevector-input-port
                      (make-bytevector 512 0))))

(define (consumer port)
  "Return a procedure that consumes or skips the given number of bytes from
PORT."
  (if (false-if-exception (seek port 0 SEEK_CUR))
      (lambda (len)
        (seek port len SEEK_CUR))
      (lambda (len)
        (define bv (make-bytevector 8192))
        (let loop ((len len))
          (define block (min len (bytevector-length bv)))
          (unless (or (zero? block)
                      (eof-object? (get-bytevector-n! port bv 0 block)))
            (loop (- len block)))))))

(define (fold-archive proc seed port)
  "Read ustar headers from PORT; for each header, call PROC."
  (define skip
    (consumer port))

  (let loop ((result seed))
    (define header
      (read-ustar-header port))

    (if (equal? header %zero-header)
        result
        (let* ((result    (proc header port result))
               (size      (ustar-header-size header))
               (remainder (modulo size 512)))
          ;; It's up to PROC to consume the SIZE bytes of data corresponding
          ;; to HEADER.  Here we consume padding.
          (unless (zero? remainder)
            (skip (- 512 remainder)))
          (loop result)))))

\f
;;;
;;; Disassembling/assembling an archive.
;;;

(define (dump in out size)
  "Copy SIZE bytes from IN to OUT."
  (define buf-size 65536)
  (define buf (make-bytevector buf-size))

  (let loop ((left size))
    (if (<= left 0)
        0
        (let ((read (get-bytevector-n! in buf 0 (min left buf-size))))
          (if (eof-object? read)
              left
              (begin
                (put-bytevector out buf 0 read)
                (loop (- left read))))))))

(define* (disassemble-archive port #:optional
                              (algorithm (hash-algorithm sha256)))
  "Read tar archive from PORT and return an sexp representing its metadata,
including individual file hashes with ALGORITHM."
  (define headers+hashes
    (fold-archive (lambda (header port result)
                    (if (zero? (ustar-header-size header))
                        (alist-cons header #f result)
                        (let ()
                          (define-values (hash-port get-hash)
                            (open-hash-port algorithm))

                          (dump port hash-port
                                (ustar-header-size header))
                          (close-port hash-port)
                          (alist-cons header (get-hash) result))))
                  '()
                  port))

  (define header+hash->sexp
    (match-lambda
      ((header . hash)
       (letrec-syntax ((serialize (syntax-rules ()
                                    ((_)
                                     '())
                                    ((_ (tag get default) rest ...)
                                     (let ((value (get header)))
                                       (append (if (equal? default value)
                                                   '()
                                                   `((tag ,value)))
                                               (serialize rest ...))))
                                    ((_ (tag get) rest ...)
                                     (append `((tag ,(get header)))
                                             (serialize rest ...))))))
         `(,(ustar-header-name header)
           ,@(serialize (mode ustar-header-mode)
                        (uid ustar-header-uid 0)
                        (gid ustar-header-gid 0)
                        (size ustar-header-size)
                        (mtime ustar-header-mtime)
                        (chksum ustar-header-checksum)
                        (typeflag ustar-header-type-flag #\nul)
                        (linkname ustar-header-link-name "")
                        (magic ustar-header-magic "")
                        (version ustar-header-version "")
                        (uname ustar-header-uname "")
                        (gname ustar-header-gname "")
                        (devmajor ustar-header-device-major 0)
                        (devminor ustar-header-device-minor 0)
                        (prefix ustar-header-prefix "")
                        (padding ustar-header-padding 0)

                        (hash (lambda (_)
                                (and
                                 hash
                                 `(,(hash-algorithm-name algorithm)
                                   ,(bytevector->base32-string hash))))
                              #f)))))))

  `(tar-source
    (version 0)
    (headers ,(map header+hash->sexp (reverse headers+hashes)))))

(define (fetch-from-swh algorithm hash)
  (define url
    (string-append "https://archive.softwareheritage.org/api/1/content/"
                   (symbol->string algorithm) ":"
                   (bytevector->base16-string hash) "/raw/"))

  (define-values (response port)
    (http-get url #:streaming? #t #:verify-certificate? #f))

  (if (= 200 (response-code response))
      port
      (throw 'swh-fetch-error url (get-string-all port))))

(define* (assemble-archive source port
                           #:optional (fetch-data fetch-from-swh))
  "Assemble archive from SOURCE, an sexp as returned by
'disassemble-archive'."
  (define sexp->header
    (match-lambda
      ((name . properties)
       (let ((ref (lambda (field)
                    (and=> (assq-ref properties field) car))))
         (make-ustar-header name
                            #:mode (ref 'mode)
                            #:uid (or (ref 'uid) 0)
                            #:gid (or (ref 'gid) 0)
                            #:size (ref 'size)
                            #:mtime (ref 'mtime)
                            #:checksum (ref 'chksum)
                            #:type-flag (or (ref 'typeflag) #\nul)
                            #:link-name (or (ref 'linkname) "")
                            #:magic (or (ref 'magic) "")
                            #:version (or (ref 'version) "")
                            #:uname (or (ref 'uname) "")
                            #:gname (or (ref 'gname) "")
                            #:device-major (or (ref 'devmajor) 0)
                            #:device-minor (or (ref 'devminor) 0)
                            #:prefix (or (ref 'prefix) "")
                            #:padding (or (ref 'padding) 0))))))

  (define sexp->data
    (match-lambda
      ((name . properties)
       (match (assq-ref properties 'hash)
         (((algorithm (= base32-string->bytevector hash)) _ ...)
          (fetch-data algorithm hash))
         (#f
          (open-input-string ""))))))

  (match source
    (('tar-source ('version 0) ('headers headers) _ ...)
     (for-each (lambda (sexp)
                 (let ((header (sexp->header sexp))
                       (data   (sexp->data sexp)))
                   (write-ustar-header header port)
                   (dump-port data port)
                   (close-port data)))
               headers))))

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-07-11 15:50     ` bug#42162: Recovering source tarballs Ludovic Courtès
@ 2020-07-13 19:20       ` Christopher Baines
  2020-07-20 21:27         ` zimoun
  2020-07-15 16:55       ` zimoun
  2020-07-30 17:36       ` Timothy Sample
  2 siblings, 1 reply; 55+ messages in thread
From: Christopher Baines @ 2020-07-13 19:20 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 42162, Maurice Brémond

[-- Attachment #1: Type: text/plain, Size: 2120 bytes --]


Ludovic Courtès <ludo@gnu.org> writes:

> Hi,
>
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> There’s this other discussion you mentioned, which I hope will have a
>> positive outcome:
>>
>>   https://forge.softwareheritage.org/T2430
>
> This discussion as well as discussions on #swh-devel have made it clear
> that SWH will not archive raw tarballs, at least not in the foreseeable
> future.  Instead, it will keep archiving the contents of tarballs, as it
> has always done—that’s already a huge service.
>
> Not storing raw tarballs makes sense from an engineering perspective,
> but it does mean that we cannot rely on SWH as a content-addressed
> mirror for tarballs.  (In fact, some raw tarballs are available on SWH,
> but that’s mostly “by chance”, for instance because they appear as-is in
> a Git repo that was ingested.)  In fact this is one of the challenges
> mentioned in
> <https://guix.gnu.org/blog/2019/connecting-reproducible-deployment-to-a-long-term-source-code-archive/>.
>
> So we need a solution for now (and quite urgently), and a solution for
> the future.
>
> For the now, since 70% of our packages use ‘url-fetch’, we need to be
> able to fetch or to reconstruct tarballs.  There’s no way around it.
>
> In the short term, we should arrange so that the build farm keeps GC
> roots on source tarballs for an indefinite amount of time.  Cuirass
> jobset?  Mcron job to preserve GC roots?  Ideas?

Going forward, being methodical as a project about storing the tarballs
and source material for the packages is probalby the way to ensure it's
available for the future. I'm not sure the data storage cost is
significant, the cost of doing this is probably in working out what to
store, doing so in a redundant manor, and making the data available.

The Guix Data Service knows about fixed output derivations, so it might
be possible to backfill such a store by just attempting to build those
derivations. It might also be possible to use the Guix Data Service to
work out what's available, and what tarballs are missing.

Chris

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 962 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-07-11 15:50     ` bug#42162: Recovering source tarballs Ludovic Courtès
  2020-07-13 19:20       ` Christopher Baines
@ 2020-07-15 16:55       ` zimoun
  2020-07-20  8:39         ` Ludovic Courtès
  2020-08-03 21:10         ` Ricardo Wurmus
  2020-07-30 17:36       ` Timothy Sample
  2 siblings, 2 replies; 55+ messages in thread
From: zimoun @ 2020-07-15 16:55 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 42162, Maurice Brémond

Hi Ludo,

Well, you enlarge the discussion to more than the issue of the 5
url-fetch packages on gforge.inria.fr :-)

First of all, you wrote [1] ``Migration away from tarballs is already
happening as more and more software is distributed straight from
content-addressed VCS repositories, though progress has been relatively
slow since we first discussed it in 2016.'' but on the other hand Guix
uses more than often [2] "url-fetch" even if "git-fetch" is available
upstream.  Other said, I am not convinced the migration is really
happening...

The issue would be mitigated if Guix transitions from "url-fetch" to
"git-fetch" when possible.

1: https://forge.softwareheritage.org/T2430#45800
2: https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html

Second, trying to do some stats about the SWH coverage, I note that
non-neglectible "url-fetch" are reachable by "lookup-content".  The
coverage is not straightforward because of the 120 request per hour rate
limit or unexpected server error.  Another story.

Well, I would like having numbers because I do not know what is
concretely the issue: how many "url-fetch" packages are reachable?  And
if they are unreachable, is it because they are not in yet? or is it
because Guix does not have enough info to lookup them?

On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:

> For the now, since 70% of our packages use ‘url-fetch’, we need to be
> able to fetch or to reconstruct tarballs.  There’s no way around it.

Yes, but for example all the packages in gnu/packages/bioconductor.scm
could be "git-fetch".  Today the source is over url-fetch but it could
be over git-fetch with https://git.bioconductor.org/packages/flowCore or
git@git.bioconductor.org:packages/flowCore.

Another example is the packages in gnu/packages/emacs-xyz.scm and the
ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for
example using
http://git.savannah.gnu.org/gitweb/?p=emacs/elpa.git;a=tree;f=packages/ace-window;h=71d3eb7bd2efceade91846a56b9937812f658bae;hb=HEAD

So I would be more reserved about the "no way around it". :-)  I mean
the 70% could be a bit mitigated.

> In the short term, we should arrange so that the build farm keeps GC
> roots on source tarballs for an indefinite amount of time.  Cuirass
> jobset?  Mcron job to preserve GC roots?  Ideas?

Yes, preserving source tarballs for an indefinite amount of time will
help.  At least all the packages where "lookup-content" returns #f,
which means they are not in SWH or they are unreachable -- both is
equivalent from Guix side.

What about in addition push to IPFS?  Feasible?  Lookup issue?

> For the future, we could store nar hashes of unpacked tarballs instead
> of hashes over tarballs.  But that raises two questions:
>
>   • If we no longer deal with tarballs but upstreams keep signing
>     tarballs (not raw directory hashes), how can we authenticate our
>     code after the fact?

Does Guix automatically authenticate code using signed tarballs?

>   • SWH internally store Git-tree hashes, not nar hashes, so we still
>     wouldn’t be able to fetch our unpacked trees from SWH.
>
> (Both issues were previously discussed at
> <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.)
>
> So for the medium term, and perhaps for the future, a possible option
> would be to preserve tarball metadata so we can reconstruct them:
>
>   tarball = metadata + tree

There is different issues at different levels:

 1. how to lookup? what information do we need to keep/store to be able
    to query SWH?
 2. how to check the integrity? what information do we need to
    keep/store to be able to verify that SWH returns what Guix expects?
 3. how to authenticate? where the tarball metadata has to be stored if
    SWH removes it?

Basically, the git-fetch source stores 3 identifiers:

 - upstream url
 - commit / tag
 - integrity (sha256)

Fetching from SWH requires the commit only (lookup-revision) or the
tag+url (lookup-origin-revision) then from the returned revision, the
integrity of the downloaded data is checked using the sha256, right?

Therefore, one way to fix lookup of the url-fetch source is to add an
extra field mimicking the commit role.

The easiest is to store a SWHID or an identifier allowing to deduce the
SWHID.

I have not checked the code, but something like this:

  https://pypi.org/project/swh.model/
  https://forge.softwareheritage.org/source/swh-model/

and at package time, this identifier is added, similarly to integrity.

Aside, does Guix use the authentication metadata that tarballs provide?

( BTW, I failed [3,4] to package swh.model so if someone wants to give a
try.
3: https://lists.gnu.org/archive/html/help-guix/2020-06/msg00158.html
4: https://lists.gnu.org/archive/html/help-guix/2020-06/msg00161.html )

> After all, tarballs are byproducts and should be no exception: we should
> build them from source.  :-)

[...]

> The code below can “disassemble” and “assemble” a tar.  When it
> disassembles it, it generates metadata like this:

[...]

> The ’assemble-archive’ procedure consumes that, looks up file contents
> by hash on SWH, and reconstructs the original tarball…

Where do you plan to store the "disassembled" metadata?
And where do you plan to "assemble-archive"?

I mean,

 What is pushed to SWH? And how?
 What is fetched from SWH? And how?

(Well, answer below. :-))

> … at least in theory, because in practice we hit the SWH rate limit
> after looking up a few files:

Yes, it is 120 request per hour and 10 save per hour.  Well, I do not
think they will increase much these numbers in general.  However,
they seem open for specific machines.  So, I do not want to speak for
them, but we could ask an higher rate limit for ci.guix.gnu.org for
example.  Then we need to distinguish between source substitutes and
binary substitutes.  And basically, when an user runs "guix build foo",
if the source is not available upstream nor already on ci.guix.gnu.org,
then ci.guix.gnu.org fetch the missing sources from SWH and delivers it
to the user.

>   https://archive.softwareheritage.org/api/#rate-limiting
>
> So it’s a bit ridiculous, but we may have to store a SWH “dir”
> identifier for the whole extracted tree—a Git-tree hash—since that would
> allow us to retrieve the whole thing in a single HTTP request.

Well, the limited resources of SWH is an issue but SWH is not a mirror
but an archive. :-)

And as I wrote above, we could ask to SWH to increase the rate limit for
specific machine such as ci.guix.gnu.org

> I think we’d have to maintain a database that maps tarball hashes to
> metadata (!).  A simple version of it could be a Git repo where, say,
> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would
> contain the metadata above.  The nice thing is that the Git repo itself
> could be archived by SWH.  :-)

How this database that maps tarball hashes to metadata should be
maintained?  Git push hook?  Cron task?

What about foreign channels?  Should they maintain their own map?

To summary, it would work like this, right?

at package time:
 - store an integrity identiter (today sha256-nix-base32)
 - disassemble the tarball
 - commit to another repo the metadata using the path (address)
   sha256/base32/<identitier>
 - push to packages-repo *and* metadata-database-repo

at future time: (upstream has disappeared, say!)
 - use the integrity identifier to query the database repo
 - lookup the SWHID from the database repo
 - fetch the data from SWH
 - or lookup the IPFS identifier from the database repo and fetch the
   data from IPFS, for another example
 - re-assemble the tarball using the metadata from the database repo
 - check integrity, authentication, etc.

Well, right it is better than only adding an identifier for looking up
as I described above; because it is more general and flexible than only
SWH as fall-back.

The format of metadata (disassemble) that you propose is schemish
(obviously! :-)) but we could propose something more JSON-like.

All the best,
simon

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-07-15 16:55       ` zimoun
@ 2020-07-20  8:39         ` Ludovic Courtès
  2020-07-20 15:52           ` zimoun
  2020-08-03 21:10         ` Ricardo Wurmus
  1 sibling, 1 reply; 55+ messages in thread
From: Ludovic Courtès @ 2020-07-20  8:39 UTC (permalink / raw)
  To: zimoun; +Cc: 42162, Maurice Brémond

Hi!

There are many many comments in your message, so I took the liberty to
reply only to the essence of it.  :-)

zimoun <zimon.toutoune@gmail.com> skribis:

> On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> For the now, since 70% of our packages use ‘url-fetch’, we need to be
>> able to fetch or to reconstruct tarballs.  There’s no way around it.
>
> Yes, but for example all the packages in gnu/packages/bioconductor.scm
> could be "git-fetch".  Today the source is over url-fetch but it could
> be over git-fetch with https://git.bioconductor.org/packages/flowCore or
> git@git.bioconductor.org:packages/flowCore.
>
> Another example is the packages in gnu/packages/emacs-xyz.scm and the
> ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for
> example using
> http://git.savannah.gnu.org/gitweb/?p=emacs/elpa.git;a=tree;f=packages/ace-window;h=71d3eb7bd2efceade91846a56b9937812f658bae;hb=HEAD
>
> So I would be more reserved about the "no way around it". :-)  I mean
> the 70% could be a bit mitigated.

The “no way around it” was about the situation today: it’s a fact that
70% of packages are built from tarballs, so we need to be able to fetch
them or reconstruct them.

However, the two examples above are good ideas as to the way forward: we
could start a url-fetch-to-git-fetch migration in these two cases, and
perhaps more.

>> In the short term, we should arrange so that the build farm keeps GC
>> roots on source tarballs for an indefinite amount of time.  Cuirass
>> jobset?  Mcron job to preserve GC roots?  Ideas?
>
> Yes, preserving source tarballs for an indefinite amount of time will
> help.  At least all the packages where "lookup-content" returns #f,
> which means they are not in SWH or they are unreachable -- both is
> equivalent from Guix side.
>
> What about in addition push to IPFS?  Feasible?  Lookup issue?

Lookup issue.  :-)  The hash in a CID is not just a raw blob hash.
Files are typically chunked beforehand, assembled as a Merkle tree, and
the CID is roughly the hash to the tree root.  So it would seem we can’t
use IPFS as-is for tarballs.

>> For the future, we could store nar hashes of unpacked tarballs instead
>> of hashes over tarballs.  But that raises two questions:
>>
>>   • If we no longer deal with tarballs but upstreams keep signing
>>     tarballs (not raw directory hashes), how can we authenticate our
>>     code after the fact?
>
> Does Guix automatically authenticate code using signed tarballs?

Not automatically; packagers are supposed to authenticate code when they
add a package (‘guix refresh -u’ does that automatically).

>>   • SWH internally store Git-tree hashes, not nar hashes, so we still
>>     wouldn’t be able to fetch our unpacked trees from SWH.
>>
>> (Both issues were previously discussed at
>> <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.)
>>
>> So for the medium term, and perhaps for the future, a possible option
>> would be to preserve tarball metadata so we can reconstruct them:
>>
>>   tarball = metadata + tree
>
> There is different issues at different levels:
>
>  1. how to lookup? what information do we need to keep/store to be able
>     to query SWH?
>  2. how to check the integrity? what information do we need to
>     keep/store to be able to verify that SWH returns what Guix expects?
>  3. how to authenticate? where the tarball metadata has to be stored if
>     SWH removes it?
>
> Basically, the git-fetch source stores 3 identifiers:
>
>  - upstream url
>  - commit / tag
>  - integrity (sha256)
>
> Fetching from SWH requires the commit only (lookup-revision) or the
> tag+url (lookup-origin-revision) then from the returned revision, the
> integrity of the downloaded data is checked using the sha256, right?

Yes.

> Therefore, one way to fix lookup of the url-fetch source is to add an
> extra field mimicking the commit role.

But today, we store tarball hashes, not directory hashes.

> The easiest is to store a SWHID or an identifier allowing to deduce the
> SWHID.
>
> I have not checked the code, but something like this:
>
>   https://pypi.org/project/swh.model/
>   https://forge.softwareheritage.org/source/swh-model/
>
> and at package time, this identifier is added, similarly to integrity.

I’m skeptical about adding a field that is practically never used.

[...]

>> The code below can “disassemble” and “assemble” a tar.  When it
>> disassembles it, it generates metadata like this:
>
> [...]
>
>> The ’assemble-archive’ procedure consumes that, looks up file contents
>> by hash on SWH, and reconstructs the original tarball…
>
> Where do you plan to store the "disassembled" metadata?
> And where do you plan to "assemble-archive"?

We’d have a repo/database containing metadata indexed by tarball sha256.

> How this database that maps tarball hashes to metadata should be
> maintained?  Git push hook?  Cron task?

Yes, something like that.  :-)

> What about foreign channels?  Should they maintain their own map?

Yes, presumably.

> To summary, it would work like this, right?
>
> at package time:
>  - store an integrity identiter (today sha256-nix-base32)
>  - disassemble the tarball
>  - commit to another repo the metadata using the path (address)
>    sha256/base32/<identitier>
>  - push to packages-repo *and* metadata-database-repo
>
> at future time: (upstream has disappeared, say!)
>  - use the integrity identifier to query the database repo
>  - lookup the SWHID from the database repo
>  - fetch the data from SWH
>  - or lookup the IPFS identifier from the database repo and fetch the
>    data from IPFS, for another example
>  - re-assemble the tarball using the metadata from the database repo
>  - check integrity, authentication, etc.

That’s the idea.

> The format of metadata (disassemble) that you propose is schemish
> (obviously! :-)) but we could propose something more JSON-like.

Sure, if that helps get other people on-board, why not (though sexps
have lived much longer than JSON and XML together :-)).

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-07-20  8:39         ` Ludovic Courtès
@ 2020-07-20 15:52           ` zimoun
  2020-07-20 17:05             ` Dr. Arne Babenhauserheide
  2020-07-21 21:22             ` Ludovic Courtès
  0 siblings, 2 replies; 55+ messages in thread
From: zimoun @ 2020-07-20 15:52 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 42162, Maurice Brémond

Hi,

On Mon, 20 Jul 2020 at 10:39, Ludovic Courtès <ludo@gnu.org> wrote:
> zimoun <zimon.toutoune@gmail.com> skribis:
> > On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:

> There are many many comments in your message, so I took the liberty to
> reply only to the essence of it.  :-)

Many comments because many open topics. ;-)

> However, the two examples above are good ideas as to the way forward: we
> could start a url-fetch-to-git-fetch migration in these two cases, and
> perhaps more.

Well, to be honest, I have tried to probe such migration when I opened
this thread:

https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html

and I have tried to summarized the pros/cons arguments here:

https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00448.html

> > What about in addition push to IPFS?  Feasible?  Lookup issue?
>
> Lookup issue.  :-)  The hash in a CID is not just a raw blob hash.
> Files are typically chunked beforehand, assembled as a Merkle tree, and
> the CID is roughly the hash to the tree root.  So it would seem we can’t
> use IPFS as-is for tarballs.

Using the Git-repo map/table, then it becomes an option, right?
Well, SWH would be a backend and IPFS could be another one.  Or any
"cloudy" storage system that could appear in the future, right?

> >>   • If we no longer deal with tarballs but upstreams keep signing
> >>     tarballs (not raw directory hashes), how can we authenticate our
> >>     code after the fact?
> >
> > Does Guix automatically authenticate code using signed tarballs?
>
> Not automatically; packagers are supposed to authenticate code when they
> add a package (‘guix refresh -u’ does that automatically).

So I miss the point of having this authentication information in the
future where upstream has disappeared.
The authentication is done at packaging time.  So once it is done,
merged into master and then pushed to SWH, being able to authenticate
again does not really matter.

And if it matters, all should be updated each time vulnerabilities are
discovered and so I am not sure SWH makes sense for this use-case.

> But today, we store tarball hashes, not directory hashes.

We store what "guix hash" returns. ;-)
So it is easy to migrate from tarball hashes to whatever else. :-)
I mean, it is "(sha256 (base32" and it is easy to have also
"(sha256-tree (base32" or something like that.

In the case where the integrity is also used as lookup key.

> > The format of metadata (disassemble) that you propose is schemish
> > (obviously! :-)) but we could propose something more JSON-like.
>
> Sure, if that helps get other people on-board, why not (though sexps
> have lived much longer than JSON and XML together :-)).

Lived much longer and still less less less used than JSON or XML alone. ;-)

I have not done yet the clear back-to-envelop computations.  Roughly,
there are ~23 commits on average per day updating packages, so say 70%
of them are url-fetch, it is ~16 new tarballs per day, on average.
How the model using a Git-repo will scale?  Because, naively the
output of "disassemble-archive" in full text (pretty-print format) for
the hello-2.10.tar is 120KB and so 16*365*120K = ~700Mb per year
without considering all the Git internals.  Obviously, it depends on
the number of files and I do not know if hello is a representative
example.

And I do not know how Git operates on binary files if the disassembled
tarball is stored as .go file, or any other.

All the best,
simon

ps:
Just if someone wants to check from where I estimate the numbers.

--8<---------------cut here---------------start------------->8---
for ci in $(git log --after=v1.0.0 --oneline \
                | grep "gnu:" | grep -E "(Add|Update)" \
                | cut -f1 -d' ')
do
    git --no-pager log -1 $ci --format="%cs"
done | uniq -c > /tmp/commits

guix environment --ad-hoc r-minimal \
     -- R -e 'summary(read.table("/tmp/commits"))'

gzip -dc < $(guix build -S hello) > /tmp/hello.tar
guix repl -L /tmp/tar/

scheme@(guix-user)> (call-with-input-file "hello.tar"
          (lambda (port)
                 (disassemble-archive port)))
--8<---------------cut here---------------end--------------->8---

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-07-20 15:52           ` zimoun
@ 2020-07-20 17:05             ` Dr. Arne Babenhauserheide
  2020-07-20 19:59               ` zimoun
  2020-07-21 21:22             ` Ludovic Courtès
  1 sibling, 1 reply; 55+ messages in thread
From: Dr. Arne Babenhauserheide @ 2020-07-20 17:05 UTC (permalink / raw)
  To: zimoun; +Cc: 42162, Maurice.Bremond

[-- Attachment #1: Type: text/plain, Size: 774 bytes --]


zimoun <zimon.toutoune@gmail.com> writes:
>> > The format of metadata (disassemble) that you propose is schemish
>> > (obviously! :-)) but we could propose something more JSON-like.
>>
>> Sure, if that helps get other people on-board, why not (though sexps
>> have lived much longer than JSON and XML together :-)).
>
> Lived much longer and still less less less used than JSON or XML alone. ;-)

Though this is likely not a function of the format, but of the
popularity of both Javascript and Java.

JSON isn’t a well defined format for arbitrary data (try to store
numbers as keys and reason about what you get as return-values), and
XML is a monster of complexity.

Best wishes,
Arne
-- 
Unpolitisch sein
heißt politisch sein
ohne es zu merken

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 1076 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-07-20 17:05             ` Dr. Arne Babenhauserheide
@ 2020-07-20 19:59               ` zimoun
  0 siblings, 0 replies; 55+ messages in thread
From: zimoun @ 2020-07-20 19:59 UTC (permalink / raw)
  To: Dr. Arne Babenhauserheide; +Cc: 42162, Maurice Brémond

On Mon, 20 Jul 2020 at 19:05, Dr. Arne Babenhauserheide <arne_bab@web.de> wrote:
> zimoun <zimon.toutoune@gmail.com> writes:
> >> > The format of metadata (disassemble) that you propose is schemish
> >> > (obviously! :-)) but we could propose something more JSON-like.
> >>
> >> Sure, if that helps get other people on-board, why not (though sexps
> >> have lived much longer than JSON and XML together :-)).
> >
> > Lived much longer and still less less less used than JSON or XML alone. ;-)
>
> Though this is likely not a function of the format, but of the
> popularity of both Javascript and Java.

Well, the popularity matters to attract a broad audience and maybe get
other people on-board; if it is the aim.
It seems the de-facto format; even if JSON has flaws.  And zillions of
parsers for all the languages are floating around, which is not the
case for Sexp, even if it is easier to parse.

And JSON is already used in Guix, see [1] for an example.

1: https://guix.gnu.org/manual/devel/en/guix.html#Additional-Build-Options

However, I am not convinced that JSON or similarly Sexp will scale
well for a Tarball Heritage perspective.

All the best,
simon

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-07-13 19:20       ` Christopher Baines
@ 2020-07-20 21:27         ` zimoun
  0 siblings, 0 replies; 55+ messages in thread
From: zimoun @ 2020-07-20 21:27 UTC (permalink / raw)
  To: Christopher Baines, Ludovic Courtès; +Cc: 42162, Maurice Brémond

Hi Chris,

On Mon, 13 Jul 2020 at 20:20, Christopher Baines <mail@cbaines.net> wrote:

> Going forward, being methodical as a project about storing the tarballs
> and source material for the packages is probalby the way to ensure it's
> available for the future. I'm not sure the data storage cost is
> significant, the cost of doing this is probably in working out what to
> store, doing so in a redundant manor, and making the data available.

A really rough estimate is 120KB on average* per raw tarball.  So if we
consider 14000 packages and 70% of them are url-fetch, then it leads to
14k*0.7*120K= 1.2GB; which is not significant.  Moreover, if we
extrapolate the numbers, between v1.0.0 and now it is 23 commits per day
modifying gnu/packages/ so 0.7*23*120K*365= 700MB per year.  However,
the 120KB of metadata to re-assemble the tarball have to be compared to
the 712KB of raw compressed tarball; both about the hello package.

*based on the hello package.  And it depends on the number of files in
 the tarball.  File stored not compressed: plain sexp.

Therefore, in addition to what to store, redundancy and availability,
one question is how to store?  Git-repo? SQL database? etc.

> The Guix Data Service knows about fixed output derivations, so it might
> be possible to backfill such a store by just attempting to build those
> derivations. It might also be possible to use the Guix Data Service to
> work out what's available, and what tarballs are missing.

Missing from where?  The substitutes farm or SWH?

Cheers,
simon

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-07-20 15:52           ` zimoun
  2020-07-20 17:05             ` Dr. Arne Babenhauserheide
@ 2020-07-21 21:22             ` Ludovic Courtès
  2020-07-22  0:27               ` zimoun
  1 sibling, 1 reply; 55+ messages in thread
From: Ludovic Courtès @ 2020-07-21 21:22 UTC (permalink / raw)
  To: zimoun; +Cc: 42162, Maurice Brémond

Hi!

zimoun <zimon.toutoune@gmail.com> skribis:

> On Mon, 20 Jul 2020 at 10:39, Ludovic Courtès <ludo@gnu.org> wrote:
>> zimoun <zimon.toutoune@gmail.com> skribis:
>> > On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> There are many many comments in your message, so I took the liberty to
>> reply only to the essence of it.  :-)
>
> Many comments because many open topics. ;-)

Understood, and they’re very valuable but (1) I choose not to just do
email :-), and (2) I like to separate issues in reasonable chunks rather
than long threads addressing all the problems we’ll have to deal with.

I think it really helps keep things tractable!

>> Lookup issue.  :-)  The hash in a CID is not just a raw blob hash.
>> Files are typically chunked beforehand, assembled as a Merkle tree, and
>> the CID is roughly the hash to the tree root.  So it would seem we can’t
>> use IPFS as-is for tarballs.
>
> Using the Git-repo map/table, then it becomes an option, right?
> Well, SWH would be a backend and IPFS could be another one.  Or any
> "cloudy" storage system that could appear in the future, right?

Sure, why not.

>> >>   • If we no longer deal with tarballs but upstreams keep signing
>> >>     tarballs (not raw directory hashes), how can we authenticate our
>> >>     code after the fact?
>> >
>> > Does Guix automatically authenticate code using signed tarballs?
>>
>> Not automatically; packagers are supposed to authenticate code when they
>> add a package (‘guix refresh -u’ does that automatically).
>
> So I miss the point of having this authentication information in the
> future where upstream has disappeared.

What I meant above, is that often, what we have is things like detached
signatures of raw tarballs, or documents referring to a tarball hash:

  https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/msg00009.html

>> But today, we store tarball hashes, not directory hashes.
>
> We store what "guix hash" returns. ;-)
> So it is easy to migrate from tarball hashes to whatever else. :-)

True, but that other thing, as it stands, would be a nar hash (like for
‘git-fetch’), not a Git-tree hash (what SWH uses).

> I mean, it is "(sha256 (base32" and it is easy to have also
> "(sha256-tree (base32" or something like that.

Right, but that first and foremost requires daemon support.

It’s doable, but migration would have to take a long time, since this is
touching core parts of the “protocol”.

> I have not done yet the clear back-to-envelop computations.  Roughly,
> there are ~23 commits on average per day updating packages, so say 70%
> of them are url-fetch, it is ~16 new tarballs per day, on average.
> How the model using a Git-repo will scale?  Because, naively the
> output of "disassemble-archive" in full text (pretty-print format) for
> the hello-2.10.tar is 120KB and so 16*365*120K = ~700Mb per year
> without considering all the Git internals.  Obviously, it depends on
> the number of files and I do not know if hello is a representative
> example.

Interesting, thanks for making that calculation!  We could make the
format more compact if needed.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-07-21 21:22             ` Ludovic Courtès
@ 2020-07-22  0:27               ` zimoun
  2020-07-22 10:28                 ` Ludovic Courtès
  0 siblings, 1 reply; 55+ messages in thread
From: zimoun @ 2020-07-22  0:27 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 42162, Maurice Brémond

Hi!

On Tue, 21 Jul 2020 at 23:22, Ludovic Courtès <ludo@gnu.org> wrote:

>>> >>   • If we no longer deal with tarballs but upstreams keep signing
>>> >>     tarballs (not raw directory hashes), how can we authenticate our
>>> >>     code after the fact?
>>> >
>>> > Does Guix automatically authenticate code using signed tarballs?
>>>
>>> Not automatically; packagers are supposed to authenticate code when they
>>> add a package (‘guix refresh -u’ does that automatically).
>>
>> So I miss the point of having this authentication information in the
>> future where upstream has disappeared.
>
> What I meant above, is that often, what we have is things like detached
> signatures of raw tarballs, or documents referring to a tarball hash:
>
>   https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/msg00009.html

I still miss why it matters to store detached signature of raw tarballs.

The authentication is done now (at package time and/or inclusion in the
lookup table proposal).  I miss why we would have to re-authenticate
again later.

IMHO, having a lookup table that returns the signatures from a tarball
hash or an archive of all the OpenGPG keys ever published is another
topic.


>>> But today, we store tarball hashes, not directory hashes.
>>
>> We store what "guix hash" returns. ;-)
>> So it is easy to migrate from tarball hashes to whatever else. :-)
>
> True, but that other thing, as it stands, would be a nar hash (like for
> ‘git-fetch’), not a Git-tree hash (what SWH uses).

Ok, now I am totally convinced that a lookup table is The Right Thing™. :-)

>> I mean, it is "(sha256 (base32" and it is easy to have also
>> "(sha256-tree (base32" or something like that.
>
> Right, but that first and foremost requires daemon support.
>
> It’s doable, but migration would have to take a long time, since this is
> touching core parts of the “protocol”.

Doable but not necessary tractable. :-)


>> I have not done yet the clear back-to-envelop computations.  Roughly,
>> there are ~23 commits on average per day updating packages, so say 70%
>> of them are url-fetch, it is ~16 new tarballs per day, on average.
>> How the model using a Git-repo will scale?  Because, naively the
>> output of "disassemble-archive" in full text (pretty-print format) for
>> the hello-2.10.tar is 120KB and so 16*365*120K = ~700Mb per year
>> without considering all the Git internals.  Obviously, it depends on
>> the number of files and I do not know if hello is a representative
>> example.
>
> Interesting, thanks for making that calculation!  We could make the
> format more compact if needed.

Compressing should help.

Considering 14000 packages, based on this 120KB estimation, it leads to:
0.7*14k*120K= ~1.2GB for the Git-repo of the current Guix.

Cheers,
simon





^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-07-22  0:27               ` zimoun
@ 2020-07-22 10:28                 ` Ludovic Courtès
  0 siblings, 0 replies; 55+ messages in thread
From: Ludovic Courtès @ 2020-07-22 10:28 UTC (permalink / raw)
  To: zimoun; +Cc: 42162, Maurice Brémond

Hello!

zimoun <zimon.toutoune@gmail.com> skribis:

> On Tue, 21 Jul 2020 at 23:22, Ludovic Courtès <ludo@gnu.org> wrote:
>
>>>> >>   • If we no longer deal with tarballs but upstreams keep signing
>>>> >>     tarballs (not raw directory hashes), how can we authenticate our
>>>> >>     code after the fact?
>>>> >
>>>> > Does Guix automatically authenticate code using signed tarballs?
>>>>
>>>> Not automatically; packagers are supposed to authenticate code when they
>>>> add a package (‘guix refresh -u’ does that automatically).
>>>
>>> So I miss the point of having this authentication information in the
>>> future where upstream has disappeared.
>>
>> What I meant above, is that often, what we have is things like detached
>> signatures of raw tarballs, or documents referring to a tarball hash:
>>
>>   https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/msg00009.html
>
> I still miss why it matters to store detached signature of raw tarballs.

I’m not saying we (Guix) should store signatures; I’m just saying that
developers typically sign raw tarballs.  It’s a general statement to
explain why storing or being able to reconstruct tarballs matters.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-07-11 15:50     ` bug#42162: Recovering source tarballs Ludovic Courtès
  2020-07-13 19:20       ` Christopher Baines
  2020-07-15 16:55       ` zimoun
@ 2020-07-30 17:36       ` Timothy Sample
  2020-07-31 14:41         ` Ludovic Courtès
  2020-08-26 10:04         ` bug#42162: Recovering source tarballs zimoun
  2 siblings, 2 replies; 55+ messages in thread
From: Timothy Sample @ 2020-07-30 17:36 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 42162, Maurice Brémond

Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

> Hi,
>
> Ludovic Courtès <ludo@gnu.org> skribis:
>
> [...]
>
> So for the medium term, and perhaps for the future, a possible option
> would be to preserve tarball metadata so we can reconstruct them:
>
>   tarball = metadata + tree
>
> After all, tarballs are byproducts and should be no exception: we should
> build them from source.  :-)
>
> In <https://forge.softwareheritage.org/T2430>, Stefano mentioned
> pristine-tar, which does almost that, but not quite: it stores a binary
> delta between a tarball and a tree:
>
>   https://manpages.debian.org/unstable/pristine-tar/pristine-tar.1.en.html
>
> I think we should have something more transparent than a binary delta.
>
> The code below can “disassemble” and “assemble” a tar.  When it
> disassembles it, it generates metadata like this:
>
> (tar-source
>   (version 0)
>   (headers
>     (("guile-3.0.4/"
>       (mode 493)
>       (size 0)
>       (mtime 1593007723)
>       (chksum 3979)
>       (typeflag #\5))
>      ("guile-3.0.4/m4/"
>       (mode 493)
>       (size 0)
>       (mtime 1593007720)
>       (chksum 4184)
>       (typeflag #\5))
>      ("guile-3.0.4/m4/pipe2.m4"
>       (mode 420)
>       (size 531)
>       (mtime 1536050419)
>       (chksum 4812)
>       (hash (sha256
>               "arx6n2rmtf66yjlwkgwp743glcpdsfzgjiqrqhfegutmcwvwvsza")))
>      ("guile-3.0.4/m4/time_h.m4"
>       (mode 420)
>       (size 5471)
>       (mtime 1536050419)
>       (chksum 4974)
>       (hash (sha256
>               "z4py26rmvsk4st7db6vwziwwhkrjjrwj7nra4al6ipqh2ms45kka")))
> […]
>
> The ’assemble-archive’ procedure consumes that, looks up file contents
> by hash on SWH, and reconstructs the original tarball…
>
> … at least in theory, because in practice we hit the SWH rate limit
> after looking up a few files:
>
>   https://archive.softwareheritage.org/api/#rate-limiting
>
> So it’s a bit ridiculous, but we may have to store a SWH “dir”
> identifier for the whole extracted tree—a Git-tree hash—since that would
> allow us to retrieve the whole thing in a single HTTP request.
>
> Besides, we’ll also have to handle compression: storing gzip/xz headers
> and compression levels.

This jumped out at me because I have been working with compression and
tarballs for the bootstrapping effort.  I started pulling some threads
and doing some research, and ended up prototyping an end-to-end solution
for decomposing a Gzip’d tarball into Gzip metadata, tarball metadata,
and an SWH directory ID.  It can even put them back together!  :)  There
are a bunch of problems still, but I think this project is doable in the
short-term.  I’ve tested 100 arbitrary Gzip’d tarballs from Guix, and
found and fixed a bunch of little gaffes.  There’s a ton of work to do,
of course, but here’s another small step.

I call the thing “Disarchive” as in “disassemble a source code archive”.
You can find it at <https://git.ngyro.com/disarchive/>.  It has a simple
command-line interface so you can do

    $ disarchive save software-1.0.tar.gz

which serializes a disassembled version of “software-1.0.tar.gz” to the
database (which is just a directory) specified by the “DISARCHIVE_DB”
environment variable.  Next, you can run

    $ disarchive load hash-of-something-in-the-db

which will recover an original file from its metadata (stored in the
database) and data retrieved from the SWH archive or taken from a cache
(again, just a directory) specified by “DISARCHIVE_DIRCACHE”.

Now some implementation details.  The way I’ve set it up is that all of
the assembly happens through Guix.  Each step in recreating a compressed
tarball is a fixed-output derivation: the download from SWH, the
creation of the tarball, and the compression.  I wanted an easy way to
build and verify things according to a dependency graph without writing
any code.  Hi Guix Daemon!  I’m not sure if this is a good long-term
approach, though.  It could work well for reproducibility, but it might
be easier to let some external service drive my code as a Guix package.
Either way, it was an easy way to get started.

For disassembly, it takes a Gzip file (containing a single member) and
breaks it down like this:

    (gzip-member
      (version 0)
      (name "hungrycat-0.4.1.tar.gz")
      (input (sha256
               "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1"))
      (header
        (mtime 0)
        (extra-flags 2)
        (os 3))
      (footer
        (crc 3863610951)
        (isize 194560))
      (compressor gnu-best)
      (digest
        (sha256
          "03fc1zsrf99lvxa7b4ps6pbi43304wbxh1f6ci4q0vkal370yfwh")))

The header and footer are read directly from the file.  Finding the
compressor is harder.  I followed the approach taken by the pristine-tar
project.  That is, try a bunch of compressors and hope for a match.
Currently, I have:

    • gnu-best
    • gnu-best-rsync
    • gnu
    • gnu-rsync
    • gnu-fast
    • gnu-fast-rsync
    • zlib-best
    • zlib
    • zlib-fast
    • zlib-best-perl
    • zlib-perl
    • zlib-fast-perl
    • gnu-best-rsync-1.4
    • gnu-rsync-1.4
    • gnu-fast-rsync-1.4

This list is inspired by pristine-tar.  The first couple GNU compressors
use modern Gzip from Guix.  The zlib and rsync-1.4 ones use the Gzip and
zlib wrapper from pristine-tar called “zgz”.  The 100 Gzip files I
looked at use “gnu”, “gnu-best”, “gnu-best-rsync-1.4”, “zlib”,
“zlib-best”, and “zlib-fast-perl”.

(As an aside, I had a way to decompose multi-member Gzip files, but it
was much, much slower.  Since I doubt they exist in the wild, I removed
that code.)

The “input” field likely points to a tarball, which looks like this:

    (tarball
      (version 0)
      (name "hungrycat-0.4.1.tar")
      (input (sha256
               "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r"))
      (default-header)
      (headers
        ((name "hungrycat-0.4.1/")
         (mode 493)
         (mtime 1513360022)
         (chksum 5058)
         (typeflag 53))
        ((name "hungrycat-0.4.1/configure")
         (mode 493)
         (size 130263)
         (mtime 1513360022)
         (chksum 6043))
        ...)
      (padding 3584)
      (digest
        (sha256
          "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1")))

Originally, I used your code, but I ran into some problems.  Namely,
real tarballs are not well-behaved.  I wrote new code to keep track of
subtle things like the formatting of the octal values.  Even though they
are not well-behaved, they are usually self-consistent, so I introduced
the “default-header” field to set default values for all headers.  Any
omitted fields in the headers use the value from the default header, and
the default header takes defaults from a “default default header”
defined in the code.  Here’s a default header from a different tarball:

    (default-header
      (uid 1199)
      (gid 30)
      (magic "ustar ")
      (version " \x00")
      (uname "cagordon")
      (gname "lhea")
      (devmajor-format (width 0))
      (devminor-format (width 0)))

These default values are computed to minimize the noise in the
serialized form.  Here we see for example that each header should have
UID 1199 unless otherwise specified.  We also see that the device fields
should be null strings instead of octal zeros.  Another good example
here is that the magic field has a space after “ustar”, which is not
what modern POSIX says to do.

My tarball reader has minimal support for extended headers, but they are
not serialized cleanly (they survive the round-trip, but they are not
human-readable).

Finally, the “input” field here points to an “swh-directory” object.  It
looks like this:

    (swh-directory
      (version 0)
      (name "hungrycat-0.4.1")
      (id "0496abd5a2e9e05c9fe20ae7684f48130ef6124a")
      (digest
        (sha256
          "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r")))

I have a little module for computing the directory hash like SWH does
(which is in-turn like what Git does).  I did not verify that the 100
packages where in the SWH archive.  I did verify a couple of packages,
but I hit the rate limit and decided to avoid it for now.

To avoid hitting the SWH archive at all, I introduced a directory cache
so that I can store the directories locally.  If the directory cache is
available, directories are stored and retrieved from it.

> How would we put that in practice?  Good question.  :-)
>
> I think we’d have to maintain a database that maps tarball hashes to
> metadata (!).  A simple version of it could be a Git repo where, say,
> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would
> contain the metadata above.  The nice thing is that the Git repo itself
> could be archived by SWH.  :-)

You mean like <https://git.ngyro.com/disarchive-db/>?  :)

This was generated by a little script built on top of “fold-packages”.
It downloads Gzip’d tarballs used by Guix packages and passes them on to
Disarchive for disassembly.  I limited the number to 100 because it’s
slow and because I’m sure there is a long tail of weird software
archives that are going to be hard to process.  The metadata directory
ended up being 13M and the directory cache 2G.

> Thus, if a tarball vanishes, we’d look it up in the database and
> reconstruct it from its metadata plus content store in SWH.
>
> Thoughts?

Obviously I like the idea.  ;)

Even with the code I have so far, I have a lot of questions.  Mainly I’m
worried about keeping everything working into the future.  It would be
easy to make incompatible changes.  A lot of care would have to be
taken.  Of course, keeping a Guix commit and a Disarchive commit might
be enough to make any assembling reproducible, but there’s a
chicken-and-egg problem there.  What if a tarball from the closure of
one the derivations is missing?  I guess you could work around it, but
it would be tricky.

> Anyhow, we should team up with fellow NixOS and SWH hackers to address
> this, and with developers of other distros as well—this problem is not
> just that of the functional deployment geeks, is it?

I could remove most of the Guix stuff so that it would be easy to
package in Guix, Nix, Debian, etc.  Then, someone™ could write a service
that consumes a “sources.json” file, adds the sources to a Disarchive
database, and pushes everything to a Git repo.  I guess everyone who
cares has to produce a “sources.json” file anyway, so it will be very
little extra work.  Other stuff like changing the serialization format
to JSON would be pretty easy, too.  I’m not well connected to these
other projects, mind you, so I’m not really sure how to reach out.

Sorry about the big mess of code and ideas – I realize I may have taken
the “do-ocracy” approach a little far here.  :)  Even if this is not
“the” solution, hopefully it’s useful for discussion!

-- Tim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-07-30 17:36       ` Timothy Sample
@ 2020-07-31 14:41         ` Ludovic Courtès
  2020-08-03 16:59           ` Timothy Sample
  2020-08-26 10:04         ` bug#42162: Recovering source tarballs zimoun
  1 sibling, 1 reply; 55+ messages in thread
From: Ludovic Courtès @ 2020-07-31 14:41 UTC (permalink / raw)
  To: Timothy Sample; +Cc: 42162, Maurice Brémond

Hi Timothy!

Timothy Sample <samplet@ngyro.com> skribis:

> This jumped out at me because I have been working with compression and
> tarballs for the bootstrapping effort.  I started pulling some threads
> and doing some research, and ended up prototyping an end-to-end solution
> for decomposing a Gzip’d tarball into Gzip metadata, tarball metadata,
> and an SWH directory ID.  It can even put them back together!  :)  There
> are a bunch of problems still, but I think this project is doable in the
> short-term.  I’ve tested 100 arbitrary Gzip’d tarballs from Guix, and
> found and fixed a bunch of little gaffes.  There’s a ton of work to do,
> of course, but here’s another small step.
>
> I call the thing “Disarchive” as in “disassemble a source code archive”.
> You can find it at <https://git.ngyro.com/disarchive/>.  It has a simple
> command-line interface so you can do
>
>     $ disarchive save software-1.0.tar.gz
>
> which serializes a disassembled version of “software-1.0.tar.gz” to the
> database (which is just a directory) specified by the “DISARCHIVE_DB”
> environment variable.  Next, you can run
>
>     $ disarchive load hash-of-something-in-the-db
>
> which will recover an original file from its metadata (stored in the
> database) and data retrieved from the SWH archive or taken from a cache
> (again, just a directory) specified by “DISARCHIVE_DIRCACHE”.

Wooohoo!  Is it that time of the year when people give presents to one
another?  I can’t believe it.  :-)

> Now some implementation details.  The way I’ve set it up is that all of
> the assembly happens through Guix.  Each step in recreating a compressed
> tarball is a fixed-output derivation: the download from SWH, the
> creation of the tarball, and the compression.  I wanted an easy way to
> build and verify things according to a dependency graph without writing
> any code.  Hi Guix Daemon!  I’m not sure if this is a good long-term
> approach, though.  It could work well for reproducibility, but it might
> be easier to let some external service drive my code as a Guix package.
> Either way, it was an easy way to get started.
>
> For disassembly, it takes a Gzip file (containing a single member) and
> breaks it down like this:
>
>     (gzip-member
>       (version 0)
>       (name "hungrycat-0.4.1.tar.gz")
>       (input (sha256
>                "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1"))
>       (header
>         (mtime 0)
>         (extra-flags 2)
>         (os 3))
>       (footer
>         (crc 3863610951)
>         (isize 194560))
>       (compressor gnu-best)
>       (digest
>         (sha256
>           "03fc1zsrf99lvxa7b4ps6pbi43304wbxh1f6ci4q0vkal370yfwh")))

Awesome.

> The header and footer are read directly from the file.  Finding the
> compressor is harder.  I followed the approach taken by the pristine-tar
> project.  That is, try a bunch of compressors and hope for a match.
> Currently, I have:
>
>     • gnu-best
>     • gnu-best-rsync
>     • gnu
>     • gnu-rsync
>     • gnu-fast
>     • gnu-fast-rsync
>     • zlib-best
>     • zlib
>     • zlib-fast
>     • zlib-best-perl
>     • zlib-perl
>     • zlib-fast-perl
>     • gnu-best-rsync-1.4
>     • gnu-rsync-1.4
>     • gnu-fast-rsync-1.4

I would have used the integers that zlib supports, but I guess that
doesn’t capture this whole gamut of compression setups.  And yeah, it’s
not great that we actually have to try and find the right compression
levels, but there’s no way around it it seems, and as you write, we can
expect a couple of variants to be the most commonly used ones.

> The “input” field likely points to a tarball, which looks like this:
>
>     (tarball
>       (version 0)
>       (name "hungrycat-0.4.1.tar")
>       (input (sha256
>                "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r"))
>       (default-header)
>       (headers
>         ((name "hungrycat-0.4.1/")
>          (mode 493)
>          (mtime 1513360022)
>          (chksum 5058)
>          (typeflag 53))
>         ((name "hungrycat-0.4.1/configure")
>          (mode 493)
>          (size 130263)
>          (mtime 1513360022)
>          (chksum 6043))
>         ...)
>       (padding 3584)
>       (digest
>         (sha256
>           "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1")))
>
> Originally, I used your code, but I ran into some problems.  Namely,
> real tarballs are not well-behaved.  I wrote new code to keep track of
> subtle things like the formatting of the octal values.

Yeah I guess I was too optimistic.  :-)  I wanted to have the
serialization/deserialization code automatically generated by that
macro, but yeah, it doesn’t capture enough details for real-world
tarballs.

Do you know how frequently you get “weird” tarballs?  I was thinking
about having something that works for plain GNU tar, but it’s even
better to have something that works with “unusual” tarballs!

(BTW the code I posted or the one in Disarchive could perhaps replace
the one in Gash-Utils.  I was frustrated to not see a ‘fold-archive’
procedure there, notably.)

> Even though they are not well-behaved, they are usually
> self-consistent, so I introduced the “default-header” field to set
> default values for all headers.  Any omitted fields in the headers use
> the value from the default header, and the default header takes
> defaults from a “default default header” defined in the code.  Here’s
> a default header from a different tarball:
>
>     (default-header
>       (uid 1199)
>       (gid 30)
>       (magic "ustar ")
>       (version " \x00")
>       (uname "cagordon")
>       (gname "lhea")
>       (devmajor-format (width 0))
>       (devminor-format (width 0)))

Very nice.

> Finally, the “input” field here points to an “swh-directory” object.  It
> looks like this:
>
>     (swh-directory
>       (version 0)
>       (name "hungrycat-0.4.1")
>       (id "0496abd5a2e9e05c9fe20ae7684f48130ef6124a")
>       (digest
>         (sha256
>           "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r")))

Yay!

> I have a little module for computing the directory hash like SWH does
> (which is in-turn like what Git does).  I did not verify that the 100
> packages where in the SWH archive.  I did verify a couple of packages,
> but I hit the rate limit and decided to avoid it for now.
>
> To avoid hitting the SWH archive at all, I introduced a directory cache
> so that I can store the directories locally.  If the directory cache is
> available, directories are stored and retrieved from it.

I guess we can get back to them eventually to estimate our coverage ratio.

>> I think we’d have to maintain a database that maps tarball hashes to
>> metadata (!).  A simple version of it could be a Git repo where, say,
>> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would
>> contain the metadata above.  The nice thing is that the Git repo itself
>> could be archived by SWH.  :-)
>
> You mean like <https://git.ngyro.com/disarchive-db/>?  :)

Woow.  :-)

We could actually have a CI job to create the database: it would
basically do ‘disarchive save’ for each tarball and store that using a
layout like the one you used.  Then we could have a job somewhere that
periodically fetches that and adds it to the database.  WDYT?

I think we should leave room for other hash algorithms (in the sexps
above too).

> This was generated by a little script built on top of “fold-packages”.
> It downloads Gzip’d tarballs used by Guix packages and passes them on to
> Disarchive for disassembly.  I limited the number to 100 because it’s
> slow and because I’m sure there is a long tail of weird software
> archives that are going to be hard to process.  The metadata directory
> ended up being 13M and the directory cache 2G.

Neat.

So it does mean that we could pretty much right away add a fall-back in
(guix download) that looks up tarballs in your database and uses
Disarchive to recontruct it, right?  I love solved problems.  :-)

Of course we could improve Disarchive and the database, but it seems to
me that we already have enough to improve the situation.  WDYT?

> Even with the code I have so far, I have a lot of questions.  Mainly I’m
> worried about keeping everything working into the future.  It would be
> easy to make incompatible changes.  A lot of care would have to be
> taken.  Of course, keeping a Guix commit and a Disarchive commit might
> be enough to make any assembling reproducible, but there’s a
> chicken-and-egg problem there.

The way I see it, Guix would always look up tarballs in the HEAD of the
database (no need to pick a specific commit).  Worst that could happen
is we reconstruct a tarball that doesn’t match, and so the daemon errors
out.

Regarding future-proofness, I think we must be super careful about the
file formats (the sexps).  You did pay attention to not having implicit
defaults, which is perfect.  Perhaps one thing to change (or perhaps
it’s already there) is support for other hashes in those sexps: both
hash algorithms and directory hash methods (SWH dir/Git tree, nar, Git
tree with different hash algorithm, IPFS CID, etc.).  Also the ability
to specify several hashes.

That way we could “refresh” the database anytime by adding the hash du
jour for already-present tarballs.

> What if a tarball from the closure of one the derivations is missing?
> I guess you could work around it, but it would be tricky.

Well, more generally, we’ll have to monitor archive coverage.  But I
don’t think the issue is specific to this method.

>> Anyhow, we should team up with fellow NixOS and SWH hackers to address
>> this, and with developers of other distros as well—this problem is not
>> just that of the functional deployment geeks, is it?
>
> I could remove most of the Guix stuff so that it would be easy to
> package in Guix, Nix, Debian, etc.  Then, someone™ could write a service
> that consumes a “sources.json” file, adds the sources to a Disarchive
> database, and pushes everything to a Git repo.  I guess everyone who
> cares has to produce a “sources.json” file anyway, so it will be very
> little extra work.  Other stuff like changing the serialization format
> to JSON would be pretty easy, too.  I’m not well connected to these
> other projects, mind you, so I’m not really sure how to reach out.

If you feel like it, you’re welcome to point them to your work in the
discussion at <https://forge.softwareheritage.org/T2430>.  There’s one
person from NixOS (lewo) participating in the discussion and I’m sure
they’d be interested.  Perhaps they’ll tell whether they care about
having it available as JSON.

> Sorry about the big mess of code and ideas – I realize I may have taken
> the “do-ocracy” approach a little far here.  :)  Even if this is not
> “the” solution, hopefully it’s useful for discussion!

You did great!  I had a very rough sketch and you did the real thing,
that’s just awesome.  :-)

Thanks a lot!

Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-07-31 14:41         ` Ludovic Courtès
@ 2020-08-03 16:59           ` Timothy Sample
  2020-08-05 17:14             ` Ludovic Courtès
  0 siblings, 1 reply; 55+ messages in thread
From: Timothy Sample @ 2020-08-03 16:59 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 42162, Maurice Brémond

Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

> Wooohoo!  Is it that time of the year when people give presents to one
> another?  I can’t believe it.  :-)

Not to be too cynical, but I think it’s just the time of year that I get
frustrated with what I should be working on, and start fantasizing about
green-field projects.  :p

> Timothy Sample <samplet@ngyro.com> skribis:
>
>> The header and footer are read directly from the file.  Finding the
>> compressor is harder.  I followed the approach taken by the pristine-tar
>> project.  That is, try a bunch of compressors and hope for a match.
>> Currently, I have:
>>
>>     • gnu-best
>>     • gnu-best-rsync
>>     • gnu
>>     • gnu-rsync
>>     • gnu-fast
>>     • gnu-fast-rsync
>>     • zlib-best
>>     • zlib
>>     • zlib-fast
>>     • zlib-best-perl
>>     • zlib-perl
>>     • zlib-fast-perl
>>     • gnu-best-rsync-1.4
>>     • gnu-rsync-1.4
>>     • gnu-fast-rsync-1.4
>
> I would have used the integers that zlib supports, but I guess that
> doesn’t capture this whole gamut of compression setups.  And yeah, it’s
> not great that we actually have to try and find the right compression
> levels, but there’s no way around it it seems, and as you write, we can
> expect a couple of variants to be the most commonly used ones.

My first instinct was “this is impossible – a DEFLATE compressor can do
just about whatever it wants!”  Then I looked at pristine-tar and
realized that their hack probably works pretty well.  If I had infinite
time, I would think about some kind of fully general, parameterized LZ77
algorithm that could describe any implementation.  If I had a lot of
time I would peel back the curtain on Gzip and zlib and expose their
tuning parameters.  That would be nicer, but keep in mind we will have
to cover XZ, bzip2, and ZIP, too!  There’s a bit of balance between
quality and coverage.  Any improvement to the representation of the
compression algorithm could be implemented easily: just replace the
names with their improved representation.

One thing pristine-tar does is reorder the compressor list based on the
input metadata.  A Gzip member usually stores its compression level, so
it makes sense to try everything at that level first before moving one.

>> Originally, I used your code, but I ran into some problems.  Namely,
>> real tarballs are not well-behaved.  I wrote new code to keep track of
>> subtle things like the formatting of the octal values.
>
> Yeah I guess I was too optimistic.  :-)  I wanted to have the
> serialization/deserialization code automatically generated by that
> macro, but yeah, it doesn’t capture enough details for real-world
> tarballs.

I enjoyed your implementation!  I might even bring back its style.  It
was a little stiff for trying to figure out exactly what I needed for
reproducing the tarballs.

> Do you know how frequently you get “weird” tarballs?  I was thinking
> about having something that works for plain GNU tar, but it’s even
> better to have something that works with “unusual” tarballs!

I don’t have hard numbers, but I would say that a good handful (5–10%)
have “X-format” fields, meaning their octal formatting is unusual.  (I’m
looking at “grep -A 10 default-header” over all the S-Exp files.)  The
most charming thing is the “uname” and “gname” fields.  For example,
“rtmidi-4.0.0” was made by “gary” from “staff”.  :)

> (BTW the code I posted or the one in Disarchive could perhaps replace
> the one in Gash-Utils.  I was frustrated to not see a ‘fold-archive’
> procedure there, notably.)

I really like “fold-archive”.  One of the reasons I started doing this
is to possibly share code with Gash-Utils.  It’s not as easy as I was
hoping, but I’m planning on improving things there based on my
experience here.  I’ve now worked with four Scheme tar implementations,
maybe if I write a really good one I could cap that number at five!

>> To avoid hitting the SWH archive at all, I introduced a directory cache
>> so that I can store the directories locally.  If the directory cache is
>> available, directories are stored and retrieved from it.
>
> I guess we can get back to them eventually to estimate our coverage ratio.

It would be nice to know, but pretty hard to find out with the rate
limit.  I guess it will improve immensely when we set up a
“sources.json” file.

>> You mean like <https://git.ngyro.com/disarchive-db/>?  :)
>
> Woow.  :-)
>
> We could actually have a CI job to create the database: it would
> basically do ‘disarchive save’ for each tarball and store that using a
> layout like the one you used.  Then we could have a job somewhere that
> periodically fetches that and adds it to the database.  WDYT?

Maybe....  I assume that Disarchive would fail for a few of them.  We
would need a plan for monitoring those failures so that Disarchive can
be improved.  Also, unless I’m misunderstanding something, this means
building the whole database at every commit, no?  That would take a lot
of time and space.  On the other hand, it would be easy enough to try.
If it works, it’s a lot easier than setting up a whole other service.

> I think we should leave room for other hash algorithms (in the sexps
> above too).

It works for different hash algorithms, but not for different directory
hashing methods (like you mention below).

>> This was generated by a little script built on top of “fold-packages”.
>> It downloads Gzip’d tarballs used by Guix packages and passes them on to
>> Disarchive for disassembly.  I limited the number to 100 because it’s
>> slow and because I’m sure there is a long tail of weird software
>> archives that are going to be hard to process.  The metadata directory
>> ended up being 13M and the directory cache 2G.
>
> Neat.
>
> So it does mean that we could pretty much right away add a fall-back in
> (guix download) that looks up tarballs in your database and uses
> Disarchive to recontruct it, right?  I love solved problems.  :-)
>
> Of course we could improve Disarchive and the database, but it seems to
> me that we already have enough to improve the situation.  WDYT?

I would say that we are darn close!  In theory it would work.  It would
be much more practical if we had better coverage in the SWH archive
(i.e., “sources.json”) and a way to get metadata for a source archive
without downloading the entire Disarchive database.  It’s 13M now, but
it will likely be 500M with all the Gzip’d tarballs from a recent commit
of Guix.  It will only grow after that, too.

Of course those are not hard blockers, so ‘(guix download)’ could start
using Disarchive as soon as we package it.  I’ve starting looking into
it, but I’m confused about getting access to Disarchive from the
“out-of-band” download system.  Would it have to become a dependency of
Guix?

>> Even with the code I have so far, I have a lot of questions.  Mainly I’m
>> worried about keeping everything working into the future.  It would be
>> easy to make incompatible changes.  A lot of care would have to be
>> taken.  Of course, keeping a Guix commit and a Disarchive commit might
>> be enough to make any assembling reproducible, but there’s a
>> chicken-and-egg problem there.
>
> The way I see it, Guix would always look up tarballs in the HEAD of the
> database (no need to pick a specific commit).  Worst that could happen
> is we reconstruct a tarball that doesn’t match, and so the daemon errors
> out.

I was imagining an escape hatch beyond this, where one could look up a
provenance record from when Disarchive ingested and verified a source
code archive.  The provenance record would tell you which version of
Guix was used when saving the archive, so you could try your luck with
using “guix time-machine” to reproduce Disarchive’s original
computation.  If we perform database migrations, you would need to
travel back in time in the database, too.  The idea is that you could
work around breakages in Disarchive automatically using the Power of
Guix™.  Just a stray thought, really.

> Regarding future-proofness, I think we must be super careful about the
> file formats (the sexps).  You did pay attention to not having implicit
> defaults, which is perfect.  Perhaps one thing to change (or perhaps
> it’s already there) is support for other hashes in those sexps: both
> hash algorithms and directory hash methods (SWH dir/Git tree, nar, Git
> tree with different hash algorithm, IPFS CID, etc.).  Also the ability
> to specify several hashes.
>
> That way we could “refresh” the database anytime by adding the hash du
> jour for already-present tarballs.

The hash algorithm is already configurable, but the directory hash
method is not.  You’re right that it should be, and that there should be
support for multiple digests.

>> What if a tarball from the closure of one the derivations is missing?
>> I guess you could work around it, but it would be tricky.
>
> Well, more generally, we’ll have to monitor archive coverage.  But I
> don’t think the issue is specific to this method.

Again, I’m thinking about the case where I want to travel back in time
to reproduce a Disarchive computation.  It’s really an unlikely
scenario, I’m just trying to think of everything that could go wrong.

>>> Anyhow, we should team up with fellow NixOS and SWH hackers to address
>>> this, and with developers of other distros as well—this problem is not
>>> just that of the functional deployment geeks, is it?
>>
>> I could remove most of the Guix stuff so that it would be easy to
>> package in Guix, Nix, Debian, etc.  Then, someone™ could write a service
>> that consumes a “sources.json” file, adds the sources to a Disarchive
>> database, and pushes everything to a Git repo.  I guess everyone who
>> cares has to produce a “sources.json” file anyway, so it will be very
>> little extra work.  Other stuff like changing the serialization format
>> to JSON would be pretty easy, too.  I’m not well connected to these
>> other projects, mind you, so I’m not really sure how to reach out.
>
> If you feel like it, you’re welcome to point them to your work in the
> discussion at <https://forge.softwareheritage.org/T2430>.  There’s one
> person from NixOS (lewo) participating in the discussion and I’m sure
> they’d be interested.  Perhaps they’ll tell whether they care about
> having it available as JSON.

Good idea.  I will work out a few more kinks and then bring it up there.
I’ve already rewritten the parts that used the Guix daemon.  Disarchive
now only needs a handful Guix modules ('base32', 'serialization', and
'swh' are the ones that would be hard to remove).

>> Sorry about the big mess of code and ideas – I realize I may have taken
>> the “do-ocracy” approach a little far here.  :)  Even if this is not
>> “the” solution, hopefully it’s useful for discussion!
>
> You did great!  I had a very rough sketch and you did the real thing,
> that’s just awesome.  :-)
>
> Thanks a lot!

My pleasure!  Thanks for the feedback so far.

-- Tim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-07-15 16:55       ` zimoun
  2020-07-20  8:39         ` Ludovic Courtès
@ 2020-08-03 21:10         ` Ricardo Wurmus
  1 sibling, 0 replies; 55+ messages in thread
From: Ricardo Wurmus @ 2020-08-03 21:10 UTC (permalink / raw)
  To: zimoun; +Cc: 42162


zimoun <zimon.toutoune@gmail.com> writes:

> Yes, but for example all the packages in gnu/packages/bioconductor.scm
> could be "git-fetch".  Today the source is over url-fetch but it could
> be over git-fetch with https://git.bioconductor.org/packages/flowCore or
> git@git.bioconductor.org:packages/flowCore.

We should do that (and soon), especially because Bioconductor does not
keep an archive of old releases.  We can discuss this on a separate
issue lest we derail the discussion at hand.

-- 
Ricardo




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-08-03 16:59           ` Timothy Sample
@ 2020-08-05 17:14             ` Ludovic Courtès
  2020-08-05 18:57               ` Timothy Sample
  0 siblings, 1 reply; 55+ messages in thread
From: Ludovic Courtès @ 2020-08-05 17:14 UTC (permalink / raw)
  To: Timothy Sample; +Cc: 42162, Maurice Brémond

Hello!

Timothy Sample <samplet@ngyro.com> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Wooohoo!  Is it that time of the year when people give presents to one
>> another?  I can’t believe it.  :-)
>
> Not to be too cynical, but I think it’s just the time of year that I get
> frustrated with what I should be working on, and start fantasizing about
> green-field projects.  :p

:-)

>> Timothy Sample <samplet@ngyro.com> skribis:
>>
>>> The header and footer are read directly from the file.  Finding the
>>> compressor is harder.  I followed the approach taken by the pristine-tar
>>> project.  That is, try a bunch of compressors and hope for a match.
>>> Currently, I have:
>>>
>>>     • gnu-best
>>>     • gnu-best-rsync
>>>     • gnu
>>>     • gnu-rsync
>>>     • gnu-fast
>>>     • gnu-fast-rsync
>>>     • zlib-best
>>>     • zlib
>>>     • zlib-fast
>>>     • zlib-best-perl
>>>     • zlib-perl
>>>     • zlib-fast-perl
>>>     • gnu-best-rsync-1.4
>>>     • gnu-rsync-1.4
>>>     • gnu-fast-rsync-1.4
>>
>> I would have used the integers that zlib supports, but I guess that
>> doesn’t capture this whole gamut of compression setups.  And yeah, it’s
>> not great that we actually have to try and find the right compression
>> levels, but there’s no way around it it seems, and as you write, we can
>> expect a couple of variants to be the most commonly used ones.
>
> My first instinct was “this is impossible – a DEFLATE compressor can do
> just about whatever it wants!”  Then I looked at pristine-tar and
> realized that their hack probably works pretty well.  If I had infinite
> time, I would think about some kind of fully general, parameterized LZ77
> algorithm that could describe any implementation.  If I had a lot of
> time I would peel back the curtain on Gzip and zlib and expose their
> tuning parameters.  That would be nicer, but keep in mind we will have
> to cover XZ, bzip2, and ZIP, too!  There’s a bit of balance between
> quality and coverage.  Any improvement to the representation of the
> compression algorithm could be implemented easily: just replace the
> names with their improved representation.

Yup, it makes sense to not spend too much time on this bit.  I guess
we’d already have good coverage with gzip and xz.

>> (BTW the code I posted or the one in Disarchive could perhaps replace
>> the one in Gash-Utils.  I was frustrated to not see a ‘fold-archive’
>> procedure there, notably.)
>
> I really like “fold-archive”.  One of the reasons I started doing this
> is to possibly share code with Gash-Utils.  It’s not as easy as I was
> hoping, but I’m planning on improving things there based on my
> experience here.  I’ve now worked with four Scheme tar implementations,
> maybe if I write a really good one I could cap that number at five!

Heh.  :-)  The needs are different anyway.  In Gash-Utils the focus is
probably on simplicity/maintainability, whereas here you really want to
cover all the details of the wire representation.

>>> To avoid hitting the SWH archive at all, I introduced a directory cache
>>> so that I can store the directories locally.  If the directory cache is
>>> available, directories are stored and retrieved from it.
>>
>> I guess we can get back to them eventually to estimate our coverage ratio.
>
> It would be nice to know, but pretty hard to find out with the rate
> limit.  I guess it will improve immensely when we set up a
> “sources.json” file.

Note that we have <https://guix.gnu.org/sources.json>.  Last I checked,
SWH was ingesting it in its “qualification” instance, so it should be
ingesting it for good real soon if it’s not doing it already.

>>> You mean like <https://git.ngyro.com/disarchive-db/>?  :)
>>
>> Woow.  :-)
>>
>> We could actually have a CI job to create the database: it would
>> basically do ‘disarchive save’ for each tarball and store that using a
>> layout like the one you used.  Then we could have a job somewhere that
>> periodically fetches that and adds it to the database.  WDYT?
>
> Maybe....  I assume that Disarchive would fail for a few of them.  We
> would need a plan for monitoring those failures so that Disarchive can
> be improved.  Also, unless I’m misunderstanding something, this means
> building the whole database at every commit, no?  That would take a lot
> of time and space.  On the other hand, it would be easy enough to try.
> If it works, it’s a lot easier than setting up a whole other service.

One can easily write a procedure that takes a tarball and returns a
<computed-file> that builds its database entry.  So at each commit, we’d
just rebuild things that have changed.

>> I think we should leave room for other hash algorithms (in the sexps
>> above too).
>
> It works for different hash algorithms, but not for different directory
> hashing methods (like you mention below).

OK.

[...]

>> So it does mean that we could pretty much right away add a fall-back in
>> (guix download) that looks up tarballs in your database and uses
>> Disarchive to recontruct it, right?  I love solved problems.  :-)
>>
>> Of course we could improve Disarchive and the database, but it seems to
>> me that we already have enough to improve the situation.  WDYT?
>
> I would say that we are darn close!  In theory it would work.  It would
> be much more practical if we had better coverage in the SWH archive
> (i.e., “sources.json”) and a way to get metadata for a source archive
> without downloading the entire Disarchive database.  It’s 13M now, but
> it will likely be 500M with all the Gzip’d tarballs from a recent commit
> of Guix.  It will only grow after that, too.

If we expose the database over HTTP (like over cgit), we can arrange so
that (guix download) simply GETs db.example.org/sha256/xyz.  No need to
fetch the whole database.

It might be more reasonable to have a real database and a real service
around it, I’m sure Chris Baines would agree ;-), but we can choose URLs
that could easily be implemented by a “real” service instead of cgit in
the future.

> Of course those are not hard blockers, so ‘(guix download)’ could start
> using Disarchive as soon as we package it.  I’ve starting looking into
> it, but I’m confused about getting access to Disarchive from the
> “out-of-band” download system.  Would it have to become a dependency of
> Guix?

Yes.  It could be a behind-the-scenes dependency of “builtin:download”;
it doesn’t have to be a dependency of each and every fixed-output
derivation.

> I was imagining an escape hatch beyond this, where one could look up a
> provenance record from when Disarchive ingested and verified a source
> code archive.  The provenance record would tell you which version of
> Guix was used when saving the archive, so you could try your luck with
> using “guix time-machine” to reproduce Disarchive’s original
> computation.  If we perform database migrations, you would need to
> travel back in time in the database, too.  The idea is that you could
> work around breakages in Disarchive automatically using the Power of
> Guix™.  Just a stray thought, really.

Seems to me it Shouldn’t Be Necessary?  :-)

I mean, as long as the format is extensible and “future-proof”, we’ll
always be able to rebuild tarballs and then re-disassemble them if we
need to compute new hashes or whatever.

>> If you feel like it, you’re welcome to point them to your work in the
>> discussion at <https://forge.softwareheritage.org/T2430>.  There’s one
>> person from NixOS (lewo) participating in the discussion and I’m sure
>> they’d be interested.  Perhaps they’ll tell whether they care about
>> having it available as JSON.
>
> Good idea.  I will work out a few more kinks and then bring it up there.
> I’ve already rewritten the parts that used the Guix daemon.  Disarchive
> now only needs a handful Guix modules ('base32', 'serialization', and
> 'swh' are the ones that would be hard to remove).

An option would be to use (gcrypt base64); another one would be to
bundle (guix base32).

I was thinking that it might be best to not use Guix for computations.
For example, have “disarchive save” not build derivations and instead do
everything “here and now”.  That would make it easier for others to
adopt.  Wait, looking at the Git history, it looks like you already
addressed that point, neat.  :-)

Thank you!

Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-08-05 17:14             ` Ludovic Courtès
@ 2020-08-05 18:57               ` Timothy Sample
  2020-08-23 16:21                 ` Ludovic Courtès
  2020-11-03 14:26                 ` Ludovic Courtès
  0 siblings, 2 replies; 55+ messages in thread
From: Timothy Sample @ 2020-08-05 18:57 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 42162, Maurice Brémond

Hey,

Ludovic Courtès <ludo@gnu.org> writes:

> Note that we have <https://guix.gnu.org/sources.json>.  Last I checked,
> SWH was ingesting it in its “qualification” instance, so it should be
> ingesting it for good real soon if it’s not doing it already.

Oh fantastic!  I was going to volunteer to do it, so that’s one thing
off my list.

> One can easily write a procedure that takes a tarball and returns a
> <computed-file> that builds its database entry.  So at each commit, we’d
> just rebuild things that have changed.

That makes more sense.  I will give this a shot soon.

> If we expose the database over HTTP (like over cgit), we can arrange so
> that (guix download) simply GETs db.example.org/sha256/xyz.  No need to
> fetch the whole database.
>
> It might be more reasonable to have a real database and a real service
> around it, I’m sure Chris Baines would agree ;-), but we can choose URLs
> that could easily be implemented by a “real” service instead of cgit in
> the future.

I got it working over cgit shortly after sending my last message.  :)  So
far, I am very much on team “good enough for now”.

> Timothy Sample <samplet@ngyro.com> skribis:
>
>> I was imagining an escape hatch beyond this, where one could look up a
>> provenance record from when Disarchive ingested and verified a source
>> code archive.  The provenance record would tell you which version of
>> Guix was used when saving the archive, so you could try your luck with
>> using “guix time-machine” to reproduce Disarchive’s original
>> computation.  If we perform database migrations, you would need to
>> travel back in time in the database, too.  The idea is that you could
>> work around breakages in Disarchive automatically using the Power of
>> Guix™.  Just a stray thought, really.
>
> Seems to me it Shouldn’t Be Necessary?  :-)
>
> I mean, as long as the format is extensible and “future-proof”, we’ll
> always be able to rebuild tarballs and then re-disassemble them if we
> need to compute new hashes or whatever.

If Disarchive relies on external compressors, there’s an outside chance
that those compressors could change under our feet.  In that case, one
would want to be able to track down exactly which version of XZ was used
when Disarchive verified that it could reassemble a given source
archive.  Maybe I’m being paranoid, but if the database entries are
being computed by the CI infrastructure it would be pretty easy to note
the Guix commit just in case.

> I was thinking that it might be best to not use Guix for computations.
> For example, have “disarchive save” not build derivations and instead do
> everything “here and now”.  That would make it easier for others to
> adopt.  Wait, looking at the Git history, it looks like you already
> addressed that point, neat.  :-)

Since my last message I managed to remove Guix as dependency completely.
Right now it loads ‘(guix swh)’ opportunistically, but I might just copy
the code in.  Directory references now support multiple “addresses” so
that you could have Nix-style, SWH-style, IPFS-style, etc.  Hopefully my
next message will have a WIP patch enabling Guix to use Disarchive!

-- Tim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-08-05 18:57               ` Timothy Sample
@ 2020-08-23 16:21                 ` Ludovic Courtès
  2020-11-03 14:26                 ` Ludovic Courtès
  1 sibling, 0 replies; 55+ messages in thread
From: Ludovic Courtès @ 2020-08-23 16:21 UTC (permalink / raw)
  To: Timothy Sample; +Cc: 42162, Maurice Brémond

Hello!

Timothy Sample <samplet@ngyro.com> skribis:

>> If we expose the database over HTTP (like over cgit), we can arrange so
>> that (guix download) simply GETs db.example.org/sha256/xyz.  No need to
>> fetch the whole database.
>>
>> It might be more reasonable to have a real database and a real service
>> around it, I’m sure Chris Baines would agree ;-), but we can choose URLs
>> that could easily be implemented by a “real” service instead of cgit in
>> the future.
>
> I got it working over cgit shortly after sending my last message.  :)  So
> far, I am very much on team “good enough for now”.

Wonderful.  :-)

>> Timothy Sample <samplet@ngyro.com> skribis:
>>
>>> I was imagining an escape hatch beyond this, where one could look up a
>>> provenance record from when Disarchive ingested and verified a source
>>> code archive.  The provenance record would tell you which version of
>>> Guix was used when saving the archive, so you could try your luck with
>>> using “guix time-machine” to reproduce Disarchive’s original
>>> computation.  If we perform database migrations, you would need to
>>> travel back in time in the database, too.  The idea is that you could
>>> work around breakages in Disarchive automatically using the Power of
>>> Guix™.  Just a stray thought, really.
>>
>> Seems to me it Shouldn’t Be Necessary?  :-)
>>
>> I mean, as long as the format is extensible and “future-proof”, we’ll
>> always be able to rebuild tarballs and then re-disassemble them if we
>> need to compute new hashes or whatever.
>
> If Disarchive relies on external compressors, there’s an outside chance
> that those compressors could change under our feet.  In that case, one
> would want to be able to track down exactly which version of XZ was used
> when Disarchive verified that it could reassemble a given source
> archive.

Oh, true.  Gzip and bzip2 are more-or-less “set in stone”, but xz, lzip,
or zstd could change.  Recording the exact version of the implementation
would be a good stopgap.

> Maybe I’m being paranoid, but if the database entries are being
> computed by the CI infrastructure it would be pretty easy to note the
> Guix commit just in case.

Yeah, that makes sense.  At least we could have “notes” in the file
format to store that kind of info.  Using CI is also a good idea.

>> I was thinking that it might be best to not use Guix for computations.
>> For example, have “disarchive save” not build derivations and instead do
>> everything “here and now”.  That would make it easier for others to
>> adopt.  Wait, looking at the Git history, it looks like you already
>> addressed that point, neat.  :-)
>
> Since my last message I managed to remove Guix as dependency completely.
> Right now it loads ‘(guix swh)’ opportunistically, but I might just copy
> the code in.  Directory references now support multiple “addresses” so
> that you could have Nix-style, SWH-style, IPFS-style, etc.  Hopefully my
> next message will have a WIP patch enabling Guix to use Disarchive!

Neat, looking forward to it!

Thank you,
Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-07-30 17:36       ` Timothy Sample
  2020-07-31 14:41         ` Ludovic Courtès
@ 2020-08-26 10:04         ` zimoun
  2020-08-26 21:11           ` Timothy Sample
  1 sibling, 1 reply; 55+ messages in thread
From: zimoun @ 2020-08-26 10:04 UTC (permalink / raw)
  To: Timothy Sample, Ludovic Courtès; +Cc: 42162, Maurice Brémond

Dear Timothy,

On Thu, 30 Jul 2020 at 13:36, Timothy Sample <samplet@ngyro.com> wrote:

> I call the thing “Disarchive” as in “disassemble a source code archive”.
> You can find it at <https://git.ngyro.com/disarchive/>.  It has a simple
> command-line interface so you can do
>
>     $ disarchive save software-1.0.tar.gz
>
> which serializes a disassembled version of “software-1.0.tar.gz” to the
> database (which is just a directory) specified by the “DISARCHIVE_DB”
> environment variable.  Next, you can run
>
>     $ disarchive load hash-of-something-in-the-db
>
> which will recover an original file from its metadata (stored in the
> database) and data retrieved from the SWH archive or taken from a cache
> (again, just a directory) specified by “DISARCHIVE_DIRCACHE”.

Really nice!  Thank you!


>> I think we’d have to maintain a database that maps tarball hashes to
>> metadata (!).  A simple version of it could be a Git repo where, say,
>> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would
>> contain the metadata above.  The nice thing is that the Git repo itself
>> could be archived by SWH.  :-)
>
> You mean like <https://git.ngyro.com/disarchive-db/>?  :)

[...]

> This was generated by a little script built on top of “fold-packages”.
> It downloads Gzip’d tarballs used by Guix packages and passes them on to
> Disarchive for disassembly.  I limited the number to 100 because it’s
> slow and because I’m sure there is a long tail of weird software
> archives that are going to be hard to process.  The metadata directory
> ended up being 13M and the directory cache 2G.

One question is how this database scales?

For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
for ~14k packages and then an increase of ~700MB per year, both with the
Ludo’s code [1].

[1] <http://issues.guix.gnu.org/issue/42162#11>



> I could remove most of the Guix stuff so that it would be easy to
> package in Guix, Nix, Debian, etc.  Then, someone™ could write a service
> that consumes a “sources.json” file, adds the sources to a Disarchive
> database, and pushes everything to a Git repo.  I guess everyone who
> cares has to produce a “sources.json” file anyway, so it will be very
> little extra work.  Other stuff like changing the serialization format
> to JSON would be pretty easy, too.  I’m not well connected to these
> other projects, mind you, so I’m not really sure how to reach out.

This service could be really useful.  Yes, it could be easy to update
the database each time Guix produces a new “sources.json”.

As mentioned [2], should this service be part of SWH (download cooking
task)?  Or project side?

[2] <https://forge.softwareheritage.org/T2430#47486>


Thank you again for this piece for work.

All the best,
simon




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-08-26 10:04         ` bug#42162: Recovering source tarballs zimoun
@ 2020-08-26 21:11           ` Timothy Sample
  2020-08-27  9:41             ` zimoun
  0 siblings, 1 reply; 55+ messages in thread
From: Timothy Sample @ 2020-08-26 21:11 UTC (permalink / raw)
  To: zimoun; +Cc: 42162, Maurice Brémond

Hi zimoun,

zimoun <zimon.toutoune@gmail.com> writes:

> One question is how this database scales?
>
> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
> for ~14k packages and then an increase of ~700MB per year, both with the
> Ludo’s code [1].
>
> [1] <http://issues.guix.gnu.org/issue/42162#11>

It’s a good question.  A good part of the size comes from the
representation rather than the data.  Compression helps a lot here.  I
have a database of 3,912 packages.  It’s 295M uncompressed (which is a
little better than your estimation).  If I pass each file through Lzip,
it shrinks down to 60M.  That’s more like 15.5K per package, which is
almost an order of magnitude smaller than the estimation you used
(120K).  I think that makes the numbers rather pleasant, but it comes at
the expense of easy storing in Git.

> As mentioned [2], should this service be part of SWH (download cooking
> task)?  Or project side?
>
> [2] <https://forge.softwareheritage.org/T2430#47486>

It would be interesting to just have SWH absorb the project.  Since
other distros already know how to produce a “sources.json” and how to
query the SWH archive, it would mean that they benefit for free (and so
would Guix, for that matter).  I’m open to that, but right now having
the freedom to experiment is important.

-- Tim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-08-26 21:11           ` Timothy Sample
@ 2020-08-27  9:41             ` zimoun
  2020-08-27 12:49               ` Ludovic Courtès
  2020-08-27 18:06               ` Bengt Richter
  0 siblings, 2 replies; 55+ messages in thread
From: zimoun @ 2020-08-27  9:41 UTC (permalink / raw)
  To: Timothy Sample; +Cc: 42162, Maurice Brémond

Hi,

On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samplet@ngyro.com> wrote:
> zimoun <zimon.toutoune@gmail.com> writes:
>
>> One question is how this database scales?
>>
>> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
>> for ~14k packages and then an increase of ~700MB per year, both with the
>> Ludo’s code [1].
>>
>> [1] <http://issues.guix.gnu.org/issue/42162#11>
>
> It’s a good question.  A good part of the size comes from the
> representation rather than the data.  Compression helps a lot here.  I
> have a database of 3,912 packages.  It’s 295M uncompressed (which is a
> little better than your estimation).  If I pass each file through Lzip,
> it shrinks down to 60M.  That’s more like 15.5K per package, which is
> almost an order of magnitude smaller than the estimation you used
> (120K).  I think that makes the numbers rather pleasant, but it comes at
> the expense of easy storing in Git.

Thank you for these numbers.  Really interesting!

First, I do not know if the database needs to be stored with Git.  What
should be the advantage? (naive question :-))


On SWH T2430 [1], you explain the “default-header” trick to cut down the
size.  Nice!

Moreover, the format is a long list, e.g.,

--8<---------------cut here---------------start------------->8---
(headers
    ((name "raptor2-2.0.15/")
     (mode 493)
     (mtime 1414909500)
     (chksum 4225)
     (typeflag 53))
    ((name "raptor2-2.0.15/build/")
     (mode 493)
     (mtime 1414909497)
     (chksum 4797)
     (typeflag 53))
    ((name "raptor2-2.0.15/build/ltversion.m4")
     (size 690)
     (mtime 1414908273)
     (chksum 5958))

     […])
--8<---------------cut here---------------end--------------->8---

which is human-readable.  Is it useful?


Instead, one could imagine shorter keywords:

    ((na "raptor2-2.0.15/")
     (mo 493)
     (mt 1414909500)
     (ch 4225)
     (ty 53))

which using your database (commit fc50927) reduces from 295MB to 279MB.

Or even plain list:

   (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53)
   (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958)

where the first element provides the “type” of list to ease the reader.


Well, the 2 naive questions are: does it make sense to
 - have the database stored under Git?
 - have an human-readable format?


Thank you again for pushing forward this topic. :-)

All the best,
simon

[1] https://forge.softwareheritage.org/T2430#47522




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-08-27  9:41             ` zimoun
@ 2020-08-27 12:49               ` Ludovic Courtès
  2020-08-27 18:06               ` Bengt Richter
  1 sibling, 0 replies; 55+ messages in thread
From: Ludovic Courtès @ 2020-08-27 12:49 UTC (permalink / raw)
  To: zimoun; +Cc: 42162, Maurice Brémond

Hi!

zimoun <zimon.toutoune@gmail.com> skribis:

> Moreover, the format is a long list, e.g.,
>
> (headers
>     ((name "raptor2-2.0.15/")
>      (mode 493)
>      (mtime 1414909500)
>      (chksum 4225)
>      (typeflag 53))
>     ((name "raptor2-2.0.15/build/")
>      (mode 493)
>      (mtime 1414909497)
>      (chksum 4797)
>      (typeflag 53))
>     ((name "raptor2-2.0.15/build/ltversion.m4")
>      (size 690)
>      (mtime 1414908273)
>      (chksum 5958))
>
>      […])
>
> which is human-readable.  Is it useful?
>
>
> Instead, one could imagine shorter keywords:
>
>     ((na "raptor2-2.0.15/")
>      (mo 493)
>      (mt 1414909500)
>      (ch 4225)
>      (ty 53))
>
> which using your database (commit fc50927) reduces from 295MB to 279MB.

I think it’s nice, at least at this stage, that it’s
human-readable—“premature optimization is the root of all evil”.  :-)

I guess it won’t be difficult to make the format more dense eventually
if that is deemed necessary, using ‘write’ instead of ‘pretty-print’,
using tricks like you write, or even going binary as a last resort.

Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-08-27  9:41             ` zimoun
  2020-08-27 12:49               ` Ludovic Courtès
@ 2020-08-27 18:06               ` Bengt Richter
  1 sibling, 0 replies; 55+ messages in thread
From: Bengt Richter @ 2020-08-27 18:06 UTC (permalink / raw)
  To: zimoun; +Cc: 42162, Maurice Brémond

Hi,

On +2020-08-27 11:41:24 +0200, zimoun wrote:
> Hi,
> 
> On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samplet@ngyro.com> wrote:
> > zimoun <zimon.toutoune@gmail.com> writes:
> >
> >> One question is how this database scales?
> >>
> >> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
> >> for ~14k packages and then an increase of ~700MB per year, both with the
> >> Ludo’s code [1].
> >>
> >> [1] <http://issues.guix.gnu.org/issue/42162#11>
> >
> > It’s a good question.  A good part of the size comes from the
> > representation rather than the data.  Compression helps a lot here.  I
> > have a database of 3,912 packages.  It’s 295M uncompressed (which is a
> > little better than your estimation).  If I pass each file through Lzip,
> > it shrinks down to 60M.  That’s more like 15.5K per package, which is
> > almost an order of magnitude smaller than the estimation you used
> > (120K).  I think that makes the numbers rather pleasant, but it comes at
> > the expense of easy storing in Git.
> 
> Thank you for these numbers.  Really interesting!
> 
> First, I do not know if the database needs to be stored with Git.  What
> should be the advantage? (naive question :-))
> 
> 
> On SWH T2430 [1], you explain the “default-header” trick to cut down the
> size.  Nice!
> 
> Moreover, the format is a long list, e.g.,
> 
> --8<---------------cut here---------------start------------->8---
> (headers

How about
    (X-v1-headers
(borrowing from rfc2045 MIME usage indicating as-yet-not-a-formal-standard)
The idea is to make it easy to script the change to "(headers" once
there is consensus for declaring a new standard. The "v1-" part could allow
a simultaneous "(X-v2-headers" alternative for zimoun's concise suggestion,
or even a base64 of a compressed format. There's lots that could be borrowed from
the MIME rfc's :)

--8<---------------cut here---------------start------------->8---
6.3.  New Content-Transfer-Encodings

   Implementors may, if necessary, define private Content-Transfer-
   Encoding values, but must use an x-token, which is a name prefixed by
   "X-", to indicate its non-standard status, e.g., "Content-Transfer-
   Encoding: x-my-new-encoding".  Additional standardized Content-
   Transfer-Encoding values must be specified by a standards-track RFC.
   The requirements such specifications must meet are given in RFC 2048.
   As such, all content-transfer-encoding namespace except that
   beginning with "X-" is explicitly reserved to the IETF for future
   use.

   Unlike media types and subtypes, the creation of new Content-
   Transfer-Encoding values is STRONGLY discouraged, as it seems likely
   to hinder interoperability with little potential benefit
--8<---------------cut here---------------end--------------->8---

>     ((name "raptor2-2.0.15/")
>      (mode 493)
If you want to be more human-readable with mode, I would put
a chmod argument in place of 493 :)

--8<---------------cut here---------------start------------->8---
$ printf "%o\n" 493
755
$ 
--8<---------------cut here---------------end--------------->8---

Hm, could this be a security risk??
I mean, could a mode typo here inadvertently open a door for a nasty mod
by oportunistic code buried in a later-executed apparently unrelated app?

>      (mtime 1414909500)
One of these might be more human-recognizable :)
--8<---------------cut here---------------start------------->8---
$ date --date='@1414909497' -Is
2014-11-02T07:24:57+01:00
$ date --date='@1414909497' -uIs
2014-11-02T06:24:57+00:00
$ TZ=America/Buenos_Aires date --date='@1414909497' -Is
2014-11-02T03:24:57-03:00
$
$ date --date='@1414909497' -u '+%Y%m%d_%H%M%S'
20141102_062457
# vs 1414909497, which, yes, costs 5 chars less
$ 
--8<---------------cut here---------------end--------------->8---

>      (chksum 4225)
>      (typeflag 53))
>     ((name "raptor2-2.0.15/build/")
>      (mode 493)
>      (mtime 1414909497)
>      (chksum 4797)
>      (typeflag 53))
>     ((name "raptor2-2.0.15/build/ltversion.m4")
>      (size 690)
>      (mtime 1414908273)
>      (chksum 5958))
> 
>      […])
> --8<---------------cut here---------------end--------------->8---
> 
> which is human-readable.  Is it useful?
> 
> 
> Instead, one could imagine shorter keywords:
>
(X-v2-headers  ;; ;-)
>     ((na "raptor2-2.0.15/")
>      (mo 493)
>      (mt 1414909500)
>      (ch 4225)
>      (ty 53))
> 
> which using your database (commit fc50927) reduces from 295MB to 279MB.
> 
> Or even plain list:
>
(X-v3-headers
>    (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53)
>    (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958)
> 
> where the first element provides the “type” of list to ease the reader.
> 
> 
> Well, the 2 naive questions are: does it make sense to
>  - have the database stored under Git?
>  - have an human-readable format?
> 
> 
> Thank you again for pushing forward this topic. :-)
> 
> All the best,
> simon
> 
> [1] https://forge.softwareheritage.org/T2430#47522
> 
> 
> 

Prefixing "X-" can obviously be used with any tentative name for anything.

I am suggesting it as a counter to premature (and likely clashing) bindings
of valuable names, which IMO is as bad as premature optimization :)

Naming is too important to be defined by first-user flag-planting, ISTM.
-- 
Regards,
Bengt Richter




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-08-05 18:57               ` Timothy Sample
  2020-08-23 16:21                 ` Ludovic Courtès
@ 2020-11-03 14:26                 ` Ludovic Courtès
  2020-11-03 16:37                   ` zimoun
  2020-11-03 19:20                   ` Timothy Sample
  1 sibling, 2 replies; 55+ messages in thread
From: Ludovic Courtès @ 2020-11-03 14:26 UTC (permalink / raw)
  To: Timothy Sample; +Cc: 42162

Hi Timothy,

I hope you’re well.  I was wondering if you’ve had the chance to fiddle
with Disarchive since the summer?

I’m thinking there are small steps we could take to move forward:

  1. Have a Disarchive package in Guix (and one for guile-quickcheck,
     kudos on that one!).

  2. Have a Cuirass job running on ci.guix.gnu.org to build and publish
     the disarchive-db.

  3. Integrate Disarchive in (guix download) to reconstruct tarballs.

WDYT?

Thanks,
Ludo’, who’s still very much excited about these perspectives!

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-11-03 14:26                 ` Ludovic Courtès
@ 2020-11-03 16:37                   ` zimoun
  2020-11-03 19:20                   ` Timothy Sample
  1 sibling, 0 replies; 55+ messages in thread
From: zimoun @ 2020-11-03 16:37 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 42162

Hi,

On Tue, 3 Nov 2020 at 15:26, Ludovic Courtès <ludo@gnu.org> wrote:

>   2. Have a Cuirass job running on ci.guix.gnu.org to build and publish
>      the disarchive-db.

One question is: how does the database scale?  And only the real world
can show it. ;-)

> Ludo’, who’s still very much excited about these perspectives!

Sounds awesome!

On my side, I asked twice on #swh-devel if it is possible to setup a
higher rate limit for one specific machine.  I have in mind one
machine located at my place (Univ. Paris, ex Paris 7 Diderot) because
of proximity and because I want to generate (script) some report about
how much Guix is in SWH.  Whatever!
Instead, we could ask for Berlin or for one machine of INRIA Bordeaux,
maybe the machine running guix.gnu.org or the one running
hpc.guix.info.  WDYT?

BTW, not related to tarballs and I have not worked so much on (running
out of time), but I would like to integrate hg-fetch and svn-fetch
with SWH, first to "guix lint -c archival" then second to
sources.json.  The save seems not so hard, but the lookup needs some
experiments with the SWH API.
The big picture is to have all the ingestion of the Guix packages done
by the automatically generated sources.json file and not via
time-to-time "guix lint -c archival" (should be recommended for custom
channels).

All the best,
simon

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-11-03 14:26                 ` Ludovic Courtès
  2020-11-03 16:37                   ` zimoun
@ 2020-11-03 19:20                   ` Timothy Sample
  2020-11-04 16:49                     ` Ludovic Courtès
  1 sibling, 1 reply; 55+ messages in thread
From: Timothy Sample @ 2020-11-03 19:20 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 42162

Hi Ludo,

Ludovic Courtès <ludo@gnu.org> writes:

> I hope you’re well.  I was wondering if you’ve had the chance to fiddle
> with Disarchive since the summer?

Sort of!  I managed to get the entire corpus of tarballs that I started
with to work (about 4000 archives).  After that, I started writing some
documentation.  The goal there was to be more careful with serialization
format.  Starting to think clearly about the format and how to ensure
long-term compatibility gave me a bit of vertigo, so I took a break.  :)

I was kind of hoping the initial excitement at SWH would push the
project along, but that seems to have died down (for now).  Going back
to making sure it works for Guix is probably the best way to develop it
until I hear more from SWH.

> I’m thinking there are small steps we could take to move forward:
>
>   1. Have a Disarchive package in Guix (and one for guile-quickcheck,
>      kudos on that one!).

This will be easy.  The hang-up I had earlier was that I vendored the
pristine-tar Gzip utility (“zgz”).  Since then I don’t think it’s such a
big deal.

(I wrote Guile-QuickCheck ages ago!  It was rotting away on my disk
because I couldn’t figure out a good way to use it with, say, Gash.  It
has exposed several Disarchive bugs already.)

>   2. Have a Cuirass job running on ci.guix.gnu.org to build and publish
>      the disarchive-db.

I’m interested in running Cuirass locally for other reasons, so I should
have a good test environment to figure this out.  To be honest, I’ve had
trouble figuring out Cuirass in the past, so I was dragging my feet a
bit.

>   3. Integrate Disarchive in (guix download) to reconstruct tarballs.

I had a very simple patch that did this!  It was less exciting when it
sounded like SWH was going to use Disarchive directly.  However, like I
wrote, making Disarchive work for Guix is probably the best way to make
it work for SWH if they want it in the future.

> WDYT?

This all will have to wait in the queue for a bit longer, but I should
be able to return to it soon.  I think the steps listed above are good,
along with some changes I want to make to Disarchive itself.

--Tim

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: Recovering source tarballs
  2020-11-03 19:20                   ` Timothy Sample
@ 2020-11-04 16:49                     ` Ludovic Courtès
  2022-09-29  0:32                       ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
  0 siblings, 1 reply; 55+ messages in thread
From: Ludovic Courtès @ 2020-11-04 16:49 UTC (permalink / raw)
  To: Timothy Sample; +Cc: 42162

Hello!

Timothy Sample <samplet@ngyro.com> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> I hope you’re well.  I was wondering if you’ve had the chance to fiddle
>> with Disarchive since the summer?
>
> Sort of!  I managed to get the entire corpus of tarballs that I started
> with to work (about 4000 archives).  After that, I started writing some
> documentation.  The goal there was to be more careful with serialization
> format.  Starting to think clearly about the format and how to ensure
> long-term compatibility gave me a bit of vertigo, so I took a break.  :)
>
> I was kind of hoping the initial excitement at SWH would push the
> project along, but that seems to have died down (for now).  Going back
> to making sure it works for Guix is probably the best way to develop it
> until I hear more from SWH.

Yeah, I suppose they have enough on their plate and won’t add it to
their agenda until we have shown that it works for us.

>> I’m thinking there are small steps we could take to move forward:
>>
>>   1. Have a Disarchive package in Guix (and one for guile-quickcheck,
>>      kudos on that one!).
>
> This will be easy.  The hang-up I had earlier was that I vendored the
> pristine-tar Gzip utility (“zgz”).  Since then I don’t think it’s such a
> big deal.

Yeah.

> (I wrote Guile-QuickCheck ages ago!  It was rotting away on my disk
> because I couldn’t figure out a good way to use it with, say, Gash.  It
> has exposed several Disarchive bugs already.)

Neat!  I’m sure many of us would love to use it.  :-)

> This all will have to wait in the queue for a bit longer, but I should
> be able to return to it soon.  I think the steps listed above are good,
> along with some changes I want to make to Disarchive itself.

Alright!  Let us know if you think there are tasks that people should
just pick and work on in the meantime.

Thanks for the prompt reply!

Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2020-07-02  7:29 bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Ludovic Courtès
  2020-07-02  8:50 ` zimoun
@ 2021-01-10 19:32 ` Maxim Cournoyer
  2021-01-13 10:39   ` Ludovic Courtès
       [not found] ` <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org>
  2 siblings, 1 reply; 55+ messages in thread
From: Maxim Cournoyer @ 2021-01-10 19:32 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 42162-done, Maurice Brémond

Hello Ludovic,

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

> Hello!
>
> The hosting site gforge.inria.fr will be taken off-line in December
> 2020.  This GForge instance hosts source code as tarballs, Subversion
> repos, and Git repos.  Users have been invited to migrate to
> gitlab.inria.fr, which is Git only.  It seems that Software Heritage
> hasn’t archived (yet) all of gforge.inria.fr.  Let’s keep track of the
> situation in this issue.
>
> The following packages have their source on gforge.inria.fr:
>
> scheme@(guile-user)> ,pp packages-on-gforge
> $7 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640>
>  #<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0>
>  #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280>
>  #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0>
>  #<package pt-scotch@6.0.6 gnu/packages/maths.scm:2920 7f632d832640>
>  #<package scotch@6.0.6 gnu/packages/maths.scm:2774 7f632d832780>
>  #<package pt-scotch32@6.0.6 gnu/packages/maths.scm:2944 7f632d8325a0>
>  #<package scotch32@6.0.6 gnu/packages/maths.scm:2873 7f632d8326e0>
>  #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280>
>  #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960>
>  #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>)
>
>
> ‘isl’ (a dependency of GCC) has its source on gforge.inria.fr but it’s
> also mirrored at gcc.gnu.org apparently.
>
> Of these, the following are available on Software Heritage:
>
> scheme@(guile-user)> ,pp archived-source
> $8 = (#<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0>
>  #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280>
>  #<package pt-scotch@6.0.6 gnu/packages/maths.scm:2920 7f632d832640>
>  #<package scotch@6.0.6 gnu/packages/maths.scm:2774 7f632d832780>
>  #<package pt-scotch32@6.0.6 gnu/packages/maths.scm:2944 7f632d8325a0>
>  #<package scotch32@6.0.6 gnu/packages/maths.scm:2873 7f632d8326e0>
>  #<package isl@0.18 gnu/packages/gcc.scm:925 7f632dc82320>
>  #<package isl@0.11.1 gnu/packages/gcc.scm:939 7f632dc82280>)

I ran the code you had attached to the original message and got:

,pp packages-on-gforge
$2 = ()
scheme@(guile-user)> ,pp archived-source
$3 = ()

Closing,

Thank you.

Maxim




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2021-01-10 19:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
@ 2021-01-13 10:39   ` Ludovic Courtès
  2021-01-13 12:27     ` Andreas Enge
  2021-01-13 15:07     ` Andreas Enge
  0 siblings, 2 replies; 55+ messages in thread
From: Ludovic Courtès @ 2021-01-13 10:39 UTC (permalink / raw)
  To: Maxim Cournoyer; +Cc: 42162-done, Maurice Brémond, andreas.enge

[-- Attachment #1: Type: text/plain, Size: 3547 bytes --]

Hi Maxim,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

>> The following packages have their source on gforge.inria.fr:
>>
>> scheme@(guile-user)> ,pp packages-on-gforge
>> $7 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640>
>>  #<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0>
>>  #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280>

[...]

> I ran the code you had attached to the original message and got:
>
> ,pp packages-on-gforge
> $2 = ()
> scheme@(guile-user)> ,pp archived-source
> $3 = ()

Oh, it’s due to a bug, where the wrong ‘origin?’ predicate was taken.
After hiding the “wrong” one:

  #:use-module ((guix swh) #:hide (origin?))

I get:

--8<---------------cut here---------------start------------->8---
scheme@(guix-user)> ,pp packages-on-gforge
$1 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3964 7fa8a522b280>
 #<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:281 7fa8a4f44dc0>
 #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:343 7fa8a4f44c80>
 #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7fa8afd8aa00>
 #<package scotch@6.1.0 gnu/packages/maths.scm:3083 7fa8a69c8d20>
 #<package pt-scotch@6.1.0 gnu/packages/maths.scm:3229 7fa8a69c8be0>
 #<package scotch32@6.1.0 gnu/packages/maths.scm:3182 7fa8a69c8c80>
 #<package pt-scotch32@6.1.0 gnu/packages/maths.scm:3253 7fa8a69c8b40>
 #<package isl@0.22.1 gnu/packages/gcc.scm:932 7fa8a64cbdc0>
 #<package isl@0.11.1 gnu/packages/gcc.scm:997 7fa8a64cbc80>
 #<package isl@0.18 gnu/packages/gcc.scm:983 7fa8a64cbd20>
 #<package gf2x@1.2 gnu/packages/algebra.scm:104 7fa8a4f66500>
 #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:672 7fa8a4f70be0>
 #<package cmh@1.0 gnu/packages/algebra.scm:325 7fa8a4f660a0>)
scheme@(guix-user)> ,pp archived-source
$2 = (#<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:281 7fa8a4f44dc0>
 #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:343 7fa8a4f44c80>
 #<package scotch@6.1.0 gnu/packages/maths.scm:3083 7fa8a69c8d20>
 #<package pt-scotch@6.1.0 gnu/packages/maths.scm:3229 7fa8a69c8be0>
 #<package scotch32@6.1.0 gnu/packages/maths.scm:3182 7fa8a69c8c80>
 #<package pt-scotch32@6.1.0 gnu/packages/maths.scm:3253 7fa8a69c8b40>
 #<package isl@0.11.1 gnu/packages/gcc.scm:997 7fa8a64cbc80>
 #<package isl@0.18 gnu/packages/gcc.scm:983 7fa8a64cbd20>)
--8<---------------cut here---------------end--------------->8---

Attaching the fixed script for clarity.

BTW, gforge.inria.fr shutdown has been delayed a bit, but most active
projects have started migrating to gitlab.inria.fr or elsewhere, so
hopefully we should be able to start updating our package recipes
accordingly.  It’s likely, though, that tarballs were lost in the
migration.

For example, Scotch is now at <https://gitlab.inria.fr/scotch/scotch>.
<https://gitlab.inria.fr/scotch/scotch/-/releases> shows “assets” for
the 6.1.0 release, but these are auto-generated tarballs instead of the
handcrafted one found on gforge.inria.fr (but this one is fine since its
tarball is archived as-is on SWH.)

ISL, MPFI, and GMP-ECM haven’t migrated, it seems.  CMH is now at
<https://gitlab.inria.fr/cmh/cmh> but without its tarballs.

Andreas, do you happen to know about the status of these?

We can already change Scotch and CMH to ‘git-fetch’ I think.  That
doesn’t solve the problem for earlier Guix revisions though, and I hope
Disarchive will save us!

Thanks,
Ludo’.


[-- Attachment #2: gforge.scm --]
[-- Type: text/plain, Size: 1304 bytes --]

(use-modules (guix) (gnu)
             (guix svn-download)
             (guix git-download)
             ((guix swh) #:hide (origin?))
             (ice-9 match)
             (srfi srfi-1)
             (srfi srfi-26))

(define (gforge? package)
  (define (gforge-string? str)
    (string-contains str "gforge.inria.fr"))

  (match (package-source package)
    ((? origin? o)
     (match (origin-uri o)
       ((? string? url)
        (gforge-string? url))
       (((? string? urls) ...)
        (any gforge-string? urls))                ;or 'find'
       ((? git-reference? ref)
        (gforge-string? (git-reference-url ref)))
       ((? svn-reference? ref)
        (gforge-string? (svn-reference-url ref)))
       (_ #f)))
    (_ #f)))

(define packages-on-gforge
  (fold-packages (lambda (package result)
                   (if (gforge? package)
                       (cons package result)
                       result))
                 '()))

(define archived-source
  (filter (lambda (package)
            (let* ((origin (package-source package))
                   (hash  (origin-hash origin)))
              (lookup-content (content-hash-value hash)
                              (symbol->string
                               (content-hash-algorithm hash)))))
          packages-on-gforge))


^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2021-01-13 10:39   ` Ludovic Courtès
@ 2021-01-13 12:27     ` Andreas Enge
  2021-01-13 15:07     ` Andreas Enge
  1 sibling, 0 replies; 55+ messages in thread
From: Andreas Enge @ 2021-01-13 12:27 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 42162, Maurice Brémond, Maxim Cournoyer

Hello,

Am Wed, Jan 13, 2021 at 11:39:19AM +0100 schrieb Ludovic Courtès:
> ISL, MPFI, and GMP-ECM haven’t migrated, it seems.  CMH is now at
> <https://gitlab.inria.fr/cmh/cmh> but without its tarballs.
> 
> Andreas, do you happen to know about the status of these?

For CMH, the tarballs are available from its (new) homepage:
   http://www.multiprecision.org/cmh/home.html
I can update the location at the next release, which I should prepare
some time soon (TM).

Concerning MPFI and GMP-ECM, I can ask their respective authors to keep
me updated; I have no doubts they are going to migrate their projects.

For ISL, I do not know.

Andreas

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
       [not found] ` <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org>
@ 2021-01-13 14:28   ` Ludovic Courtès
  2021-01-14 14:21     ` Maxim Cournoyer
  2021-10-04 15:59     ` bug#42162: gforge.inria.fr is off-line Ludovic Courtès
  0 siblings, 2 replies; 55+ messages in thread
From: Ludovic Courtès @ 2021-01-13 14:28 UTC (permalink / raw)
  To: 42162; +Cc: Maurice

[-- Attachment #1: Type: text/plain, Size: 240 bytes --]

help-debbugs@gnu.org (GNU bug Tracking System) skribis:

> We can already change Scotch and CMH to ‘git-fetch’ I think.

For Scotch, the ‘v6.1.0’ tag at gitlab.inria.fr provides different
content than the tarball on gforge:


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: the diff --]
[-- Type: text/x-patch, Size: 3000 bytes --]

Nur en /tmp/scotch_6.1.0/: bin
Nur en /tmp/scotch_6.1.0/doc/src/ptscotch: p.ps
Nur en /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout: .gitignore
Nur en /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout: .gitlab-ci.yml
Nur en /tmp/scotch_6.1.0/: include
Nur en /tmp/scotch_6.1.0/: lib
diff -ru /tmp/scotch_6.1.0/src/libscotch/library.h /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotch/library.h
--- /tmp/scotch_6.1.0/src/libscotch/library.h	1970-01-01 01:00:01.000000000 +0100
+++ /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotch/library.h	1970-01-01 01:00:01.000000000 +0100
@@ -67,8 +67,6 @@
 
 /*+ Integer type. +*/
 
-#include <stdint.h>
-
 typedef DUMMYIDX SCOTCH_Idx;
 
 typedef DUMMYINT SCOTCH_Num;
diff -ru /tmp/scotch_6.1.0/src/libscotch/Makefile /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotch/Makefile
--- /tmp/scotch_6.1.0/src/libscotch/Makefile	1970-01-01 01:00:01.000000000 +0100
+++ /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotch/Makefile	1970-01-01 01:00:01.000000000 +0100
@@ -2320,28 +2320,6 @@
 					common.h				\
 					scotch.h
 
-library_graph_diam$(OBJ)	:	library_graph_diam.c			\
-					module.h				\
-					common.h				\
-					graph.h					\
-					scotch.h
-
-library_graph_diam_f$(OBJ)	:	library_graph_diam.c			\
-					module.h				\
-					common.h				\
-					scotch.h
-
-library_graph_induce$(OBJ)	:	library_graph_diam.c			\
-					module.h				\
-					common.h				\
-					graph.h					\
-					scotch.h
-
-library_graph_induce_f$(OBJ)	:	library_graph_diam.c			\
-					module.h				\
-					common.h				\
-					scotch.h
-
 library_graph_io_chac$(OBJ)	:	library_graph_io_chac.c			\
 					module.h				\
 					common.h				\
diff -ru /tmp/scotch_6.1.0/src/libscotchmetis/library_metis.h /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotchmetis/library_metis.h
--- /tmp/scotch_6.1.0/src/libscotchmetis/library_metis.h	1970-01-01 01:00:01.000000000 +0100
+++ /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotchmetis/library_metis.h	1970-01-01 01:00:01.000000000 +0100
@@ -106,7 +106,6 @@
 */
 
 #ifndef SCOTCH_H                                  /* In case "scotch.h" not included before */
-#include <stdint.h>
 typedef DUMMYINT SCOTCH_Num;
 #endif /* SCOTCH_H */
 
diff -ru /tmp/scotch_6.1.0/src/libscotchmetis/library_parmetis.h /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotchmetis/library_parmetis.h
--- /tmp/scotch_6.1.0/src/libscotchmetis/library_parmetis.h	1970-01-01 01:00:01.000000000 +0100
+++ /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotchmetis/library_parmetis.h	1970-01-01 01:00:01.000000000 +0100
@@ -106,7 +106,6 @@
 */
 
 #ifndef SCOTCH_H                                  /* In case "scotch.h" not included before */
-#include <stdint.h>
 typedef DUMMYINT SCOTCH_Num;
 #endif /* SCOTCH_H */
 

[-- Attachment #3: Type: text/plain, Size: 214 bytes --]


There’s not much we can do if upstream isn’t more cautious though.
Perhaps we can still update to the “new” 6.1.0, maybe labeling it
“6.1.0b”?

Attached a tentative patch.

Thanks,
Ludo’.


[-- Attachment #4: Type: text/x-patch, Size: 1753 bytes --]

diff --git a/gnu/packages/maths.scm b/gnu/packages/maths.scm
index 7866bcc6eb..4f8f79052d 100644
--- a/gnu/packages/maths.scm
+++ b/gnu/packages/maths.scm
@@ -12,7 +12,7 @@
 ;;; Copyright © 2015 Fabian Harfert <fhmgufs@web.de>
 ;;; Copyright © 2016 Roel Janssen <roel@gnu.org>
 ;;; Copyright © 2016, 2018, 2020 Kei Kebreau <kkebreau@posteo.net>
-;;; Copyright © 2016, 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2016, 2017, 2018, 2019, 2020, 2021 Ludovic Courtès <ludo@gnu.org>
 ;;; Copyright © 2016 Leo Famulari <leo@famulari.name>
 ;;; Copyright © 2016, 2017 Thomas Danckaert <post@thomasdanckaert.be>
 ;;; Copyright © 2017, 2018, 2019, 2020 Paul Garlick <pgarlick@tourbillion-technology.com>
@@ -3083,13 +3083,15 @@ implemented in ANSI C, and MPI for communications.")
   (package
     (name "scotch")
     (version "6.1.0")
-    (source
-     (origin
-      (method url-fetch)
-      (uri (string-append "https://gforge.inria.fr/frs/download.php/"
-                          "latestfile/298/scotch_" version ".tar.gz"))
+    (source (origin
+              (method git-fetch)
+              (uri (git-reference
+                    (url "https://gitlab.inria.fr/scotch/scotch")
+                    (commit (string-append "v" version))))
+              (file-name (git-file-name name version))
               (sha256
-       (base32 "1184fcv4wa2df8szb5lan6pjh0raarr45pk8ilpvbz23naikzg53"))
+               (base32
+                "164jqsy75j7zfnwngj10jc4060shhxni3z8ykklhqjykdrinir55"))
               (patches (search-patches "scotch-build-parallelism.patch"
                                        "scotch-integer-declarations.patch"))))
     (build-system gnu-build-system)

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2021-01-13 10:39   ` Ludovic Courtès
  2021-01-13 12:27     ` Andreas Enge
@ 2021-01-13 15:07     ` Andreas Enge
  1 sibling, 0 replies; 55+ messages in thread
From: Andreas Enge @ 2021-01-13 15:07 UTC (permalink / raw)
  To: Ludovic Courtès
  Cc: 42162, Maurice Brémond, Maxim Cournoyer, andreas.enge

Am Wed, Jan 13, 2021 at 11:39:19AM +0100 schrieb Ludovic Courtès:
> ISL, MPFI, and GMP-ECM haven’t migrated, it seems.

gmp-ecm has migrated to gitlab.inria.fr; I just pushed a commit with an
updated URI. Besides the automatically created gitlab releases with git
snapshots, the maintainer also uploads a release tarball. I chose to use
the latter, which requires to manually update a hash together with the
version number upon a new release.

Andreas

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2021-01-13 14:28   ` Ludovic Courtès
@ 2021-01-14 14:21     ` Maxim Cournoyer
  2021-10-04 15:59     ` bug#42162: gforge.inria.fr is off-line Ludovic Courtès
  1 sibling, 0 replies; 55+ messages in thread
From: Maxim Cournoyer @ 2021-01-14 14:21 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 42162, Maurice

Hi Ludovic,

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

[...]

> There’s not much we can do if upstream isn’t more cautious though.
> Perhaps we can still update to the “new” 6.1.0, maybe labeling it
> “6.1.0b”?

I'd prefer to append a '-1' revision rather than changing the version
string itself; as that is IMO the business of upstream.

Thanks,

Maxim




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr is off-line
  2021-01-13 14:28   ` Ludovic Courtès
  2021-01-14 14:21     ` Maxim Cournoyer
@ 2021-10-04 15:59     ` Ludovic Courtès
  2021-10-04 17:50       ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 zimoun
  1 sibling, 1 reply; 55+ messages in thread
From: Ludovic Courtès @ 2021-10-04 15:59 UTC (permalink / raw)
  To: 42162; +Cc: Maurice Brémond, andreas.enge, Maxim Cournoyer

Hi!

Ludovic Courtès <ludovic.courtes@inria.fr> skribis:

> help-debbugs@gnu.org (GNU bug Tracking System) skribis:
>
>> We can already change Scotch and CMH to ‘git-fetch’ I think.
>
> For Scotch, the ‘v6.1.0’ tag at gitlab.inria.fr provides different
> content than the tarball on gforge:

[...]

> There’s not much we can do if upstream isn’t more cautious though.
> Perhaps we can still update to the “new” 6.1.0, maybe labeling it
> “6.1.0b”?
>
> Attached a tentative patch.

Believe it or not, gforge.inria.fr was finally phased out on
Sept. 30th.  And believe it or not, despite all the work and all the
chat :-), we lost the source tarball of Scotch 6.1.1 for a short period
of time (I found a copy and uploaded it to berlin a couple of hours
ago).

Going back to the script at the beginning of this bug report, we get (on
688a4db071736a772e6b5515d7c03fe501c3c15a):

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,pp packages-on-gforge
$2 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:4357 7f08823d8630>
 #<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:566 7f088675c630>
 #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:628 7f088675c4d0>
 #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f0881609160>
 #<package scotch-shared@6.1.1 gnu/packages/maths.scm:3732 7f0882964c60>
 #<package pt-scotch32@6.1.1 gnu/packages/maths.scm:3814 7f0882964b00>
 #<package pt-scotch-shared@6.1.1 gnu/packages/maths.scm:3837 7f0882964a50>
 #<package scotch32@6.1.1 gnu/packages/maths.scm:3684 7f0882964d10>
 #<package why3@1.3.3 gnu/packages/maths.scm:6904 7f0882357e70>
 #<package pt-scotch@6.1.1 gnu/packages/maths.scm:3790 7f0882964bb0>
 #<package scotch@6.1.1 gnu/packages/maths.scm:3581 7f0882964dc0>
 #<package isl@0.18 gnu/packages/gcc.scm:1103 7f088161cbb0>
 #<package isl@0.22.1 gnu/packages/gcc.scm:1052 7f088161cc60>
 #<package isl@0.11.1 gnu/packages/gcc.scm:1117 7f088161cb00>
 #<package gf2x@1.2 gnu/packages/algebra.scm:107 7f0880397b00>
 #<package gappa@1.3.5 gnu/packages/algebra.scm:1273 7f08803a76e0>)
scheme@(guile-user)> ,pp archived-source
$3 = (#<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:566 7f088675c630>
 #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:628 7f088675c4d0>
 #<package isl@0.18 gnu/packages/gcc.scm:1103 7f088161cbb0>
 #<package isl@0.11.1 gnu/packages/gcc.scm:1117 7f088161cb00>)
scheme@(guile-user)> ,pp (lset-difference eq? packages-on-gforge archived-source)
$4 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:4357 7f08823d8630>
 #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f0881609160>
 #<package scotch-shared@6.1.1 gnu/packages/maths.scm:3732 7f0882964c60>
 #<package pt-scotch32@6.1.1 gnu/packages/maths.scm:3814 7f0882964b00>
 #<package pt-scotch-shared@6.1.1 gnu/packages/maths.scm:3837 7f0882964a50>
 #<package scotch32@6.1.1 gnu/packages/maths.scm:3684 7f0882964d10>
 #<package why3@1.3.3 gnu/packages/maths.scm:6904 7f0882357e70>
 #<package pt-scotch@6.1.1 gnu/packages/maths.scm:3790 7f0882964bb0>
 #<package scotch@6.1.1 gnu/packages/maths.scm:3581 7f0882964dc0>
 #<package isl@0.22.1 gnu/packages/gcc.scm:1052 7f088161cc60>
 #<package gf2x@1.2 gnu/packages/algebra.scm:107 7f0880397b00>
 #<package gappa@1.3.5 gnu/packages/algebra.scm:1273 7f08803a76e0>)
--8<---------------cut here---------------end--------------->8---

All this to say that we must really get our act together with Disarchive
:-), and salvage all these tarballs until then.

There are redirects in place for some of these, but probably not all.

Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2021-10-04 15:59     ` bug#42162: gforge.inria.fr is off-line Ludovic Courtès
@ 2021-10-04 17:50       ` zimoun
  2021-10-07 16:07         ` Ludovic Courtès
  0 siblings, 1 reply; 55+ messages in thread
From: zimoun @ 2021-10-04 17:50 UTC (permalink / raw)
  To: Ludovic Courtès
  Cc: 42162, Maurice Brémond, Maxim Cournoyer, andreas.enge

Hi Ludo,

On Mon, 04 Oct 2021 at 17:59, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:

> Believe it or not, gforge.inria.fr was finally phased out on
> Sept. 30th.  And believe it or not, despite all the work and all the
> chat :-), we lost the source tarball of Scotch 6.1.1 for a short period
> of time (I found a copy and uploaded it to berlin a couple of hours
> ago).

Euh, I do not understand.  From bug#43442 [1] on Wed, 16 Sep 2020,
Scotch was not missing.  And from [2] neither.

Nah, the hole is the (double) update (from 6.0.6 to 6.1.0 then 6.1.1)
without manually taking care of this bug report; by switching from
url-fetch to git-fetch for instance.  Somehow, it was bounded to happen
because we lack automatic tools despite the fact they are there.

Indeed, hard to believe. :-)

As I am asking in this thread [3], the Guix project has the ressource,
storage speaking, to archive these tarballs -- waiting a robust
long-term automatic system.  But we (the Guix projet) cannot because we
duplicate the effort on keeping twice all the build outputs.  Somehow,
between Berlin and Bordeaux, coherent policies for conservancy are
missing. IMHO.

1: <http://issues.guix.gnu.org/issue/43442>
2: <http://issues.guix.gnu.org/issue/42162#0>
3: <https://lists.gnu.org/archive/html/guix-devel/2021-09/msg00174.html>

> All this to say that we must really get our act together with Disarchive
> :-), and salvage all these tarballs until then.

Definetly!  We are witnesssing missing tarballs here.  But many more
could be missing from Berlin or Bordeaux and also upstream should have
disappeared.

Cheers,
simon

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2021-10-04 17:50       ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 zimoun
@ 2021-10-07 16:07         ` Ludovic Courtès
  2021-10-09 17:29           ` raingloom
  2021-10-11  8:41           ` zimoun
  0 siblings, 2 replies; 55+ messages in thread
From: Ludovic Courtès @ 2021-10-07 16:07 UTC (permalink / raw)
  To: zimoun; +Cc: 42162, Maurice Brémond, Maxim Cournoyer, andreas.enge

Hi!

zimoun <zimon.toutoune@gmail.com> skribis:

> Euh, I do not understand.  From bug#43442 [1] on Wed, 16 Sep 2020,
> Scotch was not missing.  And from [2] neither.
>
> Nah, the hole is the (double) update (from 6.0.6 to 6.1.0 then 6.1.1)
> without manually taking care of this bug report; by switching from
> url-fetch to git-fetch for instance.  Somehow, it was bounded to happen
> because we lack automatic tools despite the fact they are there.
>
> Indeed, hard to believe. :-)

I guess, in our mind, the problem was fixed long ago.  :-)

> As I am asking in this thread [3], the Guix project has the ressource,
> storage speaking, to archive these tarballs -- waiting a robust
> long-term automatic system.  But we (the Guix projet) cannot because we
> duplicate the effort on keeping twice all the build outputs.  Somehow,
> between Berlin and Bordeaux, coherent policies for conservancy are
> missing. IMHO.

So I think we’re lucky that we can try different solutions at once.

The best solution is the one that won’t rely solely on the Guix project:
SWH + Disarchive.  We’re getting there!

The second-best solution is to improve our tooling so we can actually
keep source code in a more controlled way.  That’s what I had in mind
with <https://ci.guix.gnu.org/jobset/source>.  We have storage space for
that on berlin, but it’s not infinite.

Another approach is to use ‘git-fetch’ more, at least for non-Autotools
packages (that’s the case for Scotch, for instance.)

So we can do all these things, and we’ll have to push hard to get the
Disarchive option past the finish line because it’s the most promising
long-term.

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2021-10-07 16:07         ` Ludovic Courtès
@ 2021-10-09 17:29           ` raingloom
  2021-10-11  8:41           ` zimoun
  1 sibling, 0 replies; 55+ messages in thread
From: raingloom @ 2021-10-09 17:29 UTC (permalink / raw)
  To: 42162

On Thu, 07 Oct 2021 18:07:16 +0200
Ludovic Courtès <ludovic.courtes@inria.fr> wrote:

> Hi!
> 
> zimoun <zimon.toutoune@gmail.com> skribis:
> 
> > Euh, I do not understand.  From bug#43442 [1] on Wed, 16 Sep 2020,
> > Scotch was not missing.  And from [2] neither.
> >
> > Nah, the hole is the (double) update (from 6.0.6 to 6.1.0 then
> > 6.1.1) without manually taking care of this bug report; by
> > switching from url-fetch to git-fetch for instance.  Somehow, it
> > was bounded to happen because we lack automatic tools despite the
> > fact they are there.
> >
> > Indeed, hard to believe. :-)  
> 
> I guess, in our mind, the problem was fixed long ago.  :-)
> 
> > As I am asking in this thread [3], the Guix project has the
> > ressource, storage speaking, to archive these tarballs -- waiting a
> > robust long-term automatic system.  But we (the Guix projet) cannot
> > because we duplicate the effort on keeping twice all the build
> > outputs.  Somehow, between Berlin and Bordeaux, coherent policies
> > for conservancy are missing. IMHO.  
> 
> So I think we’re lucky that we can try different solutions at once.
> 
> The best solution is the one that won’t rely solely on the Guix
> project: SWH + Disarchive.  We’re getting there!
> 
> The second-best solution is to improve our tooling so we can actually
> keep source code in a more controlled way.  That’s what I had in mind
> with <https://ci.guix.gnu.org/jobset/source>.  We have storage space
> for that on berlin, but it’s not infinite.
> 
> Another approach is to use ‘git-fetch’ more, at least for
> non-Autotools packages (that’s the case for Scotch, for instance.)

Out of curiosity, why only non-autotools?




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2021-10-07 16:07         ` Ludovic Courtès
  2021-10-09 17:29           ` raingloom
@ 2021-10-11  8:41           ` zimoun
  2021-10-12  9:24             ` Ludovic Courtès
  1 sibling, 1 reply; 55+ messages in thread
From: zimoun @ 2021-10-11  8:41 UTC (permalink / raw)
  To: Ludovic Courtès
  Cc: 42162, Maurice Brémond, Maxim Cournoyer, andreas.enge

Hi,

On Thu, 7 Oct 2021 at 18:07, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:

> I guess, in our mind, the problem was fixed long ago.  :-)

Yes, to me the 2 remaining packages was from
<http://issues.guix.gnu.org/43442#0> but moved already to Gitlab.
Whatever, :-)

> > As I am asking in this thread [3], the Guix project has the ressource,
> > storage speaking, to archive these tarballs -- waiting a robust
> > long-term automatic system.  But we (the Guix projet) cannot because we
> > duplicate the effort on keeping twice all the build outputs.  Somehow,
> > between Berlin and Bordeaux, coherent policies for conservancy are
> > missing. IMHO.
>
> So I think we’re lucky that we can try different solutions at once.

Well, it is not what I am observing.  Anyway. :-)

> The best solution is the one that won’t rely solely on the Guix project:
> SWH + Disarchive.  We’re getting there!

Yes.  Although, it is hard to define "the Guix project". :-)
Well, the remaining question is where to set the Disarchive
database... but hardware could be floating around once it is ready.
;-)

> The second-best solution is to improve our tooling so we can actually
> keep source code in a more controlled way.  That’s what I had in mind
> with <https://ci.guix.gnu.org/jobset/source>.  We have storage space for
> that on berlin, but it’s not infinite.

If Berlin has space, why so much derivations are missing when running
time-machine?

Well, aside the implementation that ci.guix.gnu.org fetches from repo
every X minutes, i.e., drops all the commits (and the associated
derivations) pushed in the meantime.  And that bordeaux.guix.gnu.org
fetches from guix-commits the commit batch, i.e., builds only one
commit of this batch.

> Another approach is to use ‘git-fetch’ more, at least for non-Autotools
> packages (that’s the case for Scotch, for instance.)

This is what I suggested when opening this thread [1] more than one
year ago.  Reading the discussion and keeping in mind the inertia, I
do not think it is a viable path.  For instance, you know all the
pitfalls and you updated Scotch without switching to git-fetch -- no
criticism :-)  just a realistic matter of facts to have good coverage.

<https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html>

> So we can do all these things, and we’ll have to push hard to get the
> Disarchive option past the finish line because it’s the most promising
> long-term.

Agree.  Even, I think it is the only long-term option. :-)

All the best,
simon

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2021-10-11  8:41           ` zimoun
@ 2021-10-12  9:24             ` Ludovic Courtès
  2021-10-12 10:50               ` zimoun
  0 siblings, 1 reply; 55+ messages in thread
From: Ludovic Courtès @ 2021-10-12  9:24 UTC (permalink / raw)
  To: zimoun; +Cc: 42162, Maurice Brémond, Maxim Cournoyer, andreas.enge

Hello!

I sense a lot of impatience in your message :-), and I also see many
questions.  It is up to us all to answer them, I’ll just reply
selectively here.

zimoun <zimon.toutoune@gmail.com> skribis:

> On Thu, 7 Oct 2021 at 18:07, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:

[...]

>> The second-best solution is to improve our tooling so we can actually
>> keep source code in a more controlled way.  That’s what I had in mind
>> with <https://ci.guix.gnu.org/jobset/source>.  We have storage space for
>> that on berlin, but it’s not infinite.
>
> If Berlin has space, why so much derivations are missing when running
> time-machine?

That’s not related to the question at hand, but it would be worth
investigating, first by trying to quantify that.

For the record, the ‘guix publish’ config on berlin is here:

  https://git.savannah.gnu.org/cgit/guix/maintenance.git/tree/hydra/modules/sysadmin/services.scm#n485

If I read that correctly, nars have a TTL of 180 days (this is the time
a nar is retained after the last time it has been requested, so it’s a
lower bound.)

>> Another approach is to use ‘git-fetch’ more, at least for non-Autotools
>> packages (that’s the case for Scotch, for instance.)
>
> This is what I suggested when opening this thread [1] more than one
> year ago.  Reading the discussion and keeping in mind the inertia, I
> do not think it is a viable path.  For instance, you know all the
> pitfalls and you updated Scotch without switching to git-fetch -- no
> criticism :-)  just a realistic matter of facts to have good coverage.
>
> <https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html>

Right, and I agree Scotch is a package that can definitely use
‘git-fetch’ (there are bootstrapping considerations of packages low in
the stack, for instance you wouldn’t want to have Git fetched over
‘git-fetch’, but for packages like this there’s no reason not to use
‘git-fetch’.)

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2021-10-12  9:24             ` Ludovic Courtès
@ 2021-10-12 10:50               ` zimoun
  2021-10-12 16:04                 ` Substitute retention Ludovic Courtès
  0 siblings, 1 reply; 55+ messages in thread
From: zimoun @ 2021-10-12 10:50 UTC (permalink / raw)
  To: Ludovic Courtès
  Cc: 42162, Maurice Brémond, Maxim Cournoyer, andreas.enge

Hi Ludo,

On Tue, 12 Oct 2021 at 11:24, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:

> I sense a lot of impatience in your message :-), and I also see many
> questions.  It is up to us all to answer them, I’ll just reply
> selectively here.

Impatience?  Probably. :-)

> zimoun <zimon.toutoune@gmail.com> skribis:
>> On Thu, 7 Oct 2021 at 18:07, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:
>
> [...]
>
>>> The second-best solution is to improve our tooling so we can actually
>>> keep source code in a more controlled way.  That’s what I had in mind
>>> with <https://ci.guix.gnu.org/jobset/source>.  We have storage space for
>>> that on berlin, but it’s not infinite.
>>
>> If Berlin has space, why so much derivations are missing when running
>> time-machine?
>
> That’s not related to the question at hand, but it would be worth
> investigating, first by trying to quantify that.

The question seems related. :-)  Because you are saying “we have storage
space for that on Berlin”…

> For the record, the ‘guix publish’ config on berlin is here:
>
>   https://git.savannah.gnu.org/cgit/guix/maintenance.git/tree/hydra/modules/sysadmin/services.scm#n485
>
> If I read that correctly, nars have a TTL of 180 days (this is the time
> a nar is retained after the last time it has been requested, so it’s a
> lower bound.)

…and the NARs are more or less removed after 180 days if no one asked
for them during these 180 days, IIUC.  This policy seems to keep under
control the size of the storage, I guess.  And I provide an annoying
example of such policy. :-)

Anyway, I agree it is not, for now, the core of the question at
hand. :-)

About quantifying, it is clearly not related to the question at
hand. ;-)

Just for the record, a back to envelope computations.  180 days before
today was April 15th (M-x calendar C-u 180 C-b).  It means 6996 commits
(35aaf1fe10 is my current last commit).

    git log --format="%cd" --after=2021-04-15 | wc -l
    6996

However, these commits are pushed by batch.  Roughly, it reads:

    git log --format="%cd" --after=2021-04-15 --date=unix \
        | awk 'NR == 1{old= $1; next}{print old - $1; old = $1}' \
        | sort -n | uniq -c | grep -e "0$" | head
          1 -1542620
       3388 0
         14 10
          6 20
          5 30
          2 40
          4 50
          1 60
          2 70
          2 80

(Take the ’awk’ with care, I am not sure of what I am doing. :-)  And,
it is rough because timezone etc.)

Other said 3388/6996= ~50% of commits are pushed at the same time, i.e.,
missed by both build farms using 2 different strategies to collect the
thing to build (fetch every 5 minutes or fetch from guix-commits).  It
is a quick back to envelope so keep that with some salt. :-)

On that number, after 180 days (6 months), it is hard to evaluate the
rate of the time-machine queries.  And from my experience (no number to
back), running time-machine on a commit older than this 180 days implies
to build derivations.  Or it is a lucky day. :-)

Drifting, right?  Let focus on the question at hand.   However, this
question of long-term policy asked at:

<https://lists.gnu.org/archive/html/guix-devel/2021-09/msg00215.html>

appears to me worth. :-)

Cheers,
simon

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Substitute retention
  2021-10-12 10:50               ` zimoun
@ 2021-10-12 16:04                 ` Ludovic Courtès
  2021-10-12 18:06                   ` zimoun
  0 siblings, 1 reply; 55+ messages in thread
From: Ludovic Courtès @ 2021-10-12 16:04 UTC (permalink / raw)
  To: zimoun; +Cc: guix-devel

Hi!

(Moving to guix-devel from <https://issues.guix.gnu.org/42162#43>.)

zimoun <zimon.toutoune@gmail.com> skribis:

>> For the record, the ‘guix publish’ config on berlin is here:
>>
>>   https://git.savannah.gnu.org/cgit/guix/maintenance.git/tree/hydra/modules/sysadmin/services.scm#n485
>>
>> If I read that correctly, nars have a TTL of 180 days (this is the time
>> a nar is retained after the last time it has been requested, so it’s a
>> lower bound.)

[...]

> Just for the record, a back to envelope computations.  180 days before
> today was April 15th (M-x calendar C-u 180 C-b).  It means 6996 commits
> (35aaf1fe10 is my current last commit).
>
>     git log --format="%cd" --after=2021-04-15 | wc -l
>     6996
>
> However, these commits are pushed by batch.  Roughly, it reads:
>
>     git log --format="%cd" --after=2021-04-15 --date=unix \
>         | awk 'NR == 1{old= $1; next}{print old - $1; old = $1}' \
>         | sort -n | uniq -c | grep -e "0$" | head
>           1 -1542620
>        3388 0
>          14 10
>           6 20
>           5 30
>           2 40
>           4 50
>           1 60
>           2 70
>           2 80
>
> (Take the ’awk’ with care, I am not sure of what I am doing. :-)  And,
> it is rough because timezone etc.)
>
> Other said 3388/6996= ~50% of commits are pushed at the same time, i.e.,
> missed by both build farms using 2 different strategies to collect the
> thing to build (fetch every 5 minutes or fetch from guix-commits).  It
> is a quick back to envelope so keep that with some salt. :-)

OK.

> On that number, after 180 days (6 months), it is hard to evaluate the
> rate of the time-machine queries.  And from my experience (no number to
> back), running time-machine on a commit older than this 180 days implies
> to build derivations.  Or it is a lucky day. :-)

Right.

So what can we do to address this issue?  I *think* we could use a
higher TTL on berlin, and we can try that right away (9 months to being
with?).

However, there is an upper bound anyway.  To make informed decisions on
the retention policy, we should monitor storage space on berlin/bayfront
to better estimate what can be done.  We have Zabbix but it’s not
accessible from the outside; maybe we could graph storage space
somewhere so people can grab the data and work on those estimates?

What if we decide that we need to provide substitutes for 2y old
commits?  In that case, we need a plan to scale up.  That could be
renting storage space somewhere.  That’s largely non-technical work that
needs attention.

There are also technical tweaks that could help: distinguishing between
“important” substitutes that we want to keep, and less important
substitutes (how?); identifying “equivalence classes” for builds of a
given package; etc.  The outcome is unclear and it’ll take time.

Thoughts?

Ludo’.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Substitute retention
  2021-10-12 16:04                 ` Substitute retention Ludovic Courtès
@ 2021-10-12 18:06                   ` zimoun
  2021-10-15  9:27                     ` Ludovic Courtès
  0 siblings, 1 reply; 55+ messages in thread
From: zimoun @ 2021-10-12 18:06 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel

Hey,

On Tue, 12 Oct 2021 at 18:04, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:

> (Moving to guix-devel from <https://issues.guix.gnu.org/42162#43>.)

I was preparing a report.  You have been faster than me. :-)

Two questions are raising, IIUC:

 1. the “modular” derivations for all the commits
 2. long-term support for substitutes

>> Other said 3388/6996= ~50% of commits are pushed at the same time, i.e.,
>> missed by both build farms using 2 different strategies to collect the
>> thing to build (fetch every 5 minutes or fetch from guix-commits).  It
>> is a quick back to envelope so keep that with some salt. :-)
>
> OK.

To make it explicit of #1, I was talking about the “modular” Guix, i.e.,
when running “guix pull” or “guix time-machine” it leads to build the
derivations module-import.drv, guix-<hash>.drv, guix-command.drv,
guix-module-union.drv, guix-<hash>-modules.drv,
guix-packages-modules.drv, guix-system-tests-modules.drv,
guix-packages-base-modules.drv, etc.  On slow machines, it can be
unpleasant; not to say unpractical.  Even for recent commits.

The recent addition of ’channel-with-substitutes-available’ helps when
going forward (guix pull) if the build farm does not have yet these.

The issue is going backward (guix time-machine).

Basically, commit 59d10c3112 is from March 14, 2020 and it takes ~29min
on my slow laptop.  And to compare apple to apple, let take another
commit one year later from March 14, 2021, e.g., commit 7327295462.  It
takes ~5min on the same machine.

--8<---------------cut here---------------start------------->8---
$ time guix time-machine --commit=59d10c3112 -- --version
Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'...
substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
The following derivations will be built:
   /gnu/store/zvy89f9xb53fbqvfrm7lql8mbfrsfk1b-compute-guix-derivation.drv
   /gnu/store/7y80kn1bypnbm869hvcq8841mr6nqvfm-module-import-compiled.drv
   /gnu/store/amwvgaf45722k6jn4r39983zsgmbyp2g-module-import.drv
   /gnu/store/h3h0qfiw5100zkwfb919r7vn0q06ksqy-config.scm.drv
   /gnu/store/jkwhdilsbxb18hx6gi4i2rj0v06mfbab-module-import.drv
   /gnu/store/sixfy4sazai667n99pxa5h7wzzaabw79-module-import-compiled.drv

[...]

Computing Guix derivation for 'x86_64-linux'...  WARNING: (guix build emacs-build-system): imported module (guix build utils) overrides core binding `delete'
substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0%

[...]

substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
The following derivations will be built:
   /gnu/store/50ymxym19h8whzg3ajcl6kdjmq3p7qrg-profile.drv
   /gnu/store/l0znp7g83lbylbv97nd3ahz8rnrvxfrf-guix-59d10c311.drv
   /gnu/store/bvxrnp8bydl3zsbcg7j8j7m0qfygdhfs-guix-command.drv
   /gnu/store/xmsciylxx1j7nbry5cv1lm7595a8rilr-guix-module-union.drv
   /gnu/store/yvkw65kv9bhvx41750dchxw98qqxv64b-guix-59d10c311-modules.drv
   /gnu/store/6hrn3bpvcg8571mckzcj519xf2kqn2sl-guix-packages-base-modules.drv
   /gnu/store/nwl8z8cx9pdwn0sx5i5j5mp0bkdi64mm-guix-packages-base.drv
   /gnu/store/0j0271nbm2l526m5xs7zpd686qqrjz7w-guix-core.drv
   /gnu/store/2134jzhhckh871vlscw6dmwqqhny9zxg-guix-core-source.drv
   /gnu/store/mhik7ggrf4z5f38nsg7g0gbijm916b98-config.scm.drv
   /gnu/store/53zlyv96nrrm4h5ns7nmmndj8jys38f7-guix-extra.drv
   /gnu/store/7dn3f27i48jp6zvlwanzk8mfy828k6cm-guix-config-modules.drv
   /gnu/store/6xy3yyvjqvjyrlrwk1lzs12knbvilqpy-guix-config-source.drv
   /gnu/store/15ihwkqzyz4r4b4rppb92qcawha6a7p7-config.scm.drv
   /gnu/store/cacawv4yib8pa2ajzw0kyaihgym72mww-guix-config.drv
   /gnu/store/baivfv20hzm799v0wvdrcfaimh4aw22a-guix-extra-modules.drv
   /gnu/store/cxpkd0jkxapzbmg5vfmn4fy30yd7vlhm-guix-core-modules.drv
   /gnu/store/g3nqbybggh7dc2qd9gkj7swfmgmiigpp-guix-packages-modules.drv
   /gnu/store/nq9mzr00ny3nrsldvcq9r4va4fhb26sq-guix-packages.drv
   /gnu/store/i9dfjh5mf3r83447x7fa75hv1hnp9myv-guix-cli-modules.drv
   /gnu/store/xdsld8rhrawgngv93qx6lk9cgmql908c-guix-cli.drv
   /gnu/store/qnzm0a1gcwnfvkii7vk93rimzzp3mcf9-guix-system.drv
   /gnu/store/x2g2s3rhw0bv9qdp21yghgl2sij15dkr-guix-system-modules.drv
   /gnu/store/zandkjnylznvdj8jfsfgssvmvs6jfyph-guix-system-tests-modules.drv
   /gnu/store/45bs7y1h3gx2m7qry6r621klgkmv47wl-guix-system-tests.drv
   /gnu/store/caj7qjxvhrksk3jrkpsxqnx4kg7mlj9d-guix-daemon.drv
   /gnu/store/mah24wyy6bd51c27ww45hsqnjxhcn0yx-guix-manual.drv
   /gnu/store/002i672yl1192x7wvhkdbih94qffmcdk-guix-translated-texinfo.drv
   /gnu/store/5sf9aalvic81qlm06lq3a1pwdb2b3bm0-inferior-script.scm.drv
   /gnu/store/k612gnziiy3hn2dnrj33w4mw84kcnynm-profile.drv

11.7 MB will be downloaded

[...]

real	29m14.588s
user	0m56.126s
sys	0m1.032s
--8<---------------cut here---------------end--------------->8---

--8<---------------cut here---------------start------------->8---
$ time guix time-machine --commit=7327295462 -- --version
Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'...
substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0%

[...]

building /gnu/store/6xq8vxpl51l8b3fz6sxpyspa1w5chbk9-module-import.drv...
 module-import  2KiB                                           43KiB/s 00:00 [##################] 100.0%

 module-import-compiled  1.5MiB                               607KiB/s 00:03 [##################] 100.0%

 module-import-compiled  1.5MiB                               606KiB/s 00:03 [##################] 100.0%

building /gnu/store/pbv19dhrlqr2lnzphmydn4zrrdccghf2-compute-guix-derivation.drv...
substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
@ substituter-started /gnu/store/fbn395nfpbp4d4fr6jsbmwcx6n10kg16-python-minimal-3.8.2 substitute
@ download-started /gnu/store/fbn395nfpbp4d4fr6jsbmwcx6n10kg16-python-minimal-3.8.2 https://ci.guix.gnu.org/nar/lzip/fbn395nfpbp4d4fr6jsbmwcx6n10kg16-python-minimal-3
[...]
@ substituter-succeeded /gnu/store/fx0cdzzppd8jc09sianbq6gl1h7mxx3x-zziplib-0.13.72

substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
The following derivations will be built:
   /gnu/store/8kmb40b2r3cx5zxpcrwa73a4lkaxjd9l-profile.drv
   /gnu/store/la6c7m2b6izy22vv8xpyvpz1ajyq72br-profile.drv
   /gnu/store/nljv2wnw0wqkyk0am8722gdwah3b0cx2-guix-732729546.drv
   /gnu/store/bmrx03y52d8dhhcpyf9i8j4zn2fg7pip-guix-command.drv
   /gnu/store/pg26n125c9bmvk4lxxp9ssd9havk89wc-guix-module-union.drv
   /gnu/store/mp0c2ad09axrq8zwhh3ycfk4a2mrgvm2-guix-732729546-modules.drv
   /gnu/store/0gx5jr1vgkdm5ajfcscladhzjx2gz5l2-guix-system-modules.drv
   /gnu/store/1sqwx35rn2qinlzib74zfjanxzzgmza3-guix-packages-base-modules.drv
   /gnu/store/6d64jyviixsjvfjgpxv3lyzq7l35y0f3-guix-config-modules.drv
   /gnu/store/5v24gh1pbqy6jzyl9x52wzxa6qprwl6v-guix-config-source.drv
   /gnu/store/pv6rskbpg8rzq3wi92m3iw7a2524r994-config.scm.drv
   /gnu/store/i6ncmiw6agrlq290drkg94scn12sv4v8-guix-config.drv
   /gnu/store/6l3l0qsq280pwmrnb0z2dfiyc7g5ff5i-guix-extra-modules.drv
   /gnu/store/hdy7q3xm8jalssc6jim2fk1wqgxqbfm9-guix-system-tests-modules.drv
   /gnu/store/ki86v53288fnx8sih3zlfi9qnjx2lzay-guix-packages-modules.drv
   /gnu/store/l63wxbqaq98nfjkq2cyshb6q8rjxjd6h-guix-core-modules.drv
   /gnu/store/b1qky9s98cfgp88xan9pqg0k9k0rlzrm-guix-core-source.drv
   /gnu/store/wmfssg7yyz2hrwanash7yk8f86faghlf-guix-cli-modules.drv
   /gnu/store/z1xxvp4hmffrpmbl0ll5y87w5pyfma9l-guix-daemon.drv
   /gnu/store/ms6pkrkggd0rl4fh5hfh20gcva7ryip5-inferior-script.scm.drv

28.7 MB will be downloaded

[...]

real	5m56.451s
user	3m52.055s
sys	0m1.738s
--8<---------------cut here---------------end--------------->8---

To be on the same wavelength,

--8<---------------cut here---------------start------------->8---
$ git log --format="%h %cd" --after=2021-03-14 --reverse | head -n16
[...]
2babf7d831 Sun Mar 14 19:16:55 2021 +0100
b15720182e Sun Mar 14 13:24:21 2021 -0500
207aa62e6b Sun Mar 14 13:24:21 2021 -0500
30f5381487 Sun Mar 14 13:24:21 2021 -0500
af25357b7d Sun Mar 14 13:24:21 2021 -0500
7164d2105a Sun Mar 14 13:24:21 2021 -0500
078f3288e2 Sun Mar 14 13:24:21 2021 -0500
5a31eb7d35 Sun Mar 14 13:24:21 2021 -0500
620206b680 Sun Mar 14 13:24:22 2021 -0500
b76762a9b7 Sun Mar 14 13:24:22 2021 -0500
cbfcbb79df Sun Mar 14 19:43:35 2021 +0100
--8<---------------cut here---------------end--------------->8---

and Cuirass builds only one of b15720182e, 207aa62e6b, 30f5381487,
af25357b7d, 7164d2105a, 078f3288e2, 5a31eb7d35, 620206b680 or
b76762a9b7.

Considering the Build Coordinator, it uses guix-commits and from my
understanding it reads:

<https://lists.gnu.org/archive/html/guix-commits/2021-03/msg01201.html>

therefore, b15720182e would be missed but not b76762a9b7–which would be
missed by Cuirass.

Cuirass and the Build Coordinator cannot each build the both commits
b15720182e and b76762a9b7.

Cuirass check every 5 minutes and Build Coordinator reads “state” from
guix-commits.  Other said, none of them builds all these “modular”
derivations for all the commits; even for recent commits.

The rough estimate is half of commits are missed by both build farms.
Therefore, using “guix time-machine” with a random commit and one gets
1/2 probability to build something just to get the inferior – aside the
TTL policy.

(It is mitigated because the both build farms use different strategies
and thus they do not miss the same commits. \o/)

>> On that number, after 180 days (6 months), it is hard to evaluate the
>> rate of the time-machine queries.  And from my experience (no number to
>> back), running time-machine on a commit older than this 180 days implies
>> to build derivations.  Or it is a lucky day. :-)
>
> Right.
>
> So what can we do to address this issue?  I *think* we could use a
> higher TTL on berlin, and we can try that right away (9 months to being
> with?).

I *think* the issue is not TTL for question #1.  :-) But the issue that
the both build farms do not build these “modular” derivations for all
the commits.  Here, I am focused on x86_64-linux which is the case of
interest for such topic (scientific context), IMHO.

Considering to build for every commit for all architectures is not
affordable.

I agree that increasing the TTL will help for question #2 about
long-support of substitutes.

> However, there is an upper bound anyway.  To make informed decisions on
> the retention policy, we should monitor storage space on berlin/bayfront
> to better estimate what can be done.  We have Zabbix but it’s not
> accessible from the outside; maybe we could graph storage space
> somewhere so people can grab the data and work on those estimates?

Based on the size of these derivations for one commit, we could
extrapolate back to envelope.  Well, question #1 seems doable
storage-speaking.

The issue of #1 is to build these derivations for all the commits.
IMHO.

About #2, yeah if some data are available, I can try to make some
estimates.

Well, #1 seems actionable.  However, #2 raises…

> What if we decide that we need to provide substitutes for 2y old
> commits?  In that case, we need a plan to scale up.  That could be
> renting storage space somewhere.  That’s largely non-technical work that
> needs attention.

…a strong question. :-) What do “we” do for what “we” build?

Indeed, numbers are missing to make informed decisions on long-term
storage of substitutes.  What is Nix doing?

> There are also technical tweaks that could help: distinguishing between
> “important” substitutes that we want to keep, and less important
> substitutes (how?); identifying “equivalence classes” for builds of a
> given package; etc.  The outcome is unclear and it’ll take time.

I agree it will take time. :-)

I think that having 2 build farms building in parallel is a strength.
So let exploit it. :-) What one could have in mind is to challenge the
outputs; if they are identical, let keep only one version “somewhere”
and remove the other from the “elsewhere”.

For instance, we (I? with help) could resume this discussion:

<https://lists.gnu.org/archive/html/guix-devel/2020-10/msg00181.html>

Or maybe, for the identical outputs, one could imagine (dream? for) a
cooking service for missing outputs.  Well, I do not know how this is
actionable. :-)

Cheers,
simon

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Substitute retention
  2021-10-12 18:06                   ` zimoun
@ 2021-10-15  9:27                     ` Ludovic Courtès
  0 siblings, 0 replies; 55+ messages in thread
From: Ludovic Courtès @ 2021-10-15  9:27 UTC (permalink / raw)
  To: zimoun; +Cc: guix-devel

Hi!

zimoun <zimon.toutoune@gmail.com> skribis:

>>> missed by both build farms using 2 different strategies to collect the
>>> thing to build (fetch every 5 minutes or fetch from guix-commits).  It
>>> is a quick back to envelope so keep that with some salt. :-)
>>
>> OK.
>
> To make it explicit of #1, I was talking about the “modular” Guix, i.e.,
> when running “guix pull” or “guix time-machine” it leads to build the
> derivations module-import.drv, guix-<hash>.drv, guix-command.drv,
> guix-module-union.drv, guix-<hash>-modules.drv,
> guix-packages-modules.drv, guix-system-tests-modules.drv,
> guix-packages-base-modules.drv, etc.  On slow machines, it can be
> unpleasant; not to say unpractical.  Even for recent commits.

Ah I see.  Yeah, this can be kinda annoying, and amplified by the fact
that CI only builds at each push, not at each commit.

That said, this is mitigated by the fact that one typically travels to a
previously-fetched commit, which is a commit that has been built by CI
rather than a commit in between two pushes.

> Basically, commit 59d10c3112 is from March 14, 2020 and it takes ~29min
> on my slow laptop.  And to compare apple to apple, let take another
> commit one year later from March 14, 2021, e.g., commit 7327295462.  It
> takes ~5min on the same machine.

Yeah, OK.

> To be on the same wavelength,
>
> $ git log --format="%h %cd" --after=2021-03-14 --reverse | head -n16
> [...]
> 2babf7d831 Sun Mar 14 19:16:55 2021 +0100
> b15720182e Sun Mar 14 13:24:21 2021 -0500
> 207aa62e6b Sun Mar 14 13:24:21 2021 -0500
> 30f5381487 Sun Mar 14 13:24:21 2021 -0500
> af25357b7d Sun Mar 14 13:24:21 2021 -0500
> 7164d2105a Sun Mar 14 13:24:21 2021 -0500
> 078f3288e2 Sun Mar 14 13:24:21 2021 -0500
> 5a31eb7d35 Sun Mar 14 13:24:21 2021 -0500
> 620206b680 Sun Mar 14 13:24:22 2021 -0500
> b76762a9b7 Sun Mar 14 13:24:22 2021 -0500
> cbfcbb79df Sun Mar 14 19:43:35 2021 +0100
>
> and Cuirass builds only one of b15720182e, 207aa62e6b, 30f5381487,
> af25357b7d, 7164d2105a, 078f3288e2, 5a31eb7d35, 620206b680 or
> b76762a9b7.
>
> Considering the Build Coordinator, it uses guix-commits and from my
> understanding it reads:
>
> <https://lists.gnu.org/archive/html/guix-commits/2021-03/msg01201.html>
>
> therefore, b15720182e would be missed but not b76762a9b7–which would be
> missed by Cuirass.
>
> Cuirass and the Build Coordinator cannot each build the both commits
> b15720182e and b76762a9b7.
>
> Cuirass check every 5 minutes and Build Coordinator reads “state” from
> guix-commits.  Other said, none of them builds all these “modular”
> derivations for all the commits; even for recent commits.
>
> The rough estimate is half of commits are missed by both build farms.
> Therefore, using “guix time-machine” with a random commit and one gets
> 1/2 probability to build something just to get the inferior – aside the
> TTL policy.

Right.  Not every derivation produced by (guix self) needs to be rebuilt
in between two commits, but anything that depends on *package-modules*
typically has to be rebuilt.

We can reduce the amount of rebuilt like I did in commit
abd38dcee16f0ac71191527c38dcd3659111e2ba, but you’ll always have the big
(gnu packages …) derivation.

>> So what can we do to address this issue?  I *think* we could use a
>> higher TTL on berlin, and we can try that right away (9 months to being
>> with?).
>
> I *think* the issue is not TTL for question #1.  :-) But the issue that
> the both build farms do not build these “modular” derivations for all
> the commits.  Here, I am focused on x86_64-linux which is the case of
> interest for such topic (scientific context), IMHO.
>
> Considering to build for every commit for all architectures is not
> affordable.
>
> I agree that increasing the TTL will help for question #2 about
> long-support of substitutes.

Understood!

>> However, there is an upper bound anyway.  To make informed decisions on
>> the retention policy, we should monitor storage space on berlin/bayfront
>> to better estimate what can be done.  We have Zabbix but it’s not
>> accessible from the outside; maybe we could graph storage space
>> somewhere so people can grab the data and work on those estimates?
>
> Based on the size of these derivations for one commit, we could
> extrapolate back to envelope.  Well, question #1 seems doable
> storage-speaking.
>
> The issue of #1 is to build these derivations for all the commits.
> IMHO.
>
> About #2, yeah if some data are available, I can try to make some
> estimates.
>
>
> Well, #1 seems actionable.  However, #2 raises…
>
>> What if we decide that we need to provide substitutes for 2y old
>> commits?  In that case, we need a plan to scale up.  That could be
>> renting storage space somewhere.  That’s largely non-technical work that
>> needs attention.
>
> …a strong question. :-) What do “we” do for what “we” build?
>
> Indeed, numbers are missing to make informed decisions on long-term
> storage of substitutes.  What is Nix doing?

Nix, AFAIK, is doing like everyone else: pouring money on Amazon.  Last
I heard they’d retain substitutes basically indefinitely on Amazon S3
(incidentally, one motivation for them to work with Software Heritage,
AIUI, is that it would allow them to store less data on the storage they
pay for :-)).

For the record, berlin (aka ci.guix.gnu.org; it was donated by the Max
Delbrück Center, MDC, and is generously hosted by them) has a 37 TiB
disk for /gnu/store and “baked” substitutes.  That’s a lot.

Technically though, a lot of it is used by less important substitutes
such as disk images or intermediate ‘core-updates’ substitutes.

In the end we seem to be filling it more quickly than you’d think!

Perhaps we need a better strategy with a low TTL for, say, intermediate
‘core-updates’ substitutes (no need to keep them more than a few weeks
if we know we’re doing a world rebuild right after).  It cannot be done
as things are though because ‘guix publish’ doesn’t distinguish between
store items.

Or we could restart the Amazon front-end that Chris Marusich had set up
right before 1.0 was released.  Or we could build our own front-end for
substitute delivery as a proxy to berlin, thereby distributing the
burden.

Thoughts?

> I think that having 2 build farms building in parallel is a strength.
> So let exploit it. :-) What one could have in mind is to challenge the
> outputs; if they are identical, let keep only one version “somewhere”
> and remove the other from the “elsewhere”.
>
> For instance, we (I? with help) could resume this discussion:
>
> <https://lists.gnu.org/archive/html/guix-devel/2020-10/msg00181.html>

I hadn’t seen this message, interesting!

Note however that bordeaux.guix has a tenth of the storage space of
berlin (3.6 TiB), so right now we probably can’t count on it for
long-term substitute storage.

> Or maybe, for the identical outputs, one could imagine (dream? for) a
> cooking service for missing outputs.  Well, I do not know how this is
> actionable. :-)

Well, if we keep .drv around, we could arrange so that ‘guix publish’
rebuilds on-demand, after all.  I’m not sure how practical that would
be, though.

Ludo’.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2020-11-04 16:49                     ` Ludovic Courtès
@ 2022-09-29  0:32                       ` Maxim Cournoyer
  2022-09-29 10:56                         ` zimoun
  0 siblings, 1 reply; 55+ messages in thread
From: Maxim Cournoyer @ 2022-09-29  0:32 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 42162, Timothy Sample, zimoun

Hi,

Can this issue be closed?

Otherwise, what remains to be acted upon?

Thanks,

Maxim




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2022-09-29  0:32                       ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
@ 2022-09-29 10:56                         ` zimoun
  2022-09-29 15:00                           ` Ludovic Courtès
  0 siblings, 1 reply; 55+ messages in thread
From: zimoun @ 2022-09-29 10:56 UTC (permalink / raw)
  To: Maxim Cournoyer, Ludovic Courtès; +Cc: 42162, Timothy Sample

Hi Maxim,

On Wed, 28 Sep 2022 at 20:32, Maxim Cournoyer <maxim.cournoyer@gmail.com> wrote:

> Can this issue be closed?

This “meta” bug raises 2 levels of issues for long-term:

 1. save the source code,
 2. save the current binary substitutes.

For #1, we have now a roadmap to tackle this via Disarchive or
sources.json and SWH [1,2].  Therefore, this point of the “meta” bug can
be closed.

However, about #2, we do not have a roadmap, AFAIK.  For instance
«Substitute retention» [3] is still an issue.  The recent outage of
Berlin exemplifies the potential problems.  (Note that because of energy
troubles in Europe, I do not exclude some potential short blackout or
power outage for short period of time of the Berlin server.)

Why kept binary substitutes?  For instance, it is not possible to
rebuild some packages on modern CPU; see OpenBLAS [4].  Therefore,
waiting a better solution, it appears to me a pragmatic solution to keep
as much substitutes as we are able to.  We do not have a clear policy
between Berlin and Bordeaux.

1: <https://forge.softwareheritage.org/T3781>
2: <https://forge.softwareheritage.org/T4538>
3: <https://issues.guix.gnu.org/issue/42162#42>
4: <https://yhetil.org/guix/86o83oywza.fsf@gmail.com>

> Otherwise, what remains to be acted upon?

The next action for this #2 is to address what I sent on guix-sysadmin
about duplicating the storage of binary substitutes in some machine
offered by INRAe / Univ. of Montpellier.

And we need to draw a roadmap to tackle this #2.  Then, this “meta” bug
could be closed, IMHO.

Cheers,
simon

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2022-09-29 10:56                         ` zimoun
@ 2022-09-29 15:00                           ` Ludovic Courtès
  2022-09-30  3:10                             ` Maxim Cournoyer
  0 siblings, 1 reply; 55+ messages in thread
From: Ludovic Courtès @ 2022-09-29 15:00 UTC (permalink / raw)
  To: zimoun; +Cc: 42162, Timothy Sample, Maxim Cournoyer

Hi,

zimoun <zimon.toutoune@gmail.com> skribis:

> This “meta” bug raises 2 levels of issues for long-term:
>
>  1. save the source code,
>  2. save the current binary substitutes.

Maybe we can close this bug and open an issue for each of these, or
discuss them on guix-devel until there are actionable items that come
out of it?

Or better yet: we could file more specific bugs such as “Disarchive
lacks bzip2 support” or “SWH integration does not support Subversion”.

Thoughts?

Ludo’.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2022-09-29 15:00                           ` Ludovic Courtès
@ 2022-09-30  3:10                             ` Maxim Cournoyer
  2022-09-30 12:13                               ` zimoun
  2022-09-30 18:17                               ` Maxime Devos
  0 siblings, 2 replies; 55+ messages in thread
From: Maxim Cournoyer @ 2022-09-30  3:10 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 42162-done, Timothy Sample, zimoun

Hi,

Ludovic Courtès <ludo@gnu.org> writes:

> Hi,
>
> zimoun <zimon.toutoune@gmail.com> skribis:
>
>> This “meta” bug raises 2 levels of issues for long-term:
>>
>>  1. save the source code,
>>  2. save the current binary substitutes.
>
> Maybe we can close this bug and open an issue for each of these, or
> discuss them on guix-devel until there are actionable items that come
> out of it?
>
> Or better yet: we could file more specific bugs such as “Disarchive
> lacks bzip2 support” or “SWH integration does not support Subversion”.
>
> Thoughts?

Wholly agreed, this thread is already too long and the original problem
was fixed.  Closing.

Thanks,
-- 
Maxim




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2022-09-30  3:10                             ` Maxim Cournoyer
@ 2022-09-30 12:13                               ` zimoun
  2022-10-01 22:04                                 ` Ludovic Courtès
  2022-10-03 15:20                                 ` Maxim Cournoyer
  2022-09-30 18:17                               ` Maxime Devos
  1 sibling, 2 replies; 55+ messages in thread
From: zimoun @ 2022-09-30 12:13 UTC (permalink / raw)
  To: Maxim Cournoyer, Ludovic Courtès; +Cc: 42162-done, Timothy Sample

Hi,

On Thu, 29 Sep 2022 at 23:10, Maxim Cournoyer <maxim.cournoyer@gmail.com> wrote:

> Wholly agreed, this thread is already too long and the original problem
> was fixed.  Closing.

I disagree, the original problem is not fixed; as I explained.  Well,
since you consider it is, please also close the related patch#43442.

Cheers,
simon




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2022-09-30  3:10                             ` Maxim Cournoyer
  2022-09-30 12:13                               ` zimoun
@ 2022-09-30 18:17                               ` Maxime Devos
  1 sibling, 0 replies; 55+ messages in thread
From: Maxime Devos @ 2022-09-30 18:17 UTC (permalink / raw)
  To: 42162, maxim.cournoyer, ludovic.courtes


[-- Attachment #1.1.1: Type: text/plain, Size: 956 bytes --]



On 30-09-2022 05:10, Maxim Cournoyer wrote:
> Hi,
> 
> Ludovic Courtès <ludo@gnu.org> writes:
> 
>> Hi,
>>
>> zimoun <zimon.toutoune@gmail.com> skribis:
>>
>>> This “meta” bug raises 2 levels of issues for long-term:
>>>
>>>   1. save the source code,
>>>   2. save the current binary substitutes.
>>
>> Maybe we can close this bug and open an issue for each of these, or
>> discuss them on guix-devel until there are actionable items that come
>> out of it?
>>
>> Or better yet: we could file more specific bugs such as “Disarchive
>> lacks bzip2 support” or “SWH integration does not support Subversion”.
>>
>> Thoughts?
> 
> Wholly agreed, this thread is already too long and the original problem
> was fixed.  Closing.

I don't follow, our gf2x package (and a few others) still download from 
gforge.inria.fr?  Which to me seems to be what zimoun is referring to 
(in the response)?

Greetings,
Maxime.

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 929 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2022-09-30 12:13                               ` zimoun
@ 2022-10-01 22:04                                 ` Ludovic Courtès
  2022-10-03 15:20                                 ` Maxim Cournoyer
  1 sibling, 0 replies; 55+ messages in thread
From: Ludovic Courtès @ 2022-10-01 22:04 UTC (permalink / raw)
  To: zimoun; +Cc: 42162-done, Timothy Sample, Maxim Cournoyer

Hi,

zimoun <zimon.toutoune@gmail.com> skribis:

> On Thu, 29 Sep 2022 at 23:10, Maxim Cournoyer <maxim.cournoyer@gmail.com> wrote:
>
>> Wholly agreed, this thread is already too long and the original problem
>> was fixed.  Closing.
>
> I disagree, the original problem is not fixed; as I explained.

What would you think of opening specific issues as I proposed, including
(I forgot that one) an issue on substitute preservation?

I was thinking that this could make it clearer what actions remain to be
taken.  WDYT?

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2022-09-30 12:13                               ` zimoun
  2022-10-01 22:04                                 ` Ludovic Courtès
@ 2022-10-03 15:20                                 ` Maxim Cournoyer
  2022-10-04 21:26                                   ` Ludovic Courtès
  1 sibling, 1 reply; 55+ messages in thread
From: Maxim Cournoyer @ 2022-10-03 15:20 UTC (permalink / raw)
  To: zimoun; +Cc: 42162-done, Timothy Sample, Ludovic Courtès

Hi,

zimoun <zimon.toutoune@gmail.com> writes:

> Hi,
>
> On Thu, 29 Sep 2022 at 23:10, Maxim Cournoyer <maxim.cournoyer@gmail.com> wrote:
>
>> Wholly agreed, this thread is already too long and the original problem
>> was fixed.  Closing.
>
> I disagree, the original problem is not fixed; as I explained.  Well,
> since you consider it is, please also close the related patch#43442.

Oh, from Ludovic's answer, I thought that the original issue had drifted
to tangential problems; apologies for drawing the wrong conclusion.

[time passes]

I've now migrated the remaining packages off gforge.inria.fr; see the
commits ending with 06201b76e5ef811245d627706c90117a0e9813d4.  I tried
to update ocaml-dose3, but aborted that effort, since it'd require
packaging ocaml-parmap:

--8<---------------cut here---------------start------------->8---
modified   gnu/packages/ocaml.scm
@@ -646,7 +646,7 @@ (define-public ocaml-mccs
 (define-public ocaml-dose3
   (package
     (name "ocaml-dose3")
-    (version "5.0.1")
+    (version "7.0.0")
     (source (origin
               (method git-fetch)
               (uri (git-reference
@@ -655,29 +655,19 @@ (define-public ocaml-dose3
               (file-name (git-file-name name version))
               (sha256
                (base32
-                "0dxkw37gj8z45kd0dnrlfgpj8yycq0dphs8kjm9kvq9xc8rikxp3"))
-              (patches
-               (search-patches
-                "ocaml-dose3-add-unix-dependency.patch"
-                "ocaml-dose3-Fix-for-ocaml-4.06.patch"
-                "ocaml-dose3-dont-make-printconf.patch"
-                "ocaml-dose3-Install-mli-cmx-etc.patch"))))
-    (build-system ocaml-build-system)
-    (arguments
-     `(#:tests? #f                      ;the test suite requires python 2
-       #:configure-flags
-       ,#~(list (string-append "SHELL="
-                               #+(file-append (canonical-package bash-minimal)
-                                              "/bin/sh")))
-       #:make-flags
-       ,#~(list (string-append "LIBDIR=" #$output "/lib/ocaml/site-lib"))))
-    (propagated-inputs
-     (list ocaml-graph ocaml-cudf ocaml-extlib ocaml-re))
+                "0hcjh68svicap7j9bghgkp49xa12qhxa1pygmrgc9qwm0m4dhirb"))))
+    (build-system dune-build-system)
     (native-inputs
      (list perl
+           python-wrapper
            ocaml-extlib
            ocamlbuild
            ocaml-cppo))
+    (propagated-inputs
+     (list ocaml-cudf
+           ocaml-extlib
+           ocaml-graph
+           ocaml-re))
     (home-page "https://www.mancoosi.org/software/")
     (synopsis "Package distribution management framework")
     (description "Dose3 is a framework made of several OCaml libraries for
--8<---------------cut here---------------end--------------->8---

Thanks,

-- 
Maxim




^ permalink raw reply	[flat|nested] 55+ messages in thread

* bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
  2022-10-03 15:20                                 ` Maxim Cournoyer
@ 2022-10-04 21:26                                   ` Ludovic Courtès
  0 siblings, 0 replies; 55+ messages in thread
From: Ludovic Courtès @ 2022-10-04 21:26 UTC (permalink / raw)
  To: Maxim Cournoyer; +Cc: 42162-done, Timothy Sample, zimoun

Hello,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

> zimoun <zimon.toutoune@gmail.com> writes:
>
>> Hi,
>>
>> On Thu, 29 Sep 2022 at 23:10, Maxim Cournoyer <maxim.cournoyer@gmail.com> wrote:
>>
>>> Wholly agreed, this thread is already too long and the original problem
>>> was fixed.  Closing.
>>
>> I disagree, the original problem is not fixed; as I explained.  Well,
>> since you consider it is, please also close the related patch#43442.
>
> Oh, from Ludovic's answer, I thought that the original issue had drifted
> to tangential problems; apologies for drawing the wrong conclusion.

Sorry for the misleading comment.

> I've now migrated the remaining packages off gforge.inria.fr; see the
> commits ending with 06201b76e5ef811245d627706c90117a0e9813d4.

Woow, you rock!!  (Would be nice to check Disarchive coverage of the
previously-used tarballs, to see how good we’re doing.)

> I tried to update ocaml-dose3, but aborted that effort, since it'd
> require packaging ocaml-parmap:

Something for the OCaml team!  :-)

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2022-10-04 21:53 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-02  7:29 bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Ludovic Courtès
2020-07-02  8:50 ` zimoun
2020-07-02 10:03   ` Ludovic Courtès
2020-07-11 15:50     ` bug#42162: Recovering source tarballs Ludovic Courtès
2020-07-13 19:20       ` Christopher Baines
2020-07-20 21:27         ` zimoun
2020-07-15 16:55       ` zimoun
2020-07-20  8:39         ` Ludovic Courtès
2020-07-20 15:52           ` zimoun
2020-07-20 17:05             ` Dr. Arne Babenhauserheide
2020-07-20 19:59               ` zimoun
2020-07-21 21:22             ` Ludovic Courtès
2020-07-22  0:27               ` zimoun
2020-07-22 10:28                 ` Ludovic Courtès
2020-08-03 21:10         ` Ricardo Wurmus
2020-07-30 17:36       ` Timothy Sample
2020-07-31 14:41         ` Ludovic Courtès
2020-08-03 16:59           ` Timothy Sample
2020-08-05 17:14             ` Ludovic Courtès
2020-08-05 18:57               ` Timothy Sample
2020-08-23 16:21                 ` Ludovic Courtès
2020-11-03 14:26                 ` Ludovic Courtès
2020-11-03 16:37                   ` zimoun
2020-11-03 19:20                   ` Timothy Sample
2020-11-04 16:49                     ` Ludovic Courtès
2022-09-29  0:32                       ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2022-09-29 10:56                         ` zimoun
2022-09-29 15:00                           ` Ludovic Courtès
2022-09-30  3:10                             ` Maxim Cournoyer
2022-09-30 12:13                               ` zimoun
2022-10-01 22:04                                 ` Ludovic Courtès
2022-10-03 15:20                                 ` Maxim Cournoyer
2022-10-04 21:26                                   ` Ludovic Courtès
2022-09-30 18:17                               ` Maxime Devos
2020-08-26 10:04         ` bug#42162: Recovering source tarballs zimoun
2020-08-26 21:11           ` Timothy Sample
2020-08-27  9:41             ` zimoun
2020-08-27 12:49               ` Ludovic Courtès
2020-08-27 18:06               ` Bengt Richter
2021-01-10 19:32 ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 Maxim Cournoyer
2021-01-13 10:39   ` Ludovic Courtès
2021-01-13 12:27     ` Andreas Enge
2021-01-13 15:07     ` Andreas Enge
     [not found] ` <handler.42162.D42162.16105343699609.notifdone@debbugs.gnu.org>
2021-01-13 14:28   ` Ludovic Courtès
2021-01-14 14:21     ` Maxim Cournoyer
2021-10-04 15:59     ` bug#42162: gforge.inria.fr is off-line Ludovic Courtès
2021-10-04 17:50       ` bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020 zimoun
2021-10-07 16:07         ` Ludovic Courtès
2021-10-09 17:29           ` raingloom
2021-10-11  8:41           ` zimoun
2021-10-12  9:24             ` Ludovic Courtès
2021-10-12 10:50               ` zimoun
2021-10-12 16:04                 ` Substitute retention Ludovic Courtès
2021-10-12 18:06                   ` zimoun
2021-10-15  9:27                     ` Ludovic Courtès

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.