intrinsic vs extrinsic identifier: toward more robustness?

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

* intrinsic vs extrinsic identifier: toward more robustness?
@ 2023-03-03 18:07 Simon Tournier
  2023-03-04  0:08 ` Maxime Devos
  2023-03-16 17:45 ` Ludovic Courtès
  0 siblings, 2 replies; 9+ messages in thread
From: Simon Tournier @ 2023-03-03 18:07 UTC (permalink / raw)
  To: Guix Devel

Hi,

I would like to open a discussion about how we identify the source
origin (fixed output).  It is of vitally importance for being robust on
the long-term (say 3-5 years).  It matters in Reproducible Research
context, but not only.

# First thing first
===================

## What is an intrinsic identifier or an extrinsic one?
=======================================================

 - extrinsic: use a register to keep the correspondence between the
   identifier and the object; say label version as Git tag.

 - intrinsic: intimately bound to the designated object itself; say hash
   as Git blob or tree and at some extent commit.

The register must be a trusted authority and it resolves by mapping the
key identifier to the object.  Having the object at hand does not give
any clue about the key identifier.  And collisions are very frequent;
two key identifiers resolve to the same content – hopefully! we call
that mirrors. ;-)

Intrinsic identifier also relies on a (trusted) map but collisions are
avoided as much as possible.  Somehow it strongly reduces the power of
the authority and it is often more robust.

Please note that the identification and the integrity is not the same.
Since intrinsic identifier often uses cryptographic hash functions and
integrity too, it is often confusing.

Whatever the intrinsic identifier we consider – even ones based on very
weak cryptographic hash function as MD5, or based on non-crytographic
hash function as Pearson hashing, etc. – the integrity check is
currently done by SHA256.

## For example, consider this source origin,
==============

    (source (origin
              (method url-fetch)
              (uri (string-append "mirror://gnu/hello/hello-" version
                                  ".tar.gz"))
              (sha256
               (base32
                "086vqwk2wl8zfs47sq2xpjc9k066ilmb8z6dn0q6ymwjzlm196cd"))))

where ’mirror://gnu’ is resolved by Guix itself.  Or this one,

    (source
     (origin
       (method git-fetch)
       (uri (git-reference
             (url "https://github.com/FluxML/Zygote.jl")
             (commit (string-append "v" version))))
       (file-name (git-file-name name version))
       (sha256
        (base32 "02bgj6m1j25sm3pa5sgmds706qpxk1qsbm0s2j3rjlrz9xn7glgk"))))

where Guix clones then checks out at the specification of the field
’commit’.

Here both are extrinsic identifiers.  For the first example, the register
is defined by ’%mirrors’.  For the second example, the register is the
folder ’.git/’.

Intrinsic identifier could be plain hash or hashed serialized data.
Using Guix b8f6ead:

--8<---------------cut here---------------start------------->8---
$ guix hash -S none -H sha256 -f nix-base32 -x $(guix build hello -S)
086vqwk2wl8zfs47sq2xpjc9k066ilmb8z6dn0q6ymwjzlm196cd

$ guix hash -S git -H sha256 -f nix-base32 -x $(guix build hello -S)
11kaw6m19rdj3d55y4cygk6k9zv6sn2iz4gpimx0j99ps87ij29l

$ guix hash -S nar -H sha256 -f nix-base32 -x /gnu/store/3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar.gz
1lvqpbk2k1sb39z8jfxixf7p7v8sj4z6mmpa44nnmff3w1y6h8lh
--8<---------------cut here---------------end--------------->8---

Or some Git-like tree md5 of the decompressed data, e.g.,

--8<---------------cut here---------------start------------->8---
$ guix hash -S git -H md5 -f hex -x hello-2.12.1
3db60bcfecf17a5dd81e3fb5bfb1c191
--8<---------------cut here---------------end--------------->8---

Or some others.

--8<---------------cut here---------------start------------->8---
$ git clone https://github.com/FluxML/Zygote.jl
$ git -C Zygote.jl checkout v0.6.41

$ guix hash -S nar -H sha256 -f nix-base32 -x Zygote.jl
02bgj6m1j25sm3pa5sgmds706qpxk1qsbm0s2j3rjlrz9xn7glgk

$ guix hash -S git -H sha1 -f hex -x Zygote.jl
3cfdb31b517eec4173584fba2b1aa65daad46e09
--8<---------------cut here---------------end--------------->8---

# Second thing second
=====================

All that’s said, Guix uses extrinsic identifiers for almost all origins,
if not all.  Even for ’git-fetch’ method.

Consider that GitHub disappears and the default build farms ci.guix and
bordeaux.guix are unreachable for whatever reason.  Then Guix will
fallback to Software Heritage and will exploits its resolver.

--8<---------------cut here---------------start------------->8---
Initialized empty Git repository in /gnu/store/ns1f3b4wm5n470bczd2k5li6xpgbqkz7-julia-zygote-0.6.41-checkout/.git/
fatal: unable to access 'https://github.com/FluxML/Zygote.jl/': Could not resolve host: github.com
Failed to do a shallow fetch; retrying a full fetch...
fatal: unable to access 'https://github.com/FluxML/Zygote.jl/': Could not resolve host: github.com
git-fetch: '/gnu/store/55ba5ragbd5sd4r45n0q24vrxx9rigrm-git-minimal-2.39.1/bin/git fetch origin' failed with exit code 128
Trying content-addressed mirror at berlin.guix.gnu.org...
Trying content-addressed mirror at berlin.guix.gnu.org...
Trying to download from Software Heritage...
SWH: found revision 4777767737b4c95d2cea842933c5b2edae2771b2 with directory at 'https://archive.softwareheritage.org/api/1/directory/3cfdb31b517eec4173584fba2b1aa65daad46e09/'
swh:1:dir:3cfdb31b517eec4173584fba2b1aa65daad46e09/
--8<---------------cut here---------------end--------------->8---

That’s SWH which finds the revision
4777767737b4c95d2cea842933c5b2edae2771b2 from the contextual information
URL + label version and from this revision SWH associates the content
having the intrinsic identifier
swh:1:dir:3cfdb31b517eec4173584fba2b1aa65daad46e09.

## First, please note that the SWHID is just Git,
========

--8<---------------cut here---------------start------------->8---
guix hash -S git -H sha1 -f hex \
     /gnu/store/ns1f3b4wm5n470bczd2k5li6xpgbqkz7-julia-zygote-0.6.41-checkout
3cfdb31b517eec4173584fba2b1aa65daad46e09
--8<---------------cut here---------------end--------------->8---

Other said, SWH information is somehow the same information as the one
of Git objects.  Specifically, from the Git checkout,

--8<---------------cut here---------------start------------->8---
$ git cat-file -p v0.6.41
object 4777767737b4c95d2cea842933c5b2edae2771b2
type commit
tag v0.6.41

$ git cat-file -p 4777767737b4c95d2cea842933c5b2edae2771b2
tree 3cfdb31b517eec4173584fba2b1aa65daad46e09
--8<---------------cut here---------------end--------------->8---

## Second, SWH acts as a resolver here, i.e.,
=========

     (find (lambda (branch)
             (or
              ;; Git specific.
              (string=? (string-append "refs/tags/" tag)
                        (branch-name branch))
              ;; Hg specific.
              (string=? tag
                        (branch-name branch))))
           (snapshot-branches snapshot))

and this is not robust.  For one, it fails for Git lightweight tag as
exposed with the package ’open-zwave’ tag 1.6.

--8<---------------cut here---------------start------------->8---
$ for t in $(git tag); do printf "$t "; git cat-file -t $t ;done
Rel-1.0 commit
V1.5 tag
v1.2 commit
v1.3 tag
v1.4 tag
v1.6 commit
--8<---------------cut here---------------end--------------->8---

It means that the code above would be able to find V1.5 or v1.4 but not
v1.6 or v1.2.  Well, we can consider that as a bug and improve the
snapshot machinery for also collecting more ’refs’.  But, for two…

…the current code (guix swh) does not deal with several snapshots and
only consider the latest one.  Therefore, it fails for some in-place
replacements – upstream tags a specific revision then later removes it
and upstream re-use the same tag label for another revision booo!, if
SWH ingests after the first tag, SWH creates one snapshot, then if SWH
ingests again after the second re-tag, SWH creates another snapshot.

## Third, Disarchive is helping.
========

Aside adding a layer to maintain does not help when speaking about
long-term (3-5 years), well, the reduction of layers is often better for
long-term.  That’s said, there is a work in progress to have Disarchive
features directly from SWH.

What does Disarchive do?  It maps various intrinsic identifiers.

Remember ’hello’ from above?

--8<---------------cut here---------------start------------->8---
$ guix shell disarchive guile-lzma guile
$ disarchive disassemble hello-2.12.1
(disarchive
  (version 0)
  (directory-ref
    (version 0)
    (name "hello-2.12.1")
    (addresses
      (swhid "swh:1:dir:ad5fc7c3062e8426b7936588e7a27d51ace0e508"))
    (digest
      (sha256
        "cc7d5c45cfa1f5fba96c8b32d933734b24377a3c1ac776650044e497469affd4"))))

$ guix hash -S git -H sha1 -f hex hello-2.12.1
ad5fc7c3062e8426b7936588e7a27d51ace0e508
$ guix hash -S git -H sha256 -f hex hello-2.12.1
cc7d5c45cfa1f5fba96c8b32d933734b24377a3c1ac776650044e497469affd4
--8<---------------cut here---------------end--------------->8---

Well, the fixed-outputs is a compressed tarball, it reads,

--8<---------------cut here---------------start------------->8---
$ disarchive disassemble $(guix build -S hello)
(disarchive
  (gzip-member
    (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar.gz")
    (digest
      (sha256
        "8d99142afd92576f30b0cd7cb42a8dc6809998bc5d607d88761f512e26c7db20"))
    (header (mtime 0) (extra-flags 2) (os 3))
    (footer (crc 2707092614) (isize 4945920))
    (compressor gnu-best-rsync)
    (input (tarball
             (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar")
             (digest
               (sha256
                 "a2c33fd13c555015433956bcf06609293a34ce5c5e6a2070990bfb86070dc554"))
[...]

    (input (directory-ref
             (version 0)
             (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1")
             (addresses
               (swhid "swh:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a"))
             (digest
               (sha256
                 "1cb6effd40736b441a2a6dd49e56b3dfd4f6550e8ae1a8ac34ed4b1674097bc0"))))))))
--8<---------------cut here---------------end--------------->8---

where the values are just (considering that ’guix hash -S none -H sha256
-f hex’ is equivalent to ’sha256sum’)

--8<---------------cut here---------------start------------->8---
$ guix hash -S none -H sha256 -f hex $(guix build hello -S)
8d99142afd92576f30b0cd7cb42a8dc6809998bc5d607d88761f512e26c7db20
$ gzip -d $(guix build -S hello) -c | sha256sum
a2c33fd13c555015433956bcf06609293a34ce5c5e6a2070990bfb86070dc554  -
--8<---------------cut here---------------end--------------->8---

However the fields ’swhid’ and the other SHA256 ’digest’ are different
from above.  That’s because the dots [...] part.  It probably comes from
the normalization process. Well, I am not sure to deeply understand why
it is different but that’s another story. :-)

## Fourth, it misses a bridge using NAR normalization (serialization).
=========

Disarchive can (or could) provides a bridge (map) between SWHID+SHA1 and
NAR+SHA256.  But it could be nice if it was implemented in SWH
directly.  It would ease previous drawbacks.

For the interested reader, discussion there
<https://gitlab.softwareheritage.org/swh/meta/-/issues/4538>.  Moreover,
<https://gitlab.softwareheritage.org/swh/meta/-/issues/4538#note_121067>
provides simple examples about NAR and how to implement it using Python.

# Discussion asking for comments and feedback
=============================================

Still there?  If yes, thanks for reading. :-)

As shown in,

1: <https://lists.gnu.org/archive/html/guix-devel/2023-02/msg00398.html>
2: <https://lists.gnu.org/archive/html/guix-devel/2023-03/msg00007.html>

we have holes and we are not currently robust for long-term (3-5 years)
if our lovely build-farms are down for whatever reasons.

For sure, we have to fix the holes and bugs. :-)  However, I am asking
what we could add for having more robustness on the long term.

It is not affordable, neither wanted, to switch from the current
extrinsic identification to a complete intrinsic one.  Although it would
fix many issues. ;-)

Guix and ’guix time-machine’ provides all the machinery for being able
to redeploy later but as I have tried to point in the two links above
[1,2], we are lacking tools for retrieving contents; well having the
machinery does not mean that such machinery works well or is robust. :-)

The discussion could also fit how to distribute using ERIS.

At some point, I was thinking to have something like “guix freeze -m
manifest.scm” returning a map of all the sources from the deep bootstrap
to the leaf packages described in manifest.scm.  However, maybe
something is poor in the metadata we collect at package time.

For instance, the substitutions work more or less using intrinsic
identifier so it helps, I guess. :-)

Well, we could imagine the addition of another option field, say under
’properties’, that could store the intrinsic identifier of the
fixed-outputs such as SWHID or Git tree / commit hash or else.  It would
add robustness for later.

Or maybe an optional field of the ’origin’ record for the same purpose.

WDYT?

Cheers,
simon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: intrinsic vs extrinsic identifier: toward more robustness?
  2023-03-03 18:07 intrinsic vs extrinsic identifier: toward more robustness? Simon Tournier
@ 2023-03-04  0:08 ` Maxime Devos
  2023-03-04  4:10   ` Maxim Cournoyer
  2023-03-05 20:21   ` Simon Tournier
  2023-03-16 17:45 ` Ludovic Courtès
  1 sibling, 2 replies; 9+ messages in thread
From: Maxime Devos @ 2023-03-04  0:08 UTC (permalink / raw)
  To: Simon Tournier, Guix Devel

[-- Attachment #1.1.1: Type: text/plain, Size: 22653 bytes --]

Op 03-03-2023 om 19:07 schreef Simon Tournier:
> Hi,
> 
> I would like to open a discussion about how we identify the source
> origin (fixed output).  It is of vitally importance for being robust on
> the long-term (say 3-5 years).  It matters in Reproducible Research
> context, but not only.
> 
> # First thing first
> ===================
> 
> ## What is an intrinsic identifier or an extrinsic one?
> =======================================================
> 
>   - extrinsic: use a register to keep the correspondence between the
>     identifier and the object; say label version as Git tag.
> 
>   - intrinsic: intimately bound to the designated object itself; say hash
>     as Git blob or tree and at some extent commit.
 >
 > [... some reordering for convenience of replying ...]
 >
 > Please note that the identification and the integrity is not the same.
 > Since intrinsic identifier often uses cryptographic hash functions and
 > integrity too, it is often confusing.

To my understanding, there is only one 'real' identifier in Guix: the 
(sha256sum (base32 ...)) (*).  Those other identifiers like the URL in 
url-fetch and git-fetch are just hints on where to find the object -- 
very important hints without which finding the object is much more 
likely to fail, but just hints nonetheless.

While identification and integrity might be different concepts, 
content-based identifiers like (sha256 (base32 ...)) accomplish both at 
the same time.

(*) FWIW, I would like to point out that Guix theoretically supports 
some other hashes as well, though they aren't used for any in-tree packages.

> The register must be a trusted authority and it resolves by mapping the
> key identifier to the object.  Having the object at hand does not give
> any clue about the key identifier.  And collisions are very frequent;
> two key identifiers resolve to the same content – hopefully! we call
> that mirrors. ;-)

I first thought you where writing about 'extrinsic -> intrinsic (e.g. 
hash-based)' registers, so I was confused by your comment about 
collisions -- to my understanding, no sha256sum collisions are known. 
Going by your comment about mirrors, I think you meant an 'intrinsic -> 
extrinsic' map instead, e.g. 'sha256 -> a bunch of appropriate URLs'.

> Intrinsic identifier also relies on a (trusted) map but collisions are
> avoided as much as possible.  Somehow it strongly reduces the power of
> the authority and it is often more robust.

Who is 'the authority' here, how does the absence of collision reduces 
the power of the authority, and what is your point about reducing the 
power of the authority?  I was thinking of ‘the authority=Guix package 
definition’, but then only the 'more robust' part of your conclusion 
makes sense to me.  Also, as you used 'but' instead of 'and', it appears 
you consider relying on a trusted map to be a bad thing, but that 
appears basic security and patch review to me.

> Whatever the intrinsic identifier we consider – even ones based on very
> weak cryptographic hash function as MD5, or based on non-crytographic
> hash function as Pearson hashing, etc. – the integrity check is
> currently done by SHA256.

How about using the hash of the integrity check as an intrinsic 
identifier, like is done currently?  I mean, we hash it anyway with 
sha256 for the integrity check anyway, might as reuse it.

> ## For example, consider this source origin,
> ==============
> 
>      (source (origin
>                (method url-fetch)
>                (uri (string-append "mirror://gnu/hello/hello-" version
>                                    ".tar.gz"))
>                (sha256
>                 (base32
>                  "086vqwk2wl8zfs47sq2xpjc9k066ilmb8z6dn0q6ymwjzlm196cd"))))
> 
> where ’mirror://gnu’ is resolved by Guix itself.  Or this one,
> 
>      (source
>       (origin
>         (method git-fetch)
>         (uri (git-reference
>               (url "https://github.com/FluxML/Zygote.jl")
>               (commit (string-append "v" version))))
>         (file-name (git-file-name name version))
>         (sha256
>          (base32 "02bgj6m1j25sm3pa5sgmds706qpxk1qsbm0s2j3rjlrz9xn7glgk"))))
> 
> where Guix clones then checks out at the specification of the field
> ’commit’.
> 
> Here both are extrinsic identifiers.  For the first example, the register
> is defined by ’%mirrors’.  For the second example, the register is the
> folder ’.git/’.
> 
> Intrinsic identifier could be plain hash or hashed serialized data.
> Using Guix b8f6ead:
> 
> --8<---------------cut here---------------start------------->8---
> $ guix hash -S none -H sha256 -f nix-base32 -x $(guix build hello -S)
> 086vqwk2wl8zfs47sq2xpjc9k066ilmb8z6dn0q6ymwjzlm196cd
> 
> $ guix hash -S git -H sha256 -f nix-base32 -x $(guix build hello -S)
> 11kaw6m19rdj3d55y4cygk6k9zv6sn2iz4gpimx0j99ps87ij29l
> 
> $ guix hash -S nar -H sha256 -f nix-base32 -x /gnu/store/3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar.gz
> 1lvqpbk2k1sb39z8jfxixf7p7v8sj4z6mmpa44nnmff3w1y6h8lh
> --8<---------------cut here---------------end--------------->8---
> 
> Or some Git-like tree md5 of the decompressed data, e.g.,
> 
> --8<---------------cut here---------------start------------->8---
> $ guix hash -S git -H md5 -f hex -x hello-2.12.1
> 3db60bcfecf17a5dd81e3fb5bfb1c191
> --8<---------------cut here---------------end--------------->8---
> 
> Or some others.
> 
> --8<---------------cut here---------------start------------->8---
> $ git clone https://github.com/FluxML/Zygote.jl
> $ git -C Zygote.jl checkout v0.6.41
> 
> $ guix hash -S nar -H sha256 -f nix-base32 -x Zygote.jl
> 02bgj6m1j25sm3pa5sgmds706qpxk1qsbm0s2j3rjlrz9xn7glgk
> 
> $ guix hash -S git -H sha1 -f hex -x Zygote.jl
> 3cfdb31b517eec4173584fba2b1aa65daad46e09
> --8<---------------cut here---------------end--------------->8---
> 
> 
> # Second thing second
> =====================
> 
> All that’s said, Guix uses extrinsic identifiers for almost all origins,
> if not all.  Even for ’git-fetch’ method.
For git-fetch, the value of the 'commit' field is intrinsic (except when 
it's a tag instead).

> Consider that GitHub disappears and the default build farms ci.guix and
> bordeaux.guix are unreachable for whatever reason.  Then Guix will
> fallback to Software Heritage and will exploits its resolver.
> 
> --8<---------------cut here---------------start------------->8---
> Initialized empty Git repository in /gnu/store/ns1f3b4wm5n470bczd2k5li6xpgbqkz7-julia-zygote-0.6.41-checkout/.git/
> fatal: unable to access 'https://github.com/FluxML/Zygote.jl/': Could not resolve host: github.com
> Failed to do a shallow fetch; retrying a full fetch...
> fatal: unable to access 'https://github.com/FluxML/Zygote.jl/': Could not resolve host: github.com
> git-fetch: '/gnu/store/55ba5ragbd5sd4r45n0q24vrxx9rigrm-git-minimal-2.39.1/bin/git fetch origin' failed with exit code 128
> Trying content-addressed mirror at berlin.guix.gnu.org...
> Trying content-addressed mirror at berlin.guix.gnu.org...
> Trying to download from Software Heritage...
> SWH: found revision 4777767737b4c95d2cea842933c5b2edae2771b2 with directory at 'https://archive.softwareheritage.org/api/1/directory/3cfdb31b517eec4173584fba2b1aa65daad46e09/'
> swh:1:dir:3cfdb31b517eec4173584fba2b1aa65daad46e09/
> --8<---------------cut here---------------end--------------->8---
> 
> That’s SWH which finds the revision
> 4777767737b4c95d2cea842933c5b2edae2771b2 from the contextual information
> URL + label version and from this revision SWH associates the content
> having the intrinsic identifier
> swh:1:dir:3cfdb31b517eec4173584fba2b1aa65daad46e09.
> 
> 
> ## First, please note that the SWHID is just Git,
> ========
> 
> --8<---------------cut here---------------start------------->8---
> guix hash -S git -H sha1 -f hex \
>       /gnu/store/ns1f3b4wm5n470bczd2k5li6xpgbqkz7-julia-zygote-0.6.41-checkout
> 3cfdb31b517eec4173584fba2b1aa65daad46e09
> --8<---------------cut here---------------end--------------->8---
> 
> Other said, SWH information is somehow the same information as the one
> of Git objects.  Specifically, from the Git checkout,
> 
> --8<---------------cut here---------------start------------->8---
> $ git cat-file -p v0.6.41
> object 4777767737b4c95d2cea842933c5b2edae2771b2
> type commit
> tag v0.6.41
> 
> $ git cat-file -p 4777767737b4c95d2cea842933c5b2edae2771b2
> tree 3cfdb31b517eec4173584fba2b1aa65daad46e09
> --8<---------------cut here---------------end--------------->8---
> 
> 
> ## Second, SWH acts as a resolver here, i.e.,
> =========
> 
>       (find (lambda (branch)
>               (or
>                ;; Git specific.
>                (string=? (string-append "refs/tags/" tag)
>                          (branch-name branch))
>                ;; Hg specific.
>                (string=? tag
>                          (branch-name branch))))
>             (snapshot-branches snapshot))
> 
> and this is not robust.  For one, it fails for Git lightweight tag as
> exposed with the package ’open-zwave’ tag 1.6.
> 
> --8<---------------cut here---------------start------------->8---
> $ for t in $(git tag); do printf "$t "; git cat-file -t $t ;done
> Rel-1.0 commit
> V1.5 tag
> v1.2 commit
> v1.3 tag
> v1.4 tag
> v1.6 commit
> --8<---------------cut here---------------end--------------->8---
> 
> It means that the code above would be able to find V1.5 or v1.4 but not
> v1.6 or v1.2.  Well, we can consider that as a bug and improve the
> snapshot machinery for also collecting more ’refs’.  But, for two…
> 
> …the current code (guix swh) does not deal with several snapshots and
> only consider the latest one.  Therefore, it fails for some in-place
> replacements – upstream tags a specific revision then later removes it
> and upstream re-use the same tag label for another revision booo!, if
> SWH ingests after the first tag, SWH creates one snapshot, then if SWH
> ingests again after the second re-tag, SWH creates another snapshot.

This can be solved by placing the actual commit in the 'commit' field of 
git-reference, instead of the tag name, then things are completely 
unambiguous -- this and its opposite were discussed in ‘On raw strings 
in <origin> commit field’ (*), IIRC.

(*) Also maybe that thread about tricking peer review.

I didn't understand the position that commit field should contain the 
(indirect, fragile) tag instead of the (direct, robust) commit, but 
those differences could be sidestepped by having both a 'tag' field and 
a 'commit' field, IIUC.

The 'commit' field would be used for downloading the source code, and 
the 'tag' field would be used by a not-yet-existing linter that would 
check whether the (immutable) commit matches the current value (varying 
over time) of the tag.

> 
> ## Third, Disarchive is helping.
> ========
> 
> Aside adding a layer to maintain does not help when speaking about
> long-term (3-5 years), well, the reduction of layers is often better for
> long-term.  That’s said, there is a work in progress to have Disarchive
> features directly from SWH.
> 
> What does Disarchive do?  It maps various intrinsic identifiers.
> 
> Remember ’hello’ from above?
> 
> --8<---------------cut here---------------start------------->8---
> $ guix shell disarchive guile-lzma guile
> $ disarchive disassemble hello-2.12.1
> (disarchive
>    (version 0)
>    (directory-ref
>      (version 0)
>      (name "hello-2.12.1")
>      (addresses
>        (swhid "swh:1:dir:ad5fc7c3062e8426b7936588e7a27d51ace0e508"))
>      (digest
>        (sha256
>          "cc7d5c45cfa1f5fba96c8b32d933734b24377a3c1ac776650044e497469affd4"))))
> 
> $ guix hash -S git -H sha1 -f hex hello-2.12.1
> ad5fc7c3062e8426b7936588e7a27d51ace0e508
> $ guix hash -S git -H sha256 -f hex hello-2.12.1
> cc7d5c45cfa1f5fba96c8b32d933734b24377a3c1ac776650044e497469affd4
> --8<---------------cut here---------------end--------------->8---
> 
> Well, the fixed-outputs is a compressed tarball, it reads,
> 
> --8<---------------cut here---------------start------------->8---
> $ disarchive disassemble $(guix build -S hello)
> (disarchive
>    (gzip-member
>      (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar.gz")
>      (digest
>        (sha256
>          "8d99142afd92576f30b0cd7cb42a8dc6809998bc5d607d88761f512e26c7db20"))
>      (header (mtime 0) (extra-flags 2) (os 3))
>      (footer (crc 2707092614) (isize 4945920))
>      (compressor gnu-best-rsync)
>      (input (tarball
>               (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar")
>               (digest
>                 (sha256
>                   "a2c33fd13c555015433956bcf06609293a34ce5c5e6a2070990bfb86070dc554"))
> [...]
> 
>      (input (directory-ref
>               (version 0)
>               (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1")
>               (addresses
>                 (swhid "swh:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a"))
>               (digest
>                 (sha256
>                   "1cb6effd40736b441a2a6dd49e56b3dfd4f6550e8ae1a8ac34ed4b1674097bc0"))))))))
> --8<---------------cut here---------------end--------------->8---
> 
> where the values are just (considering that ’guix hash -S none -H sha256
> -f hex’ is equivalent to ’sha256sum’)
> 
> --8<---------------cut here---------------start------------->8---
> $ guix hash -S none -H sha256 -f hex $(guix build hello -S)
> 8d99142afd92576f30b0cd7cb42a8dc6809998bc5d607d88761f512e26c7db20
> $ gzip -d $(guix build -S hello) -c | sha256sum
> a2c33fd13c555015433956bcf06609293a34ce5c5e6a2070990bfb86070dc554  -
> --8<---------------cut here---------------end--------------->8---
> 
> However the fields ’swhid’ and the other SHA256 ’digest’ are different
> from above.  That’s because the dots [...] part.  It probably comes from
> the normalization process. Well, I am not sure to deeply understand why
> it is different but that’s another story. :-)

The reason for the normalisation was something about SWH only providing 
tarballs whose contents are equal to the ingested tarball; the tarballs 
are not bit-for-bit identical to the ingested tarball.  But Guix needs 
bit-for-bit identical tarballs, so Disarchive contains the information 
that was stripped-out by SWH to complement the tarballs provided by 
Disarchive.

> ## Fourth, it misses a bridge using NAR normalization (serialization).
> =========
> 
> Disarchive can (or could) provides a bridge (map) between SWHID+SHA1 and
> NAR+SHA256.  But it could be nice if it was implemented in SWH
> directly.  It would ease previous drawbacks.
> 
> For the interested reader, discussion there
> <https://gitlab.softwareheritage.org/swh/meta/-/issues/4538>.  Moreover,
> <https://gitlab.softwareheritage.org/swh/meta/-/issues/4538#note_121067>
> provides simple examples about NAR and how to implement it using Python.

I think nar stuff should be kept outside SWH.  It doesn't seem scalable 
to me for SWH to support the format of every distribution.  Likewise, I 
think that SWH identifiers should _not_ become an intrinsic identifier 
that is recorded in package definitions -- if there are other archives 
that are somewhat SWH-like archives, then Guix should support them too 
even if they don't use SWH identifiers for whatever reason, and 
including the identifier of every single archive seems unscalable to me.

I believe I have a solution on how to solve the ‘everyone uses different 
identifiers, how to map between them’ problem, but it will take some 
paragraphs:

At some point in the past, when thinking about downloading source code 
over GNUnet File-sharing (FS), I had the problem that Guix and GNUnet 
uses different intrinsic identifiers -- Guix uses the NAR hash for 
querying substitute servers, whereas FS has a system of its own that's 
more convenient for P2P file-sharing stuff.

The problem then was to somehow map the NAR hash to the FS identifier.
I couldn't do this the Disarchive way, because the point was to be _P2P_ 
and Disarchive ... isn't.

A straightforward solution would be to just replace the https:// by 
gnunet:// in the origin (like in https://issues.guix.gnu.org/44199, 
except that patch doesn't support fallbacks to other URLs like url-fetch 
does).

The problem was that people demanded that gnunet:// should only be 
supported once there is actually source code on GNUnet and GNUnet is 
stable, but why would people put source code on GNUnet when no 
distribution supports it and how would GNUnet become stable without any 
users?

To work-around these circular demands, I started 'rehash':
<https://lists.gnu.org/archive/html/guix-patches/2021-01/msg01067.html>
(current location: https://notabug.org/maximed/rehash).  It is a (P2P!) 
GNUnet service that maintains a 'SHA1512<->GNUnet FS URI' mapping, or 
more generally, a 'this hash type<->that hash type' mapping.

(It is just a service on top of the DHT, so the same could easily be 
done for BitTorrent or IPFS.)

It's rather incomplete at the moment (there is no verification or 
reputation mechanism at all so the network could be flooded with bogus 
mappings, mappings are only in DHT, not stored on disk, so they are lost 
on reboot, the POC Guix integretation is a bit limited), but the basics 
are there -- the POC successfully downloaded a substitute over GNUnet 
_without_ having to include FS URI in the narinfo (*)!.

I'm writing about substitutes here, but the exact same approach could be 
done for plain source code.

(*) I might have misremembered; I can't find the POC on 
issues.guix.gnu.org again, and I'm not sure if the POC used rehash or if 
it just included the FS URI in the narinfo.

(TBC, I haven't been working on Rehash lately, but rather Scheme-GNUnet: 
a Scheme port of the GNUnet libraries that's less limited than 
Guile-GNUnet.  Idea is to make GNUnet-FS and rehash more convenient to 
use from Scheme, and in particular, in Guix.)

> # Discussion asking for comments and feedback
> =============================================
> 
> Still there?  If yes, thanks for reading. :-)
> 
> As shown in,
> 
> 1: <https://lists.gnu.org/archive/html/guix-devel/2023-02/msg00398.html>
> 2: <https://lists.gnu.org/archive/html/guix-devel/2023-03/msg00007.html>
> 
> we have holes and we are not currently robust for long-term (3-5 years)
> if our lovely build-farms are down for whatever reasons.
> 
> For sure, we have to fix the holes and bugs. :-)  However, I am asking
> what we could add for having more robustness on the long term.

I recommend P2P systems as an additional archival method to complement 
SWH -- while less reliable than SWH, it's also not a SPOF while SWH is a 
huge SPOF.

More concretely, 'guix perform-download' could automatically insert the 
tarball into the local peer (GNUnet, IPFS, whatever) (to make it 
available to P2P if it was only available from non-P2P http and the like 
previously) and 'guix perform-download' could support 'ipfs://', 
'gnunet://' ... URIs.

(Remember, you can put _multiple_ URLs in an 'origin', as mirrors / 
fallbacks / ...!)

git-fetch and friends would be trickier, but I assume something could be 
worked out (at worst Guix could just insert the nar into the peer and 
let the P2P URI just be a reference to the nar).

There is also the option of 'more mirrors': it should be possible to 
adjust the downloading code to look at the Debian archives, say.  (I had 
some success with finding 'disappeared sources' in other distributions 
in the past.)

> It is not affordable, neither wanted, to switch from the current
> extrinsic identification to a complete intrinsic one.  Although it would
> fix many issues. ;-)

How about in-between: include both an intrinsic identifier (the 
sha256sum) and an extrinsic identifier (the URLs to locate the object 
at), like the status quo.

Additionally, additional P2P identifiers could be added -- e.g. ipfs:// 
URIs could be added for url-fetch -- multiple URLs are allowed!  These 
additions could be automated with some script (go over the package 
origins one-by-one, download it, compute the P2P-network identifier, add 
that identifier to the origin).

> Guix and ’guix time-machine’ provides all the machinery for being able
> to redeploy later but as I have tried to point in the two links above
> [1,2], we are lacking tools for retrieving contents; well having the
> machinery does not mean that such machinery works well or is robust. :-)
> 
> The discussion could also fit how to distribute using ERIS.

ERIS is not a method on its own; you need to combine it with a P2P 
network that uses ERIS.  I do not understand the special focus on ERIS.

There are also various missing bits with ERIS currently
(see various comments at <https://issues.guix.gnu.org/52555#18>).
As such, I propose to use the standard encodings used by current P2P FS 
networks instead -- if an automated script is used as proposed above, 
going for standard P2P first doesn't inhibit Guix from switching to ERIS 
in the future.

> At some point, I was thinking to have something like “guix freeze -m
> manifest.scm” returning a map of all the sources from the deep bootstrap
> to the leaf packages described in manifest.scm.  However, maybe
> something is poor in the metadata we collect at package time.

That sounds like "guix build --sources=transitive' to me, except for 
being even more transitive.  I propose making this an additional option 
for the --sources argument instead.

> For instance, the substitutions work more or less using intrinsic
> identifier so it helps, I guess. :-)
> 
> Well, we could imagine the addition of another option field, say under
> ’properties’, that could store the intrinsic identifier of the
> fixed-outputs such as SWHID or Git tree / commit hash or else.  It would
> add robustness for later.
 >
> Or maybe an optional field of the ’origin’ record for the same purpose.

It needs to be in the 'origin' record, not the 'package' record.  The 
fetchers (url-fetch, git-fetch, ...) only have access to the origin 
stuff, and origins can exist outside the context of a package.

Greetings,
Maxime.

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 929 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: intrinsic vs extrinsic identifier: toward more robustness?
  2023-03-04  0:08 ` Maxime Devos
@ 2023-03-04  4:10   ` Maxim Cournoyer
  2023-03-05 20:21   ` Simon Tournier
  1 sibling, 0 replies; 9+ messages in thread
From: Maxim Cournoyer @ 2023-03-04  4:10 UTC (permalink / raw)
  To: Maxime Devos; +Cc: Simon Tournier, Guix Devel

Hi Maxime (it's been some time, welcome back!)

Maxime Devos <maximedevos@telenet.be> writes:

[...]

> I think nar stuff should be kept outside SWH.  It doesn't seem
> scalable to me for SWH to support the format of every distribution.
> Likewise, I think that SWH identifiers should _not_ become an
> intrinsic identifier that is recorded in package definitions -- if
> there are other archives that are somewhat SWH-like archives, then
> Guix should support them too even if they don't use SWH identifiers
> for whatever reason, and including the identifier of every single
> archive seems unscalable to me.
>
> I believe I have a solution on how to solve the ‘everyone uses
> different identifiers, how to map between them’ problem, but it will
> take some paragraphs:
>
> At some point in the past, when thinking about downloading source code
> over GNUnet File-sharing (FS), I had the problem that Guix and GNUnet
> uses different intrinsic identifiers -- Guix uses the NAR hash for
> querying substitute servers, whereas FS has a system of its own that's
> more convenient for P2P file-sharing stuff.
>
> The problem then was to somehow map the NAR hash to the FS identifier.
> I couldn't do this the Disarchive way, because the point was to be
> _P2P_ and Disarchive ... isn't.
>
> A straightforward solution would be to just replace the https:// by
> gnunet:// in the origin (like in https://issues.guix.gnu.org/44199,
> except that patch doesn't support fallbacks to other URLs like
> url-fetch does).
>
> The problem was that people demanded that gnunet:// should only be
> supported once there is actually source code on GNUnet and GNUnet is
> stable, but why would people put source code on GNUnet when no
> distribution supports it and how would GNUnet become stable without
> any users?
>
> To work-around these circular demands, I started 'rehash':
> <https://lists.gnu.org/archive/html/guix-patches/2021-01/msg01067.html>
> (current location: https://notabug.org/maximed/rehash).  It is a
> (P2P!) GNUnet service that maintains a 'SHA1512<->GNUnet FS URI'
> mapping, or more generally, a 'this hash type<->that hash type'
> mapping.
>
> (It is just a service on top of the DHT, so the same could easily be
> done for BitTorrent or IPFS.)
>
> It's rather incomplete at the moment (there is no verification or
> reputation mechanism at all so the network could be flooded with bogus
> mappings, mappings are only in DHT, not stored on disk, so they are
> lost on reboot, the POC Guix integretation is a bit limited), but the
> basics are there -- the POC successfully downloaded a substitute over
> GNUnet _without_ having to include FS URI in the narinfo (*)!.
>
> I'm writing about substitutes here, but the exact same approach could
> be done for plain source code.
>
> (*) I might have misremembered; I can't find the POC on
> issues.guix.gnu.org again, and I'm not sure if the POC used rehash or
> if it just included the FS URI in the narinfo.
>
> (TBC, I haven't been working on Rehash lately, but rather
> Scheme-GNUnet: a Scheme port of the GNUnet libraries that's less
> limited than Guile-GNUnet.  Idea is to make GNUnet-FS and rehash more
> convenient to use from Scheme, and in particular, in Guix.)

Thanks for sharing your efforts on the P2P in Guix/GNUnet front!  P2P
seems like it'd make substitutes mirroring easy and improve robustness
as the network gets populated.  It's very interesting; it'd definitely
make an interesting summer internship :-).

Keep up the good and inspiring hacks!

-- 
Thanks,
Maxim


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: intrinsic vs extrinsic identifier: toward more robustness?
  2023-03-04  0:08 ` Maxime Devos
  2023-03-04  4:10   ` Maxim Cournoyer
@ 2023-03-05 20:21   ` Simon Tournier
  2023-03-06 12:22     ` Maxime Devos
  1 sibling, 1 reply; 9+ messages in thread
From: Simon Tournier @ 2023-03-05 20:21 UTC (permalink / raw)
  To: Maxime Devos, Guix Devel

Hi Maxime,

Thanks for your comments.

On Sat, 04 Mar 2023 at 01:08, Maxime Devos <maximedevos@telenet.be> wrote:

> To my understanding, there is only one 'real' identifier in Guix: the 
> (sha256sum (base32 ...)) (*).  Those other identifiers like the URL in 
> url-fetch and git-fetch are just hints on where to find the object -- 
> very important hints without which finding the object is much more 
> likely to fail, but just hints nonetheless.

I am not sure to understand why you mean by “hint”.  I would not call
URLs something like “just hints on where to find the object”.

NAR+SHA256 is only the ’real’ identifier when you allow
substitutes. Otherwise, Guix fetches using the ’uri’ from the field
’origin’.  And that’s the scenario I am envisioning here: for whatever
reasons, all the data in the stores Bordeaux and Berlin are gone, then
it is hard time for “guix time-machine”.

>> Intrinsic identifier also relies on a (trusted) map but collisions are
>> avoided as much as possible.  Somehow it strongly reduces the power of
>> the authority and it is often more robust.
>
> Who is 'the authority' here, how does the absence of collision reduces 
> the power of the authority, and what is your point about reducing the 
> power of the authority?

Considering intrinsic identifier, the “authority” is the data itself,
somehow.  In content-addressed systems, the “authority” is diluted or
absent.

>> Whatever the intrinsic identifier we consider – even ones based on very
>> weak cryptographic hash function as MD5, or based on non-crytographic
>> hash function as Pearson hashing, etc. – the integrity check is
>> currently done by SHA256.
>
> How about using the hash of the integrity check as an intrinsic 
> identifier, like is done currently?  I mean, we hash it anyway with 
> sha256 for the integrity check anyway, might as reuse it.

Maybe ask GNUnet folk to address by NAR+SHA256 instead on their
specification. ;-)

Kidding aside, your comment rises two points of view:

 1. Guix is fetching data from elsewhere and this elsewhere is not using
    NAR+SHAR256 intrinsic identifier.  Therefore, the question is how to
    adapt the source origin for taking into account this elsewhere?

 2. Replace the NAR+SHA256 integrity checksum by what content-addressed
    systems use as intrinsic identifier.  IMHO, that’s a bad idea for
    two reasons: (a) security, for instance SHA1 as used by SWH is not
    secure and (b) it will be unmanageable in practise.

>> All that’s said, Guix uses extrinsic identifiers for almost all origins,
>> if not all.  Even for ’git-fetch’ method.
>
> For git-fetch, the value of the 'commit' field is intrinsic (except when 
> it's a tag instead).

No, it is imprecise.  The exception is *not* label tag as value for the
’commit’ field but the exception is Git commit hash as value.

> This can be solved by placing the actual commit in the 'commit' field of 
> git-reference, instead of the tag name, then things are completely 
> unambiguous -- this and its opposite were discussed in ‘On raw strings 
> in <origin> commit field’ (*), IIRC.

The thread you are referencing [1] is based on misunderstandings.  I
would like to move forward, hence my detailed email. :-)

1: <https://yhetil.org/guix/6e451a878b749d4afb6eede9b476e5faabb0d609.camel@gmail.com/#r>

> (*) Also maybe that thread about tricking peer review.
>
> I didn't understand the position that commit field should contain the 
> (indirect, fragile) tag instead of the (direct, robust) commit, but 
> those differences could be sidestepped by having both a 'tag' field and 
> a 'commit' field, IIUC.

I would not frame this way.  My view is not to replace something by
something else, instead, is to add something and/or several things.

> The problem then was to somehow map the NAR hash to the FS identifier.

Yes, that’s the problem. :-) GNUnet FS identifier is one case.  And my
discussion here is: could we augment source origin to be able to deal
with various identifier?

> A straightforward solution would be to just replace the https:// by 
> gnunet:// in the origin (like in https://issues.guix.gnu.org/44199, 
> except that patch doesn't support fallbacks to other URLs like url-fetch 
> does).

Somehow, your proposition would be to have a list as URI, right?

     (origin
       (method gnunet-fetch)
       (uri
        (list
          (string-append "mirror://gnu/hello/hello-" version
                           ".tar.gz")
          "gnunet://fs/chk/TY48PGS5RVX643NT2B7GDNFCBT4DWG692PF4YNHERR96K6MSFRZ4ZWRPQ4KVKZV29MGRZTWAMY9ETTST4B6VFM47JR2JS5PWBTPVXB0.8A9HRYABJ7HDA7B0"
          "shw:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a"
       (file-name "gnunet-hello-2.10.tar.gz")
       (sha256
        (base32
         "0ssi1wpaf7plaswqqjwigppsg5fyh99vdlb9kzl7c9lng89ndq1i")

>> It is not affordable, neither wanted, to switch from the current
>> extrinsic identification to a complete intrinsic one.  Although it would
>> fix many issues. ;-)
>
> How about in-between: include both an intrinsic identifier (the 
> sha256sum) and an extrinsic identifier (the URLs to locate the object 
> at), like the status quo.

That’s what I am proposing between the lines. :-)

The question is which design.  For instance, it could go under the field
’properties’ similarly as “upstream name” or potentially other
“metadata”.  Or it could go under the source origin field.

Well, however as you pointed, being a ’properties’ would not be as
easy.  And as you also pointed, the integrity field could be something
else than ’sha256’, so maybe we could have a list here.

>> The discussion could also fit how to distribute using ERIS.
>
> ERIS is not a method on its own; you need to combine it with a P2P 
> network that uses ERIS.  I do not understand the special focus on ERIS.

Yes, indeed.  However, to my knowledge, each P2P can use its own
identifier and from my understanding, ERIS relies on whatever P2P.
Therefore, willing guix-daemon being able to use ERIS, it somehow
implies a discussion about the identifiers used by the P2P networks.

Do I miss something?

>> At some point, I was thinking to have something like “guix freeze -m
>> manifest.scm” returning a map of all the sources from the deep bootstrap
>> to the leaf packages described in manifest.scm.  However, maybe
>> something is poor in the metadata we collect at package time.
>
> That sounds like "guix build --sources=transitive' to me, except for 
> being even more transitive.  I propose making this an additional option 
> for the --sources argument instead.

No.  “guix build --sources=transitive” returns an archive containing all
the sources.  Instead, I would like the all various identifiers (URL,
NAR, SWHID, GNUnet, etc.) of all the transitive sources.

Cheers,
simon

PS:

>> However the fields ’swhid’ and the other SHA256 ’digest’ are different
>> from above.  That’s because the dots [...] part.  It probably comes from
>> the normalization process. Well, I am not sure to deeply understand why
>> it is different but that’s another story. :-)
>
> The reason for the normalisation was something about SWH only providing 
> tarballs whose contents are equal to the ingested tarball; the tarballs 
> are not bit-for-bit identical to the ingested tarball.  But Guix needs 
> bit-for-bit identical tarballs, so Disarchive contains the information 
> that was stripped-out by SWH to complement the tarballs provided by 
> Disarchive.

SWH is not in the picture with the example I provided. :-)  Yes, the
dots part is related to some normalization and “metadata”.

What I do not understand is, if “guix build hello -S” is manually
uncompressed and untar, the content corresponds to:

    $ guix hash -S git -H sha256 -f hex hello-2.12.1
    cc7d5c45cfa1f5fba96c8b32d933734b24377a3c1ac776650044e497469affd4

The tool ’disarchive’ dissembles the compressed archive; it first
provides the hash of the compressed archive (.tar.gz), then store
metadata about compression level, algorithm etc, then provides the hash
of the uncompressed archive (.tar), then store metadata about files and
last it provides the hash of the tree, it reads,

    (input (directory-ref
             (version 0)
             (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1")
             (addresses
               (swhid "swh:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a"))
             (digest
               (sha256
                 "1cb6effd40736b441a2a6dd49e56b3dfd4f6550e8ae1a8ac34ed4b1674097bc0"))))))))

and I do not understand why it is not the same as manually computed; see
above.   Well, that’s a detail and not relevant to the current
discussion since it is part of how Disarchive works internally.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: intrinsic vs extrinsic identifier: toward more robustness?
  2023-03-05 20:21   ` Simon Tournier
@ 2023-03-06 12:22     ` Maxime Devos
  2023-03-06 13:42       ` Simon Tournier
  0 siblings, 1 reply; 9+ messages in thread
From: Maxime Devos @ 2023-03-06 12:22 UTC (permalink / raw)
  To: Simon Tournier, Guix Devel


[-- Attachment #1.1.1: Type: text/plain, Size: 11862 bytes --]

Op 05-03-2023 om 21:21 schreef Simon Tournier:
>>> Whatever the intrinsic identifier we consider – even ones based on very
>>> weak cryptographic hash function as MD5, or based on non-crytographic
>>> hash function as Pearson hashing, etc. – the integrity check is
>>> currently done by SHA256.
>>
>> How about using the hash of the integrity check as an intrinsic
>> identifier, like is done currently?  I mean, we hash it anyway with
>> sha256 for the integrity check anyway, might as reuse it.
> 
> Maybe ask GNUnet folk to address by NAR+SHA256 instead on their
> specification. ;-)

Obviously, Guix should replace NAR+SHA256 by GNUnet FS URIs /j.

> Kidding aside, your comment rises two points of view:
> 
>   1. Guix is fetching data from elsewhere and this elsewhere is not using
>      NAR+SHAR256 intrinsic identifier.  Therefore, the question is how to
>      adapt the source origin for taking into account this elsewhere?
> 
>   2. Replace the NAR+SHA256 integrity checksum by what content-addressed
>      systems use as intrinsic identifier.  IMHO, that’s a bad idea for
>      two reasons: (a) security, for instance SHA1 as used by SWH is not
>      secure and (b) it will be unmanageable in practise.

I was thinking of (1), not (2).
>>> All that’s said, Guix uses extrinsic identifiers for almost all origins,
>>> if not all.  Even for ’git-fetch’ method.
>>
>> For git-fetch, the value of the 'commit' field is intrinsic (except when
>> it's a tag instead).
> 
> No, it is imprecise.  The exception is *not* label tag as value for the
> ’commit’ field but the exception is Git commit hash as value.

Are you referring to the fact that currently, the 'commit' field usually 
contains a tag name, and that it containing a commit is the exception?
If so, that doesn't contradict my claim.

>> This can be solved by placing the actual commit in the 'commit' field of
>> git-reference, instead of the tag name, then things are completely
>> unambiguous -- this and its opposite were discussed in ‘On raw strings
>> in <origin> commit field’ (*), IIRC.
> 
> The thread you are referencing [1] is based on misunderstandings.  I
> would like to move forward, hence my detailed email. :-)
> 
> 1: <https://yhetil.org/guix/6e451a878b749d4afb6eede9b476e5faabb0d609.camel@gmail.com/#r>

Your email is about intrinsic identifiers and more robustness, yet it 
doesn't mention using git commits more anywhere.  As such, I do not 
follow ‘hence my detailed email’ -- it contains detail, but it misses 
some relevant detail that I pointed out in my previous response.

Also, with ‘move forward’, do you mean ‘move forward’, or ‘maintain 
status quo’?  Because given that you are replying to the proposed 
solution (that even avoids problems pointed out in those threads) by 
saying nothing of technical importance and by pointing to some 
contentious things, it really appears the latter to me.

>> (*) Also maybe that thread about tricking peer review.
>>
>> I didn't understand the position that commit field should contain the
>> (indirect, fragile) tag instead of the (direct, robust) commit, but
>> those differences could be sidestepped by having both a 'tag' field and
>> a 'commit' field, IIUC.
> 
> I would not frame this way.  My view is not to replace something by
> something else, instead, is to add something and/or several things.

I was thinking of adding the commit (intrinsic) to the git-reference, 
instead of only having a tag (extrinsic) in the git-reference as is 
mostly done currently.

I also want to mention that, except of a general notion of 'more 
robustness' and a specific command "guix freeze -m manifest.scm" and 
such, you never mentioned what your view was, so I had to guess.

>> The problem then was to somehow map the NAR hash to the FS identifier.
> 
> Yes, that’s the problem. :-) GNUnet FS identifier is one case.  And my
> discussion here is: could we augment source origin to be able to deal
> with various identifier?
> 
> 
>> A straightforward solution would be to just replace the https:// by
>> gnunet:// in the origin (like in https://issues.guix.gnu.org/44199,
>> except that patch doesn't support fallbacks to other URLs like url-fetch
>> does).
> 
> Somehow, your proposition would be to have a list as URI, right?
> 
>       (origin
>         (method gnunet-fetch)
>         (uri
>          (list
>            (string-append "mirror://gnu/hello/hello-" version
>                             ".tar.gz")
>            "gnunet://fs/chk/TY48PGS5RVX643NT2B7GDNFCBT4DWG692PF4YNHERR96K6MSFRZ4ZWRPQ4KVKZV29MGRZTWAMY9ETTST4B6VFM47JR2JS5PWBTPVXB0.8A9HRYABJ7HDA7B0"
>            "shw:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a"
>         (file-name "gnunet-hello-2.10.tar.gz")
>         (sha256
>          (base32
>           "0ssi1wpaf7plaswqqjwigppsg5fyh99vdlb9kzl7c9lng89ndq1i")

Yes, though in a proper version of 44199 (which doesn't exist yet) it 
would just be integrated into url-fetch instead of having a separate 
gnunet-fetch.

>>> It is not affordable, neither wanted, to switch from the current
>>> extrinsic identification to a complete intrinsic one.  Although it would
>>> fix many issues. ;-)
>>
>> How about in-between: include both an intrinsic identifier (the
>> sha256sum) and an extrinsic identifier (the URLs to locate the object
>> at), like the status quo.
> 
> That’s what I am proposing between the lines. :-)

I recommend being explicit.

> The question is which design.  For instance, it could go under the field
> ’properties’ similarly as “upstream name” or potentially other
> “metadata”.  Or it could go under the source origin field.
> 
> Well, however as you pointed, being a ’properties’ would not be as
> easy.  And as you also pointed, the integrity field could be something
> else than ’sha256’, so maybe we could have a list here.

To be clear, my comment on Guix supporting other things than sha256 was 
just a statement of fact, not a proposal to use that mechanism (and 
neither a proposal to not use that mechanism).

>>> The discussion could also fit how to distribute using ERIS.
>>
>> ERIS is not a method on its own; you need to combine it with a P2P
>> network that uses ERIS.  I do not understand the special focus on ERIS.
> 
> Yes, indeed.  However, to my knowledge, each P2P can use its own
> identifier and from my understanding, ERIS relies on whatever P2P.
> Therefore, willing guix-daemon being able to use ERIS, it somehow
> implies a discussion about the identifiers used by the P2P networks.
> 
> Do I miss something?

I don't have any issue with ERIS itself (*).  The issue I have with 
ERIS, is that it often appears to be treated as some panacea that 
transcends all P2P systems and is fundamentally different from other 
identifiers used by other P2P systems, but <https://xkcd.com/927/> 
applies here -- while it might become some universal standard, it isn't yet.

Hence, ‘I do not understand the __special__ focus on ERIS’ (emphasis 
added).  As long as the ERIS identifier is treated as one among many 
instead of somehow being considered special, it's fine to me.

(*) Besides several technical issues in its current implementation -- 
the implementation of ERIS is optimised for classical transports instead 
of P2P transports, ERIS is only implemented for IPFS currently and ERIS 
doesn't have a deduplication system for directories.  (In GNUnet and 
BitTorrent, and I think in IPFS and BitTorrent too, if two directories 
(e.g. store items) that have a file in common were put into the P2P, 
then for the P2P's purposes these two files are the same file, so 
availability of one store item aids the availability of another store item.)

>>> At some point, I was thinking to have something like “guix freeze -m
>>> manifest.scm” returning a map of all the sources from the deep bootstrap
>>> to the leaf packages described in manifest.scm.  However, maybe
>>> something is poor in the metadata we collect at package time.
>>
>> That sounds like "guix build --sources=transitive' to me, except for
>> being even more transitive.  I propose making this an additional option
>> for the --sources argument instead.
> 
> No.  “guix build --sources=transitive” returns an archive containing all
> the sources.  Instead, I would like the all various identifiers (URL,
> NAR, SWHID, GNUnet, etc.) of all the transitive sources.

I do not see how making a list of all identifiers helps with robustness 
-- you need the object the identifiers point to, not the identifier itself.

Unless the goal is to use the map of package->identifiers to determine 
which packages are currently lacking redundancy (i.e., have few 
identifiers), which to be clear seems reasonable to me.
> Cheers,
> simon
> 
> PS:
> 
>>> However the fields ’swhid’ and the other SHA256 ’digest’ are different
>>> from above.  That’s because the dots [...] part.  It probably comes from
>>> the normalization process. Well, I am not sure to deeply understand why
>>> it is different but that’s another story. :-)
>>
>> The reason for the normalisation was something about SWH only providing
>> tarballs whose contents are equal to the ingested tarball; the tarballs
>> are not bit-for-bit identical to the ingested tarball.  But Guix needs
>> bit-for-bit identical tarballs, so Disarchive contains the information
>> that was stripped-out by SWH to complement the tarballs provided by
>> Disarchive.
> 
> SWH is not in the picture with the example I provided. :-)  Yes, the
> dots part is related to some normalization and “metadata”.

Your question was about where the differences come from.  The answer is 
‘because SWH normalisation stuff’.  As such, SWH is in the picture.

> What I do not understand is, if “guix build hello -S” is manually
> uncompressed and untar, the content corresponds to:
> 
>      $ guix hash -S git -H sha256 -f hex hello-2.12.1
>      cc7d5c45cfa1f5fba96c8b32d933734b24377a3c1ac776650044e497469affd4
> 
> The tool ’disarchive’ dissembles the compressed archive; it first
> provides the hash of the compressed archive (.tar.gz), then store
> metadata about compression level, algorithm etc, then provides the hash
> of the uncompressed archive (.tar), then store metadata about files and
> last it provides the hash of the tree, it reads,
> 
>      (input (directory-ref
>               (version 0)
>               (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1")
>               (addresses
>                 (swhid "swh:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a"))
>               (digest
>                 (sha256
>                   "1cb6effd40736b441a2a6dd49e56b3dfd4f6550e8ae1a8ac34ed4b1674097bc0"))))))))
> 
> and I do not understand why it is not the same as manually computed; see
> above.   Well, that’s a detail and not relevant to the current
> discussion since it is part of how Disarchive works internally.

You are hashing the 'hello-2.12.1' directory, which is the only 
directory in the tarball.  However, while it is considered bad practice, 
a tarball can contain multiple top-level entries.  As such, you should 
consider the tarball as an encoding of a directory that happens to 
contain the 'hello-2.12.1' directory, and hash the wrapper directory 
instead of its member hello-2.12.1:

$ mkdir a
$ cd a
$ tar -xf /gnu/store/3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar.gz
$ guix hash -Sgit -H sha256 -f hex .
1cb6effd40736b441a2a6dd49e56b3dfd4f6550e8ae1a8ac34ed4b1674097bc0

Using these steps, the value in the (digest (sha256 ...)) is recovered.

Greetings,
Maxime.

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 929 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: intrinsic vs extrinsic identifier: toward more robustness?
  2023-03-06 12:22     ` Maxime Devos
@ 2023-03-06 13:42       ` Simon Tournier
  0 siblings, 0 replies; 9+ messages in thread
From: Simon Tournier @ 2023-03-06 13:42 UTC (permalink / raw)
  To: Maxime Devos, Guix Devel

Hi,

On Mon, 06 Mar 2023 at 13:22, Maxime Devos <maximedevos@telenet.be> wrote:

>>> For git-fetch, the value of the 'commit' field is intrinsic (except when
>>> it's a tag instead).
>> 
>> No, it is imprecise.  The exception is *not* label tag as value for the
>> ’commit’ field but the exception is Git commit hash as value.
>
> Are you referring to the fact that currently, the 'commit' field usually 
> contains a tag name, and that it containing a commit is the exception?

Yes.

> If so, that doesn't contradict my claim.

There is no contradiction but imprecision.

> I do not see how making a list of all identifiers helps with robustness 
> -- you need the object the identifiers point to, not the identifier itself.

If you have the identifiers, you have a chance to find again the
content.  For example, in addition to NAR+SHA256, we could also store
Git+SHA1 or plain SHA256 or something else.  It would help in exploiting
other content-address systems.  For instance, SWH stores,

        "checksums": {
            "sha1": "3a48fbd0a69c7875dc18bd48a16da04d1512ed47",
            "sha1_git": "69cb76019a474330e99666f147ecb85e44de1ce6",
            "sha256": "e62e0f13f9025642a52f9fcb12ca0c31d5e05f78e97224f55b3d70d47c73b549"
        },

and maybe ’sha256_nar’ soon.  Somehow, we have a list of mirrors so why
not similarly having a list of intrinsic identifier.

> You are hashing the 'hello-2.12.1' directory

Thanks!  Having the noise too close and I missed the obvious. :-)

Cheers,
simon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: intrinsic vs extrinsic identifier: toward more robustness?
  2023-03-03 18:07 intrinsic vs extrinsic identifier: toward more robustness? Simon Tournier
  2023-03-04  0:08 ` Maxime Devos
@ 2023-03-16 17:45 ` Ludovic Courtès
  2023-04-06 12:15   ` Simon Tournier
  2023-10-04  8:52   ` content-address hint? (was Re: intrinsic vs extrinsic identifier: toward more robustness?) Simon Tournier
  1 sibling, 2 replies; 9+ messages in thread
From: Ludovic Courtès @ 2023-03-16 17:45 UTC (permalink / raw)
  To: Simon Tournier; +Cc: Guix Devel

Hi!

Thanks for starting this discussion!

Simon Tournier <zimon.toutoune@gmail.com> skribis:

> For sure, we have to fix the holes and bugs. :-)  However, I am asking
> what we could add for having more robustness on the long term.
>
> It is not affordable, neither wanted, to switch from the current
> extrinsic identification to a complete intrinsic one.  Although it would
> fix many issues. ;-)

Sources (fixed-output derivations) are already content-addressed, by
definition (I prefer “content addressing” over “intrinsic
identification” because that’s a more widely recognized term).

In a way, like Maxime way saying, the URL/URI is just a hint; what
matters it the content hash that appears in the origin.

So it seems to me that the basics are already in place.

What’s missing, both in SWH and in Guix, is the ability to store
multiple hashes.  SWH could certainly store several hashes, computed
using different serialization and hash algorithm combinations.

This is what you suggested at
<https://gitlab.softwareheritage.org/swh/meta/-/issues/4538>; it was
also discussed in the thread at
<https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/msg00019.html>.  It
would be awesome if SWH would store Nar hashes; that would solve all our
problems, as you explained.

The other option—storing multiple hashes for each origin in Guix—doesn’t
sound practical: I can’t imagine packages storing and updating more than
one content hash per package.  That doesn’t sound reasonable.  Plus it
would be a long-term solution and wouldn’t help today.

Thoughts?

Ludo’.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: intrinsic vs extrinsic identifier: toward more robustness?
  2023-03-16 17:45 ` Ludovic Courtès
@ 2023-04-06 12:15   ` Simon Tournier
  2023-10-04  8:52   ` content-address hint? (was Re: intrinsic vs extrinsic identifier: toward more robustness?) Simon Tournier
  1 sibling, 0 replies; 9+ messages in thread
From: Simon Tournier @ 2023-04-06 12:15 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix Devel

Hi,

On jeu., 16 mars 2023 at 18:45, Ludovic Courtès <ludo@gnu.org> wrote:

>> For sure, we have to fix the holes and bugs. :-)  However, I am asking
>> what we could add for having more robustness on the long term.

> Sources (fixed-output derivations) are already content-addressed, by
> definition (I prefer “content addressing” over “intrinsic
> identification” because that’s a more widely recognized term).

This is the case when you consider that the result of the fixed-output
derivation is already inside the Guix “ecosystem”…

> In a way, like Maxime way saying, the URL/URI is just a hint; what
> matters it the content hash that appears in the origin.

…but else URL/URI is not just a “hint“.  Or could you explain what you
mean by a “hint”?

Maybe I misunderstand something, from my understanding, URL/URI is a
“hint” only when substitutes is available, else Guix relies on plain
URL/URI for fetching data.

--8<---------------cut here---------------start------------->8---
$ guix build hello -S --no-substitutes --check
The following derivation will be built:
  /gnu/store/3hxraqxb0zklq065zjrxcs199ynmvicy-hello-2.12.1.tar.gz.drv
building /gnu/store/3hxraqxb0zklq065zjrxcs199ynmvicy-hello-2.12.1.tar.gz.drv...

Starting download of /gnu/store/1s6xba6nafkxb242kafkg3x10jkdn2n9-hello-2.12.1.tar.gz
From https://ftpmirror.gnu.org/gnu/hello/hello-2.12.1.tar.gz...
following redirection to `https://mirror.cyberbits.eu/gnu/hello/hello-2.12.1.tar.gz'...
downloading from https://ftpmirror.gnu.org/gnu/hello/hello-2.12.1.tar.gz ...

warning: rewriting hashes in `/gnu/store/3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar.gz'; cross fingers
--8<---------------cut here---------------end--------------->8---

Other said, when speaking about robustness (broad meaning), I think we
cannot assume that the “content addressing” provided by the derivation,

--8<---------------cut here---------------start------------->8---
Derive
([("out","/gnu/store/3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar.gz","sha256","8d99142afd92576f30b0cd7cb42a8dc6809998bc5d607d88761f512e26c7db20")]
 ,[]
 ,["/gnu/store/0mxnx8l4fgigvd7gakwdk6hc6im4wnai-disarchive-mirrors","/gnu/store/ckxc05iflc8jagdxwh4z1cxc23mb6i6q-mirrors","/gnu/store/wg1yp2vx8gb7qmcgyibqnwblahpp4bjg-content-addressed-mirrors"]
 ,"x86_64-linux","builtin:download",[]
 ,[("content-addressed-mirrors","/gnu/store/wg1yp2vx8gb7qmcgyibqnwblahpp4bjg-content-addressed-mirrors")
   ,("disarchive-mirrors","/gnu/store/0mxnx8l4fgigvd7gakwdk6hc6im4wnai-disarchive-mirrors")
   ,("impureEnvVars","http_proxy https_proxy LC_ALL LC_MESSAGES LANG COLUMNS")
   ,("mirrors","/gnu/store/ckxc05iflc8jagdxwh4z1cxc23mb6i6q-mirrors")
   ,("out","/gnu/store/3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar.gz")
   ,("preferLocalBuild","1")
   ,("url","\"mirror://gnu/hello/hello-2.12.1.tar.gz\"")])
--8<---------------cut here---------------end--------------->8---

is still there and instead it would mean Guix has to rely on another
system (here ’url’).  Somehow, I am proposing to optionally add more
“content addressing” than the current NAR+SHA256 (and URL/URI) to then
be able to exploit other “content addressing“ systems.

> So it seems to me that the basics are already in place.

Well, there is two possible choices: (1) rely on an external service
that would be bridge the different content addressing systems (as
extending the Disarchive database or hope SWH will do it :-)) but this
other external service needs to be always available or (2) extend the
information of packages (optional fields, etc.).

Moreover about (1), all third-party channels would have to be ingested
by this external service.  About SWH, that’s possible.  About Disarchive
database, it would mean register this third-party channel or maintain
their own database.  Contrary to (2) where the identifier would be
optionally part of the package definition.

> What’s missing, both in SWH and in Guix, is the ability to store
> multiple hashes.  SWH could certainly store several hashes, computed
> using different serialization and hash algorithm combinations.

Please note that currently Guix relies on a “hint“ when SWH is used as
fallback.  For instance, consider most of the cases of git-fetch, Guix
provides to the SWH API the context (URL and Git tag) and let SWH
resolves in order to find the content addressing identifier.  It works
for many cases but it fails for history of history cases, e.g., when
upstream does in-place tag replacement.

And this strategy does not work with Subversion (svn-fetch) or Mercurial
(hg-fetch) or else.  It requires more work on our side (parse the result
of the query, extract relevant information etc.).  Nothing impossible
but far to be done, IMHO. :-)

Well, I still have mixed feelings about the SWH fallback robustness. :-)

> This is what you suggested at
> <https://gitlab.softwareheritage.org/swh/meta/-/issues/4538>; it was
> also discussed in the thread at
> <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/msg00019.html>.  It
> would be awesome if SWH would store Nar hashes; that would solve all our
> problems, as you explained.

Yeah that’s nice. :-)  The progress is tracked by,

    https://gitlab.softwareheritage.org/swh/meta/-/issues/4979

and the first part for computing NAR is now merged, IIUC, with:

    https://gitlab.softwareheritage.org/swh/devel/swh-loader-core/-/merge_requests/459

However, exposing via their API this NAR and then bridging NAR -> swhid
is not planned on SWH side yet, AFAIK.

> The other option—storing multiple hashes for each origin in Guix—doesn’t
> sound practical: I can’t imagine packages storing and updating more than
> one content hash per package.  That doesn’t sound reasonable.  Plus it
> would be a long-term solution and wouldn’t help today.

Storing a list of content addressing identifiers (NAR+SHA256, Git+SHA1,
GNUnet, IPFS, etc.) would allow to add robustness, IMHO.

Other said, it is not affordable to have a ’gnunet-fetch’ method as
proposed in [1] but we could optionally have,

     (origin
       (method url-fetch)
       (uri (string-append "mirror://gnu/hello/hello-" version
                           ".tar.gz"))
       (sha256
        (base32
         "086vqwk2wl8zfs47sq2xpjc9k066ilmb8z6dn0q6ymwjzlm196cd"))
       (identifiers
        (list
         (gnunet "Y48PGS5RVX643NT2B7GDNFCBT4DWG692PF4YNHERR96K6MSFRZ4ZWRPQ4KVKZV29MGRZTWAMY9ETTST4B6VFM47JR2JS5PWBTPVXB0.8A9HRYABJ7HDA7B0")
         (git+sha1 "swh:1:dir:013573086777370b558b1a9ecb6d0dca9bb8ea18")
         (none+sha1 "8f261739d33d31867ab9c5fa26f973c37da26ca5"))))

And we could also have Git commit hash (for packages using git-fetch
method), etc.

Having an optional field ’identifiers’ would allow to help today for all
other fetch methods than url-fetch and git-fetch.

For sure, it is not straightforward.  For instance, how to insure the
consistency?  Via “guix lint”?  Else? 

Well, on the other hand, sometimes I would like to have a list of
sources using different fetch method, say try first using this url-fetch
and then this git-fetch and then this SWH fallback, etc.

To me the other viable option would be to extend the Disarchive database
and services around.

Thought?

Cheers,
simon

1: https://issues.guix.gnu.org/44199#0-lineno68

^ permalink raw reply	[flat|nested] 9+ messages in thread

* content-address hint? (was Re: intrinsic vs extrinsic identifier: toward more robustness?)
  2023-03-16 17:45 ` Ludovic Courtès
  2023-04-06 12:15   ` Simon Tournier
@ 2023-10-04  8:52   ` Simon Tournier
  1 sibling, 0 replies; 9+ messages in thread
From: Simon Tournier @ 2023-10-04  8:52 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Guix Devel

Hi Ludo,

On Thu, 16 Mar 2023 at 18:45, Ludovic Courtès <ludo@gnu.org> wrote:

> Thanks for starting this discussion!

I feel this discussion is still pending, so I am resuming. :-)

If context is missing, the thread starts here.

        intrinsic vs extrinsic identifier: toward more robustness?
        Simon Tournier <zimon.toutoune@gmail.com>
        Fri, 03 Mar 2023 19:07:23 +0100
        id:87jzzxd7z8.fsf@gmail.com
        https://lists.gnu.org/archive/html/guix-devel/2023-03
        https://yhetil.org/guix/87jzzxd7z8.fsf@gmail.com

> Sources (fixed-output derivations) are already content-addressed, by
> definition (I prefer “content addressing” over “intrinsic
> identification” because that’s a more widely recognized term).

From my understanding, this is correct only when the sources live in the
Guix project infrastructure.  I agree that if the source is
substitutable (= the source exists on one of substitute servers, i.e.,
Guix project servers), then the fixed-output derivation is
content-addressed,

For instance, let consider this fixed-output derivation:

--8<---------------cut here---------------start------------->8---
Derive
([("out","/gnu/store/n1k6jppyasn20zr6m8sfyv5ll07ibyf1-asciidoc-8.6.10.tar.gz","sha256","9e52f8578d891beaef25730a92a6e723596ddbd07bfe0d2a56486fcf63a0b983")]
 ,[]
 ,["/gnu/store/5iw2ivjw5njyyvi7avyphfcibgbqdbsc-mirrors","/gnu/store/vwyxp1dq4lb97n6b20w5cqxasy2dai79-content-addressed-mirrors"]
 ,"x86_64-linux","builtin:download",[]
 ,[("content-addressed-mirrors","/gnu/store/vwyxp1dq4lb97n6b20w5cqxasy2dai79-content-addressed-mirrors")
   ,("impureEnvVars","http_proxy https_proxy LC_ALL LC_MESSAGES LANG COLUMNS")
   ,("mirrors","/gnu/store/5iw2ivjw5njyyvi7avyphfcibgbqdbsc-mirrors")
   ,("out","/gnu/store/n1k6jppyasn20zr6m8sfyv5ll07ibyf1-asciidoc-8.6.10.tar.gz")
   ,("preferLocalBuild","1")
   ,("url","\"https://github.com/asciidoc/asciidoc/archive/8.6.10.tar.gz\"")])
--8<---------------cut here---------------end--------------->8---

I agree that the “url” field is useless while the content exists on the
“content-addressed-mirrors” list.  If one opens that file, then the code
reads:

--8<---------------cut here---------------start------------->8---
(begin
  (use-modules
   (guix base32))
  (define
    (guix-publish host)
    (lambda
        (file algo hash)
      (string-append "https://" host "/file/" file "/"
                     (symbol->string algo)
                     "/"
                     (bytevector->nix-base32-string hash))))
  (module-autoload!
   (current-module)
   (quote
    (guix base16))
   (quote
    (bytevector->base16-string)))
  (list
   (guix-publish "ci.guix.gnu.org")
   (lambda
       (file algo hash)
     (string-append "https://tarballs.nixos.org/"
                    (symbol->string algo)
                    "/"
                    (bytevector->nix-base32-string hash)))
   (lambda
       (file algo hash)
     (string-append "https://archive.softwareheritage.org/api/1/content/"
                    (symbol->string algo)
                    ":"
                    (bytevector->base16-string hash)
                    "/raw/"))))
--8<---------------cut here---------------end--------------->8---

Therefore, the look-up is done with some content-addressed via these 3
servers.

> In a way, like Maxime way saying, the URL/URI is just a hint; what
> matters it the content hash that appears in the origin.

However, from my understanding, it is incorrect to speak about
content-addressed when the source (fixed-output derivation) does not
exist for whatever reason on any substitute servers.

The URL/URI is not “just a hint”.  It *is* the location from where the
data are fetched.  And it is not content-addressed.  If I am incorrect,
please could you explain?

Please note that if only one source is missing than all the castle falls
down.  Other said, robustness means the hunt of the corner cases. :-)

If I want to time-machine to d63ee94d63c667e0c63651d6b775460f4c67497d
from Sat Jan 4 2020, and need Git, then it fails because:

    sha256 hash mismatch for /gnu/store/n1k6jppyasn20zr6m8sfyv5ll07ibyf1-asciidoc-8.6.10.tar.gz:
      expected hash: 10xrl1iwyvs8aqm0vzkvs3dnsn93wyk942kk4ppyl6w9imbzhlly
      actual hash:   1sh341j7ripkdb2wn6yf3rciln8ll89351b3d55gpkj89wypkmi2

Game over. )-:

Do we share the same understanding?

> What’s missing, both in SWH and in Guix, is the ability to store
> multiple hashes.  SWH could certainly store several hashes, computed
> using different serialization and hash algorithm combinations.

[...]

> The other option—storing multiple hashes for each origin in Guix—doesn’t
> sound practical: I can’t imagine packages storing and updating more than
> one content hash per package.  That doesn’t sound reasonable.  Plus it
> would be a long-term solution and wouldn’t help today.

Yes, the core question is where to store the database mapping these
multiple hashes.

Software Heritage (SWH) is one option although 1. it had not been
discussed yet how the Nar hashes will be publicly exposed, if they are
and 2. if SWH will implement a resolver Nar -> SWHID.

On the other hand, on Guix side, we are already building a database
mapping multiple hashes: Disarchive database. :-)

The question with the Disarchive database is its redundancy, IMHO.
Concretely, if disarchive.guix.gnu.org is down, game over.  I wish long
live to Guix project :-) but it would appear to me more robust to
propose a counter-measure.  The big picture is: if I publish a paper
which details about numerical processing using Guix, then having a Guix
installation at hand would be the only condition for redoing.

Last, please note Guix is already storing multiple hashes for some
origins.  It is the case for ’git-fetch’ methods, for example.  All
these packages using a plain Git commit hash are somehow storing two
content-addressed hashes (Git and Nar).

If one needs examples about how upstream can manage the ugly way their
mutable Git tag, for recent cases:

        bug#66015: Removal of python-pyxel
        Simon Tournier <zimon.toutoune@gmail.com>
        Fri, 15 Sep 2023 21:09:59 +0200
        id:874jjv9rso.fsf@gmail.com
        https://issues.guix.gnu.org/66015
        https://issues.guix.gnu.org/msgid/874jjv9rso.fsf@gmail.com
        https://yhetil.org/guix/874jjv9rso.fsf@gmail.com

and

        [bug#66013] [PATCH 0/4] gnu: bap, python-glcontext: Fix hash and update.
        Simon Tournier <zimon.toutoune@gmail.com>
        Fri, 15 Sep 2023 20:38:34 +0200
        id:cover.1694800551.git.zimon.toutoune@gmail.com
        https://issues.guix.gnu.org/66013
        https://issues.guix.gnu.org/msgid/cover.1694800551.git.zimon.toutoune@gmail.com
        https://yhetil.org/guix/cover.1694800551.git.zimon.toutoune@gmail.com

All in all, I think we will have more robustness if the Guix I am
running implements by its own some builtin features for
content-addressed instead of relying on external databases.  It is not
clear for me how exactly, hence the discussion. :-)

Another angle to see the problem of the multiple hashes is for using
IPFS, GNUnet and friends.

    ( I let aside long-term vs today because the time-frame I am
 interested in is: “guarantees“ that I will be able to redo in 3 years
 later what I am doing in a very near future.  And now I am trying to
 redo something from 3 years back to spot the potential problems and fix
 them or improve.  I do not really care about the state of redoing Guix
 as 3 years ago because almost no one published papers using Guix 3
 years ago. ;-) Guix is becoming popular in scientific context, yeah! so
 my interest about this robustness is for when Guix will be just a bit
 more popular. )

Cheers,
simon

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-10-04 17:57 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-03 18:07 intrinsic vs extrinsic identifier: toward more robustness? Simon Tournier
2023-03-04  0:08 ` Maxime Devos
2023-03-04  4:10   ` Maxim Cournoyer
2023-03-05 20:21   ` Simon Tournier
2023-03-06 12:22     ` Maxime Devos
2023-03-06 13:42       ` Simon Tournier
2023-03-16 17:45 ` Ludovic Courtès
2023-04-06 12:15   ` Simon Tournier
2023-10-04  8:52   ` content-address hint? (was Re: intrinsic vs extrinsic identifier: toward more robustness?) Simon Tournier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).