unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Maxime Devos <maximedevos@telenet.be>
To: Simon Tournier <zimon.toutoune@gmail.com>,
	Guix Devel <guix-devel@gnu.org>
Subject: Re: intrinsic vs extrinsic identifier: toward more robustness?
Date: Sat, 4 Mar 2023 01:08:08 +0100	[thread overview]
Message-ID: <09d3d861-0390-3ce6-30c7-22a1e2685787@telenet.be> (raw)
In-Reply-To: <87jzzxd7z8.fsf@gmail.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 22653 bytes --]



Op 03-03-2023 om 19:07 schreef Simon Tournier:
> Hi,
> 
> I would like to open a discussion about how we identify the source
> origin (fixed output).  It is of vitally importance for being robust on
> the long-term (say 3-5 years).  It matters in Reproducible Research
> context, but not only.
> 
> # First thing first
> ===================
> 
> ## What is an intrinsic identifier or an extrinsic one?
> =======================================================
> 
>   - extrinsic: use a register to keep the correspondence between the
>     identifier and the object; say label version as Git tag.
> 
>   - intrinsic: intimately bound to the designated object itself; say hash
>     as Git blob or tree and at some extent commit.
 >
 > [... some reordering for convenience of replying ...]
 >
 > Please note that the identification and the integrity is not the same.
 > Since intrinsic identifier often uses cryptographic hash functions and
 > integrity too, it is often confusing.

To my understanding, there is only one 'real' identifier in Guix: the 
(sha256sum (base32 ...)) (*).  Those other identifiers like the URL in 
url-fetch and git-fetch are just hints on where to find the object -- 
very important hints without which finding the object is much more 
likely to fail, but just hints nonetheless.

While identification and integrity might be different concepts, 
content-based identifiers like (sha256 (base32 ...)) accomplish both at 
the same time.

(*) FWIW, I would like to point out that Guix theoretically supports 
some other hashes as well, though they aren't used for any in-tree packages.

> The register must be a trusted authority and it resolves by mapping the
> key identifier to the object.  Having the object at hand does not give
> any clue about the key identifier.  And collisions are very frequent;
> two key identifiers resolve to the same content – hopefully! we call
> that mirrors. ;-)

I first thought you where writing about 'extrinsic -> intrinsic (e.g. 
hash-based)' registers, so I was confused by your comment about 
collisions -- to my understanding, no sha256sum collisions are known. 
Going by your comment about mirrors, I think you meant an 'intrinsic -> 
extrinsic' map instead, e.g. 'sha256 -> a bunch of appropriate URLs'.

> Intrinsic identifier also relies on a (trusted) map but collisions are
> avoided as much as possible.  Somehow it strongly reduces the power of
> the authority and it is often more robust.

Who is 'the authority' here, how does the absence of collision reduces 
the power of the authority, and what is your point about reducing the 
power of the authority?  I was thinking of ‘the authority=Guix package 
definition’, but then only the 'more robust' part of your conclusion 
makes sense to me.  Also, as you used 'but' instead of 'and', it appears 
you consider relying on a trusted map to be a bad thing, but that 
appears basic security and patch review to me.

> Whatever the intrinsic identifier we consider – even ones based on very
> weak cryptographic hash function as MD5, or based on non-crytographic
> hash function as Pearson hashing, etc. – the integrity check is
> currently done by SHA256.

How about using the hash of the integrity check as an intrinsic 
identifier, like is done currently?  I mean, we hash it anyway with 
sha256 for the integrity check anyway, might as reuse it.

> ## For example, consider this source origin,
> ==============
> 
>      (source (origin
>                (method url-fetch)
>                (uri (string-append "mirror://gnu/hello/hello-" version
>                                    ".tar.gz"))
>                (sha256
>                 (base32
>                  "086vqwk2wl8zfs47sq2xpjc9k066ilmb8z6dn0q6ymwjzlm196cd"))))
> 
> where ’mirror://gnu’ is resolved by Guix itself.  Or this one,
> 
>      (source
>       (origin
>         (method git-fetch)
>         (uri (git-reference
>               (url "https://github.com/FluxML/Zygote.jl")
>               (commit (string-append "v" version))))
>         (file-name (git-file-name name version))
>         (sha256
>          (base32 "02bgj6m1j25sm3pa5sgmds706qpxk1qsbm0s2j3rjlrz9xn7glgk"))))
> 
> where Guix clones then checks out at the specification of the field
> ’commit’.
> 
> Here both are extrinsic identifiers.  For the first example, the register
> is defined by ’%mirrors’.  For the second example, the register is the
> folder ’.git/’.
> 
> Intrinsic identifier could be plain hash or hashed serialized data.
> Using Guix b8f6ead:
> 
> --8<---------------cut here---------------start------------->8---
> $ guix hash -S none -H sha256 -f nix-base32 -x $(guix build hello -S)
> 086vqwk2wl8zfs47sq2xpjc9k066ilmb8z6dn0q6ymwjzlm196cd
> 
> $ guix hash -S git -H sha256 -f nix-base32 -x $(guix build hello -S)
> 11kaw6m19rdj3d55y4cygk6k9zv6sn2iz4gpimx0j99ps87ij29l
> 
> $ guix hash -S nar -H sha256 -f nix-base32 -x /gnu/store/3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar.gz
> 1lvqpbk2k1sb39z8jfxixf7p7v8sj4z6mmpa44nnmff3w1y6h8lh
> --8<---------------cut here---------------end--------------->8---
> 
> Or some Git-like tree md5 of the decompressed data, e.g.,
> 
> --8<---------------cut here---------------start------------->8---
> $ guix hash -S git -H md5 -f hex -x hello-2.12.1
> 3db60bcfecf17a5dd81e3fb5bfb1c191
> --8<---------------cut here---------------end--------------->8---
> 
> Or some others.
> 
> --8<---------------cut here---------------start------------->8---
> $ git clone https://github.com/FluxML/Zygote.jl
> $ git -C Zygote.jl checkout v0.6.41
> 
> $ guix hash -S nar -H sha256 -f nix-base32 -x Zygote.jl
> 02bgj6m1j25sm3pa5sgmds706qpxk1qsbm0s2j3rjlrz9xn7glgk
> 
> $ guix hash -S git -H sha1 -f hex -x Zygote.jl
> 3cfdb31b517eec4173584fba2b1aa65daad46e09
> --8<---------------cut here---------------end--------------->8---
> 
> 
> # Second thing second
> =====================
> 
> All that’s said, Guix uses extrinsic identifiers for almost all origins,
> if not all.  Even for ’git-fetch’ method.
For git-fetch, the value of the 'commit' field is intrinsic (except when 
it's a tag instead).

> Consider that GitHub disappears and the default build farms ci.guix and
> bordeaux.guix are unreachable for whatever reason.  Then Guix will
> fallback to Software Heritage and will exploits its resolver.
> 
> --8<---------------cut here---------------start------------->8---
> Initialized empty Git repository in /gnu/store/ns1f3b4wm5n470bczd2k5li6xpgbqkz7-julia-zygote-0.6.41-checkout/.git/
> fatal: unable to access 'https://github.com/FluxML/Zygote.jl/': Could not resolve host: github.com
> Failed to do a shallow fetch; retrying a full fetch...
> fatal: unable to access 'https://github.com/FluxML/Zygote.jl/': Could not resolve host: github.com
> git-fetch: '/gnu/store/55ba5ragbd5sd4r45n0q24vrxx9rigrm-git-minimal-2.39.1/bin/git fetch origin' failed with exit code 128
> Trying content-addressed mirror at berlin.guix.gnu.org...
> Trying content-addressed mirror at berlin.guix.gnu.org...
> Trying to download from Software Heritage...
> SWH: found revision 4777767737b4c95d2cea842933c5b2edae2771b2 with directory at 'https://archive.softwareheritage.org/api/1/directory/3cfdb31b517eec4173584fba2b1aa65daad46e09/'
> swh:1:dir:3cfdb31b517eec4173584fba2b1aa65daad46e09/
> --8<---------------cut here---------------end--------------->8---
> 
> That’s SWH which finds the revision
> 4777767737b4c95d2cea842933c5b2edae2771b2 from the contextual information
> URL + label version and from this revision SWH associates the content
> having the intrinsic identifier
> swh:1:dir:3cfdb31b517eec4173584fba2b1aa65daad46e09.
> 
> 
> ## First, please note that the SWHID is just Git,
> ========
> 
> --8<---------------cut here---------------start------------->8---
> guix hash -S git -H sha1 -f hex \
>       /gnu/store/ns1f3b4wm5n470bczd2k5li6xpgbqkz7-julia-zygote-0.6.41-checkout
> 3cfdb31b517eec4173584fba2b1aa65daad46e09
> --8<---------------cut here---------------end--------------->8---
> 
> Other said, SWH information is somehow the same information as the one
> of Git objects.  Specifically, from the Git checkout,
> 
> --8<---------------cut here---------------start------------->8---
> $ git cat-file -p v0.6.41
> object 4777767737b4c95d2cea842933c5b2edae2771b2
> type commit
> tag v0.6.41
> 
> $ git cat-file -p 4777767737b4c95d2cea842933c5b2edae2771b2
> tree 3cfdb31b517eec4173584fba2b1aa65daad46e09
> --8<---------------cut here---------------end--------------->8---
> 
> 
> ## Second, SWH acts as a resolver here, i.e.,
> =========
> 
>       (find (lambda (branch)
>               (or
>                ;; Git specific.
>                (string=? (string-append "refs/tags/" tag)
>                          (branch-name branch))
>                ;; Hg specific.
>                (string=? tag
>                          (branch-name branch))))
>             (snapshot-branches snapshot))
> 
> and this is not robust.  For one, it fails for Git lightweight tag as
> exposed with the package ’open-zwave’ tag 1.6.
> 
> --8<---------------cut here---------------start------------->8---
> $ for t in $(git tag); do printf "$t "; git cat-file -t $t ;done
> Rel-1.0 commit
> V1.5 tag
> v1.2 commit
> v1.3 tag
> v1.4 tag
> v1.6 commit
> --8<---------------cut here---------------end--------------->8---
> 
> It means that the code above would be able to find V1.5 or v1.4 but not
> v1.6 or v1.2.  Well, we can consider that as a bug and improve the
> snapshot machinery for also collecting more ’refs’.  But, for two…
> 
> …the current code (guix swh) does not deal with several snapshots and
> only consider the latest one.  Therefore, it fails for some in-place
> replacements – upstream tags a specific revision then later removes it
> and upstream re-use the same tag label for another revision booo!, if
> SWH ingests after the first tag, SWH creates one snapshot, then if SWH
> ingests again after the second re-tag, SWH creates another snapshot.

This can be solved by placing the actual commit in the 'commit' field of 
git-reference, instead of the tag name, then things are completely 
unambiguous -- this and its opposite were discussed in ‘On raw strings 
in <origin> commit field’ (*), IIRC.

(*) Also maybe that thread about tricking peer review.

I didn't understand the position that commit field should contain the 
(indirect, fragile) tag instead of the (direct, robust) commit, but 
those differences could be sidestepped by having both a 'tag' field and 
a 'commit' field, IIUC.

The 'commit' field would be used for downloading the source code, and 
the 'tag' field would be used by a not-yet-existing linter that would 
check whether the (immutable) commit matches the current value (varying 
over time) of the tag.

> 
> ## Third, Disarchive is helping.
> ========
> 
> Aside adding a layer to maintain does not help when speaking about
> long-term (3-5 years), well, the reduction of layers is often better for
> long-term.  That’s said, there is a work in progress to have Disarchive
> features directly from SWH.
> 
> What does Disarchive do?  It maps various intrinsic identifiers.
> 
> Remember ’hello’ from above?
> 
> --8<---------------cut here---------------start------------->8---
> $ guix shell disarchive guile-lzma guile
> $ disarchive disassemble hello-2.12.1
> (disarchive
>    (version 0)
>    (directory-ref
>      (version 0)
>      (name "hello-2.12.1")
>      (addresses
>        (swhid "swh:1:dir:ad5fc7c3062e8426b7936588e7a27d51ace0e508"))
>      (digest
>        (sha256
>          "cc7d5c45cfa1f5fba96c8b32d933734b24377a3c1ac776650044e497469affd4"))))
> 
> $ guix hash -S git -H sha1 -f hex hello-2.12.1
> ad5fc7c3062e8426b7936588e7a27d51ace0e508
> $ guix hash -S git -H sha256 -f hex hello-2.12.1
> cc7d5c45cfa1f5fba96c8b32d933734b24377a3c1ac776650044e497469affd4
> --8<---------------cut here---------------end--------------->8---
> 
> Well, the fixed-outputs is a compressed tarball, it reads,
> 
> --8<---------------cut here---------------start------------->8---
> $ disarchive disassemble $(guix build -S hello)
> (disarchive
>    (gzip-member
>      (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar.gz")
>      (digest
>        (sha256
>          "8d99142afd92576f30b0cd7cb42a8dc6809998bc5d607d88761f512e26c7db20"))
>      (header (mtime 0) (extra-flags 2) (os 3))
>      (footer (crc 2707092614) (isize 4945920))
>      (compressor gnu-best-rsync)
>      (input (tarball
>               (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar")
>               (digest
>                 (sha256
>                   "a2c33fd13c555015433956bcf06609293a34ce5c5e6a2070990bfb86070dc554"))
> [...]
> 
>      (input (directory-ref
>               (version 0)
>               (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1")
>               (addresses
>                 (swhid "swh:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a"))
>               (digest
>                 (sha256
>                   "1cb6effd40736b441a2a6dd49e56b3dfd4f6550e8ae1a8ac34ed4b1674097bc0"))))))))
> --8<---------------cut here---------------end--------------->8---
> 
> where the values are just (considering that ’guix hash -S none -H sha256
> -f hex’ is equivalent to ’sha256sum’)
> 
> --8<---------------cut here---------------start------------->8---
> $ guix hash -S none -H sha256 -f hex $(guix build hello -S)
> 8d99142afd92576f30b0cd7cb42a8dc6809998bc5d607d88761f512e26c7db20
> $ gzip -d $(guix build -S hello) -c | sha256sum
> a2c33fd13c555015433956bcf06609293a34ce5c5e6a2070990bfb86070dc554  -
> --8<---------------cut here---------------end--------------->8---
> 
> However the fields ’swhid’ and the other SHA256 ’digest’ are different
> from above.  That’s because the dots [...] part.  It probably comes from
> the normalization process. Well, I am not sure to deeply understand why
> it is different but that’s another story. :-)

The reason for the normalisation was something about SWH only providing 
tarballs whose contents are equal to the ingested tarball; the tarballs 
are not bit-for-bit identical to the ingested tarball.  But Guix needs 
bit-for-bit identical tarballs, so Disarchive contains the information 
that was stripped-out by SWH to complement the tarballs provided by 
Disarchive.

> ## Fourth, it misses a bridge using NAR normalization (serialization).
> =========
> 
> Disarchive can (or could) provides a bridge (map) between SWHID+SHA1 and
> NAR+SHA256.  But it could be nice if it was implemented in SWH
> directly.  It would ease previous drawbacks.
> 
> For the interested reader, discussion there
> <https://gitlab.softwareheritage.org/swh/meta/-/issues/4538>.  Moreover,
> <https://gitlab.softwareheritage.org/swh/meta/-/issues/4538#note_121067>
> provides simple examples about NAR and how to implement it using Python.

I think nar stuff should be kept outside SWH.  It doesn't seem scalable 
to me for SWH to support the format of every distribution.  Likewise, I 
think that SWH identifiers should _not_ become an intrinsic identifier 
that is recorded in package definitions -- if there are other archives 
that are somewhat SWH-like archives, then Guix should support them too 
even if they don't use SWH identifiers for whatever reason, and 
including the identifier of every single archive seems unscalable to me.

I believe I have a solution on how to solve the ‘everyone uses different 
identifiers, how to map between them’ problem, but it will take some 
paragraphs:

At some point in the past, when thinking about downloading source code 
over GNUnet File-sharing (FS), I had the problem that Guix and GNUnet 
uses different intrinsic identifiers -- Guix uses the NAR hash for 
querying substitute servers, whereas FS has a system of its own that's 
more convenient for P2P file-sharing stuff.

The problem then was to somehow map the NAR hash to the FS identifier.
I couldn't do this the Disarchive way, because the point was to be _P2P_ 
and Disarchive ... isn't.

A straightforward solution would be to just replace the https:// by 
gnunet:// in the origin (like in https://issues.guix.gnu.org/44199, 
except that patch doesn't support fallbacks to other URLs like url-fetch 
does).

The problem was that people demanded that gnunet:// should only be 
supported once there is actually source code on GNUnet and GNUnet is 
stable, but why would people put source code on GNUnet when no 
distribution supports it and how would GNUnet become stable without any 
users?

To work-around these circular demands, I started 'rehash':
<https://lists.gnu.org/archive/html/guix-patches/2021-01/msg01067.html>
(current location: https://notabug.org/maximed/rehash).  It is a (P2P!) 
GNUnet service that maintains a 'SHA1512<->GNUnet FS URI' mapping, or 
more generally, a 'this hash type<->that hash type' mapping.

(It is just a service on top of the DHT, so the same could easily be 
done for BitTorrent or IPFS.)

It's rather incomplete at the moment (there is no verification or 
reputation mechanism at all so the network could be flooded with bogus 
mappings, mappings are only in DHT, not stored on disk, so they are lost 
on reboot, the POC Guix integretation is a bit limited), but the basics 
are there -- the POC successfully downloaded a substitute over GNUnet 
_without_ having to include FS URI in the narinfo (*)!.

I'm writing about substitutes here, but the exact same approach could be 
done for plain source code.

(*) I might have misremembered; I can't find the POC on 
issues.guix.gnu.org again, and I'm not sure if the POC used rehash or if 
it just included the FS URI in the narinfo.

(TBC, I haven't been working on Rehash lately, but rather Scheme-GNUnet: 
a Scheme port of the GNUnet libraries that's less limited than 
Guile-GNUnet.  Idea is to make GNUnet-FS and rehash more convenient to 
use from Scheme, and in particular, in Guix.)

> # Discussion asking for comments and feedback
> =============================================
> 
> Still there?  If yes, thanks for reading. :-)
> 
> As shown in,
> 
> 1: <https://lists.gnu.org/archive/html/guix-devel/2023-02/msg00398.html>
> 2: <https://lists.gnu.org/archive/html/guix-devel/2023-03/msg00007.html>
> 
> we have holes and we are not currently robust for long-term (3-5 years)
> if our lovely build-farms are down for whatever reasons.
> 
> For sure, we have to fix the holes and bugs. :-)  However, I am asking
> what we could add for having more robustness on the long term.

I recommend P2P systems as an additional archival method to complement 
SWH -- while less reliable than SWH, it's also not a SPOF while SWH is a 
huge SPOF.

More concretely, 'guix perform-download' could automatically insert the 
tarball into the local peer (GNUnet, IPFS, whatever) (to make it 
available to P2P if it was only available from non-P2P http and the like 
previously) and 'guix perform-download' could support 'ipfs://', 
'gnunet://' ... URIs.

(Remember, you can put _multiple_ URLs in an 'origin', as mirrors / 
fallbacks / ...!)

git-fetch and friends would be trickier, but I assume something could be 
worked out (at worst Guix could just insert the nar into the peer and 
let the P2P URI just be a reference to the nar).

There is also the option of 'more mirrors': it should be possible to 
adjust the downloading code to look at the Debian archives, say.  (I had 
some success with finding 'disappeared sources' in other distributions 
in the past.)

> It is not affordable, neither wanted, to switch from the current
> extrinsic identification to a complete intrinsic one.  Although it would
> fix many issues. ;-)

How about in-between: include both an intrinsic identifier (the 
sha256sum) and an extrinsic identifier (the URLs to locate the object 
at), like the status quo.

Additionally, additional P2P identifiers could be added -- e.g. ipfs:// 
URIs could be added for url-fetch -- multiple URLs are allowed!  These 
additions could be automated with some script (go over the package 
origins one-by-one, download it, compute the P2P-network identifier, add 
that identifier to the origin).

> Guix and ’guix time-machine’ provides all the machinery for being able
> to redeploy later but as I have tried to point in the two links above
> [1,2], we are lacking tools for retrieving contents; well having the
> machinery does not mean that such machinery works well or is robust. :-)
> 
> The discussion could also fit how to distribute using ERIS.

ERIS is not a method on its own; you need to combine it with a P2P 
network that uses ERIS.  I do not understand the special focus on ERIS.

There are also various missing bits with ERIS currently
(see various comments at <https://issues.guix.gnu.org/52555#18>).
As such, I propose to use the standard encodings used by current P2P FS 
networks instead -- if an automated script is used as proposed above, 
going for standard P2P first doesn't inhibit Guix from switching to ERIS 
in the future.

> At some point, I was thinking to have something like “guix freeze -m
> manifest.scm” returning a map of all the sources from the deep bootstrap
> to the leaf packages described in manifest.scm.  However, maybe
> something is poor in the metadata we collect at package time.

That sounds like "guix build --sources=transitive' to me, except for 
being even more transitive.  I propose making this an additional option 
for the --sources argument instead.

> For instance, the substitutions work more or less using intrinsic
> identifier so it helps, I guess. :-)
> 
> Well, we could imagine the addition of another option field, say under
> ’properties’, that could store the intrinsic identifier of the
> fixed-outputs such as SWHID or Git tree / commit hash or else.  It would
> add robustness for later.
 >
> Or maybe an optional field of the ’origin’ record for the same purpose.

It needs to be in the 'origin' record, not the 'package' record.  The 
fetchers (url-fetch, git-fetch, ...) only have access to the origin 
stuff, and origins can exist outside the context of a package.

Greetings,
Maxime.

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 929 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

  reply	other threads:[~2023-03-04  0:08 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-03 18:07 intrinsic vs extrinsic identifier: toward more robustness? Simon Tournier
2023-03-04  0:08 ` Maxime Devos [this message]
2023-03-04  4:10   ` Maxim Cournoyer
2023-03-05 20:21   ` Simon Tournier
2023-03-06 12:22     ` Maxime Devos
2023-03-06 13:42       ` Simon Tournier
2023-03-16 17:45 ` Ludovic Courtès
2023-04-06 12:15   ` Simon Tournier
2023-10-04  8:52   ` content-address hint? (was Re: intrinsic vs extrinsic identifier: toward more robustness?) Simon Tournier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=09d3d861-0390-3ce6-30c7-22a1e2685787@telenet.be \
    --to=maximedevos@telenet.be \
    --cc=guix-devel@gnu.org \
    --cc=zimon.toutoune@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).