unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Simon Tournier <zimon.toutoune@gmail.com>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: Guix Devel <guix-devel@gnu.org>
Subject: content-address hint? (was Re: intrinsic vs extrinsic identifier: toward more robustness?)
Date: Wed, 04 Oct 2023 10:52:58 +0200	[thread overview]
Message-ID: <87a5sy3h1x.fsf@gmail.com> (raw)
In-Reply-To: <87a60cbnf7.fsf@gnu.org>

Hi Ludo,

On Thu, 16 Mar 2023 at 18:45, Ludovic Courtès <ludo@gnu.org> wrote:

> Thanks for starting this discussion!

I feel this discussion is still pending, so I am resuming. :-)

If context is missing, the thread starts here.

        intrinsic vs extrinsic identifier: toward more robustness?
        Simon Tournier <zimon.toutoune@gmail.com>
        Fri, 03 Mar 2023 19:07:23 +0100
        id:87jzzxd7z8.fsf@gmail.com
        https://lists.gnu.org/archive/html/guix-devel/2023-03
        https://yhetil.org/guix/87jzzxd7z8.fsf@gmail.com


> Sources (fixed-output derivations) are already content-addressed, by
> definition (I prefer “content addressing” over “intrinsic
> identification” because that’s a more widely recognized term).

From my understanding, this is correct only when the sources live in the
Guix project infrastructure.  I agree that if the source is
substitutable (= the source exists on one of substitute servers, i.e.,
Guix project servers), then the fixed-output derivation is
content-addressed,

For instance, let consider this fixed-output derivation:

--8<---------------cut here---------------start------------->8---
Derive
([("out","/gnu/store/n1k6jppyasn20zr6m8sfyv5ll07ibyf1-asciidoc-8.6.10.tar.gz","sha256","9e52f8578d891beaef25730a92a6e723596ddbd07bfe0d2a56486fcf63a0b983")]
 ,[]
 ,["/gnu/store/5iw2ivjw5njyyvi7avyphfcibgbqdbsc-mirrors","/gnu/store/vwyxp1dq4lb97n6b20w5cqxasy2dai79-content-addressed-mirrors"]
 ,"x86_64-linux","builtin:download",[]
 ,[("content-addressed-mirrors","/gnu/store/vwyxp1dq4lb97n6b20w5cqxasy2dai79-content-addressed-mirrors")
   ,("impureEnvVars","http_proxy https_proxy LC_ALL LC_MESSAGES LANG COLUMNS")
   ,("mirrors","/gnu/store/5iw2ivjw5njyyvi7avyphfcibgbqdbsc-mirrors")
   ,("out","/gnu/store/n1k6jppyasn20zr6m8sfyv5ll07ibyf1-asciidoc-8.6.10.tar.gz")
   ,("preferLocalBuild","1")
   ,("url","\"https://github.com/asciidoc/asciidoc/archive/8.6.10.tar.gz\"")])
--8<---------------cut here---------------end--------------->8---

I agree that the “url” field is useless while the content exists on the
“content-addressed-mirrors” list.  If one opens that file, then the code
reads:

--8<---------------cut here---------------start------------->8---
(begin
  (use-modules
   (guix base32))
  (define
    (guix-publish host)
    (lambda
        (file algo hash)
      (string-append "https://" host "/file/" file "/"
                     (symbol->string algo)
                     "/"
                     (bytevector->nix-base32-string hash))))
  (module-autoload!
   (current-module)
   (quote
    (guix base16))
   (quote
    (bytevector->base16-string)))
  (list
   (guix-publish "ci.guix.gnu.org")
   (lambda
       (file algo hash)
     (string-append "https://tarballs.nixos.org/"
                    (symbol->string algo)
                    "/"
                    (bytevector->nix-base32-string hash)))
   (lambda
       (file algo hash)
     (string-append "https://archive.softwareheritage.org/api/1/content/"
                    (symbol->string algo)
                    ":"
                    (bytevector->base16-string hash)
                    "/raw/"))))
--8<---------------cut here---------------end--------------->8---

Therefore, the look-up is done with some content-addressed via these 3
servers.


> In a way, like Maxime way saying, the URL/URI is just a hint; what
> matters it the content hash that appears in the origin.

However, from my understanding, it is incorrect to speak about
content-addressed when the source (fixed-output derivation) does not
exist for whatever reason on any substitute servers.

The URL/URI is not “just a hint”.  It *is* the location from where the
data are fetched.  And it is not content-addressed.  If I am incorrect,
please could you explain?

Please note that if only one source is missing than all the castle falls
down.  Other said, robustness means the hunt of the corner cases. :-)

If I want to time-machine to d63ee94d63c667e0c63651d6b775460f4c67497d
from Sat Jan 4 2020, and need Git, then it fails because:

    sha256 hash mismatch for /gnu/store/n1k6jppyasn20zr6m8sfyv5ll07ibyf1-asciidoc-8.6.10.tar.gz:
      expected hash: 10xrl1iwyvs8aqm0vzkvs3dnsn93wyk942kk4ppyl6w9imbzhlly
      actual hash:   1sh341j7ripkdb2wn6yf3rciln8ll89351b3d55gpkj89wypkmi2

Game over. )-:

Do we share the same understanding?


> What’s missing, both in SWH and in Guix, is the ability to store
> multiple hashes.  SWH could certainly store several hashes, computed
> using different serialization and hash algorithm combinations.

[...]

> The other option—storing multiple hashes for each origin in Guix—doesn’t
> sound practical: I can’t imagine packages storing and updating more than
> one content hash per package.  That doesn’t sound reasonable.  Plus it
> would be a long-term solution and wouldn’t help today.

Yes, the core question is where to store the database mapping these
multiple hashes.

Software Heritage (SWH) is one option although 1. it had not been
discussed yet how the Nar hashes will be publicly exposed, if they are
and 2. if SWH will implement a resolver Nar -> SWHID.

On the other hand, on Guix side, we are already building a database
mapping multiple hashes: Disarchive database. :-)

The question with the Disarchive database is its redundancy, IMHO.
Concretely, if disarchive.guix.gnu.org is down, game over.  I wish long
live to Guix project :-) but it would appear to me more robust to
propose a counter-measure.  The big picture is: if I publish a paper
which details about numerical processing using Guix, then having a Guix
installation at hand would be the only condition for redoing.

Last, please note Guix is already storing multiple hashes for some
origins.  It is the case for ’git-fetch’ methods, for example.  All
these packages using a plain Git commit hash are somehow storing two
content-addressed hashes (Git and Nar).

If one needs examples about how upstream can manage the ugly way their
mutable Git tag, for recent cases:

        bug#66015: Removal of python-pyxel
        Simon Tournier <zimon.toutoune@gmail.com>
        Fri, 15 Sep 2023 21:09:59 +0200
        id:874jjv9rso.fsf@gmail.com
        https://issues.guix.gnu.org/66015
        https://issues.guix.gnu.org/msgid/874jjv9rso.fsf@gmail.com
        https://yhetil.org/guix/874jjv9rso.fsf@gmail.com

and

        [bug#66013] [PATCH 0/4] gnu: bap, python-glcontext: Fix hash and update.
        Simon Tournier <zimon.toutoune@gmail.com>
        Fri, 15 Sep 2023 20:38:34 +0200
        id:cover.1694800551.git.zimon.toutoune@gmail.com
        https://issues.guix.gnu.org/66013
        https://issues.guix.gnu.org/msgid/cover.1694800551.git.zimon.toutoune@gmail.com
        https://yhetil.org/guix/cover.1694800551.git.zimon.toutoune@gmail.com


All in all, I think we will have more robustness if the Guix I am
running implements by its own some builtin features for
content-addressed instead of relying on external databases.  It is not
clear for me how exactly, hence the discussion. :-)

Another angle to see the problem of the multiple hashes is for using
IPFS, GNUnet and friends.

    ( I let aside long-term vs today because the time-frame I am
 interested in is: “guarantees“ that I will be able to redo in 3 years
 later what I am doing in a very near future.  And now I am trying to
 redo something from 3 years back to spot the potential problems and fix
 them or improve.  I do not really care about the state of redoing Guix
 as 3 years ago because almost no one published papers using Guix 3
 years ago. ;-) Guix is becoming popular in scientific context, yeah! so
 my interest about this robustness is for when Guix will be just a bit
 more popular. )

Cheers,
simon


      parent reply	other threads:[~2023-10-04 17:57 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-03 18:07 intrinsic vs extrinsic identifier: toward more robustness? Simon Tournier
2023-03-04  0:08 ` Maxime Devos
2023-03-04  4:10   ` Maxim Cournoyer
2023-03-05 20:21   ` Simon Tournier
2023-03-06 12:22     ` Maxime Devos
2023-03-06 13:42       ` Simon Tournier
2023-03-16 17:45 ` Ludovic Courtès
2023-04-06 12:15   ` Simon Tournier
2023-10-04  8:52   ` Simon Tournier [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a5sy3h1x.fsf@gmail.com \
    --to=zimon.toutoune@gmail.com \
    --cc=guix-devel@gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).