Re: intrinsic vs extrinsic identifier: toward more robustness?

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Maxime Devos <maximedevos@telenet.be>
To: Simon Tournier <zimon.toutoune@gmail.com>,
	Guix Devel <guix-devel@gnu.org>
Subject: Re: intrinsic vs extrinsic identifier: toward more robustness?
Date: Mon, 6 Mar 2023 13:22:24 +0100	[thread overview]
Message-ID: <842cf5e6-ff29-ec08-2e6f-01708a6316e6@telenet.be> (raw)
In-Reply-To: <86sfej0x1d.fsf@gmail.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 11862 bytes --]

Op 05-03-2023 om 21:21 schreef Simon Tournier:
>>> Whatever the intrinsic identifier we consider – even ones based on very
>>> weak cryptographic hash function as MD5, or based on non-crytographic
>>> hash function as Pearson hashing, etc. – the integrity check is
>>> currently done by SHA256.
>>
>> How about using the hash of the integrity check as an intrinsic
>> identifier, like is done currently?  I mean, we hash it anyway with
>> sha256 for the integrity check anyway, might as reuse it.
> 
> Maybe ask GNUnet folk to address by NAR+SHA256 instead on their
> specification. ;-)

Obviously, Guix should replace NAR+SHA256 by GNUnet FS URIs /j.

> Kidding aside, your comment rises two points of view:
> 
>   1. Guix is fetching data from elsewhere and this elsewhere is not using
>      NAR+SHAR256 intrinsic identifier.  Therefore, the question is how to
>      adapt the source origin for taking into account this elsewhere?
> 
>   2. Replace the NAR+SHA256 integrity checksum by what content-addressed
>      systems use as intrinsic identifier.  IMHO, that’s a bad idea for
>      two reasons: (a) security, for instance SHA1 as used by SWH is not
>      secure and (b) it will be unmanageable in practise.

I was thinking of (1), not (2).
>>> All that’s said, Guix uses extrinsic identifiers for almost all origins,
>>> if not all.  Even for ’git-fetch’ method.
>>
>> For git-fetch, the value of the 'commit' field is intrinsic (except when
>> it's a tag instead).
> 
> No, it is imprecise.  The exception is *not* label tag as value for the
> ’commit’ field but the exception is Git commit hash as value.

Are you referring to the fact that currently, the 'commit' field usually 
contains a tag name, and that it containing a commit is the exception?
If so, that doesn't contradict my claim.

>> This can be solved by placing the actual commit in the 'commit' field of
>> git-reference, instead of the tag name, then things are completely
>> unambiguous -- this and its opposite were discussed in ‘On raw strings
>> in <origin> commit field’ (*), IIRC.
> 
> The thread you are referencing [1] is based on misunderstandings.  I
> would like to move forward, hence my detailed email. :-)
> 
> 1: <https://yhetil.org/guix/6e451a878b749d4afb6eede9b476e5faabb0d609.camel@gmail.com/#r>

Your email is about intrinsic identifiers and more robustness, yet it 
doesn't mention using git commits more anywhere.  As such, I do not 
follow ‘hence my detailed email’ -- it contains detail, but it misses 
some relevant detail that I pointed out in my previous response.

Also, with ‘move forward’, do you mean ‘move forward’, or ‘maintain 
status quo’?  Because given that you are replying to the proposed 
solution (that even avoids problems pointed out in those threads) by 
saying nothing of technical importance and by pointing to some 
contentious things, it really appears the latter to me.

>> (*) Also maybe that thread about tricking peer review.
>>
>> I didn't understand the position that commit field should contain the
>> (indirect, fragile) tag instead of the (direct, robust) commit, but
>> those differences could be sidestepped by having both a 'tag' field and
>> a 'commit' field, IIUC.
> 
> I would not frame this way.  My view is not to replace something by
> something else, instead, is to add something and/or several things.

I was thinking of adding the commit (intrinsic) to the git-reference, 
instead of only having a tag (extrinsic) in the git-reference as is 
mostly done currently.

I also want to mention that, except of a general notion of 'more 
robustness' and a specific command "guix freeze -m manifest.scm" and 
such, you never mentioned what your view was, so I had to guess.

>> The problem then was to somehow map the NAR hash to the FS identifier.
> 
> Yes, that’s the problem. :-) GNUnet FS identifier is one case.  And my
> discussion here is: could we augment source origin to be able to deal
> with various identifier?
> 
> 
>> A straightforward solution would be to just replace the https:// by
>> gnunet:// in the origin (like in https://issues.guix.gnu.org/44199,
>> except that patch doesn't support fallbacks to other URLs like url-fetch
>> does).
> 
> Somehow, your proposition would be to have a list as URI, right?
> 
>       (origin
>         (method gnunet-fetch)
>         (uri
>          (list
>            (string-append "mirror://gnu/hello/hello-" version
>                             ".tar.gz")
>            "gnunet://fs/chk/TY48PGS5RVX643NT2B7GDNFCBT4DWG692PF4YNHERR96K6MSFRZ4ZWRPQ4KVKZV29MGRZTWAMY9ETTST4B6VFM47JR2JS5PWBTPVXB0.8A9HRYABJ7HDA7B0"
>            "shw:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a"
>         (file-name "gnunet-hello-2.10.tar.gz")
>         (sha256
>          (base32
>           "0ssi1wpaf7plaswqqjwigppsg5fyh99vdlb9kzl7c9lng89ndq1i")

Yes, though in a proper version of 44199 (which doesn't exist yet) it 
would just be integrated into url-fetch instead of having a separate 
gnunet-fetch.

>>> It is not affordable, neither wanted, to switch from the current
>>> extrinsic identification to a complete intrinsic one.  Although it would
>>> fix many issues. ;-)
>>
>> How about in-between: include both an intrinsic identifier (the
>> sha256sum) and an extrinsic identifier (the URLs to locate the object
>> at), like the status quo.
> 
> That’s what I am proposing between the lines. :-)

I recommend being explicit.

> The question is which design.  For instance, it could go under the field
> ’properties’ similarly as “upstream name” or potentially other
> “metadata”.  Or it could go under the source origin field.
> 
> Well, however as you pointed, being a ’properties’ would not be as
> easy.  And as you also pointed, the integrity field could be something
> else than ’sha256’, so maybe we could have a list here.

To be clear, my comment on Guix supporting other things than sha256 was 
just a statement of fact, not a proposal to use that mechanism (and 
neither a proposal to not use that mechanism).

>>> The discussion could also fit how to distribute using ERIS.
>>
>> ERIS is not a method on its own; you need to combine it with a P2P
>> network that uses ERIS.  I do not understand the special focus on ERIS.
> 
> Yes, indeed.  However, to my knowledge, each P2P can use its own
> identifier and from my understanding, ERIS relies on whatever P2P.
> Therefore, willing guix-daemon being able to use ERIS, it somehow
> implies a discussion about the identifiers used by the P2P networks.
> 
> Do I miss something?

I don't have any issue with ERIS itself (*).  The issue I have with 
ERIS, is that it often appears to be treated as some panacea that 
transcends all P2P systems and is fundamentally different from other 
identifiers used by other P2P systems, but <https://xkcd.com/927/> 
applies here -- while it might become some universal standard, it isn't yet.

Hence, ‘I do not understand the __special__ focus on ERIS’ (emphasis 
added).  As long as the ERIS identifier is treated as one among many 
instead of somehow being considered special, it's fine to me.

(*) Besides several technical issues in its current implementation -- 
the implementation of ERIS is optimised for classical transports instead 
of P2P transports, ERIS is only implemented for IPFS currently and ERIS 
doesn't have a deduplication system for directories.  (In GNUnet and 
BitTorrent, and I think in IPFS and BitTorrent too, if two directories 
(e.g. store items) that have a file in common were put into the P2P, 
then for the P2P's purposes these two files are the same file, so 
availability of one store item aids the availability of another store item.)

>>> At some point, I was thinking to have something like “guix freeze -m
>>> manifest.scm” returning a map of all the sources from the deep bootstrap
>>> to the leaf packages described in manifest.scm.  However, maybe
>>> something is poor in the metadata we collect at package time.
>>
>> That sounds like "guix build --sources=transitive' to me, except for
>> being even more transitive.  I propose making this an additional option
>> for the --sources argument instead.
> 
> No.  “guix build --sources=transitive” returns an archive containing all
> the sources.  Instead, I would like the all various identifiers (URL,
> NAR, SWHID, GNUnet, etc.) of all the transitive sources.

I do not see how making a list of all identifiers helps with robustness 
-- you need the object the identifiers point to, not the identifier itself.

Unless the goal is to use the map of package->identifiers to determine 
which packages are currently lacking redundancy (i.e., have few 
identifiers), which to be clear seems reasonable to me.
> Cheers,
> simon
> 
> PS:
> 
>>> However the fields ’swhid’ and the other SHA256 ’digest’ are different
>>> from above.  That’s because the dots [...] part.  It probably comes from
>>> the normalization process. Well, I am not sure to deeply understand why
>>> it is different but that’s another story. :-)
>>
>> The reason for the normalisation was something about SWH only providing
>> tarballs whose contents are equal to the ingested tarball; the tarballs
>> are not bit-for-bit identical to the ingested tarball.  But Guix needs
>> bit-for-bit identical tarballs, so Disarchive contains the information
>> that was stripped-out by SWH to complement the tarballs provided by
>> Disarchive.
> 
> SWH is not in the picture with the example I provided. :-)  Yes, the
> dots part is related to some normalization and “metadata”.

Your question was about where the differences come from.  The answer is 
‘because SWH normalisation stuff’.  As such, SWH is in the picture.

> What I do not understand is, if “guix build hello -S” is manually
> uncompressed and untar, the content corresponds to:
> 
>      $ guix hash -S git -H sha256 -f hex hello-2.12.1
>      cc7d5c45cfa1f5fba96c8b32d933734b24377a3c1ac776650044e497469affd4
> 
> The tool ’disarchive’ dissembles the compressed archive; it first
> provides the hash of the compressed archive (.tar.gz), then store
> metadata about compression level, algorithm etc, then provides the hash
> of the uncompressed archive (.tar), then store metadata about files and
> last it provides the hash of the tree, it reads,
> 
>      (input (directory-ref
>               (version 0)
>               (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1")
>               (addresses
>                 (swhid "swh:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a"))
>               (digest
>                 (sha256
>                   "1cb6effd40736b441a2a6dd49e56b3dfd4f6550e8ae1a8ac34ed4b1674097bc0"))))))))
> 
> and I do not understand why it is not the same as manually computed; see
> above.   Well, that’s a detail and not relevant to the current
> discussion since it is part of how Disarchive works internally.

You are hashing the 'hello-2.12.1' directory, which is the only 
directory in the tarball.  However, while it is considered bad practice, 
a tarball can contain multiple top-level entries.  As such, you should 
consider the tarball as an encoding of a directory that happens to 
contain the 'hello-2.12.1' directory, and hash the wrapper directory 
instead of its member hello-2.12.1:

$ mkdir a
$ cd a
$ tar -xf /gnu/store/3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar.gz
$ guix hash -Sgit -H sha256 -f hex .
1cb6effd40736b441a2a6dd49e56b3dfd4f6550e8ae1a8ac34ed4b1674097bc0

Using these steps, the value in the (digest (sha256 ...)) is recovered.

Greetings,
Maxime.

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 929 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

next prev parent reply	other threads:[~2023-03-06 12:25 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-03 18:07 intrinsic vs extrinsic identifier: toward more robustness? Simon Tournier
2023-03-04  0:08 ` Maxime Devos
2023-03-04  4:10   ` Maxim Cournoyer
2023-03-05 20:21   ` Simon Tournier
2023-03-06 12:22     ` Maxime Devos [this message]
2023-03-06 13:42       ` Simon Tournier
2023-03-16 17:45 ` Ludovic Courtès
2023-04-06 12:15   ` Simon Tournier
2023-10-04  8:52   ` content-address hint? (was Re: intrinsic vs extrinsic identifier: toward more robustness?) Simon Tournier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=842cf5e6-ff29-ec08-2e6f-01708a6316e6@telenet.be \
    --to=maximedevos@telenet.be \
    --cc=guix-devel@gnu.org \
    --cc=zimon.toutoune@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).