On raw strings in <origin> commit field

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

* On raw strings in <origin> commit field
@ 2021-12-28 20:55 Liliana Marie Prikler
  2021-12-29  8:39 ` zimoun
                   ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Liliana Marie Prikler @ 2021-12-28 20:55 UTC (permalink / raw)
  To: guix-devel

Hi Guix,

when Ricardo recently added guile-aiscm to Guix, I was confused that
both the version field of the package and the commit field of the git-
reference used in its origin.  It turns out, that this is a rare
pattern observed in less than 200 packages currently in Guix.   The
reason to do so (as far as I understand and was explained to me in IRC)
is that commit tags are in principle mutable and hence can not be
relied on when fetching sources.  I do have a few issues with that
explanation, but before that let's go a step back and discuss the
relation of version and commit.

Consider a package being added or updated in Guix.  At the time of
commit, we have the tag v1.2.3 pointing towards commit deadbeef.  We
therefore create a guix package with version "1.2.3" pointing to said
commit (either directly or indirectly).  At this point, one of the
following holds:
  (1) Guix "1.2.3" -> upstream "v1.2.3" -> upstream "deadbeef"
  (2) Guix "1.2.3" -> upstream "deadbeef" <- upstream "v1.2.3"
From either, we can follow that Guix "1.2.3" = upstream "v1.2.3".  If
upstream keeps their tags around, then both forms are equivalent, but
(1) is more convenient; it allows us to derive commit from version,
which is often done through an affine mapping.

Problems arise, when upstreams move or delete tags.  At this point,
guix packages that use them break and are no longer able to fetch their
source code.  Raw commits are in principle resilient to this kind of
denial of service; instead upstreams would have to actually delete the
commits themselves, including also possible backups such as SWH to
break it.  There is certainly an argument for robustness to be made
here, particularly concerning `guix time-machine', though as noted it
is not infallible.  

It should be noted, that in the case of moving or deleted tags, the
assertion Guix "1.2.3" = upstream "v1.2.3" no longer holds.  Widespread
use of this pattern under the above reasoning would imply that those
upstreams can't be trusted to have stable tags when there are probably
few offenders in that category (considering also that Guix is not the
only tool they'd break if they do move or delete tags).  More
importantly, if we do have a non-trustworthy upstream, it could be
reasoned that referring to some tag is as good as referring to a random
commit and thereby let-bound commits and revisions ought to be used.

As any good Sith would, the above talks in absolutes, or at the very
least uses default logic without considerable fallbacks.  On the note
of fallbacks, we do also have the issue that Guix fails on the first
download that does not match the hash instead of e.g. continuing to SWH
to fetch an archive of the old tag (as well as other fallback-related
issues, also including the "Tricking Peer Review" thread).  Putting
those aside for a while, there is an all but endless amount of
upstreams for which we can't tell ahead of time whether they will act
nicely or not.  The status quo for most of our packages is to assume
that they do and fail loudly if they don't.  The proposed alternative
is to assume they don't and miss out on nice things if they do. 
However, even under that assumption we also miss out on ninja version
bumps and the only way of noticing other than paranoid amounts of
checking whether the tag moved would be to wait for a mail from
upstream claiming that they actually wanted us to notice the ninja
bump.

Neither of the above is really satisfactory.  At the very least, if raw
strings are to be used in the commit fields for tags that "once
existed, but maybe no longer point to that commit", I'd want a comment
like the ones I find in minetest.scm to mentally prepare me for what
I'm about to read in the rest of the package description, but I'd much
prefer using let-bound commit/revision pairs.  Perhaps we could make
revision "0" (alternatively #f if we don't want current versions to
break) special in that a git-version with it expands to just version.  

Long-term, we might want to support having multiple <git-references> in
git-fetch -- if the first one fails due to a hash mismatch, we would
warn about that instead of producing an error and thereafter continue
with the second, third, etc. similar to how we currently have mirror://
urls for some well-known mirrored repositories.  That way, we have a
system to warn us about naughty upstreams while also providing
robustness for the time machine.

What do y'all think?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-28 20:55 On raw strings in <origin> commit field Liliana Marie Prikler
@ 2021-12-29  8:39 ` zimoun
  2021-12-29 20:25   ` Liliana Marie Prikler
  2021-12-30  1:13 ` Mark H Weaver
  2021-12-31 17:56 ` Vagrant Cascadian
  2 siblings, 1 reply; 63+ messages in thread
From: zimoun @ 2021-12-29  8:39 UTC (permalink / raw)
  To: Liliana Marie Prikler, guix-devel

Hi,

On Tue, 28 Dec 2021 at 21:55, Liliana Marie Prikler <liliana.prikler@gmail.com> wrote:

> Consider a package being added or updated in Guix.  At the time of
> commit, we have the tag v1.2.3 pointing towards commit deadbeef.  We
> therefore create a guix package with version "1.2.3" pointing to said
> commit (either directly or indirectly).  At this point, one of the
> following holds:
>   (1) Guix "1.2.3" -> upstream "v1.2.3" -> upstream "deadbeef"
>   (2) Guix "1.2.3" -> upstream "deadbeef" <- upstream "v1.2.3"
> From either, we can follow that Guix "1.2.3" = upstream "v1.2.3".  If
> upstream keeps their tags around, then both forms are equivalent, but
> (1) is more convenient; it allows us to derive commit from version,
> which is often done through an affine mapping.

No, tags and hash commit are not equivalent.  Hash commit is intrinsic:
it only depends on the content.  Whereas, tags are extrinsic, they
depend on external choice.

From the content to the hash, three keys: 1) how to serialize and 2) how
to hash and 3) how to represent the hash.  For #1, Git uses their own
serializer and Guix, inheriting from Nix, uses another (Nar); although
the difference is minor.  For #2, Git uses by default SHA-1 as hash
function, although Guix uses SHA-256.  And for #3, Git uses hexadecimal
format and Guix uses nix-base32.

The subcommand “guix hash” with the options ’-S, -H’ and ’-f’ exposes
these 3 keys.  For instance:

        $ cat /tmp/foo.txt | git hash-object --stdin
        557db03de997c86a4a028e1ebd3a1ceb225be238
        $ ./pre-inst-env guix hash -S git -H sha1 -f hex /tmp/foo.txt
        557db03de997c86a4a028e1ebd3a1ceb225be238

To make it explicit, the checksum hash of ’git-reference’ could be
removed because it is somehow redundant with the commit hash.
Obviously, it cannot because security reason (SHA-1 is considered as
weak).

> Problems arise, when upstreams move or delete tags.  At this point,
> guix packages that use them break and are no longer able to fetch their
> source code.  Raw commits are in principle resilient to this kind of
> denial of service; instead upstreams would have to actually delete the
> commits themselves, including also possible backups such as SWH to
> break it.  There is certainly an argument for robustness to be made
> here, particularly concerning `guix time-machine', though as noted it
> is not infallible.  

SWH provides ’swh:id’ which is another triplet (really close to Git).
Basically, content means data and metadata and to make it short, SWH
deals their way with metadata for reason of large scale.  And SWH does
snapshots of Git repositories.

Therefore, to have something really robust, Guix has to rely on a map
from package definition to SWH.

Using Git commit hash instead of tag makes this map.  For tag, to have
something robust, we need an external map from checksum hash to SWH hash
via Git commit hash.  This “external” is done by Disarchive.

> Long-term, we might want to support having multiple <git-references> in
> git-fetch -- if the first one fails due to a hash mismatch, we would
> warn about that instead of producing an error and thereafter continue
> with the second, third, etc. similar to how we currently have mirror://
> urls for some well-known mirrored repositories.  That way, we have a
> system to warn us about naughty upstreams while also providing
> robustness for the time machine.

I think the long term is to completely remove tag and only use commit
hash; as done for ’guile-aiscm’.  But it will not happen for convenience
reasons, I guess.

What you are proposing is to mix extrinsic (tag, URL, etc.) with
intrinsic (commit hash, checksum hash, etc.).  Well, I do not know if
this proposed fallback mechanism would ease the maintenance and would
make Guix more robust.

To me, robustness means make a map from intrinsic values to content; as
Disarchive is doing for instance.

Cheers,
simon

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-29  8:39 ` zimoun
@ 2021-12-29 20:25   ` Liliana Marie Prikler
  2021-12-30 12:43     ` zimoun
  0 siblings, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2021-12-29 20:25 UTC (permalink / raw)
  To: zimoun, guix-devel

Hi,

Am Mittwoch, dem 29.12.2021 um 09:39 +0100 schrieb zimoun:
> Hi,
> 
> On Tue, 28 Dec 2021 at 21:55, Liliana Marie Prikler
> <liliana.prikler@gmail.com> wrote:
> 
> > Consider a package being added or updated in Guix.  At the time of
> > commit, we have the tag v1.2.3 pointing towards commit deadbeef.  We
> > therefore create a guix package with version "1.2.3" pointing to
> > said commit (either directly or indirectly).  At this point, one of
> > the following holds:
> >   (1) Guix "1.2.3" -> upstream "v1.2.3" -> upstream "deadbeef"
> >   (2) Guix "1.2.3" -> upstream "deadbeef" <- upstream "v1.2.3"
> > From either, we can follow that Guix "1.2.3" = upstream "v1.2.3".  If
> > upstream keeps their tags around, then both forms are equivalent, but
> > (1) is more convenient; it allows us to derive commit from version,
> > which is often done through an affine mapping.
> 
> No, tags and hash commit are not equivalent.  Hash commit is intrinsic:
> it only depends on the content.  Whereas, tags are extrinsic, they
> depend on external choice.
The notion of equivalence I am using here is the same as in the
statement "5 ≡ 2 mod 3", wherein the ≡ symbol is ironically called
IDENTICAL TO in Unicode despite being used very differently in
mathematics.  Perhaps there is a language barrier here; in German we
read that as "5 is equivalent to 2 modulo 3" and logic equivalence
functions similarly.

For the record, one could argue that I should have used that symbol for
comparing Guix "1.2.3" to upstream "v1.2.3" because they are in fact
not equal, only equivalent, but that's besides the point.  The point
is, with an upstream behaving as we want upstreams to behave (not just
git ones, url-fetch suffers from the same issue with moving tarballs
for instance), you can substitute one for the other without a change in
meaning; both will fetch the same commit.

> From the content to the hash, three keys: 1) how to serialize and 2)
> how to hash and 3) how to represent the hash.  For #1, Git uses their
> own serializer and Guix, inheriting from Nix, uses another (Nar);
> although the difference is minor.  For #2, Git uses by default SHA-1 as
> hash function, although Guix uses SHA-256.  And for #3, Git uses
> hexadecimal format and Guix uses nix-base32.
> 
> The subcommand “guix hash” with the options ’-S, -H’ and ’-f’ exposes
> these 3 keys.  For instance:
> 
>         $ cat /tmp/foo.txt | git hash-object --stdin
>         557db03de997c86a4a028e1ebd3a1ceb225be238
>         $ ./pre-inst-env guix hash -S git -H sha1 -f hex /tmp/foo.txt
>         557db03de997c86a4a028e1ebd3a1ceb225be238
> 
> To make it explicit, the checksum hash of ’git-reference’ could be
> removed because it is somehow redundant with the commit hash.
> Obviously, it cannot because security reason (SHA-1 is considered as
> weak).
The other way also works.  If Git used a secure hashing function such
as SHA-256 (or SHA-512 or Keccak) and Guix supported that hash, we
could generate a git hash from the Guix hash (assuming also we allow
the origin serializer to be configured, which would be required either
way).

The weakness of SHA-1 also flies in the face of the robustness
argument.  One could maliciously push a commit that replaces an
existing one with the same hash, though it would also break the repo in
doing so.  At least in theory, as no such attack has been done yet. 
Note to self: theoretical attacks on Git are probably off-topic as
well.

> > Problems arise, when upstreams move or delete tags.  At this 
> > point, guix packages that use them break and are no longer able to
> > fetch their source code.  Raw commits are in principle resilient to
> > this kind of denial of service; instead upstreams would have to
> > actually delete the commits themselves, including also possible
> > backups such as SWH to break it.  There is certainly an argument
> > for robustness to be made here, particularly concerning `guix time-
> > machine', though as noted it is not infallible.  
> 
> SWH provides ’swh:id’ which is another triplet (really close to Git).
> Basically, content means data and metadata and to make it short, SWH
> deals their way with metadata for reason of large scale.  And SWH
> does snapshots of Git repositories.
> 
> Therefore, to have something really robust, Guix has to rely on a map
> from package definition to SWH.
> 
> Using Git commit hash instead of tag makes this map.  For tag, to
> have something robust, we need an external map from checksum hash to
> SWH hash via Git commit hash.  This “external” is done by Disarchive.
I don't know too much about Disarchive here, so please enlighten me. 
If it used a pair of origin file name + hash, whether or not the git-
reference uses tags would be irrelevant, no?  Do we have to take values
from the uri field?

> > Long-term, we might want to support having multiple <git-
> > references> in git-fetch -- if the first one fails due to a hash
> > mismatch, we would warn about that instead of producing an error
> > and thereafter continue with the second, third, etc. 
> > similar to how we currently have mirror:// urls for some well-known
> > mirrored repositories.  That way, we have a system to warn us about
> > naughty upstreams while also providing robustness for the time
> > machine.
> 
> I think the long term is to completely remove tag and only use commit
> hash; as done for ’guile-aiscm’.  But it will not happen for
> convenience reasons, I guess.
> 
> What you are proposing is to mix extrinsic (tag, URL, etc.) with
> intrinsic (commit hash, checksum hash, etc.).  Well, I do not know if
> this proposed fallback mechanism would ease the maintenance and would
> make Guix more robust.
I'm not sure the distinction between extrinsic and intrinsic values is
a useful one here.  The only important intrinsic value here is the
content hash, which is unlikely to break [1].  We're using extrinsic
values such as URLs all over the place, including the very line
preceding the commit value of a git-reference (almost) every time --
I'm leaving room here for some person to put the commit before the URL.

> To me, robustness means make a map from intrinsic values to content;
> as Disarchive is doing for instance.
See above, I don't understand why Disarchive would need more than the
content hash as an intrinsic value to do so.

Cheers,
Liliana

[1] "Briefly stated, if you find SHA-256 collisions scary then your
priorities are wrong." https://stackoverflow.com/a/4014407


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-28 20:55 On raw strings in <origin> commit field Liliana Marie Prikler
  2021-12-29  8:39 ` zimoun
@ 2021-12-30  1:13 ` Mark H Weaver
  2021-12-30 12:56   ` zimoun
  2021-12-31  3:15   ` Liliana Marie Prikler
  2021-12-31 17:56 ` Vagrant Cascadian
  2 siblings, 2 replies; 63+ messages in thread
From: Mark H Weaver @ 2021-12-30  1:13 UTC (permalink / raw)
  To: Liliana Marie Prikler, guix-devel

Hi Liliana,

Liliana Marie Prikler <liliana.prikler@gmail.com> writes:
> It should be noted, that in the case of moving or deleted tags, the
> assertion Guix "1.2.3" = upstream "v1.2.3" no longer holds.

Agreed, but I don't think that assertion should be our top priority.

For purposes of Guix's core goal of enabling software to be reliably
reproduced in the future, the most important property to preserve is
that 'Guix "1.2.3"' should remain forever immutable.

An obvious corollary is that if upstream mutates the meaning of
'upstream "v1.2.3"' over time, then the equation above will become
false.  That would be an unfortunate result of upstream's actions, but
it's exactly what _needs_ to happen to enable Guix to be reliably
reproducible.

If I perform an experiment with Guix "1.2.3" and publish the results,
and someone later wishes to reproduce those results, they will want
precisely the same 'Guix "1.2.3"' that was used to perform the original
experiment, and not whatever version of the software upstream is now
calling "v1.2.3".

The simple fact is that the way Ricardo wrote the 'guile-aiscm' package
is the right way to ensure that it can be reliably reproduced in the
future.

Guix packages that refer to git _tags_ may cease to be reproducible in
the future if upstream mutates or removes those tags, and it's simply
not feasible to transform our SHA256 hashes (of the NAR-encoded source
checkout) into something that we can use to fetch the archived source
from SWH.  There's simply no hope to make that work, unless we can
convince SWH to maintain a secondary index of their content based on
NAR-encoded source trees, which seems unlikely.

On the other hand, if we refer to git _commit hashes_, then it *is*
feasible for us to fetch the archived source from SWH, regardless of
what upstream has done to its tags in the meantime.

For that reason alone, I think that the way Ricardo wrote the
guile-aiscm package definition is clearly the right approach, given
Guix's longstanding goals.

> On the note
> of fallbacks, we do also have the issue that Guix fails on the first
> download that does not match the hash instead of e.g. continuing to SWH
> to fetch an archive of the old tag (as well as other fallback-related
> issues, also including the "Tricking Peer Review" thread).

That's a bug that can, and should, be fixed.  The existence of that bug
might temporarily prevent us from enjoying the benefits of Ricardo's
approach, but that's not an argument for adopting practices that push us
farther from our core goals.

What do you think?

      Regards,
        Mark

-- 
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-29 20:25   ` Liliana Marie Prikler
@ 2021-12-30 12:43     ` zimoun
  2021-12-31  0:02       ` Liliana Marie Prikler
  0 siblings, 1 reply; 63+ messages in thread
From: zimoun @ 2021-12-30 12:43 UTC (permalink / raw)
  To: Liliana Marie Prikler, guix-devel

Hi Liliana,

On Wed, 29 Dec 2021 at 21:25, Liliana Marie Prikler <liliana.prikler@gmail.com> wrote:
> Am Mittwoch, dem 29.12.2021 um 09:39 +0100 schrieb zimoun:
>> On Tue, 28 Dec 2021 at 21:55, Liliana Marie Prikler
>> <liliana.prikler@gmail.com> wrote:

> The notion of equivalence I am using here is the same as in the
> statement "5 ≡ 2 mod 3", wherein the ≡ symbol is ironically called
> IDENTICAL TO in Unicode despite being used very differently in
> mathematics.  Perhaps there is a language barrier here; in German we
> read that as "5 is equivalent to 2 modulo 3" and logic equivalence
> functions similarly.

I do not understand against what you are arguing so I skip it. :-)

> For the record, one could argue that I should have used that symbol for
> comparing Guix "1.2.3" to upstream "v1.2.3" because they are in fact
> not equal, only equivalent, but that's besides the point.  The point
> is, with an upstream behaving as we want upstreams to behave (not just
> git ones, url-fetch suffers from the same issue with moving tarballs
> for instance), you can substitute one for the other without a change in
> meaning; both will fetch the same commit.

If I understand you correctly:

 - Guix "1.2.3" means the field ’version’
 - upstream “v1.2.3” means the upstream tag used by the field ’commit’
   of ’git-reference’.

and yes it is strongly expected that these both fields matches. :-) But
it is irrelevant, IMHO, to your initial message «commit tags are in
principle mutable and hence can not be relied on when fetching sources.
I do have a few issues with that explanation».  It is fortunate and not
robust that ’commit’ matches ’version’ via upstream ’tag’.

Because how ’commit’ and ’tag’ are defined is different.

I cannot tell it differently than: Git commit depends only on the
content, although ’tag’ not.

Version (or tag) is convenient names for humans.  It is easier to tell
version 0.23.1 than
09rdbcr8dinzijyx9h940ann91yjlbg0fangx365llhvy354n840.  And we can deduce
that 0.22.3 is older than 0.23.1, when it is impossible for commits.

If you prefer to keep the frame: «you can substitute one for the other
without a change in meaning», then, for what my opinion is worth on that
matter, my probably wrong understanding of your words is that perhaps
you are missing a point about content-addressability.

>> From the content to the hash, three keys: 1) how to serialize and 2)
>> how to hash and 3) how to represent the hash.  For #1, Git uses their
>> own serializer and Guix, inheriting from Nix, uses another (Nar);
>> although the difference is minor.  For #2, Git uses by default SHA-1 as
>> hash function, although Guix uses SHA-256.  And for #3, Git uses
>> hexadecimal format and Guix uses nix-base32.

[...]

>> To make it explicit, the checksum hash of ’git-reference’ could be
>> removed because it is somehow redundant with the commit hash.
>> Obviously, it cannot because security reason (SHA-1 is considered as
>> weak).
>
> The other way also works.  If Git used a secure hashing function such
> as SHA-256 (or SHA-512 or Keccak) and Guix supported that hash, we
> could generate a git hash from the Guix hash (assuming also we allow
> the origin serializer to be configured, which would be required either
> way).

Yes somehow.  To be on the same wavelength, we need to be precise when
we speak about hash here because hash means:

 - serializer: how to deal with all the bits making the full content
   (files, folder, tree, etc.)
 - hashing function
 - format

So yes, on principles, instead of NAR + SHA-256 + Nix-base32, the Guix
project could have chosen Git + SHA-1 + Hex, or Git + SHA-512 + Base64
or any other combinations.

(I think this choice inherited from Nix is rooted in daemon
implementations and another triplet would have been more changes when
starting Guix, I guess.)

However, knowing only the final Guix checksum hash (NAR + SHA-256 +
Nix-base32), say 09rdbcr8dinzijyx9h940ann91yjlbg0fangx365llhvy354n840,
you can easily replace by any other formats (Hex or Base64), but it is
not straightforward to compute the Git commit hash (here
c78b91edb7c17c6fbf3b294452f44e91d75e3c67) from this Guix checksum hash,
because the serializer NAR and Git have minor differences, and mainly
because one uses SHA-256 and the other SHA-1 – and it is generally not
possible to convert the hash from one hashing function to another
hashing function.

To make it short, my point is: a) a Git commit hash owns the same
properties as any checksum hash and b) a string tag is obviously not a
checksum.

> I don't know too much about Disarchive here, so please enlighten me. 
> If it used a pair of origin file name + hash, whether or not the git-
> reference uses tags would be irrelevant, no?  Do we have to take values
> from the uri field?

I am not sure to understand the questions.  Maybe the thread starting
here is worth:

    <https://yhetil.org/guix/87a6j2w1et.fsf@gnu.org>

Otherwise, could you explain more what you have in mind?

>> To me, robustness means make a map from intrinsic values to content;
>> as Disarchive is doing for instance.
>
> See above, I don't understand why Disarchive would need more than the
> content hash as an intrinsic value to do so.

Basically nothing more, so nothing to understand. :-)

Your initial messages started with:

        when Ricardo recently added guile-aiscm to Guix, I was confused
        that both the version field of the package and the commit field
        of the git- reference used in its origin.  It turns out, that
        this is a rare pattern observed in less than 200 packages
        currently in Guix.  The reason to do so (as far as I understand
        and was explained to me in IRC) is that commit tags are in
        principle mutable and hence can not be relied on when fetching
        sources.  I do have a few issues with that explanation, but
        before that let's go a step back and discuss the relation of
        version and commit.

and my intent was to point the reason is not really the “mutable” part
but the reason is because it is better to rely on intrinsic values
(discussed in link above).  Obviously, intrinsic value is immutable but,
IMHO, intrinsic value is somehow a key-point for lookup in
content-address systems.  Git-commit hash is one way, SWH-ID is another,
IPFS uses another, GNUnet another, etc.  The recent ERIS [1,2] is an
attempt to bridge, IIUC.

Addressing ’origin’ by intrinsic values implies which ones and The Right
Thing is really hard to predict.

My opinion is that robust long-term – i.e., near future I want – is to
rely on more intrinsic values in ’source’ or ’origin’ and less tags,
urls, etc.  Well, I am fine if we disagree.  You asked «What do y'all
think?», now you know what I think. :-)

Last, sorry if I am misunderstanding you, back to your initial message.
You provided ’guile-aiscm’ as one example of something that confused
you.  Instead of the current definition, you would like this definition

--8<---------------cut here---------------start------------->8---
1 file changed, 1 insertion(+), 1 deletion(-)
gnu/packages/machine-learning.scm | 2 +-

modified   gnu/packages/machine-learning.scm
@@ -299,7 +299,7 @@ (define-public guile-aiscm
               (method git-fetch)
               (uri (git-reference
                     (url "https://github.com/wedesoft/aiscm")
-                    (commit "c78b91edb7c17c6fbf3b294452f44e91d75e3c67")))
+                    (commit (string-append "v" version))))
               (file-name (git-file-name name version))
               (sha256
                (base32
--8<---------------cut here---------------end--------------->8---

?  Or something like along these lines,

--8<---------------cut here---------------start------------->8---
(define-public guile-aiscm
  (let ((version "0.23.1")
        (commit "c78b91edb7c17c6fbf3b294452f44e91d75e3c67")
        (revision "0"))
    (package
      (name "guile-aiscm")
      (version (git-version version revision commit))
      (source (origin
                (method git-fetch)
                (uri (git-reference
                      (url "https://github.com/wedesoft/aiscm")
                      (commit commit)))
                (file-name (git-file-name name version))
                (sha256
                 (base32
                  "09rdbcr8dinzijyx9h940ann91yjlbg0fangx365llhvy354n840"))))
[..]
--8<---------------cut here---------------end--------------->8---

?  And your point is that “0.23.1” is redundant with
“c78b91edb7c17c6fbf3b294452f44e91d75e3c67” because Git so why not just
use “0.23.1” in ’origin’.  Right?

In the current matter of facts, I do not think any rationale can be made
in favor of one of the three main possible definitions (addressing by
tag, by commit, using let).  The only weak justification for addressing
using commit hash is that the lookup when fallbacking to SWH is easier,
i.e., it is easier when the Git-commit hash is known instead of URL+tag.

These 200 packages can also be seen as real-world experiments
complementing the other ways of addressing in order to find The Right
Way for robust addressing.

My personal preference, for what it is worth, is an explicit reference
to the commit, i.e., the current definition or the ’let’ one.  Note it
was also discussed this: have convenient things as url+tag for ’uri’ and
use checksum coupled to an external service as disarchive.guix.gnu.org;
but the definitions would be not self-consistent anymore.  Heh, The
Right Thing is not obvious. :-)

Other said, version and tag are currently first-class when commit is
second-class, somehow.  As you said «it allows us to derive commit from
tag» (tag is mine).  And I think it is inherited from the long history
about releasing software which is now somehow inadequate these days.
Obviously, I do not know how to do but it should be the contrary: commit
first-class which allows us to derive version second-class.

1: <https://inqlab.net/projects/eris/>
2: <http://issues.guix.gnu.org/issue/52555>

Cheers,
simon

PS: You said in initial email «(1) is more convenient; it allows us to
derive commit from version, which is often done through an affine
mapping.».

I do not understand the “affine mapping”.  Why would it be an affine
mapping?  Well, I miss what is the affine space here, I am able to
imagine the set but what would be the vector space?  Bah you are
probably referring to maths I have never studied. :-)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-30  1:13 ` Mark H Weaver
@ 2021-12-30 12:56   ` zimoun
  2021-12-31  3:15   ` Liliana Marie Prikler
  1 sibling, 0 replies; 63+ messages in thread
From: zimoun @ 2021-12-30 12:56 UTC (permalink / raw)
  To: Mark H Weaver, Liliana Marie Prikler, guix-devel

Hi Mark,

On Wed, 29 Dec 2021 at 20:13, Mark H Weaver <mhw@netris.org> wrote:

> Guix packages that refer to git _tags_ may cease to be reproducible in
> the future if upstream mutates or removes those tags, and it's simply
> not feasible to transform our SHA256 hashes (of the NAR-encoded source
> checkout) into something that we can use to fetch the archived source
> from SWH.  There's simply no hope to make that work, unless we can
> convince SWH to maintain a secondary index of their content based on
> NAR-encoded source trees, which seems unlikely.

Yes, that’s the core point. :-)

Basically, url+tag can work with the SWH API.  But SWH stores
snapshots so it is not always straightforward.

> On the other hand, if we refer to git _commit hashes_, then it *is*
> feasible for us to fetch the archived source from SWH, regardless of
> what upstream has done to its tags in the meantime.

Well, IMHO, the main point of the story is how to content-address, i.e.,
using which method.

SWH promotes their own encoding named ’swhid’ (basically, it looks
similar to Git commit).  Many other address types are around.

It had been discussed to maintain this secondary index via a Disarchive
database – potentially bridging from our SHA-256 to other addressing
hash.  One issue is that the package definition is not self consistent
and requires this external service.  On the other hand, we cannot
predict the future and who could tell which content-address systems will
be still there?  ;-)

This content-address is not an easy topic. :-)

Cheers,
simon

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-30 12:43     ` zimoun
@ 2021-12-31  0:02       ` Liliana Marie Prikler
  2021-12-31  1:23         ` zimoun
                           ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Liliana Marie Prikler @ 2021-12-31  0:02 UTC (permalink / raw)
  To: zimoun, guix-devel

Am Donnerstag, dem 30.12.2021 um 13:43 +0100 schrieb zimoun:
> Hi Liliana,
> 
> On Wed, 29 Dec 2021 at 21:25, Liliana Marie Prikler
> <liliana.prikler@gmail.com> wrote:
> > Am Mittwoch, dem 29.12.2021 um 09:39 +0100 schrieb zimoun:
> > > On Tue, 28 Dec 2021 at 21:55, Liliana Marie Prikler
> > > <liliana.prikler@gmail.com> wrote:
> 
> > The notion of equivalence I am using here is the same as in the
> > statement "5 ≡ 2 mod 3", wherein the ≡ symbol is ironically called
> > IDENTICAL TO in Unicode despite being used very differently in
> > mathematics.  Perhaps there is a language barrier here; in German
> > we read that as "5 is equivalent to 2 modulo 3" and logic
> > equivalence functions similarly.
> 
> I do not understand against what you are arguing so I skip it. :-)
I was under the impression that you and I used the word "equivalent"
differently, so I wanted to clear that up.

> > For the record, one could argue that I should have used that symbol
> > for comparing Guix "1.2.3" to upstream "v1.2.3" because they are in
> > fact not equal, only equivalent, but that's besides the point.  The
> > point is, with an upstream behaving as we want upstreams to behave
> > (not just git ones, url-fetch suffers from the same issue with
> > moving tarballs for instance), you can substitute one for the other
> > without a change in meaning; both will fetch the same commit.
> 
> If I understand you correctly:
> 
>  - Guix "1.2.3" means the field ’version’
>  - upstream “v1.2.3” means the upstream tag used by the field
> ’commit’ of ’git-reference’.
> 
> and yes it is strongly expected that these both fields matches. :-)
Well, at least we agree on something.

> But it is irrelevant, IMHO, to your initial message «commit tags are
> in principle mutable and hence can not be relied on when fetching
> sources.  I do have a few issues with that explanation».  It is
> fortunate and not robust that ’commit’ matches ’version’ via upstream
> ’tag’.
It is in fact very relevant to the issue at hand.  In principle,
versioned URLs are not robust, hence we can't have a single package
using url-fetch.  A statement like that is obviously silly, not just
because tarballs that are updated in-place are exceedingly rare, but
also because they violate how we think about versions.  The same holds
for git, with the difference being that we no longer generate a URL
from the version, but a tag.  If that tag can't serve as bridge here,
the version field loses the meaning it had from the strong expectation
that the two of them match.

> Because how ’commit’ and ’tag’ are defined is different.
> 
> I cannot tell it differently than: Git commit depends only on the
> content, although ’tag’ not.
> 
> Version (or tag) is convenient names for humans.  It is easier to
> tell version 0.23.1 than
> 09rdbcr8dinzijyx9h940ann91yjlbg0fangx365llhvy354n840.  And we can
> deduce that 0.22.3 is older than 0.23.1, when it is impossible for
> commits.
Git commit hashes do not just depend on the content.  They also depend
on how much effort you put into solving a proof of work challenge that
won't ever earn you crypto coins [1].

> If you prefer to keep the frame: «you can substitute one for the
> other without a change in meaning», then, for what my opinion is
> worth on that matter, my probably wrong understanding of your words
> is that perhaps you are missing a point about content-addressability.
To be fair, I did not consider content-addressability here, because my
main concern is natural intelligence based verification.  

> > > From the content to the hash, three keys: 1) how to serialize and
> > > 2) how to hash and 3) how to represent the hash.  For #1, Git
> > > uses their own serializer and Guix, inheriting from Nix, uses
> > > another (Nar); although the difference is minor.  For #2, Git
> > > uses by default SHA-1 as hash function, although Guix uses SHA-
> > > 256.  And for #3, Git uses hexadecimal format and Guix uses nix-
> > > base32.
> 
> [...]
> 
> > > To make it explicit, the checksum hash of ’git-reference’ could
> > > be removed because it is somehow redundant with the commit hash.
> > > Obviously, it cannot because security reason (SHA-1 is considered
> > > as weak).
> > 
> > The other way also works.  If Git used a secure hashing function
> > such as SHA-256 (or SHA-512 or Keccak) and Guix supported that
> > hash, we could generate a git hash from the Guix hash (assuming
> > also we allow the origin serializer to be configured, which would
> > be required either way).
> 
> Yes somehow.  To be on the same wavelength, we need to be precise
> when we speak about hash here because hash means:
> 
>  - serializer: how to deal with all the bits making the full content
>    (files, folder, tree, etc.)
>  - hashing function
>  - format
> 
> So yes, on principles, instead of NAR + SHA-256 + Nix-base32, the
> Guix project could have chosen Git + SHA-1 + Hex, or Git + SHA-512 +
> Base64 or any other combinations.
> 
> (I think this choice inherited from Nix is rooted in daemon
> implementations and another triplet would have been more changes when
> starting Guix, I guess.)
> 
> However, knowing only the final Guix checksum hash (NAR + SHA-256 +
> Nix-base32), say
> 09rdbcr8dinzijyx9h940ann91yjlbg0fangx365llhvy354n840,
> you can easily replace by any other formats (Hex or Base64), but it
> is not straightforward to compute the Git commit hash (here
> c78b91edb7c17c6fbf3b294452f44e91d75e3c67) from this Guix checksum
> hash, because the serializer NAR and Git have minor differences, and
> mainly because one uses SHA-256 and the other SHA-1 – and it is
> generally not possible to convert the hash from one hashing function
> to another hashing function.
> 
> To make it short, my point is: a) a Git commit hash owns the same
> properties as any checksum hash and b) a string tag is obviously not
> a checksum.
Ad b), I never claimed that a string tag is a checksum.  All I'm
claiming is that *under normal circumstances* we would expect it to
point to just one commit over time, similar to how we expect mirror URL
to expect to the same tarball no matter who ends up delivering it.  Or
how we expect the same substitutes from different servers ;)

Ad a) given that the Git hash (or checksum if you will) is weaker than
other checksums used in Guix, I simply wanted to reassert that it ought
to be the first to vanish if any of them vanishes, not the last.  I
understand my crypto well enough to know that we can't simply change
serializers post hash creation; if we wanted to encode our origin
hashes in an SWH-friendly fashion, we would need to change API
accordingly.

> > I don't know too much about Disarchive here, so please enlighten
> > me. If it used a pair of origin file name + hash, whether or not
> > the git-reference uses tags would be irrelevant, no?  Do we have to
> > take values from the uri field?
> 
> I am not sure to understand the questions.  Maybe the thread starting
> here is worth:
> 
>     <https://yhetil.org/guix/87a6j2w1et.fsf@gnu.org>
> 
> Otherwise, could you explain more what you have in mind?
I'mma quote Ludo for a change.

> SWH records the “history of the history”.  It can tell you what the
> tag pointed to at the time of a specific snapshot.
This just reiterates my point of Guix not trying hard enough with
fallbacks.  Let's say I archive git.evil.org/malicious-repo at version
0.1.0 a fair number of 64 times because I just keep changing the
initial release and Guix still refers to it by tag because I am also in
charge of updating the Guix package and have not yet caught up to the
fact that revision/commit pairs are good, actually.  Since each of
those 64 archives have a different NAR hash, we could try fetching all
of them from SWH and pick the one that fits.

Now obviously, in the real world, we would probably switch to a
version/revision pair for an upstream that violated our expectations
once, perhaps twice, so the overhead would not be as dramatic outside
of constructed examples.  Still, for the sake of "robustness", we might
want to decide what's a robust number of retries to not get into a DoS
loop. 

> > > To me, robustness means make a map from intrinsic values to
> > > content; as Disarchive is doing for instance.
> > 
> > See above, I don't understand why Disarchive would need more than
> > the content hash as an intrinsic value to do so.
> 
> Basically nothing more, so nothing to understand. :-)
> 
> Your initial messages started with:
> 
>         [...]
> 
> and my intent was to point the reason is not really the “mutable”
> part but the reason is because it is better to rely on intrinsic
> values (discussed in link above).  
By content hash, I meant NAR hash or Guix hash, not commit hash.  Sorry
for the confusion.

> Obviously, intrinsic value is immutable but, IMHO, intrinsic value is
> somehow a key-point for lookup in content-address systems.  Git-
> commit hash is one way, SWH-ID is another, IPFS uses another, GNUnet
> another, etc.  The recent ERIS [1,2] is an
> attempt to bridge, IIUC.
> 
> Addressing ’origin’ by intrinsic values implies which ones and The
> Right Thing is really hard to predict.
I don't think I agree with that assessment.  "Guix for Racket packages"
(it was called Xiden back then, but appears to have changed to denxi)
had the insane idea of allowing more than one hash in a package
definition and the source would have to match all of them.  We could do
the same in Guix, but it'd be another core-updates cycle until then.

> My opinion is that robust long-term – i.e., near future I want – is
> to rely on more intrinsic values in ’source’ or ’origin’ and less
> tags, urls, etc.  Well, I am fine if we disagree.  You asked «What do
> y'all think?», now you know what I think. :-)
> 
> Last, sorry if I am misunderstanding you, back to your initial
> message.  You provided ’guile-aiscm’ as one example of something that
> confused you.  Instead of the current definition, you would like this
> definition
> 
> --8<---------------cut here---------------start------------->8---
> 1 file changed, 1 insertion(+), 1 deletion(-)
> gnu/packages/machine-learning.scm | 2 +-
> 
> modified   gnu/packages/machine-learning.scm
> @@ -299,7 +299,7 @@ (define-public guile-aiscm
>                (method git-fetch)
>                (uri (git-reference
>                      (url "https://github.com/wedesoft/aiscm")
> -                    (commit
> "c78b91edb7c17c6fbf3b294452f44e91d75e3c67")))
> +                    (commit (string-append "v" version))))
>                (file-name (git-file-name name version))
>                (sha256
>                 (base32
> --8<---------------cut here---------------end--------------->8---
That would have been a perfectly fine definition in my opinion, yes.

> ?  Or something like along these lines,
> 
> --8<---------------cut here---------------start------------->8---
> (define-public guile-aiscm
>   (let ((version "0.23.1")
>         (commit "c78b91edb7c17c6fbf3b294452f44e91d75e3c67")
>         (revision "0"))
>     (package
>       (name "guile-aiscm")
>       (version (git-version version revision commit))
>       (source (origin
>                 (method git-fetch)
>                 (uri (git-reference
>                       (url "https://github.com/wedesoft/aiscm")
>                       (commit commit)))
>                 (file-name (git-file-name name version))
>                 (sha256
>                  (base32
>                  
> "09rdbcr8dinzijyx9h940ann91yjlbg0fangx365llhvy354n840"))))
> [..]
> --8<---------------cut here---------------end--------------->8---
> 
> ?  And your point is that “0.23.1” is redundant with
> “c78b91edb7c17c6fbf3b294452f44e91d75e3c67” because Git so why not
> just use “0.23.1” in ’origin’.  Right?
We typically don't let-bind version (i.e. we only bind revision and
commit, which is probably a wise idea as version is syntax inside
package), but sure, that's also a fine definition.  I would wonder why
you are doing that for a commit that is itself a release, but if you're
explaining to me "Well, I don't trust this weird wedesoft fellow, they
sound like the kind of person/company to change their tags more often
then their underwear" or even better had evidence of such a change, I'd
agree and push.

> In the current matter of facts, I do not think any rationale can be
> made in favor of one of the three main possible definitions
> (addressing by tag, by commit, using let).  The only weak
> justification for addressing using commit hash is that the lookup
> when fallbacking to SWH is easier, i.e., it is easier when the Git-
> commit hash is known instead of URL+tag.
In my personal opinion, the version+raw commit style can be discredited
using Cantor's diagonal argument.

> These 200 packages can also be seen as real-world experiments
> complementing the other ways of addressing in order to find The Right
> Way for robust addressing.
If a comment spanning four lines is the most reasonable way of
explaining said style to others in the source code, that alone serves
as an argument for let-binding. 

> My personal preference, for what it is worth, is an explicit
> reference to the commit, i.e., the current definition or the ’let’
> one.  Note it was also discussed this: have convenient things as
> url+tag for ’uri’ and use checksum coupled to an external service as
> disarchive.guix.gnu.org; but the definitions would be not self-
> consistent anymore.  Heh, The Right Thing is not obvious. :-)
I have trouble understanding this.  Using origin file-names and hashes
for computing fallbacks would be a good thing, no?  We could completely
decouple that from anything related to the method; if we have a backup
elsewhere, we can use it.

> Other said, version and tag are currently first-class when commit is
> second-class, somehow.  As you said «it allows us to derive commit
> from tag» (tag is mine).  And I think it is inherited from the long
> history about releasing software which is now somehow inadequate
> these days.  Obviously, I do not know how to do but it should be the
> contrary: commit first-class which allows us to derive version
> second-class.
Let's put humans before machines, they're not our overlords (yet).

> PS: You said in initial email «(1) is more convenient; it allows us
> to derive commit from version, which is often done through an affine
> mapping.».
> 
> I do not understand the “affine mapping”.  Why would it be an affine
> mapping?  Well, I miss what is the affine space here, I am able to
> imagine the set but what would be the vector space?  Bah you are
> probably referring to maths I have never studied. :-)
I thought affine mappings were a fine substitute for bijective ones,
but it turns out this time it was I who sucks at maths.  The original
point I was making though, is that we often just have to prepend "v" or
some other version marker to get from the Guix version to the tag, for
which it doesn't matter if that's an affine mapping or a bijective one,
as it's both affine and bijective.

Cheers


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31  0:02       ` Liliana Marie Prikler
@ 2021-12-31  1:23         ` zimoun
  2021-12-31  3:27           ` Liliana Marie Prikler
  2021-12-31 23:36         ` Mark H Weaver
  2021-12-31 23:56         ` Mark H Weaver
  2 siblings, 1 reply; 63+ messages in thread
From: zimoun @ 2021-12-31  1:23 UTC (permalink / raw)
  To: Liliana Marie Prikler, guix-devel

Hi Liliana,

I have read all your emails a couple of times and I am sorry I am still
missing what you are raising.  Because I feel we are failing to explain
each other, that’s fine, it happens sometimes :-) I hope others will
find the intersection of this discussion.

Honestly I am lost in the middle of somewhere between affine space and
Cantor’s diagonal argument. ;-)

I agree with this statement:

    >> SWH records the “history of the history”.  It can tell you what the
    >> tag pointed to at the time of a specific snapshot.
    This just reiterates my point of Guix not trying hard enough with
    fallbacks.

and in my views, the path to robustify the fallback is via more
immutable content-address and intrinsic values and less mutable broken
string as URL+tag.  Obviously, 1) all is not white or black and many
things are grey as always and 2) we have to deal with this broken world
of URL+tag, thus I hope we will improve the fallback SWH through various
snapshots instead of considering only the last one.

However consider that SWH is an archive, not a forge or a mirror.  It
means that SWH ingests this or that only every X months.  Therefore, you
have no guarantee that the snapshots represents the complete history of
history.

For sure, upstream can remove some commits between two ingestions.  But,
most of the time, commits (history) are kept and the bad practise is to
just move the pointer (tag) from one commit to another.

> I'mma quote Ludo for a change.

And for completeness, let quote Ludo again from the same thread. :-)

        No, I think we should consider always referring to commits
        instead of tags.  It’s annoying from a readability viewpoint,
        but it would ensure reproducibility.  Even flatpak has this
        policy.  :-)

          https://github.com/flathub/flathub/wiki/App-Requirements

<https://yhetil.org/guix/87mtmr2a3t.fsf_-_@gnu.org/>

Cheers,
simon

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-30  1:13 ` Mark H Weaver
  2021-12-30 12:56   ` zimoun
@ 2021-12-31  3:15   ` Liliana Marie Prikler
  2021-12-31  7:57     ` Taylan Kammer
  2022-01-01  1:41     ` Mark H Weaver
  1 sibling, 2 replies; 63+ messages in thread
From: Liliana Marie Prikler @ 2021-12-31  3:15 UTC (permalink / raw)
  To: Mark H Weaver, guix-devel

Hi Mark,

Am Mittwoch, dem 29.12.2021 um 20:13 -0500 schrieb Mark H Weaver:
> Hi Liliana,
> 
> Liliana Marie Prikler <liliana.prikler@gmail.com> writes:
> > It should be noted, that in the case of moving or deleted tags, the
> > assertion Guix "1.2.3" = upstream "v1.2.3" no longer holds.
> 
> Agreed, but I don't think that assertion should be our top priority.
> 
> For purposes of Guix's core goal of enabling software to be reliably
> reproduced in the future, the most important property to preserve is
> that 'Guix "1.2.3"' should remain forever immutable.
> 
> An obvious corollary is that if upstream mutates the meaning of
> 'upstream "v1.2.3"' over time, then the equation above will become
> false.  That would be an unfortunate result of upstream's actions, but
> it's exactly what _needs_ to happen to enable Guix to be reliably
> reproducible.
> 
> If I perform an experiment with Guix "1.2.3" and publish the results,
> and someone later wishes to reproduce those results, they will want
> precisely the same 'Guix "1.2.3"' that was used to perform the original
> experiment, and not whatever version of the software upstream is now
> calling "v1.2.3".
I agree with you so far, though with some nuance.  Obviously, when
travelling back in time, we want Guix' "1.2.3" to be whatever it was by
that point, but on the other hand, we also want a recently pulled Guix
to have a reasonably recent "v1.2.3" if it claims to have "1.2.3".  So
we have two proposals at odds with each other, with only the origin-
hash to determine which interpretation to prioritize.

> The simple fact is that the way Ricardo wrote the 'guile-aiscm' package
> is the right way to ensure that it can be reliably reproduced in the
> future.
And here I disagree.  This reasoning presupposes that we have to ensure
that the package still points to the same commit if the tag changes,
which itself presupposes that the tag does change.  However, if we are
always talking about more than one possible "1.2.3" (with the included
future tag that we have yet to witness), we lose the basis by which we
currently assign "1.2.3" as the version (instead of using git-version,
as we expect it won't be "1.2.3" at some future point).  This scheme
only makes sense when it doesn't make sense and when it doesn't make
sense it makes sense.

> Guix packages that refer to git _tags_ may cease to be reproducible in
> the future if upstream mutates or removes those tags, and it's simply
> not feasible to transform our SHA256 hashes (of the NAR-encoded source
> checkout) into something that we can use to fetch the archived source
> from SWH.  There's simply no hope to make that work, unless we can
> convince SWH to maintain a secondary index of their content based on
> NAR-encoded source trees, which seems unlikely.
As pointed out elsewhere, SWH keeps a history of the tags that we could
look up until one matches, and there'd also be the option to keep a
secondary index ourselves (or have a third party do it).

> On the other hand, if we refer to git _commit hashes_, then it *is*
> feasible for us to fetch the archived source from SWH, regardless of
> what upstream has done to its tags in the meantime.
> 
> For that reason alone, I think that the way Ricardo wrote the
> guile-aiscm package definition is clearly the right approach, given
> Guix's longstanding goals.
To me, it rather sounds like a workaround for longstanding bugs [1, 2].
And then again it rests on the assumption that upstream does awful
things to their tags which makes no sense when it makes sense.

> > On the note of fallbacks, we do also have the issue that Guix fails
> > on the first download that does not match the hash instead of e.g.
> > continuing to SWH to fetch an archive of the old tag (as well as
> > other fallback-related issues, also including the "Tricking Peer
> > Review" thread).
> 
> That's a bug that can, and should, be fixed.  The existence of that
> bug might temporarily prevent us from enjoying the benefits of
> Ricardo's approach, but that's not an argument for adopting practices
> that push us farther from our core goals.
> 
> What do you think?
Which bug are you talking about?  "Tricking Peer Review" or the
fallback thing?  If it's the fallback thing, then that's an enabler for
Ricardo's approach, since as you pointed out the commit will still be
fetched correctly from SWH (if not from the main repo itself). 
Insufficient fallbacks are what make it painful to refer to moving
commit tags by tag, since our SWH lookup currently breaks when it
doesn't need to.  Having sufficient fallbacks would mean that we could
use git tags as we did before even in those cases, since the SWH (or
other) fallback would kick in and give us the historical version
matching our origin-hash.

Now, "Tricking Peer Review" is a harder thing to circumvent.  We would
need to issue a warning, preferably a big one if fallbacks do kick in
unintended, i.e. particularly outside of time-machine.

Cheers

[1] https://issues.guix.gnu.org/28659
[2] https://issues.guix.gnu.org/39575


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31  1:23         ` zimoun
@ 2021-12-31  3:27           ` Liliana Marie Prikler
  2021-12-31  9:31             ` Ricardo Wurmus
  0 siblings, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2021-12-31  3:27 UTC (permalink / raw)
  To: zimoun, guix-devel

Hi,

Am Freitag, dem 31.12.2021 um 02:23 +0100 schrieb zimoun:
> Hi Liliana,
> 
> I have read all your emails a couple of times and I am sorry I am
> still missing what you are raising.  Because I feel we are failing to
> explain each other, that’s fine, it happens sometimes :-) I hope
> others will find the intersection of this discussion.
> 
> Honestly I am lost in the middle of somewhere between affine space
> and Cantor’s diagonal argument. ;-)
To shorten it, we currently have the following problem: We can specify
the commit field by tag, which is not robust, by raw commit which is
confusing and can be misleading, or by let-bound commit with a special
version field, which is still confusing if we're at an epsilon revision
(epsilon meaning nothing changed since release and we're not expecting
change), but beats the first option in robustness and the second in
readability at the cost of having a funky version field.  I am arguing
that we should go with either option one or three, but never two.

> I agree with this statement:
> 
>     >> SWH records the “history of the history”.  It can tell you
> what the
>     >> tag pointed to at the time of a specific snapshot.
>     This just reiterates my point of Guix not trying hard enough with
>     fallbacks.
> 
> and in my views, the path to robustify the fallback is via more
> immutable content-address and intrinsic values and less mutable
> broken string as URL+tag.  Obviously, 1) all is not white or black
> and many things are grey as always and 2) we have to deal with this
> broken world of URL+tag, thus I hope we will improve the fallback SWH
> through various snapshots instead of considering only the last one.
> 
> However consider that SWH is an archive, not a forge or a mirror.  It
> means that SWH ingests this or that only every X months.  Therefore,
> you have no guarantee that the snapshots represents the complete
> history of history.
> 
> For sure, upstream can remove some commits between two ingestions. 
> But, most of the time, commits (history) are kept and the bad
> practise is to just move the pointer (tag) from one commit to
> another.
To be fair, you raise a good point with SWH not being infallible, so we
might want to have fallbacks that work regardless of it.  That being
said, we don't support git-mirrors yet either, do we?

> > I'mma quote Ludo for a change.
> 
> And for completeness, let quote Ludo again from the same thread. :-)
> 
>         No, I think we should consider always referring to commits
>         instead of tags.  It’s annoying from a readability viewpoint,
>         but it would ensure reproducibility.  Even flatpak has this
>         policy.  :-)
> 
>           https://github.com/flathub/flathub/wiki/App-Requirements
... 

Fine, I'mma quote flathub for a change.
> When building from a git tag, both the tag name and the commit id
> should be specified, like so:
> 
>    "tag": "1.0.4",
>    "commit": "cdfb19b90587bc0c44404fae30c139f9ec1cca5c"
It's almost as though they know a commit without a tag has no intrinsic
meaning.  Also, I'm pretty sure flatpak could care less about hashes if
asked to (unlike Guix, which requires you provide one to be granted
network access), so the optional by means other than policy SHA-1 hash
here is not comparable to the required SHA-256 hash we always have.

Also, even if we do decide to sacrifice readability for the greater
reproducibility, I think this approach is especially hostile to the
future reader in the very case it is concerned about.  For all they
know, this particular Guix package of version "1.0.4" could have had
cosmetic changes applied, which were not at all cosmetic, but rather a
downgrade to 0.1.4.

Cheers


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31  3:15   ` Liliana Marie Prikler
@ 2021-12-31  7:57     ` Taylan Kammer
  2021-12-31 10:55       ` Liliana Marie Prikler
  2022-01-01  1:41     ` Mark H Weaver
  1 sibling, 1 reply; 63+ messages in thread
From: Taylan Kammer @ 2021-12-31  7:57 UTC (permalink / raw)
  To: Liliana Marie Prikler, Mark H Weaver, guix-devel

On 31.12.2021 04:15, Liliana Marie Prikler wrote:
>                                                   [...] Obviously, when
> travelling back in time, we want Guix' "1.2.3" to be whatever it was by
> that point, but on the other hand, we also want a recently pulled Guix
> to have a reasonably recent "v1.2.3" if it claims to have "1.2.3". [...]

I think here lies the crux of the disagreement.  As far as I understand,
Guix doesn't intend to support the notion that one version string could
represent two different actual versions of a program throughout time.

Rather, I think, the reason Guix keeps both the tag and commit ref is
simply that the tag could disappear from the repo.  (In my experience,
that's easy to do by accident when you clone a repo and push it to a
new location.  You have to fetch and push the tags explicitly.)

If a tag ever *was* changed to point to a different commit, meaning that
the same version string now represents a different actual version, then
I think Guix would give that version a new name, such as "1.2.3-new" or
whatever.  I don't know if this ever actually happened, but I think this
is how Guix would probably want to deal with it if it does.  Having one
string represent two different actual versions is just really terrible
and I don't see Guix ever supporting such a practice.

[tangent follows]

(A software developer might argue that two different commits actually
are the same version of the software, say for instance because only a
minor change in the build system or README file or such was made, i.e.
files that are considered "not part of the end-product," but in Guix
land I think we wouldn't let that fare.  Maybe an exception would be
made if it was proven that the actual package produced by Guix from
both commits will always be bit-identical.  Even then, better not.)

P.S. I hope I'm actually helping to add clarity to the thread instead
of more confusion by adding my voice.  I was just skimming the ML,
found this thread interesting, and thought I might be able to add
clarity, because it seemed a little confusing. :-)

-- 
Taylan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31  3:27           ` Liliana Marie Prikler
@ 2021-12-31  9:31             ` Ricardo Wurmus
  2021-12-31 11:07               ` Liliana Marie Prikler
  2021-12-31 13:15               ` zimoun
  0 siblings, 2 replies; 63+ messages in thread
From: Ricardo Wurmus @ 2021-12-31  9:31 UTC (permalink / raw)
  To: Liliana Marie Prikler; +Cc: guix-devel


Liliana Marie Prikler <liliana.prikler@gmail.com> writes:

>> And for completeness, let quote Ludo again from the same thread. :-)
>> 
>>         No, I think we should consider always referring to commits
>>         instead of tags.  It’s annoying from a readability viewpoint,
>>         but it would ensure reproducibility.  Even flatpak has this
>>         policy.  :-)
>> 
>>           https://github.com/flathub/flathub/wiki/App-Requirements
> ... 
>
> Fine, I'mma quote flathub for a change.
>> When building from a git tag, both the tag name and the commit id
>> should be specified, like so:
>> 
>>    "tag": "1.0.4",
>>    "commit": "cdfb19b90587bc0c44404fae30c139f9ec1cca5c"
> It's almost as though they know a commit without a tag has no intrinsic
> meaning.  Also, I'm pretty sure flatpak could care less about hashes if
> asked to (unlike Guix, which requires you provide one to be granted
> network access), so the optional by means other than policy SHA-1 hash
> here is not comparable to the required SHA-256 hash we always have.

In the past I’ve also added a comment above the raw commit, stating that
it corresponds to the given version.

I have no strong feelings for or against any of the proposed options.  I
think that using raw commits might not be great for our tooling because
we’re not reusing an existing version string and would need to remember
to update the raw commit as well.  But other than that I don’t find the
raw commit to introduce readability problems for humans.

-- 
Ricardo


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31  7:57     ` Taylan Kammer
@ 2021-12-31 10:55       ` Liliana Marie Prikler
  0 siblings, 0 replies; 63+ messages in thread
From: Liliana Marie Prikler @ 2021-12-31 10:55 UTC (permalink / raw)
  To: Taylan Kammer, Mark H Weaver, guix-devel

Hi,

Am Freitag, dem 31.12.2021 um 08:57 +0100 schrieb Taylan Kammer:
> On 31.12.2021 04:15, Liliana Marie Prikler wrote:
> >                                                   [...] Obviously,
> > when travelling back in time, we want Guix' "1.2.3" to be whatever
> > it was by that point, but on the other hand, we also want a
> > recently pulled Guix to have a reasonably recent "v1.2.3" if it
> > claims to have "1.2.3". [...]
> 
> I think here lies the crux of the disagreement.  As far as I
> understand, Guix doesn't intend to support the notion that one
> version string could represent two different actual versions of a
> program throughout time.
It does not [intend ...], but the failure mode is important here,
particularly also for outside observers.  An outside observer seeing
that Guix uses commit deadbeef for "1.2.3" whereas upstream has
bedeadaf for the same might not know that upstream moved their commit
and given that committers can do much without oversight, they could
also sneak in a malicious deadbeef when upstream "1.2.3" was actually
d000000d all along.  If Ricardo had pushed to staging or core-updates,
that commit could have gone unnoticed for far longer (and just to be
sure, I do trust Ricardo to pick the right commit and would likely not
even bother checking if I was used to that scheme).

Seeing `guix build' fail because upstream hopped tags is frustrating
from a reproducibility angle, but it makes it somewhat easier to assign
blame and move forward.  Similarly, if we use git-version where we are
unsure if upstreams play nice, we never claim to package a canonical
"1.2.3", but a particular commit that advertises being "1.2.3" through
other means, such as configure files.  It would be obvious, that Guix
always packaged that commit.

> Rather, I think, the reason Guix keeps both the tag and commit ref is
> simply that the tag could disappear from the repo.  (In my
> experience, that's easy to do by accident when you clone a repo and
> push it to a new location.  You have to fetch and push the tags
> explicitly.)
Changing locations are not an issue here as we don't have git mirrors.

> If a tag ever *was* changed to point to a different commit, meaning
> that the same version string now represents a different actual
> version, then I think Guix would give that version a new name, such
> as "1.2.3-new" or whatever.  I don't know if this ever actually
> happened, but I think this is how Guix would probably want to deal
> with it if it does.  Having one string represent two different actual
> versions is just really terrible and I don't see Guix ever supporting
> such a practice.
Currently, Guix "supports" this practice by going from tags to (git-
version base revision commit), i.e. doing what you'd expect it to do. 
I think we did find a few badly behaving upstreams by virtue of using
tag and (hopefully) moved to git-version for all of those.

The alternative proposal would support this practice by not caring
about tags and let upstreams do as they please because they can't break
our tooling anyway, YOLO.  In a sense, we are trying to find technical
solutions to social issues here.

> [tangent follows]
> 
> (A software developer might argue that two different commits actually
> are the same version of the software, say for instance because only a
> minor change in the build system or README file or such was made,
> i.e. files that are considered "not part of the end-product," but in
> Guix land I think we wouldn't let that fare.  Maybe an exception
> would be made if it was proven that the actual package produced by
> Guix from both commits will always be bit-identical.  Even then,
> better not.)
If the documentation is included in the end product (which it hopefully
is), then yes, that hash would also change.  If you change your gitlab
CI yaml, because you typo'd hard and then the CI failed to build a
release tarball, I think we as Guix can see that this is a one time
thing while you're still young and move to the new commit.  If you do
it more often then yeah, no love from us.

> P.S. I hope I'm actually helping to add clarity to the thread instead
> of more confusion by adding my voice.  I was just skimming the ML,
> found this thread interesting, and thought I might be able to add
> clarity, because it seemed a little confusing. :-)
At least imo your opinion helps, as it also helps me formulating my own
ideas clearly.  If there's a particular thing you're confused about, do
not hesitate to ask :)

Cheers


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31  9:31             ` Ricardo Wurmus
@ 2021-12-31 11:07               ` Liliana Marie Prikler
  2021-12-31 12:31                 ` Ricardo Wurmus
  2021-12-31 13:15               ` zimoun
  1 sibling, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2021-12-31 11:07 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

Hi Ricardo,

Am Freitag, dem 31.12.2021 um 10:31 +0100 schrieb Ricardo Wurmus:
> In the past I’ve also added a comment above the raw commit, stating
> that it corresponds to the given version.
> 
> I have no strong feelings for or against any of the proposed
> options.  I think that using raw commits might not be great for our
> tooling because we’re not reusing an existing version string and
> would need to remember to update the raw commit as well.  But other
> than that I don’t find the raw commit to introduce readability
> problems for humans.
In German, we have the word "Betriebsblindheit", which describes the
state of being so used to a routine that it's no longer questioned. 
Particularly here, you're used to raw commit hashes, so you no longer
feel the need to add a comment explaining that it corresponds to a
given tag, which others (such as myself, your past self and possibly
your future self) would need at least until they themselves also turn a
blind eye to raw hashes.  

I don't think I want to achieve a state of mass raw commit blindness as
I fear that would undermine some of the security that git-version
currently provides us with.  WDYT?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31 11:07               ` Liliana Marie Prikler
@ 2021-12-31 12:31                 ` Ricardo Wurmus
  2021-12-31 13:18                   ` Liliana Marie Prikler
  0 siblings, 1 reply; 63+ messages in thread
From: Ricardo Wurmus @ 2021-12-31 12:31 UTC (permalink / raw)
  To: Liliana Marie Prikler; +Cc: guix-devel


Liliana Marie Prikler <liliana.prikler@gmail.com> writes:

> Particularly here, you're used to raw commit hashes, so you no longer
> feel the need to add a comment explaining that it corresponds to a
> given tag, which others (such as myself, your past self and possibly
> your future self) would need at least until they themselves also turn a
> blind eye to raw hashes.  

FWIW this is not applicable here.  I’m certainly not “used” to using raw
commit hashes.  I adopted this practise after Ludo’s comments that were
quoted upthread.

I’ll go whichever way the maintainers decide, because I think there are
upsides and downsides to all proposed methods, and I just can’t make
myself feel passionately about any of them :)

-- 
Ricardo


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31  9:31             ` Ricardo Wurmus
  2021-12-31 11:07               ` Liliana Marie Prikler
@ 2021-12-31 13:15               ` zimoun
  2021-12-31 15:19                 ` Liliana Marie Prikler
  1 sibling, 1 reply; 63+ messages in thread
From: zimoun @ 2021-12-31 13:15 UTC (permalink / raw)
  To: Ricardo Wurmus, Liliana Marie Prikler; +Cc: guix-devel

Hi all,

On Fri, 31 Dec 2021 at 10:31, Ricardo Wurmus <rekado@elephly.net> wrote:

> I have no strong feelings for or against any of the proposed options.  I
> think that using raw commits might not be great for our tooling because
> we’re not reusing an existing version string and would need to remember
> to update the raw commit as well.  But other than that I don’t find the
> raw commit to introduce readability problems for humans.

By tooling, Ricardo, do you mean the ’importers’ and other ’updaters’?

Well, a general minor comment about readability and metadata.  The
anatomy of a package is:

--8<---------------cut here---------------start------------->8---
(define-public a-symbol
  (package
    (name "a-name")
    (version "1.2.3")
    (source (origin
              (method git-fetch)
              (uri (git-reference
                    (url "https://an-url.somewhere")
                    (commit ????)))
              (file-name (git-file-name name version))
              (sha256
               (base32
                "09rdbcr8dinzijyx9h940ann91yjlbg0fangx365llhvy354n840"))))
    (build-system gnu-build-system)
    (home-page "https://another-url.somewhere)
    (synopsis "Guile extension for numerical arrays and tensors")
    (description "AIscm is a Guile extension for numerical arrays and tensors.
Performance is achieved by using the LLVM JIT compiler.")
    (license license:gpl3+)))
--8<---------------cut here---------------end--------------->8---

and here, a-symbol, a-name and various home-page, synopsis, description
are Guix specific.  They are metadata added by Guix packagers.

Version is also Guix specific.  Sometimes, we patch; for security
reasons, for fixing a bug, for quickly backporting something, for
removing non-free bits, for unbundling stuff, for making work with the
rest of Guix packages or for whatever other reasons – or we apply some
options for building specifically for Guix.  Then, the version “1.2.3”
is not always changed and therefore it does not necessary correspond to
what upstream refers as “1.2.3”, or what Debian calls “1.2.3”, etc.

The field ’version’ is Guix specific, at the same level of metadata as
’name’, ’home-page’, ’synopsis’ or ’description’.  Other said, these
fields only depend on choices made by the Guix packagers.

Then, the ’origin’ part is not Guix specific.  It is only upstream
specific.

Obviously, as packages distributor, the Guix specific ’version’ matches
as much as possible with what upstream refers as their version, most of
the time using the Git feature of tag.  This tag is upstream specific:
sometimes is “v1.2.3”, sometimes “1.2.3”, sometimes “release-1.2.3”,
sometimes “r1.2.3”, or whatever else.  We often map ’version’ to ’tag’
using ’string-append’.

For other methods that git-fetch, we also use a map, but instead, from
’version’ to URL, or from ’version’ to ’changeset’, or from ’version’ to
’revision’, etc.

On a side note, I miss why using commit hash is an issue for ’git-fetch’
– despite the fact of content-address advantages – when it seems not for
’svn-fetch’ as in:

--8<---------------cut here---------------start------------->8---
    (version "0.5.1")
    (source
     (origin
       (method svn-fetch)
       (uri (svn-reference
             (url (string-append
                   "https://code.call-cc.org/svn/chicken-eggs/"
                   "release/5/srfi-1/tags/"
                   version))
             (revision 39055)
             (user-name "anonymous")
             (password "")))
       (file-name (string-append "chicken-srfi-1" version "-checkout"))
       (sha256
        (base32
         "02940zsjrmn7c34rnp1rllm2nahh9jvszlzrw8ak4pf31q09cmq1"))))
--8<---------------cut here---------------end--------------->8---

or other example

--8<---------------cut here---------------start------------->8---
  (let ((revision 505)
        (release "1.09.01"))
    (package
      (name "fullswof-2d")
      (version release)
      (source (origin
               (method svn-fetch)
               (uri (svn-reference
                     (url (string-append "https://subversion.renater.fr/"
                                         "anonscm/svn/fullswof-2d/tags/"
                                         "release-" version))
                     (revision revision)))
               (file-name (string-append "fullswof-2d-" version "-checkout"))
               (sha256
                (base32
                 "16v08dx7h7n4wyddzbwimazwyj74ynis12mpjfkay4243npy44b8"))))
--8<---------------cut here---------------end--------------->8---

I let aside the readability point for git-fetch or any others since it
is only habits or more precisely collective conventions and a bit of
personal preferences. :-).

When we speak about robustness and long-term, the issue is the field
’uri’.  Having something extrinsic, i.e., which does not depend on the
content, as URL+tag or URL+revision or just URL leads to fragile
fetching methods depending on the Moon phase.

What Disarchive is currently doing for url-fetch is somehow to index by
integrity field, depending only on the content itself (sha256; usually
not using nix-base32 format referred as ’base32’ in ’origin’ but instead
’base16’ format, whatever).  In short and quickly said, Disarchive-DB
does 2 things more or less, first it somehow maps from this integrity
hash to swhid hash allowing to lookup in SWH archive and fetches the
data, and second it stores metadata, indexed by integrity field,
allowing to reassemble the content = data + metadata.

We were discussing to do this strategy for all the fetching methods.
And potentially add more than swhid hash as content-address systems;
somehow.

All the robustness now relies on the availability of the Disarchive
service.  Based on this context, what I miss in all the discussion is
that Git owns a built-in solution (commit hash) and the arguments for
not using it appears to me weak considering the easy advantage it
brings.

It is a difficult topic to know what information the ’uri’ field should
contain for robust long-term; a topic with a lot of unknowns, although
many solutions are around, they are a strong change of habits and
changing my own habits is already hard, so a collective change is a big
collective challenge. :-)

For instance, SWH promotes swhid instead of DOI for referencing the
publications.  I am not sure it is really popular outside a small French
subgroup. ;-)

Somehow, find some rationale –readability, matching versions, etc.– and
then find counter-measures of their flaws to keep extrinsic values –tag,
revision, etc.– is, for what my opinion is worth, not the correct level
or frame when thinking about robustness and long-term.

Cheers,
simon

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31 12:31                 ` Ricardo Wurmus
@ 2021-12-31 13:18                   ` Liliana Marie Prikler
  0 siblings, 0 replies; 63+ messages in thread
From: Liliana Marie Prikler @ 2021-12-31 13:18 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

Am Freitag, dem 31.12.2021 um 13:31 +0100 schrieb Ricardo Wurmus:
> 
> Liliana Marie Prikler <liliana.prikler@gmail.com> writes:
> 
> > Particularly here, you're used to raw commit hashes, so you no
> > longer feel the need to add a comment explaining that it
> > corresponds to a given tag, which others (such as myself, your past
> > self and possibly your future self) would need at least until they
> > themselves also turn a blind eye to raw hashes.  
> 
> FWIW this is not applicable here.  I’m certainly not “used” to using
> raw commit hashes.  I adopted this practise after Ludo’s comments
> that werequoted upthread.
Pardon me for misunderstanding then.  For what it's worth, I do not
read that quote as "let's use raw hashes without context", though, I
think it makes the much weaker statement that we ought to bind the
commit field to a value that is actually a commit (as opposed to a
branch or a tag).  Throughout Guix this has mostly meant let-binding a
revision/commit pair.

> I’ll go whichever way the maintainers decide, because I think there
> are upsides and downsides to all proposed methods, and I just can’t
> make myself feel passionately about any of them :)
Fair enough, it was mentioned that we ought to RFC here and Tobias also
mentioned that in IRC, but it so far didn't happen.  I'm not sure
whether this is the right thread for discussion given that I'm rather
opinionated, but let's see what happens.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31 13:15               ` zimoun
@ 2021-12-31 15:19                 ` Liliana Marie Prikler
  2021-12-31 17:21                   ` zimoun
  0 siblings, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2021-12-31 15:19 UTC (permalink / raw)
  To: zimoun, Ricardo Wurmus; +Cc: guix-devel

Hi,

Am Freitag, dem 31.12.2021 um 14:15 +0100 schrieb zimoun:
> [...]
> Version is also Guix specific.  Sometimes, we patch; for security
> reasons, for fixing a bug, for quickly backporting something, for
> removing non-free bits, for unbundling stuff, for making work with
> the rest of Guix packages or for whatever other reasons – or we apply
> some options for building specifically for Guix.  Then, the version
> “1.2.3” is not always changed and therefore it does not necessary
> correspond to what upstream refers as “1.2.3”, or what Debian calls
> “1.2.3”, etc.
I think this is generally what you expect from a distro.  From my
personal experience with Debian-based distros and Gentoo, they do
sometimes need to use revisions for when they update their own patches,
but we do have the upper hand here with `guix describe'.

> [...]
> 
> On a side note, I miss why using commit hash is an issue for ’git-
> fetch’ – despite the fact of content-address advantages – when it
> seems not for ’svn-fetch’ as in:
> 
> --8<---------------cut here---------------start------------->8---
>     (version "0.5.1")
>     (source
>      (origin
>        (method svn-fetch)
>        (uri (svn-reference
>              (url (string-append
>                    "https://code.call-cc.org/svn/chicken-eggs/"
>                    "release/5/srfi-1/tags/"
>                    version))
>              (revision 39055)
>              (user-name "anonymous")
>              (password "")))
>        (file-name (string-append "chicken-srfi-1" version "-
> checkout"))
>        (sha256
>         (base32
>          "02940zsjrmn7c34rnp1rllm2nahh9jvszlzrw8ak4pf31q09cmq1"))))
> --8<---------------cut here---------------end--------------->8---
> 
> or other example
> 
> --8<---------------cut here---------------start------------->8---
>   (let ((revision 505)
>         (release "1.09.01"))
>     (package
>       (name "fullswof-2d")
>       (version release)
>       (source (origin
>                (method svn-fetch)
>                (uri (svn-reference
>                      (url (string-append
> "https://subversion.renater.fr/"
>                                          "anonscm/svn/fullswof-
> 2d/tags/"
>                                          "release-" version))
>                      (revision revision)))
>                (file-name (string-append "fullswof-2d-" version "-
> checkout"))
>                (sha256
>                 (base32
>                 
> "16v08dx7h7n4wyddzbwimazwyj74ynis12mpjfkay4243npy44b8"))))
> --8<---------------cut here---------------end--------------->8---
> 
> I let aside the readability point for git-fetch or any others since
> it is only habits or more precisely collective conventions and a bit
> of personal preferences. :-).
I don't think the SVN comparison here is fair.  For one, the examples
reference the tagged revision doubly -- once by revision, once by tag.
We don't have a way of doing that for git (currently).  Plus, an SVN
revision does have more intrinsic meaning than a git hash.

> When we speak about robustness and long-term, the issue is the field
> ’uri’.  Having something extrinsic, i.e., which does not depend on
> the content, as URL+tag or URL+revision or just URL leads to fragile
> fetching methods depending on the Moon phase.
> 
> What Disarchive is currently doing for url-fetch is somehow to index
> by integrity field, depending only on the content itself (sha256;
> usually not using nix-base32 format referred as ’base32’ in ’origin’
> but instead ’base16’ format, whatever).  In short and quickly said,
> Disarchive-DB does 2 things more or less, first it somehow maps from
> this integrity hash to swhid hash allowing to lookup in SWH archive
> and fetches the data, and second it stores metadata, indexed by
> integrity field, allowing to reassemble the content = data +
> metadata.
> 
> We were discussing to do this strategy for all the fetching methods.
> And potentially add more than swhid hash as content-address systems;
> somehow.
You're also missing the part in which it currently relies on a single
server to do all this, but there are plans to move it out to multiple
ones, i.e. adding fallbacks/redundancy to your fallback mechanism,
which for the record is a good idea to have.

> All the robustness now relies on the availability of the Disarchive
> service.  Based on this context, what I miss in all the discussion is
> that Git owns a built-in solution (commit hash) and the arguments for
> not using it appears to me weak considering the easy advantage it
> brings.
> 
> It is a difficult topic to know what information the ’uri’ field
> should contain for robust long-term; a topic with a lot of unknowns,
> although many solutions are around, they are a strong change of
> habits and changing my own habits is already hard, so a collective
> change is a big collective challenge. :-)
We're going back to Cantor's argument for raw commits.  I'm not opposed
to using commits as value of the commit field (let-bound commits
reflected in the version, that is), but let's not forget that this
robustness argument still presupposes that the (commit tag) binding is
the point of failure.  This probably holds to some degree for "npm-
something", but we also have a fair amount of e.g. GNOME-related
packages which we trust to have robust tags and the only reason we
don't use mirror://gnome to refer to them is because it's not in GNOME
mirrors (yet). 

> For instance, SWH promotes swhid instead of DOI for referencing the
> publications.  I am not sure it is really popular outside a small
> French subgroup. ;-)
Completely off-topic, but isn't part of the point of DOIs that you can
fetch the revised paper as well?  I can understand putting OpenData
behind an SWH ID rather than a DOI, but the paper itself?  Why?

> Somehow, find some rationale –readability, matching versions, etc.–
> and then find counter-measures of their flaws to keep extrinsic
> values –tag, revision, etc.– is, for what my opinion is worth, not
> the correct level or frame when thinking about robustness and long-
> term.
For what it's worth, I don't think content addressing everything
(particularly relying on a single service to do so) is robust in the
long term, it just introduces larger failure points.  The only robust
way of increasing robustness is to add more fallbacks and redundancies
(and actually use them).

Cheers


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31 15:19                 ` Liliana Marie Prikler
@ 2021-12-31 17:21                   ` zimoun
  2021-12-31 20:52                     ` Liliana Marie Prikler
  0 siblings, 1 reply; 63+ messages in thread
From: zimoun @ 2021-12-31 17:21 UTC (permalink / raw)
  To: Liliana Marie Prikler, Ricardo Wurmus; +Cc: guix-devel

Hi,

On Fri, 31 Dec 2021 at 16:19, Liliana Marie Prikler <liliana.prikler@gmail.com> wrote:

> You're also missing the part in which it currently relies on a single
> server to do all this, but there are plans to move it out to multiple
> ones, i.e. adding fallbacks/redundancy to your fallback mechanism,
> which for the record is a good idea to have.

I do not see why you guess I am missing a part.  Anyway.

Redundancy adds one kind of robustness: resilience.  Obviously it helps.
For sure, I want that too because it is the straightforward, an “easy“
and “quick” way to have robustness.  However this assumes all the
redundant nodes of the web of nets will be still up, at least enough to
have this…  robustness.  Me too, I hope Guix will be popular and all
redundancies still running when I will be old or dead.  But I will not
bet on that assumption.

What Timothy is doing with Preservation of Guix and a window of ~2years
shows that any web of nets is really fragile.  I do not see why the one
we are building around Guix will be different.

Instead of trying to have robustness by adding more and more, from my
point of view, it appears to me the occasion to rethink and try to have
robustness with less.

I agree with you that various fallbacks is one good direction to go.
SWH is one thing because it is currently well supported (by UNESCO for
instance).  But many others are also worth.  Maybe IPFS or GNUnet are
worth.

>> It is a difficult topic to know what information the ’uri’ field
>> should contain for robust long-term; a topic with a lot of unknowns,
>> although many solutions are around, they are a strong change of
>> habits and changing my own habits is already hard, so a collective
>> change is a big collective challenge. :-)
>
> We're going back to Cantor's argument for raw commits.  I'm not opposed
> to using commits as value of the commit field (let-bound commits
> reflected in the version, that is), but let's not forget that this
> robustness argument still presupposes that the (commit tag) binding is
> the point of failure.  This probably holds to some degree for "npm-
> something", but we also have a fair amount of e.g. GNOME-related
> packages which we trust to have robust tags and the only reason we
> don't use mirror://gnome to refer to them is because it's not in GNOME
> mirrors (yet). 

Because this point of failure for tag potentially exists, the
counter-measure would be to add more (check integrity, fallback to other
servers, etc.) and even it could be impossible if the tag changed and
propagated to all.

I am not saying neither that we have to replace tomorrow all the tags by
commit hashes.  My point is just that this tag in the ’uri’ field does
not appears to me a correct design.  For sure, I agree it is convenient
but I think it is not The Right Thing.  Sadly, I do not know what The
Right Thing is – and commit hash is probably not The Right Thing but it
seems to me a direction to explore.

>> For instance, SWH promotes swhid instead of DOI for referencing the
>> publications.  I am not sure it is really popular outside a small
>> French subgroup. ;-)
>
> Completely off-topic, but isn't part of the point of DOIs that you can
> fetch the revised paper as well?  I can understand putting OpenData
> behind an SWH ID rather than a DOI, but the paper itself?  Why?

If you find it off-topic, fine.  My point is to say that DOI (extrinsic)
is not known to not be The Right Thing for referencing and intrinsic
identifier is really better but it seems hard to convince people to
switch.

For instance, DOI is known to be fragile because it relies on an
external centralized mutable index to have the bijection between the
identifier and the content.  If today I cite doi:123abc then tomorrow
when you reach this very same identifier doi:123abc, then you have no
guarantee that it is the same content.  Obviously, it is not an issue by
itself, but in scientific context where fraud is something, once the
centralized mutable index is corrupted, done!

Because SWH-ID only depends on the content itself, it allows
decentralization and integrity check.

Do not take me wrong, I am not comparing Git SHA-1 hash with an
integrity check. :-)  Well, maybe the interested reader can give a look
at:

<https://www.softwareheritage.org/2020/07/09/intrinsic-vs-extrinsic-identifiers/>

All in all, I was trying to point that this extrinsic vs intrinsic thing
is bigger than ’git-fetch’ and commit hash vs tag and the root appears
to me in exploring what the ’uri’ field should contain.  This DOI was an
example to show the topic is not easy.

>> Somehow, find some rationale –readability, matching versions, etc.–
>> and then find counter-measures of their flaws to keep extrinsic
>> values –tag, revision, etc.– is, for what my opinion is worth, not
>> the correct level or frame when thinking about robustness and long-
>> term.
>
> For what it's worth, I don't think content addressing everything
> (particularly relying on a single service to do so) is robust in the
> long term, it just introduces larger failure points.  The only robust
> way of increasing robustness is to add more fallbacks and redundancies
> (and actually use them).

We disagree; especially on “only robust way” and “add more”.  And from
my side, now I exposed all, I guess. ;-)

Cheers,
simon

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-28 20:55 On raw strings in <origin> commit field Liliana Marie Prikler
  2021-12-29  8:39 ` zimoun
  2021-12-30  1:13 ` Mark H Weaver
@ 2021-12-31 17:56 ` Vagrant Cascadian
  2022-01-03 15:51   ` Ludovic Courtès
  2 siblings, 1 reply; 63+ messages in thread
From: Vagrant Cascadian @ 2021-12-31 17:56 UTC (permalink / raw)
  To: Liliana Marie Prikler, guix-devel

[-- Attachment #1: Type: text/plain, Size: 1454 bytes --]

On 2021-12-28, Liliana Marie Prikler wrote:
> Consider a package being added or updated in Guix.  At the time of
> commit, we have the tag v1.2.3 pointing towards commit deadbeef.  We
> therefore create a guix package with version "1.2.3" pointing to said
> commit (either directly or indirectly).  At this point, one of the
> following holds:
>   (1) Guix "1.2.3" -> upstream "v1.2.3" -> upstream "deadbeef"
>   (2) Guix "1.2.3" -> upstream "deadbeef" <- upstream "v1.2.3"
> From either, we can follow that Guix "1.2.3" = upstream "v1.2.3".  If
> upstream keeps their tags around, then both forms are equivalent, but
> (1) is more convenient; it allows us to derive commit from version,
> which is often done through an affine mapping.

How about using the output of git describe, which can unambigously
include the most relevent tag, the number of commits since that tag, and
the commit hash:

  $ git describe --long --abbrev=41
  v1.3.0-13278-g60661adfb8ffa28e1acfcfea27c6cc2fc70f88fe

  $ git describe --long --abbrev=41 v1.3.0
  v1.3.0-0-ga0178d34f582b50e9bdbb0403943129ae5b560ff

I *think* I've used such git references in the commit field of packages
before, and guix seemed fine with it. Occasionally, I've seen git
describe pick an odd tag to base on. Not sure how it interacts with
software heritage, or multiple tags, or renamed tags... but in theory it
could work, and would allow us to detect tag changes "upstream".


live well,
  vagrant

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31 17:21                   ` zimoun
@ 2021-12-31 20:52                     ` Liliana Marie Prikler
  0 siblings, 0 replies; 63+ messages in thread
From: Liliana Marie Prikler @ 2021-12-31 20:52 UTC (permalink / raw)
  To: zimoun, Ricardo Wurmus; +Cc: guix-devel

Hi,

Am Freitag, dem 31.12.2021 um 18:21 +0100 schrieb zimoun:
> Redundancy adds one kind of robustness: resilience.  [...]  However
> this assumes all the redundant nodes of the web of nets will be still
> up, at least enough to have this…  robustness.  Me too, I hope Guix
> will be popular and all redundancies still running when I will be old
> or dead.  But I will not bet on that assumption.
I think we can live with one or two redundant nodes dying over time;
the great thing about redundancy is that things will still work out
fine if a sufficient number of them (typically one) is still around at
the time of query.  So it'd be robust enough to actually work let's say
10 years, but I have no illusions that our time machine will ever be
able to go back a lifetime (which is also a reason why I don't think
using commit hashes everywhere will magically result in the robustness
that appears to be desired here).

> What Timothy is doing with Preservation of Guix and a window of
> ~2years shows that any web of nets is really fragile.  I do not see
> why the one we are building around Guix will be different.
> 
> Instead of trying to have robustness by adding more and more, from my
> point of view, it appears to me the occasion to rethink and try to
> have robustness with less.
> 
> I agree with you that various fallbacks is one good direction to go.
> SWH is one thing because it is currently well supported (by UNESCO
> for instance).  But many others are also worth.  Maybe IPFS or GNUnet
> are worth.
Why not both?  Or all three, because for what it's worth SWH will also
be around for some while.  We would still need federated Disarchive
instances to match origins to SWH IDs, IPFS files and whatever GNUnet
has.

> > > It is a difficult topic to know what information the ’uri’ field
> > > should contain for robust long-term; a topic with a lot of
> > > unknowns, although many solutions are around, they are a strong
> > > change of habits and changing my own habits is already hard, so a
> > > collective change is a big collective challenge. :-)
> > We're going back to Cantor's argument for raw commits.  I'm not
> > opposed to using commits as value of the commit field (let-bound
> > commits reflected in the version, that is), but let's not forget
> > that this robustness argument still presupposes that the (commit
> > tag) binding is the point of failure.  This probably holds to some
> > degree for "npm-something", but we also have a fair amount of e.g.
> > GNOME-related packages which we trust to have robust tags and the
> > only reason we don't use mirror://gnome to refer to them is because
> > it's not in GNOME mirrors (yet). 
> 
> Because this point of failure for tag potentially exists, the
> counter-measure would be to add more (check integrity, fallback to
> other servers, etc.) and even it could be impossible if the tag
> changed and propagated to all.
> 
> I am not saying neither that we have to replace tomorrow all the tags
> by commit hashes.  My point is just that this tag in the ’uri’ field
> does not appears to me a correct design.  For sure, I agree it is
> convenient but I think it is not The Right Thing.  Sadly, I do not
> know what The Right Thing is – and commit hash is probably not The
> Right Thing but it seems to me a direction to explore.
I don't think there's a single Right Thing to be had here.

> > > For instance, SWH promotes swhid instead of DOI for referencing
> > > the publications.  I am not sure it is really popular outside a
> > > small French subgroup. ;-)
> > 
> > Completely off-topic, but isn't part of the point of DOIs that you
> > can fetch the revised paper as well?  I can understand putting
> > OpenData behind an SWH ID rather than a DOI, but the paper itself? 
> > Why?
> 
> If you find it off-topic, fine.  My point is to say that DOI
> (extrinsic) is not known to not be The Right Thing for referencing
> and intrinsic identifier is really better but it seems hard to
> convince people to switch.
> 
> For instance, DOI is known to be fragile because it relies on an
> external centralized mutable index to have the bijection between the
> identifier and the content.  If today I cite doi:123abc then tomorrow
> when you reach this very same identifier doi:123abc, then you have no
> guarantee that it is the same content.  Obviously, it is not an issue
> by itself, but in scientific context where fraud is something, once
> the centralized mutable index is corrupted, done!
I'm not sure to which extent there's a central index on all DOIs.  As
far as I can see most things are actually handled by DOI registration
agencies, which of course one could possibly corrupt in much the same
manner.

But you don't just cite a DOI, typically.  You also have all that
analog stuff like author, title, publisher, etc.  Assuming the
publisher (or an archive of their publications) still exist, you can
use that to cross-check.

> Because SWH-ID only depends on the content itself, it allows
> decentralization and integrity check.
> 
> Do not take me wrong, I am not comparing Git SHA-1 hash with an
> integrity check. :-)  Well, maybe the interested reader can give a
> look at:
> 
> <
> https://www.softwareheritage.org/2020/07/09/intrinsic-vs-extrinsic-identifiers/
> >
> 
> All in all, I was trying to point that this extrinsic vs intrinsic
> thing is bigger than ’git-fetch’ and commit hash vs tag and the root
> appears to me in exploring what the ’uri’ field should contain.  This
> DOI was an example to show the topic is not easy.
Point taken, "it's not easy" is something we can all easily agree on :)

But the larger issue with DOIs vs SWH IDs is that I typically don't
need to refer to other papers by exact content, which those intrinsic
tagging mechanisms rely on.  If I quote a book from 2015 and you read
the 2025 edition, chances are that the main body is still the same,
with perhaps one or two typos fixed and a new foreword.  For future
academics, it might also be interesting to know whether what I claimed
back in 2022 still holds then or if it has since been superseded.

For historians, it might instead be valuable to archive periodically
check whether the content behind the DOI changes and if so archive a
new snapshot (similar to what archive.org, SWH et al. do).  Then, if
the DOI gets lost or some evil company or government tries to bring out
a censored version of my paper or the paper I'm citing, you can browse
the archive to check what's behind all those sections that have been
painted black.

Note that the archive must be able to be queried in much the same
manner as you'd type a query in a normal search machine.  If it only
relied on content tagging, the evil agency could just simply hand you a
broken ID or even one that refers to a maliciously crafted page of
theirs.  Assuming they let you track down my paper in the first place.

TL;DR (even though you should read the full thing anyway): Despite what
archives specializing themselves on intrinsic identifiers might tell
you, they are not a panacea.  I could go even further off-topic and
show that NaCl is a social construct, but I'd rather stop here.

Cheers


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31  0:02       ` Liliana Marie Prikler
  2021-12-31  1:23         ` zimoun
@ 2021-12-31 23:36         ` Mark H Weaver
  2022-01-01  1:33           ` Liliana Marie Prikler
  2021-12-31 23:56         ` Mark H Weaver
  2 siblings, 1 reply; 63+ messages in thread
From: Mark H Weaver @ 2021-12-31 23:36 UTC (permalink / raw)
  To: Liliana Marie Prikler, zimoun, guix-devel

Hi Liliana,

Liliana Marie Prikler <liliana.prikler@gmail.com> writes:
> In my personal opinion, the version+raw commit style can be discredited
> using Cantor's diagonal argument.

You've mentioned Cantor's diagonalization argument at least twice in
this thread so far, but although I'm familiar with that kind of argument
for showing that certain sets are uncountable, I don't understand how it
applies here.  Can you please elaborate?

      Thanks,
        Mark

-- 
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31  0:02       ` Liliana Marie Prikler
  2021-12-31  1:23         ` zimoun
  2021-12-31 23:36         ` Mark H Weaver
@ 2021-12-31 23:56         ` Mark H Weaver
  2022-01-01  0:15           ` Liliana Marie Prikler
  2 siblings, 1 reply; 63+ messages in thread
From: Mark H Weaver @ 2021-12-31 23:56 UTC (permalink / raw)
  To: Liliana Marie Prikler, zimoun, guix-devel

Hi Liliana,

Liliana Marie Prikler <liliana.prikler@gmail.com> writes:

> Git commit hashes do not just depend on the content.  They also depend
> on how much effort you put into solving a proof of work challenge that
> won't ever earn you crypto coins [1].

My knowledge of git is admittedly not that strong, but my understanding
is that git commit hashes depend solely on the contents of the tree plus
the commit log and the commit history leading up to that point.  Am I
mistaken?

The "[1]" at the end of your sentence led me to expect a corresponding
footnote, but I wasn't able to easily find it.  Can you please provide a
clarifying reference, or at least elaborate on what you meant here?

       Thanks,
         Mark

-- 
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31 23:56         ` Mark H Weaver
@ 2022-01-01  0:15           ` Liliana Marie Prikler
  0 siblings, 0 replies; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-01  0:15 UTC (permalink / raw)
  To: Mark H Weaver, zimoun, guix-devel

Am Freitag, dem 31.12.2021 um 18:56 -0500 schrieb Mark H Weaver:
> Hi Liliana,
> 
> Liliana Marie Prikler <liliana.prikler@gmail.com> writes:
> 
> > Git commit hashes do not just depend on the content.  They also
> > depend on how much effort you put into solving a proof of work
> > challenge that won't ever earn you crypto coins [1].
> 
> My knowledge of git is admittedly not that strong, but my understanding
> is that git commit hashes depend solely on the contents of the tree
> plus the commit log and the commit history leading up to that point. 
> Am I mistaken?
> 
> The "[1]" at the end of your sentence led me to expect a corresponding
> footnote, but I wasn't able to easily find it.  Can you please provide
> a clarifying reference, or at least elaborate on what you meant here?
Ahh, my bad, I meant to put in

[1] https://github.com/tochev/git-vanity

at the end.  It's a tool to set the short hash to an arbitrary chosen
value by manipulating the commit log (as you'd expect).

Cheers


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31 23:36         ` Mark H Weaver
@ 2022-01-01  1:33           ` Liliana Marie Prikler
  2022-01-01  5:00             ` Mark H Weaver
  0 siblings, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-01  1:33 UTC (permalink / raw)
  To: Mark H Weaver, zimoun, guix-devel

Am Freitag, dem 31.12.2021 um 18:36 -0500 schrieb Mark H Weaver:
> Hi Liliana,
> 
> Liliana Marie Prikler <liliana.prikler@gmail.com> writes:
> > In my personal opinion, the version+raw commit style can be
> > discredited using Cantor's diagonal argument.
> 
> You've mentioned Cantor's diagonalization argument at least twice in
> this thread so far, but although I'm familiar with that kind of
> argument for showing that certain sets are uncountable, I don't
> understand how it applies here.  Can you please elaborate?
Okay, so let's write out the full argument.  At a certain date, we
package or update P to version V through reference to tag T (at commit
C).  Because we can't trust T to remain valid, we only keep (V, C)
around for "robustness".

Now notice, how version V is generated by referring to T.  Without loss
of generality, assume that T is invalidated, otherwise nothing to
prove.  Since V is created through reference to T, it is also
invalidated as being the canonical V, whichever it is.  A similar
argument can be made for C as well.  So both (V, C) are invalidated and
the only thing we can claim is "yeah, upstream did tag that T at some
point".

Let us now assume, that T is never invalidated.  In this case (V, C)
remain robust for all observable time, but so would (V, T).  Hence
there is no robustness to be gained in this scenario.

Now what if we were to instead define V' := (B, N, C') with N being a
number to order the different Cs under B and C' being the first few
bytes of C.  Since V' clearly points to C, there is a clear link
established between the two even if T is lost at some point and we
coincidentally have B := clean(T) for some cleaning function clean.

Now obviously V' is exactly what git-version does and there are some
problems with it if we move back to the real world.  For one, I don't
think our updater would currently detect that upstream moved T to a
newer commit, whereas using tag for commit makes us notice breakages
loudly (too loudly as some argue, hence the move away from it). 
However, since I'm a "people first, machines second" girl, I am willing
to ignore this minor inconvenience and take the robustness if that's
the extent of the issues it brings.

To state something that probably hasn't gotten enough attention here,
my main problem is not that we are adding robustness by using commits
in the commit field more often, my problem is that we're using raw
commits when the version field would suggest we're using a tag.  One
could raise the issue that long versions would become unreadable and
this is largely a non-issue on the command line, but assuming that it
is, I did provide other potential solutions.

So the main question here is: Do we really want raw strings in the
commit field?  Is there no better way of providing "robustness"?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31  3:15   ` Liliana Marie Prikler
  2021-12-31  7:57     ` Taylan Kammer
@ 2022-01-01  1:41     ` Mark H Weaver
  2022-01-01 11:12       ` Liliana Marie Prikler
  1 sibling, 1 reply; 63+ messages in thread
From: Mark H Weaver @ 2022-01-01  1:41 UTC (permalink / raw)
  To: Liliana Marie Prikler, guix-devel

Hi Liliana,

Liliana Marie Prikler <liliana.prikler@gmail.com> writes:

> Am Mittwoch, dem 29.12.2021 um 20:13 -0500 schrieb Mark H Weaver:
[...]
>> The simple fact is that the way Ricardo wrote the 'guile-aiscm' package
>> is the right way to ensure that it can be reliably reproduced in the
>> future.
> And here I disagree.  This reasoning presupposes that we have to ensure
> that the package still points to the same commit if the tag changes,
> which itself presupposes that the tag does change.

I disagree with the last line above.  What makes you think that I'm
presupposing that the tag does change?

There's a difference between "presupposing that the tag does change" and
"not assuming that the tag will not change".  Do you see the difference?

> However, if we are
> always talking about more than one possible "1.2.3" (with the included
> future tag that we have yet to witness), we lose the basis by which we
> currently assign "1.2.3" as the version 

I see what you're getting at here, but still I disagree.  Our basis for
associating version "1.2.3" with commit XYZ is simply that upstream had
indicated that version "1.2.3" was commit XYZ.  That historical fact is
immutable.

If upstream later indicates that version "1.2.3" is now commit YYZ, I
don't think that invalidates our basis for continuing to associate
version "1.2.3" with commit XYZ.  The aforementioned immutable
historical fact still remains our basis and justification for making
that association.

Perhaps some people would prefer to use a distro where version "1.2.3"
of package FOO could mean a different thing tomorrow than it means
today.  Personally, that's not what I want.

If upstream changes their mind about the meaning of version "1.2.3", I
want that to correspond to a different version number in Guix, perhaps
"1.2.3a" or something, as Taylan suggested.  Incidentally, I vaguely
recall that we've done that in the past, but I don't know if we've done
it consistently.

> As pointed out elsewhere, SWH keeps a history of the tags that we could
> look up until one matches,

Only if SWH took a snapshot at the right time.  I would guess that
mutations of release tags usually happen within a few days after the
release tag is first created.  Relying on SWH to take a snapshot within
that possibly quite small time interval doesn't sound very robust to me.

> and there'd also be the option to keep a
> secondary index ourselves (or have a third party do it).

That's true, but then we'd be adding another piece of centralized
infrastructure that users would need to rely upon in order to reliably
reproduce their systems.  That infrastructure would have to be
maintained indefinitely.  If we failed to keep up maintenance, then
users could run into problems reproducing their older systems.

It seems to me clearly better to avoid relying on a piece of centralized
infrastructure if it can be easily avoided, no?

>> On the other hand, if we refer to git _commit hashes_, then it *is*
>> feasible for us to fetch the archived source from SWH, regardless of
>> what upstream has done to its tags in the meantime.
>> 
>> For that reason alone, I think that the way Ricardo wrote the
>> guile-aiscm package definition is clearly the right approach, given
>> Guix's longstanding goals.
> To me, it rather sounds like a workaround for longstanding bugs [1, 2].
[...]
> [1] https://issues.guix.gnu.org/28659
> [2] https://issues.guix.gnu.org/39575

I don't understand how it's a workaround for those bugs.  Even if those
bugs were fixed, we'd still need a reliable way to find the git commit
that matches the one expected by a git-fetch <origin> record, i.e. the
one that will produce a source checkout with the expected SHA256 hash.

Am I missing something?

>> > On the note of fallbacks, we do also have the issue that Guix fails
>> > on the first download that does not match the hash instead of e.g.
>> > continuing to SWH to fetch an archive of the old tag (as well as
>> > other fallback-related issues, also including the "Tricking Peer
>> > Review" thread).
>> 
>> That's a bug that can, and should, be fixed.  The existence of that
>> bug might temporarily prevent us from enjoying the benefits of
>> Ricardo's approach, but that's not an argument for adopting practices
>> that push us farther from our core goals.
>> 
>> What do you think?
> Which bug are you talking about?  "Tricking Peer Review" or the
> fallback thing?

I was talking about the fallback issue.

> If it's the fallback thing, then that's an enabler for
> Ricardo's approach, since as you pointed out the commit will still be
> fetched correctly from SWH (if not from the main repo itself). 

Right.

> Now, "Tricking Peer Review" is a harder thing to circumvent.  We would
> need to issue a warning, preferably a big one if fallbacks do kick in
> unintended, i.e. particularly outside of time-machine.

Regarding "Tricking Peer Review": I think it would be ideal for package
definitions to include both the git tag _and_ the git commit hash, and
to teach our linter to raise an alarm when the expected tags are missing
or fail to match the expected commit hash.

For similar reasons, it would also be good to include the fingerprints
of upstream PGP signing keys in our package definitions, and to teach
our linter to check those signatures and that they match the SHA256
hashes in our recipes.

What do you think?

      Regards,
        Mark

-- 
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-01  1:33           ` Liliana Marie Prikler
@ 2022-01-01  5:00             ` Mark H Weaver
  2022-01-01 10:33               ` Liliana Marie Prikler
  0 siblings, 1 reply; 63+ messages in thread
From: Mark H Weaver @ 2022-01-01  5:00 UTC (permalink / raw)
  To: Liliana Marie Prikler, zimoun, guix-devel

Hi Liliana,

Liliana Marie Prikler <liliana.prikler@gmail.com> writes:

> Am Freitag, dem 31.12.2021 um 18:36 -0500 schrieb Mark H Weaver:
>> Hi Liliana,
>> 
>> Liliana Marie Prikler <liliana.prikler@gmail.com> writes:
>> > In my personal opinion, the version+raw commit style can be
>> > discredited using Cantor's diagonal argument.
>> 
>> You've mentioned Cantor's diagonalization argument at least twice in
>> this thread so far, but although I'm familiar with that kind of
>> argument for showing that certain sets are uncountable, I don't
>> understand how it applies here.  Can you please elaborate?
> Okay, so let's write out the full argument.  At a certain date, we
> package or update P to version V through reference to tag T (at commit
> C).  Because we can't trust T to remain valid, we only keep (V, C)
> around for "robustness".
>
> Now notice, how version V is generated by referring to T.  Without loss
> of generality, assume that T is invalidated, otherwise nothing to
> prove.  Since V is created through reference to T, it is also
> invalidated as being the canonical V, whichever it is.  A similar
> argument can be made for C as well.  So both (V, C) are invalidated and
> the only thing we can claim is "yeah, upstream did tag that T at some
> point".
>
> Let us now assume, that T is never invalidated.  In this case (V, C)
> remain robust for all observable time, but so would (V, T).  Hence
> there is no robustness to be gained in this scenario.
>
> Now what if we were to instead define V' := (B, N, C') with N being a
> number to order the different Cs under B and C' being the first few
> bytes of C.  Since V' clearly points to C, there is a clear link
> established between the two even if T is lost at some point and we
> coincidentally have B := clean(T) for some cleaning function clean.
>
> Now obviously V' is exactly what git-version does and there are some
> problems with it if we move back to the real world.  For one, I don't
> think our updater would currently detect that upstream moved T to a
> newer commit, whereas using tag for commit makes us notice breakages
> loudly (too loudly as some argue, hence the move away from it). 
> However, since I'm a "people first, machines second" girl, I am willing
> to ignore this minor inconvenience and take the robustness if that's
> the extent of the issues it brings.
>
> To state something that probably hasn't gotten enough attention here,
> my main problem is not that we are adding robustness by using commits
> in the commit field more often, my problem is that we're using raw
> commits when the version field would suggest we're using a tag.  One
> could raise the issue that long versions would become unreadable and
> this is largely a non-issue on the command line, but assuming that it
> is, I did provide other potential solutions.
>
> So the main question here is: Do we really want raw strings in the
> commit field?  Is there no better way of providing "robustness"?

Where is the Cantor-style diagonalization argument that you spoke of?

      Regards,
        Mark

-- 
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-01  5:00             ` Mark H Weaver
@ 2022-01-01 10:33               ` Liliana Marie Prikler
  2022-01-01 20:37                 ` Mark H Weaver
  0 siblings, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-01 10:33 UTC (permalink / raw)
  To: Mark H Weaver, zimoun, guix-devel

> Where is the Cantor-style diagonalization argument that you spoke of?
You skipped over it, read again.  The key point is that you're
referencing the thing you think will be invalidated to create your
scheme.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-01  1:41     ` Mark H Weaver
@ 2022-01-01 11:12       ` Liliana Marie Prikler
  2022-01-01 17:45         ` Timothy Sample
                           ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-01 11:12 UTC (permalink / raw)
  To: Mark H Weaver, guix-devel

Am Freitag, dem 31.12.2021 um 20:41 -0500 schrieb Mark H Weaver:
> I disagree with the last line above.  What makes you think that I'm
> presupposing that the tag does change?
> 
> There's a difference between "presupposing that the tag does change"
> and "not assuming that the tag will not change".  Do you see the
> difference?
I'm pretty sure ¬assume(¬X) = assume(¬¬X) in this concept.  You have to
start with some assumptions and while ideally we'd like to encode "I
don't care", we do not have a system that allows us to do so.

> > However, if we are always talking about more than one possible
> > "1.2.3" (with the included future tag that we have yet to witness),
> > we lose the basis by which we currently assign "1.2.3" as the
> > version 
> 
> I see what you're getting at here, but still I disagree.  Our basis
> for associating version "1.2.3" with commit XYZ is simply that
> upstream had indicated that version "1.2.3" was commit XYZ.  That
> historical fact is immutable.
History is a social construct, it's not immutable.

> If upstream later indicates that version "1.2.3" is now commit YYZ, I
> don't think that invalidates our basis for continuing to associate
> version "1.2.3" with commit XYZ.  The aforementioned immutable
> historical fact still remains our basis and justification for making
> that association.
I'm pretty sure it does, particularly to a future observer who may not
have the luxury of a history to distinguish that record from one in
which a malicious committer linked those versions and tag together and
then no one bothered to check.

> Perhaps some people would prefer to use a distro where version
> "1.2.3" of package FOO could mean a different thing tomorrow than it
> means today.  Personally, that's not what I want.
> 
> If upstream changes their mind about the meaning of version "1.2.3",
> I want that to correspond to a different version number in Guix,
> perhaps "1.2.3a" or something, as Taylan suggested.  Incidentally, I
> vaguely recall that we've done that in the past, but I don't know if
> we've done it consistently.
The entire point here is to use git-version in combination with let-
bound commit hashes, yes.

> > As pointed out elsewhere, SWH keeps a history of the tags that we
> > could look up until one matches,
> 
> Only if SWH took a snapshot at the right time.  I would guess that
> mutations of release tags usually happen within a few days after the
> release tag is first created.  Relying on SWH to take a snapshot
> within that possibly quite small time interval doesn't sound very
> robust to me.
If the scenario is "a few days within release" vs "literally forever",
there would for one only be a relatively short range of bad Guix
versions having a broken reference (limiting impact) and for another,
the majority of the audience would also associate the latter with said
version.  There's really no good argument from the robustness side to
be had here.

> > and there'd also be the option to keep a secondary index ourselves
> > (or have a third party do it).
> 
> That's true, but then we'd be adding another piece of centralized
> infrastructure that users would need to rely upon in order to
> reliably reproduce their systems.  That infrastructure would have to
> be maintained indefinitely.  If we failed to keep up maintenance, the
> users could run into problems reproducing their older systems.
> 
> It seems to me clearly better to avoid relying on a piece of
> centralized infrastructure if it can be easily avoided, no?
We can make that a key-value store for which you write a distributed
MapReduce function in Erlang if it makes you happier.

> > > On the other hand, if we refer to git _commit hashes_, then it
> > > *is* feasible for us to fetch the archived source from SWH,
> > > regardless of what upstream has done to its tags in the meantime.
> > > 
> > > For that reason alone, I think that the way Ricardo wrote the
> > > guile-aiscm package definition is clearly the right approach,
> > > given Guix's longstanding goals.
> > To me, it rather sounds like a workaround for longstanding bugs [1,
> > 2].
> [...]
> > [1] https://issues.guix.gnu.org/28659
> > [2] https://issues.guix.gnu.org/39575
> 
> I don't understand how it's a workaround for those bugs.  Even if
> those bugs were fixed, we'd still need a reliable way to find the git
> commit that matches the one expected by a git-fetch <origin> record,
> i.e. the one that will produce a source checkout with the expected
> SHA256 hash.
> 
> Am I missing something?
We are working on the base assumption here, that we have an (array of)
reachable fallbacks in any case, I don't think it's too big of a leap
to assume that we can keep a mapping 
  (origin-file-name x origin-hash) → canonicalized-uri
around either as part of said fallbacks or in parallel.

> Regarding "Tricking Peer Review": I think it would be ideal for
> package definitions to include both the git tag _and_ the git commit
> hash, and to teach our linter to raise an alarm when the expected
> tags are missing or fail to match the expected commit hash.
That is among the solutions I've proposed here, so naturally I'd be
fine with it.

> For similar reasons, it would also be good to include the
> fingerprints of upstream PGP signing keys in our package definitions,
> and to teach our linter to check those signatures and that they match
> the SHA256 hashes in our recipes.
> 
> What do you think?
I think that is one of the main things we could import over from Guix
for Racket users (previously Xiden, currently denxi).  I.e. we could
have 

  (origin
    ...
    (sha256 some-hash)
    (sha512 some-other-hash)
    (pgp-signature sig)
    [other validation forms...]
    [patches and snippet])

We would have to break record ABI for that, but imo with field
sanitizers that's something we could code up.  If at some time in the
future SHA-2 is broken, we can then still rely on the robustness that
breaking all of these hashes would be difficult and perhaps not worth
it for GNU Hello.

Now obviously, there is a performance tradeoff here.  You don't want to
only check signatures all the time for a relatively minor build
(particularly with Rust where the build phase is literally copy-paste
for 90% of the packages).  So we'd have to add a configuration option
on the sliding scale between "only check the weakest" over "only check
the strongest" over "check at least N at random or all of them" to
"check everything always".

Cheers


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-01 11:12       ` Liliana Marie Prikler
@ 2022-01-01 17:45         ` Timothy Sample
  2022-01-01 19:52           ` Liliana Marie Prikler
  2022-01-03 15:46           ` Ludovic Courtès
  2022-01-01 20:19         ` Mark H Weaver
  2022-01-02  2:07         ` Bengt Richter
  2 siblings, 2 replies; 63+ messages in thread
From: Timothy Sample @ 2022-01-01 17:45 UTC (permalink / raw)
  To: Liliana Marie Prikler; +Cc: guix-devel

Hi all,

Liliana Marie Prikler <liliana.prikler@gmail.com> writes:

> Am Freitag, dem 31.12.2021 um 20:41 -0500 schrieb Mark H Weaver:
>
>> If upstream later indicates that version "1.2.3" is now commit YYZ, I
>> don't think that invalidates our basis for continuing to associate
>> version "1.2.3" with commit XYZ.  The aforementioned immutable
>> historical fact still remains our basis and justification for making
>> that association.
>
> I'm pretty sure it does, particularly to a future observer who may not
> have the luxury of a history to distinguish that record from one in
> which a malicious committer linked those versions and tag together and
> then no one bothered to check.

If you want a concrete example to think through, there’s ‘eclib’.  Our
package says it’s version “20190909”, but that’s not what upstream calls
version “20190909”.  It looks like when we packaged ‘eclib’, that tag
pointed to commit 19e7e3e74268bf78bd9a1c4ba07597d5434fb166, but now it
points to bfbbd7c414521e1bf5e718a2925ea8ad845a2e87.

If you try to build ‘eclib’, everything will work great, since we can
grab the checkout from our servers.  If you use

    $ guix build --check -S eclib

you get a hash mismatch.  We have CI jobs for sources, but they aren’t
checking this: <https://ci.guix.gnu.org/build/319/details>.  That job
succeeds after downloading the checkout from our servers.

There are two things I can highlight from this case.

First, as expected, finding the original commit was painful.  SWH did
not record the old version of the tag.  Comparing it with the checkout
from our servers showed that the differences were very minor.  With that
in mind, I moved backwards through the commit history with ‘guix hash’
until I found a match.  As pointed out many times, if I had the original
commit, I could just ask SWH for it directly.

Second, these cases are very, very rare.  (I’ve essentially checked
every Git origin since Guix version 1.0.0, and this problem is not one
that worries me).  “Tricking Peer Review”-style problems seem to be much
more prevalent.  When tracking down a “difficult” Git origin, the first
thing I do is grep the Guix Git history for a “oops I committed the
wrong hash” message.  I recommend we focus our energies there before
worrying too much about replacing tags with commits or using both or
whatever.

>> Regarding "Tricking Peer Review": I think it would be ideal for
>> package definitions to include both the git tag _and_ the git commit
>> hash, and to teach our linter to raise an alarm when the expected
>> tags are missing or fail to match the expected commit hash.
>
> That is among the solutions I've proposed here, so naturally I'd be
> fine with it.

Given what I wrote above, maybe we could start by updating the linter so
that ‘check-source’ actually checks that it gets the right result.
Right now it uses a few heuristics to check that the result looks okay
(for instance, it checks if the result is suspiciously small).  Maybe it
should just go through the whole download process and verify the hash?
Alternatively (or additionally), the CI “source” specification could be
configured to avoid using our servers as a fallback when checking
sources.

I agree that adding more identifiers (commit hashes or whatever) makes
things more robust, but the cost is more work when creating, updating,
and reviewing packages.  I think we should start by verifying the
identifiers we already have (i.e., checking that the URI and method of
the origin produce the right output).  It would solve many existing
problems and would serve as a nice foundation for future improvements.

And as a bonus, if you want to be really kind to future time travellers,
when fixing an errant hash, please include a nice hint as to what the
original hash was for (like a commit hash).  We have commit
ca5a791f6285b08506ccd662d5911ccf0c4d1ece in our repo, which says:

> The previous hash was from the "dev" branch of the repository.

I can’t find the source for the previous hash, and if I could actually
travel through time, I would change the commit message to:

> The previous hash was from commit abcd0123..., which comes from the
> "dev" branch of the repository.

:)

-- Tim

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-01 17:45         ` Timothy Sample
@ 2022-01-01 19:52           ` Liliana Marie Prikler
  2022-01-02 23:00             ` Timothy Sample
  2022-01-03 15:46           ` Ludovic Courtès
  1 sibling, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-01 19:52 UTC (permalink / raw)
  To: Timothy Sample; +Cc: guix-devel

Hi Timothy,

Am Samstag, dem 01.01.2022 um 12:45 -0500 schrieb Timothy Sample:
> If you want a concrete example to think through, there’s ‘eclib’. 
> Our package says it’s version “20190909”, but that’s not what
> upstream calls version “20190909”.  It looks like when we packaged
> ‘eclib’, that tag pointed to commit
> 19e7e3e74268bf78bd9a1c4ba07597d5434fb166, but now
> it points to bfbbd7c414521e1bf5e718a2925ea8ad845a2e87.
> 
> If you try to build ‘eclib’, everything will work great, since we can
> grab the checkout from our servers.  If you use
> 
>     $ guix build --check -S eclib
> 
> you get a hash mismatch.  We have CI jobs for sources, but they
> aren’t checking this: <https://ci.guix.gnu.org/build/319/details>. 
> That job succeeds after downloading the checkout from our servers.
With the robustness framework that is talked about here, this is only
partially robust.  If you have substitutes disabled, then a normal
`guix build -S eclib' also fails, and if CI eventually garbage collects
the source, the same happens for everyone.

If we simply hardcoded the hash on the other hand, none of that would
happen at all, you couldn't even use `guix build --check -S' as an
oracle.

> There are two things I can highlight from this case.
> 
> First, as expected, finding the original commit was painful.  SWH did
> not record the old version of the tag.  Comparing it with the
> checkout from our servers showed that the differences were very
> minor.  With that in mind, I moved backwards through the commit
> history with ‘guix hash’ until I found a match.  As pointed out many
> times, if I had the original commit, I could just ask SWH for it
> directly.
> 
> Second, these cases are very, very rare.  (I’ve essentially checked
> every Git origin since Guix version 1.0.0, and this problem is not
> one that worries me).  “Tricking Peer Review”-style problems seem to
> be much more prevalent.  When tracking down a “difficult” Git origin,
> the first thing I do is grep the Guix Git history for a “oops I
> committed the wrong hash” message.  I recommend we focus our energies
> there before worrying too much about replacing tags with commits or
> using both or whatever.
Since you are our expert on preservation, would you mind if I ask you
for some estimates on how painful it is to track down such commits in
general, if it could be made easier were you to record tag → commit
(alternatively file-name x sha256 → SWHID) maps periodically (or if you
already have such a map and those arise while creating it), and how
many “Tricking Peer Review”-style problems you think are currently
around?

> > > Regarding "Tricking Peer Review": I think it would be ideal for
> > > package definitions to include both the git tag _and_ the git
> > > commit hash, and to teach our linter to raise an alarm when the
> > > expected tags are missing or fail to match the expected commit
> > > hash.
> > 
> > That is among the solutions I've proposed here, so naturally I'd be
> > fine with it.
> 
> Given what I wrote above, maybe we could start by updating the linter
> so that ‘check-source’ actually checks that it gets the right result.
> Right now it uses a few heuristics to check that the result looks
> okay (for instance, it checks if the result is suspiciously small). 
> Maybe it should just go through the whole download process and verify
> the hash?  Alternatively (or additionally), the CI “source”
> specification could be configured to avoid using our servers as a
> fallback when checking sources.
I think substitutes should be disabled for the source download of a
"check-source".  Even if a substitute or SWH fallback exists, that's
not what we want to check here, no?

> I agree that adding more identifiers (commit hashes or whatever)
> makes things more robust, but the cost is more work when creating,
> updating, and reviewing packages.  I think we should start by
> verifying the identifiers we already have (i.e., checking that the
> URI and method of the origin produce the right output).  It would
> solve many existing problems and would serve as a nice foundation for
> future improvements.
Is this something we can reasonably expect our current CI or CI in
general to handle (assuming we tweaked the linter to behave as you
intend?)  Or would it make more sense to implement this as a
weekly/monthly cronjob?

> And as a bonus, if you want to be really kind to future time
> travellers, when fixing an errant hash, please include a nice hint as
> to what the original hash was for (like a commit hash).  We have
> commit ca5a791f6285b08506ccd662d5911ccf0c4d1ece in our repo, which
> says:
> 
> > The previous hash was from the "dev" branch of the repository.
> 
> I can’t find the source for the previous hash, and if I could
> actually travel through time, I would change the commit message to:
> 
> > The previous hash was from commit abcd0123..., which comes from the
> > "dev" branch of the repository.
+1 from me for useful commit messages.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-01 11:12       ` Liliana Marie Prikler
  2022-01-01 17:45         ` Timothy Sample
@ 2022-01-01 20:19         ` Mark H Weaver
  2022-01-01 23:20           ` Liliana Marie Prikler
  2022-01-02  2:07         ` Bengt Richter
  2 siblings, 1 reply; 63+ messages in thread
From: Mark H Weaver @ 2022-01-01 20:19 UTC (permalink / raw)
  To: Liliana Marie Prikler, guix-devel

Hi Liliana,

Liliana Marie Prikler <liliana.prikler@gmail.com> writes:

> Am Freitag, dem 31.12.2021 um 20:41 -0500 schrieb Mark H Weaver:
>> I disagree with the last line above.  What makes you think that I'm
>> presupposing that the tag does change?
>> 
>> There's a difference between "presupposing that the tag does change"
>> and "not assuming that the tag will not change".  Do you see the
>> difference?
> I'm pretty sure ¬assume(¬X) = assume(¬¬X) in this concept.

No, that's certainly false.  On the left-hand side of that equation
there is an absence of any assumptions, and on the right-hand side there
is the assumption that ¬¬X is true.

Perhaps something is getting lost in translation between our languages.

> You have to
> start with some assumptions and while ideally we'd like to encode "I
> don't care", we do not have a system that allows us to do so.
>
>> > However, if we are always talking about more than one possible
>> > "1.2.3" (with the included future tag that we have yet to witness),
>> > we lose the basis by which we currently assign "1.2.3" as the
>> > version 
>> 
>> I see what you're getting at here, but still I disagree.  Our basis
>> for associating version "1.2.3" with commit XYZ is simply that
>> upstream had indicated that version "1.2.3" was commit XYZ.  That
>> historical fact is immutable.
> History is a social construct, it's not immutable.

I agree that /our knowledge of history/ is a social construct, and thus
mutable, but that's not what I was referring to here.  I was referring
to the facts of what /actually happened/ in the past, which is
admittedly unknowable to us (especially in recent times).

>> If upstream later indicates that version "1.2.3" is now commit YYZ, I
>> don't think that invalidates our basis for continuing to associate
>> version "1.2.3" with commit XYZ.  The aforementioned immutable
>> historical fact still remains our basis and justification for making
>> that association.
> I'm pretty sure it does, particularly to a future observer who may not
> have the luxury of a history to distinguish that record from one in
> which a malicious committer linked those versions and tag together and
> then no one bothered to check.

It's a valid point.  However, your denial of the existence of any
immutable historical facts (which is somewhat defensible) is starkly at
odds with the fundamental principles of GNU Guix and its core goals.

I can understand your desire to have "FOO@1.2.3" in Guix correspond to
the most recent announcement from the upstream FOO project on what they
consider version "1.2.3" of their package to be.  This is obviously a
desirable property for a package manager to have.

The problem is that it's incompatible with properties of Guix that I
consider to be far more important.  I don't want the meaning of
"FOO@1.2.3" in Guix to depend on what time it is when I ask the
question.  I want it to mean the same thing tomorrow that it means
today.  Reproducibility requires this.  It requires some notion of
immutability.

I don't see how to reconcile Guix's core goals with your apparent goal
of having "FOO@1.2.3" match upstream's latest idea of what "1.2.3" means
to them, without abandoning the use of package version numbers in Guix
altogether.

We might simply want different things from a package manager.
De gustibus non disputandum est.

      Regards,
        Mark

-- 
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-01 10:33               ` Liliana Marie Prikler
@ 2022-01-01 20:37                 ` Mark H Weaver
  2022-01-01 22:55                   ` Liliana Marie Prikler
  2022-01-02 19:30                   ` zimoun
  0 siblings, 2 replies; 63+ messages in thread
From: Mark H Weaver @ 2022-01-01 20:37 UTC (permalink / raw)
  To: Liliana Marie Prikler, zimoun, guix-devel

Liliana Marie Prikler <liliana.prikler@gmail.com> writes:

>> Where is the Cantor-style diagonalization argument that you spoke of?
> You skipped over it, read again.  The key point is that you're
> referencing the thing you think will be invalidated to create your
> scheme.

I've carefully read your message at least 4 times, but I've been unable
to find anything resembling Cantor's diagonalization argument in there.
Does anyone else see it?  Perhaps my powers of recognition are too weak.

      Regards,
        Mark

-- 
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-01 20:37                 ` Mark H Weaver
@ 2022-01-01 22:55                   ` Liliana Marie Prikler
  2022-01-02 22:57                     ` Mark H Weaver
  2022-01-02 19:30                   ` zimoun
  1 sibling, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-01 22:55 UTC (permalink / raw)
  To: Mark H Weaver, zimoun, guix-devel

Am Samstag, dem 01.01.2022 um 15:37 -0500 schrieb Mark H Weaver:
> Liliana Marie Prikler <liliana.prikler@gmail.com> writes:
> 
> > > Where is the Cantor-style diagonalization argument that you spoke
> > > of?
> > You skipped over it, read again.  The key point is that you're
> > referencing the thing you think will be invalidated to create your
> > scheme.
> 
> I've carefully read your message at least 4 times, but I've been
> unable to find anything resembling Cantor's diagonalization argument
> in there.  Does anyone else see it?  Perhaps my powers of recognition
> are too weak.
Can I help your powers of recognition by describing everything in terms
of Turing machines?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-01 20:19         ` Mark H Weaver
@ 2022-01-01 23:20           ` Liliana Marie Prikler
  2022-01-02 12:25             ` Mark H Weaver
  0 siblings, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-01 23:20 UTC (permalink / raw)
  To: Mark H Weaver, guix-devel

Hi,

Am Samstag, dem 01.01.2022 um 15:19 -0500 schrieb Mark H Weaver:
> Hi Liliana,
> 
> Liliana Marie Prikler <liliana.prikler@gmail.com> writes:
> 
> > Am Freitag, dem 31.12.2021 um 20:41 -0500 schrieb Mark H Weaver:
> > > I disagree with the last line above.  What makes you think that
> > > I'm presupposing that the tag does change?
> > > 
> > > There's a difference between "presupposing that the tag does
> > > change" and "not assuming that the tag will not change".  Do you
> > > see the difference?
> > I'm pretty sure ¬assume(¬X) = assume(¬¬X) in this concept.
To correct my own typo here, I meant context, not concept.

> No, that's certainly false.  On the left-hand side of that equation
> there is an absence of any assumptions, and on the right-hand side
> there is the assumption that ¬¬X is true.
> 
> Perhaps something is getting lost in translation between our
> languages.
You are not a tabula rasa.  By the time you push a commit or even by
the time you submit it to the mailing lists, you have already made a
bunch of assumptions, some of them just more implicit than others. 
Assuming you care enough to make an informed use of the raw commit
pattern rather than just copying it from elsewhere, this holds even
more so.

> > > 
> > > > However, if we are always talking about more than one possible
> > > > "1.2.3" (with the included future tag that we have yet to
> > > > witness), we lose the basis by which we currently assign
> > > > "1.2.3" as the version 
> > > 
> > > I see what you're getting at here, but still I disagree.  Our
> > > basis for associating version "1.2.3" with commit XYZ is simply
> > > that upstream had indicated that version "1.2.3" was commit XYZ. 
> > > That historical fact is immutable.
> > History is a social construct, it's not immutable.
> 
> I agree that /our knowledge of history/ is a social construct, and
> thus mutable, but that's not what I was referring to here.  I was
> referring to the facts of what /actually happened/ in the past, which
> is admittedly unknowable to us (especially in recent times).
We'd have to go into metaphysics in order to debate that and I'm sure
you could chase that rabbit hole to find an answer that works for
software version control, but at the same time we are dealing with an
even larger field of unanswered questions at that point and not making
any advancements to the problem we're actually trying to solve
whatsoever.

> > 
> > > If upstream later indicates that version "1.2.3" is now commit
> > > YYZ, I don't think that invalidates our basis for continuing to
> > > associate version "1.2.3" with commit XYZ.  The aforementioned
> > > immutable historical fact still remains our basis and
> > > justification for making that association.
> > I'm pretty sure it does, particularly to a future observer who may
> > not have the luxury of a history to distinguish that record from
> > one in which a malicious committer linked those versions and tag
> > together and then no one bothered to check.
> 
> It's a valid point.  However, your denial of the existence of any
> immutable historical facts (which is somewhat defensible) is starkly
> at odds with the fundamental principles of GNU Guix and its core
> goals.
> 
> I can understand your desire to have "FOO@1.2.3" in Guix correspond
> to the most recent announcement from the upstream FOO project on what
> they consider version "1.2.3" of their package to be.  This is
> obviously a desirable property for a package manager to have.
> 
> The problem is that it's incompatible with properties of Guix that I
> consider to be far more important.  I don't want the meaning of
> "FOO@1.2.3" in Guix to depend on what time it is when I ask the
> question.  I want it to mean the same thing tomorrow that it means
> today.  Reproducibility requires this.  It requires some notion of
> immutability.
> 
> I don't see how to reconcile Guix's core goals with your apparent
> goal of having "FOO@1.2.3" match upstream's latest idea of what
> "1.2.3" means to them, without abandoning the use of package version
> numbers in Guix altogether.
> 
> We might simply want different things from a package manager.
> De gustibus non disputandum est.
I think you are (intentionally or not) ignoring multiple key
assumptions here, that we can work with.  Some because we're dealing
with software, some because we're working with Guix.  For instance, the
assumption (you might call it "fact"), that Guix can identify an origin
by a combination of filename and hash.  Or the assumption that version
numbers are monotonically increasing and typically not reissued.

When you claim for a specific git-reference that you need to prepare
for the occasion of a tag being overwritten, you are making an
assumption that such an event will indeed take place in the future and
justifying your action based on said assumption.  When I claim "yeah,
GNOME has been around for more than twenty years and they're pretty
adamant about versioning", I am doing exactly the same.  However,
notice how from my assumption it logically follows – as long as said
assumption remains valid – that there exists a single release 41.0
which I can refer to (by tag), but from yours – again, as long as it is
valid – it follows that there is no single version N that you can refer
to by whatever commit you have.

Now what if I'm wrong?  If I am indeed wrong and GNOME 43 is tagged
twice or even thrice, I'd be willing to use git-version for all GNOME
stuff that we take from git.  I'd also be willing to move to git
origins for GNOME stuff if mirror://gnome becomes unreliable.  But what
if you're correct?  Are you willing to use git-version for every single
git-reference out there?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-01 11:12       ` Liliana Marie Prikler
  2022-01-01 17:45         ` Timothy Sample
  2022-01-01 20:19         ` Mark H Weaver
@ 2022-01-02  2:07         ` Bengt Richter
  2 siblings, 0 replies; 63+ messages in thread
From: Bengt Richter @ 2022-01-02  2:07 UTC (permalink / raw)
  To: Liliana Marie Prikler; +Cc: guix-devel

[0]    https://cdn.quotesgram.com/img/31/40/532049644-676813c5150a0168ad089c40202f742e.jpg

On +2022-01-01 12:12:33 +0100, Liliana Marie Prikler wrote:
> Am Freitag, dem 31.12.2021 um 20:41 -0500 schrieb Mark H Weaver:
> > I disagree with the last line above.  What makes you think that I'm
> > presupposing that the tag does change?
> > 
> > There's a difference between "presupposing that the tag does change"
> > and "not assuming that the tag will not change".  Do you see the
> > difference?
> I'm pretty sure ¬assume(¬X) = assume(¬¬X) in this concept.  You have to
> start with some assumptions and while ideally we'd like to encode "I
> don't care", we do not have a system that allows us to do so.
> 
> > > However, if we are always talking about more than one possible
> > > "1.2.3" (with the included future tag that we have yet to witness),
> > > we lose the basis by which we currently assign "1.2.3" as the
> > > version 
> > 
> > I see what you're getting at here, but still I disagree.  Our basis
> > for associating version "1.2.3" with commit XYZ is simply that
> > upstream had indicated that version "1.2.3" was commit XYZ.  That
> > historical fact is immutable.
> History is a social construct, it's not immutable.
>

--8<---------------cut here---------------start------------->8---
“When I use a word,” Humpty Dumpty said in rather a scornful tone,
“it means just what I choose it to mean — neither more nor less.”

“The question is,” said Alice, “whether you can make words mean so many different things.”

“The question is,” said Humpty Dumpty, “which is to be master – – that’s all.”
--8<---------------cut here---------------end--------------->8---

-- 
Regards,
Bengt Richter


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-01 23:20           ` Liliana Marie Prikler
@ 2022-01-02 12:25             ` Mark H Weaver
  2022-01-02 14:09               ` Liliana Marie Prikler
  0 siblings, 1 reply; 63+ messages in thread
From: Mark H Weaver @ 2022-01-02 12:25 UTC (permalink / raw)
  To: Liliana Marie Prikler, guix-devel

Hi Liliana,

Liliana Marie Prikler <liliana.prikler@gmail.com> writes:

> Am Samstag, dem 01.01.2022 um 15:19 -0500 schrieb Mark H Weaver:
>> Hi Liliana,
>> 
>> Liliana Marie Prikler <liliana.prikler@gmail.com> writes:
>> 
>> > Am Freitag, dem 31.12.2021 um 20:41 -0500 schrieb Mark H Weaver:
>> > > I disagree with the last line above.  What makes you think that
>> > > I'm presupposing that the tag does change?
>> > > 
>> > > There's a difference between "presupposing that the tag does
>> > > change" and "not assuming that the tag will not change".  Do you
>> > > see the difference?
>> > I'm pretty sure ¬assume(¬X) = assume(¬¬X) in this concept.
> To correct my own typo here, I meant context, not concept.
>
>> No, that's certainly false.  On the left-hand side of that equation
>> there is an absence of any assumptions, and on the right-hand side
>> there is the assumption that ¬¬X is true.
>> 
>> Perhaps something is getting lost in translation between our
>> languages.

Repeating myself: our difficulties understanding each other on this
point might be due a translation issue.  Earlier, you wrote:

> And here I disagree.  This reasoning presupposes that we have to ensure
> that the package still points to the same commit if the tag changes,
> which itself presupposes that the tag does change.

and I replied (quoted above) "I disagree with the last line above".

However, I'll note that a single-word substitution would eliminate my
objection.  If you substitute "might change" or "could change" in place
of "does change" in the text above above, then I would more-or-less
agree with what you wrote.

In English, the phrases "could change" and "might change" indicate a
/possibility/ of change.  In other words, they indicate an absence of
knowledge about whether change will occur.

On the other hand, "does change" suggests to my ears (as a native
English speaker) that change is /known to occur/.  In other words, if
you say "X does change" and then X is observed to remain constant over
some suitably long time interval, that would call into question the
veracity of your words.

To give an example from the Scheme programming language, if you showed
me the following code template:

   (let ((LST '(1 2 3)))
     <body>)

I would say "LST could change".  I would *not* say "LST does change".

With this in mind, here are your words again:

> And here I disagree.  This reasoning presupposes that we have to ensure
> that the package still points to the same commit if the tag changes,
> which itself presupposes that the tag does change.

In the last line, you're telling me that my reasoning "presupposes that
the tag does change", which to my ears suggests that you think I'm
assuming that _every_ tag will be mutated sooner or later.

If, instead, the last line above read: "which itself presupposes that
the tag *could* change", that essentially means that I'm preparing for
the /possibility/ of change, which is true.

What do you think?

      Regards,
        Mark

-- 
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-02 12:25             ` Mark H Weaver
@ 2022-01-02 14:09               ` Liliana Marie Prikler
  0 siblings, 0 replies; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-02 14:09 UTC (permalink / raw)
  To: Mark H Weaver, guix-devel

Hi Mark,

Am Sonntag, dem 02.01.2022 um 07:25 -0500 schrieb Mark H Weaver:
> Repeating myself: our difficulties understanding each other on this
> point might be due a translation issue.  Earlier, you wrote:
> 
> > And here I disagree.  This reasoning presupposes that we have to
> > ensure that the package still points to the same commit if the tag
> > changes, which itself presupposes that the tag does change.
> 
> and I replied (quoted above) "I disagree with the last line above".
> 
> However, I'll note that a single-word substitution would eliminate my
> objection.  If you substitute "might change" or "could change" in
> place of "does change" in the text above above, then I would more-or-
> less agree with what you wrote.
> 
> In English, the phrases "could change" and "might change" indicate a
> /possibility/ of change.  In other words, they indicate an absence of
> knowledge about whether change will occur.
We can agree on an absence of knowledge, but an absence of knowledge is
not an absence of assumption, which is why I'm making a stronger
statement than what you agree with.

> On the other hand, "does change" suggests to my ears (as a native
> English speaker) that change is /known to occur/.  In other words, if
> you say "X does change" and then X is observed to remain constant
> over some suitably long time interval, that would call into question
> the veracity of your words.
It's not quite as harsh, since either way we are making predictions
about uncertain events in the future.  We could for instance make a bet
on a given package, whether a given tag will be moved/dropped within
1/2/5/10 years and whoever wins gets the commit in which we updated the
package description as NFT, bragging rights, or something similarly
trivial.  There is no such thing as absolute truth to either statement
(the tag changes/does not change), there is only an observation whether
one or the other holds in past tense at a given point in time.

> To give an example from the Scheme programming language, if you
> showed me the following code template:
> 
>    (let ((LST '(1 2 3)))
>      <body>)
> 
> I would say "LST could change".  I would *not* say "LST does change".
Whether or not LST changes obviously depends on BODY here, but this
form is ill-suited to draw comparison.  C++ const (correctness) would
be closer to what we're discussing, as would be the following Scheme
code inside a module:

  (define LST '(1 2 3))
  (define (F) (peek LST))

Let's assume that some outsider redefines LST to '(4 5 6).  What would
be the value printed and returned by F?  This depends on whether the
compiler was optimistic and inlined LST or pessimistic and did not
inline it.  In either case, the compiler makes an assumption about
whether LST *does change* and it can be wrong.

> With this in mind, here are your words again:
> 
> > And here I disagree.  This reasoning presupposes that we have to
> > ensure that the package still points to the same commit if the tag
> > changes, which itself presupposes that the tag does change.
> 
> In the last line, you're telling me that my reasoning "presupposes
> that the tag does change", which to my ears suggests that you think
> I'm assuming that _every_ tag will be mutated sooner or later.
> 
> If, instead, the last line above read: "which itself presupposes that
> the tag *could* change", that essentially means that I'm preparing
> for the /possibility/ of change, which is true.
"Tags in Git can change" is not an assumption that one person can make
and another dispute.  It is a fact/rule/whatever given by Git, that
they can.  In a similar manner, it is a fact, that servers can move to
different locations in both geographical and virtual address spaces and
can over time serve different content.

The question is what policies we derive from said facts, which is a
typical is/ought dilemma.  Since at this point we are far removed from
facts that we can state with certainty, we have to instead rely on
assumptions, whether they come from experience, gut feelings or an old
lady with a crystal ball down the street.  

This does not mean that you assume each and every tag out in the wild
changes every few commits to the Guix repository.  But it does mean
that you find it likely and/or troublesome enough to warrant a policy,
which in turn is based either on the assumption that it is likelier to
change than not or some ad-hoc justification to bet against the odds as
well as a rough assumption on said odds.  It's like asking someone to
estimate whether the glass is full or empty, but with the added bonus
that we don't even know how much water it contains.

Now I am very open about my assumption that tags won't break for a
large number of packages.  I can also understand if you make a
different assumption for one package out there and thereby justify
using git-version, and there's even an argument to be made that all
git-based ought to use it by generalizing that assumption.  But you
cannot assume both to hold at the same time without being inconsistent
and there's nothing meaningful to derive from not knowing because
you'll make a guess either way.

Cheers

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-01 20:37                 ` Mark H Weaver
  2022-01-01 22:55                   ` Liliana Marie Prikler
@ 2022-01-02 19:30                   ` zimoun
  2022-01-02 21:35                     ` Liliana Marie Prikler
  1 sibling, 1 reply; 63+ messages in thread
From: zimoun @ 2022-01-02 19:30 UTC (permalink / raw)
  To: Mark H Weaver, Liliana Marie Prikler, guix-devel

Hi Mark, Liliana, all,

On Sat, 01 Jan 2022 at 15:37, Mark H Weaver <mhw@netris.org> wrote:

>>> Where is the Cantor-style diagonalization argument that you spoke of?
>>
>> You skipped over it, read again.  The key point is that you're
>> referencing the thing you think will be invalidated to create your
>> scheme.
>
> I've carefully read your message at least 4 times, but I've been unable
> to find anything resembling Cantor's diagonalization argument in there.
> Does anyone else see it?  Perhaps my powers of recognition are too weak.

Mark, I do not see the diagonalization either.  Liliana, please point
explicitly what acts as diagonale in your reasoning.  Or anyone else if
I am missing the obvious.

That’s said, I think the reasoning is doomed earlier.  It reads,

        Okay, so let's write out the full argument.  At a certain date,
        we package or update P to version V through reference to tag T
        (at commit C).  Because we can't trust T to remain valid, we
        only keep (V, C) around for "robustness".

        Now notice, how version V is generated by referring to T.  Without loss
        of generality, assume that T is invalidated, otherwise nothing to
        prove.  Since V is created through reference to T, it is also
        invalidated as being the canonical V, whichever it is.  A similar
        argument can be made for C as well.  So both (V, C) are invalidated and
        the only thing we can claim is "yeah, upstream did tag that T at some
        point".

<https://yhetil.org/guix/762e9fb7116c442bf0f8f63221bf32fa2b77f2cf.camel@gmail.com>

And this statement «Without loss of generality, assume that T is
invalidated, otherwise nothing to prove.  Since V is created through
reference to T, it is also invalidated as being the canonical V,
whichever it is.» is not enough precise.

Because the pair (V,C) is fixed by Guix packagers; thus whatever T is
becoming, then the pair (V,C) is not invalidated.  What is invalidated
is the match between what upstream calls version (what we can name
upstream canonical) and what the Guix project considers as version (what
the end-user expects similar to the upstream canonical one).  Timothy
provided an example of such mismatch.

The reasoning requires 2 versions: the upstream canonical version V’
linked to T.  And the Guix-related version noted V and used by the pair
(V,C) in the package definition.  It appears to me hard to infer logical
arguments for the link between V and V’.

Moreover, the pair (V,C) is stored inside the immutable Guix history.
Yeah, maybe tomorrow we will all be crazy and rewrite all the Guix
history, because yes Guix history is just one DAG we collectively agree
on – I would not say it is a social construct though, anyway.

As I tried to explain
<https://yhetil.org/guix/86ilv46hls.fsf@gmail.com>, the Guix ’version’
field matches as much as possible upstream “canonical” version defined
by tag, commit, url, revision, etc. but both are not the same thing.  In
addition, Git-tag and Git-commit are not the same philosophical
ontology.

Last on this point, using ’git-version’ and commit or tag to define Guix
version (the field ’version’) is not related to the issue of referring
by tag or commit in ’uri’ field.  That’s not the same level.

Moreover, Liliana you wrote:

        Git commit hashes do not just depend on the content.  They also
        depend on how much effort you put into solving a proof of work
        challenge that won't ever earn you crypto coins [1].

pointing to <https://github.com/tochev/git-vanity>.  The statement «Git
commit hashes do not just depend on the content» is wrong.
Specifically, it is by adding more well-chosen content that the hash is
tricked.  Now, Liliana, you can define “content“ by useful content
opposed as meta-content, etc.  Well, it is fine but the statement still
appears to me wrong because Git commit hash only depends on the content
itself.

If your point is that Git using SHA-1 is subject to chose-prefix attack,
yeah it is well-known since,

    https://sha-mbles.github.io/

and it is even discussed in the long thread “Tricking peer review”
<https://yhetil.org/guix/874k9if7am.fsf@inria.fr>.  For instance, see my
email about SWH case <https://yhetil.org/guix/86r1cgcb8r.fsf@gmail.com>.

Even more, we discussed chosen-prefix attack and SHA-1 for channel
fallback starting here <https://issues.guix.gnu.org/44187#10>.

Somehow, as always and even outside content-address system, you have to
distinguish between identifier and integrity.  Anyway.

In despite of being aware (before this discussion) of many flaws, I am
still thinking that intrinsic values is better than extrinsic values for
referencing source or paper or else.  And yes, intrinsic values are not
the perfect solution but it is really better than extrinsic ones; and it
is not because it is not perfect that it is not worth or preferable.
Although not perfect, it is still better.  Where the impression of all
your lengthy answers provide is that intrinsic referencing has some
well-known issues so let find overcomplicated strategies to fix
extrinsic referencing which has even more issues.

Well, I am confused by what you are trying to raise in all this now
lengthy thread – I have read it several times. :-)

To me, the points for more intrinsic values and less extrinsic ones in
’uri’ field are:

 1. yes, upstream extrinsic addressing are handy
 2. but they add burden for lookup
 3. uri requires intrinsic addressing to ease long term
 4. it is not clear what intrinsic values pick and how to transition

and for ’version’ field, we can do whatever we want as Guix packagers.

Then, we can come up and question by overcomplicated philosophico-logics
and metaphysics arguments the meaning of life, fore sure it is
interesting – at least I find it sometimes interesting :-) – but I am
doubtful it helps in cooking the rice. ;-)

Cheers,
simon

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-02 19:30                   ` zimoun
@ 2022-01-02 21:35                     ` Liliana Marie Prikler
  2022-01-03  9:22                       ` zimoun
  0 siblings, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-02 21:35 UTC (permalink / raw)
  To: zimoun, Mark H Weaver, guix-devel

Hi Simon,

Am Sonntag, dem 02.01.2022 um 20:30 +0100 schrieb zimoun:
> Last on this point, using ’git-version’ and commit or tag to define
> Guix version (the field ’version’) is not related to the issue of
> referring by tag or commit in ’uri’ field.  That’s not the same
> level.
Look at the title, now back to me, now back to the title, now back to
me.  Sadly, that title is not about whether to use a commit in the
<git-reference>, but whether to use a *raw* commit string and not
indicating so in the version field of the <package>.  So while a
preference of git-fetch over url-fetch, commit over tag (or the other
way round) are perhaps adjacent, the use of this style rather than let-
bound commits with git-versions is pretty damn relevant.

> The statement still appears to me wrong because Git commit hash only
> depends on the content itself.
If you define content through the NAR hash used by Guix, I'm pretty
sure vanity commits invalidate that statement.

> If your point is that Git using SHA-1 is subject to chose-prefix
> attack, yeah it is well-known since,
> 
>     https://sha-mbles.github.io/
> 
> and it is even discussed in the long thread “Tricking peer review”
> <https://yhetil.org/guix/874k9if7am.fsf@inria.fr>.  For instance, see
> my email about SWH case <
> https://yhetil.org/guix/86r1cgcb8r.fsf@gmail.com>.
> 
> Even more, we discussed chosen-prefix attack and SHA-1 for channel
> fallback starting here <https://issues.guix.gnu.org/44187#10>.
> 
> Somehow, as always and even outside content-address system, you have
> to distinguish between identifier and integrity.  Anyway.
That might work against injecting evil content, but the attack surface
for a denial of surface (which is what we need to consider in our
robustness argument) is a little larger, don't you think?  Heck, if I'm
in control of the forge and I used a preimage attack to push a
different commit under the same hash, Guix would not even bother using
SWH and just error out (once substitutes have been garbage-collected,
which again is a prerequisite for any robustness argument and may
therefore be assumed).

> In despite of being aware (before this discussion) of many flaws, I
> am still thinking that intrinsic values is better than extrinsic
> values for referencing source or paper or else.  And yes, intrinsic
> values are not the perfect solution but it is really better than
> extrinsic ones; and it is not because it is not perfect that it is
> not worth or preferable.
> Although not perfect, it is still better.  Where the impression of
> all your lengthy answers provide is that intrinsic referencing has
> some well-known issues so let find overcomplicated strategies to fix
> extrinsic referencing which has even more issues.
If intrinsic values are so good, why wouldn't you use them for versions
then?  :P

Our issues here are not of technological nature, they are social
issues.  I don't care if we're using SemVer, CalVer, jiffies the start
of the pandemic, BibleVerse or years since the birth of Kim Il Sŏng to
version and url-fetch, svn-fetch, git-fetch, or butterfly-fetch to
fetch software.  I care about consistency, both internal and external,
which for any given package means using (V(T), T) or (V(T, C), C) and
not (V(T), C), though of course the choice of which might vary.

> Well, I am confused by what you are trying to raise in all this now
> lengthy thread – I have read it several times. :-)
> 
> To me, the points for more intrinsic values and less extrinsic ones
> in ’uri’ field are:
> 
>  1. yes, upstream extrinsic addressing are handy
>  2. but they add burden for lookup
>  3. uri requires intrinsic addressing to ease long term
>  4. it is not clear what intrinsic values pick and how to transition
> 
> and for ’version’ field, we can do whatever we want as Guix
> packagers.
There are a few equivalent ways of formulating my core point here, so
pick the one you're most comfortable with:
1. If you can't trust tag T to uniquely label version V(T), you cannot
do so for the commit C it points to either.
2. If you can't trust tag T to always point to commit C, there is no
basis on which you can claim V(T) is provided by C.
3. If you version a package V(T) and fetch it using commit C, but T
points to C', that's an issue.

Bonus claim as a length extender:
4. If you cannot trust commit C to remain for as long as you want it to
stay... well, then you'd better not use git-fetch at all, don't you
think?

> Guix history is just one DAG we collectively agree on – I would not
> say it is a social construct though, anyway.
Oh, but it is, and it has been changed in the past.  Now try finding
out when.  Bonus points if you do so in 10 years without referring to
logs.guix.gnu.org.  Jackpot if you do so after Guix switched its
version control system because Git still uses SHA-1 in year N and it
has been utterly broken at that point.

All our versioning systems, whether "intrinsic" or extrinsic are
socially constructed through our collective agreement to use the same
software to assign meaning to a particular mapping.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-01 22:55                   ` Liliana Marie Prikler
@ 2022-01-02 22:57                     ` Mark H Weaver
  2022-01-03 21:25                       ` Liliana Marie Prikler
  0 siblings, 1 reply; 63+ messages in thread
From: Mark H Weaver @ 2022-01-02 22:57 UTC (permalink / raw)
  To: Liliana Marie Prikler, zimoun, guix-devel

Hi Liliana,

Liliana Marie Prikler <liliana.prikler@gmail.com> writes:

> Am Samstag, dem 01.01.2022 um 15:37 -0500 schrieb Mark H Weaver:
>> Liliana Marie Prikler <liliana.prikler@gmail.com> writes:
>> 
>> > > Where is the Cantor-style diagonalization argument that you spoke
>> > > of?
>> > You skipped over it, read again.  The key point is that you're
>> > referencing the thing you think will be invalidated to create your
>> > scheme.
>> 
>> I've carefully read your message at least 4 times, but I've been
>> unable to find anything resembling Cantor's diagonalization argument
>> in there.  Does anyone else see it?  Perhaps my powers of recognition
>> are too weak.
> Can I help your powers of recognition by describing everything in terms
> of Turing machines?

How about pointing out what acts as the diagonal in your reasoning?

      Thanks,
        Mark

-- 
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-01 19:52           ` Liliana Marie Prikler
@ 2022-01-02 23:00             ` Timothy Sample
  0 siblings, 0 replies; 63+ messages in thread
From: Timothy Sample @ 2022-01-02 23:00 UTC (permalink / raw)
  To: Liliana Marie Prikler; +Cc: guix-devel

Hey,

Liliana Marie Prikler <liliana.prikler@gmail.com> writes:

> Since you are our expert on preservation, would you mind if I ask you
> for some estimates on how painful it is to track down such commits in
> general, if it could be made easier were you to record tag → commit
> (alternatively file-name x sha256 → SWHID) maps periodically (or if you
> already have such a map and those arise while creating it), and how
> many “Tricking Peer Review”-style problems you think are currently
> around?

I haven’t been keeping a detailed log of issues or anything, but I have
some notes and recollections.  As of last month [1], we have 9,554 Git
sources (fixed-output derivations lowered from ‘git-reference’ origins).
Of those, 186 could not be recovered automatically (by simply cloning
the repo or, for about 100 cases with the commit hash, checking SWH).
Most of the 186 have ‘(recursive? #t)’, which is something I haven’t
implemented yet (there’s no Guix fallback support for it either).
However, there are 51 of those that should just work but don’t.

It turns out that most of these are due to my scripts ignoring the wrong
kind of VCS files (like ignoring “.hg”) when hashing.  My scripts follow
the logic of ‘guix hash -S nar -x .’, but Guix actually just deletes the
Git metadata: ‘rm -rf .git; guix hash -S nar .’.  :)

Another couple are <https://issues.guix.gnu.org/48540>.

There were a handful of mutated tags (around a dozen).  Some of them
were deleted, but the tag name referred to the commit hash (as if the
tag was named by ‘git describe’).  Some of them were changed, but it was
clear that the original tag was just a few commits back.  There was only
one I couldn’t figure out:

    https://github.com/jurplel/qView.git
    at tag 2.0
    with hash 1s29hz44rb5dwzq8d4i4bfg77dr0v3ywpvidpa6xzg7hnnv3mhi5

A similar problem is when the repo URL changes, but the tags are still
the same when you track down another copy of the repo.  I encountered
this a few times.

Another handful (again around a dozen) were hash mistakes in the style
of tricking peer review.  In most cases our commit messages were clear
enough to figure out what the hash was actually for.  There are two
mysterious cases:

    https://git.umaneti.net/flycheck-grammalecte/
    at tag v1.3
    with hash 1f1gapvs9j89qr474103dqgsiyb96phlnsmq5hiv4ba242blg9lb
    (see Guix commit ca5a791f6285b08506ccd662d5911ccf0c4d1ece)

    https://github.com/fdik/libetpan
    at commit 210ba2b3b310b8b7a6ee4a4e35e50f7fa379643f
    with hash 00000nij3ray7nssvq0lzb352wmnab8ffzk7dgff2c68mvjbh1l6
    (the hash kinda looks fake, but it was like that for a long time)

There are two other cases that are basically “typos” in the hash.  One
is clearly just an edit to the hash to make the build fail and print the
correct hash (see commits 618df2e335acb49a27ca014b555ede34f79503f3 and
bdc7f72fe4391ede313a0388ddd17cbb053931c9).  The other one is commit
c0dc4179091f85fe4b8a2bbdb07c154a7f0408ed, which changes the hash of the
package ‘zimg’ without mentioning anything about it in the commit
message.  This is fixed in b08c4f5fceff6064baedea3385703689b8a72e47
(back to the original hash).  Tobias might remember what happened there,
but it looks like an honest mistake to me.  I have no clue what that
other hash was for.

Note for all of this that my scripts treat the SHA256 hash as *the*
identifier for a source.  That is, if a tag is mutated and a someone
adjusts the origin URI to point to the commit that the tag used to refer
to, I would not notice.  Similarly, for tricking peer review: fixing the
URI to match the hash is invisible to me.  It’s only when we fix the
hash to match the URI that I notice.

See also zimoun’s analysis of the same thing, but with older data:
<https://lists.gnu.org/archive/html/guix-devel/2021-12/msg00032.html>.

[1] https://ngyro.com/pog-reports/2021-12-06/

> Am Samstag, dem 01.01.2022 um 12:45 -0500 schrieb Timothy Sample:
>
>> Given what I wrote above, maybe we could start by updating the linter
>> so that ‘check-source’ actually checks that it gets the right result.
>> Right now it uses a few heuristics to check that the result looks
>> okay (for instance, it checks if the result is suspiciously small). 
>> Maybe it should just go through the whole download process and verify
>> the hash?  Alternatively (or additionally), the CI “source”
>> specification could be configured to avoid using our servers as a
>> fallback when checking sources.
>
> I think substitutes should be disabled for the source download of a
> "check-source".  Even if a substitute or SWH fallback exists, that's
> not what we want to check here, no?

Exactly.  It should just fetch the source as naïvely as possible, akin
to ‘GUIX_DOWNLOAD_FALLBACK_TEST=none guix build --check -S ...’ (or with
‘--substitute-urls=""’ or whatever).

>> I agree that adding more identifiers (commit hashes or whatever)
>> makes things more robust, but the cost is more work when creating,
>> updating, and reviewing packages.  I think we should start by
>> verifying the identifiers we already have (i.e., checking that the
>> URI and method of the origin produce the right output).  It would
>> solve many existing problems and would serve as a nice foundation for
>> future improvements.
>
> Is this something we can reasonably expect our current CI or CI in
> general to handle (assuming we tweaked the linter to behave as you
> intend?)  Or would it make more sense to implement this as a
> weekly/monthly cronjob?

I really only mentioned the CI because I had to explain to myself why it
didn’t notice the problem.  I think the linter is probably the better
place to improve things here.  It’s something I’m willing to work on,
but I would need to understand why it doesn’t check the hash already.
It seems like one of those things someone may have already thought about
and decided against.

-- Tim

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-02 21:35                     ` Liliana Marie Prikler
@ 2022-01-03  9:22                       ` zimoun
  2022-01-03 18:13                         ` Liliana Marie Prikler
  0 siblings, 1 reply; 63+ messages in thread
From: zimoun @ 2022-01-03  9:22 UTC (permalink / raw)
  To: Liliana Marie Prikler, Mark H Weaver, guix-devel

Hi Liliana,

On Sun, 02 Jan 2022 at 22:35, Liliana Marie Prikler <liliana.prikler@gmail.com> wrote:

>> The statement still appears to me wrong because Git commit hash only
>> depends on the content itself.
>
> If you define content through the NAR hash used by Guix, I'm pretty
> sure vanity commits invalidate that statement.

I do not understand what it means – not to say I think your comment does
not make sense at all.  Well, I already took the time to explain twice
how it works.

<https://yhetil.org/guix/86y243kdoo.fsf@gmail.com>
<https://yhetil.org/guix/867dbmi7pf.fsf@gmail.com>

Maybe you also deny the Git documentation saying «Git is a
content-addressable filesystem.»

<https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>

I have the impression that you are trying to keep your statement by
stretching how it concretely works.  Instead of just say: “My bad, I was
going too far with the chosen-prefix attack of SHA-1”.  And that’s fine
because that’s a valid objection.  (Even if I was already aware,
mentioning such issue helps for a sane collective discussion, IMHO.)

Other said, I totally miss why you are raising what now appears to me as
bad faith.  Anyway.

For the rest, I could comment several items but I will not.  I would
repeat myself and other minor disagreements are not worth to put energy
in for resolving. :-)

Cheers,
simon

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-01 17:45         ` Timothy Sample
  2022-01-01 19:52           ` Liliana Marie Prikler
@ 2022-01-03 15:46           ` Ludovic Courtès
  1 sibling, 0 replies; 63+ messages in thread
From: Ludovic Courtès @ 2022-01-03 15:46 UTC (permalink / raw)
  To: Timothy Sample; +Cc: guix-devel, Liliana Marie Prikler

Hello!

Timothy Sample <samplet@ngyro.com> skribis:

> If you want a concrete example to think through, there’s ‘eclib’.  Our
> package says it’s version “20190909”, but that’s not what upstream calls
> version “20190909”.  It looks like when we packaged ‘eclib’, that tag
> pointed to commit 19e7e3e74268bf78bd9a1c4ba07597d5434fb166, but now it
> points to bfbbd7c414521e1bf5e718a2925ea8ad845a2e87.

[...]

> First, as expected, finding the original commit was painful.  SWH did
> not record the old version of the tag.

It probably did: SWH archives the “history of histories”.  However, our
SWH code, ‘lookup-origin-revision’, is looking at the tag found in the
latest snapshot, which is not helpful in this case.

[...]

> Second, these cases are very, very rare.  (I’ve essentially checked
> every Git origin since Guix version 1.0.0, and this problem is not one
> that worries me).  “Tricking Peer Review”-style problems seem to be much
> more prevalent.  When tracking down a “difficult” Git origin, the first
> thing I do is grep the Guix Git history for a “oops I committed the
> wrong hash” message.  I recommend we focus our energies there before
> worrying too much about replacing tags with commits or using both or
> whatever.

Agreed, it’d be nice to address, but not concern #1.

> And as a bonus, if you want to be really kind to future time travellers,
> when fixing an errant hash, please include a nice hint as to what the
> original hash was for (like a commit hash).  We have commit
> ca5a791f6285b08506ccd662d5911ccf0c4d1ece in our repo, which says:
>
>> The previous hash was from the "dev" branch of the repository.
>
> I can’t find the source for the previous hash, and if I could actually
> travel through time, I would change the commit message to:
>
>> The previous hash was from commit abcd0123..., which comes from the
>> "dev" branch of the repository.

In commit 944bd79113b9c856b11dd2b40d40e0274a9f4dd9 I added an
explanation right in the source; I think that’s a transparent and clear
way of handling issues with tags modified in place.

Ludo’.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2021-12-31 17:56 ` Vagrant Cascadian
@ 2022-01-03 15:51   ` Ludovic Courtès
  2022-01-03 16:29     ` Vagrant Cascadian
  0 siblings, 1 reply; 63+ messages in thread
From: Ludovic Courtès @ 2022-01-03 15:51 UTC (permalink / raw)
  To: Vagrant Cascadian; +Cc: guix-devel, Liliana Marie Prikler

Hello,

Vagrant Cascadian <vagrant@debian.org> skribis:

> How about using the output of git describe, which can unambigously
> include the most relevent tag, the number of commits since that tag, and
> the commit hash:
>
>   $ git describe --long --abbrev=41
>   v1.3.0-13278-g60661adfb8ffa28e1acfcfea27c6cc2fc70f88fe
>
>   $ git describe --long --abbrev=41 v1.3.0
>   v1.3.0-0-ga0178d34f582b50e9bdbb0403943129ae5b560ff

What does ‘git checkout’ do when passed such a string?  Does it ignore
the tag part?

> I *think* I've used such git references in the commit field of packages
> before, and guix seemed fine with it. Occasionally, I've seen git
> describe pick an odd tag to base on. Not sure how it interacts with
> software heritage, or multiple tags, or renamed tags... but in theory it
> could work, and would allow us to detect tag changes "upstream".

For SWH, we need either a tag or a commit ID.

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-03 15:51   ` Ludovic Courtès
@ 2022-01-03 16:29     ` Vagrant Cascadian
  0 siblings, 0 replies; 63+ messages in thread
From: Vagrant Cascadian @ 2022-01-03 16:29 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel, Liliana Marie Prikler

On 2022-01-03, Ludovic Courtès wrote:
> Vagrant Cascadian <vagrant@debian.org> skribis:
>
>> How about using the output of git describe, which can unambigously
>> include the most relevent tag, the number of commits since that tag, and
>> the commit hash:
>>
>>   $ git describe --long --abbrev=41
>>   v1.3.0-13278-g60661adfb8ffa28e1acfcfea27c6cc2fc70f88fe
>>
>>   $ git describe --long --abbrev=41 v1.3.0
>>   v1.3.0-0-ga0178d34f582b50e9bdbb0403943129ae5b560ff
>
> What does ‘git checkout’ do when passed such a string?  Does it ignore
> the tag part?

Technically, I have not tried it where the tag no longer exists, but
when the tag does exist, it checks it out. If the tag is moved or not
longer exists and git does not handle that well, the fallback can be to
the full commit id, e.g. the part without the vX.Y.Z-N-g, as it is the
same as a commit ID.

This at least documents both the tag at the time the committer updated
the guix package, as well as the commit ID, with the ability to somewhat
gracefully fall back to the raw commit ID.

I daresay, it seems like the best of both worlds, with the main downside
of being a little verbose.

>> I *think* I've used such git references in the commit field of packages
>> before, and guix seemed fine with it. Occasionally, I've seen git
>> describe pick an odd tag to base on. Not sure how it interacts with
>> software heritage, or multiple tags, or renamed tags... but in theory it
>> could work, and would allow us to detect tag changes "upstream".
>
> For SWH, we need either a tag or a commit ID.

The commit id can be programatically derived from the git describe
format, if you know in fact it is the git describe format with the
appropriate --long and --abbrev=41 arguments.

live well,
  vagrant

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-03  9:22                       ` zimoun
@ 2022-01-03 18:13                         ` Liliana Marie Prikler
  2022-01-03 19:07                           ` zimoun
  0 siblings, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-03 18:13 UTC (permalink / raw)
  To: zimoun, Mark H Weaver, guix-devel

Am Montag, dem 03.01.2022 um 10:22 +0100 schrieb zimoun:
> Hi Liliana,
> 
> On Sun, 02 Jan 2022 at 22:35, Liliana Marie Prikler
> <liliana.prikler@gmail.com> wrote:
> 
> > > The statement still appears to me wrong because Git commit hash
> > > only depends on the content itself.
> > 
> > If you define content through the NAR hash used by Guix, I'm pretty
> > sure vanity commits invalidate that statement.
> 
> I do not understand what it means – not to say I think your comment
> does not make sense at all.  Well, I already took the time to explain
> twice how it works.
> 
> <https://yhetil.org/guix/86y243kdoo.fsf@gmail.com>
> <https://yhetil.org/guix/867dbmi7pf.fsf@gmail.com>
Nothing agains Yhetil, but that page did break for me yesterday with a
502.  If you have anything important to say, (partially) quoting
yourself would be much preferred while still adding said link for
curious outsiders, because then I can use an intrinsic lookup mechanism
using only my own mailbox rather than an extrinsic one.

Anyway, the point here is a rather simple one that you can base on your
own explanations.  Due to the different ways Guix and Git filter,
serialize and hash content, you can have two objects O and O', such
that Git hashes O and O' differently, but Guix does not, and similarly
two objects O and O' such that Guix hashes them differently, but Git
does not.  Finding particular values for O and O' would in some cases
be computationally expensive, especially if you want to force a hash
collision in SHA-256 instead of reusing the same files but attaching a
different commit message, but theoretically possible, and if theoretic
possibilities is something you want to base your policies on, that is a
thing to consider.

> Maybe you also deny the Git documentation saying «Git is a
> content-addressable filesystem.»
> 
> <https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>
I don't see why I ought to.  At this point we're very far removed from
my original claim.  However, speaking about file systems, they do
support a variety of operations and one of them which Git has goes by
the familiar command "rm -rf /".  (And even if Git didn't, Git over
HTTPS certainly does particularly when only considering 'git clone', so
that'd again be a moot point to argue).

All file systems, content-addressable or otherwise will run out of
names if the space to store them is finite while the number of files to
store is not.  Some allow the user to overwrite existing files even if
that pool has not yet been exhausted.  You might see that as a
vulnerability.  I don't really care.

> I have the impression that you are trying to keep your statement by
> stretching how it concretely works.  Instead of just say: “My bad, I
> was going too far with the chosen-prefix attack of SHA-1”.  And
> that’s fine because that’s a valid objection.  (Even if I was already
> aware, mentioning such issue helps for a sane collective discussion,
> IMHO.)
I'm not trying to sell you on Fossil, but even before SHA-mbles,
SHAttered were the first to claim that Git was broken due to their
attack [1].  Which doesn't necessarily mean their attack is practical
against Git, for they haven't demonstrated it, but if you want to stoke
fear, go ahead.  (Speaking towards a general you, not you personally.)

I'm not trying to stoke fear, I'm arguing that "raw string in <git-
reference> for robustness" is a bad take for a multitude of reasons. 
No matter what scenarios you think up for other repos out there, the
worst effect on Guix in the foreseeable future is that it's going to
barf a hash mismatch at you if you try to `guix build -S'.  Which if
you want to weaken the robustness claim even more is going to happen
for a dozen commits in a selection in the span of a few years. 
Depending on where you live, it's likelier you (again general you) got
the rona within the last week than a package caught the hashies.

[1] https://shattered.io/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-03 18:13                         ` Liliana Marie Prikler
@ 2022-01-03 19:07                           ` zimoun
  2022-01-03 20:19                             ` Liliana Marie Prikler
  0 siblings, 1 reply; 63+ messages in thread
From: zimoun @ 2022-01-03 19:07 UTC (permalink / raw)
  To: Liliana Marie Prikler, Mark H Weaver, guix-devel

Hi Liliana,

On Mon, 03 Jan 2022 at 19:13, Liliana Marie Prikler <liliana.prikler@gmail.com> wrote:
> Am Montag, dem 03.01.2022 um 10:22 +0100 schrieb zimoun:
>> On Sun, 02 Jan 2022 at 22:35, Liliana Marie Prikler
>> <liliana.prikler@gmail.com> wrote:
>> 
>> > > The statement still appears to me wrong because Git commit hash
>> > > only depends on the content itself.
>> > 
>> > If you define content through the NAR hash used by Guix, I'm pretty
>> > sure vanity commits invalidate that statement.
>> 
>> I do not understand what it means – not to say I think your comment
>> does not make sense at all.  Well, I already took the time to explain
>> twice how it works.
>> 
>> <https://yhetil.org/guix/86y243kdoo.fsf@gmail.com>
>> <https://yhetil.org/guix/867dbmi7pf.fsf@gmail.com>
> Nothing agains Yhetil, but that page did break for me yesterday with a
> 502.  If you have anything important to say, (partially) quoting
> yourself would be much preferred while still adding said link for
> curious outsiders, because then I can use an intrinsic lookup mechanism
> using only my own mailbox rather than an extrinsic one.

This is somehow intrinsic* because public-inbox uses Message-ID as URL.
Therefore, using emacs-notmuch, you just have to search for
id:86y243kdoo.fsf@gmail.com for instance.

*intrinsic: no it is not intrinsic but self-contained. :-)


> Anyway, the point here is a rather simple one that you can base on your
> own explanations.  Due to the different ways Guix and Git filter,
> serialize and hash content, you can have two objects O and O', such
> that Git hashes O and O' differently, but Guix does not, and similarly
> two objects O and O' such that Guix hashes them differently, but Git
> does not.  Finding particular values for O and O' would in some cases
> be computationally expensive, especially if you want to force a hash
> collision in SHA-256 instead of reusing the same files but attaching a
> different commit message, but theoretically possible, and if theoretic
> possibilities is something you want to base your policies on, that is a
> thing to consider.

Collision with hashing functions does not mean that the hash does not
*only* depend on the content.  Collision means that 2 contents provides
the same hash.  The final hashes only depends on the content, whatever
the serializer is and as weak as the hashing function is.


> I'm not trying to stoke fear, I'm arguing that "raw string in <git-
> reference> for robustness" is a bad take for a multitude of reasons.

1) No one is advocating to replace tomorrow all by “Git SHA-1 commit
hash in <git-reference>”.  Instead, people exposed what are the
motivations to do so, what it would fix, and so on.

2) I am still failing to understand your multitude bad reasons.  Yes for
sure, introducing more intrinsic values is not straightforward, socially
and about toolings, but I have not read multitude fundamentally bad
reasons.  Anyway.


Cheers,
simon


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-03 19:07                           ` zimoun
@ 2022-01-03 20:19                             ` Liliana Marie Prikler
  2022-01-03 23:00                               ` zimoun
  0 siblings, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-03 20:19 UTC (permalink / raw)
  To: zimoun, Mark H Weaver, guix-devel

Am Montag, dem 03.01.2022 um 20:07 +0100 schrieb zimoun:
> This is somehow intrinsic* because public-inbox uses Message-ID as URL.
> Therefore, using emacs-notmuch, you just have to search for
> id:86y243kdoo.fsf@gmail.com for instance.
> 
> *intrinsic: no it is not intrinsic but self-contained. :-)
I am very happy with my mail reader not showing Message IDs for the
most part, but fair enough, I'll extend that courtesy to you and make
it do so.

In any case, I am not sure what I'm missing from those messages here. 
We both agree, that Guix and Git use different (S, H, F) with S being
the serializer, H a cryptographic hash function and F a formatter, for
their fundamental operations.  And that while Guix can produce git
hashes "internally" (I'm pretty sure it just calls to libgit, but it's
fine if it doesn't), it does not use them for anything meaningful in
its own logic.  In particular, Guix content addressing scheme is based
on SHA-256 NAR hashes, which ought to be robust enough for anything
that relies on content addressing. 

> > Anyway, the point here is a rather simple one that you can base on
> > your own explanations.  Due to the different ways Guix and Git
> > filter, serialize and hash content, you can have two objects O and
> > O', such that Git hashes O and O' differently, but Guix does not, and
> > similarly two objects O and O' such that Guix hashes them
> > differently, but Git does not.  Finding particular values for O and
> > O' would in some cases be computationally expensive, especially if
> > you want to force a hash collision in SHA-256 instead of reusing the
> > same files but attaching a different commit message, but
> > theoretically possible, and if theoretic possibilities is something
> > you want to base your policies on, that is a thing to consider.
> 
> Collision with hashing functions does not mean that the hash does not
> *only* depend on the content.  Collision means that 2 contents provides
> the same hash.  The final hashes only depends on the content, whatever
> the serializer is and as weak as the hashing function is.
That's not the case I'm making here.  The case I'm making is that Git
considers some content content, which Guix does not consider content. 
If I push the same file to two branches, once with the commit message
"Hello Mark" and once with "Hello Simon", they're the same file to
Guix, but different files to Git.

> 
> > I'm not trying to stoke fear, I'm arguing that "raw string in <git-
> > reference> for robustness" is a bad take for a multitude of
> > reasons.
> 
> 1) No one is advocating to replace tomorrow all by “Git SHA-1 commit
> hash in <git-reference>”.  Instead, people exposed what are the
> motivations to do so, what it would fix, and so on.
> 
> 2) I am still failing to understand your multitude bad reasons.  Yes
> for sure, introducing more intrinsic values is not straightforward,
> socially and about toolings, but I have not read multitude
> fundamentally bad reasons.  Anyway.
I think you're missing an important section of the exchange between
messages 867dbmi7pf.fsf@gmail.com and
3d448fe42f0c43574db96fa26aecd7da5fd5a95d.camel@gmail.com concerning
alternative styles, which you've since ignored.

My issues with the proposed style are:
1. It is inconsistent, choosing both to trust and not trust a tag
simultaneously.
2. It does not communicate anything about that choice to the reader.
3. It enables a particular class of "Tricking Peer Review" style
problems.
4a. The issue it tries to address is mostly a social one.
4b. Even if content addressing solves it, SHA-1 is a poor hash and
likely to introduce a similar robustness problem anyway.

Once again, there are tangible benefits to using a (let-bound) commit
inside a <git-reference>, particularly if you have a strong reason to
believe that an upstream might be unreliable.  However, virtually all
of these go down the drain if you do not do so in tandem with git-
version.

Cheers

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-02 22:57                     ` Mark H Weaver
@ 2022-01-03 21:25                       ` Liliana Marie Prikler
  2022-01-03 23:14                         ` Mark H Weaver
  0 siblings, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-03 21:25 UTC (permalink / raw)
  To: Mark H Weaver, zimoun, guix-devel

Am Sonntag, dem 02.01.2022 um 17:57 -0500 schrieb Mark H Weaver:
> Hi Liliana,
> 
> Liliana Marie Prikler <liliana.prikler@gmail.com> writes:
> 
> > Am Samstag, dem 01.01.2022 um 15:37 -0500 schrieb Mark H Weaver:
> > > Liliana Marie Prikler <liliana.prikler@gmail.com> writes:
> > > > > Where is the Cantor-style diagonalization argument that you
> > > > > spoke of?
> > > > You skipped over it, read again.  The key point is that you're
> > > > referencing the thing you think will be invalidated to create
> > > > your scheme.
> > > I've carefully read your message at least 4 times, but I've been
> > > unable to find anything resembling Cantor's diagonalization
> > > argument in there.  Does anyone else see it?  Perhaps my powers
> > > of recognition are too weak.
> > Can I help your powers of recognition by describing everything in
> > terms of Turing machines?
> 
> How about pointing out what acts as the diagonal in your reasoning?
If you are talking specifically about the uncountability of real
numbers, that'd be quite deep down (as in an uncountability of push
actions to a particular Git repo, particularly if we also allow
reinitialization).  My overview was admittedly too high-level; I jumped
ahead to the "this sentence is a lie" statement, which is "I trust,
that this tag which I don't trust, resolves to a particular commit". 
Mea culpa.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-03 20:19                             ` Liliana Marie Prikler
@ 2022-01-03 23:00                               ` zimoun
  2022-01-04  5:23                                 ` Liliana Marie Prikler
  0 siblings, 1 reply; 63+ messages in thread
From: zimoun @ 2022-01-03 23:00 UTC (permalink / raw)
  To: Liliana Marie Prikler, Mark H Weaver, guix-devel

Hi Liliana,

On Mon, 03 Jan 2022 at 21:19, Liliana Marie Prikler <liliana.prikler@gmail.com> wrote:

> That's not the case I'm making here.  The case I'm making is that Git
> considers some content content, which Guix does not consider content.
> If I push the same file to two branches, once with the commit message
> "Hello Mark" and once with "Hello Simon", they're the same file to
> Guix, but different files to Git.

You are saying what I predicted you will say :-) ; here I quote myself:

        The statement «Git commit hashes do not just depend on the
        content» is wrong.  Specifically, it is by adding more
        well-chosen content that the hash is tricked.  Now, Liliana, you
        can define “content“ by useful content opposed as meta-content,
        etc.  Well, it is fine but the statement still appears to me
        wrong because Git commit hash only depends on the content
        itself.

        <https://yhetil.org/guix/86bl0url52.fsf@gmail.com>

Sorry, I used meta-content instead of “content content”. :-) In all
cases, that is part of the content that is hashed.  Well, maybe I
appears picky but I feel there is a fundamental misunderstanding
somewhere. :-)

Just let recap. :-)  You wrote, full quotation:

        Git commit hashes do not just depend on the content.  They also
        depend on how much effort you put into solving a proof of work
        challenge that won't ever earn you crypto coins [1].

        <https://yhetil.org/guix/3d448fe42f0c43574db96fa26aecd7da5fd5a95d.camel@gmail.com>

And the statement is incorrect, especially in the light of your last
comment.  If you read my first email, I took the example:

        $ cat /tmp/foo.txt | git hash-object --stdin
        557db03de997c86a4a028e1ebd3a1ceb225be238

        <https://yhetil.org/guix/86y243kdoo.fsf@gmail.com>

where there is no “content content” (using your own word).  Git is a
serializer, NAR is another.  The content that you serialize does not
matter,

        $ echo hello > hello.txt

        $ cat hello.txt | git hash-object --stdin
        ce013625030ba8dba906f756967f9e9ca394464a

        $ guix hash -S git -H sha1 -f hex hello.txt
        ce013625030ba8dba906f756967f9e9ca394464a

Or using the default format and hash function, comparing Git and Nar
serializers:

        $ guix hash -S nar hello.txt
        04zwf782yjwnh3q6hz5izfd6jyip8kgw6g6yj43fiqhbyhdd0dqw

        $ guix hash -S git hello.txt
        1d7bp5nmgpi5j1ikglw3l7ry7dzczlhp8wl79arl75g2kqyxiy1c

The content can be one file, some files, folders, etc.  or Git objects
as Git commit object or Git tree object or whatever.  Therefore, Git
commit hash only depends on the content itself, i.e., Git commit object;
as explained by the pointer provided earlier in the thread,

    <https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>

On the other hand, we could build our own Guix-DVCS using
NAR+SHA-256+Nix-base32 instead of Git Git+SHA-1+Hex. ;-)

Last, identifier is different from integrity checksum.  Especially, if a
collision is possible, then it invalidates the integrity checksum.  A
collision for an identifier does not matter so much security-wise,
because it just implies that the lookup is not unique or that one
content is unreachable.  Well, collision is for sure an issue and can
break the content-address system, even can lead to security troubles for
extreme cases, but also for sure, it does not change the relation
between Git commit hash and content.

For the rest, it does not matter if we agree or not because this
discussion is far to cook the rice – well there is no concrete
outcomes.  Your last paragraph,

        Once again, there are tangible benefits to using a (let-bound)
        commit inside a <git-reference>, particularly if you have a
        strong reason to believe that an upstream might be unreliable.
        However, virtually all of these go down the drain if you do not
        do so in tandem with git- version.

is an acceptable ending. :-) Therefore, from my side, I consider it as a
final word.

Cheers,
simon

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-03 21:25                       ` Liliana Marie Prikler
@ 2022-01-03 23:14                         ` Mark H Weaver
  2022-01-04 19:55                           ` Liliana Marie Prikler
  0 siblings, 1 reply; 63+ messages in thread
From: Mark H Weaver @ 2022-01-03 23:14 UTC (permalink / raw)
  To: Liliana Marie Prikler, zimoun, guix-devel

Hi Liliana,

Liliana Marie Prikler <liliana.prikler@gmail.com> writes:

> Am Sonntag, dem 02.01.2022 um 17:57 -0500 schrieb Mark H Weaver:
>> Hi Liliana,
>> 
>> Liliana Marie Prikler <liliana.prikler@gmail.com> writes:
>> 
>> > Am Samstag, dem 01.01.2022 um 15:37 -0500 schrieb Mark H Weaver:
>> > > Liliana Marie Prikler <liliana.prikler@gmail.com> writes:
>> > > > > Where is the Cantor-style diagonalization argument that you
>> > > > > spoke of?
>> > > > You skipped over it, read again.  The key point is that you're
>> > > > referencing the thing you think will be invalidated to create
>> > > > your scheme.
>> > > I've carefully read your message at least 4 times, but I've been
>> > > unable to find anything resembling Cantor's diagonalization
>> > > argument in there.  Does anyone else see it?  Perhaps my powers
>> > > of recognition are too weak.
>> > Can I help your powers of recognition by describing everything in
>> > terms of Turing machines?
>> 
>> How about pointing out what acts as the diagonal in your reasoning?
> If you are talking specifically about the uncountability of real
> numbers, that'd be quite deep down (as in an uncountability of push
> actions to a particular Git repo, particularly if we also allow
> reinitialization).

Hmm.  I think that the set of push actions to a particular Git repo is
countable, even if we allow reinitialization.  What makes you think that
it's uncountable?  I'd be very curious to see your argument for that.

> My overview was admittedly too high-level; I jumped
> ahead to the "this sentence is a lie" statement, which is "I trust,
> that this tag which I don't trust, resolves to a particular commit". 
> Mea culpa.

I appreciate these words, thank you.

     Regards,
       Mark

-- 
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-03 23:00                               ` zimoun
@ 2022-01-04  5:23                                 ` Liliana Marie Prikler
  2022-01-04  8:51                                   ` zimoun
  0 siblings, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-04  5:23 UTC (permalink / raw)
  To: zimoun, Mark H Weaver, guix-devel

Hi simon,

Am Dienstag, dem 04.01.2022 um 00:00 +0100 schrieb zimoun:
> [...]
> 
> You are saying what I predicted you will say :-) ; here I quote
> myself:
> 
>         The statement «Git commit hashes do not just depend on the
>         content» is wrong.  [...]
> 
> Sorry, I used meta-content instead of “content content”. :-) In all
> cases, that is part of the content that is hashed.  Well, maybe I
> appears picky but I feel there is a fundamental misunderstanding
> somewhere. :-)
> 
> [...]
>         $ echo hello > hello.txt
> 
>         $ cat hello.txt | git hash-object --stdin
>         ce013625030ba8dba906f756967f9e9ca394464a
> 
>         $ guix hash -S git -H sha1 -f hex hello.txt
>         ce013625030ba8dba906f756967f9e9ca394464a
> 
> Or using the default format and hash function, comparing Git and Nar
> serializers:
> 
>         $ guix hash -S nar hello.txt
>         04zwf782yjwnh3q6hz5izfd6jyip8kgw6g6yj43fiqhbyhdd0dqw
> 
>         $ guix hash -S git hello.txt
>         1d7bp5nmgpi5j1ikglw3l7ry7dzczlhp8wl79arl75g2kqyxiy1c
> 
> The content can be one file, some files, folders, etc.  or Git
> objects as Git commit object or Git tree object or whatever. 
> Therefore, Git commit hash only depends on the content itself, i.e.,
> Git commit object; as explained by the pointer provided earlier in
> the thread,
> 
>     <https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>
At some point in there (you can figure out yourself where), you are
mistakenly equating file hashes and commit hashes, which you make
comparison to other tools which only regard the files as content.  One
of them is immutable for all I know, the other is subject to very
observable changes.

> Last, identifier is different from integrity checksum.  Especially,
> if a collision is possible, then it invalidates the integrity
> checksum.  A collision for an identifier does not matter so much
> security-wise, because it just implies that the lookup is not unique
> or that one content is unreachable.  Well, collision is for sure an
> issue and can break the content-address system, even can lead to
> security troubles for extreme cases, but also for sure, it does not
> change the relation between Git commit hash and content.
"Lookup is not unique" is exactly the problem we have for robustness
with tags, however.

> Therefore, from my side, I consider it as a final word.
Fair enough, we can agree on the conclusion even if reached by
different methods.

Cheers


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-04  5:23                                 ` Liliana Marie Prikler
@ 2022-01-04  8:51                                   ` zimoun
  2022-01-04 13:15                                     ` zimoun
  0 siblings, 1 reply; 63+ messages in thread
From: zimoun @ 2022-01-04  8:51 UTC (permalink / raw)
  To: Liliana Marie Prikler, Mark H Weaver, guix-devel


On Tue, 04 Jan 2022 at 06:23, Liliana Marie Prikler <liliana.prikler@gmail.com> wrote:

>> The content can be one file, some files, folders, etc.  or Git
>> objects as Git commit object or Git tree object or whatever. 
>> Therefore, Git commit hash only depends on the content itself, i.e.,
>> Git commit object; as explained by the pointer provided earlier in
>> the thread,
>> 
>>     <https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>
>
> At some point in there (you can figure out yourself where), you are
> mistakenly equating file hashes and commit hashes, which you make
> comparison to other tools which only regard the files as content.  One
> of them is immutable for all I know, the other is subject to very
> observable changes.

Incorrect.

I quote myself:

        then, for what my opinion is worth on that matter, my probably
        wrong understanding of your words is that perhaps you are
        missing a point about content-addressability.

Loop.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-04  8:51                                   ` zimoun
@ 2022-01-04 13:15                                     ` zimoun
  2022-01-04 19:45                                       ` Liliana Marie Prikler
  0 siblings, 1 reply; 63+ messages in thread
From: zimoun @ 2022-01-04 13:15 UTC (permalink / raw)
  To: Liliana Marie Prikler, Mark H Weaver, guix-devel

Hi Liliana,

On Tue, 04 Jan 2022 at 09:51, zimoun <zimon.toutoune@gmail.com> wrote:
> On Tue, 04 Jan 2022 at 06:23, Liliana Marie Prikler <liliana.prikler@gmail.com> wrote:
>
>>> The content can be one file, some files, folders, etc.  or Git
>>> objects as Git commit object or Git tree object or whatever. 
>>> Therefore, Git commit hash only depends on the content itself, i.e.,
>>> Git commit object; as explained by the pointer provided earlier in
>>> the thread,
>>> 
>>>     <https://git-scm.com/book/en/v2/Git-Internals-Git-Objects>
>>
>> At some point in there (you can figure out yourself where), you are
>> mistakenly equating file hashes and commit hashes, which you make
>> comparison to other tools which only regard the files as content.  One
>> of them is immutable for all I know, the other is subject to very
>> observable changes.

Let pick one commit in the Git history, for instance:
e598e46913c661bc92df813d537eeb6be5a86471. 

--8<---------------cut here---------------start------------->8---
$ git --no-pager show e598e46913c661bc92df813d537eeb6be5a86471

commit e598e46913c661bc92df813d537eeb6be5a86471
Author: Tobias Geerinckx-Rice <me@tobias.gr>
Date:   Tue Oct 26 02:01:41 2021 +0200

    gnu: darkhttpd: Update to 1.13.
    
    * gnu/packages/web.scm (darkhttpd): Update to 1.13.
    [source]: Use GIT-FETCH and GIT-FILE-NAME.
    [arguments]: Don't explicitly return #t from phases.

diff --git a/gnu/packages/web.scm b/gnu/packages/web.scm
index dc5a9d61a8..2bd3c4ea13 100644
--- a/gnu/packages/web.scm
+++ b/gnu/packages/web.scm
@@ -5791,28 +5791,28 @@ (define-public surfraw
[...]
--8<---------------cut here---------------end--------------->8---

Let copy somewhere the content of this Git commit:

    $ mkdir -p /tmp/kikoo
    $ cp .git/objects/e5/98e46913c661bc92df813d537eeb6be5a86471 \
         /tmp/kikoo/content

Now, let run inside a container:

        $ cd /tmp/kikoo
        $ guix shell -C coreutils gzip
        [env]$ printf "\x1f\x8b\x08\x00\x00\x00\x00\x00" \
                  | cat - content | gzip -d | sha1sum

        gzip: stdin: unexpected end of file
        e598e46913c661bc92df813d537eeb6be5a86471  -

Explain me.  « Git commit hash only depends on the content itself, i.e.,
Git commit object », as I wrote.

Instead of taking a superior tone «(you can figure out yourself where)»,
I would prefer that you correctly read the messages I wrote.  Maybe,
that’s why my previous email is probably is bit harsh, sorry.


Specifically, the content of this Git commit is just:

--8<---------------cut here---------------start------------->8---
$ printf "\x1f\x8b\x08\x00\x00\x00\x00\x00" |cat - content | gzip -d

commit 662tree 44207c0d8c9e885b156bb98562221beb2ab8f7bf
parent b8fc7c23596b542b81306829b31cf255d908fa65
author Tobias Geerinckx-Rice <me@tobias.gr> 1635206501 +0200
committer Tobias Geerinckx-Rice <me@tobias.gr> 1635206568 +0200
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iIMEABYKACsWIQT12iAyS4c9C3o4dnINsP+IT1VteQUCYXdFqA0cbWVAdG9iaWFz
 LmdyAAoJEA2w/4hPVW15OoAA+gIzCyrlUEBUFdLp0CBFW1GxjVzYiSFMmk5aNDNu
 OngEAQDoisl2dQK7lLvIl/2xXDqoJ2CAoiqZ1DbGNyg3Yt/DCA==
 =RH+I
 -----END PGP SIGNATURE-----

gnu: darkhttpd: Update to 1.13.

* gnu/packages/web.scm (darkhttpd): Update to 1.13.
[source]: Use GIT-FETCH and GIT-FILE-NAME.
[arguments]: Don't explicitly return #t from phases.
--8<---------------cut here---------------end--------------->8---

The files and folders are in a Git tree object, referred in the Git
commit object by ’tree 44207c0d8c9e885b156bb98562221beb2ab8f7bf’.

I let you run this command from the Guix git repo:

    $ git cat-file -p 44207c0d8c9e885b156bb98562221beb2ab8f7bf


Please re-read all your answers and mines.  I hope you will see where
you were incorrect.


Cheers,
simon


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-04 13:15                                     ` zimoun
@ 2022-01-04 19:45                                       ` Liliana Marie Prikler
  2022-01-04 19:53                                         ` zimoun
  0 siblings, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-04 19:45 UTC (permalink / raw)
  To: zimoun, Mark H Weaver, guix-devel

Hi simon,

> Please re-read all your answers and mines.  I hope you will see where
> you were incorrect.
I don't think there's anything to see here.  Believe it or not, but
you've so far been boiling my water multiple times only to then throw
it into my face as I attempt to put the rice in.  Admittedly, that's a
little frustrating.  However, I am a big girl and I can handle getting
hot and wet.

> [F]or what my opinion is worth on that matter, my probably wrong
> understanding of your words is that perhaps you are missing a point
> about content-addressability.

Am Dienstag, dem 04.01.2022 um 14:15 +0100 schrieb zimoun:
> Let pick one commit in the Git history, for instance:
> e598e46913c661bc92df813d537eeb6be5a86471.  [...]
> 
> Explain me.  « Git commit hash only depends on the content itself,
> i.e., Git commit object », as I wrote.
That's exactly the point.  For the entirety of this discussion, I've
been assuming the content (i.e. "the content content" or "the content
without meta-content" or however else you want to term it) to be "that
which is hashed by Guix", which if using git-fetch is the working
directory (using Git parlance correctly here, hopefully, correct me if
not) sans the Git subdirectory.  I am sure you have at least a rough
understanding of how that ought to work yourself, but for a more in-
depth analysis of what goes into that, see Timothy's message
> Note for all of this that my scripts treat the SHA256 hash as *the*
> identifier for a source.  That is, if a tag is mutated and a someone
> adjusts the origin URI to point to the commit that the tag used to
> refer to, I would not notice.  Similarly, for tricking peer review:
> fixing the URI to match the hash is invisible to me.  It’s only when we
> fix the hash to match the URI that I notice.
from 87ee5pspza.fsf@ngyro.com

> Instead of taking a superior tone «(you can figure out yourself
> where)», I would prefer that you correctly read the messages I
> wrote.  Maybe, that’s why my previous email is probably is bit harsh,
> sorry.
I apologize, I had not intended that to be a superior tone.  I wanted
this to be a less authoritarian version of « you have to figure out
yourself where », leaving open some room for you to not bother any
longer, but my attempt failed.

For the sake of transparency, you are (in my opinion at least) making a
leap here in that you think I somehow care about the hash of a commit
message, which in fact I couldn't care less about other than the
obvious fact that it changes with it.  I can't pinpoint where exactly
along these lines you might have mistakenly got that impression or I
might mistakenly have conveyed it, but you appear to have a rather
convincing reason for you to do so.  Therefore, before going off on yet
another tangent I wanted you to make sure whether that is in fact the
case.


In short, not all content addressing schemes are equal and the content
by which Git addresses its commits is completely irrelevant to Guix (by
virtue of it deleting its means to do so anyway).  I hope that cleared
up any misconceptions.  If not, feel free to ask.

Cheers


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-04 19:45                                       ` Liliana Marie Prikler
@ 2022-01-04 19:53                                         ` zimoun
  0 siblings, 0 replies; 63+ messages in thread
From: zimoun @ 2022-01-04 19:53 UTC (permalink / raw)
  To: Liliana Marie Prikler; +Cc: Guix Devel

Hi Liliana,

On Tue, 4 Jan 2022 at 20:45, Liliana Marie Prikler
<liliana.prikler@gmail.com> wrote:

> I don't think there's anything to see here.  Believe it or not, but
> you've so far been boiling my water multiple times only to then throw
> it into my face as I attempt to put the rice in.  Admittedly, that's a
> little frustrating.  However, I am a big girl and I can handle getting
> hot and wet.

I apologize.  I was not my intent.

For the rest, we said that we have to say.

Cheers,
simon


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-03 23:14                         ` Mark H Weaver
@ 2022-01-04 19:55                           ` Liliana Marie Prikler
  2022-01-04 23:42                             ` Mark H Weaver
  0 siblings, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-04 19:55 UTC (permalink / raw)
  To: Mark H Weaver, zimoun, guix-devel

Am Montag, dem 03.01.2022 um 18:14 -0500 schrieb Mark H Weaver:
> 
> > If you are talking specifically about the uncountability of real
> > numbers, that'd be quite deep down (as in an uncountability of push
> > actions to a particular Git repo, particularly if we also allow
> > reinitialization).
> 
> Hmm.  I think that the set of push actions to a particular Git repo
> is countable, even if we allow reinitialization.  What makes you
> think that it's uncountable?  I'd be very curious to see your
> argument for that.
I can always add another by force-pushing a tree with a single commit,
garbage collecting and then pushing regular commits on top, which
should be able to reuse previously existing hashes.  Even if not, since
we identify a repo by URL, I could ask the forge to delete it and then
push a completely reinitialized repo under the same name.

Speaking about destructive operations, there is one course of actions
that an upstream could take if they deem 2^160 to be too small an
address space, which would be sane for consumers using tags, but insane
for those using commits.  And that would be squashing all the commits
between two tags to a single one, correctly reapplying tags and (force-
)pushing the resulting repo.  Though again, the blow for commit users
might be softened through git vanity this time :)

I'm not saying the above should have any impact on what we feed to git-
fetch, but it's a fun shower thought.
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-04 19:55                           ` Liliana Marie Prikler
@ 2022-01-04 23:42                             ` Mark H Weaver
  2022-01-05  9:28                               ` Mark H Weaver
  0 siblings, 1 reply; 63+ messages in thread
From: Mark H Weaver @ 2022-01-04 23:42 UTC (permalink / raw)
  To: Liliana Marie Prikler, zimoun, guix-devel

Hi Liliana,

Liliana Marie Prikler <liliana.prikler@gmail.com> writes:

> Am Montag, dem 03.01.2022 um 18:14 -0500 schrieb Mark H Weaver:
>> 
>> > If you are talking specifically about the uncountability of real
>> > numbers, that'd be quite deep down (as in an uncountability of push
>> > actions to a particular Git repo, particularly if we also allow
>> > reinitialization).
>> 
>> Hmm.  I think that the set of push actions to a particular Git repo
>> is countable, even if we allow reinitialization.  What makes you
>> think that it's uncountable?  I'd be very curious to see your
>> argument for that.

> I can always add another by force-pushing a tree with a single commit,
> garbage collecting and then pushing regular commits on top, which
> should be able to reuse previously existing hashes.  Even if not, since
> we identify a repo by URL, I could ask the forge to delete it and then
> push a completely reinitialized repo under the same name.

Sorry, but this is not even close to a valid argument that the set of
possible push actions to a Git repo is uncountable.  In fact, it's quite
easy to prove that the set is countable.  Any mathematician will know this.

I'll also note that the Cantor-style diagonalization argument, which you
have repeatedly claimed to have, is still nowhere to be found.  To make
matters worse, you gaslit me earlier by claiming that I was failing to
see it in your earlier message, when it's clearly not there.  If you
have such an argument, you're apparently unwilling to share it with us.

       Mark

-- 
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-04 23:42                             ` Mark H Weaver
@ 2022-01-05  9:28                               ` Mark H Weaver
  2022-01-05 20:43                                 ` Liliana Marie Prikler
  0 siblings, 1 reply; 63+ messages in thread
From: Mark H Weaver @ 2022-01-05  9:28 UTC (permalink / raw)
  To: Liliana Marie Prikler, zimoun, guix-devel

Hi Liliana,

Earlier, I wrote:
> Sorry, but this is not even close to a valid argument that the set of
> possible push actions to a Git repo is uncountable.  In fact, it's quite
> easy to prove that the set is countable.  Any mathematician will know this.

I suppose I should give a proof.  I'll give two proofs.

First I'll give a very simple proof that depends on assuming that every
push action can be uniquely represented as a message of finite length.

You may imagine this representation as being something like the output
of "git log --patch --format=fuller" but with _every_ piece of
information that git has about the pushed commits included.

Better yet, you could imagine the representation as being closer to what
git push actually sends to the server: a set of references to mutate,
and a pack of objects reachable by those references.  As an
optimization, there is a more complex two-way exchange to avoid sending
objects that the server already has, but for purposes of this proof you
could imagine eliminating that optimization.

Anyway, if you'll accept that every push action can be represented as a
message of finite length, then the proof is very simple:

Observe that there exists a one-to-one correspondence between push
actions and a subset of messages of finite length, and moreover that
there exists a one-to-one correspondence between messages of finite
length and a subset of the natural numbers.

Therefore, there exists a one-to-one correspondence between push actions
and a subset of the natural numbers.  Thus, the set of push actions is
countable, by definition.  Q.E.D.

=-=-=

If you don't accept the assumption above, here's a more detailed proof
that could be adapted to *any* kind of action that can be communicated
over a digital communications channel in finite time:

First, note that every push action can be performed by exchanging a
finite number of messages of finite length between the client and
server, hereafter referred to as a "finite exchange".

Let P be the set of possible push actions.

Let V be the subset of finite exchanges that represent a push action,
hereafter referred to as a "valid exchange".

Let f : V -> P be a function that maps valid exchanges to the push
actions that they represent.

Observe that f is surjective, i.e. for all p in P, there exists v in V
such that f(v) = p.  In other words, for every push action, there exists
at least one valid exchange that represents that push action.

Since f : V -> P is a surjective function, it follows that the
cardinality of V is greater than or equal to the cardinality of P.

Now, note that any valid exchange can be represented as a bit string of
finite length, and therefore as a natural number.  (There are many ways
to do this; the details are left as an exercise).

Therefore, the set V of valid exchanges is in one-to-one correspondence
with a subset of the natural numbers, and is therefore countable, by
definition.

Since V is countable, and it has cardinality greater than or equal to
the cardinality of P, it follows that P (the set of push actions) is
also countable.  Q.E.D.

      Mark

-- 
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-05  9:28                               ` Mark H Weaver
@ 2022-01-05 20:43                                 ` Liliana Marie Prikler
  2022-01-06 10:38                                   ` Mark H Weaver
  0 siblings, 1 reply; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-05 20:43 UTC (permalink / raw)
  To: Mark H Weaver, zimoun, guix-devel

Am Mittwoch, dem 05.01.2022 um 04:28 -0500 schrieb Mark H Weaver:
> Hi Liliana,
> 
> Earlier, I wrote:
> > Sorry, but this is not even close to a valid argument that the set
> > of possible push actions to a Git repo is uncountable.  In fact,
> > it's quite easy to prove that the set is countable.  Any
> > mathematician will know this.
> 
> I suppose I should give a proof.  I'll give two proofs.
Now, those two proofs are nice and I'm not going to dispute them. 
However, just for the record I do think that you're paying a little too
much attention to formalisms and too little to everything else I'm
saying.  In the case of Cantor vs. the liar paradox (which I've wrongly
attributed to Cantor because his proof is being used in Gödel's
incompleteness theorem and the Halting Problem¹), that's completely on
me, although I'm not sure whether that suffices for gaslighting.  In
any case, it was not intentional.  In this context, on the other hand,
it ought to have been comparatively easy to infer that I was talking
about push actions (plural) as sequences, not as individual push
actions like you've used for your proof.

From here on, I will assume that each individual push action is finite
as you did, but I don't think that using communications of finite
length are a helpful building block here.  Porting Git to Turing
Machines would have the effect of allowing an infinite tape shared
between multiple machines and they could possibly run forever. 
However, I am going to assume that each individual push action is
finite anyway, as your assumptions will typically hold for reasons of
practicability.  Note, though, that this would not suffice for
robustness arguments.  Whatever communications you make at a given
time, you can not be sure that your observations will be replicated if
you repeat them (even if restricted to fetching only).

As we're trying to generalize your proof for a single push action to be
chosen among a finite set to all communications to a series of push
actions, we do encounter a problem if we were to encode this as a mere
list of push actions.  This can be done by a rather simple Cantor
proof:  The set of lists of a particular type T which admits at least
two values is uncountable (a list of booleans can be directly mapped to
a binary number and thus Cantor's original proof applied).  Since there
are more than two permissible push actions, we have a problem.

Luckily, there is a straightforward solution.  A push action can only
create new commits (of which there are finitely many, both in terms of
the content committed as per your argument as in their count by the
limited number of SHA-1 hashes), create new branches (which must have a
finitely long name as per your argument), or create tags (which also
must have finitely long names, again per your argument).  There are no
other operations worth mentioning and a push action must contain at
least one action to be valid.  Therefore, a valid sequence of push
actions is finite, as it will inevitably end up using all SHA-1 hashes
as well as all possible branch and tag names in some configuration. 
Q.E.D.

Remember the ¹ I wrote earlier?  Now originally, I had planned to write
a formal proof to show how validating a tag (i.e. making sure it does
not get assigned to a different commit) and likewise validating a
commit after SHA-whateverthenextattackwillbenamed would be co-RE-hard.
It would likely have invoked Cantor at some point, though again perhaps
at a lower level than you'd have liked.  But I am sure you agree, that
this proof that Git is completely safe, actually, is much nicer.  There
shouldn't be anything that could discredit it, least of all something
past me would have said.  WDYT?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-05 20:43                                 ` Liliana Marie Prikler
@ 2022-01-06 10:38                                   ` Mark H Weaver
  2022-01-06 11:25                                     ` Liliana Marie Prikler
  0 siblings, 1 reply; 63+ messages in thread
From: Mark H Weaver @ 2022-01-06 10:38 UTC (permalink / raw)
  To: Liliana Marie Prikler, zimoun, guix-devel

Hi Liliana,

Liliana Marie Prikler <liliana.prikler@gmail.com> writes:
> it ought to have been comparatively easy to infer that I was talking
> about push actions (plural) as sequences, not as individual push
> actions like you've used for your proof.

It makes no difference, because the set of push actions is closed under
sequential composition.  Moreover, the argument in my proof applies to
*any* kind of action that can be communicated over a digital
communications channel in finite time.

> From here on, I will assume that each individual push action is finite
> as you did, but I don't think that using communications of finite
> length are a helpful building block here.

Really?  You don't think communications of finite length are a helpful
building block here?

Note that in the real world, every observable time interval is finite.

> Porting Git to Turing
> Machines would have the effect of allowing an infinite tape shared
> between multiple machines and they could possibly run forever.

I'm sorry, I thought we were talking about "Git" as it exists in the
real world.  Please recall that the motivation for my proof was to
refute the following claim of yours from a few messages ago:

>> How about pointing out what acts as the diagonal in your reasoning?
>
>If you are talking specifically about the uncountability of real
>numbers, that'd be quite deep down (as in an uncountability of push
>actions to a particular Git repo, particularly if we also allow
>reinitialization).

You referred to "Git" here.  That's why my proof was about Git.  You
didn't say that you were talking about some theoretical variant of Git
that supports "push actions" that literally *never* end, and that runs
on a theoretical machine with infinite memory that can never be built.

> As we're trying to generalize your proof for a single push action to
> be chosen among a finite set to all communications to a series of push
> actions, we do encounter a problem if we were to encode this as a mere
> list of push actions.  This can be done by a rather simple Cantor
> proof: The set of lists of a particular type T which admits at least
> two values is uncountable (a list of booleans can be directly mapped
> to a binary number and thus Cantor's original proof applied).

No, that's incorrect.  The set of lists of booleans is countable.
Moreover, for any type T, if the set of objects of type T is countable,
then the set of _lists_ of objects of type T is also countable.

Note that lists are finite, by definition.

If you intend to claim that by "push actions", you meant to include
infinite streams of push actions: this has no relevance to the real
world.

Even if we live in an open universe, the fact remains that at any
arbitrary point in time, the age of any git repository is always finite.
Therefore, no one can ever observe the result of an infinite-length
"push action" in the real world.  They can only ever observe the result
of some finite prefix of that so-called "push action".

* * *

I'm sorry, but I'm growing tired of this discussion.  I suppose you will
want to have the last word.  I'll try to resist the temptation to
correct any errors in it, but my silence should not be interpreted as
acceptance of your future claims.

     Regards,
       Mark

-- 
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: On raw strings in <origin> commit field
  2022-01-06 10:38                                   ` Mark H Weaver
@ 2022-01-06 11:25                                     ` Liliana Marie Prikler
  0 siblings, 0 replies; 63+ messages in thread
From: Liliana Marie Prikler @ 2022-01-06 11:25 UTC (permalink / raw)
  To: Mark H Weaver, zimoun, guix-devel

Hi,

Am Donnerstag, dem 06.01.2022 um 05:38 -0500 schrieb Mark H Weaver:
> > From here on, I will assume that each individual push action is
> > finite as you did, but I don't think that using communications of
> > finite length are a helpful building block here.
> 
> Really?  You don't think communications of finite length are a
> helpful building block here?
> 
> Note that in the real world, every observable time interval is
> finite.
The reason I don't is because if we go all the way back to what we
actually want to do, we're invoking claims of an uncertain future.  At
the time of pushing a package description to guix.git, we already ran
whatever tests we had for the upstream and determined what we believe
to be a safe course of actions.  In theory, this belief could be
shattered mere seconds after.

> You referred to "Git" here.  That's why my proof was about Git.  You
> didn't say that you were talking about some theoretical variant of
> Git that supports "push actions" that literally *never* end, and that
> runs on a theoretical machine with infinite memory that can never be
> built.
Now to be sure, we're both talking about Git here, the porting to
Turing machines is only relevant because if we go that deep to Computer
Science basics, we might as well use them.
The scary parts of Git, that we use to make robustness claims are:
1. In the future, upstream might delete the repo.
2. In the future, upstream might update or delete a branch.
3. In the future, upstream might update or delete a tag.
4. In the future, upstream might update or delete a commit.
> 

> No, that's incorrect.  The set of lists of booleans is countable.
> Moreover, for any type T, if the set of objects of type T is
> countable, then the set of _lists_ of objects of type T is also
> countable.

> Note that lists are finite, by definition.
I'm pretty sure that there's bounded and unbounded lists, but perhaps I
am confusing the latter for infinite streams as you've pointed out.  

> If you intend to claim that by "push actions", you meant to include
> infinite streams of push actions: this has no relevance to the real
> world.
> 
> Even if we live in an open universe, the fact remains that at any
> arbitrary point in time, the age of any git repository is always
> finite.  Therefore, no one can ever observe the result of an
> infinite-length "push action" in the real world.  They can only ever
> observe the result of some finite prefix of that so-called "push
> action".
Even in a closed universe, we can only jump to arbitrary points in time
of which we've already collected a snapshots.  Statements regarding
anything else always carry uncertainty.

That's also a reason to bring Turing machines into this equation.  A
Turing machine would try to observe an infinite stream of push actions
if we program it that way.  This raises the Halting and Non-Halting
problems among others.

Cheers


^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2022-01-06 11:29 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-28 20:55 On raw strings in <origin> commit field Liliana Marie Prikler
2021-12-29  8:39 ` zimoun
2021-12-29 20:25   ` Liliana Marie Prikler
2021-12-30 12:43     ` zimoun
2021-12-31  0:02       ` Liliana Marie Prikler
2021-12-31  1:23         ` zimoun
2021-12-31  3:27           ` Liliana Marie Prikler
2021-12-31  9:31             ` Ricardo Wurmus
2021-12-31 11:07               ` Liliana Marie Prikler
2021-12-31 12:31                 ` Ricardo Wurmus
2021-12-31 13:18                   ` Liliana Marie Prikler
2021-12-31 13:15               ` zimoun
2021-12-31 15:19                 ` Liliana Marie Prikler
2021-12-31 17:21                   ` zimoun
2021-12-31 20:52                     ` Liliana Marie Prikler
2021-12-31 23:36         ` Mark H Weaver
2022-01-01  1:33           ` Liliana Marie Prikler
2022-01-01  5:00             ` Mark H Weaver
2022-01-01 10:33               ` Liliana Marie Prikler
2022-01-01 20:37                 ` Mark H Weaver
2022-01-01 22:55                   ` Liliana Marie Prikler
2022-01-02 22:57                     ` Mark H Weaver
2022-01-03 21:25                       ` Liliana Marie Prikler
2022-01-03 23:14                         ` Mark H Weaver
2022-01-04 19:55                           ` Liliana Marie Prikler
2022-01-04 23:42                             ` Mark H Weaver
2022-01-05  9:28                               ` Mark H Weaver
2022-01-05 20:43                                 ` Liliana Marie Prikler
2022-01-06 10:38                                   ` Mark H Weaver
2022-01-06 11:25                                     ` Liliana Marie Prikler
2022-01-02 19:30                   ` zimoun
2022-01-02 21:35                     ` Liliana Marie Prikler
2022-01-03  9:22                       ` zimoun
2022-01-03 18:13                         ` Liliana Marie Prikler
2022-01-03 19:07                           ` zimoun
2022-01-03 20:19                             ` Liliana Marie Prikler
2022-01-03 23:00                               ` zimoun
2022-01-04  5:23                                 ` Liliana Marie Prikler
2022-01-04  8:51                                   ` zimoun
2022-01-04 13:15                                     ` zimoun
2022-01-04 19:45                                       ` Liliana Marie Prikler
2022-01-04 19:53                                         ` zimoun
2021-12-31 23:56         ` Mark H Weaver
2022-01-01  0:15           ` Liliana Marie Prikler
2021-12-30  1:13 ` Mark H Weaver
2021-12-30 12:56   ` zimoun
2021-12-31  3:15   ` Liliana Marie Prikler
2021-12-31  7:57     ` Taylan Kammer
2021-12-31 10:55       ` Liliana Marie Prikler
2022-01-01  1:41     ` Mark H Weaver
2022-01-01 11:12       ` Liliana Marie Prikler
2022-01-01 17:45         ` Timothy Sample
2022-01-01 19:52           ` Liliana Marie Prikler
2022-01-02 23:00             ` Timothy Sample
2022-01-03 15:46           ` Ludovic Courtès
2022-01-01 20:19         ` Mark H Weaver
2022-01-01 23:20           ` Liliana Marie Prikler
2022-01-02 12:25             ` Mark H Weaver
2022-01-02 14:09               ` Liliana Marie Prikler
2022-01-02  2:07         ` Bengt Richter
2021-12-31 17:56 ` Vagrant Cascadian
2022-01-03 15:51   ` Ludovic Courtès
2022-01-03 16:29     ` Vagrant Cascadian

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).