unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Timothy Sample <samplet@ngyro.com>
To: Liliana Marie Prikler <liliana.prikler@gmail.com>
Cc: guix-devel@gnu.org
Subject: Re: On raw strings in <origin> commit field
Date: Sun, 02 Jan 2022 18:00:09 -0500	[thread overview]
Message-ID: <87ee5pspza.fsf@ngyro.com> (raw)
In-Reply-To: ea6a072346578e34026410e6a8413cea5494f247.camel@gmail.com

Hey,

Liliana Marie Prikler <liliana.prikler@gmail.com> writes:

> Since you are our expert on preservation, would you mind if I ask you
> for some estimates on how painful it is to track down such commits in
> general, if it could be made easier were you to record tag → commit
> (alternatively file-name x sha256 → SWHID) maps periodically (or if you
> already have such a map and those arise while creating it), and how
> many “Tricking Peer Review”-style problems you think are currently
> around?

I haven’t been keeping a detailed log of issues or anything, but I have
some notes and recollections.  As of last month [1], we have 9,554 Git
sources (fixed-output derivations lowered from ‘git-reference’ origins).
Of those, 186 could not be recovered automatically (by simply cloning
the repo or, for about 100 cases with the commit hash, checking SWH).
Most of the 186 have ‘(recursive? #t)’, which is something I haven’t
implemented yet (there’s no Guix fallback support for it either).
However, there are 51 of those that should just work but don’t.

It turns out that most of these are due to my scripts ignoring the wrong
kind of VCS files (like ignoring “.hg”) when hashing.  My scripts follow
the logic of ‘guix hash -S nar -x .’, but Guix actually just deletes the
Git metadata: ‘rm -rf .git; guix hash -S nar .’.  :)

Another couple are <https://issues.guix.gnu.org/48540>.

There were a handful of mutated tags (around a dozen).  Some of them
were deleted, but the tag name referred to the commit hash (as if the
tag was named by ‘git describe’).  Some of them were changed, but it was
clear that the original tag was just a few commits back.  There was only
one I couldn’t figure out:

    https://github.com/jurplel/qView.git
    at tag 2.0
    with hash 1s29hz44rb5dwzq8d4i4bfg77dr0v3ywpvidpa6xzg7hnnv3mhi5

A similar problem is when the repo URL changes, but the tags are still
the same when you track down another copy of the repo.  I encountered
this a few times.

Another handful (again around a dozen) were hash mistakes in the style
of tricking peer review.  In most cases our commit messages were clear
enough to figure out what the hash was actually for.  There are two
mysterious cases:

    https://git.umaneti.net/flycheck-grammalecte/
    at tag v1.3
    with hash 1f1gapvs9j89qr474103dqgsiyb96phlnsmq5hiv4ba242blg9lb
    (see Guix commit ca5a791f6285b08506ccd662d5911ccf0c4d1ece)

    https://github.com/fdik/libetpan
    at commit 210ba2b3b310b8b7a6ee4a4e35e50f7fa379643f
    with hash 00000nij3ray7nssvq0lzb352wmnab8ffzk7dgff2c68mvjbh1l6
    (the hash kinda looks fake, but it was like that for a long time)

There are two other cases that are basically “typos” in the hash.  One
is clearly just an edit to the hash to make the build fail and print the
correct hash (see commits 618df2e335acb49a27ca014b555ede34f79503f3 and
bdc7f72fe4391ede313a0388ddd17cbb053931c9).  The other one is commit
c0dc4179091f85fe4b8a2bbdb07c154a7f0408ed, which changes the hash of the
package ‘zimg’ without mentioning anything about it in the commit
message.  This is fixed in b08c4f5fceff6064baedea3385703689b8a72e47
(back to the original hash).  Tobias might remember what happened there,
but it looks like an honest mistake to me.  I have no clue what that
other hash was for.

Note for all of this that my scripts treat the SHA256 hash as *the*
identifier for a source.  That is, if a tag is mutated and a someone
adjusts the origin URI to point to the commit that the tag used to refer
to, I would not notice.  Similarly, for tricking peer review: fixing the
URI to match the hash is invisible to me.  It’s only when we fix the
hash to match the URI that I notice.

See also zimoun’s analysis of the same thing, but with older data:
<https://lists.gnu.org/archive/html/guix-devel/2021-12/msg00032.html>.

[1] https://ngyro.com/pog-reports/2021-12-06/

> Am Samstag, dem 01.01.2022 um 12:45 -0500 schrieb Timothy Sample:
>
>> Given what I wrote above, maybe we could start by updating the linter
>> so that ‘check-source’ actually checks that it gets the right result.
>> Right now it uses a few heuristics to check that the result looks
>> okay (for instance, it checks if the result is suspiciously small). 
>> Maybe it should just go through the whole download process and verify
>> the hash?  Alternatively (or additionally), the CI “source”
>> specification could be configured to avoid using our servers as a
>> fallback when checking sources.
>
> I think substitutes should be disabled for the source download of a
> "check-source".  Even if a substitute or SWH fallback exists, that's
> not what we want to check here, no?

Exactly.  It should just fetch the source as naïvely as possible, akin
to ‘GUIX_DOWNLOAD_FALLBACK_TEST=none guix build --check -S ...’ (or with
‘--substitute-urls=""’ or whatever).

>> I agree that adding more identifiers (commit hashes or whatever)
>> makes things more robust, but the cost is more work when creating,
>> updating, and reviewing packages.  I think we should start by
>> verifying the identifiers we already have (i.e., checking that the
>> URI and method of the origin produce the right output).  It would
>> solve many existing problems and would serve as a nice foundation for
>> future improvements.
>
> Is this something we can reasonably expect our current CI or CI in
> general to handle (assuming we tweaked the linter to behave as you
> intend?)  Or would it make more sense to implement this as a
> weekly/monthly cronjob?

I really only mentioned the CI because I had to explain to myself why it
didn’t notice the problem.  I think the linter is probably the better
place to improve things here.  It’s something I’m willing to work on,
but I would need to understand why it doesn’t check the hash already.
It seems like one of those things someone may have already thought about
and decided against.


-- Tim


  reply	other threads:[~2022-01-02 23:01 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-28 20:55 On raw strings in <origin> commit field Liliana Marie Prikler
2021-12-29  8:39 ` zimoun
2021-12-29 20:25   ` Liliana Marie Prikler
2021-12-30 12:43     ` zimoun
2021-12-31  0:02       ` Liliana Marie Prikler
2021-12-31  1:23         ` zimoun
2021-12-31  3:27           ` Liliana Marie Prikler
2021-12-31  9:31             ` Ricardo Wurmus
2021-12-31 11:07               ` Liliana Marie Prikler
2021-12-31 12:31                 ` Ricardo Wurmus
2021-12-31 13:18                   ` Liliana Marie Prikler
2021-12-31 13:15               ` zimoun
2021-12-31 15:19                 ` Liliana Marie Prikler
2021-12-31 17:21                   ` zimoun
2021-12-31 20:52                     ` Liliana Marie Prikler
2021-12-31 23:36         ` Mark H Weaver
2022-01-01  1:33           ` Liliana Marie Prikler
2022-01-01  5:00             ` Mark H Weaver
2022-01-01 10:33               ` Liliana Marie Prikler
2022-01-01 20:37                 ` Mark H Weaver
2022-01-01 22:55                   ` Liliana Marie Prikler
2022-01-02 22:57                     ` Mark H Weaver
2022-01-03 21:25                       ` Liliana Marie Prikler
2022-01-03 23:14                         ` Mark H Weaver
2022-01-04 19:55                           ` Liliana Marie Prikler
2022-01-04 23:42                             ` Mark H Weaver
2022-01-05  9:28                               ` Mark H Weaver
2022-01-05 20:43                                 ` Liliana Marie Prikler
2022-01-06 10:38                                   ` Mark H Weaver
2022-01-06 11:25                                     ` Liliana Marie Prikler
2022-01-02 19:30                   ` zimoun
2022-01-02 21:35                     ` Liliana Marie Prikler
2022-01-03  9:22                       ` zimoun
2022-01-03 18:13                         ` Liliana Marie Prikler
2022-01-03 19:07                           ` zimoun
2022-01-03 20:19                             ` Liliana Marie Prikler
2022-01-03 23:00                               ` zimoun
2022-01-04  5:23                                 ` Liliana Marie Prikler
2022-01-04  8:51                                   ` zimoun
2022-01-04 13:15                                     ` zimoun
2022-01-04 19:45                                       ` Liliana Marie Prikler
2022-01-04 19:53                                         ` zimoun
2021-12-31 23:56         ` Mark H Weaver
2022-01-01  0:15           ` Liliana Marie Prikler
2021-12-30  1:13 ` Mark H Weaver
2021-12-30 12:56   ` zimoun
2021-12-31  3:15   ` Liliana Marie Prikler
2021-12-31  7:57     ` Taylan Kammer
2021-12-31 10:55       ` Liliana Marie Prikler
2022-01-01  1:41     ` Mark H Weaver
2022-01-01 11:12       ` Liliana Marie Prikler
2022-01-01 17:45         ` Timothy Sample
2022-01-01 19:52           ` Liliana Marie Prikler
2022-01-02 23:00             ` Timothy Sample [this message]
2022-01-03 15:46           ` Ludovic Courtès
2022-01-01 20:19         ` Mark H Weaver
2022-01-01 23:20           ` Liliana Marie Prikler
2022-01-02 12:25             ` Mark H Weaver
2022-01-02 14:09               ` Liliana Marie Prikler
2022-01-02  2:07         ` Bengt Richter
2021-12-31 17:56 ` Vagrant Cascadian
2022-01-03 15:51   ` Ludovic Courtès
2022-01-03 16:29     ` Vagrant Cascadian

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ee5pspza.fsf@ngyro.com \
    --to=samplet@ngyro.com \
    --cc=guix-devel@gnu.org \
    --cc=liliana.prikler@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).