* Preservation of Guix Report
@ 2021-10-20 19:48 Timothy Sample
2021-10-21 2:04 ` Timothy Sample
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Timothy Sample @ 2021-10-20 19:48 UTC (permalink / raw)
To: guix-devel
Hi everyone!
Early this summer I did a bunch of work trying to figure out which Guix
sources are preserved by the SWH archive. I’m finally ready to share
some preliminary results!
https://ngyro.com/pog-reports/2021-10-20/
This report is already quite outdated, though. It only covers commits
up to the end of May, and sometime in June is when the sources were
checked against the SWH archive. I’m sharing it now to avoid any
further delays.
What’s cool is that the report is automated. Next on my list is to
update the database and generate a new report. Then, we can compare the
results and see if we are improving. (My read on the results so far is
that improving “sources.json” will yield big improvements, but we might
not be able to get to that before the next report.)
The report itself only provides a very high level overview. If you want
to check on specifics, you will have to download the database. There’s
a link at the bottom of the report as well as a link to a detailed
schema definition. Anyone interested in making some sense of the 5,043
known missing sources is encouraged to look there. However, I can say
from my own investigation that a lot of them are kinda boring. For
instance, 3,435 are from crates.io, CRAN, Hackage, Bioconductor, and
CPAN:
select count(*)
from fods
join fod_references using (fod_id)
where not is_in_swh
and (reference like '%crates.io%' or
reference like '%/cran/%' or
reference like '%hackage%' or
reference like '%/bioconductor.%' or
reference like '%/cpan/%');
=> 3435
It’s surprising to me that SWH is not already getting these from
“sources.json”. I picked an arbitrary one, “rust-quote-0.6”, and it’s
simply not in “sources.json”. On the other hand, I bet SWH would like a
crates.io (and CRAN, etc.) loader, too.
One other more interesting approach might be to check Git sources:
select count(*)
from fods
join fod_references using (fod_id)
where not is_in_swh
and reference like '(git-reference%';
=> 336
There are fewer, but they might be more interesting. Just be sure to
check that they haven’t made it into the SWH archive since June. For
instance, I just checked “asciidoc@9.1.0” and learned that the database
has “NOT is_in_swh”, but it is now in the SWH archive. So, caveat
emptor, I guess. Maybe it would be wise to wait for a more recent
report before diving in.
One other way to help would be to suggest improvements to the report. I
don’t want to fiddle with it too much, but if there is some simple graph
or table or list that should be there, I’m happy to give it a go.
-- Tim
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Preservation of Guix Report
2021-10-20 19:48 Preservation of Guix Report Timothy Sample
@ 2021-10-21 2:04 ` Timothy Sample
2021-10-21 7:39 ` zimoun
2021-10-21 20:47 ` Ludovic Courtès
2 siblings, 0 replies; 13+ messages in thread
From: Timothy Sample @ 2021-10-21 2:04 UTC (permalink / raw)
To: guix-devel
Hi again,
Rereading this a few hours later, I found an error.
Timothy Sample <samplet@ngyro.com> writes:
> It’s surprising to me that SWH is not already getting these from
> “sources.json”. I picked an arbitrary one, “rust-quote-0.6”, and it’s
> simply not in “sources.json”.
It is in fact there! I made a mistake while grepping.
In the database, there’s “rust-quote-0.6@0.6.12” and
“rust-quote-0.6@0.6.13”. The SWH archive has version 0.6.13, but not
0.6.12. Looking back, 0.6.13 was released in July 2019, so maybe 0.6.12
predates “sources.json”?
AFAIK, we have no way of getting 0.6.12 into the SWH archive. There’s
been some talk of making historical “sources.json” files.... I imagine
this is one of many little things to work out.
-- Tim
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Preservation of Guix Report
2021-10-20 19:48 Preservation of Guix Report Timothy Sample
2021-10-21 2:04 ` Timothy Sample
@ 2021-10-21 7:39 ` zimoun
2021-10-21 16:26 ` Timothy Sample
2021-10-21 20:47 ` Ludovic Courtès
2 siblings, 1 reply; 13+ messages in thread
From: zimoun @ 2021-10-21 7:39 UTC (permalink / raw)
To: Timothy Sample, guix-devel
Hi Timothy,
On Wed, 20 Oct 2021 at 15:48, Timothy Sample <samplet@ngyro.com> wrote:
> Early this summer I did a bunch of work trying to figure out which Guix
> sources are preserved by the SWH archive. I’m finally ready to share
> some preliminary results!
>
> https://ngyro.com/pog-reports/2021-10-20/
Cool! Really interesting.
> What’s cool is that the report is automated. Next on my list is to
> update the database and generate a new report. Then, we can compare the
> results and see if we are improving. (My read on the results so far is
> that improving “sources.json” will yield big improvements, but we might
> not be able to get to that before the next report.)
Here two minor comments:
1. Since a couple of days, I run:
$ GUIX_SWH_TOKEN=$TOKEN guix lint -c archival
where $TOKEN is provided by the SWH Authentication service [1].
Instead of a rate limit at 120, it is 1200. Therefore, more
’git-fetch’ packages are added. I am in the process to automate
that but do not hold your breath. :-)
2. For still unknown reasons, the bridge between SWH and Disarchive has
some holes. For instance,
$ guix lint -c archive znc
gnu/packages/messaging.scm:996:12: znc@1.8.2: Disarchive entry refers to non-existent SWH directory '33a3b509b5ff8e9039626d11b7a800281884cf2a'
$ wget https://guix.gnu.org/sources.json
$ cat sources.json | jq | grep znc
"integrity": "sha256-IwbxlQzncsWlmlf1SG1Zu5yrmEl8RfxJy8RawN7BGbs="
"integrity": "sha256-q0jatpd+j0PW//szIo0ViGX2jd5wJtEjxpPXcznc8rs="
"https://znc.in/releases/archive/znc-1.8.2.tar.gz"
$ guix download https://znc.in/releases/archive/znc-1.8.2.tar.gz
Starting download of /tmp/guix-file.hnjWTE
From https://znc.in/releases/archive/znc-1.8.2.tar.gz...
znc-1.8.2.tar.gz 2.0MiB 599KiB/s 00:03 [##################] 100.0%
/gnu/store/58khbiwp2ghhzg00gnzdy2jlfv49vajm-znc-1.8.2.tar.gz
03fyi0j44zcanj1rsdx93hkdskwfvhbywjiwd17f9q1a7yp8l8zz
Therefore, something is wrong somewhere. Because of #1, I detect
many of such examples. I do not know if SWH-ID computed by
Disarchive is incorrect or if SWH has not ingested. Investigations
required. :-)
1: <https://archive.softwareheritage.org/api/>
> It’s surprising to me that SWH is not already getting these from
> “sources.json”. I picked an arbitrary one, “rust-quote-0.6”, and it’s
> simply not in “sources.json”. On the other hand, I bet SWH would like a
> crates.io (and CRAN, etc.) loader, too.
From the SWH doc, there is a CRAN lister [2] but I have not checked what
they ingest concretely. Because on our side, we are using ’url-fetch’
and it appears to me possible to have a tiny mismatch between what is
inside the release tarball (what we concretely use) vs what SWH ingests
directly from CRAN.
2: <https://docs.softwareheritage.org/devel/apidoc/swh.lister.cran.html?highlight=cran#module-swh.lister.cran>
And answering to your question [3] about “sources.json”, I think the
ingestion started after this commit
35bb77108fc7f2339da0b5be139043a5f3f21493 from guix-artwork. Other said,
SWH started to ingest from “sources.json” after July 2020; probably
around September 2020.
3: <https://lists.gnu.org/archive/html/guix-devel/2021-10/msg00141.html>
> One other way to help would be to suggest improvements to the report. I
> don’t want to fiddle with it too much, but if there is some simple graph
> or table or list that should be there, I’m happy to give it a go.
For the Missing and Unknown fields, could you distinguish the kind of
origin? Is it mainly git-fetch or url-fetch or others?
It would help to spot the issues to work on it (sources.json, SWH side,
Disarchive, etc.).
Cheers,
simon
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Preservation of Guix Report
2021-10-21 7:39 ` zimoun
@ 2021-10-21 16:26 ` Timothy Sample
2021-10-22 7:58 ` zimoun
0 siblings, 1 reply; 13+ messages in thread
From: Timothy Sample @ 2021-10-21 16:26 UTC (permalink / raw)
To: zimoun; +Cc: guix-devel
Hi zimoun,
zimoun <zimon.toutoune@gmail.com> writes:
> 2. For still unknown reasons, the bridge between SWH and Disarchive has
> some holes. For instance,
>
> $ guix lint -c archive znc
> gnu/packages/messaging.scm:996:12: znc@1.8.2: Disarchive entry refers to non-existent SWH directory '33a3b509b5ff8e9039626d11b7a800281884cf2a'
>
> [...]
>
> Therefore, something is wrong somewhere. Because of #1, I detect
> many of such examples. I do not know if SWH-ID computed by
> Disarchive is incorrect [...].
Bingo!
According to SWH (emphasis mine):
SWHIDs for contents, directories, revisions, and releases are, *at
present*, compatible with the Git way of computing identifiers for
its objects.
This is not true anymore. As they go on to say:
Note that Git compatibility is incidental and is not guaranteed to
be maintained in future versions of this scheme (or Git).
Disarchive does it the Git way, and SWH does something slightly
different. The SWH hash is 4e58dc09b8362caf1265102130a593b070562a68,
but the Git hash is 33a3b509b5ff8e9039626d11b7a800281884cf2a. The
difference is that Disarchive, like Git, ignores empty directories. It
makes sense that an archival project like SWH would not do that, and
they indeed don’t.
Fixing this in Disarchive is going to make a *huge* difference, so that
is now high priority for me (it’s a one line change, but I want to fix
it, release it, update Guix, and recompute the report).
> And answering to your question [3] about “sources.json”, I think the
> ingestion started after this commit
> 35bb77108fc7f2339da0b5be139043a5f3f21493 from guix-artwork. Other said,
> SWH started to ingest from “sources.json” after July 2020; probably
> around September 2020.
>
> 3: <https://lists.gnu.org/archive/html/guix-devel/2021-10/msg00141.html>
Thanks! While investigating the above problem, I found a page that
lists what SWH is getting from us [1] and another showing when they are
scanning “sources.json” [2]. I don’t know if you’ve seen them before,
but they will be invaluable for figuring this stuff out.
[1] https://archive.softwareheritage.org/browse/origin/branches/?origin_url=https://guix.gnu.org/sources.json
[2] https://archive.softwareheritage.org/browse/origin/visits/?origin_url=https://guix.gnu.org/sources.json
> For the Missing and Unknown fields, could you distinguish the kind of
> origin? Is it mainly git-fetch or url-fetch or others?
Good idea. I think I can do this easily enough. I might shelve it for
a bit, because I’m too excited to update the report with the Disarchive
hash fix. :)
-- Tim
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Preservation of Guix Report
2021-10-21 16:26 ` Timothy Sample
@ 2021-10-22 7:58 ` zimoun
0 siblings, 0 replies; 13+ messages in thread
From: zimoun @ 2021-10-22 7:58 UTC (permalink / raw)
To: Timothy Sample; +Cc: guix-devel
Hi Timothy,
On Thu, 21 Oct 2021 at 12:26, Timothy Sample <samplet@ngyro.com> wrote:
> Fixing this in Disarchive is going to make a *huge* difference, so that
> is now high priority for me (it’s a one line change, but I want to fix
> it, release it, update Guix, and recompute the report).
Cool! Excited to get the new report.
> Thanks! While investigating the above problem, I found a page that
> lists what SWH is getting from us [1] and another showing when they are
> scanning “sources.json” [2]. I don’t know if you’ve seen them before,
> but they will be invaluable for figuring this stuff out.
>
> [1] https://archive.softwareheritage.org/browse/origin/branches/?origin_url=https://guix.gnu.org/sources.json
> [2] https://archive.softwareheritage.org/browse/origin/visits/?origin_url=https://guix.gnu.org/sources.json
I do not know these. However, I am confused how to use them. For
instance, clicking to a random orange link:
<https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://guix.gnu.org/sources.json×tamp=2021-04-27T16:01:28Z>
which points to a snapshot of the Git repo of Guix. I would be
interesting to know what does it mean “partial”. And read the log.
Well, let roam on #swh-devel. ;-)
Cheers,
simon
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Preservation of Guix Report
2021-10-20 19:48 Preservation of Guix Report Timothy Sample
2021-10-21 2:04 ` Timothy Sample
2021-10-21 7:39 ` zimoun
@ 2021-10-21 20:47 ` Ludovic Courtès
2021-10-22 7:53 ` zimoun
2021-10-22 14:19 ` Preservation of Guix Report Timothy Sample
2 siblings, 2 replies; 13+ messages in thread
From: Ludovic Courtès @ 2021-10-21 20:47 UTC (permalink / raw)
To: Timothy Sample; +Cc: guix-devel
Hi Timothy!
Timothy Sample <samplet@ngyro.com> skribis:
> Early this summer I did a bunch of work trying to figure out which Guix
> sources are preserved by the SWH archive. I’m finally ready to share
> some preliminary results!
>
> https://ngyro.com/pog-reports/2021-10-20/
>
> This report is already quite outdated, though. It only covers commits
> up to the end of May, and sometime in June is when the sources were
> checked against the SWH archive. I’m sharing it now to avoid any
> further delays.
This is truly awesome! (Did you manage to grab all that info with the
default rate limit?!)
I can’t wait for the updated report now that Simon and yourself have
identified that SWHID computation bug!
Some of our <git-reference> refer to tags, not commits. How do you
determine whether they’re saved?
‘guix lint -c archival’ uses ‘lookup-origin-revision’, which is a good
approximation, but it’s not 100% reliable because tags can be modified
and that procedure only tells you that a same-named tag was found, not
that it’s the commit you were expecting. (And really, we should stop
referring to tags.)
Thank you! 👍🏽
Ludo’.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Preservation of Guix Report
2021-10-21 20:47 ` Ludovic Courtès
@ 2021-10-22 7:53 ` zimoun
2021-10-29 14:12 ` Mutable Git tags & Software Heritage Ludovic Courtès
2021-10-22 14:19 ` Preservation of Guix Report Timothy Sample
1 sibling, 1 reply; 13+ messages in thread
From: zimoun @ 2021-10-22 7:53 UTC (permalink / raw)
To: Ludovic Courtès, Timothy Sample; +Cc: guix-devel
Hi Ludo,
On Thu, 21 Oct 2021 at 22:47, Ludovic Courtès <ludo@gnu.org> wrote:
> ‘guix lint -c archival’ uses ‘lookup-origin-revision’, which is a good
> approximation, but it’s not 100% reliable because tags can be modified
> and that procedure only tells you that a same-named tag was found, not
> that it’s the commit you were expecting. (And really, we should stop
> referring to tags.)
At package time, Guix uses tag. Then “guix lint” saves the upstream
repo; containing the correct tag. Now, upstream replaces in-place the
tag and saves to SWH by their own. How does SWH deal with this case?
Well, because it is not affordable to switch from the current
tag-address to immutable commit-address for defining packages, in order
to be 100% reliable, any fallback should use Disarchive-DB which stores
the mapping from checksum to swhid; for all kind origins.
Is it what you have in mind?
Cheers,
simon
^ permalink raw reply [flat|nested] 13+ messages in thread
* Mutable Git tags & Software Heritage
2021-10-22 7:53 ` zimoun
@ 2021-10-29 14:12 ` Ludovic Courtès
2021-10-30 16:19 ` zimoun
0 siblings, 1 reply; 13+ messages in thread
From: Ludovic Courtès @ 2021-10-29 14:12 UTC (permalink / raw)
To: zimoun; +Cc: guix-devel
Hi,
zimoun <zimon.toutoune@gmail.com> skribis:
> On Thu, 21 Oct 2021 at 22:47, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> ‘guix lint -c archival’ uses ‘lookup-origin-revision’, which is a good
>> approximation, but it’s not 100% reliable because tags can be modified
>> and that procedure only tells you that a same-named tag was found, not
>> that it’s the commit you were expecting. (And really, we should stop
>> referring to tags.)
>
> At package time, Guix uses tag. Then “guix lint” saves the upstream
> repo; containing the correct tag. Now, upstream replaces in-place the
> tag and saves to SWH by their own. How does SWH deal with this case?
SWH records the “history of the history”. It can tell you what the tag
pointed to at the time of a specific snapshot.
However our fallback code picks the tag as it exists in the latest
snapshot, and thus it could pick “the wrong one” if the tag was modified
over time.
> Well, because it is not affordable to switch from the current
> tag-address to immutable commit-address for defining packages, in order
> to be 100% reliable, any fallback should use Disarchive-DB which stores
> the mapping from checksum to swhid; for all kind origins.
>
> Is it what you have in mind?
No, I think we should consider always referring to commits instead of
tags. It’s annoying from a readability viewpoint, but it would ensure
reproducibility. Even flatpak has this policy. :-)
https://github.com/flathub/flathub/wiki/App-Requirements
Ludo’.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Mutable Git tags & Software Heritage
2021-10-29 14:12 ` Mutable Git tags & Software Heritage Ludovic Courtès
@ 2021-10-30 16:19 ` zimoun
2021-11-09 16:55 ` Ludovic Courtès
0 siblings, 1 reply; 13+ messages in thread
From: zimoun @ 2021-10-30 16:19 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: guix-devel
Hi,
On Fri, 29 Oct 2021 at 16:12, Ludovic Courtès <ludo@gnu.org> wrote:
>> At package time, Guix uses tag. Then “guix lint” saves the upstream
>> repo; containing the correct tag. Now, upstream replaces in-place the
>> tag and saves to SWH by their own. How does SWH deal with this case?
>
> SWH records the “history of the history”. It can tell you what the tag
> pointed to at the time of a specific snapshot.
>
> However our fallback code picks the tag as it exists in the latest
> snapshot, and thus it could pick “the wrong one” if the tag was modified
> over time.
Ah yes. Once read, it seems obvious. :-) Thanks for explaining.
>> Well, because it is not affordable to switch from the current
>> tag-address to immutable commit-address for defining packages, in order
>> to be 100% reliable, any fallback should use Disarchive-DB which stores
>> the mapping from checksum to swhid; for all kind origins.
>>
>> Is it what you have in mind?
>
> No, I think we should consider always referring to commits instead of
> tags. It’s annoying from a readability viewpoint, but it would ensure
> reproducibility. Even flatpak has this policy. :-)
Ah, IMHO, «it is not affordable to switch from the current tag-address
to immutable commit-address for defining packages, in order to be 100%
reliable» :-)
Do you think the switch from tag to commit instead is really doable?
Other said, do you think it should be possible to automatize such task?
Because from my experience, it had been long and quite boring to
manually clean various R packages from various location to correct ones
(Bioconductor, CRAN, etc.). Cleaning Python2 is also something. I
mean, it is not doable manually, IMHO.
Cheers,
simon
PS: Ah, refer to commit instead of tags could be a RFC. ;-)
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Mutable Git tags & Software Heritage
2021-10-30 16:19 ` zimoun
@ 2021-11-09 16:55 ` Ludovic Courtès
0 siblings, 0 replies; 13+ messages in thread
From: Ludovic Courtès @ 2021-11-09 16:55 UTC (permalink / raw)
To: zimoun; +Cc: guix-devel
Hello,
zimoun <zimon.toutoune@gmail.com> skribis:
> On Fri, 29 Oct 2021 at 16:12, Ludovic Courtès <ludo@gnu.org> wrote:
[...]
>>> Well, because it is not affordable to switch from the current
>>> tag-address to immutable commit-address for defining packages, in order
>>> to be 100% reliable, any fallback should use Disarchive-DB which stores
>>> the mapping from checksum to swhid; for all kind origins.
>>>
>>> Is it what you have in mind?
>>
>> No, I think we should consider always referring to commits instead of
>> tags. It’s annoying from a readability viewpoint, but it would ensure
>> reproducibility. Even flatpak has this policy. :-)
>
> Ah, IMHO, «it is not affordable to switch from the current tag-address
> to immutable commit-address for defining packages, in order to be 100%
> reliable» :-)
>
> Do you think the switch from tag to commit instead is really doable?
>
> Other said, do you think it should be possible to automatize such task?
With enough effort, anything can be automated. That said, I think it
would be fine to change things incrementally as packages are
added/updated.
> PS: Ah, refer to commit instead of tags could be a RFC. ;-)
Yup! :-)
Thanks,
Ludo’.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Preservation of Guix Report
2021-10-21 20:47 ` Ludovic Courtès
2021-10-22 7:53 ` zimoun
@ 2021-10-22 14:19 ` Timothy Sample
2021-10-22 17:32 ` Timothy Sample
2021-10-29 14:20 ` Ludovic Courtès
1 sibling, 2 replies; 13+ messages in thread
From: Timothy Sample @ 2021-10-22 14:19 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: guix-devel
Hey,
Ludovic Courtès <ludo@gnu.org> writes:
> Timothy Sample <samplet@ngyro.com> skribis:
>
>> Early this summer I did a bunch of work trying to figure out which Guix
>> sources are preserved by the SWH archive. I’m finally ready to share
>> some preliminary results!
>>
>> https://ngyro.com/pog-reports/2021-10-20/
>>
>> This report is already quite outdated, though. It only covers commits
>> up to the end of May, and sometime in June is when the sources were
>> checked against the SWH archive. I’m sharing it now to avoid any
>> further delays.
>
> This is truly awesome! (Did you manage to grab all that info with the
> default rate limit?!)
Yes, but I have another trick. The “known” endpoint [1]. If you
already know the SWHIDs you want to check, you can check 1,000 per call.
With the anonymous rate limit, I can check 120,000 every hour, which is
plenty.
[1] https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html#get--api-1-content-known-(sha1)[,(sha1),%20...,(sha1)]-
> I can’t wait for the updated report now that Simon and yourself have
> identified that SWHID computation bug!
I’m computing SWHIDs while writing this. Not long now!
> Some of our <git-reference> refer to tags, not commits. How do you
> determine whether they’re saved?
The short answer is “elbow grease”. Basically, I’m taking a “work
harder, not smarter” approach. :p I go out and obtain the source,
verify it with Guix’s hash, and then compute the SWHID. This is another
thing we could move to the CI infrastructure, but I think there might be
some hiccoughs. For git-references, I believe we can’t just compute the
ID after the download derivation – we would have to change the download
derivation itself. Maybe add an ‘swhid’ output? It’s a little more
complicated than just throwing up some scripts, anyway.
> ‘guix lint -c archival’ uses ‘lookup-origin-revision’, which is a good
> approximation, but it’s not 100% reliable because tags can be modified
> and that procedure only tells you that a same-named tag was found, not
> that it’s the commit you were expecting. (And really, we should stop
> referring to tags.)
Like zimoun said elsewhere in this thread, having an explicit mapping
from Guix hash to SHWID will improve reliability quite a bit. It’s hard
to get to 100%, though! With the reports, we will eventually be able to
check everything. However, there’s still a small possibility of bugs
and false positives. Ultimately, I’m hoping the reports will help
detect small problems (some specific source is missing) and guide our
efforts on big problems (xz support in Disarchive or support for more
version control systems, etc.).
-- Tim
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Preservation of Guix Report
2021-10-22 14:19 ` Preservation of Guix Report Timothy Sample
@ 2021-10-22 17:32 ` Timothy Sample
2021-10-29 14:20 ` Ludovic Courtès
1 sibling, 0 replies; 13+ messages in thread
From: Timothy Sample @ 2021-10-22 17:32 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: guix-devel
Hi again,
Timothy Sample <samplet@ngyro.com> writes:
> Yes, but I have another trick. The “known” endpoint [1]. If you
> already know the SWHIDs you want to check, you can check 1,000 per call.
> With the anonymous rate limit, I can check 120,000 every hour, which is
> plenty.
>
> [1]
> https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html#get--api-1-content-known-(sha1)[,(sha1),%20...,(sha1)]-
That’s the wrong “known” endpoint. I meant to link to this one:
https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html#post--api-1-known-
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Preservation of Guix Report
2021-10-22 14:19 ` Preservation of Guix Report Timothy Sample
2021-10-22 17:32 ` Timothy Sample
@ 2021-10-29 14:20 ` Ludovic Courtès
1 sibling, 0 replies; 13+ messages in thread
From: Ludovic Courtès @ 2021-10-29 14:20 UTC (permalink / raw)
To: Timothy Sample; +Cc: guix-devel
Hello!
Timothy Sample <samplet@ngyro.com> skribis:
> Ludovic Courtès <ludo@gnu.org> writes:
[...]
>> This is truly awesome! (Did you manage to grab all that info with the
>> default rate limit?!)
>
> Yes, but I have another trick. The “known” endpoint [1]. If you
> already know the SWHIDs you want to check, you can check 1,000 per call.
> With the anonymous rate limit, I can check 120,000 every hour, which is
> plenty.
>
> [1] https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html#get--api-1-content-known-(sha1)[,(sha1),%20...,(sha1)]-
Oh, smart.
>> Some of our <git-reference> refer to tags, not commits. How do you
>> determine whether they’re saved?
>
> The short answer is “elbow grease”. Basically, I’m taking a “work
> harder, not smarter” approach. :p I go out and obtain the source,
> verify it with Guix’s hash, and then compute the SWHID. This is another
> thing we could move to the CI infrastructure, but I think there might be
> some hiccoughs. For git-references, I believe we can’t just compute the
> ID after the download derivation – we would have to change the download
> derivation itself. Maybe add an ‘swhid’ output? It’s a little more
> complicated than just throwing up some scripts, anyway.
Just like we have ‘etc/disarchive-manifest.scm’, we could have a thing
that computes the SWHID of all the ‘git-fetch’ origins, for instance,
using the Disarchive code. Would that help?
That would allow us to maintain a mapping from nar hash to swh:dir hash.
>> ‘guix lint -c archival’ uses ‘lookup-origin-revision’, which is a good
>> approximation, but it’s not 100% reliable because tags can be modified
>> and that procedure only tells you that a same-named tag was found, not
>> that it’s the commit you were expecting. (And really, we should stop
>> referring to tags.)
>
> Like zimoun said elsewhere in this thread, having an explicit mapping
> from Guix hash to SHWID will improve reliability quite a bit. It’s hard
> to get to 100%, though! With the reports, we will eventually be able to
> check everything. However, there’s still a small possibility of bugs
> and false positives. Ultimately, I’m hoping the reports will help
> detect small problems (some specific source is missing) and guide our
> efforts on big problems (xz support in Disarchive or support for more
> version control systems, etc.).
Definitely, thumbs up!
Ludo’.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-11-09 16:55 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-10-20 19:48 Preservation of Guix Report Timothy Sample
2021-10-21 2:04 ` Timothy Sample
2021-10-21 7:39 ` zimoun
2021-10-21 16:26 ` Timothy Sample
2021-10-22 7:58 ` zimoun
2021-10-21 20:47 ` Ludovic Courtès
2021-10-22 7:53 ` zimoun
2021-10-29 14:12 ` Mutable Git tags & Software Heritage Ludovic Courtès
2021-10-30 16:19 ` zimoun
2021-11-09 16:55 ` Ludovic Courtès
2021-10-22 14:19 ` Preservation of Guix Report Timothy Sample
2021-10-22 17:32 ` Timothy Sample
2021-10-29 14:20 ` Ludovic Courtès
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).