unofficial mirror of guix-science@gnu.org 
 help / color / mirror / Atom feed
* Software Heritage fifth anniversary event
@ 2021-12-01  9:41 Ludovic Courtès
  2021-12-01 18:04 ` Timothy Sample
  2021-12-03  3:58 ` Maxim Cournoyer
  0 siblings, 2 replies; 11+ messages in thread
From: Ludovic Courtès @ 2021-12-01  9:41 UTC (permalink / raw)
  To: guix-devel, guix-science; +Cc: zimoun, Timothy Sample

Hello Guix!

I had the pleasure to attend the Software Heritage fifth anniversary
event yesterday at the UNESCO headquarters (fancy!) and at Inria in
Paris.

I learned about things others are doing with SWH (notably in the
cultural and scientific fields) and had discussions with hackers (people
who work on Subversion, CVS, Mercurial, and Bazaar “loaders”, for
instance).  I gave a 10–15mn talk on how Guix uses SWH, what Disarchive
is, what the current status of the “preservation of Guix” is, and what
remains to be done:

  https://git.savannah.gnu.org/cgit/guix/maintenance.git/plain/talks/swh-unesco-2021/talk.20211130.pdf

(There was a great talk about Maneage¹ right before mine.)

I chatted with the SWH tech team; they’re obviously very busy solving
all sorts of scalability challenges :-) but they’re also truly
interested in what we’re doing and in supporting our use case.  Off the
top of my head, here are some of the topics discussed:

  • ingesting past revisions: if we can give them ‘sources.json’ for
    past revisions, they’re happy to ingest them;

  • rate limit: we can find an arrangement to raise it for the purposes
    of statistics gathering like Simon and Timothy have been doing (we
    can discuss the details off-list);

  • Disarchive: they’d like to better understand the “unknowns” in the
    PoG plots (I wasn’t sure if it was non-tar.gz tarballs or what) and
    to work on the definitely-missing origins that show up there;
    they’re not opposed to the idea of eventually hosting or maintaining
    the Disarchive database (in fact one of the developers thought we
    were hosting it in Git and that as such they were already archiving
    it—maybe we could go back to Git?);

  • bit-for-bit archival: there’s a tension between making SWH a
    “canonical” representation of VCS repos and making it a faithful,
    bit-for-bit identical copy of the original, and there are different
    opinions in the team here; our use case pretty much requires
    bit-for-bit copies, and fortunately this is what SWH is giving us in
    practice for Git repos, so checkout authentication (for example)
    should work even when fetching Guix from SWH.

There were other discussions about Guix and Nix and I was pleased to see
people were enthusiastic about functional package management and about
our whole endeavor.

Anyway I think we can take this as an opportunity to increase bandwidth
with the SWH developers!

Thanks,
Ludo’.

¹ https://maneage.org/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Software Heritage fifth anniversary event
  2021-12-01  9:41 Software Heritage fifth anniversary event Ludovic Courtès
@ 2021-12-01 18:04 ` Timothy Sample
  2021-12-02  8:59   ` Ludovic Courtès
  2021-12-02 13:17   ` zimoun
  2021-12-03  3:58 ` Maxim Cournoyer
  1 sibling, 2 replies; 11+ messages in thread
From: Timothy Sample @ 2021-12-01 18:04 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel, guix-science

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

> I gave a 10–15mn talk on how Guix uses SWH, what Disarchive is, what
> the current status of the “preservation of Guix” is, and what remains
> to be done:
>
>   https://git.savannah.gnu.org/cgit/guix/maintenance.git/plain/talks/swh-unesco-2021/talk.20211130.pdf

Wow – great work!

> I chatted with the SWH tech team; they’re obviously very busy solving
> all sorts of scalability challenges :-) but they’re also truly
> interested in what we’re doing and in supporting our use case.  Off the
> top of my head, here are some of the topics discussed:
>
>   • ingesting past revisions: if we can give them ‘sources.json’ for
>     past revisions, they’re happy to ingest them;

This is something I can probably coax out of the Preservation of Guix
database.  That might be the cheapest way to do it.  Alternatively, when
we get “sources.json” built with Cuirass, we could tell Cuirass to build
out a sample of previous commits to get pretty good coverage.  (Side
note: eventually we could verify the coverage of the sampling approach
using the Data Service, which has a processed a very exhaustive list of
commits.)

>   • rate limit: we can find an arrangement to raise it for the purposes
>     of statistics gathering like Simon and Timothy have been doing (we
>     can discuss the details off-list);

Cool!  So far it hasn’t been a concern for me, but it would help in the
future if want to try and track down Git repositories that have gone
missing.

>   • Disarchive: they’d like to better understand the “unknowns” in the
>     PoG plots (I wasn’t sure if it was non-tar.gz tarballs or what) and
>     to work on the definitely-missing origins that show up there;

Many of the unknowns are there for me to track Disarchive progress.
It’s not really the clearest reporting, but it tracks more what Guix can
handle automatically than what we could theoretically know about.
Basically something is “known” if it can be downloaded from upstream,
and either: it’s a non-recursive Git reference; or it’s something
Disarchive can handle.  Hence, we know nothing about other version
control systems and, say, “.tar.bz2” archives.  Also, all these things
are based on heuristics.  :)  As we get closer to 100% known, we can
start analyzing everything more closely.

>     they’re not opposed to the idea of eventually hosting or maintaining
>     the Disarchive database (in fact one of the developers thought we
>     were hosting it in Git and that as such they were already archiving
>     it—maybe we could go back to Git?);

It’s a possibility, but right now I’m hopeful that the database will be
in the care of SWH directly before too long.  I’d rather wait and see at
this point.  I’m sure we could manage it, but the uncompressed size of
the Disarchive specification of a Chromium tarball is 366M.  Storing all
the XZ specifications uncompressed is over 20G.  It would be a big Git
repo!

>   • bit-for-bit archival: there’s a tension between making SWH a
>     “canonical” representation of VCS repos and making it a faithful,
>     bit-for-bit identical copy of the original, and there are different
>     opinions in the team here; our use case pretty much requires
>     bit-for-bit copies, and fortunately this is what SWH is giving us in
>     practice for Git repos, so checkout authentication (for example)
>     should work even when fetching Guix from SWH.

That’s interesting.  I’m sure most of us in the Guix camp are on team
bit-for-bit, but I’m sure we can all agree that it’s not easy to get
there.

> There were other discussions about Guix and Nix and I was pleased to see
> people were enthusiastic about functional package management and about
> our whole endeavor.
>
> Anyway I think we can take this as an opportunity to increase bandwidth
> with the SWH developers!

Good idea.  It’s nice when our efforts and experience produce something
useful to the broader free software community.  :)


-- Tim


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Software Heritage fifth anniversary event
  2021-12-01 18:04 ` Timothy Sample
@ 2021-12-02  8:59   ` Ludovic Courtès
  2021-12-02 13:17   ` zimoun
  1 sibling, 0 replies; 11+ messages in thread
From: Ludovic Courtès @ 2021-12-02  8:59 UTC (permalink / raw)
  To: Timothy Sample; +Cc: guix-devel, guix-science, zimoun

Hi!

Timothy Sample <samplet@ngyro.com> skribis:

> Ludovic Courtès <ludovic.courtes@inria.fr> writes:

[...]

>>   • Disarchive: they’d like to better understand the “unknowns” in the
>>     PoG plots (I wasn’t sure if it was non-tar.gz tarballs or what) and
>>     to work on the definitely-missing origins that show up there;
>
> Many of the unknowns are there for me to track Disarchive progress.
> It’s not really the clearest reporting, but it tracks more what Guix can
> handle automatically than what we could theoretically know about.
> Basically something is “known” if it can be downloaded from upstream,
> and either: it’s a non-recursive Git reference; or it’s something
> Disarchive can handle.  Hence, we know nothing about other version
> control systems and, say, “.tar.bz2” archives.  Also, all these things
> are based on heuristics.  :)  As we get closer to 100% known, we can
> start analyzing everything more closely.

Right.  Perhaps at some point we can give them (say on swh-devel) this
explanation so they have a clearer view of how far Disarchive is from
being “production-ready” from an SWH perspective.  Valentin of the SWH
team played a lot with pristine-tar and I’m sure they’d have useful
feedback to give.

>>     they’re not opposed to the idea of eventually hosting or maintaining
>>     the Disarchive database (in fact one of the developers thought we
>>     were hosting it in Git and that as such they were already archiving
>>     it—maybe we could go back to Git?);
>
> It’s a possibility, but right now I’m hopeful that the database will be
> in the care of SWH directly before too long.  I’d rather wait and see at
> this point.  I’m sure we could manage it, but the uncompressed size of
> the Disarchive specification of a Chromium tarball is 366M.  Storing all
> the XZ specifications uncompressed is over 20G.  It would be a big Git
> repo!

Indeed!

So, in passing, you’re telling us that xz support is kinda ready, right?
:-)

Thanks!

Ludo’.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Software Heritage fifth anniversary event
  2021-12-01 18:04 ` Timothy Sample
  2021-12-02  8:59   ` Ludovic Courtès
@ 2021-12-02 13:17   ` zimoun
  2021-12-02 14:04     ` Timothy Sample
  2021-12-02 15:02     ` Ludovic Courtès
  1 sibling, 2 replies; 11+ messages in thread
From: zimoun @ 2021-12-02 13:17 UTC (permalink / raw)
  To: Timothy Sample; +Cc: Ludovic Courtès, Guix Devel, guix-science

Hi,

On Wed, 1 Dec 2021 at 19:04, Timothy Sample <samplet@ngyro.com> wrote:
> Ludovic Courtès <ludovic.courtes@inria.fr> writes:
>
> > I gave a 10–15mn talk on how Guix uses SWH, what Disarchive is, what
> > the current status of the “preservation of Guix” is, and what remains
> > to be done:
> >
> >   https://git.savannah.gnu.org/cgit/guix/maintenance.git/plain/talks/swh-unesco-2021/talk.20211130.pdf

Thank you Ludo for this nice write up!  I hope the stream had been
recorded and soon available for all. :-)


> > I chatted with the SWH tech team; they’re obviously very busy solving
> > all sorts of scalability challenges :-) but they’re also truly
> > interested in what we’re doing and in supporting our use case.  Off the
> > top of my head, here are some of the topics discussed:
> >
> >   • ingesting past revisions: if we can give them ‘sources.json’ for
> >     past revisions, they’re happy to ingest them;
>
> This is something I can probably coax out of the Preservation of Guix
> database.  That might be the cheapest way to do it.  Alternatively, when
> we get “sources.json” built with Cuirass, we could tell Cuirass to build
> out a sample of previous commits to get pretty good coverage.  (Side
> note: eventually we could verify the coverage of the sampling approach
> using the Data Service, which has a processed a very exhaustive list of
> commits.)

Let avoid "quirk" because now the ingestion requires too many manual checks. :-)

For instance, "guix lint -c archival" works well but it is not
systematically done by contributors or pushers; especially on quick
updated packages.  This is mainly what we see: 35 vs 24 missing
type:git from PoG [1,2].

On the other hand, 'sources.json' is built with the Guix website.  But
SWH ingests only the tarball items from there.

It is not clear to me how to add to CI both: saving requests for
git-fetch packages and build 'sources.json'.

Last, all the packages are not equal.  We could have 99.99% for the
coverage but if the missing 0.01% packages are deep in the graph, then
all the house of card falls down.  Somehow, we need to work on the
graph and spot the "important", or least sort them.  Argh, it is
something I would like to do since long time (help when release is
coming) but days count only 24h. ;-)


1: https://ngyro.com/pog-reports/2021-10-31/
2: https://ngyro.com/pog-reports/2021-11-30/


> >   • rate limit: we can find an arrangement to raise it for the purposes
> >     of statistics gathering like Simon and Timothy have been doing (we
> >     can discuss the details off-list);
>
> Cool!  So far it hasn’t been a concern for me, but it would help in the
> future if want to try and track down Git repositories that have gone
> missing.

Timothy, could you provide again the entry point you use?


> >     they’re not opposed to the idea of eventually hosting or maintaining
> >     the Disarchive database (in fact one of the developers thought we
> >     were hosting it in Git and that as such they were already archiving
> >     it—maybe we could go back to Git?);
>
> It’s a possibility, but right now I’m hopeful that the database will be
> in the care of SWH directly before too long.  I’d rather wait and see at
> this point.  I’m sure we could manage it, but the uncompressed size of
> the Disarchive specification of a Chromium tarball is 366M.  Storing all
> the XZ specifications uncompressed is over 20G.  It would be a big Git
> repo!

Hehe!  That's something we discussed at the very beginning of Disarchive. :-)

If Disarchive-DB is managed by SWH, maybe some people would be afraid
by security concerns.  I mean, today SWH ingests an archive. Today,
this archive is checksummed using a robust algorithm say Foo.  Using
the content from SWH and the meta from Disarchive-DB, the archive is
rebuilt and because Foo is robust, it is possible to checksum that the
rebuild match the expectation.  Later, Foo is weak and preimage attack
is possible.  All one has is the expectation using Foo.  Therefore,
SWH could cheat and introduce something in content and/or meta that
matches the expectation using Foo.  If the 2 databases are
independent, then it is harder. :-)

Well, the assumptions are: SWH would be still there when Foo is
broken.  Currently Foo is SHA-256, so who knows. :-)

From scientific context, this scenario (SWH corrupted) is really low
in the list of issues. ;-)


> >   • bit-for-bit archival: there’s a tension between making SWH a
> >     “canonical” representation of VCS repos and making it a faithful,
> >     bit-for-bit identical copy of the original, and there are different
> >     opinions in the team here; our use case pretty much requires
> >     bit-for-bit copies, and fortunately this is what SWH is giving us in
> >     practice for Git repos, so checkout authentication (for example)
> >     should work even when fetching Guix from SWH.

The main issue is the lookup.  Non bit-for-bit archival implies that
people store a SWH lookup key (swhid I guess) at ingestion time,
otherwise it becomes nearly impossible to find back.  To me, the
tension is in the meaning of preservation of source code, i.e.,
between archiving for reading or archiving for compiling.  In the case
of compilation, all the lookup must be automated and so non
bit-for-bit archival means: make swhid THE standard for serialization;
somehow replacing all the other checksums.


> > Anyway I think we can take this as an opportunity to increase bandwidth
> > with the SWH developers!

Yeah, let have a good story! :-)


Cheers,
simon


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Software Heritage fifth anniversary event
  2021-12-02 13:17   ` zimoun
@ 2021-12-02 14:04     ` Timothy Sample
  2021-12-02 15:49       ` zimoun
  2021-12-02 15:02     ` Ludovic Courtès
  1 sibling, 1 reply; 11+ messages in thread
From: Timothy Sample @ 2021-12-02 14:04 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel, Ludovic Courtès, guix-science

Hi,

zimoun <zimon.toutoune@gmail.com> writes:

> Timothy, could you provide again the entry point you use?

https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html#post--api-1-known-


-- Tim


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Software Heritage fifth anniversary event
  2021-12-02 13:17   ` zimoun
  2021-12-02 14:04     ` Timothy Sample
@ 2021-12-02 15:02     ` Ludovic Courtès
  1 sibling, 0 replies; 11+ messages in thread
From: Ludovic Courtès @ 2021-12-02 15:02 UTC (permalink / raw)
  To: zimoun; +Cc: Timothy Sample, Guix Devel, guix-science

Hello!

zimoun <zimon.toutoune@gmail.com> skribis:

>> >   • bit-for-bit archival: there’s a tension between making SWH a
>> >     “canonical” representation of VCS repos and making it a faithful,
>> >     bit-for-bit identical copy of the original, and there are different
>> >     opinions in the team here; our use case pretty much requires
>> >     bit-for-bit copies, and fortunately this is what SWH is giving us in
>> >     practice for Git repos, so checkout authentication (for example)
>> >     should work even when fetching Guix from SWH.
>
> The main issue is the lookup.  Non bit-for-bit archival implies that
> people store a SWH lookup key (swhid I guess) at ingestion time,
> otherwise it becomes nearly impossible to find back.  To me, the
> tension is in the meaning of preservation of source code, i.e.,
> between archiving for reading or archiving for compiling.

Exactly, I guess that’s the big difference.  Also: allowing archived
content to be authenticated by third parties vs. having to trust SWH.

> In the case of compilation, all the lookup must be automated and so
> non bit-for-bit archival means: make swhid THE standard for
> serialization; somehow replacing all the other checksums.

Yes, but even if that eventually happens, it’s going to take time.

Ludo’.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Software Heritage fifth anniversary event
  2021-12-02 14:04     ` Timothy Sample
@ 2021-12-02 15:49       ` zimoun
  2021-12-02 18:04         ` Ludovic Courtès
  0 siblings, 1 reply; 11+ messages in thread
From: zimoun @ 2021-12-02 15:49 UTC (permalink / raw)
  To: Timothy Sample; +Cc: Guix Devel, Ludovic Courtès, guix-science

Hi Tim,

On Thu, 2 Dec 2021 at 15:04, Timothy Sample <samplet@ngyro.com> wrote:

> https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html#post--api-1-known-

Thanks!

For the interested reader, http://logs.guix.gnu.org/guix/2021-12-02.log#164713

Cheers,
simon


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Software Heritage fifth anniversary event
  2021-12-02 15:49       ` zimoun
@ 2021-12-02 18:04         ` Ludovic Courtès
  0 siblings, 0 replies; 11+ messages in thread
From: Ludovic Courtès @ 2021-12-02 18:04 UTC (permalink / raw)
  To: zimoun; +Cc: Timothy Sample, Guix Devel, guix-science

zimoun <zimon.toutoune@gmail.com> skribis:

> On Thu, 2 Dec 2021 at 15:04, Timothy Sample <samplet@ngyro.com> wrote:
>
>> https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html#post--api-1-known-
>
> Thanks!
>
> For the interested reader, http://logs.guix.gnu.org/guix/2021-12-02.log#164713

Nice!  Would be nice to integrate the relevant bits of
<https://git.ngyro.com/preservation-of-guix/tree/pog/swhid.scm> into
(guix swh) eventually.

Ludo’.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Software Heritage fifth anniversary event
  2021-12-01  9:41 Software Heritage fifth anniversary event Ludovic Courtès
  2021-12-01 18:04 ` Timothy Sample
@ 2021-12-03  3:58 ` Maxim Cournoyer
  2021-12-03 11:01   ` Ludovic Courtès
  1 sibling, 1 reply; 11+ messages in thread
From: Maxim Cournoyer @ 2021-12-03  3:58 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel, guix-science

Hello Ludovic!

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

> Hello Guix!
>
> I had the pleasure to attend the Software Heritage fifth anniversary
> event yesterday at the UNESCO headquarters (fancy!) and at Inria in
> Paris.
>
> I learned about things others are doing with SWH (notably in the
> cultural and scientific fields) and had discussions with hackers (people
> who work on Subversion, CVS, Mercurial, and Bazaar “loaders”, for
> instance).  I gave a 10–15mn talk on how Guix uses SWH, what Disarchive
> is, what the current status of the “preservation of Guix” is, and what
> remains to be done:
>
>   https://git.savannah.gnu.org/cgit/guix/maintenance.git/plain/talks/swh-unesco-2021/talk.20211130.pdf
>
> (There was a great talk about Maneage¹ right before mine.)

Interesting!  I'm glad you could showcase Guix as a solution to software
deployment there.

Thank you for sharing the slides, they were interesting/entertaining.  I
note that they were written prior to 'guix shell' :-).

Cheers,

Maxim


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Software Heritage fifth anniversary event
  2021-12-03  3:58 ` Maxim Cournoyer
@ 2021-12-03 11:01   ` Ludovic Courtès
  2021-12-05  5:10     ` Maxim Cournoyer
  0 siblings, 1 reply; 11+ messages in thread
From: Ludovic Courtès @ 2021-12-03 11:01 UTC (permalink / raw)
  To: Maxim Cournoyer; +Cc: guix-devel, guix-science

Hey!

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

> Thank you for sharing the slides, they were interesting/entertaining.  I
> note that they were written prior to 'guix shell' :-).

I thought I’d rather leave ‘guix environment’ in there as someone
downloading 1.3.0 won’t have ‘guix shell’.

Likewise, I recently gave a packaging tutorial and I was extremely
frustrated that I had to teach things `(("like" ,this)).  :-)

Ludo’.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Software Heritage fifth anniversary event
  2021-12-03 11:01   ` Ludovic Courtès
@ 2021-12-05  5:10     ` Maxim Cournoyer
  0 siblings, 0 replies; 11+ messages in thread
From: Maxim Cournoyer @ 2021-12-05  5:10 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel, guix-science

Hi!

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

> Hey!
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> Thank you for sharing the slides, they were interesting/entertaining.  I
>> note that they were written prior to 'guix shell' :-).
>
> I thought I’d rather leave ‘guix environment’ in there as someone
> downloading 1.3.0 won’t have ‘guix shell’.
>
> Likewise, I recently gave a packaging tutorial and I was extremely
> frustrated that I had to teach things `(("like" ,this)).  :-)

Eh :-).  We can all forget about that syntax soon...

Thanks!

Maxim


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-12-05  5:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-01  9:41 Software Heritage fifth anniversary event Ludovic Courtès
2021-12-01 18:04 ` Timothy Sample
2021-12-02  8:59   ` Ludovic Courtès
2021-12-02 13:17   ` zimoun
2021-12-02 14:04     ` Timothy Sample
2021-12-02 15:49       ` zimoun
2021-12-02 18:04         ` Ludovic Courtès
2021-12-02 15:02     ` Ludovic Courtès
2021-12-03  3:58 ` Maxim Cournoyer
2021-12-03 11:01   ` Ludovic Courtès
2021-12-05  5:10     ` Maxim Cournoyer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).