* Software Heritage fifth anniversary event @ 2021-12-01 9:41 Ludovic Courtès 2021-12-01 18:04 ` Timothy Sample 2021-12-03 3:58 ` Maxim Cournoyer 0 siblings, 2 replies; 11+ messages in thread From: Ludovic Courtès @ 2021-12-01 9:41 UTC (permalink / raw) To: guix-devel, guix-science; +Cc: zimoun, Timothy Sample Hello Guix! I had the pleasure to attend the Software Heritage fifth anniversary event yesterday at the UNESCO headquarters (fancy!) and at Inria in Paris. I learned about things others are doing with SWH (notably in the cultural and scientific fields) and had discussions with hackers (people who work on Subversion, CVS, Mercurial, and Bazaar “loaders”, for instance). I gave a 10–15mn talk on how Guix uses SWH, what Disarchive is, what the current status of the “preservation of Guix” is, and what remains to be done: https://git.savannah.gnu.org/cgit/guix/maintenance.git/plain/talks/swh-unesco-2021/talk.20211130.pdf (There was a great talk about Maneage¹ right before mine.) I chatted with the SWH tech team; they’re obviously very busy solving all sorts of scalability challenges :-) but they’re also truly interested in what we’re doing and in supporting our use case. Off the top of my head, here are some of the topics discussed: • ingesting past revisions: if we can give them ‘sources.json’ for past revisions, they’re happy to ingest them; • rate limit: we can find an arrangement to raise it for the purposes of statistics gathering like Simon and Timothy have been doing (we can discuss the details off-list); • Disarchive: they’d like to better understand the “unknowns” in the PoG plots (I wasn’t sure if it was non-tar.gz tarballs or what) and to work on the definitely-missing origins that show up there; they’re not opposed to the idea of eventually hosting or maintaining the Disarchive database (in fact one of the developers thought we were hosting it in Git and that as such they were already archiving it—maybe we could go back to Git?); • bit-for-bit archival: there’s a tension between making SWH a “canonical” representation of VCS repos and making it a faithful, bit-for-bit identical copy of the original, and there are different opinions in the team here; our use case pretty much requires bit-for-bit copies, and fortunately this is what SWH is giving us in practice for Git repos, so checkout authentication (for example) should work even when fetching Guix from SWH. There were other discussions about Guix and Nix and I was pleased to see people were enthusiastic about functional package management and about our whole endeavor. Anyway I think we can take this as an opportunity to increase bandwidth with the SWH developers! Thanks, Ludo’. ¹ https://maneage.org/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Software Heritage fifth anniversary event 2021-12-01 9:41 Software Heritage fifth anniversary event Ludovic Courtès @ 2021-12-01 18:04 ` Timothy Sample 2021-12-02 8:59 ` Ludovic Courtès 2021-12-02 13:17 ` zimoun 2021-12-03 3:58 ` Maxim Cournoyer 1 sibling, 2 replies; 11+ messages in thread From: Timothy Sample @ 2021-12-01 18:04 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guix-devel, guix-science Ludovic Courtès <ludovic.courtes@inria.fr> writes: > I gave a 10–15mn talk on how Guix uses SWH, what Disarchive is, what > the current status of the “preservation of Guix” is, and what remains > to be done: > > https://git.savannah.gnu.org/cgit/guix/maintenance.git/plain/talks/swh-unesco-2021/talk.20211130.pdf Wow – great work! > I chatted with the SWH tech team; they’re obviously very busy solving > all sorts of scalability challenges :-) but they’re also truly > interested in what we’re doing and in supporting our use case. Off the > top of my head, here are some of the topics discussed: > > • ingesting past revisions: if we can give them ‘sources.json’ for > past revisions, they’re happy to ingest them; This is something I can probably coax out of the Preservation of Guix database. That might be the cheapest way to do it. Alternatively, when we get “sources.json” built with Cuirass, we could tell Cuirass to build out a sample of previous commits to get pretty good coverage. (Side note: eventually we could verify the coverage of the sampling approach using the Data Service, which has a processed a very exhaustive list of commits.) > • rate limit: we can find an arrangement to raise it for the purposes > of statistics gathering like Simon and Timothy have been doing (we > can discuss the details off-list); Cool! So far it hasn’t been a concern for me, but it would help in the future if want to try and track down Git repositories that have gone missing. > • Disarchive: they’d like to better understand the “unknowns” in the > PoG plots (I wasn’t sure if it was non-tar.gz tarballs or what) and > to work on the definitely-missing origins that show up there; Many of the unknowns are there for me to track Disarchive progress. It’s not really the clearest reporting, but it tracks more what Guix can handle automatically than what we could theoretically know about. Basically something is “known” if it can be downloaded from upstream, and either: it’s a non-recursive Git reference; or it’s something Disarchive can handle. Hence, we know nothing about other version control systems and, say, “.tar.bz2” archives. Also, all these things are based on heuristics. :) As we get closer to 100% known, we can start analyzing everything more closely. > they’re not opposed to the idea of eventually hosting or maintaining > the Disarchive database (in fact one of the developers thought we > were hosting it in Git and that as such they were already archiving > it—maybe we could go back to Git?); It’s a possibility, but right now I’m hopeful that the database will be in the care of SWH directly before too long. I’d rather wait and see at this point. I’m sure we could manage it, but the uncompressed size of the Disarchive specification of a Chromium tarball is 366M. Storing all the XZ specifications uncompressed is over 20G. It would be a big Git repo! > • bit-for-bit archival: there’s a tension between making SWH a > “canonical” representation of VCS repos and making it a faithful, > bit-for-bit identical copy of the original, and there are different > opinions in the team here; our use case pretty much requires > bit-for-bit copies, and fortunately this is what SWH is giving us in > practice for Git repos, so checkout authentication (for example) > should work even when fetching Guix from SWH. That’s interesting. I’m sure most of us in the Guix camp are on team bit-for-bit, but I’m sure we can all agree that it’s not easy to get there. > There were other discussions about Guix and Nix and I was pleased to see > people were enthusiastic about functional package management and about > our whole endeavor. > > Anyway I think we can take this as an opportunity to increase bandwidth > with the SWH developers! Good idea. It’s nice when our efforts and experience produce something useful to the broader free software community. :) -- Tim ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Software Heritage fifth anniversary event 2021-12-01 18:04 ` Timothy Sample @ 2021-12-02 8:59 ` Ludovic Courtès 2021-12-02 13:17 ` zimoun 1 sibling, 0 replies; 11+ messages in thread From: Ludovic Courtès @ 2021-12-02 8:59 UTC (permalink / raw) To: Timothy Sample; +Cc: guix-devel, guix-science, zimoun Hi! Timothy Sample <samplet@ngyro.com> skribis: > Ludovic Courtès <ludovic.courtes@inria.fr> writes: [...] >> • Disarchive: they’d like to better understand the “unknowns” in the >> PoG plots (I wasn’t sure if it was non-tar.gz tarballs or what) and >> to work on the definitely-missing origins that show up there; > > Many of the unknowns are there for me to track Disarchive progress. > It’s not really the clearest reporting, but it tracks more what Guix can > handle automatically than what we could theoretically know about. > Basically something is “known” if it can be downloaded from upstream, > and either: it’s a non-recursive Git reference; or it’s something > Disarchive can handle. Hence, we know nothing about other version > control systems and, say, “.tar.bz2” archives. Also, all these things > are based on heuristics. :) As we get closer to 100% known, we can > start analyzing everything more closely. Right. Perhaps at some point we can give them (say on swh-devel) this explanation so they have a clearer view of how far Disarchive is from being “production-ready” from an SWH perspective. Valentin of the SWH team played a lot with pristine-tar and I’m sure they’d have useful feedback to give. >> they’re not opposed to the idea of eventually hosting or maintaining >> the Disarchive database (in fact one of the developers thought we >> were hosting it in Git and that as such they were already archiving >> it—maybe we could go back to Git?); > > It’s a possibility, but right now I’m hopeful that the database will be > in the care of SWH directly before too long. I’d rather wait and see at > this point. I’m sure we could manage it, but the uncompressed size of > the Disarchive specification of a Chromium tarball is 366M. Storing all > the XZ specifications uncompressed is over 20G. It would be a big Git > repo! Indeed! So, in passing, you’re telling us that xz support is kinda ready, right? :-) Thanks! Ludo’. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Software Heritage fifth anniversary event 2021-12-01 18:04 ` Timothy Sample 2021-12-02 8:59 ` Ludovic Courtès @ 2021-12-02 13:17 ` zimoun 2021-12-02 14:04 ` Timothy Sample 2021-12-02 15:02 ` Ludovic Courtès 1 sibling, 2 replies; 11+ messages in thread From: zimoun @ 2021-12-02 13:17 UTC (permalink / raw) To: Timothy Sample; +Cc: Ludovic Courtès, Guix Devel, guix-science Hi, On Wed, 1 Dec 2021 at 19:04, Timothy Sample <samplet@ngyro.com> wrote: > Ludovic Courtès <ludovic.courtes@inria.fr> writes: > > > I gave a 10–15mn talk on how Guix uses SWH, what Disarchive is, what > > the current status of the “preservation of Guix” is, and what remains > > to be done: > > > > https://git.savannah.gnu.org/cgit/guix/maintenance.git/plain/talks/swh-unesco-2021/talk.20211130.pdf Thank you Ludo for this nice write up! I hope the stream had been recorded and soon available for all. :-) > > I chatted with the SWH tech team; they’re obviously very busy solving > > all sorts of scalability challenges :-) but they’re also truly > > interested in what we’re doing and in supporting our use case. Off the > > top of my head, here are some of the topics discussed: > > > > • ingesting past revisions: if we can give them ‘sources.json’ for > > past revisions, they’re happy to ingest them; > > This is something I can probably coax out of the Preservation of Guix > database. That might be the cheapest way to do it. Alternatively, when > we get “sources.json” built with Cuirass, we could tell Cuirass to build > out a sample of previous commits to get pretty good coverage. (Side > note: eventually we could verify the coverage of the sampling approach > using the Data Service, which has a processed a very exhaustive list of > commits.) Let avoid "quirk" because now the ingestion requires too many manual checks. :-) For instance, "guix lint -c archival" works well but it is not systematically done by contributors or pushers; especially on quick updated packages. This is mainly what we see: 35 vs 24 missing type:git from PoG [1,2]. On the other hand, 'sources.json' is built with the Guix website. But SWH ingests only the tarball items from there. It is not clear to me how to add to CI both: saving requests for git-fetch packages and build 'sources.json'. Last, all the packages are not equal. We could have 99.99% for the coverage but if the missing 0.01% packages are deep in the graph, then all the house of card falls down. Somehow, we need to work on the graph and spot the "important", or least sort them. Argh, it is something I would like to do since long time (help when release is coming) but days count only 24h. ;-) 1: https://ngyro.com/pog-reports/2021-10-31/ 2: https://ngyro.com/pog-reports/2021-11-30/ > > • rate limit: we can find an arrangement to raise it for the purposes > > of statistics gathering like Simon and Timothy have been doing (we > > can discuss the details off-list); > > Cool! So far it hasn’t been a concern for me, but it would help in the > future if want to try and track down Git repositories that have gone > missing. Timothy, could you provide again the entry point you use? > > they’re not opposed to the idea of eventually hosting or maintaining > > the Disarchive database (in fact one of the developers thought we > > were hosting it in Git and that as such they were already archiving > > it—maybe we could go back to Git?); > > It’s a possibility, but right now I’m hopeful that the database will be > in the care of SWH directly before too long. I’d rather wait and see at > this point. I’m sure we could manage it, but the uncompressed size of > the Disarchive specification of a Chromium tarball is 366M. Storing all > the XZ specifications uncompressed is over 20G. It would be a big Git > repo! Hehe! That's something we discussed at the very beginning of Disarchive. :-) If Disarchive-DB is managed by SWH, maybe some people would be afraid by security concerns. I mean, today SWH ingests an archive. Today, this archive is checksummed using a robust algorithm say Foo. Using the content from SWH and the meta from Disarchive-DB, the archive is rebuilt and because Foo is robust, it is possible to checksum that the rebuild match the expectation. Later, Foo is weak and preimage attack is possible. All one has is the expectation using Foo. Therefore, SWH could cheat and introduce something in content and/or meta that matches the expectation using Foo. If the 2 databases are independent, then it is harder. :-) Well, the assumptions are: SWH would be still there when Foo is broken. Currently Foo is SHA-256, so who knows. :-) From scientific context, this scenario (SWH corrupted) is really low in the list of issues. ;-) > > • bit-for-bit archival: there’s a tension between making SWH a > > “canonical” representation of VCS repos and making it a faithful, > > bit-for-bit identical copy of the original, and there are different > > opinions in the team here; our use case pretty much requires > > bit-for-bit copies, and fortunately this is what SWH is giving us in > > practice for Git repos, so checkout authentication (for example) > > should work even when fetching Guix from SWH. The main issue is the lookup. Non bit-for-bit archival implies that people store a SWH lookup key (swhid I guess) at ingestion time, otherwise it becomes nearly impossible to find back. To me, the tension is in the meaning of preservation of source code, i.e., between archiving for reading or archiving for compiling. In the case of compilation, all the lookup must be automated and so non bit-for-bit archival means: make swhid THE standard for serialization; somehow replacing all the other checksums. > > Anyway I think we can take this as an opportunity to increase bandwidth > > with the SWH developers! Yeah, let have a good story! :-) Cheers, simon ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Software Heritage fifth anniversary event 2021-12-02 13:17 ` zimoun @ 2021-12-02 14:04 ` Timothy Sample 2021-12-02 15:49 ` zimoun 2021-12-02 15:02 ` Ludovic Courtès 1 sibling, 1 reply; 11+ messages in thread From: Timothy Sample @ 2021-12-02 14:04 UTC (permalink / raw) To: zimoun; +Cc: Guix Devel, Ludovic Courtès, guix-science Hi, zimoun <zimon.toutoune@gmail.com> writes: > Timothy, could you provide again the entry point you use? https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html#post--api-1-known- -- Tim ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Software Heritage fifth anniversary event 2021-12-02 14:04 ` Timothy Sample @ 2021-12-02 15:49 ` zimoun 2021-12-02 18:04 ` Ludovic Courtès 0 siblings, 1 reply; 11+ messages in thread From: zimoun @ 2021-12-02 15:49 UTC (permalink / raw) To: Timothy Sample; +Cc: Guix Devel, Ludovic Courtès, guix-science Hi Tim, On Thu, 2 Dec 2021 at 15:04, Timothy Sample <samplet@ngyro.com> wrote: > https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html#post--api-1-known- Thanks! For the interested reader, http://logs.guix.gnu.org/guix/2021-12-02.log#164713 Cheers, simon ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Software Heritage fifth anniversary event 2021-12-02 15:49 ` zimoun @ 2021-12-02 18:04 ` Ludovic Courtès 0 siblings, 0 replies; 11+ messages in thread From: Ludovic Courtès @ 2021-12-02 18:04 UTC (permalink / raw) To: zimoun; +Cc: Timothy Sample, Guix Devel, guix-science zimoun <zimon.toutoune@gmail.com> skribis: > On Thu, 2 Dec 2021 at 15:04, Timothy Sample <samplet@ngyro.com> wrote: > >> https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html#post--api-1-known- > > Thanks! > > For the interested reader, http://logs.guix.gnu.org/guix/2021-12-02.log#164713 Nice! Would be nice to integrate the relevant bits of <https://git.ngyro.com/preservation-of-guix/tree/pog/swhid.scm> into (guix swh) eventually. Ludo’. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Software Heritage fifth anniversary event 2021-12-02 13:17 ` zimoun 2021-12-02 14:04 ` Timothy Sample @ 2021-12-02 15:02 ` Ludovic Courtès 1 sibling, 0 replies; 11+ messages in thread From: Ludovic Courtès @ 2021-12-02 15:02 UTC (permalink / raw) To: zimoun; +Cc: Timothy Sample, Guix Devel, guix-science Hello! zimoun <zimon.toutoune@gmail.com> skribis: >> > • bit-for-bit archival: there’s a tension between making SWH a >> > “canonical” representation of VCS repos and making it a faithful, >> > bit-for-bit identical copy of the original, and there are different >> > opinions in the team here; our use case pretty much requires >> > bit-for-bit copies, and fortunately this is what SWH is giving us in >> > practice for Git repos, so checkout authentication (for example) >> > should work even when fetching Guix from SWH. > > The main issue is the lookup. Non bit-for-bit archival implies that > people store a SWH lookup key (swhid I guess) at ingestion time, > otherwise it becomes nearly impossible to find back. To me, the > tension is in the meaning of preservation of source code, i.e., > between archiving for reading or archiving for compiling. Exactly, I guess that’s the big difference. Also: allowing archived content to be authenticated by third parties vs. having to trust SWH. > In the case of compilation, all the lookup must be automated and so > non bit-for-bit archival means: make swhid THE standard for > serialization; somehow replacing all the other checksums. Yes, but even if that eventually happens, it’s going to take time. Ludo’. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Software Heritage fifth anniversary event 2021-12-01 9:41 Software Heritage fifth anniversary event Ludovic Courtès 2021-12-01 18:04 ` Timothy Sample @ 2021-12-03 3:58 ` Maxim Cournoyer 2021-12-03 11:01 ` Ludovic Courtès 1 sibling, 1 reply; 11+ messages in thread From: Maxim Cournoyer @ 2021-12-03 3:58 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guix-devel, guix-science Hello Ludovic! Ludovic Courtès <ludovic.courtes@inria.fr> writes: > Hello Guix! > > I had the pleasure to attend the Software Heritage fifth anniversary > event yesterday at the UNESCO headquarters (fancy!) and at Inria in > Paris. > > I learned about things others are doing with SWH (notably in the > cultural and scientific fields) and had discussions with hackers (people > who work on Subversion, CVS, Mercurial, and Bazaar “loaders”, for > instance). I gave a 10–15mn talk on how Guix uses SWH, what Disarchive > is, what the current status of the “preservation of Guix” is, and what > remains to be done: > > https://git.savannah.gnu.org/cgit/guix/maintenance.git/plain/talks/swh-unesco-2021/talk.20211130.pdf > > (There was a great talk about Maneage¹ right before mine.) Interesting! I'm glad you could showcase Guix as a solution to software deployment there. Thank you for sharing the slides, they were interesting/entertaining. I note that they were written prior to 'guix shell' :-). Cheers, Maxim ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Software Heritage fifth anniversary event 2021-12-03 3:58 ` Maxim Cournoyer @ 2021-12-03 11:01 ` Ludovic Courtès 2021-12-05 5:10 ` Maxim Cournoyer 0 siblings, 1 reply; 11+ messages in thread From: Ludovic Courtès @ 2021-12-03 11:01 UTC (permalink / raw) To: Maxim Cournoyer; +Cc: guix-devel, guix-science Hey! Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis: > Thank you for sharing the slides, they were interesting/entertaining. I > note that they were written prior to 'guix shell' :-). I thought I’d rather leave ‘guix environment’ in there as someone downloading 1.3.0 won’t have ‘guix shell’. Likewise, I recently gave a packaging tutorial and I was extremely frustrated that I had to teach things `(("like" ,this)). :-) Ludo’. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Software Heritage fifth anniversary event 2021-12-03 11:01 ` Ludovic Courtès @ 2021-12-05 5:10 ` Maxim Cournoyer 0 siblings, 0 replies; 11+ messages in thread From: Maxim Cournoyer @ 2021-12-05 5:10 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guix-devel, guix-science Hi! Ludovic Courtès <ludovic.courtes@inria.fr> writes: > Hey! > > Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis: > >> Thank you for sharing the slides, they were interesting/entertaining. I >> note that they were written prior to 'guix shell' :-). > > I thought I’d rather leave ‘guix environment’ in there as someone > downloading 1.3.0 won’t have ‘guix shell’. > > Likewise, I recently gave a packaging tutorial and I was extremely > frustrated that I had to teach things `(("like" ,this)). :-) Eh :-). We can all forget about that syntax soon... Thanks! Maxim ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2021-12-05 5:11 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-12-01 9:41 Software Heritage fifth anniversary event Ludovic Courtès 2021-12-01 18:04 ` Timothy Sample 2021-12-02 8:59 ` Ludovic Courtès 2021-12-02 13:17 ` zimoun 2021-12-02 14:04 ` Timothy Sample 2021-12-02 15:49 ` zimoun 2021-12-02 18:04 ` Ludovic Courtès 2021-12-02 15:02 ` Ludovic Courtès 2021-12-03 3:58 ` Maxim Cournoyer 2021-12-03 11:01 ` Ludovic Courtès 2021-12-05 5:10 ` Maxim Cournoyer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).