From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1 ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id sJqCLanHqGEXgAEAgWs5BA (envelope-from ) for ; Thu, 02 Dec 2021 14:18:33 +0100 Received: from aspmx1.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1 with LMTPS id AHYaKanHqGGmPwAAbx9fmQ (envelope-from ) for ; Thu, 02 Dec 2021 13:18:33 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 4802B1E2E5 for ; Thu, 2 Dec 2021 14:18:33 +0100 (CET) Received: from localhost ([::1]:54740 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1mslym-0001j1-H5 for larch@yhetil.org; Thu, 02 Dec 2021 08:18:29 -0500 Received: from eggs.gnu.org ([209.51.188.92]:52076) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mslyI-0001eZ-5A; Thu, 02 Dec 2021 08:17:58 -0500 Received: from [2607:f8b0:4864:20::d35] (port=45797 helo=mail-io1-xd35.google.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1mslyE-0007ga-UR; Thu, 02 Dec 2021 08:17:56 -0500 Received: by mail-io1-xd35.google.com with SMTP id v23so35143854iom.12; Thu, 02 Dec 2021 05:17:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=MoEIhKMLmXfr5EPn5KCx6rlYoiwR+IkwGwqxX6tuJ1A=; b=btZvHHS9Ll/jU++YoP8VBuQiQCiPhKJ1DYKuqxM4Gdb//nESp1FLt+O32ZOaMY9dcZ ZTejLuNhi1GglqS5lLEGFCBsBakcOoreZQAickK0NlEdOzpqvwNqBJ+TUwCyW6Js7vBB /XAAGDpYvnhSziW5aIA3VVLmd0QtFjW5aMWvYyOrGz51AwiI/QYJJR0kkKbDErDNQCJn 9QwgV1rBElepo6ZI18bej/SinXewoINiROrwLy2i8yTZRjpqhW8njROJvvbMy6ivX5P3 3dRh60zjAzk6RJ4jMoi1waOj+lCbc3VoFmw9haw6eBDFA+y328XkUJiaf+s/KZhQ8js9 IVEw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=MoEIhKMLmXfr5EPn5KCx6rlYoiwR+IkwGwqxX6tuJ1A=; b=z7xXPKyBCZOaC1qdwQYaMnjFtJLS04Qv/axF03vDUI02XXAAlCJbX8j9sFluGr8Xfw mLU6NWp+B4Bc1jEDvoJnooD8poc8ScjcqYLIv7+ElE8yx6pDreegV4NvN4QbZ1UQoY+2 AUsY8k7X3NeMqeMq36qY/6PEJKpA3gwCZEO1TCfPbgvcvSGLwH/Ab7RFh4a7lXIC9XkU qc7RAgXVMsWXncVVTpQM/UHFiMjQqULITI6vGCAyWBRKwKeyVJCusz9ep7lM1RY0Rp8L TenWek2fsS6CLGbGE40leRZw2L14MCtha1amHXfKqNJqKdww+SjqSc19W+jAtjIBo8z3 fAeA== X-Gm-Message-State: AOAM532959xFg4SN9d1n0DibyimBDTwBDFA3uQeoIaXSYtG6Q13L4LLV f16KHfWkj/Z4NBnWShJ650nOTVpzNFSnw+zFMEE= X-Google-Smtp-Source: ABdhPJw5yU1h78mPAHQ9XqdgZTbyrjo2Lz2OPAlou1O3Gp8IKvpBg7hFsRBW97CI0Xx1zlV80q0YeDDvFgeAfTfOKNc= X-Received: by 2002:a05:6602:2f15:: with SMTP id q21mr15880589iow.113.1638451070616; Thu, 02 Dec 2021 05:17:50 -0800 (PST) MIME-Version: 1.0 References: <87sfvc4q8j.fsf@inria.fr> <87tufsgq1p.fsf@ngyro.com> In-Reply-To: <87tufsgq1p.fsf@ngyro.com> From: zimoun Date: Thu, 2 Dec 2021 14:17:39 +0100 Message-ID: Subject: Re: Software Heritage fifth anniversary event To: Timothy Sample Cc: =?UTF-8?Q?Ludovic_Court=C3=A8s?= , Guix Devel , guix-science@gnu.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Host-Lookup-Failed: Reverse DNS lookup failed for 2607:f8b0:4864:20::d35 (failed) Received-SPF: pass client-ip=2607:f8b0:4864:20::d35; envelope-from=zimon.toutoune@gmail.com; helo=mail-io1-xd35.google.com X-Spam_score_int: -12 X-Spam_score: -1.3 X-Spam_bar: - X-Spam_report: (-1.3 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, PDS_HP_HELO_NORDNS=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RDNS_NONE=0.793, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: guix-science@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-science-bounces+larch=yhetil.org@gnu.org Sender: "Guix-Science" X-Migadu-Flow: FLOW_IN X-Migadu-Country: US ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1638451113; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=MoEIhKMLmXfr5EPn5KCx6rlYoiwR+IkwGwqxX6tuJ1A=; b=X7/heZlGdLn5vJ/v8/XMqJMw7gin8Jh1Jt0fH44SAG6lzfUVRxqqrzS/gXhbh48UEsowES PL5ZUuFZ4qAyScrH+wJH/z3+LnhXp5epOyrS3z1lrPYTXMivUf9pt/iIwIALA42+qI4KMT 8jgzeaiEGzTk9063a15qyVNLiyQNpUo73Qovi2T2JA5TG2mbTvbqyvbyMiGfx0PcMHT0y1 OxtzEPeUvvvzrlnTzznjL1XYk/R3OC+/9LM+kYdvU/iP1NZ/2cb0ktpIujuAfnNqch1FTh LWU15DOlrEhWFAYZVq5WdaNIm9QbpZ5SSz+J2j267vUnd3xodtG6eRG4D5Rt7w== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1638451113; a=rsa-sha256; cv=none; b=LSw1V72RFfsb50r2W1J+c4lEd8L5in73Pueuc7qQEy/9h/PK5BT7uUAByutw2s+8YCRGpr DDkzXHiHsHdaEwWLHwT65v3+J5y1FKM0YcnCV3Un4VwwvqeVxlavXpXxc5p9GvaRmGAgJw nRDEh5EZrAKk8zpRJx/S7k81TNw5BlV7erq91WE0v/lxJdIRIZaFviwj6GPaU6x42En83N QjhvHf4XmWo4+FpQb6JlFdnYRIa2fhcQB4YRICkyExN0f/vN5yaO/y7UKgQCnyrxaCG2/b tmbmQRzNLWpbtQXkxPnphQjWlmj5U8PXJUntvri1g+sYRKPA8yeL/xhOvvhuwA== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=btZvHHS9; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (aspmx1.migadu.com: domain of "guix-science-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-science-bounces+larch=yhetil.org@gnu.org" X-Migadu-Spam-Score: -4.12 Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=btZvHHS9; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (aspmx1.migadu.com: domain of "guix-science-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-science-bounces+larch=yhetil.org@gnu.org" X-Migadu-Queue-Id: 4802B1E2E5 X-Spam-Score: -4.12 X-Migadu-Scanner: scn0.migadu.com X-TUID: N6fJa4kUwTMk Hi, On Wed, 1 Dec 2021 at 19:04, Timothy Sample wrote: > Ludovic Court=C3=A8s writes: > > > I gave a 10=E2=80=9315mn talk on how Guix uses SWH, what Disarchive is,= what > > the current status of the =E2=80=9Cpreservation of Guix=E2=80=9D is, an= d what remains > > to be done: > > > > https://git.savannah.gnu.org/cgit/guix/maintenance.git/plain/talks/sw= h-unesco-2021/talk.20211130.pdf Thank you Ludo for this nice write up! I hope the stream had been recorded and soon available for all. :-) > > I chatted with the SWH tech team; they=E2=80=99re obviously very busy s= olving > > all sorts of scalability challenges :-) but they=E2=80=99re also truly > > interested in what we=E2=80=99re doing and in supporting our use case. = Off the > > top of my head, here are some of the topics discussed: > > > > =E2=80=A2 ingesting past revisions: if we can give them =E2=80=98sour= ces.json=E2=80=99 for > > past revisions, they=E2=80=99re happy to ingest them; > > This is something I can probably coax out of the Preservation of Guix > database. That might be the cheapest way to do it. Alternatively, when > we get =E2=80=9Csources.json=E2=80=9D built with Cuirass, we could tell C= uirass to build > out a sample of previous commits to get pretty good coverage. (Side > note: eventually we could verify the coverage of the sampling approach > using the Data Service, which has a processed a very exhaustive list of > commits.) Let avoid "quirk" because now the ingestion requires too many manual checks= . :-) For instance, "guix lint -c archival" works well but it is not systematically done by contributors or pushers; especially on quick updated packages. This is mainly what we see: 35 vs 24 missing type:git from PoG [1,2]. On the other hand, 'sources.json' is built with the Guix website. But SWH ingests only the tarball items from there. It is not clear to me how to add to CI both: saving requests for git-fetch packages and build 'sources.json'. Last, all the packages are not equal. We could have 99.99% for the coverage but if the missing 0.01% packages are deep in the graph, then all the house of card falls down. Somehow, we need to work on the graph and spot the "important", or least sort them. Argh, it is something I would like to do since long time (help when release is coming) but days count only 24h. ;-) 1: https://ngyro.com/pog-reports/2021-10-31/ 2: https://ngyro.com/pog-reports/2021-11-30/ > > =E2=80=A2 rate limit: we can find an arrangement to raise it for the = purposes > > of statistics gathering like Simon and Timothy have been doing (we > > can discuss the details off-list); > > Cool! So far it hasn=E2=80=99t been a concern for me, but it would help = in the > future if want to try and track down Git repositories that have gone > missing. Timothy, could you provide again the entry point you use? > > they=E2=80=99re not opposed to the idea of eventually hosting or ma= intaining > > the Disarchive database (in fact one of the developers thought we > > were hosting it in Git and that as such they were already archiving > > it=E2=80=94maybe we could go back to Git?); > > It=E2=80=99s a possibility, but right now I=E2=80=99m hopeful that the da= tabase will be > in the care of SWH directly before too long. I=E2=80=99d rather wait and= see at > this point. I=E2=80=99m sure we could manage it, but the uncompressed si= ze of > the Disarchive specification of a Chromium tarball is 366M. Storing all > the XZ specifications uncompressed is over 20G. It would be a big Git > repo! Hehe! That's something we discussed at the very beginning of Disarchive. := -) If Disarchive-DB is managed by SWH, maybe some people would be afraid by security concerns. I mean, today SWH ingests an archive. Today, this archive is checksummed using a robust algorithm say Foo. Using the content from SWH and the meta from Disarchive-DB, the archive is rebuilt and because Foo is robust, it is possible to checksum that the rebuild match the expectation. Later, Foo is weak and preimage attack is possible. All one has is the expectation using Foo. Therefore, SWH could cheat and introduce something in content and/or meta that matches the expectation using Foo. If the 2 databases are independent, then it is harder. :-) Well, the assumptions are: SWH would be still there when Foo is broken. Currently Foo is SHA-256, so who knows. :-) >From scientific context, this scenario (SWH corrupted) is really low in the list of issues. ;-) > > =E2=80=A2 bit-for-bit archival: there=E2=80=99s a tension between mak= ing SWH a > > =E2=80=9Ccanonical=E2=80=9D representation of VCS repos and making = it a faithful, > > bit-for-bit identical copy of the original, and there are different > > opinions in the team here; our use case pretty much requires > > bit-for-bit copies, and fortunately this is what SWH is giving us i= n > > practice for Git repos, so checkout authentication (for example) > > should work even when fetching Guix from SWH. The main issue is the lookup. Non bit-for-bit archival implies that people store a SWH lookup key (swhid I guess) at ingestion time, otherwise it becomes nearly impossible to find back. To me, the tension is in the meaning of preservation of source code, i.e., between archiving for reading or archiving for compiling. In the case of compilation, all the lookup must be automated and so non bit-for-bit archival means: make swhid THE standard for serialization; somehow replacing all the other checksums. > > Anyway I think we can take this as an opportunity to increase bandwidth > > with the SWH developers! Yeah, let have a good story! :-) Cheers, simon