From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp12.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms5.migadu.com with LMTPS id oNIABYg3AmTlGwEAbAwnHQ (envelope-from ) for ; Fri, 03 Mar 2023 19:08:08 +0100 Received: from aspmx1.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp12.migadu.com with LMTPS id KJP5BIg3AmQcgAEAauVa8A (envelope-from ) for ; Fri, 03 Mar 2023 19:08:08 +0100 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id A5F7C19B2F for ; Fri, 3 Mar 2023 19:08:07 +0100 (CET) Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=WIwvCXtY; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1677866887; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=giHsFkM9hD5fpv9YBiiR5esA5zwkpwVnCXIemFDw4bw=; b=TUSX0Iqe+dq8pRkulE8LA7c8hISkHeHi0Nke78lR8W7dK568Gu3owLd8e4x+S+PNWseGYR uzVqy0BpBk8ji6Ozz+dpGyoS1A4zjgV/jTukAraTd3SF3RFgTWUvAa4LCLDa1S7gD/z0zW UDIB59AP6V+x57MgXYCLj424UBXEFnyCNDnQfZ5XJlJpRrIO4Wv9Z1kxhHF1fWdSZD1o3A 7GhYKXZlvi05c01W6iiw4iikbxjdq3dr+ZM/dPMZome79SZOtjHjWs4QGxwhzca4JVtKXb NTMbgLMBLJEa5Vzd0JfzgqKivKjmqj5kSPp8il4DheoEkrtQwSggcpWGxc3MGw== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=WIwvCXtY; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=key1; d=yhetil.org; t=1677866887; a=rsa-sha256; cv=none; b=nlOpo+tdRAo0tL4yqbt1EsvNk2Jl0osSQ46GHN2haK9kEVnqodrdfeyTxaDkLWyTkZjiUO j2n4DpZWPHVA4QrhCRhncXOIL35WMNj3jBJS5M4/PYehzEguohxVdF2A/GH/r7aUkrNeoW IkNKIwQAWcKT8UdzscWEJQ2rfB1Dam1PwB0tyoBsl4auhCChhpc4X8bxnoM0Mm2kAxFI6a O8FX5rLmHr5cbEDCh2c8b6IvCjGGKJYO/md8nEL1S67LLoCOjeIgzWtn7tiaDlHT7mDsVd 8f0MzpxW8nx7b6mPLK7Lj9DTBKKn82P3EWfQTPwSaK1fP0YX/nVU2t0e9stlXQ== Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pY9oa-0007RL-4c; Fri, 03 Mar 2023 13:07:32 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pY9oY-0007RA-TF for guix-devel@gnu.org; Fri, 03 Mar 2023 13:07:30 -0500 Received: from mail-wr1-x42d.google.com ([2a00:1450:4864:20::42d]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1pY9oW-0003R9-91 for guix-devel@gnu.org; Fri, 03 Mar 2023 13:07:30 -0500 Received: by mail-wr1-x42d.google.com with SMTP id q16so3146604wrw.2 for ; Fri, 03 Mar 2023 10:07:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1677866846; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:from:to:cc:subject:date:message-id:reply-to; bh=giHsFkM9hD5fpv9YBiiR5esA5zwkpwVnCXIemFDw4bw=; b=WIwvCXtYXlZsw336PJfGaNwRZ/EZAs31GEyB3N2KdlSBG5q2HkeAFYa2OFE7BGuLZV 4MXBaJ6MnVSnIXc1sIBucqeHdV0NB/uRyDZ2tNFSrkN3zmB0+NJOoUhsEj5TXtD0TnUB ZHNCTTnfH+ojcE5jTkdMH/fQCPGZpQR9zqGL8y6rrBEnLn06m5Td3Ey9VRdOmTluR+pV AvQIfPTo4pCAW5b3Ms9yRXYb0fb70HD4UCxr00iCTskqAGyHnZK1f3SpO1f6QQ6cRViH um4mWPKIZN1JFi19dEYXEkdFjyTwfy693oalEFpuaffaDZaOX+jWZytfBbb9eJDjifQP /DSw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1677866846; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=giHsFkM9hD5fpv9YBiiR5esA5zwkpwVnCXIemFDw4bw=; b=cmxhIFPpe22JnyZdSY3ixEwuKNiS5LCF2mq/TtQVkokR73womfJduRTNvL+5ayIBXX HWDTARXls6lfFSHLF7REFHAWj1jMIcS3m+i/Sv7T5LQXjkRntBcqJwOWokog0iNxCXhK T8/u9RUx0iawPVm5hijQirXmBLsw1xypTupUx0m0EeCfBhNBjUH906Y9RtHfX/BY52DN kXyAeYUgMvvTiCP84UoHsR657zzPBZzFYdD+FjUBsXlXZfQXTOhbIKutwijqVS5HR/cj eHHepopZhqgJGAtA9DIktHNTE8haXlDCvXqT1GbKxkuwfoHT/Lj2vHLz6o9HxmkXJs8Z UMQg== X-Gm-Message-State: AO0yUKWQuAnz4fHZB9cy+wL/Vx1ff+7XkaFir5m5t7SAMm82Kxuk+WsT Fv60St48xiw1eMvNSLJ9js89IYdK5Ag= X-Google-Smtp-Source: AK7set+1YsmSUsSmN75v9SOm3p2u2dcC3854RO4ytjFC7KVVApD6XdjRTjHEesr0ryxCKmptIcEe+Q== X-Received: by 2002:adf:ffc8:0:b0:2c7:1c72:69ac with SMTP id x8-20020adfffc8000000b002c71c7269acmr1456244wrs.2.1677866846511; Fri, 03 Mar 2023 10:07:26 -0800 (PST) Received: from pfiuh07 ([193.48.40.241]) by smtp.gmail.com with ESMTPSA id m14-20020adffe4e000000b002c54c8e70b1sm2894910wrs.9.2023.03.03.10.07.26 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 03 Mar 2023 10:07:26 -0800 (PST) From: Simon Tournier To: Guix Devel Subject: intrinsic vs extrinsic identifier: toward more robustness? Date: Fri, 03 Mar 2023 19:07:23 +0100 Message-ID: <87jzzxd7z8.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2a00:1450:4864:20::42d; envelope-from=zimon.toutoune@gmail.com; helo=mail-wr1-x42d.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: X-Migadu-Spam-Score: -6.23 X-Spam-Score: -6.23 X-Migadu-Scanner: scn0.migadu.com X-Migadu-Queue-Id: A5F7C19B2F List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN X-TUID: BtgWLxQXYP9x Hi, I would like to open a discussion about how we identify the source origin (fixed output). It is of vitally importance for being robust on the long-term (say 3-5 years). It matters in Reproducible Research context, but not only. # First thing first =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D ## What is an intrinsic identifier or an extrinsic one? =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D - extrinsic: use a register to keep the correspondence between the identifier and the object; say label version as Git tag. - intrinsic: intimately bound to the designated object itself; say hash as Git blob or tree and at some extent commit. The register must be a trusted authority and it resolves by mapping the key identifier to the object. Having the object at hand does not give any clue about the key identifier. And collisions are very frequent; two key identifiers resolve to the same content =E2=80=93 hopefully! we call that mirrors. ;-) Intrinsic identifier also relies on a (trusted) map but collisions are avoided as much as possible. Somehow it strongly reduces the power of the authority and it is often more robust. Please note that the identification and the integrity is not the same. Since intrinsic identifier often uses cryptographic hash functions and integrity too, it is often confusing. Whatever the intrinsic identifier we consider =E2=80=93 even ones based on = very weak cryptographic hash function as MD5, or based on non-crytographic hash function as Pearson hashing, etc. =E2=80=93 the integrity check is currently done by SHA256. ## For example, consider this source origin, =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D (source (origin (method url-fetch) (uri (string-append "mirror://gnu/hello/hello-" version ".tar.gz")) (sha256 (base32 "086vqwk2wl8zfs47sq2xpjc9k066ilmb8z6dn0q6ymwjzlm196cd")))) where =E2=80=99mirror://gnu=E2=80=99 is resolved by Guix itself. Or this o= ne, (source (origin (method git-fetch) (uri (git-reference (url "https://github.com/FluxML/Zygote.jl") (commit (string-append "v" version)))) (file-name (git-file-name name version)) (sha256 (base32 "02bgj6m1j25sm3pa5sgmds706qpxk1qsbm0s2j3rjlrz9xn7glgk")))) where Guix clones then checks out at the specification of the field =E2=80=99commit=E2=80=99. Here both are extrinsic identifiers. For the first example, the register is defined by =E2=80=99%mirrors=E2=80=99. For the second example, the regi= ster is the folder =E2=80=99.git/=E2=80=99. Intrinsic identifier could be plain hash or hashed serialized data. Using Guix b8f6ead: --8<---------------cut here---------------start------------->8--- $ guix hash -S none -H sha256 -f nix-base32 -x $(guix build hello -S) 086vqwk2wl8zfs47sq2xpjc9k066ilmb8z6dn0q6ymwjzlm196cd $ guix hash -S git -H sha256 -f nix-base32 -x $(guix build hello -S) 11kaw6m19rdj3d55y4cygk6k9zv6sn2iz4gpimx0j99ps87ij29l $ guix hash -S nar -H sha256 -f nix-base32 -x /gnu/store/3dq55rw99wdc4g4wbl= z7xikc8a2jy7a3-hello-2.12.1.tar.gz 1lvqpbk2k1sb39z8jfxixf7p7v8sj4z6mmpa44nnmff3w1y6h8lh --8<---------------cut here---------------end--------------->8--- Or some Git-like tree md5 of the decompressed data, e.g., --8<---------------cut here---------------start------------->8--- $ guix hash -S git -H md5 -f hex -x hello-2.12.1 3db60bcfecf17a5dd81e3fb5bfb1c191 --8<---------------cut here---------------end--------------->8--- Or some others. --8<---------------cut here---------------start------------->8--- $ git clone https://github.com/FluxML/Zygote.jl $ git -C Zygote.jl checkout v0.6.41 $ guix hash -S nar -H sha256 -f nix-base32 -x Zygote.jl 02bgj6m1j25sm3pa5sgmds706qpxk1qsbm0s2j3rjlrz9xn7glgk $ guix hash -S git -H sha1 -f hex -x Zygote.jl 3cfdb31b517eec4173584fba2b1aa65daad46e09 --8<---------------cut here---------------end--------------->8--- # Second thing second =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D All that=E2=80=99s said, Guix uses extrinsic identifiers for almost all ori= gins, if not all. Even for =E2=80=99git-fetch=E2=80=99 method. Consider that GitHub disappears and the default build farms ci.guix and bordeaux.guix are unreachable for whatever reason. Then Guix will fallback to Software Heritage and will exploits its resolver. --8<---------------cut here---------------start------------->8--- Initialized empty Git repository in /gnu/store/ns1f3b4wm5n470bczd2k5li6xpgb= qkz7-julia-zygote-0.6.41-checkout/.git/ fatal: unable to access 'https://github.com/FluxML/Zygote.jl/': Could not r= esolve host: github.com Failed to do a shallow fetch; retrying a full fetch... fatal: unable to access 'https://github.com/FluxML/Zygote.jl/': Could not r= esolve host: github.com git-fetch: '/gnu/store/55ba5ragbd5sd4r45n0q24vrxx9rigrm-git-minimal-2.39.1/= bin/git fetch origin' failed with exit code 128 Trying content-addressed mirror at berlin.guix.gnu.org... Trying content-addressed mirror at berlin.guix.gnu.org... Trying to download from Software Heritage... SWH: found revision 4777767737b4c95d2cea842933c5b2edae2771b2 with directory= at 'https://archive.softwareheritage.org/api/1/directory/3cfdb31b517eec417= 3584fba2b1aa65daad46e09/' swh:1:dir:3cfdb31b517eec4173584fba2b1aa65daad46e09/ --8<---------------cut here---------------end--------------->8--- That=E2=80=99s SWH which finds the revision 4777767737b4c95d2cea842933c5b2edae2771b2 from the contextual information URL + label version and from this revision SWH associates the content having the intrinsic identifier swh:1:dir:3cfdb31b517eec4173584fba2b1aa65daad46e09. ## First, please note that the SWHID is just Git, =3D=3D=3D=3D=3D=3D=3D=3D --8<---------------cut here---------------start------------->8--- guix hash -S git -H sha1 -f hex \ /gnu/store/ns1f3b4wm5n470bczd2k5li6xpgbqkz7-julia-zygote-0.6.41-checko= ut 3cfdb31b517eec4173584fba2b1aa65daad46e09 --8<---------------cut here---------------end--------------->8--- Other said, SWH information is somehow the same information as the one of Git objects. Specifically, from the Git checkout, --8<---------------cut here---------------start------------->8--- $ git cat-file -p v0.6.41 object 4777767737b4c95d2cea842933c5b2edae2771b2 type commit tag v0.6.41 $ git cat-file -p 4777767737b4c95d2cea842933c5b2edae2771b2 tree 3cfdb31b517eec4173584fba2b1aa65daad46e09 --8<---------------cut here---------------end--------------->8--- ## Second, SWH acts as a resolver here, i.e., =3D=3D=3D=3D=3D=3D=3D=3D=3D (find (lambda (branch) (or ;; Git specific. (string=3D? (string-append "refs/tags/" tag) (branch-name branch)) ;; Hg specific. (string=3D? tag (branch-name branch)))) (snapshot-branches snapshot)) and this is not robust. For one, it fails for Git lightweight tag as exposed with the package =E2=80=99open-zwave=E2=80=99 tag 1.6. --8<---------------cut here---------------start------------->8--- $ for t in $(git tag); do printf "$t "; git cat-file -t $t ;done Rel-1.0 commit V1.5 tag v1.2 commit v1.3 tag v1.4 tag v1.6 commit --8<---------------cut here---------------end--------------->8--- It means that the code above would be able to find V1.5 or v1.4 but not v1.6 or v1.2. Well, we can consider that as a bug and improve the snapshot machinery for also collecting more =E2=80=99refs=E2=80=99. But, f= or two=E2=80=A6 =E2=80=A6the current code (guix swh) does not deal with several snapshots a= nd only consider the latest one. Therefore, it fails for some in-place replacements =E2=80=93 upstream tags a specific revision then later removes= it and upstream re-use the same tag label for another revision booo!, if SWH ingests after the first tag, SWH creates one snapshot, then if SWH ingests again after the second re-tag, SWH creates another snapshot. ## Third, Disarchive is helping. =3D=3D=3D=3D=3D=3D=3D=3D Aside adding a layer to maintain does not help when speaking about long-term (3-5 years), well, the reduction of layers is often better for long-term. That=E2=80=99s said, there is a work in progress to have Disarc= hive features directly from SWH. What does Disarchive do? It maps various intrinsic identifiers. Remember =E2=80=99hello=E2=80=99 from above? --8<---------------cut here---------------start------------->8--- $ guix shell disarchive guile-lzma guile $ disarchive disassemble hello-2.12.1 (disarchive (version 0) (directory-ref (version 0) (name "hello-2.12.1") (addresses (swhid "swh:1:dir:ad5fc7c3062e8426b7936588e7a27d51ace0e508")) (digest (sha256 "cc7d5c45cfa1f5fba96c8b32d933734b24377a3c1ac776650044e497469affd4")= ))) $ guix hash -S git -H sha1 -f hex hello-2.12.1 ad5fc7c3062e8426b7936588e7a27d51ace0e508 $ guix hash -S git -H sha256 -f hex hello-2.12.1 cc7d5c45cfa1f5fba96c8b32d933734b24377a3c1ac776650044e497469affd4 --8<---------------cut here---------------end--------------->8--- Well, the fixed-outputs is a compressed tarball, it reads, --8<---------------cut here---------------start------------->8--- $ disarchive disassemble $(guix build -S hello) (disarchive (gzip-member (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar.gz") (digest (sha256 "8d99142afd92576f30b0cd7cb42a8dc6809998bc5d607d88761f512e26c7db20")) (header (mtime 0) (extra-flags 2) (os 3)) (footer (crc 2707092614) (isize 4945920)) (compressor gnu-best-rsync) (input (tarball (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1.tar") (digest (sha256 "a2c33fd13c555015433956bcf06609293a34ce5c5e6a2070990bfb860= 70dc554")) [...] (input (directory-ref (version 0) (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1") (addresses (swhid "swh:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a")) (digest (sha256 "1cb6effd40736b441a2a6dd49e56b3dfd4f6550e8ae1a8ac34ed4b167= 4097bc0")))))))) --8<---------------cut here---------------end--------------->8--- where the values are just (considering that =E2=80=99guix hash -S none -H s= ha256 -f hex=E2=80=99 is equivalent to =E2=80=99sha256sum=E2=80=99) --8<---------------cut here---------------start------------->8--- $ guix hash -S none -H sha256 -f hex $(guix build hello -S) 8d99142afd92576f30b0cd7cb42a8dc6809998bc5d607d88761f512e26c7db20 $ gzip -d $(guix build -S hello) -c | sha256sum a2c33fd13c555015433956bcf06609293a34ce5c5e6a2070990bfb86070dc554 - --8<---------------cut here---------------end--------------->8--- However the fields =E2=80=99swhid=E2=80=99 and the other SHA256 =E2=80=99di= gest=E2=80=99 are different from above. That=E2=80=99s because the dots [...] part. It probably comes= from the normalization process. Well, I am not sure to deeply understand why it is different but that=E2=80=99s another story. :-) ## Fourth, it misses a bridge using NAR normalization (serialization). =3D=3D=3D=3D=3D=3D=3D=3D=3D Disarchive can (or could) provides a bridge (map) between SWHID+SHA1 and NAR+SHA256. But it could be nice if it was implemented in SWH directly. It would ease previous drawbacks. For the interested reader, discussion there . Moreover, provides simple examples about NAR and how to implement it using Python. # Discussion asking for comments and feedback =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Still there? If yes, thanks for reading. :-) As shown in, 1: 2: we have holes and we are not currently robust for long-term (3-5 years) if our lovely build-farms are down for whatever reasons. For sure, we have to fix the holes and bugs. :-) However, I am asking what we could add for having more robustness on the long term. It is not affordable, neither wanted, to switch from the current extrinsic identification to a complete intrinsic one. Although it would fix many issues. ;-) Guix and =E2=80=99guix time-machine=E2=80=99 provides all the machinery for= being able to redeploy later but as I have tried to point in the two links above [1,2], we are lacking tools for retrieving contents; well having the machinery does not mean that such machinery works well or is robust. :-) The discussion could also fit how to distribute using ERIS. At some point, I was thinking to have something like =E2=80=9Cguix freeze -m manifest.scm=E2=80=9D returning a map of all the sources from the deep boot= strap to the leaf packages described in manifest.scm. However, maybe something is poor in the metadata we collect at package time. For instance, the substitutions work more or less using intrinsic identifier so it helps, I guess. :-) Well, we could imagine the addition of another option field, say under =E2=80=99properties=E2=80=99, that could store the intrinsic identifier of = the fixed-outputs such as SWHID or Git tree / commit hash or else. It would add robustness for later. Or maybe an optional field of the =E2=80=99origin=E2=80=99 record for the s= ame purpose. WDYT? Cheers, simon