From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id gMTbOZGV419rGAAA0tVLHw (envelope-from ) for ; Wed, 23 Dec 2020 19:08:01 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id sCq4NZGV419UAgAAB5/wlQ (envelope-from ) for ; Wed, 23 Dec 2020 19:08:01 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 340829403CA for ; Wed, 23 Dec 2020 19:08:01 +0000 (UTC) Received: from localhost ([::1]:48136 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ks9UN-0004lp-Re for larch@yhetil.org; Wed, 23 Dec 2020 14:07:59 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:41618) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ks9UB-0004lR-9m for guix-devel@gnu.org; Wed, 23 Dec 2020 14:07:47 -0500 Received: from mail-wr1-x42e.google.com ([2a00:1450:4864:20::42e]:44202) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1ks9U9-0005t3-3X; Wed, 23 Dec 2020 14:07:47 -0500 Received: by mail-wr1-x42e.google.com with SMTP id w5so172140wrm.11; Wed, 23 Dec 2020 11:07:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:references:date:in-reply-to:message-id :user-agent:mime-version:content-transfer-encoding; bh=3tiKJUDlItaOthE2TUS4thXnOQVE0sYfu1V7Q7W4OCM=; b=vh7yiyVeA7KcFjSrsu0RMVGqs+K9KdWmwpbegDN4ncXvwqoadZZz2DBkGKgxU1ykii 9AcAVcud3vXOh+aafX9ucfCnwYmSzBWEpAiDvoivT5cxFkgK9u44YiIxfqPimAo2jU1N R9aHCJflCG/e+As8d7C3/PR0t03+nQ18orhGPjduqSxET1FYo/xbl/GmVmUlGzHJEQrB o+yN4ZSvxYsJ0v0/5jaZDq8xPU19Op2J6hrmuMT7qmxLDZA1biyMVPav2XpC8ndd/yD4 TObt9gOuTxYr3Oi4FIs/KpnoN8HJzwoQbIT8X3sDzjH1XpuQHMyhT2cwQnobiH/X6wdz EhtA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:references:date:in-reply-to :message-id:user-agent:mime-version:content-transfer-encoding; bh=3tiKJUDlItaOthE2TUS4thXnOQVE0sYfu1V7Q7W4OCM=; b=UC+hDVe2iKm3TWmYuoHUPoNLsTCFErTEyH7nUjhQtmROP1yzP6pj028KUNmlw50L6r SnQhWGaql6O3m9C0wNQYMekbRuW5wb3QQENQIbPTOGkpu/JY2se32+/cxrLxQKIAnhUW 495POTGLwbWpIa5BOxmm/dHzWyP1Fn5KpQpbGnoz+1PjRPEIdMEkj+1SDac/EhoebKGK VYZhZ3SUpsQ4+EDMas1p+m0KyjdCBJisRxDSxmNfkVM5qrbxuz/LnX8rtGhyodDzAMgF Sz8P0a43ZK73e7240svAvJPkY67BAHGvCDN184H7e3SW1WXVSOcDrzh8zip3PTGHj8Xg QVjg== X-Gm-Message-State: AOAM532zd1wcLVRIp0Cc96jH2zbYDqmQsoNayiFZXWYjBQR3DmdLLmD4 uNdcMyi8gkyZyno5nFjZ+04RMFiRiCI= X-Google-Smtp-Source: ABdhPJwKa1FYNdpo40sHkPCxXf8zsC9U5UY15AuR19a4FuvybQETpyIV0UxMwVPNEVcZYFVpJsI4Pg== X-Received: by 2002:a5d:5227:: with SMTP id i7mr31666316wra.68.1608750462792; Wed, 23 Dec 2020 11:07:42 -0800 (PST) Received: from unfall (36.193.158.146.dynamic.jazztel.es. [146.158.193.36]) by smtp.gmail.com with ESMTPSA id w18sm38171574wrn.2.2020.12.23.11.07.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 23 Dec 2020 11:07:42 -0800 (PST) From: =?utf-8?Q?Miguel_=C3=81ngel_Arruga_Vivas?= To: Julien Lepiller Subject: Re: Identical files across subsequent package revisions References: <87wnx9wlea.fsf@gnu.org> <878s9oy8f7.fsf@gmail.com> <86wnx8r4ys.fsf@gmail.com> <077ECD6C-AB0D-4FEA-ABBA-82550834265E@lepiller.eu> Date: Wed, 23 Dec 2020 20:07:40 +0100 In-Reply-To: <077ECD6C-AB0D-4FEA-ABBA-82550834265E@lepiller.eu> (Julien Lepiller's message of "Wed, 23 Dec 2020 10:40:00 -0500") Message-ID: <87v9cswdc3.fsf@gmail.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2a00:1450:4864:20::42e; envelope-from=rosen644835@gmail.com; helo=mail-wr1-x42e.google.com X-Spam_score_int: -17 X-Spam_score: -1.8 X-Spam_bar: - X-Spam_report: (-1.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: guix-devel@gnu.org Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Migadu-Flow: FLOW_IN X-Migadu-Spam-Score: -1.23 Authentication-Results: aspmx1.migadu.com; dkim=fail (headers rsa verify failed) header.d=gmail.com header.s=20161025 header.b=vh7yiyVe; dmarc=fail reason="SPF not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Queue-Id: 340829403CA X-Spam-Score: -1.23 X-Migadu-Scanner: scn1.migadu.com X-TUID: d/9nOiAm884n Hi Julien and Simon, Julien Lepiller writes: > Le 23 d=C3=A9cembre 2020 09:07:23 GMT-05:00, zimoun a =C3=A9crit : >>Hi, >> >>On Wed, 23 Dec 2020 at 14:10, Miguel =C3=81ngel Arruga Vivas >> wrote: >>> Another idea that might fit well into that kind of protocol---with >>> harder impact on the design, and probably with a high cost on the >>> runtime---would be the "upgrade" of the deduplication process towards >>a >>> content-based file system as git does[2]. This way the a description >>of >>> the nar contents (size, hash) could trigger the retrieval only of the >>> needed files not found in the current store. >> >>Is it not related to Content-Addressed Store? i.e, =C2=ABintensional >>model=C2=BB? >> >>Chap. 6: >>Nix FRC: >> > > I think this is different, because we're talking about sub-element > content-addressing. The intensional model is about content-addressing > whole store elements. I think the idea would be to save individual > files in, say, /gnu/store/.links, and let nar or narinfo files > describe the files to retrieve. If we are missing some, we'd download > them, then create hardlinks. This could even help our deduplication I > think :) Exactly. My first approach would be a tree %links-prefix/hash/size, to where all the internal contents of each store item would be hard linked: mainly Git's approach with some touch here and there---some of them probably have too much good will, the first approach isn't usually the best. :-) - The garbage collection process could check if there is any hard link to those files or remove them otherwise, deleting the hash folder when needed. - The deduplication process could be in charge of moving the files and placing hard links instead, but hash collisions with the same size are always a possibility, therefore some mechanism is needed to treat these cases[1] and the vulnerabilities associated to them. - The substitution process could retrieve from the server the information about the needed files, check which contents are already available and which ones must be retrieved, and ensure that the "generated nar" is the same as the one from the server. This is quite related to the deduplication process and the mechanism used there[2]. Happy hacking! Miguel [1] Perhaps the usage of two different hash algorithms instead of one, or different salts, could be enough for the "common" error case as a collision on both cases are quite improbable. They are possible anyway with a size bigger than the hash size, therefore a final fallback to actual bytes is probably needed. [2] The folder could be hashed, even with a temporary salt agreed with the server, to perform an independent/real-time check, but any issue here has bigger consequences too, as no byte to byte comparison is possible before the actual transmission.