From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id g2fcFY2C9V/3egAA0tVLHw (envelope-from ) for ; Wed, 06 Jan 2021 09:27:41 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id UMgrEY2C9V9NIwAA1q6Kng (envelope-from ) for ; Wed, 06 Jan 2021 09:27:41 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 1597D9401C0 for ; Wed, 6 Jan 2021 09:27:41 +0000 (UTC) Received: from localhost ([::1]:48664 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kx56S-0002lX-1r for larch@yhetil.org; Wed, 06 Jan 2021 04:27:40 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:42754) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kx56J-0002lR-HA for guix-devel@gnu.org; Wed, 06 Jan 2021 04:27:31 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]:56557) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kx56J-00038i-0B; Wed, 06 Jan 2021 04:27:31 -0500 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=41402 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1kx56I-0006gC-7g; Wed, 06 Jan 2021 04:27:30 -0500 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: pukkamustard Subject: Re: Identical files across subsequent package revisions References: <87wnx9wlea.fsf@gnu.org> <86v9ckmleq.fsf@posteo.net> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 17 =?utf-8?Q?Niv=C3=B4se?= an 229 de la =?utf-8?Q?R?= =?utf-8?Q?=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Wed, 06 Jan 2021 10:27:28 +0100 In-Reply-To: <86v9ckmleq.fsf@posteo.net> (pukkamustard@posteo.net's message of "Tue, 29 Dec 2020 21:01:33 +0100") Message-ID: <87mtxmpg8v.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: guix-devel@gnu.org Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Migadu-Flow: FLOW_IN X-Migadu-Spam-Score: -2.84 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Queue-Id: 1597D9401C0 X-Spam-Score: -2.84 X-Migadu-Scanner: scn0.migadu.com X-TUID: +KWG1xSefyO/ Hi! pukkamustard skribis: > Your research inspired me to do conduct some experiments towards > de-duplication. > > For two similar packages (emacs-27.1 and emacs-26.3) I was able to > de-duplicate ~12% using EROFS and ERIS. Still far from the ~85% > similarity, but an attempt I'd like to share. > > The two main ingredients: > > - EROFS (Enhanced Read-Only File-System) is a read-only, compressed > file-system comparable to SquashFS. It has some properties that > make > it more suitable than SquashFS (it aligns content to fixed block > size). EROFS is in mainline Linux Kernel since v5.4. > > - ERIS (Encoding for Robust Immutable Storage) is an encoding of > content > into uniformly sized blocks that I've been working on. It > de-couples > encoding of content from storage and transport layer. Transport > layers > can be things like IPFS, GNUNet, Named Data Network or just a plain > old HTTP service. > > I make EROFS images of the packages and encode them with ERIS, which > de-duplicates blocks as part of the encoding process. > > With this I manage to de-duplicate between 12-17% (depending on some > parameters). Very nice! I wonder what the file-level similarity is compared to the block-level similarity. > This could allow: > > - Directly mounting packages instead of unarchiving (a la distri) Yeah, I=E2=80=99m not sure about this part. It would be a radical change f= or Guix in terms of code, and I also wonder about the efficiency: sure mounting the package would be instantaneous, but subsequent reads would be slowed down compared to the current approach. Maybe the slowdown is only on the first hit though, and maybe it=E2=80=99s hardly measurable, dun= no. > - Peer-to-peer distribution of packages (that's what ERIS is for) Yup, looking forward to that. > - De-duplicating common content in packages to a certain extent > (topic > of this thread) > > A more in-depth write-up: > https://gitlab.com/openengiadina/eris/-/tree/main/examples/dedup-fs Great writeup and nice tooling that you have here! Thanks, Ludo=E2=80=99.