From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id KJuEOhyL7F8ZZAAA0tVLHw (envelope-from ) for ; Wed, 30 Dec 2020 14:13:48 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id oKJZNhyL7F/TGwAAB5/wlQ (envelope-from ) for ; Wed, 30 Dec 2020 14:13:48 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 08AD894038E for ; Wed, 30 Dec 2020 14:13:48 +0000 (UTC) Received: from localhost ([::1]:39726 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kucEV-0005XK-0N for larch@yhetil.org; Wed, 30 Dec 2020 09:13:47 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:52040) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kuLBj-0007Fu-Lb for guix-devel@gnu.org; Tue, 29 Dec 2020 15:01:47 -0500 Received: from mout02.posteo.de ([185.67.36.66]:48071) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kuLBf-0002jP-4o for guix-devel@gnu.org; Tue, 29 Dec 2020 15:01:47 -0500 Received: from submission (posteo.de [89.146.220.130]) by mout02.posteo.de (Postfix) with ESMTPS id 6BD30240100 for ; Tue, 29 Dec 2020 21:01:39 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1609272099; bh=GzM6QbWD5YB+mPBant7CQOJRtaT9J/C1tLfH3GPoVzw=; h=From:To:Cc:Subject:Date:From; b=QjawbL02pUWQMoyFOiuoV8uW3EHczLkw21/jzeAboR4ISoUqsraAryI0aMaXf+2vm HpdzV8b4QetJlFCdWvMhZUfit7SvUC5AJKV99aDzw2JDcKDtfnT7w/+9rWYeGBbn98 ddfk9xqn2C/Oasw3MKXRCGohVrEaoBYHYNLTBIVPe9024XoJ+zz+0D5437wHKKf2X3 He5qjLB2rlL9yzz5G3+O6ji1qQRcPjsrHSI7lS6OrXfXjz4xJCD3inoLF7xe1sMDEe 3Duxm+0eB4/xoNTVP2OOX9YPeZww25tbPL66WsgQJhNr+aMr4zT58hDZciYRl5YXCO jbXckniF7tN9A== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4D550L6Rh9z6tmL; Tue, 29 Dec 2020 21:01:38 +0100 (CET) References: <87wnx9wlea.fsf@gnu.org> User-agent: mu4e 1.4.13; emacs 27.1 From: pukkamustard To: Ludovic =?utf-8?Q?Court=C3=A8s?= Subject: Re: Identical files across subsequent package revisions In-reply-to: <87wnx9wlea.fsf@gnu.org> Date: Tue, 29 Dec 2020 21:01:33 +0100 Message-ID: <86v9ckmleq.fsf@posteo.net> MIME-Version: 1.0 Content-Type: text/plain; format=flowed Received-SPF: pass client-ip=185.67.36.66; envelope-from=pukkamustard@posteo.net; helo=mout02.posteo.de X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Mailman-Approved-At: Wed, 30 Dec 2020 09:13:32 -0500 X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: guix-devel@gnu.org Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Migadu-Flow: FLOW_IN X-Migadu-Spam-Score: -3.03 Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=posteo.net header.s=2017 header.b=QjawbL02; dmarc=pass (policy=none) header.from=posteo.net; spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Queue-Id: 08AD894038E X-Spam-Score: -3.03 X-Migadu-Scanner: scn1.migadu.com X-TUID: YBOvMdPWbiEX Hi Ludo, > > Thoughts? :-) > Super cool! :) Your research inspired me to do conduct some experiments towards de-duplication. For two similar packages (emacs-27.1 and emacs-26.3) I was able to de-duplicate ~12% using EROFS and ERIS. Still far from the ~85% similarity, but an attempt I'd like to share. The two main ingredients: - EROFS (Enhanced Read-Only File-System) is a read-only, compressed file-system comparable to SquashFS. It has some properties that make it more suitable than SquashFS (it aligns content to fixed block size). EROFS is in mainline Linux Kernel since v5.4. - ERIS (Encoding for Robust Immutable Storage) is an encoding of content into uniformly sized blocks that I've been working on. It de-couples encoding of content from storage and transport layer. Transport layers can be things like IPFS, GNUNet, Named Data Network or just a plain old HTTP service. I make EROFS images of the packages and encode them with ERIS, which de-duplicates blocks as part of the encoding process. With this I manage to de-duplicate between 12-17% (depending on some parameters). This could allow: - Directly mounting packages instead of unarchiving (a la distri) - Peer-to-peer distribution of packages (that's what ERIS is for) - De-duplicating common content in packages to a certain extent (topic of this thread) A more in-depth write-up: https://gitlab.com/openengiadina/eris/-/tree/main/examples/dedup-fs Happy Hacking! -pukkamustard