From mboxrd@z Thu Jan 1 00:00:00 1970 From: ludo@gnu.org (Ludovic =?utf-8?Q?Court=C3=A8s?=) Subject: Leveraging the synergy of deduplication Date: Wed, 25 Mar 2015 14:46:46 +0100 Message-ID: <87k2y5w5o9.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:43372) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Yale1-0005Rt-AH for guix-devel@gnu.org; Wed, 25 Mar 2015 09:46:54 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Yaldx-0004sp-Tl for guix-devel@gnu.org; Wed, 25 Mar 2015 09:46:53 -0400 Received: from fencepost.gnu.org ([2001:4830:134:3::e]:53768) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Yaldx-0004sf-QQ for guix-devel@gnu.org; Wed, 25 Mar 2015 09:46:49 -0400 Received: from reverse-83.fdn.fr ([80.67.176.83]:37849 helo=pluto) by fencepost.gnu.org with esmtpsa (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1Yaldx-0002Wy-2O for guix-devel@gnu.org; Wed, 25 Mar 2015 09:46:49 -0400 List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org To: guix-devel Currently the daemon implements a simple yet efficient way do deduplicate files identical among store items. The /gnu/store/.links directory contains hard links to files in the store; the link name is the base32-encoded SHA256 of the file. When the daemon adds a new file in the store, it checks in /gnu/store/.links whether an identical file is already in store, and if so makes a hard link to that thing. When installing, say, two different variants of texlive, which in practice are 90% bit-identical, there=E2=80=99s a lot of deduplication happening. However, we still end up downloading the whole texlive archive just to realize that we already have most of its files in store. A solution to this would be to change the HTTP substitute protocol. =E2=80=98guix publish=E2=80=99 could serve content-addressed files. For in= stance, http://example.org/1ghws12lrp62vvxxxqmxp7jgxv2p18ihiyq420ag77nh9bw5qsfg.f= ile would serve the contents of the store file that has the given hash. The archive format would have to be different from the one currently implemented by =E2=80=98write-file=E2=80=99: for regular files, =E2=80=98wr= ite-contents=E2=80=99 would simply write the hash of the contents, and it would be up to the substituter to go fetch that file if it=E2=80=99s not already in store (whi= ch can be determined by looking it up in /gnu/store/.links.) This is not very sophisticated, but has the advantage of being relatively easy to implement in Guix itself. The downside is that Hydra would most likely not implement this new protocol (which would give us another incentive to move away from it.) Thoughts? Patches? :-) Ludo=E2=80=99. PS: Title inspired by .