From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp11.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms9.migadu.com with LMTPS id CBbHH5YTmGQr1AAASxT56A (envelope-from ) for ; Sun, 25 Jun 2023 12:14:46 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp11.migadu.com with LMTPS id uG/WH5YTmGQECwAA9RJhRA (envelope-from ) for ; Sun, 25 Jun 2023 12:14:46 +0200 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 557949D67 for ; Sun, 25 Jun 2023 12:14:46 +0200 (CEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qDMky-0005j2-Cs; Sun, 25 Jun 2023 06:14:08 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qDMkw-0005hj-PJ for guix-devel@gnu.org; Sun, 25 Jun 2023 06:14:07 -0400 Received: from mout01.posteo.de ([185.67.36.65]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qDMku-0002uA-Fr for guix-devel@gnu.org; Sun, 25 Jun 2023 06:14:06 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id 354AF240028 for ; Sun, 25 Jun 2023 12:14:01 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1687688041; bh=SdEzFkmhzJzCXYH7GLHD1lOjrmY5YAI3fGpcGf9KANo=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:From; b=GZQL0USX7MIkpyGZH1PQ+q7TZjWBvnw7pbdN9TCE4JcYM4ochtnj0Q52w/po6F9p2 CPD2RSOMgHaGGWzRigCx0sMXrLpFcGYkDAgSV87IAH2un9qzvdWnxqrML3DQNKA8uL FRmTEZq1aTQZ0U6DHV7GA7F0d9HuB1gbrWdXByCvg6RARKwduizj4Bw4Fnmgd37zJB WYuahLsSPVXlIvJZp72D07lB/LeL2ao5UA6kOHpSwyMb0dK2XL44hc1HOPl5OPQQ12 C/GgdzOI+faNS8BoCLxKdlNbBXoJ3Yg3ltszjH11b8pVVkV+Oc+Ksi4ejKYtq/aocf 9yuou6Y7feOXg== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QpmzD3xsFz9rxB; Sun, 25 Jun 2023 12:14:00 +0200 (CEST) References: <87jzvxoicg.fsf@riseup.net> From: pukkamustard To: Csepp Cc: guix-devel@gnu.org Subject: Re: distributed substitutes: file slicing Date: Sun, 25 Jun 2023 09:48:11 +0000 In-reply-to: <87jzvxoicg.fsf@riseup.net> Message-ID: <861qhzluhp.fsf@posteo.net> MIME-Version: 1.0 Content-Type: text/plain Received-SPF: pass client-ip=185.67.36.65; envelope-from=pukkamustard@posteo.net; helo=mout01.posteo.de X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1687688086; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=QIbXevkDYKcb9jwTOsj9JOC2P1Vg+Q+epx8Wq9TSlMA=; b=YgXLErmq3uPeup+9ZSNeeHkBwPjEXSJGxuHM3G5iZbpWWBXUpBwQdDrnW2ZyzN+JeZnihp urYg6+lq09NQa3S5cY02bFUYtZ07/AQwpGvnLl4/MG1PXczqdiOm2TTm+d140ERv7Ttq6w yLMB8FUuu3iRzj4vgOc7rbkYlMCu/+UVD20svyeebD0blAFxcGOoXv2HzfxPtNSKf1CY4W EggTfNrpEJ6vMICT1dvPdux670i8iSkK4ZHATnzaq4OrjbYvPQLEyBTOeO8zIRgeAH7ihJ B1yhi7lN9CRg5bBMHBScbzx4dRRwi1AT0XPdx2nap1oJwAvk+elady3uHpNPPQ== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=posteo.net header.s=2017 header.b=GZQL0USX; dmarc=pass (policy=none) header.from=posteo.net; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" ARC-Seal: i=1; s=key1; d=yhetil.org; t=1687688086; a=rsa-sha256; cv=none; b=HVGDWcE0E8lEzsSibNrc7sUPROT5/yg0341Iz2FS8iqpWdLRYnCLCK659MiGTE8x8ge3yl lTnvPGpa2A+OhkY9No8JSJaoaQ7ObSijpXTnnBeR2FbU1mvH6rs2+SlAjrK+QogEaN44A/ b4vqVr7PNMAXv347LJO6vzbpjmAsLDc8Z7XWGYcc+1G1HA/KcC+KSE4XmRYNlpWoAzBjS0 soa+zOwyAy/0rzE3M/VYMRB1ui71DkQ9XpSd8h27yIYjrFRuwwki2fetEm76dpzUvHPt95 Yde4Qbng+bdnsRiepwV0UPftXzqQ5GwVCp0wXYCVMHnEJLC7M/3Yc7TTUXCgRA== X-Migadu-Scanner: scn1.migadu.com X-Migadu-Spam-Score: -5.23 Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=posteo.net header.s=2017 header.b=GZQL0USX; dmarc=pass (policy=none) header.from=posteo.net; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" X-Migadu-Queue-Id: 557949D67 X-Spam-Score: -5.23 X-TUID: iPcxGaN+ZY1V Csepp writes: > I have a question / suggestion about the distributed substitutes > project: would downloads be split into uniformly sized chunks or could > the sizes vary? For the proposal that uses ERIS (https://issues.guix.gnu.org/52555) the chunks are uniformly sized (32KiB). > Specifically, in an extreme case where an update introduced a single > extra byte at the beginning of a file, would that result in completely > new chunks? Yes, that would be the case. ERIS uses fixed-block sizes and such extreme cases would result in completely new chunks - very bad de-duplication. The reason for using fixed-block sizes is security/privacy. When using variable sized blocks the sizes are observable by a potential censor and are also a function of the content itself. This leaks information about the transferred content. I believe there are documented cases of HTTPS connections being blocked/censored based on size of requests [citation needed]. This is something ERIS tries to prevent. That being said, I think there is still room for optimizing the de-duplication even with fixed-size blocks. > An alternative I've been thinking about is this: > find the store references in a file and split it along these references, > optionally apply further chunking to the non-reference blobs. > > It's probably best to do this at the NAR level?? I like the idea! If I understand correctly we would split whenever a store reference appears. When a single store reference changes (this probably happens quite often) then only the preceeding block changes. I think there is also a way to do something similar while preserving fixed size blocks: Maintain a lookup table for all store references appearing in a store item. When serializing this lookup table goes to the front (or back) with appropriate padding so that it is block aligned. All store references in the remaining serialization are replaced by a reference to the lookup table. Now when a store reference changes only the lookup table changes, the remaining content remains the same and is de-duplicated. A similar idea for also allowing de-duplication when individual files change: https://codeberg.org/eris/eer/src/branch/eris-fs/eer/eris-fs/index.md Also check out the Guix `wip-digests` branch. There are some related interesting ideas there. I'm working on rebasing and updating the decentralized substitute patches. Sorry for the slowness. They would at first only address block-wise transfer with a naive encoding that does not do very good de-duplication. As outlined I think de-duplication can be added later and I think it's great to start thinking about it and experimenting with ideas. -pukkamustard