From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp11.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms9.migadu.com with LMTPS id GPjCLJIqnGTqaQAASxT56A (envelope-from ) for ; Wed, 28 Jun 2023 14:41:54 +0200 Received: from aspmx1.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp11.migadu.com with LMTPS id 4N+QLJIqnGRnDAEA9RJhRA (envelope-from ) for ; Wed, 28 Jun 2023 14:41:54 +0200 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 8B5C7390F4 for ; Wed, 28 Jun 2023 14:41:53 +0200 (CEST) Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=flokli.de header.s=mail header.b=SWHYhC4o; dmarc=none; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1687956113; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=DhMnZUvKZC8iK1B9U21Wbt342/LsdWSLzLOeIEu4UO0=; b=fC4shHKMi9RvDB9Mc3bNiI4ajn+8FTxwElE4szxqgFD1seoH1cyXkqmVIJlQkIU6CR4Pev l93XjjO9iAR4jFcmpkRVCpsx+waGgbjjLtMroXIU91GSXqORv/nF3O//0FOlDyGO1Rve9V tKXXvQnwuItLmYzGU0AH3TiG6ikxcWewyWehZH6o/thyO7QivhG0/t28Qp01LLwYyOZs/L OAKMn5M3BuAngfmi+F7sO2HRb/nlCdvwPdb7Cvjwv4t0vLRbh9EPDue+0CGIsjOIdu11sC JL/Wqby7Z/j1qhmUEKRTgFLhKcuwBosKMfCygS8kjDdE+2T4nR7gtuo2p1g+Wg== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=flokli.de header.s=mail header.b=SWHYhC4o; dmarc=none; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" ARC-Seal: i=1; s=key1; d=yhetil.org; t=1687956113; a=rsa-sha256; cv=none; b=VNws5Uqd3Of+v0op/tZMLjErgur7c74xiRF1opE7qpBY4xRPOgUTDDeep3mgEiNQhvn9k7 KRe8pX0M9hqMTUWVqBFqkISaDwxFSk1t/iwGnx89feXJ4Hciq87JKFHKc6w0Ipj6Uv87QI dH5SyHpTg4U+iNDPnLewBQp2CZHtz+JpN6C5SlODZO5EEWMNk1HLFoGtDvIqlfNORzJ3j4 zDM66zJebRmP6FO1v6muPWqoTjD6YrbOlhjtdYkeMNQ35BaMmY7crrJ1HnCAmuqdHjROro xt7t1Fe8ecWhy7O7+TmtmVgniW6YbRrVzQH5C8e94eFAveu/e2GN/7O/W6PMag== Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qEUUP-0001VT-HJ; Wed, 28 Jun 2023 08:41:41 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qDmTM-0005p9-De for guix-devel@gnu.org; Mon, 26 Jun 2023 09:41:40 -0400 Received: from mail.flokli.de ([116.203.226.116]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qDmTJ-0003LH-QB for guix-devel@gnu.org; Mon, 26 Jun 2023 09:41:40 -0400 Date: Mon, 26 Jun 2023 15:41:32 +0200 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=flokli.de; s=mail; t=1687786893; bh=DhMnZUvKZC8iK1B9U21Wbt342/LsdWSLzLOeIEu4UO0=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=SWHYhC4o2JHzT/ICP5PoGPOsycAPWpFmK3h4Kpindy8ywlQNev7cD6PmZJuftR23I aolSCnBuMAwur89T7D9io+CQPDFwgmzRMwNEbmRvK2KWstQxlTNwX0dY4lhmrQZ3hQ 68BDj0fOQhS+wJ8XdrBTPwA0So9/7pP7pOoEWxRg= From: Florian Klink To: Csepp Cc: guix-devel@gnu.org, pukkamustard Subject: Re: distributed substitutes: file slicing Message-ID: References: <87jzvxoicg.fsf@riseup.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: <87jzvxoicg.fsf@riseup.net> Received-SPF: pass client-ip=116.203.226.116; envelope-from=flokli@flokli.de; helo=mail.flokli.de X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-Mailman-Approved-At: Wed, 28 Jun 2023 08:41:30 -0400 X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN X-Migadu-Scanner: scn0.migadu.com X-Migadu-Spam-Score: -2.55 X-Spam-Score: -2.55 X-Migadu-Queue-Id: 8B5C7390F4 X-TUID: 1PQDYF641yHR On 23-06-21 00:44:06, Csepp wrote: >I have a question / suggestion about the distributed substitutes >project: would downloads be split into uniformly sized chunks or could >the sizes vary? >Specifically, in an extreme case where an update introduced a single >extra byte at the beginning of a file, would that result in completely >new chunks? > >An alternative I've been thinking about is this: >find the store references in a file and split it along these references, >optionally apply further chunking to the non-reference blobs. > >It's probably best to do this at the NAR level?? > >Storing reference offsets is already something that we should be doing to >speed other operations up, so this could tie in nicely with that. A bit late to the party, but I've been toying around with a different model to represent contents inside store paths - see [tvix-store-docs] for more details. Essentially, tvix-store internally uses a model similar to git trees, but with Blake3 as a digest for blobs (regular file contents). Even with all that, you can still put on a NAR lens, and get back a byte-by-byte identical NAR representation of a store path. Because blake3 enables [verified streaming][bao], there's no need to make granular chunking part of the information to encode - it can be a transport concern only. It also allows easy "seeking" into different parts of a store path, and due to content-adressability, easy partial fetching. I've been playing around with using a blob storage implementation storing these blobs with content-defined chunking (and eventually exposing more granular chunking data to clients). Due to the "decomposition" of the NAR (storing blobs separately from the "surrounding skeleton"), we always look at file contents separately. I didn't yet run any benchmarks on whether it makes sense to "blank out" store paths before ingesting, and dynamically applying these references on top, but would be interested in some discussion around some experiments. flokli -- [tvix-store-docs]: https://cs.tvl.fyi/depot/-/tree/tvix/store/docs [bao]: https://github.com/oconnor663/bao