unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Florian Klink <flokli@flokli.de>
To: Csepp <raingloom@riseup.net>
Cc: guix-devel@gnu.org, pukkamustard <pukkamustard@posteo.net>
Subject: Re: distributed substitutes: file slicing
Date: Mon, 26 Jun 2023 15:41:32 +0200	[thread overview]
Message-ID: <zuj5tdofidrkweh3tcedtrdjsrzrnhog5zgu5yyog3nezau6rh@4w6gsfmp7izg> (raw)
In-Reply-To: <87jzvxoicg.fsf@riseup.net>

On 23-06-21 00:44:06, Csepp wrote:
>I have a question / suggestion about the distributed substitutes
>project: would downloads be split into uniformly sized chunks or could
>the sizes vary?
>Specifically, in an extreme case where an update introduced a single
>extra byte at the beginning of a file, would that result in completely
>new chunks?
>
>An alternative I've been thinking about is this:
>find the store references in a file and split it along these references,
>optionally apply further chunking to the non-reference blobs.
>
>It's probably best to do this at the NAR level??
>
>Storing reference offsets is already something that we should be doing to
>speed other operations up, so this could tie in nicely with that.

A bit late to the party, but I've been toying around with a different
model to represent contents inside store paths - see [tvix-store-docs]
for more details.

Essentially, tvix-store internally uses a model similar to git trees,
but with Blake3 as a digest for blobs (regular file contents).
Even with all that, you can still put on a NAR lens, and get back a
byte-by-byte identical NAR representation of a store path.

Because blake3 enables [verified streaming][bao], there's no need to
make granular chunking part of the information to encode - it can be a
transport concern only. It also allows easy "seeking" into different
parts of a store path, and due to content-adressability, easy partial
fetching.

I've been playing around with using a blob storage implementation
storing these blobs with content-defined chunking (and eventually
exposing more granular chunking data to clients).
Due to the "decomposition" of the NAR (storing blobs separately from the
"surrounding skeleton"), we always look at file contents separately.

I didn't yet run any benchmarks on whether it makes sense to "blank out"
store paths before ingesting, and dynamically applying these references
on top, but would be interested in some discussion around some
experiments.


flokli

--

[tvix-store-docs]: https://cs.tvl.fyi/depot/-/tree/tvix/store/docs
[bao]: https://github.com/oconnor663/bao


      parent reply	other threads:[~2023-06-28 12:41 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-20 22:44 distributed substitutes: file slicing Csepp
2023-06-21 14:32 ` Attila Lendvai
2023-06-21 23:06 ` Csepp
2023-06-25  9:48 ` pukkamustard
2023-06-26 13:41 ` Florian Klink [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=zuj5tdofidrkweh3tcedtrdjsrzrnhog5zgu5yyog3nezau6rh@4w6gsfmp7izg \
    --to=flokli@flokli.de \
    --cc=guix-devel@gnu.org \
    --cc=pukkamustard@posteo.net \
    --cc=raingloom@riseup.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).