unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Attila Lendvai <attila@lendvai.name>
To: Csepp <raingloom@riseup.net>
Cc: guix-devel@gnu.org, pukkamustard <pukkamustard@posteo.net>
Subject: Re: distributed substitutes: file slicing
Date: Wed, 21 Jun 2023 14:32:33 +0000	[thread overview]
Message-ID: <uFSsmsmAEOWIVPpDPv3ddKQeUk9bW3VTyvpja9tRA77VrXgWLqOuVHyXEaf06NcJluxE-m40yqUiNMtY1mLgX9LMLzaROZXe-oaE7uOMnwU=@lendvai.name> (raw)
In-Reply-To: <87jzvxoicg.fsf@riseup.net>

> I have a question / suggestion about the distributed substitutes
> project: would downloads be split into uniformly sized chunks or could
> the sizes vary?
> Specifically, in an extreme case where an update introduced a single
> extra byte at the beginning of a file, would that result in completely
> new chunks?


most (all?) distributed storage solutions have a chunker (including ERIS with its 32k chunks, or Swarm with 4k chunks), and the chunks are content addressed, i.e. it also serves as deduplication at the chunk granularity.

if the file doesn't just grow, but shifts away a couple of bytes somewhere in the middle, then this chunk-level deduplication stops happening from that point on.

IIRC rar was the first archiver that introduced a very fast deduplication algorithm that detected even the non-aligned duplicated blocks of varying sizes. i don't think any distributed storage system has anything like that.


> An alternative I've been thinking about is this:
> find the store references in a file and split it along these references,
> optionally apply further chunking to the non-reference blobs.


chunking storage systems store only whole chunks, so too much splitting of files can increase the wasted storage. more so with large chunks, less so with smaller ones.


> It's probably best to do this at the NAR level??
> 
> Storing reference offsets is already something that we should be doing to
> speed other operations up, so this could tie in nicely with that.


if optimization of grafting is worth this amount of trouble, then maybe the best is to extend the NAR format to store mutable references in a separate table at the end of the file. that would speed up guix operations like grafting, and help any storage systems that have deduplication, which includes some copy-on-write filesystems.

-- 
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“If you shut up truth and bury it under the ground, it will but grow, and gather to itself such explosive power that the day it bursts through it will knock down everything that stands in its way.”
	— Émile Zola (1840–1902)



  reply	other threads:[~2023-06-21 14:33 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-20 22:44 distributed substitutes: file slicing Csepp
2023-06-21 14:32 ` Attila Lendvai [this message]
2023-06-21 23:06 ` Csepp
2023-06-25  9:48 ` pukkamustard
2023-06-26 13:41 ` Florian Klink

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='uFSsmsmAEOWIVPpDPv3ddKQeUk9bW3VTyvpja9tRA77VrXgWLqOuVHyXEaf06NcJluxE-m40yqUiNMtY1mLgX9LMLzaROZXe-oaE7uOMnwU=@lendvai.name' \
    --to=attila@lendvai.name \
    --cc=guix-devel@gnu.org \
    --cc=pukkamustard@posteo.net \
    --cc=raingloom@riseup.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).