unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: pukkamustard <pukkamustard@posteo.net>
To: Csepp <raingloom@riseup.net>
Cc: guix-devel@gnu.org
Subject: Re: distributed substitutes: file slicing
Date: Sun, 25 Jun 2023 09:48:11 +0000	[thread overview]
Message-ID: <861qhzluhp.fsf@posteo.net> (raw)
In-Reply-To: <87jzvxoicg.fsf@riseup.net>


Csepp <raingloom@riseup.net> writes:

> I have a question / suggestion about the distributed substitutes
> project: would downloads be split into uniformly sized chunks or could
> the sizes vary?

For the proposal that uses ERIS (https://issues.guix.gnu.org/52555) the
chunks are uniformly sized (32KiB).

> Specifically, in an extreme case where an update introduced a single
> extra byte at the beginning of a file, would that result in completely
> new chunks?

Yes, that would be the case.

ERIS uses fixed-block sizes and such extreme cases would result in
completely new chunks - very bad de-duplication.

The reason for using fixed-block sizes is security/privacy. When using
variable sized blocks the sizes are observable by a potential censor and
are also a function of the content itself. This leaks information about
the transferred content.

I believe there are documented cases of HTTPS connections being
blocked/censored based on size of requests [citation needed]. This is
something ERIS tries to prevent.

That being said, I think there is still room for optimizing the
de-duplication even with fixed-size blocks.

> An alternative I've been thinking about is this:
> find the store references in a file and split it along these references,
> optionally apply further chunking to the non-reference blobs.
>
> It's probably best to do this at the NAR level??

I like the idea!

If I understand correctly we would split whenever a store reference
appears. When a single store reference changes (this probably happens
quite often) then only the preceeding block changes.

I think there is also a way to do something similar while preserving
fixed size blocks:

Maintain a lookup table for all store references appearing in a store
item. When serializing this lookup table goes to the front (or back)
with appropriate padding so that it is block aligned. All store
references in the remaining serialization are replaced by a reference to
the lookup table. Now when a store reference changes only the lookup
table changes, the remaining content remains the same and is
de-duplicated.

A similar idea for also allowing de-duplication when individual files
change: https://codeberg.org/eris/eer/src/branch/eris-fs/eer/eris-fs/index.md

Also check out the Guix `wip-digests` branch. There are some related
interesting ideas there.

I'm working on rebasing and updating the decentralized substitute
patches. Sorry for the slowness. They would at first only address
block-wise transfer with a naive encoding that does not do very good
de-duplication. 

As outlined I think de-duplication can be added later and I think it's
great to start thinking about it and experimenting with ideas.

-pukkamustard


  parent reply	other threads:[~2023-06-25 10:14 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-20 22:44 distributed substitutes: file slicing Csepp
2023-06-21 14:32 ` Attila Lendvai
2023-06-21 23:06 ` Csepp
2023-06-25  9:48 ` pukkamustard [this message]
2023-06-26 13:41 ` Florian Klink

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=861qhzluhp.fsf@posteo.net \
    --to=pukkamustard@posteo.net \
    --cc=guix-devel@gnu.org \
    --cc=raingloom@riseup.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).