unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: "Ludovic Courtès" <ludo@gnu.org>
To: pukkamustard <pukkamustard@posteo.net>
Cc: guix-devel@gnu.org
Subject: Re: Identical files across subsequent package revisions
Date: Wed, 06 Jan 2021 10:27:28 +0100	[thread overview]
Message-ID: <87mtxmpg8v.fsf@gnu.org> (raw)
In-Reply-To: <86v9ckmleq.fsf@posteo.net> (pukkamustard@posteo.net's message of "Tue, 29 Dec 2020 21:01:33 +0100")

Hi!

pukkamustard <pukkamustard@posteo.net> skribis:

> Your research inspired me to do conduct some experiments towards
> de-duplication.
>
> For two similar packages (emacs-27.1 and emacs-26.3) I was able to
> de-duplicate ~12% using EROFS and ERIS. Still far from the ~85%
> similarity, but an attempt I'd like to share.
>
> The two main ingredients:
>
> - EROFS (Enhanced Read-Only File-System) is a read-only,   compressed
>  file-system comparable to SquashFS. It has some properties that
>  make
>  it more suitable than SquashFS (it aligns content to fixed block
>  size). EROFS is in mainline Linux Kernel since v5.4.
>
> - ERIS (Encoding for Robust Immutable Storage) is an encoding of
>   content
>  into uniformly sized blocks that I've been working on. It
>  de-couples
>  encoding of content from storage and transport layer. Transport
>  layers
>  can be things like IPFS, GNUNet, Named Data Network or just a   plain
>  old HTTP service.
>
> I make EROFS images of the packages and encode them with ERIS, which
> de-duplicates blocks as part of the encoding process.
>
> With this I manage to de-duplicate between 12-17% (depending on some
> parameters).

Very nice!  I wonder what the file-level similarity is compared to the
block-level similarity.

> This could allow:
>
> - Directly mounting packages instead of unarchiving (a la distri)

Yeah, I’m not sure about this part.  It would be a radical change for
Guix in terms of code, and I also wonder about the efficiency: sure
mounting the package would be instantaneous, but subsequent reads would
be slowed down compared to the current approach.  Maybe the slowdown is
only on the first hit though, and maybe it’s hardly measurable, dunno.

> - Peer-to-peer distribution of packages (that's what ERIS is for)

Yup, looking forward to that.

> - De-duplicating common content in packages to a certain extent
>   (topic
>  of this thread)
>
> A more in-depth write-up:
> https://gitlab.com/openengiadina/eris/-/tree/main/examples/dedup-fs

Great writeup and nice tooling that you have here!

Thanks,
Ludo’.


      reply	other threads:[~2021-01-06  9:27 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-22 22:01 Identical files across subsequent package revisions Ludovic Courtès
2020-12-23  9:08 ` Taylan Kammer
2020-12-23 10:50   ` Pierre Neidhardt
2020-12-23 22:06   ` Mark H Weaver
2020-12-23 22:47     ` Jonathan Brielmaier
2020-12-23 23:42     ` Mark H Weaver
2020-12-23 10:19 ` zimoun
2020-12-23 10:41   ` zimoun
2020-12-27 22:22     ` Ludovic Courtès
2020-12-23 11:48 ` Christopher Baines
2020-12-23 13:10 ` Miguel Ángel Arruga Vivas
2020-12-23 14:07   ` zimoun
2020-12-23 15:40     ` Julien Lepiller
2020-12-23 19:07       ` Miguel Ángel Arruga Vivas
2020-12-27 22:29   ` Ludovic Courtès
2021-01-06  9:39     ` Ludovic Courtès
2020-12-29 20:01 ` pukkamustard
2021-01-06  9:27   ` Ludovic Courtès [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mtxmpg8v.fsf@gnu.org \
    --to=ludo@gnu.org \
    --cc=guix-devel@gnu.org \
    --cc=pukkamustard@posteo.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).