unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Efraim Flashner <efraim@flashner.co.il>
To: Felix Lechner <felix.lechner@lease-up.com>
Cc: guix-devel@gnu.org
Subject: Daemon deduplication and btrfs compression [was Re: Are 'guix gc' stats exaggerated?]
Date: Sun, 2 Jun 2024 11:24:20 +0300	[thread overview]
Message-ID: <ZlwsNCflz6Yb_S86@3900XT> (raw)
In-Reply-To: <87a5k5oczg.fsf@lease-up.com>

[-- Attachment #1: Type: text/plain, Size: 3517 bytes --]

On Fri, May 31, 2024 at 03:03:47PM -0700, Felix Lechner wrote:
> Hi Efraim,
> 
> On Tue, May 28 2024, Efraim Flashner wrote:
> 
> > As your store grows larger the inherent deduplication from the
> > guix-daemon approaches a 3:1 file deduplication ratio.
> 
> Thank you for your explanations and your data about btrfs!  Btrfs
> compression is a well-understood feature, although even its developers
> acknowledge that the benefit is hard to quantify.
> 
> It probably makes more sense to focus on the Guix daemon here.  I hope
> you don't mind a few clarifying questions.
> 
> Why, please, does the benefit of de-duplication approach a fixed ratio
> of 3:1?  Does the benefit not depend on the number of copies in the
> store, which can vary by any number?  (It sounds like the answer may
> have something to do with store size.)

It would seem that this is just my experience and I'm not sure of an
actual reason why this is the case. I believe that with the hardlinks
only files which are identical would share a link, as opposed to a block
based deduplication, where there could be more granular deduplication,
so it's quite likely that multiple copies of the same package at the
same version would share the majority of their files with the other
copies of the package.

> Further, why is the removal of hardlinks counted as saving space even
> when their inode reference count, which is widely available [1] is
> greater than one?

I suspect that this part of the code is in the C++ daemon, which no one
really wants to hack on.  AFAIK Nix turned off deduplication by default
years ago to speed up store operations, so I wouldn't be surprised if
they also haven't worked on that part of the code.

> Finally, barring a better solution should our output numbers be divided
> by three to being them closer to the expected result for users?
> 
> [1] https://en.wikipedia.org/wiki/Hard_link#Reference_counting

(ins)efraim@3900XT ~$ sudo compsize -x /gnu
Processed 39994797 files, 12867013 regular extents (28475611 refs), 20558307 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       56%      437G         776G         2.1T
none       100%      275G         275G         723G
zstd        32%      161G         500G         1.4T

It looks like right now my store is physically using 437GB of space.
Looking only at the total the Uncompressed -> Referenced ratio being
about 2.77:1 and Disk Usage -> Uncompressed being about 1.78:1, I'm
netting a total of 4.92:1.

Numbers on Berlin are a bit different:

(ins)efraim@berlin ~$ time guix shell compsize -- sudo compsize -x /gnu
Processed 41030472 files, 14521470 regular extents (37470325 refs), 17429255 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       59%      578G         970G         3.2T
none       100%      402G         402G         1.1T
zstd        31%      176G         567G         2.1T

real    45m9.762s
user    1m53.984s
sys     24m37.338s

Uncompressed -> Referenced:     3.4:1
Disk Usage -> Uncompressed:     1.68:1
Total:                          5.67:1

Looking at it another way, the bits that are compressible with zstd
together move from 3.79:1 to 12.22:1, with no change (2.8:1) for the
uncompressible bits.

-- 
Efraim Flashner   <efraim@flashner.co.il>   רנשלפ םירפא
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2024-06-02  8:25 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-26 20:13 Are 'guix gc' stats exaggerated? Felix Lechner via Development of GNU Guix and the GNU System distribution.
2024-05-27  9:10 ` raingloom
2024-05-28  2:47   ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
2024-05-28  9:01 ` Efraim Flashner
2024-05-31 22:03   ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
2024-06-02  8:24     ` Efraim Flashner [this message]
2024-06-06 14:17     ` Ludovic Courtès
2024-06-06 19:32       ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
2024-06-09  9:19         ` Efraim Flashner
2024-06-09  9:30           ` Andreas Enge
2024-06-17 11:24             ` Ludovic Courtès
2024-05-31 16:33 ` Simon Tournier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZlwsNCflz6Yb_S86@3900XT \
    --to=efraim@flashner.co.il \
    --cc=felix.lechner@lease-up.com \
    --cc=guix-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).