unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Are 'guix gc' stats exaggerated?
@ 2024-05-26 20:13 Felix Lechner via Development of GNU Guix and the GNU System distribution.
  2024-05-27  9:10 ` raingloom
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Felix Lechner via Development of GNU Guix and the GNU System distribution. @ 2024-05-26 20:13 UTC (permalink / raw)
  To: guix-devel

Hi,

Today I ran 'guix gc' on equipment with an ext4 root partition.  It had
these space characteristics beforehand:

    Filesystem  Size       Used      Avail     Use% Mounted on
    /dev/dm-3   309047680  157252980 138126064  54% /

or for human eyes:

    /dev/dm-3   295G        150G      132G      54% /

After the run, the drive showed:

    /dev/dm-3   309047680   88267956 207111088  30% /

or for human eyes:

    /dev/dm-3   295G         85G      198G      30% /

By my math, about 65.8 GiB were recovered.

When 'guix gc' was done, it announced:

    [184389 MiB] deleting '/gnu/store/...'
    deleting `/gnu/store/trash'
    deleting unused links...
    note: currently hard linking saves 59224.03 MiB
    guix gc: freed 110,649.49 MiBs

Seeing the 184389 MiB number, or 180 GiB, already made me suspicious.
It exceeded my drive usage by 30 GiB.  Even the more conservative 110649
MiB "freed," however, are off by a mile. That would have been 108 GiB,
or 42 GiB more than the space actually recovered.

Am I looking at those numbers the wrong way?  Thanks!

Kind regards
Felix


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Are 'guix gc' stats exaggerated?
  2024-05-26 20:13 Are 'guix gc' stats exaggerated? Felix Lechner via Development of GNU Guix and the GNU System distribution.
@ 2024-05-27  9:10 ` raingloom
  2024-05-28  2:47   ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
  2024-05-28  9:01 ` Efraim Flashner
  2024-05-31 16:33 ` Simon Tournier
  2 siblings, 1 reply; 12+ messages in thread
From: raingloom @ 2024-05-27  9:10 UTC (permalink / raw)
  To: Felix Lechner; +Cc: guix-devel

On 2024-05-26 22:13, Felix Lechner via "Development of GNU Guix and the
GNU System distribution." wrote:
> Hi,
> 
> Today I ran 'guix gc' on equipment with an ext4 root partition.  It had
> these space characteristics beforehand:
> 
>     Filesystem  Size       Used      Avail     Use% Mounted on
>     /dev/dm-3   309047680  157252980 138126064  54% /
> 
> or for human eyes:
> 
>     /dev/dm-3   295G        150G      132G      54% /
> 
> After the run, the drive showed:
> 
>     /dev/dm-3   309047680   88267956 207111088  30% /
> 
> or for human eyes:
> 
>     /dev/dm-3   295G         85G      198G      30% /
> 
> By my math, about 65.8 GiB were recovered.
> 
> When 'guix gc' was done, it announced:
> 
>     [184389 MiB] deleting '/gnu/store/...'
>     deleting `/gnu/store/trash'
>     deleting unused links...
>     note: currently hard linking saves 59224.03 MiB
>     guix gc: freed 110,649.49 MiBs
> 
> Seeing the 184389 MiB number, or 180 GiB, already made me suspicious.
> It exceeded my drive usage by 30 GiB.  Even the more conservative 110649
> MiB "freed," however, are off by a mile. That would have been 108 GiB,
> or 42 GiB more than the space actually recovered.
> 
> Am I looking at those numbers the wrong way?  Thanks!
> 
> Kind regards
> Felix

Are you using compression? (BTRFS, ZFS, etc)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Are 'guix gc' stats exaggerated?
  2024-05-27  9:10 ` raingloom
@ 2024-05-28  2:47   ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
  0 siblings, 0 replies; 12+ messages in thread
From: Felix Lechner via Development of GNU Guix and the GNU System distribution. @ 2024-05-28  2:47 UTC (permalink / raw)
  To: raingloom; +Cc: guix-devel

Hi raingloom,

On Mon, May 27 2024, raingloom@riseup.net wrote:

> Are you using compression? (BTRFS, ZFS, etc)

No, I thought about that, too, but that volume, like all my root
volumes, is straight ext4 on LVM2, on bare metal.

Kind regards
Felix


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Are 'guix gc' stats exaggerated?
  2024-05-26 20:13 Are 'guix gc' stats exaggerated? Felix Lechner via Development of GNU Guix and the GNU System distribution.
  2024-05-27  9:10 ` raingloom
@ 2024-05-28  9:01 ` Efraim Flashner
  2024-05-31 22:03   ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
  2024-05-31 16:33 ` Simon Tournier
  2 siblings, 1 reply; 12+ messages in thread
From: Efraim Flashner @ 2024-05-28  9:01 UTC (permalink / raw)
  To: Felix Lechner; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 1901 bytes --]

On Sun, May 26, 2024 at 01:13:45PM -0700, Felix Lechner via Development of GNU Guix and the GNU System distribution. wrote:
> Hi,
> 
> Today I ran 'guix gc' on equipment with an ext4 root partition.  It had
> these space characteristics beforehand:
> 
>     Filesystem  Size       Used      Avail     Use% Mounted on
>     /dev/dm-3   309047680  157252980 138126064  54% /
> 
> or for human eyes:
> 
>     /dev/dm-3   295G        150G      132G      54% /
> 
> After the run, the drive showed:
> 
>     /dev/dm-3   309047680   88267956 207111088  30% /
> 
> or for human eyes:
> 
>     /dev/dm-3   295G         85G      198G      30% /
> 
> By my math, about 65.8 GiB were recovered.
> 
> When 'guix gc' was done, it announced:
> 
>     [184389 MiB] deleting '/gnu/store/...'
>     deleting `/gnu/store/trash'
>     deleting unused links...
>     note: currently hard linking saves 59224.03 MiB
>     guix gc: freed 110,649.49 MiBs
> 
> Seeing the 184389 MiB number, or 180 GiB, already made me suspicious.
> It exceeded my drive usage by 30 GiB.  Even the more conservative 110649
> MiB "freed," however, are off by a mile. That would have been 108 GiB,
> or 42 GiB more than the space actually recovered.
> 
> Am I looking at those numbers the wrong way?  Thanks!

As your store grows larger the inherent deduplication from the
guix-daemon approaches a 3:1 file deduplication ratio.  If two files are
the same then they are hardlinked to the same actual block on the drive
and you save some space.

I have found that if you switch to btrfs and add zstd (level 3)
compression then you get about another 2:1 on top of that, for around
5.5:1.

-- 
Efraim Flashner   <efraim@flashner.co.il>   רנשלפ םירפא
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Are 'guix gc' stats exaggerated?
  2024-05-26 20:13 Are 'guix gc' stats exaggerated? Felix Lechner via Development of GNU Guix and the GNU System distribution.
  2024-05-27  9:10 ` raingloom
  2024-05-28  9:01 ` Efraim Flashner
@ 2024-05-31 16:33 ` Simon Tournier
  2 siblings, 0 replies; 12+ messages in thread
From: Simon Tournier @ 2024-05-31 16:33 UTC (permalink / raw)
  To: Felix Lechner, guix-devel

Hi,

On Sun, 26 May 2024 at 13:13, Felix Lechner via "Development of GNU Guix and the GNU System distribution." <guix-devel@gnu.org> wrote:

> By my math, about 65.8 GiB were recovered.
>
> When 'guix gc' was done, it announced:
>
>     [184389 MiB] deleting '/gnu/store/...'
>     deleting `/gnu/store/trash'
>     deleting unused links...
>     note: currently hard linking saves 59224.03 MiB
>     guix gc: freed 110,649.49 MiBs

Well, 180 GiB does not count deduplication, I guess.  And as Efraim
said, the ratio on average is 3:1 so 65 GiB vs 180 GiB seems consistent,
right?

However, the question is then: what are these 110 GiB?

Cheers,
simon


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Are 'guix gc' stats exaggerated?
  2024-05-28  9:01 ` Efraim Flashner
@ 2024-05-31 22:03   ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
  2024-06-02  8:24     ` Daemon deduplication and btrfs compression [was Re: Are 'guix gc' stats exaggerated?] Efraim Flashner
  2024-06-06 14:17     ` Are 'guix gc' stats exaggerated? Ludovic Courtès
  0 siblings, 2 replies; 12+ messages in thread
From: Felix Lechner via Development of GNU Guix and the GNU System distribution. @ 2024-05-31 22:03 UTC (permalink / raw)
  To: Efraim Flashner; +Cc: guix-devel

Hi Efraim,

On Tue, May 28 2024, Efraim Flashner wrote:

> As your store grows larger the inherent deduplication from the
> guix-daemon approaches a 3:1 file deduplication ratio.

Thank you for your explanations and your data about btrfs!  Btrfs
compression is a well-understood feature, although even its developers
acknowledge that the benefit is hard to quantify.

It probably makes more sense to focus on the Guix daemon here.  I hope
you don't mind a few clarifying questions.

Why, please, does the benefit of de-duplication approach a fixed ratio
of 3:1?  Does the benefit not depend on the number of copies in the
store, which can vary by any number?  (It sounds like the answer may
have something to do with store size.)

Further, why is the removal of hardlinks counted as saving space even
when their inode reference count, which is widely available [1] is
greater than one?

Finally, barring a better solution should our output numbers be divided
by three to being them closer to the expected result for users?

Thanks!

Kind regards,
Felix

[1] https://en.wikipedia.org/wiki/Hard_link#Reference_counting


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Daemon deduplication and btrfs compression [was Re: Are 'guix gc' stats exaggerated?]
  2024-05-31 22:03   ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
@ 2024-06-02  8:24     ` Efraim Flashner
  2024-06-06 14:17     ` Are 'guix gc' stats exaggerated? Ludovic Courtès
  1 sibling, 0 replies; 12+ messages in thread
From: Efraim Flashner @ 2024-06-02  8:24 UTC (permalink / raw)
  To: Felix Lechner; +Cc: guix-devel

[-- Attachment #1: Type: text/plain, Size: 3517 bytes --]

On Fri, May 31, 2024 at 03:03:47PM -0700, Felix Lechner wrote:
> Hi Efraim,
> 
> On Tue, May 28 2024, Efraim Flashner wrote:
> 
> > As your store grows larger the inherent deduplication from the
> > guix-daemon approaches a 3:1 file deduplication ratio.
> 
> Thank you for your explanations and your data about btrfs!  Btrfs
> compression is a well-understood feature, although even its developers
> acknowledge that the benefit is hard to quantify.
> 
> It probably makes more sense to focus on the Guix daemon here.  I hope
> you don't mind a few clarifying questions.
> 
> Why, please, does the benefit of de-duplication approach a fixed ratio
> of 3:1?  Does the benefit not depend on the number of copies in the
> store, which can vary by any number?  (It sounds like the answer may
> have something to do with store size.)

It would seem that this is just my experience and I'm not sure of an
actual reason why this is the case. I believe that with the hardlinks
only files which are identical would share a link, as opposed to a block
based deduplication, where there could be more granular deduplication,
so it's quite likely that multiple copies of the same package at the
same version would share the majority of their files with the other
copies of the package.

> Further, why is the removal of hardlinks counted as saving space even
> when their inode reference count, which is widely available [1] is
> greater than one?

I suspect that this part of the code is in the C++ daemon, which no one
really wants to hack on.  AFAIK Nix turned off deduplication by default
years ago to speed up store operations, so I wouldn't be surprised if
they also haven't worked on that part of the code.

> Finally, barring a better solution should our output numbers be divided
> by three to being them closer to the expected result for users?
> 
> [1] https://en.wikipedia.org/wiki/Hard_link#Reference_counting

(ins)efraim@3900XT ~$ sudo compsize -x /gnu
Processed 39994797 files, 12867013 regular extents (28475611 refs), 20558307 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       56%      437G         776G         2.1T
none       100%      275G         275G         723G
zstd        32%      161G         500G         1.4T

It looks like right now my store is physically using 437GB of space.
Looking only at the total the Uncompressed -> Referenced ratio being
about 2.77:1 and Disk Usage -> Uncompressed being about 1.78:1, I'm
netting a total of 4.92:1.

Numbers on Berlin are a bit different:

(ins)efraim@berlin ~$ time guix shell compsize -- sudo compsize -x /gnu
Processed 41030472 files, 14521470 regular extents (37470325 refs), 17429255 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       59%      578G         970G         3.2T
none       100%      402G         402G         1.1T
zstd        31%      176G         567G         2.1T

real    45m9.762s
user    1m53.984s
sys     24m37.338s

Uncompressed -> Referenced:     3.4:1
Disk Usage -> Uncompressed:     1.68:1
Total:                          5.67:1

Looking at it another way, the bits that are compressible with zstd
together move from 3.79:1 to 12.22:1, with no change (2.8:1) for the
uncompressible bits.

-- 
Efraim Flashner   <efraim@flashner.co.il>   רנשלפ םירפא
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Are 'guix gc' stats exaggerated?
  2024-05-31 22:03   ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
  2024-06-02  8:24     ` Daemon deduplication and btrfs compression [was Re: Are 'guix gc' stats exaggerated?] Efraim Flashner
@ 2024-06-06 14:17     ` Ludovic Courtès
  2024-06-06 19:32       ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
  1 sibling, 1 reply; 12+ messages in thread
From: Ludovic Courtès @ 2024-06-06 14:17 UTC (permalink / raw)
  To: Felix Lechner via Development of GNU Guix and the GNU System distribution.
  Cc: Efraim Flashner, Felix Lechner

Hi Felix,

Felix Lechner via "Development of GNU Guix and the GNU System
distribution." <guix-devel@gnu.org> skribis:

> It probably makes more sense to focus on the Guix daemon here.  I hope
> you don't mind a few clarifying questions.
>
> Why, please, does the benefit of de-duplication approach a fixed ratio
> of 3:1?  Does the benefit not depend on the number of copies in the
> store, which can vary by any number?  (It sounds like the answer may
> have something to do with store size.)

Where does that 3:1 figure come from?

> Further, why is the removal of hardlinks counted as saving space even
> when their inode reference count, which is widely available [1] is
> greater than one?

Where do you see that in the code?  After checking ‘removeUnusedLinks’,
I think it counts space savings right.  (OTOH, something somewhere is
counted wrong, as anyone who’s used ‘guix gc -F…’ has seen; not sure
where the bug is!)

Thanks,
Ludo’.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Are 'guix gc' stats exaggerated?
  2024-06-06 14:17     ` Are 'guix gc' stats exaggerated? Ludovic Courtès
@ 2024-06-06 19:32       ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
  2024-06-09  9:19         ` Efraim Flashner
  0 siblings, 1 reply; 12+ messages in thread
From: Felix Lechner via Development of GNU Guix and the GNU System distribution. @ 2024-06-06 19:32 UTC (permalink / raw)
  To: Ludovic Courtès,
	Felix Lechner via Development of GNU Guix and the GNU System distribution.
  Cc: Efraim Flashner

Hi Ludo'

On Thu, Jun 06 2024, Ludovic Courtès wrote:

> Where does that 3:1 figure come from?

Efraim's experience, I believe.

> Where do you see that in the code?  After checking
> ‘removeUnusedLinks’, I think it counts space savings right.

Sorry, I didn't look at the code.  I was merely prompted to speculate by
the mentioning of hard links and inferred wrongly, it seems, that the
discrepancy was related---although in fairness I also doubted that a
fixed 3:1 ratio could be credibly explained by deduplication alone.

Also, I don't mean to appear critical.  Thanks to everyone for your hard
work on Guix!

Kind regards
Felix


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Are 'guix gc' stats exaggerated?
  2024-06-06 19:32       ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
@ 2024-06-09  9:19         ` Efraim Flashner
  2024-06-09  9:30           ` Andreas Enge
  0 siblings, 1 reply; 12+ messages in thread
From: Efraim Flashner @ 2024-06-09  9:19 UTC (permalink / raw)
  To: Felix Lechner
  Cc: Ludovic Courtès,
	Felix Lechner via Development of GNU Guix and the GNU System distribution.

[-- Attachment #1: Type: text/plain, Size: 1580 bytes --]

On Thu, Jun 06, 2024 at 12:32:52PM -0700, Felix Lechner wrote:
> Hi Ludo'
> 
> On Thu, Jun 06 2024, Ludovic Courtès wrote:
> 
> > Where does that 3:1 figure come from?
> 
> Efraim's experience, I believe.

I've found that to be my experience, and posted two compsize outputs to
show where I got my numbers from.

> > Where do you see that in the code?  After checking
> > ‘removeUnusedLinks’, I think it counts space savings right.
> 
> Sorry, I didn't look at the code.  I was merely prompted to speculate by
> the mentioning of hard links and inferred wrongly, it seems, that the
> discrepancy was related---although in fairness I also doubted that a
> fixed 3:1 ratio could be credibly explained by deduplication alone.
> 
> Also, I don't mean to appear critical.  Thanks to everyone for your hard
> work on Guix!

In my not having looked at the code, I'll point out that running `guix
gc -C 10G` will clear 10G of items from the store, but will return
between 2-10G of real space for future use on the hard drive.  Thinking
across my various machines, on my desktop and laptop using btrfs this is
the case, but on my other machines using ext4 I think the space cleared
and what I'm expecting to have free to use do actually match up, but I
don't remember paying that much attention to the numbers previously on
those machines.

-- 
Efraim Flashner   <efraim@flashner.co.il>   רנשלפ םירפא
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Are 'guix gc' stats exaggerated?
  2024-06-09  9:19         ` Efraim Flashner
@ 2024-06-09  9:30           ` Andreas Enge
  2024-06-17 11:24             ` Ludovic Courtès
  0 siblings, 1 reply; 12+ messages in thread
From: Andreas Enge @ 2024-06-09  9:30 UTC (permalink / raw)
  To: Felix Lechner, Ludovic Courtès,
	Felix Lechner via Development of GNU Guix and the GNU System distribution.

Am Sun, Jun 09, 2024 at 12:19:55PM +0300 schrieb Efraim Flashner:
> In my not having looked at the code, I'll point out that running `guix
> gc -C 10G` will clear 10G of items from the store, but will return
> between 2-10G of real space for future use on the hard drive.  Thinking
> across my various machines, on my desktop and laptop using btrfs this is
> the case, but on my other machines using ext4 I think the space cleared
> and what I'm expecting to have free to use do actually match up, but I
> don't remember paying that much attention to the numbers previously on
> those machines.

In my experience on ext4 (also not backed by looking at the code), "guix gc"
always deletes substantially less than what I ask for. I always thought it
just counted hard linked files even when the link count does not go to 0
and the file is not actually deleted.

For instance, I have tried it just now:
$ df -h .
/dev/mapper/cryptroot  468G    427G   18G   97% /

$ guix gc -F 20G
guix gc: 2.931,84 MiB werden freigegeben
...
deleted or invalidated more than 3074252800 bytes; stopping

$ df -h .
/dev/mapper/cryptroot  468G    427G   18G   96% /

Andreas



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Are 'guix gc' stats exaggerated?
  2024-06-09  9:30           ` Andreas Enge
@ 2024-06-17 11:24             ` Ludovic Courtès
  0 siblings, 0 replies; 12+ messages in thread
From: Ludovic Courtès @ 2024-06-17 11:24 UTC (permalink / raw)
  To: Andreas Enge
  Cc: Felix Lechner,
	Felix Lechner via Development of GNU Guix and the GNU System distribution.

Andreas Enge <andreas@enge.fr> skribis:

> In my experience on ext4 (also not backed by looking at the code), "guix gc"
> always deletes substantially less than what I ask for. I always thought it
> just counted hard linked files even when the link count does not go to 0
> and the file is not actually deleted.

Yes, that’s also my experience.  I did look at the code several times, I
even thought 7033c7692ccbbbad8f7b9952015de071a5588e87 in 2020 would fix
that estimate, but it didn’t.  I guess I’m bad at maths and logic, we
should give another look at that part of the code!

(Note that creation of sparse files will be another source of
discrepancy, though there will be few of them.)

Ludo’.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-06-17 11:25 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-26 20:13 Are 'guix gc' stats exaggerated? Felix Lechner via Development of GNU Guix and the GNU System distribution.
2024-05-27  9:10 ` raingloom
2024-05-28  2:47   ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
2024-05-28  9:01 ` Efraim Flashner
2024-05-31 22:03   ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
2024-06-02  8:24     ` Daemon deduplication and btrfs compression [was Re: Are 'guix gc' stats exaggerated?] Efraim Flashner
2024-06-06 14:17     ` Are 'guix gc' stats exaggerated? Ludovic Courtès
2024-06-06 19:32       ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
2024-06-09  9:19         ` Efraim Flashner
2024-06-09  9:30           ` Andreas Enge
2024-06-17 11:24             ` Ludovic Courtès
2024-05-31 16:33 ` Simon Tournier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).