* guix gc takes long in "deleting unused links ..."
@ 2019-02-01 5:53 Björn Höfling
2019-02-01 22:22 ` Caleb Ristvedt
0 siblings, 1 reply; 6+ messages in thread
From: Björn Höfling @ 2019-02-01 5:53 UTC (permalink / raw)
To: guix-devel
When I do a guix gc -d <store-entry>, it usually is quite fast, until I
hit:
$ guix gc
-d /gnu/store/apnk0ibj6axl9f0x5qa7ixpfvqww77rv-ruby-contracts-0.16.0 finding garbage collector roots...
deleting `/gnu/store/apnk0ibj6axl9f0x5qa7ixpfvqww77rv-ruby-contracts-0.16.0'
deleting `/gnu/store/trash'
deleting unused links...
Why does that take soo long?
Or better: Is it save here to just hit CTRL-C (and let the daemon work
in background, or whatever)?
Björn
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: guix gc takes long in "deleting unused links ..."
2019-02-01 5:53 guix gc takes long in "deleting unused links ..." Björn Höfling
@ 2019-02-01 22:22 ` Caleb Ristvedt
2019-02-02 6:25 ` Björn Höfling
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Caleb Ristvedt @ 2019-02-01 22:22 UTC (permalink / raw)
To: guix-devel
Björn Höfling <bjoern.hoefling@bjoernhoefling.de> writes:
> Why does that take soo long?
Warning: technical overview follows.
It takes so long because after the garbage collection pass it then does
a *full* pass over the /gnu/store/.links directory. Which is huge. It
contains an entry for every unique file (not just store entry, but
everything in those entries, recursively) in the store. The individual
work for each entry is low - just a readdir(), lstat() to see if the
link is still in use anywhere, and an unlink() if it isn't. But my
store, for example, has 998536 entries in there. I got that number with
a combination of ls and wc, and it took probably around 4 minutes to get
it.
Ideally, the reference-counting approach to removing files would work
the same as in programming languages: as soon as a reference is removed,
check whether the reference count is now 0 (in our case 1, since an
entry would still exist in .links). In our case, we'd actually have to
check prior to losing the reference whether the count *would become* 1,
that is, whether it is currently 2. But unlike in programming languages,
we can't just "free a file" (more specifically, an inode). We have to
delete the last existing reference, in .links. The only way to find that
is by hashing the file prior to deleting it, which could be quite
expensive, but for any garbage collection targeting a small subset of
store items it would likely still be much faster. A potential fix there
would be to augment the store database with a table mapping store paths
to hashes (hashes already get computed when store items are
registered). Or we could switch between the full-pass and incremental
approaches based on characteristics of the request.
> Or better: Is it save here to just hit CTRL-C (and let the daemon work
> in background, or whatever)?
I expect that CTRL-C at that point would cause the guix process to
terminate, closing its connection to the daemon. I don't believe the
daemon uses asynchronous I/O, so it wouldn't be affected until it tried
reading or writing from/to that socket. So yeah, if you do that at that
point it would probably work, but you may as well just start it in the
background in that case ("guix gc ... &") or put it in the background
with CTRL-Z followed by the 'bg' command.
- reepca
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: guix gc takes long in "deleting unused links ..."
2019-02-01 22:22 ` Caleb Ristvedt
@ 2019-02-02 6:25 ` Björn Höfling
2019-02-02 10:38 ` Giovanni Biscuolo
2019-02-04 21:11 ` Ludovic Courtès
2 siblings, 0 replies; 6+ messages in thread
From: Björn Höfling @ 2019-02-02 6:25 UTC (permalink / raw)
To: Caleb Ristvedt; +Cc: guix-devel
On Fri, 01 Feb 2019 16:22:21 -0600
Caleb Ristvedt <caleb.ristvedt@cune.org> wrote:
> Björn Höfling <bjoern.hoefling@bjoernhoefling.de> writes:
>
> > Why does that take soo long?
>
> Warning: technical overview follows.
Thank you for that technical answer, I heard just yesterday your name
in combination with the build daemon :-)
Björn
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: guix gc takes long in "deleting unused links ..."
2019-02-01 22:22 ` Caleb Ristvedt
2019-02-02 6:25 ` Björn Höfling
@ 2019-02-02 10:38 ` Giovanni Biscuolo
2019-02-04 21:11 ` Ludovic Courtès
2 siblings, 0 replies; 6+ messages in thread
From: Giovanni Biscuolo @ 2019-02-02 10:38 UTC (permalink / raw)
To: Caleb Ristvedt, guix-devel
[-- Attachment #1: Type: text/plain, Size: 954 bytes --]
Hi Caleb,
thank you very much for the technocal details of the garbage collection
process!
Caleb Ristvedt <caleb.ristvedt@cune.org> writes:
> Björn Höfling <bjoern.hoefling@bjoernhoefling.de> writes:
>
>> Why does that take soo long?
>
> Warning: technical overview follows.
>
> It takes so long because after the garbage collection pass it then does
> a *full* pass over the /gnu/store/.links directory. Which is huge.
[...]
>> Or better: Is it save here to just hit CTRL-C (and let the daemon work
>> in background, or whatever)?
>
> I expect that CTRL-C at that point would cause the guix process to
> terminate, closing its connection to the daemon. I don't believe the
> daemon uses asynchronous I/O
[...]
IMHO both this questions/anwsers should start a brand new "Guix FAQ"
texinfo document :-)
I'll try to bootstrap one if I can
Thanks!
Giovanni
--
Giovanni Biscuolo
Xelera IT Infrastructures
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: guix gc takes long in "deleting unused links ..."
2019-02-01 22:22 ` Caleb Ristvedt
2019-02-02 6:25 ` Björn Höfling
2019-02-02 10:38 ` Giovanni Biscuolo
@ 2019-02-04 21:11 ` Ludovic Courtès
2019-02-06 21:32 ` Caleb Ristvedt
2 siblings, 1 reply; 6+ messages in thread
From: Ludovic Courtès @ 2019-02-04 21:11 UTC (permalink / raw)
To: Caleb Ristvedt; +Cc: guix-devel
Hi!
Caleb Ristvedt <caleb.ristvedt@cune.org> skribis:
[...]
> Ideally, the reference-counting approach to removing files would work
> the same as in programming languages: as soon as a reference is removed,
> check whether the reference count is now 0 (in our case 1, since an
> entry would still exist in .links). In our case, we'd actually have to
> check prior to losing the reference whether the count *would become* 1,
> that is, whether it is currently 2. But unlike in programming languages,
> we can't just "free a file" (more specifically, an inode). We have to
> delete the last existing reference, in .links. The only way to find that
> is by hashing the file prior to deleting it, which could be quite
> expensive, but for any garbage collection targeting a small subset of
> store items it would likely still be much faster. A potential fix there
> would be to augment the store database with a table mapping store paths
> to hashes (hashes already get computed when store items are
> registered). Or we could switch between the full-pass and incremental
> approaches based on characteristics of the request.
Note that the database would need to contain hashes of individual files,
not just store items (it already contains hashes of store item nars).
This issue was discussed a while back at
<https://issues.guix.info/issue/24937>. Back then we couldn’t agree on
a solution, but it’d be good to have your opinion with your fresh mind!
>> Or better: Is it save here to just hit CTRL-C (and let the daemon work
>> in background, or whatever)?
>
> I expect that CTRL-C at that point would cause the guix process to
> terminate, closing its connection to the daemon. I don't believe the
> daemon uses asynchronous I/O, so it wouldn't be affected until it tried
> reading or writing from/to that socket.
The guix-daemon child that handles the session would immediately get
SIGHUP and terminate (I think), but that’s fine: it’s just that files
that could have been removed from .links will still be there.
Ludo’.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: guix gc takes long in "deleting unused links ..."
2019-02-04 21:11 ` Ludovic Courtès
@ 2019-02-06 21:32 ` Caleb Ristvedt
0 siblings, 0 replies; 6+ messages in thread
From: Caleb Ristvedt @ 2019-02-06 21:32 UTC (permalink / raw)
To: guix-devel, ludo
Ludovic Courtès <ludo@gnu.org> writes:
> Note that the database would need to contain hashes of individual files,
> not just store items (it already contains hashes of store item nars).
Indeed! Although last night I thought of a way that it would only need
to contain hashes of *unique* individual files, see below.
> This issue was discussed a while back at
> <https://issues.guix.info/issue/24937>. Back then we couldn’t agree on
> a solution, but it’d be good to have your opinion with your fresh mind!
The main thing that comes to mind is making the amount of time required
for deleting links scale with the number of things being deleted rather
than the number of "things" in total - O(m) instead of O(n), so to
speak. I actually hadn't even considered things like disk access
patterns.
In my mind, the ideal situation is like this: we get rid of .links, and
instead keep a mapping from hashes to inodes in the
database. Deduplication would then involve just creating a hardlink to
the corresponding inode. The link-deleting phase then becomes entirely
unnecessary, as when the last hardlink is deleted the refcount becomes 0
automatically. Unfortunately, this isn't possible, because AFAIK there
is no way to create a hardlink to an inode directly; it always has to go
through another hardlink. Presumably the necessary system call doesn't
exist because there would be permissions / validation issues (if anyone
happens to know of a function that does something like this, I'd love to
hear about it!).
So the second-most-ideal situation would, to me, be to keep a mapping
from inodes to hashes in the database. Then, when it becomes known
during garbage collection that a file is to be deleted and the refcount
for its inode is 2, the file's inode can be obtained from stat(), from
that the hash can be looked up, and from that the corresponding link
filename can be obtained and removed. After that the inode->hash
association can be removed from the database.
I think this is a reasonable approach, as such a table in the database
shouldn't take up much more disk space than .links does: 8 bytes for an
inode, and 32 bytes for the hash (or 52 if we keep the hash in text
form), for a total of 60 bytes. Based on the numbers from the linked
discussion (10M entries), that's around 400MB or 600MB, plus whatever
extra space sqlite uses, kept on the disk. If that's considered too
high, we could only store the hashes of relatively large files in the
database and fall back to hashing at delete-time for the others.
The main limitation is the lack of portability of inodes. That is, when
copying a store across filesystems, said table would need to be
updated. Also, it requires that everything in the store is on the same
filesystem, though this could be fixed by looking up the hash through
(inode, device number) pairs instead of just inode. In that case, it
would work to copy across filesystems, though I think it still wouldn't
work for copying across systems.
How does that sound?
> The guix-daemon child that handles the session would immediately get
> SIGHUP and terminate (I think), but that’s fine: it’s just that files
> that could have been removed from .links will still be there.
Turns out it's SIGPOLL, actually, but yep. There's a checkInterrupt()
that gets run before each attempt to delete a link, and that triggers
the exit.
- reepca
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-02-06 21:33 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-02-01 5:53 guix gc takes long in "deleting unused links ..." Björn Höfling
2019-02-01 22:22 ` Caleb Ristvedt
2019-02-02 6:25 ` Björn Höfling
2019-02-02 10:38 ` Giovanni Biscuolo
2019-02-04 21:11 ` Ludovic Courtès
2019-02-06 21:32 ` Caleb Ristvedt
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).