* guix gc takes long in "deleting unused links ..." @ 2019-02-01 5:53 Björn Höfling 2019-02-01 22:22 ` Caleb Ristvedt 0 siblings, 1 reply; 6+ messages in thread From: Björn Höfling @ 2019-02-01 5:53 UTC (permalink / raw) To: guix-devel When I do a guix gc -d <store-entry>, it usually is quite fast, until I hit: $ guix gc -d /gnu/store/apnk0ibj6axl9f0x5qa7ixpfvqww77rv-ruby-contracts-0.16.0 finding garbage collector roots... deleting `/gnu/store/apnk0ibj6axl9f0x5qa7ixpfvqww77rv-ruby-contracts-0.16.0' deleting `/gnu/store/trash' deleting unused links... Why does that take soo long? Or better: Is it save here to just hit CTRL-C (and let the daemon work in background, or whatever)? Björn ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: guix gc takes long in "deleting unused links ..." 2019-02-01 5:53 guix gc takes long in "deleting unused links ..." Björn Höfling @ 2019-02-01 22:22 ` Caleb Ristvedt 2019-02-02 6:25 ` Björn Höfling ` (2 more replies) 0 siblings, 3 replies; 6+ messages in thread From: Caleb Ristvedt @ 2019-02-01 22:22 UTC (permalink / raw) To: guix-devel Björn Höfling <bjoern.hoefling@bjoernhoefling.de> writes: > Why does that take soo long? Warning: technical overview follows. It takes so long because after the garbage collection pass it then does a *full* pass over the /gnu/store/.links directory. Which is huge. It contains an entry for every unique file (not just store entry, but everything in those entries, recursively) in the store. The individual work for each entry is low - just a readdir(), lstat() to see if the link is still in use anywhere, and an unlink() if it isn't. But my store, for example, has 998536 entries in there. I got that number with a combination of ls and wc, and it took probably around 4 minutes to get it. Ideally, the reference-counting approach to removing files would work the same as in programming languages: as soon as a reference is removed, check whether the reference count is now 0 (in our case 1, since an entry would still exist in .links). In our case, we'd actually have to check prior to losing the reference whether the count *would become* 1, that is, whether it is currently 2. But unlike in programming languages, we can't just "free a file" (more specifically, an inode). We have to delete the last existing reference, in .links. The only way to find that is by hashing the file prior to deleting it, which could be quite expensive, but for any garbage collection targeting a small subset of store items it would likely still be much faster. A potential fix there would be to augment the store database with a table mapping store paths to hashes (hashes already get computed when store items are registered). Or we could switch between the full-pass and incremental approaches based on characteristics of the request. > Or better: Is it save here to just hit CTRL-C (and let the daemon work > in background, or whatever)? I expect that CTRL-C at that point would cause the guix process to terminate, closing its connection to the daemon. I don't believe the daemon uses asynchronous I/O, so it wouldn't be affected until it tried reading or writing from/to that socket. So yeah, if you do that at that point it would probably work, but you may as well just start it in the background in that case ("guix gc ... &") or put it in the background with CTRL-Z followed by the 'bg' command. - reepca ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: guix gc takes long in "deleting unused links ..." 2019-02-01 22:22 ` Caleb Ristvedt @ 2019-02-02 6:25 ` Björn Höfling 2019-02-02 10:38 ` Giovanni Biscuolo 2019-02-04 21:11 ` Ludovic Courtès 2 siblings, 0 replies; 6+ messages in thread From: Björn Höfling @ 2019-02-02 6:25 UTC (permalink / raw) To: Caleb Ristvedt; +Cc: guix-devel On Fri, 01 Feb 2019 16:22:21 -0600 Caleb Ristvedt <caleb.ristvedt@cune.org> wrote: > Björn Höfling <bjoern.hoefling@bjoernhoefling.de> writes: > > > Why does that take soo long? > > Warning: technical overview follows. Thank you for that technical answer, I heard just yesterday your name in combination with the build daemon :-) Björn ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: guix gc takes long in "deleting unused links ..." 2019-02-01 22:22 ` Caleb Ristvedt 2019-02-02 6:25 ` Björn Höfling @ 2019-02-02 10:38 ` Giovanni Biscuolo 2019-02-04 21:11 ` Ludovic Courtès 2 siblings, 0 replies; 6+ messages in thread From: Giovanni Biscuolo @ 2019-02-02 10:38 UTC (permalink / raw) To: Caleb Ristvedt, guix-devel [-- Attachment #1: Type: text/plain, Size: 954 bytes --] Hi Caleb, thank you very much for the technocal details of the garbage collection process! Caleb Ristvedt <caleb.ristvedt@cune.org> writes: > Björn Höfling <bjoern.hoefling@bjoernhoefling.de> writes: > >> Why does that take soo long? > > Warning: technical overview follows. > > It takes so long because after the garbage collection pass it then does > a *full* pass over the /gnu/store/.links directory. Which is huge. [...] >> Or better: Is it save here to just hit CTRL-C (and let the daemon work >> in background, or whatever)? > > I expect that CTRL-C at that point would cause the guix process to > terminate, closing its connection to the daemon. I don't believe the > daemon uses asynchronous I/O [...] IMHO both this questions/anwsers should start a brand new "Guix FAQ" texinfo document :-) I'll try to bootstrap one if I can Thanks! Giovanni -- Giovanni Biscuolo Xelera IT Infrastructures [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: guix gc takes long in "deleting unused links ..." 2019-02-01 22:22 ` Caleb Ristvedt 2019-02-02 6:25 ` Björn Höfling 2019-02-02 10:38 ` Giovanni Biscuolo @ 2019-02-04 21:11 ` Ludovic Courtès 2019-02-06 21:32 ` Caleb Ristvedt 2 siblings, 1 reply; 6+ messages in thread From: Ludovic Courtès @ 2019-02-04 21:11 UTC (permalink / raw) To: Caleb Ristvedt; +Cc: guix-devel Hi! Caleb Ristvedt <caleb.ristvedt@cune.org> skribis: [...] > Ideally, the reference-counting approach to removing files would work > the same as in programming languages: as soon as a reference is removed, > check whether the reference count is now 0 (in our case 1, since an > entry would still exist in .links). In our case, we'd actually have to > check prior to losing the reference whether the count *would become* 1, > that is, whether it is currently 2. But unlike in programming languages, > we can't just "free a file" (more specifically, an inode). We have to > delete the last existing reference, in .links. The only way to find that > is by hashing the file prior to deleting it, which could be quite > expensive, but for any garbage collection targeting a small subset of > store items it would likely still be much faster. A potential fix there > would be to augment the store database with a table mapping store paths > to hashes (hashes already get computed when store items are > registered). Or we could switch between the full-pass and incremental > approaches based on characteristics of the request. Note that the database would need to contain hashes of individual files, not just store items (it already contains hashes of store item nars). This issue was discussed a while back at <https://issues.guix.info/issue/24937>. Back then we couldn’t agree on a solution, but it’d be good to have your opinion with your fresh mind! >> Or better: Is it save here to just hit CTRL-C (and let the daemon work >> in background, or whatever)? > > I expect that CTRL-C at that point would cause the guix process to > terminate, closing its connection to the daemon. I don't believe the > daemon uses asynchronous I/O, so it wouldn't be affected until it tried > reading or writing from/to that socket. The guix-daemon child that handles the session would immediately get SIGHUP and terminate (I think), but that’s fine: it’s just that files that could have been removed from .links will still be there. Ludo’. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: guix gc takes long in "deleting unused links ..." 2019-02-04 21:11 ` Ludovic Courtès @ 2019-02-06 21:32 ` Caleb Ristvedt 0 siblings, 0 replies; 6+ messages in thread From: Caleb Ristvedt @ 2019-02-06 21:32 UTC (permalink / raw) To: guix-devel, ludo Ludovic Courtès <ludo@gnu.org> writes: > Note that the database would need to contain hashes of individual files, > not just store items (it already contains hashes of store item nars). Indeed! Although last night I thought of a way that it would only need to contain hashes of *unique* individual files, see below. > This issue was discussed a while back at > <https://issues.guix.info/issue/24937>. Back then we couldn’t agree on > a solution, but it’d be good to have your opinion with your fresh mind! The main thing that comes to mind is making the amount of time required for deleting links scale with the number of things being deleted rather than the number of "things" in total - O(m) instead of O(n), so to speak. I actually hadn't even considered things like disk access patterns. In my mind, the ideal situation is like this: we get rid of .links, and instead keep a mapping from hashes to inodes in the database. Deduplication would then involve just creating a hardlink to the corresponding inode. The link-deleting phase then becomes entirely unnecessary, as when the last hardlink is deleted the refcount becomes 0 automatically. Unfortunately, this isn't possible, because AFAIK there is no way to create a hardlink to an inode directly; it always has to go through another hardlink. Presumably the necessary system call doesn't exist because there would be permissions / validation issues (if anyone happens to know of a function that does something like this, I'd love to hear about it!). So the second-most-ideal situation would, to me, be to keep a mapping from inodes to hashes in the database. Then, when it becomes known during garbage collection that a file is to be deleted and the refcount for its inode is 2, the file's inode can be obtained from stat(), from that the hash can be looked up, and from that the corresponding link filename can be obtained and removed. After that the inode->hash association can be removed from the database. I think this is a reasonable approach, as such a table in the database shouldn't take up much more disk space than .links does: 8 bytes for an inode, and 32 bytes for the hash (or 52 if we keep the hash in text form), for a total of 60 bytes. Based on the numbers from the linked discussion (10M entries), that's around 400MB or 600MB, plus whatever extra space sqlite uses, kept on the disk. If that's considered too high, we could only store the hashes of relatively large files in the database and fall back to hashing at delete-time for the others. The main limitation is the lack of portability of inodes. That is, when copying a store across filesystems, said table would need to be updated. Also, it requires that everything in the store is on the same filesystem, though this could be fixed by looking up the hash through (inode, device number) pairs instead of just inode. In that case, it would work to copy across filesystems, though I think it still wouldn't work for copying across systems. How does that sound? > The guix-daemon child that handles the session would immediately get > SIGHUP and terminate (I think), but that’s fine: it’s just that files > that could have been removed from .links will still be there. Turns out it's SIGPOLL, actually, but yep. There's a checkInterrupt() that gets run before each attempt to delete a link, and that triggers the exit. - reepca ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-02-06 21:33 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-02-01 5:53 guix gc takes long in "deleting unused links ..." Björn Höfling 2019-02-01 22:22 ` Caleb Ristvedt 2019-02-02 6:25 ` Björn Höfling 2019-02-02 10:38 ` Giovanni Biscuolo 2019-02-04 21:11 ` Ludovic Courtès 2019-02-06 21:32 ` Caleb Ristvedt
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/guix.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).