From: ludo@gnu.org (Ludovic Courtès)
To: 24937@debbugs.gnu.org
Subject: bug#24937: "deleting unused links" GC phase is too slow
Date: Fri, 09 Dec 2016 23:43:57 +0100 [thread overview]
Message-ID: <87wpf867v6.fsf@gnu.org> (raw)
In-Reply-To: <87wpg7ffbm.fsf@gnu.org> ("Ludovic \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\= \=\?utf-8\?Q\?s\?\= message of "Sun, 13 Nov 2016 18:41:01 +0100")
ludo@gnu.org (Ludovic Courtès) skribis:
> ‘LocalStore::removeUnusedLinks’ traverses all the entries in
> /gnu/store/.links and calls lstat(2) on each one of them and checks
> ‘st_nlink’ to determine whether they can be deleted.
>
> There are two problems: lstat(2) can be slow on spinning disks as found
> on hydra.gnu.org, and the algorithm is proportional in the number of
> entries in /gnu/store/.links, which is a lot on hydra.gnu.org.
On Dec. 2 on guix-sysadmin@gnu.org, Mark described an improvement that
noticeably improved performance:
The idea is to read the entire /gnu/store/.links directory, sort the
entries by inode number, and then iterate over the entries by inode
number, calling 'lstat' on each one and deleting the ones with a link
count of 1.
The reason this is so much faster is because the inodes are stored on
disk in order of inode number, so this leads to a sequential access
pattern on disk instead of a random access pattern.
The difficulty is that the directory is too large to comfortably store
all of the entries in virtual memory. Instead, the entries should be
written to temporary files on disk, and then sorted using merge sort to
ensure sequential access patterns during sorting. Fortunately, this is
exactly what 'sort' does from GNU coreutils.
So, for now, I've implemented this as a pair of small C programs that is
used in a pipeline with GNU sort. The first program simply reads a
directory and writes lines of the form "<inode> <name>" to stdout.
(Unfortunately, "ls -i" calls stat on each entry, so it can't be used).
This is piped through 'sort -n' and then into another small C program
that reads these lines, calls 'lstat' on each one, and deletes the
non-directories with link count 1.
Regarding memory usage, I replied:
Really?
For each entry, we have to store roughly 70 bytes for the file name (or
52 if we consider only the basename), plus 8 bytes for the inode number;
let’s say 64 bytes.
If we have 10 M entries, that’s 700 MB (or 520 MB), which is a lot, but
maybe acceptable?
At worst, we may still see an improvement if we proceed by batches: we
read 10000 directory entries (7 MB), sort them, and stat them, then read
the next 10000 entries. WDYT?
Ludo’.
next prev parent reply other threads:[~2016-12-09 23:26 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-13 17:41 bug#24937: "deleting unused links" GC phase is too slow Ludovic Courtès
2016-12-09 22:43 ` Ludovic Courtès [this message]
2016-12-11 13:46 ` Ludovic Courtès
2016-12-11 14:23 ` Mark H Weaver
2016-12-11 18:02 ` Ludovic Courtès
2016-12-11 19:27 ` Mark H Weaver
2016-12-13 0:00 ` Ludovic Courtès
2016-12-13 12:48 ` Mark H Weaver
2016-12-13 17:02 ` Ludovic Courtès
2016-12-13 17:18 ` Ricardo Wurmus
2020-04-16 13:26 ` Ricardo Wurmus
2020-04-16 14:27 ` Ricardo Wurmus
2020-04-17 8:16 ` Ludovic Courtès
2020-04-17 8:28 ` Ricardo Wurmus
2016-12-13 4:09 ` Mark H Weaver
2016-12-15 1:19 ` Mark H Weaver
2021-11-09 14:44 ` Ludovic Courtès
2021-11-09 15:00 ` Ludovic Courtès
2021-11-11 20:59 ` Maxim Cournoyer
2021-11-13 16:56 ` Ludovic Courtès
2021-11-13 21:37 ` bug#24937: [PATCH 1/2] tests: Factorize 'file=?' Ludovic Courtès
2021-11-13 21:37 ` bug#24937: [PATCH 2/2] daemon: Do not deduplicate files smaller than 4 KiB Ludovic Courtès
2021-11-16 13:54 ` bug#24937: "deleting unused links" GC phase is too slow Ludovic Courtès
2021-11-13 21:45 ` Ludovic Courtès
2021-11-22 2:30 ` John Kehayias via Bug reports for GNU Guix
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://guix.gnu.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87wpf867v6.fsf@gnu.org \
--to=ludo@gnu.org \
--cc=24937@debbugs.gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).