unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
From: Mark H Weaver <mhw@netris.org>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: 24937@debbugs.gnu.org
Subject: bug#24937: "deleting unused links" GC phase is too slow
Date: Tue, 13 Dec 2016 07:48:19 -0500	[thread overview]
Message-ID: <87wpf4yoz0.fsf@netris.org> (raw)
In-Reply-To: <87d1gwvgu0.fsf@gnu.org> ("Ludovic \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\= \=\?utf-8\?Q\?s\?\= message of "Tue, 13 Dec 2016 01:00:07 +0100")

ludo@gnu.org (Ludovic Courtès) writes:

> I did some measurements with the attached program on chapters, which is
> a Xen VM with spinning disks underneath, similar to hydra.gnu.org.  It
> has 600k entries in /gnu/store/.links.

I just want to point out that 600k inodes use 150 megabytes of disk
space on ext4, which is small enough to fit in the cache, so the disk
I/O will not be multiplied for such a small test case.

> Here’s a comparison of the “optimal” mode (bulk stats after we’ve
> fetched all the dirents) vs. the “semi-interleaved” mode (doing bulk
> stats every 100,000 dirents):
>
> ludo@guix:~$ gcc -std=gnu99 -Wall links-traversal.c  -DMODE=3
> ludo@guix:~$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
> ludo@guix:~$ time ./a.out
> 603858 dir_entries, 157 seconds
> stat took 1 seconds
>
> real    2m38.508s
> user    0m0.324s
> sys     0m1.824s
> ludo@guix:~$ gcc -std=gnu99 -Wall links-traversal.c  -DMODE=2
> ludo@guix:~$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
> ludo@guix:~$ time ./a.out 
> 3852 dir_entries, 172 seconds (including stat)
>
> real    2m51.827s
> user    0m0.312s
> sys     0m1.808s
>
> Semi-interleaved is ~12% slower here (not sure how reproducible that is
> though).

This directory you're testing on is more than an order of magnitude
smaller than Hydra's when it's full.  Unlike in your test above, all of
the inodes in Hydra's store won't fit in the cache.

In my opinion, the reason Hydra performs so poorly is because efficiency
and scalability are apparently very low priorities in the design of the
software running on it.  Unfortunately, I feel that my advice in this
area is discarded more often than not.

>>>> Why not just use GNU sort?  It already exists, and does exactly what we
>>>> need.
>>>
>>> Does ‘sort’ manage to avoid reading whole files in memory?
>>
>> Yes, it does.  I monitored the 'sort' process when I first ran my
>> optimized pipeline.  It created about 10 files in /tmp, approximately 70
>> megabytes each as I recall, and then read them all concurrently while
>> writing the sorted output.
>>
>> My guess is that it reads a manageable chunk of the input, sorts it in
>> memory, and writes it to a temporary file.  I guess it repeats this
>> process, writing multiple temporary files, until the entire input is
>> consumed, and then reads all of those temporary files, merging them
>> together into the output stream.
>
> OK.  That seems to be that the comment above ‘sortlines’ in sort.c
> describes.

Also, see <https://en.wikipedia.org/wiki/External_sorting>.  This is a
well-studied problem with a long history.

>>>> If you object to using an external program for some reason, I would
>>>> prefer to re-implement a similar algorithm in the daemon.
>>>
>>> Yeah, I’d rather avoid serializing the list of file names/inode number
>>> pairs just to invoke ‘sort’ on that.

I'm fairly sure that the overhead of serializing the file names and
inode numbers is *far* less than the overhead you would add by iterating
over the inodes in multiple passes.

>> Sure, I agree that it would be better to avoid that, but IMO not at the
>> cost of using O(N) memory instead of O(1) memory, nor at the cost of
>> multiplying the amount of disk I/O by a non-trivial factor.
>
> Understood.
>
> sort.c in Coreutils is very big, and we surely don’t want to duplicate
> all that.  Yet, I’d rather not shell out to ‘sort’.

The "shell" would not be involved here at all, just the "sort" program.
I guess you dislike launching external processes?  Can you explain why?

Guix-daemon launches external processes for building derivations, so why
is using one for garbage collection a problem?  Emacs, a program that
you cite in your talks as having many qualities that we seek to emulate,
does not shy away from using external programs.

> Do you know how many entries are in .links on hydra.gnu.org?

"df -i /gnu" indicates that it currently has about 5.5M inodes, but
that's with only 29% of the disk in use.  A few days ago, when the disk
was full, assuming that the average file size is the same, it may have
had closer to 5.5M / 0.29 ~= 19M inodes, which is over 30 times as many
as used in your measurements above.  On ext4, which uses 256-byte
inodes, that's about 5 gigabytes of inodes.

      Mark

  reply	other threads:[~2016-12-13 12:49 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-13 17:41 bug#24937: "deleting unused links" GC phase is too slow Ludovic Courtès
2016-12-09 22:43 ` Ludovic Courtès
2016-12-11 13:46 ` Ludovic Courtès
2016-12-11 14:23   ` Mark H Weaver
2016-12-11 18:02     ` Ludovic Courtès
2016-12-11 19:27       ` Mark H Weaver
2016-12-13  0:00         ` Ludovic Courtès
2016-12-13 12:48           ` Mark H Weaver [this message]
2016-12-13 17:02             ` Ludovic Courtès
2016-12-13 17:18               ` Ricardo Wurmus
2020-04-16 13:26                 ` Ricardo Wurmus
2020-04-16 14:27                   ` Ricardo Wurmus
2020-04-17  8:16                     ` Ludovic Courtès
2020-04-17  8:28                       ` Ricardo Wurmus
2016-12-13  4:09         ` Mark H Weaver
2016-12-15  1:19           ` Mark H Weaver
2021-11-09 14:44 ` Ludovic Courtès
2021-11-09 15:00   ` Ludovic Courtès
2021-11-11 20:59   ` Maxim Cournoyer
2021-11-13 16:56     ` Ludovic Courtès
2021-11-13 21:37       ` bug#24937: [PATCH 1/2] tests: Factorize 'file=?' Ludovic Courtès
2021-11-13 21:37         ` bug#24937: [PATCH 2/2] daemon: Do not deduplicate files smaller than 4 KiB Ludovic Courtès
2021-11-16 13:54           ` bug#24937: "deleting unused links" GC phase is too slow Ludovic Courtès
2021-11-13 21:45       ` Ludovic Courtès
2021-11-22  2:30 ` John Kehayias via Bug reports for GNU Guix

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wpf4yoz0.fsf@netris.org \
    --to=mhw@netris.org \
    --cc=24937@debbugs.gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).