* bug#51787: Disk performance on ci.guix.gnu.org
2021-12-21 17:26 ` Ricardo Wurmus
@ 2021-12-21 17:51 ` Leo Famulari
2021-12-21 18:23 ` Mathieu Othacehe
` (3 subsequent siblings)
4 siblings, 0 replies; 12+ messages in thread
From: Leo Famulari @ 2021-12-21 17:51 UTC (permalink / raw)
To: Ricardo Wurmus; +Cc: Mathieu Othacehe, 51787
On Tue, Dec 21, 2021 at 06:26:03PM +0100, Ricardo Wurmus wrote:
> We could take this opportunity to reformat /gnu with btrfs, which
> performs quite a bit more poorly than ext4 but would be immune to
> defragmentation. It’s not clear that defragmentation matters here. It
> could just be that the problem is exclusively caused by having these
> incredibly large, flat /gnu/store, /gnu/store/.links, and
> /gnu/store/trash directories.
My impression was that btrfs could also become fragmented. At least,
btrfs-progrs includes a command for defragmenting. Or do I
misunderstand?
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#51787: Disk performance on ci.guix.gnu.org
2021-12-21 17:26 ` Ricardo Wurmus
2021-12-21 17:51 ` Leo Famulari
@ 2021-12-21 18:23 ` Mathieu Othacehe
2021-12-21 23:20 ` Bengt Richter
` (2 subsequent siblings)
4 siblings, 0 replies; 12+ messages in thread
From: Mathieu Othacehe @ 2021-12-21 18:23 UTC (permalink / raw)
To: Ricardo Wurmus; +Cc: 51787
Hey,
> Today we discovered a few more things and discussed them on IRC. Here’s
> a summary.
Nice summary :)
> We could take this opportunity to reformat /gnu with btrfs, which
> performs quite a bit more poorly than ext4 but would be immune to
> defragmentation. It’s not clear that defragmentation matters here. It
> could just be that the problem is exclusively caused by having these
> incredibly large, flat /gnu/store, /gnu/store/.links, and
> /gnu/store/trash directories.
>
> A possible alternative for this file system might also be XFS, which
> performs well when presented with unreasonably large directories.
>
> It may be a good idea to come up with realistic test scenarios that we
> could test with each of these three file systems at scale.
We could compare xfs, btrfs and ext4 performances on a store subset,
1TiB for instance that we would create on the SAN. Realistic test
scenario could be:
- Time the copy of new items to the test store.
- Time the removal of randomly picked items from the test store.
- Time the creation of nar archives from the test store.
That will allow us to choose the file-system that has the best
performances for our use-case, regardless of fragmentation.
Now fragmentation may or may not be a problem as you mentioned. What we
could do is repeat the same tests but on a test store that is created
and removed N times, to simulate file-system aging.
This is more or less what is done in this article[1] by "git pulling" N
times a repository and testing read performances. For them btrfs > xfs >
ext4 in term of performances, but we might draw different conclusions
for our specific use case.
Do you think it is realistic? If so, we can start working on some test
scripts.
Thanks,
Mathieu
[1]: https://www.usenix.org/system/files/hotstorage19-paper-conway.pdf
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#51787: Disk performance on ci.guix.gnu.org
2021-12-21 17:26 ` Ricardo Wurmus
2021-12-21 17:51 ` Leo Famulari
2021-12-21 18:23 ` Mathieu Othacehe
@ 2021-12-21 23:20 ` Bengt Richter
2021-12-22 0:27 ` Thiago Jung Bauermann via Bug reports for GNU Guix
2021-12-25 22:19 ` Ricardo Wurmus
4 siblings, 0 replies; 12+ messages in thread
From: Bengt Richter @ 2021-12-21 23:20 UTC (permalink / raw)
To: Ricardo Wurmus; +Cc: Mathieu Othacehe, 51787
Hi Ricardo,
TL;DR: re: "Any ideas?" :)
Read this [0], and consider how file systems may be
interacting with with SSD wear-leveling algorithms.
Are some file systems dependent on successful speculative
transaction continuations, while others might slow down
waiting for signs that an SSD controller has committed one
of ITS transactions, e.g. in special cases where the user or
kernel file system wants to be sure metadata is
written/journaled for fs structural integrity, but maybe
cares less about data?
I guess this difference might show up in copying a large
file over-writing the same target file (slower) vs copying
to a series of new files (faster).
What happens if you use a contiguous file as swap space?
Or, if you use anonymous files as user data space buffers,
passing them to wayland as file handles, per its protocol,
can you do better than ignoring SSD controllers and/or
storage hardware altogether?
Reference [0] is from 2013, so probably much has happened
since then, and the paper mentions (which has probably not
gotten better), the following, referring to trade secrets
giving one manufacturer ability to produce longer-lasting
SSDs cheaper and better than others ...
--8<---------------cut here---------------start------------->8---
This means that the SSD controller is dedicated to a
single brand of NAND, and it means that the SSD maker
can’t shop around among NAND suppliers for the best price.
Furthermore, the NAND supplier won’t share this
information unless it believes that there is some compelling
reason to work the SSD manufacturer. Since there are
hundreds of SSD makers it’s really difficult to get these
companies to pay attention to you! The SSD manufacturers
that have this kind of relationship with their flash
suppliers are very rare and very special.
--8<---------------cut here---------------end--------------->8---
Well, maybe you will have to parameterize your file system
tuning with manufacturer ID and SSD controller firmware
version ;/
Mvh, Bengt
[0] https://www.snia.org/sites/default/files/SSSITECHNOTES_HowControllersMaximizeSSDLife.pdf
On +2021-12-21 18:26:03 +0100, Ricardo Wurmus wrote:
> Today we discovered a few more things and discussed them on IRC. Here’s
> a summary.
>
> /var/cache sits on the same storage as /gnu. We mounted the 5TB ext4
> file system that’s hosted by the SAN at /mnt_test and started copying
> over /var/cache to /mnt_test/var/cache. Transfer speed was considerably
> faster (not *great*, but reasonably fast) than the copy of
> /gnu/store/trash to the same target.
>
> This confirmed our suspicions that the problem is not with the storage
> array but due to the fact that /gnu/store/trash (and also /gnu/store)
> is an extremely large, flat directory. /var/cache is not.
>
> Here’s what we do now: continue copying /var/cache to the SAN, then
> remount to serve substitutes from there. This removes some pressure
> from the file system as it will only be used for /gnu. We’re
> considering to dump the file system completely (i.e. reinstall the
> server), thereby emptying /gnu, but leaving the stash of built
> substitutes in /var/cache (hosted from the faster SAN).
>
> We could take this opportunity to reformat /gnu with btrfs, which
> performs quite a bit more poorly than ext4 but would be immune to
> defragmentation. It’s not clear that defragmentation matters here. It
> could just be that the problem is exclusively caused by having these
> incredibly large, flat /gnu/store, /gnu/store/.links, and
> /gnu/store/trash directories.
>
> A possible alternative for this file system might also be XFS, which
> performs well when presented with unreasonably large directories.
>
> It may be a good idea to come up with realistic test scenarios that we
> could test with each of these three file systems at scale.
>
> Any ideas?
>
> --
> Ricardo
>
>
>
(sorry, the top-post grew)
--
Regards,
Bengt Richter
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#51787: Disk performance on ci.guix.gnu.org
2021-12-21 17:26 ` Ricardo Wurmus
` (2 preceding siblings ...)
2021-12-21 23:20 ` Bengt Richter
@ 2021-12-22 0:27 ` Thiago Jung Bauermann via Bug reports for GNU Guix
2021-12-25 22:19 ` Ricardo Wurmus
4 siblings, 0 replies; 12+ messages in thread
From: Thiago Jung Bauermann via Bug reports for GNU Guix @ 2021-12-22 0:27 UTC (permalink / raw)
To: 51787; +Cc: Ricardo Wurmus, Mathieu Othacehe, Bengt Richter, Leo Famulari
Hello,
Ricardo Wurmus <rekado@elephly.net> writes:
> Today we discovered a few more things and discussed them on IRC. Here’s
> a summary.
>
> /var/cache sits on the same storage as /gnu. We mounted the 5TB ext4
> file system that’s hosted by the SAN at /mnt_test and started copying
> over /var/cache to /mnt_test/var/cache. Transfer speed was considerably
> faster (not *great*, but reasonably fast) than the copy of
> /gnu/store/trash to the same target.
>
> This confirmed our suspicions that the problem is not with the storage
> array but due to the fact that /gnu/store/trash (and also /gnu/store)
> is an extremely large, flat directory. /var/cache is not.
There was an interesting thread in the Linux kernel mailing lists about this
very issue earlier this year:
https://lore.kernel.org/linux-fsdevel/206078.1621264018@warthog.procyon.org.uk/
I’m not sure I completely understood all of the concerns discussed there, but
my understanding of it is that for workloads which don’t concurrently modify
the huge directory, it’s size isn’t a problem for btrfs and XFS and in fact
it’s even more efficient to have one big directory rather than
subdirectories¹. It’s should also be well handled even by ext4, IIUC².
The problem for all filesystems is concurrently modifying the directory
(e.g., adding or removing files), because the kernel serializes directory
operations at the VFS layer.
Also in that case XFS can also have allocation issues when adding new files
if one isn’t careful.³
--
Thanks
Thiago
¹ https://lore.kernel.org/linux-fsdevel/20210517232237.GE2893@dread.disaster.area/
² https://lore.kernel.org/linux-fsdevel/6E4DE257-4220-4B5B-B3D0-B67C7BC69BB5@dilger.ca/
³ https://lore.kernel.org/linux-fsdevel/20210519125743.GP2893@dread.disaster.area/
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#51787: Disk performance on ci.guix.gnu.org
2021-12-21 17:26 ` Ricardo Wurmus
` (3 preceding siblings ...)
2021-12-22 0:27 ` Thiago Jung Bauermann via Bug reports for GNU Guix
@ 2021-12-25 22:19 ` Ricardo Wurmus
2021-12-26 8:53 ` Mathieu Othacehe
4 siblings, 1 reply; 12+ messages in thread
From: Ricardo Wurmus @ 2021-12-25 22:19 UTC (permalink / raw)
To: Mathieu Othacehe; +Cc: 51787
Ricardo Wurmus <rekado@elephly.net> writes:
> Today we discovered a few more things and discussed them on IRC. Here’s
> a summary.
>
> /var/cache sits on the same storage as /gnu. We mounted the 5TB ext4
> file system that’s hosted by the SAN at /mnt_test and started copying
> over /var/cache to /mnt_test/var/cache. Transfer speed was considerably
> faster (not *great*, but reasonably fast) than the copy of
> /gnu/store/trash to the same target.
Turns out that space on the SAN is insufficient for a full copy of
/var/cache. We’ve hit ENOSPC after 4.2TB. The SAN enforces some
headroom to remain free, so it denies us full access to the 5TB slice.
Bummer.
I guess we’ll have to wait for the SAN extension some time early 2022
before we can relocate the substitutes cache.
Should we attempt to overwrite /gnu/store and rely exclusively on
substitutes from the cache?
No matter how we look at it, the huge store is a performance problem for
us. Today I had to kill ’guix gc’ after the GC lock had been held for
about 24 hours. We will keep having this problem.
--
Ricardo
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#51787: Disk performance on ci.guix.gnu.org
2021-12-25 22:19 ` Ricardo Wurmus
@ 2021-12-26 8:53 ` Mathieu Othacehe
2021-12-30 10:44 ` Ricardo Wurmus
0 siblings, 1 reply; 12+ messages in thread
From: Mathieu Othacehe @ 2021-12-26 8:53 UTC (permalink / raw)
To: Ricardo Wurmus; +Cc: 51787
Hello Ricardo,
> Should we attempt to overwrite /gnu/store and rely exclusively on
> substitutes from the cache?
Yes, I don't see any other options. Before that, what might be nice
could be:
1. Ensure that all Berlin /var/cache/guix/publish directory is
synchronized on Bordeaux. We are now at 117G out of X. We could then
start a publish server on Bordeaux. As Bordeaux is already part of the
default substitute servers list, the transition could be smooth I guess.
2. Determine what file-system out of ext4, btrfs and xfs could be the
most suitable for Berlin's /gnu/store. I'm running some tests on an old
HDD to try to determine the fragmentation impact on those
file-systems. We can of course choose to be conservative and go for ext4
that did the job until now.
Regarding the /gnu/store re-creation, I wonder how can we do it without
reinstalling completely Berlin. Maybe we could save the system store
closure somewhere and restore it on the shining new file-system?
Thanks,
Mathieu
^ permalink raw reply [flat|nested] 12+ messages in thread
* bug#51787: Disk performance on ci.guix.gnu.org
2021-12-26 8:53 ` Mathieu Othacehe
@ 2021-12-30 10:44 ` Ricardo Wurmus
0 siblings, 0 replies; 12+ messages in thread
From: Ricardo Wurmus @ 2021-12-30 10:44 UTC (permalink / raw)
To: Mathieu Othacehe; +Cc: 51787
Mathieu Othacehe <othacehe@gnu.org> writes:
> Hello Ricardo,
>
>> Should we attempt to overwrite /gnu/store and rely exclusively on
>> substitutes from the cache?
>
> Yes, I don't see any other options. Before that, what might be nice
> could be:
>
> 1. Ensure that all Berlin /var/cache/guix/publish directory is
> synchronized on Bordeaux. We are now at 117G out of X. We could then
> start a publish server on Bordeaux. As Bordeaux is already part of the
> default substitute servers list, the transition could be smooth I guess.
I had the SAN slice extended from 5TB to 10TB. This is now also full
(at 9.2TB due to SAN configuration). I suggest doing the rsync to
Bordeaux from /mnt_test/var/cache/guix/publish instead of the much
slower /var/cache/guix/publish. It doesn’t hold *all* files, but 9+TB
should be enough to fuel the transfer to Bordeaux for a while.
> Regarding the /gnu/store re-creation, I wonder how can we do it without
> reinstalling completely Berlin. Maybe we could save the system store
> closure somewhere and restore it on the shining new file-system?
I don’t know. I would want to take a copy of the root file system as a
backup of state (like the Lets Encrypt certs), and copy the closure of
the current operating system configuration somewhere. We could copy it
to a dedicated build node (after stopping the GC cron job) and set it up
as an internal substitute server. Then “guix system init” while
fetching the substitutes from that server.
But I guess we’d have to boot the installer image anyway so that we can
safely erase /gnu/store, or else we’d erase files that are currently in
use.
--
Ricardo
^ permalink raw reply [flat|nested] 12+ messages in thread