all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock
@ 2018-06-11 14:06 Ludovic Courtès
  2019-09-06  9:04 ` Ludovic Courtès
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Ludovic Courtès @ 2018-06-11 14:06 UTC (permalink / raw)
  To: 31785

Hello,

I tried running this:

  guix build --max-jobs=200 $(guix gc -R $(guix build -d inkscape --no-grafts) | sort) & \
  guix build --max-jobs=200 $(guix gc -R $(guix build -d inkscape --no-grafts) | sort -r)

… also in parallel with this (for good measure):

  guix build --max-jobs=200 $(guix gc -R $(guix build -d inkscape --no-grafts) | sort -R)

Since we have 3 clients, that leads to 3 guix-daemon processes, and
those are stuck in a deadlock:

--8<---------------cut here---------------start------------->8---
$ ps aux | grep guix-daemon
[…]
root     20501  0.6  0.0  39292 12012 ?        Ss   15:51   0:01 guix-daemon 20455
root     20503  4.3  0.0  39420 12096 ?        Ss   15:51   0:12 guix-daemon 20491
root     22154  0.0  0.0  39028 11016 ?        Ss   15:52   0:00 guix-daemon 22148
$ sudo strace -p 22154 -t
strace: Process 22154 attached
15:57:03 select(0, 0x7ffdca30b610, NULL, NULL, {tv_sec=1, tv_usec=314189}) = 0 (Timeout)
15:57:04 openat(AT_FDCWD, "/gnu/store/ghc9vcc8ydv305jz0spd73qrrf30gxik-binutils-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/5hdpffkgkfszpsin2ir50zscz06x0mb5-coreutils-8.29-debug.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0.19.8.1.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/j1kpzcpi1m1ihzwyqb60zd4s74yp85g8-glibc-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/4llnxsgr6mxxj6pk21d9vbwaxnkbn99k-gnutls-3.5.18-doc.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/7rs9qx86f2w4hjbf9wjm10k4m2vvlbyv-groff-1.22.3-doc.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/d645d919gp1zk8v65qa2r4ia5x1c5n6k-libstdc++-5.5.0.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/p262qk7zsd2wnlal6x1jax2xbqvdddqq-m4-1.4.18.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-module-import-compiled.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-openssl-1.0.2o-doc.lock", O_RDWR|O_CREAT, 0600) = 8
^C
$ sudo strace -p 20503 -t
strace: Process 20503 attached
15:58:23 select(445, [15 54 159 177 321 411 414 444], NULL, NULL, {tv_sec=1, tv_usec=595302}) = 0 (Timeout)
15:58:25 openat(AT_FDCWD, "/gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock", O_RDWR|O_CREAT, 0600) = 14
15:58:25 fcntl(14, F_GETFD)             = 0
15:58:25 fcntl(14, F_SETFD, FD_CLOEXEC) = 0
15:58:25 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:58:25 close(14)                      = 0
15:58:25 openat(AT_FDCWD, "/gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-module-import-compiled.lock", O_RDWR|O_CREAT, 0600) = 14
15:58:25 fcntl(14, F_GETFD)             = 0
15:58:25 fcntl(14, F_SETFD, FD_CLOEXEC) = 0
15:58:25 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:58:25 close(14)                      = 0
15:58:25 openat(AT_FDCWD, "/gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-openssl-1.0.2o-doc.lock", O_RDWR|O_CREAT, 0600) = 14
15:58:25 fcntl(14, F_GETFD)             = 0
15:58:25 fcntl(14, F_SETFD, FD_CLOEXEC) = 0
15:58:25 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:58:25 close(14)                      = 0
15:58:25 select(445, [15 54 159 177 321 411 414 444], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
15:58:30 openat(AT_FDCWD, "/gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock", O_RDWR|O_CREAT, 0600) = 14
15:58:30 fcntl(14, F_GETFD)             = 0
15:58:30 fcntl(14, F_SETFD, FD_CLOEXEC) = 0
15:58:30 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:58:30 close(14)                      = 0
15:58:30 openat(AT_FDCWD, "/gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-module-import-compiled.lock", O_RDWR|O_CREAT, 0600) = 14
15:58:30 fcntl(14, F_GETFD)             = 0
15:58:30 fcntl(14, F_SETFD, FD_CLOEXEC) = 0
15:58:30 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:58:30 close(14)                      = 0
15:58:30 openat(AT_FDCWD, "/gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-openssl-1.0.2o-doc.lock", O_RDWR|O_CREAT, 0600) = 14
15:58:30 fcntl(14, F_GETFD)             = 0
15:58:30 fcntl(14, F_SETFD, FD_CLOEXEC) = 0
15:58:30 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:58:30 close(14)                      = 0
15:58:30 select(445, [15 54 159 177 321 411 414 444], NULL, NULL, {tv_sec=5, tv_usec=0}^Cstrace: Process 20503 detached
$ sudo strace -p 20501 -t
strace: Process 20501 attached
15:59:09 select(121, [20 30 120], NULL, NULL, {tv_sec=0, tv_usec=535316}) = 0 (Timeout)
15:59:10 openat(AT_FDCWD, "/gnu/store/ghc9vcc8ydv305jz0spd73qrrf30gxik-binutils-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) = 8
15:59:10 fcntl(8, F_GETFD)              = 0
15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:59:10 close(8)                       = 0
15:59:10 openat(AT_FDCWD, "/gnu/store/5hdpffkgkfszpsin2ir50zscz06x0mb5-coreutils-8.29-debug.lock", O_RDWR|O_CREAT, 0600) = 8
15:59:10 fcntl(8, F_GETFD)              = 0
15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:59:10 close(8)                       = 0
15:59:10 openat(AT_FDCWD, "/gnu/store/aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0.19.8.1.lock", O_RDWR|O_CREAT, 0600) = 8
15:59:10 fcntl(8, F_GETFD)              = 0
15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:59:10 close(8)                       = 0
15:59:10 openat(AT_FDCWD, "/gnu/store/j1kpzcpi1m1ihzwyqb60zd4s74yp85g8-glibc-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) = 8
15:59:10 fcntl(8, F_GETFD)              = 0
15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:59:10 close(8)                       = 0
15:59:10 openat(AT_FDCWD, "/gnu/store/4llnxsgr6mxxj6pk21d9vbwaxnkbn99k-gnutls-3.5.18-doc.lock", O_RDWR|O_CREAT, 0600) = 8
15:59:10 fcntl(8, F_GETFD)              = 0
15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:59:10 close(8)                       = 0
15:59:10 openat(AT_FDCWD, "/gnu/store/7rs9qx86f2w4hjbf9wjm10k4m2vvlbyv-groff-1.22.3-doc.lock", O_RDWR|O_CREAT, 0600) = 8
15:59:10 fcntl(8, F_GETFD)              = 0
15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:59:10 close(8)                       = 0
15:59:10 openat(AT_FDCWD, "/gnu/store/d645d919gp1zk8v65qa2r4ia5x1c5n6k-libstdc++-5.5.0.lock", O_RDWR|O_CREAT, 0600) = 8
15:59:10 fcntl(8, F_GETFD)              = 0
15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:59:10 close(8)                       = 0
15:59:10 openat(AT_FDCWD, "/gnu/store/p262qk7zsd2wnlal6x1jax2xbqvdddqq-m4-1.4.18.lock", O_RDWR|O_CREAT, 0600) = 8
15:59:10 fcntl(8, F_GETFD)              = 0
15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:59:10 close(8)                       = 0
15:59:10 select(121, [20 30 120], NULL, NULL, {tv_sec=5, tv_usec=0}^Cstrace: Process 20501 detached
$ sudo lsof | grep '/gnu/store/.*\.lock'
guix-daem 20501             root   14uW     REG              253,0          0    3150879 /gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-module-import-compiled.lock
guix-daem 20501             root   26uW     REG              253,0          0    3212618 /gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-openssl-1.0.2o-doc.lock
guix-daem 20501             root  116uW     REG              253,0          0    3212684 /gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock
guix-daem 20503             root    8uW     REG              253,0          0    3212520 /gnu/store/7rs9qx86f2w4hjbf9wjm10k4m2vvlbyv-groff-1.22.3-doc.lock
guix-daem 20503             root   50uW     REG              253,0          0    3212514 /gnu/store/ghc9vcc8ydv305jz0spd73qrrf30gxik-binutils-bootstrap-0.lock
guix-daem 20503             root  155uW     REG              253,0          0    3212547 /gnu/store/j1kpzcpi1m1ihzwyqb60zd4s74yp85g8-glibc-bootstrap-0.lock
guix-daem 20503             root  173uW     REG              253,0          0    3212552 /gnu/store/4llnxsgr6mxxj6pk21d9vbwaxnkbn99k-gnutls-3.5.18-doc.lock
guix-daem 20503             root  317uW     REG              253,0          0    3212602 /gnu/store/d645d919gp1zk8v65qa2r4ia5x1c5n6k-libstdc++-5.5.0.lock
guix-daem 20503             root  407uW     REG              253,0          0    3212655 /gnu/store/5hdpffkgkfszpsin2ir50zscz06x0mb5-coreutils-8.29-debug.lock
guix-daem 20503             root  410uW     REG              253,0          0    3212659 /gnu/store/aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0.19.8.1.lock
guix-daem 20503             root  440uW     REG              253,0          0    3212686 /gnu/store/p262qk7zsd2wnlal6x1jax2xbqvdddqq-m4-1.4.18.lock
--8<---------------cut here---------------end--------------->8---

So for example:

  • 20501 tries to acquire
    aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0.19.8.1.lock,
    which is held by 20503;

  • 20503 tries to acquire
    amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock, which is held by
    20501.

This comes from the fact that ‘LocalStore::buildPaths’ takes the
user-supplied derivation list as is, without sorting it, and then
acquires locks in that order in ‘Worker::run’.

A topological sort (or maybe an alphanumeric sort?) should allow us to
guarantee that guix-daemon processes take locks in the same order, and
then don’t end up in a deadlock.

I discovered this bug while monitoring Cuirass on berlin: several
sessions submit batches of 200 derivations in ‘build-paths’ RPCs, and
sometimes most of the corresponding guix-daemon processes would end up
being stuck in a lock-acquiring loop.

Ludo’.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock
  2018-06-11 14:06 bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock Ludovic Courtès
@ 2019-09-06  9:04 ` Ludovic Courtès
  2020-11-04 15:21 ` Ludovic Courtès
  2024-12-21 16:22 ` Ludovic Courtès
  2 siblings, 0 replies; 6+ messages in thread
From: Ludovic Courtès @ 2019-09-06  9:04 UTC (permalink / raw)
  To: 31785

Here’s another example from berlin today:

--8<---------------cut here---------------start------------->8---
ludo@berlin$ sudo guix processes
SessionPID: 38649
ClientPID: 38611
ClientCommand: /gnu/store/7x9lrzs2l00mfcr3ya7hg1i06i7lq9pd-profile/bin/guile \ /home/mbakke/guix/master/scripts/guix build --no-grafts --keep-going --max-silent-time=36000 --timeout=216000 -s armhf-linux nmoldyn r-msnid r-yamss netcdf-fortran python-netcdf4
LockHeld: /gnu/store/v6hcm1gvv3gdfbnq6cd4kn1i6ip3y74s-netcdf-fortran-4.4.4.lock
LockHeld: /gnu/store/ydxl0ws9kk6ix58m99ib7qhi157yq49h-r-msnid-1.18.1.lock
ChildProcess: 33181: /gnu/store/9alic3caqhay3h8mx4iihpmyj6ymqpcx-guile-2.2.4/bin/guile \ /gnu/store/vyhfp0gj4rx74yw0ybalzj8bvmqgp9pr-guix-1.0.1-4.c902458/bin/.guix-real offload x86_64-linux 36000 1 216000

[...]

SessionPID: 44043
ClientPID: 44023
ClientCommand: /gnu/store/7x9lrzs2l00mfcr3ya7hg1i06i7lq9pd-profile/bin/guile \ /home/mbakke/guix/master/scripts/guix build --no-grafts --keep-going --max-silent-time=36000 --timeout=216000 -s armhf-linux netcdf-fortran r-msnid nmoldyn r-yamss
LockHeld: /gnu/store/69jjabypcpx83482rh3yp4a9ncjimw6b-r-yamss-1.10.0.lock
LockHeld: /gnu/store/ibdbwxb1i8yal8ilqxxgnzh6bxpinic1-nmoldyn-3.0.11.lock
ChildProcess: 48997: /gnu/store/9alic3caqhay3h8mx4iihpmyj6ymqpcx-guile-2.2.4/bin/guile \ /gnu/store/vyhfp0gj4rx74yw0ybalzj8bvmqgp9pr-guix-1.0.1-4.c902458/bin/.guix-real offload x86_64-linux 36000 1 216000

ludo@berlin ~$ sudo strace -p 38649 -e openat,fcntl
strace: Process 38649 attached
openat(AT_FDCWD, "/gnu/store/ibdbwxb1i8yal8ilqxxgnzh6bxpinic1-nmoldyn-3.0.11.lock", O_RDWR|O_CREAT, 0600) = 10
fcntl(10, F_GETFD)                      = 0
fcntl(10, F_SETFD, FD_CLOEXEC)          = 0
fcntl(10, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Resource temporarily unavailable)
openat(AT_FDCWD, "/gnu/store/69jjabypcpx83482rh3yp4a9ncjimw6b-r-yamss-1.10.0.lock", O_RDWR|O_CREAT, 0600) = 10
fcntl(10, F_GETFD)                      = 0
fcntl(10, F_SETFD, FD_CLOEXEC)          = 0
fcntl(10, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Resource temporarily unavailable)
^Cstrace: Process 38649 detached

ludo@berlin ~$ sudo strace -p 44043 -e openat,fcntl
strace: Process 44043 attached
openat(AT_FDCWD, "/gnu/store/ydxl0ws9kk6ix58m99ib7qhi157yq49h-r-msnid-1.18.1.lock", O_RDWR|O_CREAT, 0600) = 10
fcntl(10, F_GETFD)                      = 0
fcntl(10, F_SETFD, FD_CLOEXEC)          = 0
fcntl(10, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Resource temporarily unavailable)
^Cstrace: Process 44043 detached
--8<---------------cut here---------------end--------------->8---

A very palpable deadlock.

Ludo’.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock
  2018-06-11 14:06 bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock Ludovic Courtès
  2019-09-06  9:04 ` Ludovic Courtès
@ 2020-11-04 15:21 ` Ludovic Courtès
  2024-12-21 16:22 ` Ludovic Courtès
  2 siblings, 0 replies; 6+ messages in thread
From: Ludovic Courtès @ 2020-11-04 15:21 UTC (permalink / raw)
  To: 31785; +Cc: Mathieu Othacehe

Hi,

ludo@gnu.org (Ludovic Courtès) skribis:

> This comes from the fact that ‘LocalStore::buildPaths’ takes the
> user-supplied derivation list as is, without sorting it, and then
> acquires locks in that order in ‘Worker::run’.

This diagnostic is incorrect: ‘Goals’ is a set sorted according to
‘CompareGoalPtrs’, which is lexical sort that arranges so substitution
goals come before derivation goals.  Thus, ‘_topGoals’ and ‘awake2’ in
Worker::run are sorted in a deterministic fashion.


The problem is that ‘Worker::waitForAWhile’ reshuffles the order of
goals by temporarily moving goals out of the way.  This can happen when
offloading replies “postpone”, which is inherently non-deterministic
(which goals are put to sleep will vary from one session to another
session.)

When those goals are eventually woken up from ‘Worker::waitForInput’,
they’re reprocessed, in sorted order, but potentially with “holes”
compared to other ‘guix-daemon’ processes.

That’s only a partial explanation; we need to go further to come up with
an actual deadlock scenario.

Ludo’.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock
  2018-06-11 14:06 bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock Ludovic Courtès
  2019-09-06  9:04 ` Ludovic Courtès
  2020-11-04 15:21 ` Ludovic Courtès
@ 2024-12-21 16:22 ` Ludovic Courtès
  2024-12-21 17:08   ` Ludovic Courtès
  2 siblings, 1 reply; 6+ messages in thread
From: Ludovic Courtès @ 2024-12-21 16:22 UTC (permalink / raw)
  To: 31785

ludo@gnu.org (Ludovic Courtès) skribis:

> I tried running this:
>
>   guix build --max-jobs=200 $(guix gc -R $(guix build -d inkscape --no-grafts) | sort) & \
>   guix build --max-jobs=200 $(guix gc -R $(guix build -d inkscape --no-grafts) | sort -r)
>
> … also in parallel with this (for good measure):
>
>   guix build --max-jobs=200 $(guix gc -R $(guix build -d inkscape --no-grafts) | sort -R)
>
> Since we have 3 clients, that leads to 3 guix-daemon processes, and
> those are stuck in a deadlock:

This strikes again: ‘cuirass remote-worker’ processes on berlin
occasionally end up deadlocking in the exact same way.

When running ‘current remote-worker --workers=4’, 4 sessions (4 clients)
are used, which can lead to that situation, as in this example:

--8<---------------cut here---------------start------------->8---
root@hydra-guix-126 ~# guix processes |guix shell recutils -- recsel -p 'SessionPID,ClientCommand,LockHeld'
SessionPID: 27250
ClientCommand: /gnu/store/mfkz7fvlfpv3ppwbkv0imb19nrf95akf-guile-3.0.9/bin/guile --no-auto-compile -e main -s /gnu/store/ll18sc406b5cqapmvz17v22gh4sryb24-cuirass-1.2.0-11.e96f088/bin/.cuirass-real remote-worker --user=cuirass-worker --workers=4 --systems=x86_64-linux,i686-linux --publish-port=5558 --substitute-urls=http://141.80.167.131

SessionPID: 27269
ClientCommand: /gnu/store/mfkz7fvlfpv3ppwbkv0imb19nrf95akf-guile-3.0.9/bin/guile --no-auto-compile -e main -s /gnu/store/ll18sc406b5cqapmvz17v22gh4sryb24-cuirass-1.2.0-11.e96f088/bin/.cuirass-real remote-worker --user=cuirass-worker --workers=4 --systems=x86_64-linux,i686-linux --publish-port=5558 --substitute-urls=http://141.80.167.131
LockHeld: /gnu/store/72s7500g3zg2p6fjdc1paazvm1w2xdr2-libva-2.19.0.lock
LockHeld: /gnu/store/0bbnhq7bagn6sbj2lmapmdiiw50v3dgz-rav1e-0.7.1.lock

SessionPID: 27308
ClientCommand: /gnu/store/mfkz7fvlfpv3ppwbkv0imb19nrf95akf-guile-3.0.9/bin/guile --no-auto-compile -e main -s /gnu/store/ll18sc406b5cqapmvz17v22gh4sryb24-cuirass-1.2.0-11.e96f088/bin/.cuirass-real remote-worker --user=cuirass-worker --workers=4 --systems=x86_64-linux,i686-linux --publish-port=5558 --substitute-urls=http://141.80.167.131
LockHeld: /gnu/store/zf5w9ypk8il0i9y22n81aamypr2qgsmm-dav1d-1.5.0.lock

SessionPID: 27345
ClientCommand: /gnu/store/mfkz7fvlfpv3ppwbkv0imb19nrf95akf-guile-3.0.9/bin/guile --no-auto-compile -e main -s /gnu/store/ll18sc406b5cqapmvz17v22gh4sryb24-cuirass-1.2.0-11.e96f088/bin/.cuirass-real remote-worker --user=cuirass-worker --workers=4 --systems=x86_64-linux,i686-linux --publish-port=5558 --substitute-urls=http://141.80.167.131
LockHeld: /gnu/store/0xbi2bgq34yyx2fqjjwpgdv4gkfyaf60-gst-plugins-bad-minimal-1.22.3.lock
LockHeld: /gnu/store/ij5igi5xrp4sx6c78nbvg24lb4ma2f4l-libcbor-0.11.0.lock
LockHeld: /gnu/store/czfvm14yy517vb8w2hpp46nyrdrymqyp-libfido2-1.12.0.lock
LockHeld: /gnu/store/1ldcq0p20nqy7d3mxdy4yra1ax5ik3xc-mpg123-1.31.2.lock
LockHeld: /gnu/store/sadbf1fmb0n9k754x5jbbdklcxbjqlhx-openssh-9.9p1.lock
LockHeld: /gnu/store/86rl29llmb7s4sl3bx0vl465mmq7nk6f-gcr-3.41.2.lock

SessionPID: 27382
ClientCommand: /gnu/store/mfkz7fvlfpv3ppwbkv0imb19nrf95akf-guile-3.0.9/bin/guile --no-auto-compile -e main -s /gnu/store/ll18sc406b5cqapmvz17v22gh4sryb24-cuirass-1.2.0-11.e96f088/bin/.cuirass-real remote-worker --user=cuirass-worker --workers=4 --systems=x86_64-linux,i686-linux --publish-port=5558 --substitute-urls=http://141.80.167.131
--8<---------------cut here---------------end--------------->8---

Here process 27269 holds locks on libva and rav1e and waits forever
trying to get the dav1d lock, held by 27308; process 27308 tries to get
the rav1e lock; process 27345 tries to get the libva lock.

FWIW, each of them is trying to substitute (not build) those things, via
the ‘build-things’ call made after the “substituting ~a inputs for ~a”
message in remote-worker.

Ludo’.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock
  2024-12-21 16:22 ` Ludovic Courtès
@ 2024-12-21 17:08   ` Ludovic Courtès
  2024-12-21 22:43     ` Ludovic Courtès
  0 siblings, 1 reply; 6+ messages in thread
From: Ludovic Courtès @ 2024-12-21 17:08 UTC (permalink / raw)
  To: 31785; +Cc: Christopher Baines, Reepca Russelstein

(Cc’ing Reepca + Chris for insight!  See
<https://issues.guix.gnu.org/31785> for context.)

Ludovic Courtès <ludo@gnu.org> skribis:

> Here process 27269 holds locks on libva and rav1e and waits forever
> trying to get the dav1d lock, held by 27308; process 27308 tries to get
> the rav1e lock; process 27345 tries to get the libva lock.

Additional data points:

  • All three store items are valid, yet client sessions are still stuck
    with locks held and trying to acquire the other locks;

  • The associated timestamp of these three store items in
    /var/guix/db/db.sqlite is the same (one-second accuracy);

  • The timestamps corresponds exactly to that of the “fetching path”
    messages in the log:

--8<---------------cut here---------------start------------->8---
root@hydra-guix-126 ~# grep -C3 -E "fetching.*(rav1e-0.7.1|libva-2.19.0|dav1d-1.5.0)'" /var/log/cuirass-remote-worker.log
2024-12-21 13:27:21  libgdata-0.18.1.tar.xz  832KiB    133.9MiB/s 00:00 ???????????????????? 100.0%
2024-12-21 13:27:21
2024-12-21 13:27:21 @ substituter-succeeded /gnu/store/9zbfmr41v0g4a5wm5s4yzwn8hg8l50b0-libgdata-0.18.1.tar.xz
2024-12-21 13:27:21 fetching path `/gnu/store/zf5w9ypk8il0i9y22n81aamypr2qgsmm-dav1d-1.5.0'...
2024-12-21 13:27:21 @ substituter-started /gnu/store/zf5w9ypk8il0i9y22n81aamypr2qgsmm-dav1d-1.5.0 substitute
2024-12-21 13:27:21 GmMxDR0c: substituting 14 inputs for /gnu/store/21m6i110abpzdjsb2kbz2c6sm8zy3zpx-sugar-cellgame-activity-5-1.4a22fd1.drv
2024-12-21 13:27:21 Downloading http://141.80.167.131/nar/lzip/zf5w9ypk8il0i9y22n81aamypr2qgsmm-dav1d-1.5.0...
--
2024-12-21 13:27:21  guile_gi-0.3.2.tar.gz  876KiB     124.3MiB/s 00:00 ???????????????????? 100.0%
2024-12-21 13:27:21
2024-12-21 13:27:21 @ substituter-succeeded /gnu/store/db700nxijpqn34a22nxpajbw2pwffkpv-guile_gi-0.3.2.tar.gz
2024-12-21 13:27:21 fetching path `/gnu/store/72s7500g3zg2p6fjdc1paazvm1w2xdr2-libva-2.19.0'...
2024-12-21 13:27:21 @ substituter-started /gnu/store/72s7500g3zg2p6fjdc1paazvm1w2xdr2-libva-2.19.0 substitute
2024-12-21 13:27:21 Downloading http://141.80.167.131/nar/zstd/ij5igi5xrp4sx6c78nbvg24lb4ma2f4l-libcbor-0.11.0...
2024-12-21 13:27:21
--
2024-12-21 13:27:21
2024-12-21 13:27:21 @ substituter-succeeded /gnu/store/ij5igi5xrp4sx6c78nbvg24lb4ma2f4l-libcbor-0.11.0
2024-12-21 13:27:21 Downloading http://141.80.167.131/nar/zstd/72s7500g3zg2p6fjdc1paazvm1w2xdr2-libva-2.19.0...
2024-12-21 13:27:21 fetching path `/gnu/store/0bbnhq7bagn6sbj2lmapmdiiw50v3dgz-rav1e-0.7.1'...
2024-12-21 13:27:21 @ substituter-started /gnu/store/0bbnhq7bagn6sbj2lmapmdiiw50v3dgz-rav1e-0.7.1 substitute
2024-12-21 13:27:21 waiting for locks or build slots...
2024-12-21 13:27:21
--8<---------------cut here---------------end--------------->8---

It would seem that the root cause is that locks aren’t released even
though substitution succeeded:

--8<---------------cut here---------------start------------->8---
root@hydra-guix-126 ~# grep -E '(zf5w9ypk8il0i9y22n81aamypr2qgsmm|72s7500g3zg2p6fjdc1paazvm1w2xdr2|0bbnhq7bagn6sbj2lmapmdiiw50v3dgz)' /var/log/cuirass-remote-worker.log
2024-12-21 13:27:21 fetching path `/gnu/store/zf5w9ypk8il0i9y22n81aamypr2qgsmm-dav1d-1.5.0'...
2024-12-21 13:27:21 @ substituter-started /gnu/store/zf5w9ypk8il0i9y22n81aamypr2qgsmm-dav1d-1.5.0 substitute
2024-12-21 13:27:21 Downloading http://141.80.167.131/nar/lzip/zf5w9ypk8il0i9y22n81aamypr2qgsm -dav1d-1.5.0...
2024-12-21 13:27:21 @ substituter-succeeded /gnu/store/zf5w9ypk8il0i9y22n81aamypr2qgsmm-dav1d-1.5.0
2024-12-21 13:27:21 fetching path `/gnu/store/72s7500g3zg2p6fjdc1paazvm1w2xdr2-libva-2.19.0'...
2024-12-21 13:27:21 @ substituter-started /gnu/store/72s7500g3zg2p6fjdc1paazvm1w2xdr2-libva-2.19.0 substitute
2024-12-21 13:27:21 Downloading http://141.80.167.131/nar/zstd/72s7500g3zg2p6fjdc1paazvm1w2xdr -libva-2.19.0...
2024-12-21 13:27:21 fetching path `/gnu/store/0bbnhq7bagn6sbj2lmapmdiiw50v3dgz-rav1e-0.7.1'...
2024-12-21 13:27:21 @ substituter-started /gnu/store/0bbnhq7bagn6sbj2lmapmdiiw50v3dgz-rav1e-0.7.1 substitute
2024-12-21 13:27:21 @ substituter-succeeded /gnu/store/72s7500g3zg2p6fjdc1paazvm1w2xdr2-libva-2.19.0
2024-12-21 13:27:21  gst-plugins-espeak-0.5.0-1.7f6e412  19KiB 4.5MiB/s 00:00 ???????????????????? 100.0%Downloading http://141.80.167.131/nar/zstd/0bbnhq7bagn6sbj2lmapmdiiw50v3dgz-rav1e-0.7.1...
2024-12-21 13:27:21 @ substituter-succeeded /gnu/store/0bbnhq7bagn6sbj2lmapmdiiw50v3dgz-rav1e-0.7.1
--8<---------------cut here---------------end--------------->8---

Thoughts?

Ludo’.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock
  2024-12-21 17:08   ` Ludovic Courtès
@ 2024-12-21 22:43     ` Ludovic Courtès
  0 siblings, 0 replies; 6+ messages in thread
From: Ludovic Courtès @ 2024-12-21 22:43 UTC (permalink / raw)
  To: 31785; +Cc: Christopher Baines, Reepca Russelstein

Ludovic Courtès <ludo@gnu.org> skribis:

> It would seem that the root cause is that locks aren’t released even
> though substitution succeeded:

<https://github.com/NixOS/nix/issues/178> was about an issue that looks
exactly the same, but the fix is already included in our copy of
‘build.cc’.

Ludo’.




^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-12-21 22:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-06-11 14:06 bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock Ludovic Courtès
2019-09-06  9:04 ` Ludovic Courtès
2020-11-04 15:21 ` Ludovic Courtès
2024-12-21 16:22 ` Ludovic Courtès
2024-12-21 17:08   ` Ludovic Courtès
2024-12-21 22:43     ` Ludovic Courtès

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.