From mboxrd@z Thu Jan 1 00:00:00 1970 From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Subject: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock Date: Mon, 11 Jun 2018 16:06:16 +0200 Message-ID: <87602ph0yv.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:47656) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fSNTP-0003PN-M2 for bug-guix@gnu.org; Mon, 11 Jun 2018 10:07:09 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fSNTL-0001pe-Km for bug-guix@gnu.org; Mon, 11 Jun 2018 10:07:07 -0400 Received: from debbugs.gnu.org ([208.118.235.43]:35617) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1fSNTL-0001pT-Fy for bug-guix@gnu.org; Mon, 11 Jun 2018 10:07:03 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1fSNTK-0008Is-Kg for bug-guix@gnu.org; Mon, 11 Jun 2018 10:07:03 -0400 Sender: "Debbugs-submit" Resent-Message-ID: Received: from eggs.gnu.org ([2001:4830:134:3::10]:47293) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fSNSg-0002pO-VQ for bug-guix@gnu.org; Mon, 11 Jun 2018 10:06:31 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fSNSc-0001TT-TN for bug-guix@gnu.org; Mon, 11 Jun 2018 10:06:22 -0400 Received: from fencepost.gnu.org ([2001:4830:134:3::e]:44274) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fSNSc-0001TH-PY for bug-guix@gnu.org; Mon, 11 Jun 2018 10:06:18 -0400 Received: from [193.50.110.75] (port=57708 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1fSNSc-0003FS-9d for bug-guix@gnu.org; Mon, 11 Jun 2018 10:06:18 -0400 List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org Sender: "bug-Guix" To: 31785@debbugs.gnu.org Hello, I tried running this: guix build --max-jobs=3D200 $(guix gc -R $(guix build -d inkscape --no-gr= afts) | sort) & \ guix build --max-jobs=3D200 $(guix gc -R $(guix build -d inkscape --no-gr= afts) | sort -r) =E2=80=A6 also in parallel with this (for good measure): guix build --max-jobs=3D200 $(guix gc -R $(guix build -d inkscape --no-gr= afts) | sort -R) Since we have 3 clients, that leads to 3 guix-daemon processes, and those are stuck in a deadlock: --8<---------------cut here---------------start------------->8--- $ ps aux | grep guix-daemon [=E2=80=A6] root 20501 0.6 0.0 39292 12012 ? Ss 15:51 0:01 guix-daemo= n 20455 root 20503 4.3 0.0 39420 12096 ? Ss 15:51 0:12 guix-daemo= n 20491 root 22154 0.0 0.0 39028 11016 ? Ss 15:52 0:00 guix-daemo= n 22148 $ sudo strace -p 22154 -t strace: Process 22154 attached 15:57:03 select(0, 0x7ffdca30b610, NULL, NULL, {tv_sec=3D1, tv_usec=3D31418= 9}) =3D 0 (Timeout) 15:57:04 openat(AT_FDCWD, "/gnu/store/ghc9vcc8ydv305jz0spd73qrrf30gxik-binu= tils-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/5hdpffkgkfszpsin2ir50zscz06x0mb5-core= utils-8.29-debug.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/aasfcliv85frq3aawpa4rw18nvlvl735-gett= ext-minimal-0.19.8.1.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/j1kpzcpi1m1ihzwyqb60zd4s74yp85g8-glib= c-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/4llnxsgr6mxxj6pk21d9vbwaxnkbn99k-gnut= ls-3.5.18-doc.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/7rs9qx86f2w4hjbf9wjm10k4m2vvlbyv-grof= f-1.22.3-doc.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/d645d919gp1zk8v65qa2r4ia5x1c5n6k-libs= tdc++-5.5.0.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1= .4.18.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/p262qk7zsd2wnlal6x1jax2xbqvdddqq-m4-1= .4.18.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-modu= le-import-compiled.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-open= ssl-1.0.2o-doc.lock", O_RDWR|O_CREAT, 0600) =3D 8 ^C $ sudo strace -p 20503 -t strace: Process 20503 attached 15:58:23 select(445, [15 54 159 177 321 411 414 444], NULL, NULL, {tv_sec= =3D1, tv_usec=3D595302}) =3D 0 (Timeout) 15:58:25 openat(AT_FDCWD, "/gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1= .4.18.lock", O_RDWR|O_CREAT, 0600) =3D 14 15:58:25 fcntl(14, F_GETFD) =3D 0 15:58:25 fcntl(14, F_SETFD, FD_CLOEXEC) =3D 0 15:58:25 fcntl(14, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:58:25 close(14) =3D 0 15:58:25 openat(AT_FDCWD, "/gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-modu= le-import-compiled.lock", O_RDWR|O_CREAT, 0600) =3D 14 15:58:25 fcntl(14, F_GETFD) =3D 0 15:58:25 fcntl(14, F_SETFD, FD_CLOEXEC) =3D 0 15:58:25 fcntl(14, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:58:25 close(14) =3D 0 15:58:25 openat(AT_FDCWD, "/gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-open= ssl-1.0.2o-doc.lock", O_RDWR|O_CREAT, 0600) =3D 14 15:58:25 fcntl(14, F_GETFD) =3D 0 15:58:25 fcntl(14, F_SETFD, FD_CLOEXEC) =3D 0 15:58:25 fcntl(14, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:58:25 close(14) =3D 0 15:58:25 select(445, [15 54 159 177 321 411 414 444], NULL, NULL, {tv_sec= =3D5, tv_usec=3D0}) =3D 0 (Timeout) 15:58:30 openat(AT_FDCWD, "/gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1= .4.18.lock", O_RDWR|O_CREAT, 0600) =3D 14 15:58:30 fcntl(14, F_GETFD) =3D 0 15:58:30 fcntl(14, F_SETFD, FD_CLOEXEC) =3D 0 15:58:30 fcntl(14, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:58:30 close(14) =3D 0 15:58:30 openat(AT_FDCWD, "/gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-modu= le-import-compiled.lock", O_RDWR|O_CREAT, 0600) =3D 14 15:58:30 fcntl(14, F_GETFD) =3D 0 15:58:30 fcntl(14, F_SETFD, FD_CLOEXEC) =3D 0 15:58:30 fcntl(14, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:58:30 close(14) =3D 0 15:58:30 openat(AT_FDCWD, "/gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-open= ssl-1.0.2o-doc.lock", O_RDWR|O_CREAT, 0600) =3D 14 15:58:30 fcntl(14, F_GETFD) =3D 0 15:58:30 fcntl(14, F_SETFD, FD_CLOEXEC) =3D 0 15:58:30 fcntl(14, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:58:30 close(14) =3D 0 15:58:30 select(445, [15 54 159 177 321 411 414 444], NULL, NULL, {tv_sec= =3D5, tv_usec=3D0}^Cstrace: Process 20503 detached $ sudo strace -p 20501 -t strace: Process 20501 attached 15:59:09 select(121, [20 30 120], NULL, NULL, {tv_sec=3D0, tv_usec=3D535316= }) =3D 0 (Timeout) 15:59:10 openat(AT_FDCWD, "/gnu/store/ghc9vcc8ydv305jz0spd73qrrf30gxik-binu= tils-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:59:10 fcntl(8, F_GETFD) =3D 0 15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:59:10 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:59:10 close(8) =3D 0 15:59:10 openat(AT_FDCWD, "/gnu/store/5hdpffkgkfszpsin2ir50zscz06x0mb5-core= utils-8.29-debug.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:59:10 fcntl(8, F_GETFD) =3D 0 15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:59:10 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:59:10 close(8) =3D 0 15:59:10 openat(AT_FDCWD, "/gnu/store/aasfcliv85frq3aawpa4rw18nvlvl735-gett= ext-minimal-0.19.8.1.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:59:10 fcntl(8, F_GETFD) =3D 0 15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:59:10 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:59:10 close(8) =3D 0 15:59:10 openat(AT_FDCWD, "/gnu/store/j1kpzcpi1m1ihzwyqb60zd4s74yp85g8-glib= c-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:59:10 fcntl(8, F_GETFD) =3D 0 15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:59:10 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:59:10 close(8) =3D 0 15:59:10 openat(AT_FDCWD, "/gnu/store/4llnxsgr6mxxj6pk21d9vbwaxnkbn99k-gnut= ls-3.5.18-doc.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:59:10 fcntl(8, F_GETFD) =3D 0 15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:59:10 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:59:10 close(8) =3D 0 15:59:10 openat(AT_FDCWD, "/gnu/store/7rs9qx86f2w4hjbf9wjm10k4m2vvlbyv-grof= f-1.22.3-doc.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:59:10 fcntl(8, F_GETFD) =3D 0 15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:59:10 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:59:10 close(8) =3D 0 15:59:10 openat(AT_FDCWD, "/gnu/store/d645d919gp1zk8v65qa2r4ia5x1c5n6k-libs= tdc++-5.5.0.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:59:10 fcntl(8, F_GETFD) =3D 0 15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:59:10 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:59:10 close(8) =3D 0 15:59:10 openat(AT_FDCWD, "/gnu/store/p262qk7zsd2wnlal6x1jax2xbqvdddqq-m4-1= .4.18.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:59:10 fcntl(8, F_GETFD) =3D 0 15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:59:10 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:59:10 close(8) =3D 0 15:59:10 select(121, [20 30 120], NULL, NULL, {tv_sec=3D5, tv_usec=3D0}^Cst= race: Process 20501 detached $ sudo lsof | grep '/gnu/store/.*\.lock' guix-daem 20501 root 14uW REG 253,0 = 0 3150879 /gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-module-import-com= piled.lock guix-daem 20501 root 26uW REG 253,0 = 0 3212618 /gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-openssl-1.0.2o-do= c.lock guix-daem 20501 root 116uW REG 253,0 = 0 3212684 /gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock guix-daem 20503 root 8uW REG 253,0 = 0 3212520 /gnu/store/7rs9qx86f2w4hjbf9wjm10k4m2vvlbyv-groff-1.22.3-doc.= lock guix-daem 20503 root 50uW REG 253,0 = 0 3212514 /gnu/store/ghc9vcc8ydv305jz0spd73qrrf30gxik-binutils-bootstra= p-0.lock guix-daem 20503 root 155uW REG 253,0 = 0 3212547 /gnu/store/j1kpzcpi1m1ihzwyqb60zd4s74yp85g8-glibc-bootstrap-0= .lock guix-daem 20503 root 173uW REG 253,0 = 0 3212552 /gnu/store/4llnxsgr6mxxj6pk21d9vbwaxnkbn99k-gnutls-3.5.18-doc= .lock guix-daem 20503 root 317uW REG 253,0 = 0 3212602 /gnu/store/d645d919gp1zk8v65qa2r4ia5x1c5n6k-libstdc++-5.5.0.l= ock guix-daem 20503 root 407uW REG 253,0 = 0 3212655 /gnu/store/5hdpffkgkfszpsin2ir50zscz06x0mb5-coreutils-8.29-de= bug.lock guix-daem 20503 root 410uW REG 253,0 = 0 3212659 /gnu/store/aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0= .19.8.1.lock guix-daem 20503 root 440uW REG 253,0 = 0 3212686 /gnu/store/p262qk7zsd2wnlal6x1jax2xbqvdddqq-m4-1.4.18.lock --8<---------------cut here---------------end--------------->8--- So for example: =E2=80=A2 20501 tries to acquire aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0.19.8.1.lock, which is held by 20503; =E2=80=A2 20503 tries to acquire amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock, which is held by 20501. This comes from the fact that =E2=80=98LocalStore::buildPaths=E2=80=99 take= s the user-supplied derivation list as is, without sorting it, and then acquires locks in that order in =E2=80=98Worker::run=E2=80=99. A topological sort (or maybe an alphanumeric sort?) should allow us to guarantee that guix-daemon processes take locks in the same order, and then don=E2=80=99t end up in a deadlock. I discovered this bug while monitoring Cuirass on berlin: several sessions submit batches of 200 derivations in =E2=80=98build-paths=E2=80=99= RPCs, and sometimes most of the corresponding guix-daemon processes would end up being stuck in a lock-acquiring loop. Ludo=E2=80=99.