From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id OCJ3EAfLkl8gKgAA0tVLHw (envelope-from ) for ; Fri, 23 Oct 2020 12:22:31 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1 with LMTPS id oOCwCgfLkl8AOwAAbx9fmQ (envelope-from ) for ; Fri, 23 Oct 2020 12:22:31 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 264DD9402A8 for ; Fri, 23 Oct 2020 12:22:29 +0000 (UTC) Received: from localhost ([::1]:46942 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kVw5U-00055q-GV for larch@yhetil.org; Fri, 23 Oct 2020 08:22:28 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:37802) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kVw56-00052p-2v for bug-guix@gnu.org; Fri, 23 Oct 2020 08:22:04 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:44712) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kVw54-0001OD-6M for bug-guix@gnu.org; Fri, 23 Oct 2020 08:22:03 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1kVw54-0000IQ-0F for bug-guix@gnu.org; Fri, 23 Oct 2020 08:22:02 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#43565: cuirass: Fibers scheduling blocked. Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Fri, 23 Oct 2020 12:22:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 43565 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Mathieu Othacehe Received: via spool by 43565-submit@debbugs.gnu.org id=B43565.16034556791074 (code B ref 43565); Fri, 23 Oct 2020 12:22:01 +0000 Received: (at 43565) by debbugs.gnu.org; 23 Oct 2020 12:21:19 +0000 Received: from localhost ([127.0.0.1]:56258 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kVw4M-0000HG-FI for submit@debbugs.gnu.org; Fri, 23 Oct 2020 08:21:18 -0400 Received: from eggs.gnu.org ([209.51.188.92]:37490) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kVw4J-0000Gy-1q for 43565@debbugs.gnu.org; Fri, 23 Oct 2020 08:21:17 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:48171) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kVw4D-0001Ke-PV for 43565@debbugs.gnu.org; Fri, 23 Oct 2020 08:21:09 -0400 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=55132 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1kVw4B-0003Jj-Vh; Fri, 23 Oct 2020 08:21:08 -0400 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= References: <87eemtzr1q.fsf@gnu.org> <87r1qc27mo.fsf@gnu.org> <874kmmzd92.fsf@gnu.org> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 2 Brumaire an 229 de la =?UTF-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Fri, 23 Oct 2020 14:21:05 +0200 In-Reply-To: <874kmmzd92.fsf@gnu.org> (Mathieu Othacehe's message of "Thu, 22 Oct 2020 13:55:21 +0200") Message-ID: <871rhpqgjy.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-Spam-Score: -3.3 (---) X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 43565@debbugs.gnu.org Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Spam-Score: -1.51 X-TUID: zqHSIXMVpH9U --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Good afternoon fearless hacker! Mathieu Othacehe skribis: >> =E2=80=98process-build-log=E2=80=99 in Cuirass uses =E2=80=98read-line/n= on-blocking=E2=80=99 to read a >> line from the log port of =E2=80=98build-derivations&=E2=80=99. If that= really is >> non-blocking (and I think it is), then we should be fine? >> >> We should attach GDB to Cuirass next time to see what=E2=80=99s blocking. > > Cuirass is currently hanging probably due to the same issue. I saved a > GDB core dump in /home/mathieu/core.76483. For those following along at home, we have 60 threads in there. A couple of threads are blocked in =E2=80=98clock_nanosleep=E2=80=99, which= I considered fishy at first: --8<---------------cut here---------------start------------->8--- (gdb) bt #0 0x00007fe26752f7a1 in __GI___clock_nanosleep (clock_id=3D-612010, flags= =3D0, req=3D0x7fdf6b40d140, rem=3D0x7fdf6b40d140) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48 #1 0x00007fe267a0166d in ffi_call_unix64 () from /gnu/store/bw15z9kh9c65ycc2vbhl2izwfwfva7p1-libffi-3.3/lib/libffi.s= o.7 #2 0x00007fe2679ffac0 in ffi_call_int () from /gnu/store/bw15z9kh9c65ycc2v= bhl2izwfwfva7p1-libffi-3.3/lib/libffi.so.7 #3 0x00007fe267af5f2e in scm_i_foreign_call (cif_scm=3D, po= inter_scm=3D,=20 errno_ret=3Derrno_ret@entry=3D0x7fe25a8e86cc, argv=3D0x7fe25b955df0) at= foreign.c:1073 #4 0x00007fe267b64a84 in foreign_call (thread=3D0x7fe26741e480, cif=3D, pointer=3D) at vm.c:1282 #5 0x00007fe2505253e0 in ?? () #6 0x00007fe26741e480 in ?? () #7 0x00007fe267bd7620 in ?? () from /gnu/store/0w76khfspfy8qmcpjya41chj3bg= fcy0k-guile-3.0.4/lib/libguile-3.0.so.1 #8 0x00007fe26741e480 in ?? () #9 0x00007fe267b1043b in scm_jit_enter_mcode (thread=3D0x7fe26741e480, thr= ead@entry=3D0x7fe2505253b0,=20 mcode=3D0x7fe25052627c "L\215\243\210") at jit.c:5852 #10 0x00007fe267b6bc24 in vm_regular_engine (thread=3D0x7fe2505253b0) at vm= -engine.c:415 #11 0x00007fe267b6c5b5 in scm_call_n (proc=3Dproc@entry=3D#, argv=3Dargv@entry=3D0x0,=20 nargs=3Dnargs@entry=3D0) at vm.c:1608 #12 0x00007fe267ae8ae9 in scm_call_0 (proc=3Dproc@entry=3D#) at eval.c:490 #13 0x00007fe267adb138 in scm_call_with_unblocked_asyncs (proc=3D#) at async.c:406 --8<---------------cut here---------------end--------------->8--- This can only come from (fibers posix-clocks) via =E2=80=98with-interrupts=E2=80=99=E2=80=94probably OK. Then there=E2=80=99s a couple of threads block in =E2=80=98pthread_cond_wai= t=E2=80=99, but that=E2=80=99s presumably also Fibers internals. Then there=E2=80=99s a whole bunch of threads stuck in =E2=80=98read=E2=80= =99: --8<---------------cut here---------------start------------->8--- (gdb) bt #0 0x00007fe267a180a4 in __libc_read (fd=3D80, buf=3Dbuf@entry=3D0x7fe22b0= bb8f0, nbytes=3Dnbytes@entry=3D8) at ../sysdeps/unix/sysv/linux/read.c:26 #1 0x00007fe267af69c7 in fport_read (port=3D, dst=3D, start=3D, count=3D8) at fports.c:597 #2 0x00007fe267b30542 in trampoline_to_c_read (port=3D# 7fe22b7b9880>,=20 dst=3D"#" =3D {...}, start=3D0, count=3D8) at ports.c:266 #3 0x00007fe2580cb5fe in ?? () #4 0x00007fe267431d80 in ?? () #5 0x00007fe267bd7620 in ?? () from /gnu/store/0w76khfspfy8qmcpjya41chj3bg= fcy0k-guile-3.0.4/lib/libguile-3.0.so.1 #6 0x00007fe267431d80 in ?? () #7 0x00007fe267b1043b in scm_jit_enter_mcode (thread=3D0x7fe267431d80, thr= ead@entry=3D0x7fe2580cb5d0,=20 mcode=3D0x7fe229340690 "H\203\350(I\211\314I)\304I\203\374\060\017\205T= \003") at jit.c:5852 #8 0x00007fe267b6b8e9 in vm_regular_engine (thread=3D0x7fe2580cb5d0) at vm= -engine.c:360 #9 0x00007fe267b6c5b5 in scm_call_n (proc=3Dproc@entry=3D#, argv=3Dargv@entry=3D0x0,=20 nargs=3Dnargs@entry=3D0) at vm.c:1608 #10 0x00007fe267ae8ae9 in scm_call_0 (proc=3Dproc@entry=3D#) at eval.c:490 #11 0x00007fe267adb138 in scm_call_with_unblocked_asyncs (proc=3D#) at async.c:406 --8<---------------cut here---------------end--------------->8--- =E2=80=98trampoline_to_c_read=E2=80=99 is known as =E2=80=98port-read=E2=80= =99 in Scheme, so I think the call above comes from =E2=80=98read-bytes=E2=80=99 in (ice-9 suspendable-po= rts). Normally, this file descriptor is O_NONBLOCK, and thus =E2=80=98fport_read= =E2=80=99 immediately returns EAGAIN, so =E2=80=98trampoline_to_c_read=E2=80=99 retur= ns #false. But does Cuirass create file descriptors as O_NONBLOCK? This has to be done explicitly, Fibers won=E2=80=99t do it for us. As it turns out, the a= nswer is no, in at least one important case: the connection to the daemon (untested patch below). While GC is running, Cuirass typically sends =E2=80=98build-derivations=E2= =80=99 RPCs and they block until the GC lock is released. That can lead to the situation above: a bunch of threads blocked in =E2=80=98read=E2=80=99 from = their daemon socket, waiting for the RPC reply. OTOH, =E2=80=98build-derivations=E2=80= =99 RPCs are made from a fresh thread created by =E2=80=98build-derivations&=E2=80=99. There are probably other situations where the daemon replies slowly. For instance, =E2=80=98fetch-input=E2=80=99 can remain stuck until GC is ov= er. WDYT? Thanks for investigating! Ludo=E2=80=99. --=-=-= Content-Type: text/x-patch Content-Disposition: inline diff --git a/src/cuirass/base.scm b/src/cuirass/base.scm index 5a0c826..6db43c4 100644 --- a/src/cuirass/base.scm +++ b/src/cuirass/base.scm @@ -36,6 +36,9 @@ #:use-module ((guix config) #:select (%state-directory)) #:use-module (git) #:use-module (ice-9 binary-ports) + #:use-module ((ice-9 suspendable-ports) + #:select (current-read-waiter + current-write-waiter)) #:use-module (ice-9 format) #:use-module (ice-9 match) #:use-module (ice-9 popen) @@ -79,7 +82,12 @@ ;; currently closes in a 'dynamic-wind' handler, which means it would close ;; the store at each context switch. Remove this when the real 'with-store' ;; has been fixed. - (let ((store (open-connection))) + (let* ((store (open-connection)) + (socket (store-connection-socket store))) + ;; Mark SOCKET as non-blocking so Fibers can schedule the way it wants. + (let ((flags (fcntl socket F_GETFL))) + (fcntl socket F_SETFL (logior O_NONBLOCK flags))) + (unwind-protect ;; Always set #:keep-going? so we don't stop on the first build failure. ;; Set #:print-build-trace explicitly to make sure 'process-build-log' @@ -422,7 +430,12 @@ Essentially this procedure inverts the inversion-of-control that (lambda () (guard (c ((store-error? c) (atomic-box-set! result c))) - (parameterize ((current-build-output-port output)) + (parameterize ((current-build-output-port output) + + ;; STORE's socket is O_NONBLOCK but since we're + ;; not in a fiber, disable Fiber's handlers. + (current-read-waiter #f) + (current-write-waiter #f)) (let ((x (build-derivations store lst))) (atomic-box-set! result x)))) (close-port output)) --=-=-=--