Good afternoon fearless hacker! Mathieu Othacehe skribis: >> ‘process-build-log’ in Cuirass uses ‘read-line/non-blocking’ to read a >> line from the log port of ‘build-derivations&’. If that really is >> non-blocking (and I think it is), then we should be fine? >> >> We should attach GDB to Cuirass next time to see what’s blocking. > > Cuirass is currently hanging probably due to the same issue. I saved a > GDB core dump in /home/mathieu/core.76483. For those following along at home, we have 60 threads in there. A couple of threads are blocked in ‘clock_nanosleep’, which I considered fishy at first: --8<---------------cut here---------------start------------->8--- (gdb) bt #0 0x00007fe26752f7a1 in __GI___clock_nanosleep (clock_id=-612010, flags=0, req=0x7fdf6b40d140, rem=0x7fdf6b40d140) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48 #1 0x00007fe267a0166d in ffi_call_unix64 () from /gnu/store/bw15z9kh9c65ycc2vbhl2izwfwfva7p1-libffi-3.3/lib/libffi.so.7 #2 0x00007fe2679ffac0 in ffi_call_int () from /gnu/store/bw15z9kh9c65ycc2vbhl2izwfwfva7p1-libffi-3.3/lib/libffi.so.7 #3 0x00007fe267af5f2e in scm_i_foreign_call (cif_scm=, pointer_scm=, errno_ret=errno_ret@entry=0x7fe25a8e86cc, argv=0x7fe25b955df0) at foreign.c:1073 #4 0x00007fe267b64a84 in foreign_call (thread=0x7fe26741e480, cif=, pointer=) at vm.c:1282 #5 0x00007fe2505253e0 in ?? () #6 0x00007fe26741e480 in ?? () #7 0x00007fe267bd7620 in ?? () from /gnu/store/0w76khfspfy8qmcpjya41chj3bgfcy0k-guile-3.0.4/lib/libguile-3.0.so.1 #8 0x00007fe26741e480 in ?? () #9 0x00007fe267b1043b in scm_jit_enter_mcode (thread=0x7fe26741e480, thread@entry=0x7fe2505253b0, mcode=0x7fe25052627c "L\215\243\210") at jit.c:5852 #10 0x00007fe267b6bc24 in vm_regular_engine (thread=0x7fe2505253b0) at vm-engine.c:415 #11 0x00007fe267b6c5b5 in scm_call_n (proc=proc@entry=#, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1608 #12 0x00007fe267ae8ae9 in scm_call_0 (proc=proc@entry=#) at eval.c:490 #13 0x00007fe267adb138 in scm_call_with_unblocked_asyncs (proc=#) at async.c:406 --8<---------------cut here---------------end--------------->8--- This can only come from (fibers posix-clocks) via ‘with-interrupts’—probably OK. Then there’s a couple of threads block in ‘pthread_cond_wait’, but that’s presumably also Fibers internals. Then there’s a whole bunch of threads stuck in ‘read’: --8<---------------cut here---------------start------------->8--- (gdb) bt #0 0x00007fe267a180a4 in __libc_read (fd=80, buf=buf@entry=0x7fe22b0bb8f0, nbytes=nbytes@entry=8) at ../sysdeps/unix/sysv/linux/read.c:26 #1 0x00007fe267af69c7 in fport_read (port=, dst=, start=, count=8) at fports.c:597 #2 0x00007fe267b30542 in trampoline_to_c_read (port=# 7fe22b7b9880>, dst="#" = {...}, start=0, count=8) at ports.c:266 #3 0x00007fe2580cb5fe in ?? () #4 0x00007fe267431d80 in ?? () #5 0x00007fe267bd7620 in ?? () from /gnu/store/0w76khfspfy8qmcpjya41chj3bgfcy0k-guile-3.0.4/lib/libguile-3.0.so.1 #6 0x00007fe267431d80 in ?? () #7 0x00007fe267b1043b in scm_jit_enter_mcode (thread=0x7fe267431d80, thread@entry=0x7fe2580cb5d0, mcode=0x7fe229340690 "H\203\350(I\211\314I)\304I\203\374\060\017\205T\003") at jit.c:5852 #8 0x00007fe267b6b8e9 in vm_regular_engine (thread=0x7fe2580cb5d0) at vm-engine.c:360 #9 0x00007fe267b6c5b5 in scm_call_n (proc=proc@entry=#, argv=argv@entry=0x0, nargs=nargs@entry=0) at vm.c:1608 #10 0x00007fe267ae8ae9 in scm_call_0 (proc=proc@entry=#) at eval.c:490 #11 0x00007fe267adb138 in scm_call_with_unblocked_asyncs (proc=#) at async.c:406 --8<---------------cut here---------------end--------------->8--- ‘trampoline_to_c_read’ is known as ‘port-read’ in Scheme, so I think the call above comes from ‘read-bytes’ in (ice-9 suspendable-ports). Normally, this file descriptor is O_NONBLOCK, and thus ‘fport_read’ immediately returns EAGAIN, so ‘trampoline_to_c_read’ returns #false. But does Cuirass create file descriptors as O_NONBLOCK? This has to be done explicitly, Fibers won’t do it for us. As it turns out, the answer is no, in at least one important case: the connection to the daemon (untested patch below). While GC is running, Cuirass typically sends ‘build-derivations’ RPCs and they block until the GC lock is released. That can lead to the situation above: a bunch of threads blocked in ‘read’ from their daemon socket, waiting for the RPC reply. OTOH, ‘build-derivations’ RPCs are made from a fresh thread created by ‘build-derivations&’. There are probably other situations where the daemon replies slowly. For instance, ‘fetch-input’ can remain stuck until GC is over. WDYT? Thanks for investigating! Ludo’.