unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
From: "Ludovic Courtès" <ludo@gnu.org>
To: Mathieu Othacehe <othacehe@gnu.org>
Cc: 43565@debbugs.gnu.org
Subject: bug#43565: cuirass: Fibers scheduling blocked.
Date: Fri, 23 Oct 2020 14:21:05 +0200	[thread overview]
Message-ID: <871rhpqgjy.fsf@gnu.org> (raw)
In-Reply-To: <874kmmzd92.fsf@gnu.org> (Mathieu Othacehe's message of "Thu, 22 Oct 2020 13:55:21 +0200")

[-- Attachment #1: Type: text/plain, Size: 5157 bytes --]

Good afternoon fearless hacker!

Mathieu Othacehe <othacehe@gnu.org> skribis:

>> ‘process-build-log’ in Cuirass uses ‘read-line/non-blocking’ to read a
>> line from the log port of ‘build-derivations&’.  If that really is
>> non-blocking (and I think it is), then we should be fine?
>>
>> We should attach GDB to Cuirass next time to see what’s blocking.
>
> Cuirass is currently hanging probably due to the same issue. I saved a
> GDB core dump in /home/mathieu/core.76483.

For those following along at home, we have 60 threads in there.

A couple of threads are blocked in ‘clock_nanosleep’, which I considered
fishy at first:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007fe26752f7a1 in __GI___clock_nanosleep (clock_id=-612010, flags=0, req=0x7fdf6b40d140, rem=0x7fdf6b40d140)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  0x00007fe267a0166d in ffi_call_unix64 ()
   from /gnu/store/bw15z9kh9c65ycc2vbhl2izwfwfva7p1-libffi-3.3/lib/libffi.so.7
#2  0x00007fe2679ffac0 in ffi_call_int () from /gnu/store/bw15z9kh9c65ycc2vbhl2izwfwfva7p1-libffi-3.3/lib/libffi.so.7
#3  0x00007fe267af5f2e in scm_i_foreign_call (cif_scm=<optimized out>, pointer_scm=<optimized out>, 
    errno_ret=errno_ret@entry=0x7fe25a8e86cc, argv=0x7fe25b955df0) at foreign.c:1073
#4  0x00007fe267b64a84 in foreign_call (thread=0x7fe26741e480, cif=<optimized out>, pointer=<optimized out>)
    at vm.c:1282
#5  0x00007fe2505253e0 in ?? ()
#6  0x00007fe26741e480 in ?? ()
#7  0x00007fe267bd7620 in ?? () from /gnu/store/0w76khfspfy8qmcpjya41chj3bgfcy0k-guile-3.0.4/lib/libguile-3.0.so.1
#8  0x00007fe26741e480 in ?? ()
#9  0x00007fe267b1043b in scm_jit_enter_mcode (thread=0x7fe26741e480, thread@entry=0x7fe2505253b0, 
    mcode=0x7fe25052627c "L\215\243\210") at jit.c:5852
#10 0x00007fe267b6bc24 in vm_regular_engine (thread=0x7fe2505253b0) at vm-engine.c:415
#11 0x00007fe267b6c5b5 in scm_call_n (proc=proc@entry=#<unmatched-tag 20045>, argv=argv@entry=0x0, 
    nargs=nargs@entry=0) at vm.c:1608
#12 0x00007fe267ae8ae9 in scm_call_0 (proc=proc@entry=#<unmatched-tag 20045>) at eval.c:490
#13 0x00007fe267adb138 in scm_call_with_unblocked_asyncs (proc=#<unmatched-tag 20045>) at async.c:406
--8<---------------cut here---------------end--------------->8---

This can only come from (fibers posix-clocks) via
‘with-interrupts’—probably OK.

Then there’s a couple of threads block in ‘pthread_cond_wait’, but
that’s presumably also Fibers internals.

Then there’s a whole bunch of threads stuck in ‘read’:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007fe267a180a4 in __libc_read (fd=80, buf=buf@entry=0x7fe22b0bb8f0, nbytes=nbytes@entry=8)
    at ../sysdeps/unix/sysv/linux/read.c:26
#1  0x00007fe267af69c7 in fport_read (port=<optimized out>, dst=<optimized out>, start=<optimized out>, count=8)
    at fports.c:597
#2  0x00007fe267b30542 in trampoline_to_c_read (port=#<port #<port-type file 7fe25fb4db40> 7fe22b7b9880>, 
    dst="#<vu8vector>" = {...}, start=0, count=8) at ports.c:266
#3  0x00007fe2580cb5fe in ?? ()
#4  0x00007fe267431d80 in ?? ()
#5  0x00007fe267bd7620 in ?? () from /gnu/store/0w76khfspfy8qmcpjya41chj3bgfcy0k-guile-3.0.4/lib/libguile-3.0.so.1
#6  0x00007fe267431d80 in ?? ()
#7  0x00007fe267b1043b in scm_jit_enter_mcode (thread=0x7fe267431d80, thread@entry=0x7fe2580cb5d0, 
    mcode=0x7fe229340690 "H\203\350(I\211\314I)\304I\203\374\060\017\205T\003") at jit.c:5852
#8  0x00007fe267b6b8e9 in vm_regular_engine (thread=0x7fe2580cb5d0) at vm-engine.c:360
#9  0x00007fe267b6c5b5 in scm_call_n (proc=proc@entry=#<unmatched-tag 20045>, argv=argv@entry=0x0, 
    nargs=nargs@entry=0) at vm.c:1608
#10 0x00007fe267ae8ae9 in scm_call_0 (proc=proc@entry=#<unmatched-tag 20045>) at eval.c:490
#11 0x00007fe267adb138 in scm_call_with_unblocked_asyncs (proc=#<unmatched-tag 20045>) at async.c:406
--8<---------------cut here---------------end--------------->8---

‘trampoline_to_c_read’ is known as ‘port-read’ in Scheme, so I think the
call above comes from ‘read-bytes’ in (ice-9 suspendable-ports).

Normally, this file descriptor is O_NONBLOCK, and thus ‘fport_read’
immediately returns EAGAIN, so ‘trampoline_to_c_read’ returns #false.

But does Cuirass create file descriptors as O_NONBLOCK?  This has to be
done explicitly, Fibers won’t do it for us.  As it turns out, the answer
is no, in at least one important case: the connection to the daemon
(untested patch below).

While GC is running, Cuirass typically sends ‘build-derivations’ RPCs
and they block until the GC lock is released.  That can lead to the
situation above: a bunch of threads blocked in ‘read’ from their daemon
socket, waiting for the RPC reply.  OTOH, ‘build-derivations’ RPCs are
made from a fresh thread created by ‘build-derivations&’.

There are probably other situations where the daemon replies slowly.
For instance, ‘fetch-input’ can remain stuck until GC is over.

WDYT?

Thanks for investigating!

Ludo’.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 1931 bytes --]

diff --git a/src/cuirass/base.scm b/src/cuirass/base.scm
index 5a0c826..6db43c4 100644
--- a/src/cuirass/base.scm
+++ b/src/cuirass/base.scm
@@ -36,6 +36,9 @@
   #:use-module ((guix config) #:select (%state-directory))
   #:use-module (git)
   #:use-module (ice-9 binary-ports)
+  #:use-module ((ice-9 suspendable-ports)
+                #:select (current-read-waiter
+                          current-write-waiter))
   #:use-module (ice-9 format)
   #:use-module (ice-9 match)
   #:use-module (ice-9 popen)
@@ -79,7 +82,12 @@
   ;; currently closes in a 'dynamic-wind' handler, which means it would close
   ;; the store at each context switch.  Remove this when the real 'with-store'
   ;; has been fixed.
-  (let ((store (open-connection)))
+  (let* ((store  (open-connection))
+         (socket (store-connection-socket store)))
+    ;; Mark SOCKET as non-blocking so Fibers can schedule the way it wants.
+    (let ((flags (fcntl socket F_GETFL)))
+      (fcntl socket F_SETFL (logior O_NONBLOCK flags)))
+
     (unwind-protect
      ;; Always set #:keep-going? so we don't stop on the first build failure.
      ;; Set #:print-build-trace explicitly to make sure 'process-build-log'
@@ -422,7 +430,12 @@ Essentially this procedure inverts the inversion-of-control that
           (lambda ()
             (guard (c ((store-error? c)
                        (atomic-box-set! result c)))
-              (parameterize ((current-build-output-port output))
+              (parameterize ((current-build-output-port output)
+
+                             ;; STORE's socket is O_NONBLOCK but since we're
+                             ;; not in a fiber, disable Fiber's handlers.
+                             (current-read-waiter #f)
+                             (current-write-waiter #f))
                 (let ((x (build-derivations store lst)))
                   (atomic-box-set! result x))))
             (close-port output))

  reply	other threads:[~2020-10-23 12:22 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-22 16:58 bug#43565: cuirass: Fibers scheduling blocked Mathieu Othacehe
2020-10-05 12:13 ` Ludovic Courtès
2020-10-22 11:55   ` Mathieu Othacehe
2020-10-23 12:21     ` Ludovic Courtès [this message]
2020-10-26 14:22       ` Mathieu Othacehe
2020-10-26 16:20         ` Ludovic Courtès
2020-10-27 18:03           ` Mathieu Othacehe
2020-11-02 10:09           ` Mathieu Othacehe
2020-11-19 10:56             ` Mathieu Othacehe
2020-11-20  8:37               ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871rhpqgjy.fsf@gnu.org \
    --to=ludo@gnu.org \
    --cc=43565@debbugs.gnu.org \
    --cc=othacehe@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).