unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
* bug#41948: Shepherd deadlocks
@ 2020-06-19  8:41 Mathieu Othacehe
  2020-06-19 12:10 ` Mathieu Othacehe
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Mathieu Othacehe @ 2020-06-19  8:41 UTC (permalink / raw)
  To: 41948

[-- Attachment #1: Type: text/plain, Size: 2414 bytes --]


Hello,

When running "gui-installed-desktop-os-encrypted" test, Shepherd seems
to deadlock when restarting "guix-daemon". This can happen at different
stages:

* In "umount-cow-store" procedure, just before finishing the install.

* During "set-http-proxy" tests inside the marionette.
  
This is not always reproducible. In order to gather some information, I
created a Shepherd "strace" service that logs what's happening in
Shepherd itself (patch attached).

It seems that, just after blocking signals, in "fork+exec-command", I
guess, Shepherd is taking a lock:

--8<---------------cut here---------------start------------->8---
183553:1     chdir("/")                        = 0
183554:1     clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=4387222}) = 0
183555:1     rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
183560:1     madvise(0x7f179782d000, 12288, MADV_DONTNEED) = 0
183561:1     clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=13592357}) = 0
183562:1     clone(child_stack=0x7f17981e6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271
183563:1     stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0
183564:1     write(15, "shepherd[1]: changing HTTP/HTTPS"..., 86 <unfinished ...>
183566:1     <... write resumed>)              = 86
183575:1     getpgid(169)                      = 169
183576:1     getpgid(0)                        = 0
183577:1     kill(-169, SIGTERM <unfinished ...>
183579:1     <... kill resumed>)               = 0
183582:1     stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0
183583:1     write(15, "shepherd[1]: Service guix-daemon"..., 51 <unfinished ...>
183585:1     <... write resumed>)              = 51
183594:1     getuid()                          = 0
183595:1     rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0
183596:1     write(6, "\1", 1 <unfinished ...>
183598:1     <... write resumed>)              = 1
183605:1     futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>
--8<---------------cut here---------------end--------------->8---

and is then blocking forever.

I cannot find any trace of locking in Shepherd, except for
"without-automatic-finalization" procedure.

Thanks,

Mathieu

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-strace-service.patch --]
[-- Type: text/x-diff, Size: 3177 bytes --]

From 13ac475ad953509cbbf88547e4f72efc08d31055 Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <othacehe@gnu.org>
Date: Thu, 18 Jun 2020 14:13:15 +0200
Subject: [PATCH] strace service

---
 gnu/services/base.scm  |  3 +++
 gnu/services/linux.scm | 25 ++++++++++++++++++++++++-
 gnu/system/install.scm |  3 +++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/gnu/services/base.scm b/gnu/services/base.scm
index 6ea7ef8e7e..d80c6cca6d 100644
--- a/gnu/services/base.scm
+++ b/gnu/services/base.scm
@@ -34,6 +34,7 @@
   #:use-module (guix deprecation)
   #:use-module (gnu services)
   #:use-module (gnu services admin)
+  #:use-module (gnu services linux)
   #:use-module (gnu services shepherd)
   #:use-module (gnu system pam)
   #:use-module (gnu system shadow)                ; 'user-account', etc.
@@ -2386,6 +2387,8 @@ to handle."
         (service guix-service-type)
         (service nscd-service-type)
 
+        (service strace-service-type 1)
+
         (service rottlog-service-type)
 
         ;; The LVM2 rules are needed as soon as LVM2 or the device-mapper is
diff --git a/gnu/services/linux.scm b/gnu/services/linux.scm
index 12934c2084..96889dfb7f 100644
--- a/gnu/services/linux.scm
+++ b/gnu/services/linux.scm
@@ -42,7 +42,9 @@
             earlyoom-configuration-send-notification-command
             earlyoom-service-type
 
-            kernel-module-loader-service-type))
+            kernel-module-loader-service-type
+
+            strace-service-type))
 
 \f
 ;;;
@@ -177,3 +179,24 @@ representation."
    (compose concatenate)
    (extend append)
    (default-value '())))
+
+\f
+;;;
+;;; Strace service.
+;;;
+
+(define strace-service-type
+  (shepherd-service-type
+   'strace
+   (lambda (pid)
+     (shepherd-service
+      (requirement '(root-file-system user-processes))
+      (provision (list (string->symbol
+                        (string-append "strace-" (number->string pid)))))
+      (documentation "Strace a running process.")
+      (start #~(make-forkexec-constructor
+                (list #$(file-append strace "/bin/strace")
+                      "-f" "-p" (number->string #$pid)
+                      "-o" (string-append "/var/log/strace-"
+                                          (number->string #$pid) ".log"))))
+      (stop   #~(make-kill-destructor))))))
diff --git a/gnu/system/install.scm b/gnu/system/install.scm
index d0ff2e7c52..f2f1e9414c 100644
--- a/gnu/system/install.scm
+++ b/gnu/system/install.scm
@@ -34,6 +34,7 @@
   #:use-module (gnu installer)
   #:use-module (gnu system locale)
   #:use-module (gnu services dbus)
+  #:use-module (gnu services linux)
   #:use-module (gnu services networking)
   #:use-module (gnu services shepherd)
   #:use-module (gnu services ssh)
@@ -378,6 +379,8 @@ Access documentation at any time by pressing Alt-F2.\x1b[0m
           ;; regulations-compliant WiFi access.
           (udev-service #:rules (list lvm2 crda))
 
+          (service strace-service-type 1)
+
           ;; Add the 'cow-store' service, which users have to start manually
           ;; since it takes the installation directory as an argument.
           (cow-store-service)
-- 
2.26.2


[-- Attachment #3: strace-1.log.xz --]
[-- Type: application/x-xz, Size: 622104 bytes --]

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* bug#41948: Shepherd deadlocks
  2020-06-19  8:41 bug#41948: Shepherd deadlocks Mathieu Othacehe
@ 2020-06-19 12:10 ` Mathieu Othacehe
  2020-06-20  0:16 ` Michael Rohleder
  2020-06-20 10:31 ` Ludovic Courtès
  2 siblings, 0 replies; 12+ messages in thread
From: Mathieu Othacehe @ 2020-06-19 12:10 UTC (permalink / raw)
  To: 41948


> 183605:1     futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>

I think this is caused by a "pthread_join", most probably the one in
"stop_finalization_thread" that is called right before forking a new
process. The fact that we hang here probably means that the finalizer
thread itself is hanging, not sure why.

It looks like what was reported by Ludo here:
https://issues.guix.info/31925.

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#41948: Shepherd deadlocks
  2020-06-19  8:41 bug#41948: Shepherd deadlocks Mathieu Othacehe
  2020-06-19 12:10 ` Mathieu Othacehe
@ 2020-06-20  0:16 ` Michael Rohleder
  2020-06-20 10:31 ` Ludovic Courtès
  2 siblings, 0 replies; 12+ messages in thread
From: Michael Rohleder @ 2020-06-20  0:16 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 41948

[-- Attachment #1: Type: text/plain, Size: 1270 bytes --]

Hi Mathieu,

Mathieu Othacehe <othacehe@gnu.org> writes:
> This is not always reproducible. In order to gather some information, I
> created a Shepherd "strace" service that logs what's happening in
> Shepherd itself (patch attached).

> +(define strace-service-type
> +  (shepherd-service-type
> +   'strace
> +   (lambda (pid)
> +     (shepherd-service
> +      (requirement '(root-file-system user-processes))
> +      (provision (list (string->symbol
> +                        (string-append "strace-" (number->string pid)))))
> +      (documentation "Strace a running process.")
> +      (start #~(make-forkexec-constructor
> +                (list #$(file-append strace "/bin/strace")
> +                      "-f" "-p" (number->string #$pid)
> +                      "-o" (string-append "/var/log/strace-"
> +                                          (number->string #$pid) ".log"))))
> +      (stop   #~(make-kill-destructor))))))

Thank you very much for this nice strace service!

I'am debugging a "shepherd hangs" problem in the context of lvm and
lvmcache...
This service helps me a lot, ty!


-- 
>Ever heard of .cshrc?
That's a city in Bosnia.  Right?
(Discussion in comp.os.linux.misc on the intuitiveness of commands.)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#41948: Shepherd deadlocks
  2020-06-19  8:41 bug#41948: Shepherd deadlocks Mathieu Othacehe
  2020-06-19 12:10 ` Mathieu Othacehe
  2020-06-20  0:16 ` Michael Rohleder
@ 2020-06-20 10:31 ` Ludovic Courtès
  2020-08-16  9:56   ` Mathieu Othacehe
  2 siblings, 1 reply; 12+ messages in thread
From: Ludovic Courtès @ 2020-06-20 10:31 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 41948

Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

> When running "gui-installed-desktop-os-encrypted" test, Shepherd seems
> to deadlock when restarting "guix-daemon". This can happen at different
> stages:
>
> * In "umount-cow-store" procedure, just before finishing the install.
>
> * During "set-http-proxy" tests inside the marionette.
>   
> This is not always reproducible. In order to gather some information, I
> created a Shepherd "strace" service that logs what's happening in
> Shepherd itself (patch attached).

We should be able to reproduce it with much simpler tests then, right?
Like maybe “while : ; do herd restart guix-daemon ; done” or similar?

> It seems that, just after blocking signals, in "fork+exec-command", I
> guess, Shepherd is taking a lock:
>
> 183553:1     chdir("/")                        = 0
> 183554:1     clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=4387222}) = 0
> 183555:1     rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
> 183560:1     madvise(0x7f179782d000, 12288, MADV_DONTNEED) = 0
> 183561:1     clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=13592357}) = 0
> 183562:1     clone(child_stack=0x7f17981e6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271
> 183563:1     stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0
> 183564:1     write(15, "shepherd[1]: changing HTTP/HTTPS"..., 86 <unfinished ...>
> 183566:1     <... write resumed>)              = 86
> 183575:1     getpgid(169)                      = 169
> 183576:1     getpgid(0)                        = 0
> 183577:1     kill(-169, SIGTERM <unfinished ...>
> 183579:1     <... kill resumed>)               = 0
> 183582:1     stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0
> 183583:1     write(15, "shepherd[1]: Service guix-daemon"..., 51 <unfinished ...>
> 183585:1     <... write resumed>)              = 51
> 183594:1     getuid()                          = 0
> 183595:1     rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0
> 183596:1     write(6, "\1", 1 <unfinished ...>
> 183598:1     <... write resumed>)              = 1
> 183605:1     futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>
>
> and is then blocking forever.

When that happens, we should check how many threads exist in PID 1.
There should be the finalization thread and the main thread, plus the
signal thread (because there are still ‘sigaction’ calls in the ‘main’
procedure), plus the GC marker threads.

In <https://issues.guix.gnu.org/31925#6>, Andy suggests that the signal
thread is not properly handled; indeed it takes locks and we don’t try
to shut it down upon fork.  However, when using signalfd, the signal
thread must be stuck in its ‘read’ call in ‘read_signal_pipe_data’, so I
don’t see how it could cause problems.

The GC threads are presumably taken care of by the atfork handler in
libgc.

Thoughts?

Ludo’.




^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#41948: Shepherd deadlocks
  2020-06-20 10:31 ` Ludovic Courtès
@ 2020-08-16  9:56   ` Mathieu Othacehe
  2021-05-07 21:49     ` Ludovic Courtès
  2021-05-08  9:43     ` Ludovic Courtès
  0 siblings, 2 replies; 12+ messages in thread
From: Mathieu Othacehe @ 2020-08-16  9:56 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 41948

[-- Attachment #1: Type: text/plain, Size: 2301 bytes --]


Hey Ludo,

> We should be able to reproduce it with much simpler tests then, right?
> Like maybe “while : ; do herd restart guix-daemon ; done” or similar?

Well I tried that without success. Then I had a closer look to the
strace log.

Turns out there are two concurrent "finalizer" threads:

--8<---------------cut here---------------start------------->8---
1     clone(child_stack=0x7f17981e6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271
--8<---------------cut here---------------end--------------->8---

and this one,

--8<---------------cut here---------------start------------->8---
217   <... clone resumed>, parent_tid=[253], tls=0x7f1799309700, child_tidptr=0x7f17993099d0) = 253
--8<---------------cut here---------------end--------------->8---

The first one is spawned from Shepherd directly. The other one is
spawned from the forked process in "marionette-shepherd-service".

Those two finalizer threads share the same pipe. When we try to
stop the finalizer thread in Shepherd, right before forking a new
process, we send a '\1' byte to the finalizer pipe.

--8<---------------cut here---------------start------------->8---
1     write(6, "\1", 1 <unfinished ...>
--8<---------------cut here---------------end--------------->8---

which is received by (line 183597): 

--8<---------------cut here---------------start------------->8---
253   <... read resumed>"\1", 1)        = 1
--8<---------------cut here---------------end--------------->8---

the marionette finalizer thread. Then, we pthread_join the Shepherd
finalizer thread, which never stops! Quite unfortunate.

Here's a small reproducer attached. So unless I'm wrong this is a Guile
issue, that will cause any program that uses at least two primitive-fork
calls to possibly hang.

I'm quite convinced that those two bugs are directly related:

* https://issues.guix.info/31925
* https://issues.guix.gnu.org/42353

Now regarding the fix of this issue, I guess that a process forked with
"primitive-fork" in Guile should close it's parent finalizer pipe and
open a new one.

WDYT?

Thanks,

Mathieu


[-- Attachment #2: t.scm --]
[-- Type: application/octet-stream, Size: 315 bytes --]

(use-modules (shepherd service)
             (ice-9 match))

(match (primitive-fork)
  (0
   (while #t
     (gc)
     (usleep 200000)))
  (pid
   (let loop ((count 0))
     (format #t "Forking ~a~%" count)
     (fork+exec-command '("/bin/sh" "-c" "sleep 1"))
     (usleep (random 200000))
     (loop (1+ count)))))

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#41948: Shepherd deadlocks
  2020-08-16  9:56   ` Mathieu Othacehe
@ 2021-05-07 21:49     ` Ludovic Courtès
  2021-05-07 22:07       ` Ludovic Courtès
  2021-05-08  9:43     ` Ludovic Courtès
  1 sibling, 1 reply; 12+ messages in thread
From: Ludovic Courtès @ 2021-05-07 21:49 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 41948

[-- Attachment #1: Type: text/plain, Size: 3057 bytes --]

Hi!

Mathieu Othacehe <othacehe@gnu.org> skribis:

> Those two finalizer threads share the same pipe. When we try to
> stop the finalizer thread in Shepherd, right before forking a new
> process, we send a '\1' byte to the finalizer pipe.
>
> 1     write(6, "\1", 1 <unfinished ...>
>
>
> which is received by (line 183597): 
>
> 253   <... read resumed>"\1", 1)        = 1
>
> the marionette finalizer thread. Then, we pthread_join the Shepherd
> finalizer thread, which never stops! Quite unfortunate.

While working on a fix for this issue (finalizer pipe shared between
parent and child process), I found the ‘sleep_pipe’ of the main thread
is also shared between the parent and its child.

Attached is a reproducer.  It prints something like this before hanging:

--8<---------------cut here---------------start------------->8---
$ guile ~/src/guile-debugging/signal-pipe.scm
parent: 25947
child: 25953
alarm in parent!
alarm in parent!
alarm in parent!

[...]

alarm in parent!
alarm in parent!
child woken up!
--8<---------------cut here---------------end--------------->8---

“child woken up” means that it’s the child process that won the race
reading on the sleep pipe (from ‘scm_std_select’).

The parent process then hangs because, in ‘scm_std_select’, it did:

  1. select(1), which returned due to available data on ‘wakeup_fd’;

  2. ‘full_read (wakeup_fd, &dummy, 1)’ gets stuck forever in read(2)
     because the child process read that byte in the meantime so
     there’s nothing left to read.

Here’s the sequence:

--8<---------------cut here---------------start------------->8---
25947 select(4, [3], NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout)
25947 getpid()                          = 25947
25947 kill(25947, SIGALRM)              = 0
25947 --- SIGALRM {si_signo=SIGALRM, si_code=SI_USER, si_pid=25947, si_uid=1000} ---
25947 write(8, "\16", 1)                = 1
25947 rt_sigreturn({mask=[]})           = 0
25952 <... read resumed>"\16", 1)       = 1
25947 rt_sigprocmask(SIG_BLOCK, NULL,  <unfinished ...>
25952 write(4, "\0", 1 <unfinished ...>
25947 <... rt_sigprocmask resumed>[], 8) = 0
25953 <... select resumed>)             = 1 (in [3], left {tv_sec=0, tv_usec=346370})
25952 <... write resumed>)              = 1
25947 select(4, [3], NULL, NULL, {tv_sec=0, tv_usec=100000} <unfinished ...>
25953 read(3,  <unfinished ...>
25952 rt_sigprocmask(SIG_BLOCK, NULL,  <unfinished ...>
25947 <... select resumed>)             = 1 (in [3], left {tv_sec=0, tv_usec=99999})
25953 <... read resumed>"\0", 1)        = 1
25947 read(3,  <unfinished ...>
25952 <... rt_sigprocmask resumed>~[KILL STOP PWR RTMIN RT_1], 8) = 0
25953 write(1, "child woken up!\n", 16 <unfinished ...>
25952 read(7,  <unfinished ...>
--8<---------------cut here---------------end--------------->8---

Notice how “\16” (= SIGALRM) is written by the parent’s main thread and
read by the child’s main thread.

Thoughts?

Ludo’.


[-- Attachment #2: the reproducer --]
[-- Type: text/plain, Size: 760 bytes --]

;; https://issues.guix.gnu.org/41948

(use-modules (ice-9 match))

(setvbuf (current-output-port) 'line)
(sigaction SIGCHLD pk)                            ;start signal thread

(match (primitive-fork)
  (0
   (format #t "child: ~a~%" (getpid))
   (let loop ()
     (unless (zero? (usleep 500000))
       ;; If this happens, it means the select(2) call in 'scm_std_select'
       ;; returned because one of our file descriptors had input data
       ;; available (which shouldn't happen).
       (format #t "child woken up!~%"))
     (loop)))
  (pid
   (format #t "parent: ~a~%" (getpid))
   (sigaction SIGALRM (lambda _
                        (format #t "alarm in parent!~%")))
   (let loop ()
     (kill (getpid) SIGALRM)
     (usleep 100000)
     (loop))))

^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#41948: Shepherd deadlocks
  2021-05-07 21:49     ` Ludovic Courtès
@ 2021-05-07 22:07       ` Ludovic Courtès
  2021-05-08 20:52         ` Ludovic Courtès
  0 siblings, 1 reply; 12+ messages in thread
From: Ludovic Courtès @ 2021-05-07 22:07 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 41948

[-- Attachment #1: Type: text/plain, Size: 420 bytes --]

Ludovic Courtès <ludo@gnu.org> skribis:

> While working on a fix for this issue (finalizer pipe shared between
> parent and child process), I found the ‘sleep_pipe’ of the main thread
> is also shared between the parent and its child.

Here’s a patch that fixes the problem as exposed by the reproducer.

Thoughts?

(We’d need these pipes to be “O_CLOFORK” rather than O_CLOEXEC.)

Ludo’.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 1207 bytes --]

diff --git a/libguile/posix.c b/libguile/posix.c
index eaf12de32..e0c157c20 100644
--- a/libguile/posix.c
+++ b/libguile/posix.c
@@ -1217,6 +1217,29 @@ SCM_DEFINE (scm_execle, "execle", 2, 0, 1,
 #undef FUNC_NAME
 
 #ifdef HAVE_FORK
+static void *
+do_fork (void *ret)
+{
+  pid_t pid = fork ();
+
+  if (pid == 0)
+    {
+      /* The child process must not share its sleep pipe with the
+         parent.  */
+      int err;
+      scm_thread *t = SCM_I_CURRENT_THREAD;
+
+      close (t->sleep_pipe[0]);
+      close (t->sleep_pipe[1]);
+      err = pipe2 (t->sleep_pipe, O_CLOEXEC);
+      if (err != 0)
+        abort ();
+    }
+
+  * (pid_t *) ret = pid;
+  return NULL;
+}
+
 SCM_DEFINE (scm_fork, "primitive-fork", 0, 0, 0,
             (),
 	    "Creates a new \"child\" process by duplicating the current \"parent\" process.\n"
@@ -1244,7 +1267,9 @@ SCM_DEFINE (scm_fork, "primitive-fork", 0, 0, 0,
         "         further behavior unspecified.  See \"Processes\" in the\n"
         "         manual, for more information.\n"),
        scm_current_warning_port ());
-  pid = fork ();
+
+  scm_without_guile (do_fork, &pid);
+
   if (pid == -1)
     SCM_SYSERROR;
   return scm_from_int (pid);

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* bug#41948: Shepherd deadlocks
  2020-08-16  9:56   ` Mathieu Othacehe
  2021-05-07 21:49     ` Ludovic Courtès
@ 2021-05-08  9:43     ` Ludovic Courtès
  2021-05-08 13:49       ` Andrew Whatson
  1 sibling, 1 reply; 12+ messages in thread
From: Ludovic Courtès @ 2021-05-08  9:43 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 41948

[-- Attachment #1: Type: text/plain, Size: 738 bytes --]

Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

> Those two finalizer threads share the same pipe. When we try to
> stop the finalizer thread in Shepherd, right before forking a new
> process, we send a '\1' byte to the finalizer pipe.
>
> 1     write(6, "\1", 1 <unfinished ...>
>
>
> which is received by (line 183597): 
>
> 253   <... read resumed>"\1", 1)        = 1
>
> the marionette finalizer thread. Then, we pthread_join the Shepherd
> finalizer thread, which never stops! Quite unfortunate.

And here’s the patch for this pipe: it closes the finalizer pipe
pre-fork, and it gets reopened lazily.

Together with the ‘sleep_pipe’ patch, it appears to fix the problems
described here.

Ludo’.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 2701 bytes --]

diff --git a/libguile/finalizers.c b/libguile/finalizers.c
index 0ae165fd1..5ddc7dbef 100644
--- a/libguile/finalizers.c
+++ b/libguile/finalizers.c
@@ -24,6 +24,7 @@
 # include <config.h>
 #endif
 
+#include <assert.h>
 #include <errno.h>
 #include <fcntl.h>
 #include <full-write.h>
@@ -170,7 +171,7 @@ queue_finalizer_async (void)
 
 #if SCM_USE_PTHREAD_THREADS
 
-static int finalization_pipe[2];
+static int finalization_pipe[2] = { -1, -1 };
 static scm_i_pthread_mutex_t finalization_thread_lock =
   SCM_I_PTHREAD_MUTEX_INITIALIZER;
 static pthread_t finalization_thread;
@@ -254,6 +255,13 @@ start_finalization_thread (void)
   scm_i_pthread_mutex_lock (&finalization_thread_lock);
   if (!finalization_thread_is_running)
     {
+      assert (finalization_pipe[0] == -1);
+
+      if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)
+        scm_syserror (NULL);
+
+      GC_set_finalizer_notifier (notify_finalizers_to_run);
+
       /* Use the raw pthread API and scm_with_guile, because we don't want
 	 to block on any lock that scm_spawn_thread might want to take,
 	 and we don't want to inherit the dynamic state (fluids) of the
@@ -276,6 +284,12 @@ stop_finalization_thread (void)
       notify_about_to_fork ();
       if (pthread_join (finalization_thread, NULL))
         perror ("joining finalization thread");
+
+      close (finalization_pipe[0]);
+      close (finalization_pipe[1]);
+      finalization_pipe[0] = -1;
+      finalization_pipe[1] = -1;
+
       finalization_thread_is_running = 0;
     }
   scm_i_pthread_mutex_unlock (&finalization_thread_lock);
@@ -284,7 +298,6 @@ stop_finalization_thread (void)
 static void
 spawn_finalizer_thread (void)
 {
-  GC_set_finalizer_notifier (notify_finalizers_to_run);
   start_finalization_thread ();
 }
 
@@ -368,8 +381,6 @@ scm_set_automatic_finalization_enabled (int enabled_p)
   if (enabled_p)
     {
 #if SCM_USE_PTHREAD_THREADS
-      if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)
-        scm_syserror (NULL);
       GC_set_finalizer_notifier (spawn_finalizer_thread);
 #else
       GC_set_finalizer_notifier (queue_finalizer_async);
@@ -381,10 +392,6 @@ scm_set_automatic_finalization_enabled (int enabled_p)
 
 #if SCM_USE_PTHREAD_THREADS
       stop_finalization_thread ();
-      close (finalization_pipe[0]);
-      close (finalization_pipe[1]);
-      finalization_pipe[0] = -1;
-      finalization_pipe[1] = -1;
 #endif
     }
 
@@ -423,10 +430,6 @@ scm_init_finalizer_thread (void)
 {
 #if SCM_USE_PTHREAD_THREADS
   if (automatic_finalization_p)
-    {
-      if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)
-        scm_syserror (NULL);
     GC_set_finalizer_notifier (spawn_finalizer_thread);
-    }
 #endif
 }

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* bug#41948: Shepherd deadlocks
  2021-05-08  9:43     ` Ludovic Courtès
@ 2021-05-08 13:49       ` Andrew Whatson
  2021-05-08 13:49         ` bug#41948: [PATCH] Fix some finalizer thread race conditions Andrew Whatson
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Whatson @ 2021-05-08 13:49 UTC (permalink / raw)
  To: 41948

Hi,

I've reviewed the finalizer patch and made some changes to ensure that
it works correctly if pipe creation or thread creation fail.

Thread creation fails in an out-of-memory scenario, so this part can be
verified by running Guile's test-out-of-memory test case.  You'll need a
libgc built with --disable-munmap for the test to survive long enough to
retry launching the finalizer thread.

Cheers!






^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#41948: [PATCH] Fix some finalizer thread race conditions
  2021-05-08 13:49       ` Andrew Whatson
@ 2021-05-08 13:49         ` Andrew Whatson
  2021-05-08 20:50           ` bug#41948: Shepherd deadlocks Ludovic Courtès
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Whatson @ 2021-05-08 13:49 UTC (permalink / raw)
  To: 41948; +Cc: Andrew Whatson

* libguile/finalizers.c (finalization_pipe): Initialize.
(reset_finalization_pipe): Factored out.
(start_finalization_thread): Create the pipe immediately before
launching the thread.  Ensure the pipe is cleaned up if thread creation
fails.  Update the finalizer callback if thread creation succeeds.
(stop_finalization_thread): Clean up the pipe after stopping the thread.
(spawn_finalizer_thread): Remove finalizer callback logic.
(scm_set_automatic_finalization_enabled): Remove pipe management.
(scm_init_finalizer_thread): Remove pipe management.
---
 libguile/finalizers.c | 45 +++++++++++++++++++++++++++----------------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/libguile/finalizers.c b/libguile/finalizers.c
index 0ae165fd1..5122e5fe3 100644
--- a/libguile/finalizers.c
+++ b/libguile/finalizers.c
@@ -24,6 +24,7 @@
 # include <config.h>
 #endif
 
+#include <assert.h>
 #include <errno.h>
 #include <fcntl.h>
 #include <full-write.h>
@@ -170,7 +171,7 @@ queue_finalizer_async (void)
 
 #if SCM_USE_PTHREAD_THREADS
 
-static int finalization_pipe[2];
+static int finalization_pipe[2] = { -1, -1 };
 static scm_i_pthread_mutex_t finalization_thread_lock =
   SCM_I_PTHREAD_MUTEX_INITIALIZER;
 static pthread_t finalization_thread;
@@ -190,6 +191,15 @@ notify_about_to_fork (void)
   full_write (finalization_pipe[1], &byte, 1);
 }
 
+static void
+reset_finalization_pipe (void)
+{
+  close (finalization_pipe[0]);
+  close (finalization_pipe[1]);
+  finalization_pipe[0] = -1;
+  finalization_pipe[1] = -1;
+}
+
 struct finalization_pipe_data
 {
   char byte;
@@ -254,15 +264,25 @@ start_finalization_thread (void)
   scm_i_pthread_mutex_lock (&finalization_thread_lock);
   if (!finalization_thread_is_running)
     {
+      assert (finalization_pipe[0] == -1);
+
       /* Use the raw pthread API and scm_with_guile, because we don't want
 	 to block on any lock that scm_spawn_thread might want to take,
 	 and we don't want to inherit the dynamic state (fluids) of the
 	 caller.  */
-      if (pthread_create (&finalization_thread, NULL,
-			  run_finalization_thread, NULL))
-	perror ("error creating finalization thread");
+      if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)
+	perror ("error creating finalization pipe");
+      else if (pthread_create (&finalization_thread, NULL,
+			       run_finalization_thread, NULL))
+	{
+	  reset_finalization_pipe ();
+	  perror ("error creating finalization thread");
+	}
       else
-	finalization_thread_is_running = 1;
+	{
+	  GC_set_finalizer_notifier (notify_finalizers_to_run);
+	  finalization_thread_is_running = 1;
+	}
     }
   scm_i_pthread_mutex_unlock (&finalization_thread_lock);
 }
@@ -276,6 +296,8 @@ stop_finalization_thread (void)
       notify_about_to_fork ();
       if (pthread_join (finalization_thread, NULL))
         perror ("joining finalization thread");
+
+      reset_finalization_pipe ();
       finalization_thread_is_running = 0;
     }
   scm_i_pthread_mutex_unlock (&finalization_thread_lock);
@@ -284,7 +306,6 @@ stop_finalization_thread (void)
 static void
 spawn_finalizer_thread (void)
 {
-  GC_set_finalizer_notifier (notify_finalizers_to_run);
   start_finalization_thread ();
 }
 
@@ -368,8 +389,6 @@ scm_set_automatic_finalization_enabled (int enabled_p)
   if (enabled_p)
     {
 #if SCM_USE_PTHREAD_THREADS
-      if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)
-        scm_syserror (NULL);
       GC_set_finalizer_notifier (spawn_finalizer_thread);
 #else
       GC_set_finalizer_notifier (queue_finalizer_async);
@@ -381,10 +400,6 @@ scm_set_automatic_finalization_enabled (int enabled_p)
 
 #if SCM_USE_PTHREAD_THREADS
       stop_finalization_thread ();
-      close (finalization_pipe[0]);
-      close (finalization_pipe[1]);
-      finalization_pipe[0] = -1;
-      finalization_pipe[1] = -1;
 #endif
     }
 
@@ -423,10 +438,6 @@ scm_init_finalizer_thread (void)
 {
 #if SCM_USE_PTHREAD_THREADS
   if (automatic_finalization_p)
-    {
-      if (pipe2 (finalization_pipe, O_CLOEXEC) != 0)
-        scm_syserror (NULL);
-      GC_set_finalizer_notifier (spawn_finalizer_thread);
-    }
+    GC_set_finalizer_notifier (spawn_finalizer_thread);
 #endif
 }
-- 
2.31.1





^ permalink raw reply related	[flat|nested] 12+ messages in thread

* bug#41948: Shepherd deadlocks
  2021-05-08 13:49         ` bug#41948: [PATCH] Fix some finalizer thread race conditions Andrew Whatson
@ 2021-05-08 20:50           ` Ludovic Courtès
  0 siblings, 0 replies; 12+ messages in thread
From: Ludovic Courtès @ 2021-05-08 20:50 UTC (permalink / raw)
  To: Andrew Whatson; +Cc: 41948

Hi Andrew,

Andrew Whatson <whatson@gmail.com> skribis:

> * libguile/finalizers.c (finalization_pipe): Initialize.
> (reset_finalization_pipe): Factored out.
> (start_finalization_thread): Create the pipe immediately before
> launching the thread.  Ensure the pipe is cleaned up if thread creation
> fails.  Update the finalizer callback if thread creation succeeds.
> (stop_finalization_thread): Clean up the pipe after stopping the thread.
> (spawn_finalizer_thread): Remove finalizer callback logic.
> (scm_set_automatic_finalization_enabled): Remove pipe management.
> (scm_init_finalizer_thread): Remove pipe management.

I tweaked the commit log and pushed as
5a281e35f4a5ae78fbcf10591d9358bec8f0bee0.

Thanks!

Ludo’.




^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#41948: Shepherd deadlocks
  2021-05-07 22:07       ` Ludovic Courtès
@ 2021-05-08 20:52         ` Ludovic Courtès
  0 siblings, 0 replies; 12+ messages in thread
From: Ludovic Courtès @ 2021-05-08 20:52 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 41948-done

Ludovic Courtès <ludo@gnu.org> skribis:

> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> While working on a fix for this issue (finalizer pipe shared between
>> parent and child process), I found the ‘sleep_pipe’ of the main thread
>> is also shared between the parent and its child.
>
> Here’s a patch that fixes the problem as exposed by the reproducer.

I went ahead and pushed as 381291f5ff4480afbb197bf5e5a2272cfe54a386,
together with a test derived from the one I posted earlier.

With this and Andrew’s finalizer pipe patch, we should be good!

This will be in the upcoming Guile 3.0.7 release.

Ludo’.




^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-05-08 21:52 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-19  8:41 bug#41948: Shepherd deadlocks Mathieu Othacehe
2020-06-19 12:10 ` Mathieu Othacehe
2020-06-20  0:16 ` Michael Rohleder
2020-06-20 10:31 ` Ludovic Courtès
2020-08-16  9:56   ` Mathieu Othacehe
2021-05-07 21:49     ` Ludovic Courtès
2021-05-07 22:07       ` Ludovic Courtès
2021-05-08 20:52         ` Ludovic Courtès
2021-05-08  9:43     ` Ludovic Courtès
2021-05-08 13:49       ` Andrew Whatson
2021-05-08 13:49         ` bug#41948: [PATCH] Fix some finalizer thread race conditions Andrew Whatson
2021-05-08 20:50           ` bug#41948: Shepherd deadlocks Ludovic Courtès

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).