unofficial mirror of bug-guix@gnu.org 
 help / color / Atom feed
* bug#41948: Shepherd deadlocks
@ 2020-06-19  8:41 Mathieu Othacehe
  2020-06-19 12:10 ` Mathieu Othacehe
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Mathieu Othacehe @ 2020-06-19  8:41 UTC (permalink / raw)
  To: 41948

[-- Attachment #1: Type: text/plain, Size: 2414 bytes --]


Hello,

When running "gui-installed-desktop-os-encrypted" test, Shepherd seems
to deadlock when restarting "guix-daemon". This can happen at different
stages:

* In "umount-cow-store" procedure, just before finishing the install.

* During "set-http-proxy" tests inside the marionette.
  
This is not always reproducible. In order to gather some information, I
created a Shepherd "strace" service that logs what's happening in
Shepherd itself (patch attached).

It seems that, just after blocking signals, in "fork+exec-command", I
guess, Shepherd is taking a lock:

--8<---------------cut here---------------start------------->8---
183553:1     chdir("/")                        = 0
183554:1     clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=4387222}) = 0
183555:1     rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
183560:1     madvise(0x7f179782d000, 12288, MADV_DONTNEED) = 0
183561:1     clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=13592357}) = 0
183562:1     clone(child_stack=0x7f17981e6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271
183563:1     stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0
183564:1     write(15, "shepherd[1]: changing HTTP/HTTPS"..., 86 <unfinished ...>
183566:1     <... write resumed>)              = 86
183575:1     getpgid(169)                      = 169
183576:1     getpgid(0)                        = 0
183577:1     kill(-169, SIGTERM <unfinished ...>
183579:1     <... kill resumed>)               = 0
183582:1     stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0
183583:1     write(15, "shepherd[1]: Service guix-daemon"..., 51 <unfinished ...>
183585:1     <... write resumed>)              = 51
183594:1     getuid()                          = 0
183595:1     rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0
183596:1     write(6, "\1", 1 <unfinished ...>
183598:1     <... write resumed>)              = 1
183605:1     futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>
--8<---------------cut here---------------end--------------->8---

and is then blocking forever.

I cannot find any trace of locking in Shepherd, except for
"without-automatic-finalization" procedure.

Thanks,

Mathieu

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-strace-service.patch --]
[-- Type: text/x-diff, Size: 3177 bytes --]

From 13ac475ad953509cbbf88547e4f72efc08d31055 Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <othacehe@gnu.org>
Date: Thu, 18 Jun 2020 14:13:15 +0200
Subject: [PATCH] strace service

---
 gnu/services/base.scm  |  3 +++
 gnu/services/linux.scm | 25 ++++++++++++++++++++++++-
 gnu/system/install.scm |  3 +++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/gnu/services/base.scm b/gnu/services/base.scm
index 6ea7ef8e7e..d80c6cca6d 100644
--- a/gnu/services/base.scm
+++ b/gnu/services/base.scm
@@ -34,6 +34,7 @@
   #:use-module (guix deprecation)
   #:use-module (gnu services)
   #:use-module (gnu services admin)
+  #:use-module (gnu services linux)
   #:use-module (gnu services shepherd)
   #:use-module (gnu system pam)
   #:use-module (gnu system shadow)                ; 'user-account', etc.
@@ -2386,6 +2387,8 @@ to handle."
         (service guix-service-type)
         (service nscd-service-type)
 
+        (service strace-service-type 1)
+
         (service rottlog-service-type)
 
         ;; The LVM2 rules are needed as soon as LVM2 or the device-mapper is
diff --git a/gnu/services/linux.scm b/gnu/services/linux.scm
index 12934c2084..96889dfb7f 100644
--- a/gnu/services/linux.scm
+++ b/gnu/services/linux.scm
@@ -42,7 +42,9 @@
             earlyoom-configuration-send-notification-command
             earlyoom-service-type
 
-            kernel-module-loader-service-type))
+            kernel-module-loader-service-type
+
+            strace-service-type))
 
 \f
 ;;;
@@ -177,3 +179,24 @@ representation."
    (compose concatenate)
    (extend append)
    (default-value '())))
+
+\f
+;;;
+;;; Strace service.
+;;;
+
+(define strace-service-type
+  (shepherd-service-type
+   'strace
+   (lambda (pid)
+     (shepherd-service
+      (requirement '(root-file-system user-processes))
+      (provision (list (string->symbol
+                        (string-append "strace-" (number->string pid)))))
+      (documentation "Strace a running process.")
+      (start #~(make-forkexec-constructor
+                (list #$(file-append strace "/bin/strace")
+                      "-f" "-p" (number->string #$pid)
+                      "-o" (string-append "/var/log/strace-"
+                                          (number->string #$pid) ".log"))))
+      (stop   #~(make-kill-destructor))))))
diff --git a/gnu/system/install.scm b/gnu/system/install.scm
index d0ff2e7c52..f2f1e9414c 100644
--- a/gnu/system/install.scm
+++ b/gnu/system/install.scm
@@ -34,6 +34,7 @@
   #:use-module (gnu installer)
   #:use-module (gnu system locale)
   #:use-module (gnu services dbus)
+  #:use-module (gnu services linux)
   #:use-module (gnu services networking)
   #:use-module (gnu services shepherd)
   #:use-module (gnu services ssh)
@@ -378,6 +379,8 @@ Access documentation at any time by pressing Alt-F2.\x1b[0m
           ;; regulations-compliant WiFi access.
           (udev-service #:rules (list lvm2 crda))
 
+          (service strace-service-type 1)
+
           ;; Add the 'cow-store' service, which users have to start manually
           ;; since it takes the installation directory as an argument.
           (cow-store-service)
-- 
2.26.2


[-- Attachment #3: strace-1.log.xz --]
[-- Type: application/x-xz, Size: 622104 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* bug#41948: Shepherd deadlocks
  2020-06-19  8:41 bug#41948: Shepherd deadlocks Mathieu Othacehe
@ 2020-06-19 12:10 ` Mathieu Othacehe
  2020-06-20  0:16 ` Michael Rohleder
  2020-06-20 10:31 ` Ludovic Courtès
  2 siblings, 0 replies; 4+ messages in thread
From: Mathieu Othacehe @ 2020-06-19 12:10 UTC (permalink / raw)
  To: 41948


> 183605:1     futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>

I think this is caused by a "pthread_join", most probably the one in
"stop_finalization_thread" that is called right before forking a new
process. The fact that we hang here probably means that the finalizer
thread itself is hanging, not sure why.

It looks like what was reported by Ludo here:
https://issues.guix.info/31925.

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 4+ messages in thread

* bug#41948: Shepherd deadlocks
  2020-06-19  8:41 bug#41948: Shepherd deadlocks Mathieu Othacehe
  2020-06-19 12:10 ` Mathieu Othacehe
@ 2020-06-20  0:16 ` Michael Rohleder
  2020-06-20 10:31 ` Ludovic Courtès
  2 siblings, 0 replies; 4+ messages in thread
From: Michael Rohleder @ 2020-06-20  0:16 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 41948

[-- Attachment #1: Type: text/plain, Size: 1270 bytes --]

Hi Mathieu,

Mathieu Othacehe <othacehe@gnu.org> writes:
> This is not always reproducible. In order to gather some information, I
> created a Shepherd "strace" service that logs what's happening in
> Shepherd itself (patch attached).

> +(define strace-service-type
> +  (shepherd-service-type
> +   'strace
> +   (lambda (pid)
> +     (shepherd-service
> +      (requirement '(root-file-system user-processes))
> +      (provision (list (string->symbol
> +                        (string-append "strace-" (number->string pid)))))
> +      (documentation "Strace a running process.")
> +      (start #~(make-forkexec-constructor
> +                (list #$(file-append strace "/bin/strace")
> +                      "-f" "-p" (number->string #$pid)
> +                      "-o" (string-append "/var/log/strace-"
> +                                          (number->string #$pid) ".log"))))
> +      (stop   #~(make-kill-destructor))))))

Thank you very much for this nice strace service!

I'am debugging a "shepherd hangs" problem in the context of lvm and
lvmcache...
This service helps me a lot, ty!


-- 
>Ever heard of .cshrc?
That's a city in Bosnia.  Right?
(Discussion in comp.os.linux.misc on the intuitiveness of commands.)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* bug#41948: Shepherd deadlocks
  2020-06-19  8:41 bug#41948: Shepherd deadlocks Mathieu Othacehe
  2020-06-19 12:10 ` Mathieu Othacehe
  2020-06-20  0:16 ` Michael Rohleder
@ 2020-06-20 10:31 ` Ludovic Courtès
  2 siblings, 0 replies; 4+ messages in thread
From: Ludovic Courtès @ 2020-06-20 10:31 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 41948

Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

> When running "gui-installed-desktop-os-encrypted" test, Shepherd seems
> to deadlock when restarting "guix-daemon". This can happen at different
> stages:
>
> * In "umount-cow-store" procedure, just before finishing the install.
>
> * During "set-http-proxy" tests inside the marionette.
>   
> This is not always reproducible. In order to gather some information, I
> created a Shepherd "strace" service that logs what's happening in
> Shepherd itself (patch attached).

We should be able to reproduce it with much simpler tests then, right?
Like maybe “while : ; do herd restart guix-daemon ; done” or similar?

> It seems that, just after blocking signals, in "fork+exec-command", I
> guess, Shepherd is taking a lock:
>
> 183553:1     chdir("/")                        = 0
> 183554:1     clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=4387222}) = 0
> 183555:1     rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
> 183560:1     madvise(0x7f179782d000, 12288, MADV_DONTNEED) = 0
> 183561:1     clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=13592357}) = 0
> 183562:1     clone(child_stack=0x7f17981e6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271
> 183563:1     stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0
> 183564:1     write(15, "shepherd[1]: changing HTTP/HTTPS"..., 86 <unfinished ...>
> 183566:1     <... write resumed>)              = 86
> 183575:1     getpgid(169)                      = 169
> 183576:1     getpgid(0)                        = 0
> 183577:1     kill(-169, SIGTERM <unfinished ...>
> 183579:1     <... kill resumed>)               = 0
> 183582:1     stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0
> 183583:1     write(15, "shepherd[1]: Service guix-daemon"..., 51 <unfinished ...>
> 183585:1     <... write resumed>)              = 51
> 183594:1     getuid()                          = 0
> 183595:1     rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0
> 183596:1     write(6, "\1", 1 <unfinished ...>
> 183598:1     <... write resumed>)              = 1
> 183605:1     futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>
>
> and is then blocking forever.

When that happens, we should check how many threads exist in PID 1.
There should be the finalization thread and the main thread, plus the
signal thread (because there are still ‘sigaction’ calls in the ‘main’
procedure), plus the GC marker threads.

In <https://issues.guix.gnu.org/31925#6>, Andy suggests that the signal
thread is not properly handled; indeed it takes locks and we don’t try
to shut it down upon fork.  However, when using signalfd, the signal
thread must be stuck in its ‘read’ call in ‘read_signal_pipe_data’, so I
don’t see how it could cause problems.

The GC threads are presumably taken care of by the atfork handler in
libgc.

Thoughts?

Ludo’.




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, back to index

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-19  8:41 bug#41948: Shepherd deadlocks Mathieu Othacehe
2020-06-19 12:10 ` Mathieu Othacehe
2020-06-20  0:16 ` Michael Rohleder
2020-06-20 10:31 ` Ludovic Courtès

unofficial mirror of bug-guix@gnu.org 

Archives are clonable:
	git clone --mirror https://yhetil.org/guix-bugs/0 guix-bugs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 guix-bugs guix-bugs/ https://yhetil.org/guix-bugs \
		bug-guix@gnu.org
	public-inbox-index guix-bugs

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.yhetil.org/yhetil.gnu.guix.bugs
	nntp://news.gmane.io/gmane.comp.gnu.guix.bugs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git