bug#63736: [Shepherd] Loss of SIGCHLD notifications

unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed

* bug#63736: [Shepherd] Loss of SIGCHLD notifications
@ 2023-05-26 12:14 Ludovic Courtès
  2023-05-27 17:00 ` Ludovic Courtès
  0 siblings, 1 reply; 3+ messages in thread
From: Ludovic Courtès @ 2023-05-26 12:14 UTC (permalink / raw)
  To: 63736

I experienced, with the Shepherd 0.10.0, a situation where ‘herd
restart’ would get stuck after the end-of-grace-period expiration
(restarting nscd in this case):

--8<---------------cut here---------------start------------->8---
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Status of nscd: 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   It is running since 08:43:57 (36 seconds ago). 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   Running value is 23753. 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   It is enabled. 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   Provides (nscd). 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   Requires (user-processes syslogd). 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   Will be respawned. 
May 26 08:44:33 localhost shepherd[1]: Stopping service nscd... 
[...]
May 26 08:44:38 localhost shepherd[1]: Grace period of 5 seconds is over; sending -23753 SIGKILL. 
--8<---------------cut here---------------end--------------->8---

The ‘herd restart’ process is indeed waiting, and the nscd process is
still around as zombie, meaning that shepherd didn’t call waitpid(2):

--8<---------------cut here---------------start------------->8---
$ sudo herd status nscd
Status of nscd:
  It is being stopped.
  It is enabled.
  Provides (nscd).
  Requires (user-processes syslogd).
  Will be respawned.
$ ps aux|grep nscd
root     23753  0.0  0.0      0     0 ?        Zs   08:43   0:00 [nscd] <defunct>
root     23870  0.0  0.1  49968 18088 ?        Sl   08:44   0:00 /gnu/store/4gvgcfdiz67wv04ihqfa8pqwzsb0qpv5-guile-3.0.9/bin/guile --no-auto-compile /run/current-system/profile/bin/herd restart nscd
ludo     25280 33.3  0.0   6112  2304 pts/0    S+   09:32   0:00 grep --color=auto nscd
--8<---------------cut here---------------end--------------->8---

At that point, if I pick another service process and kill it, shepherd
again fails to react to SIGCHLD (or doesn’t receive it):

--8<---------------cut here---------------start------------->8---
$ sudo herd status ntpd
Status of ntpd:
  It is running since 08:44:31 AM (52 minutes ago).
  Running value is 23814.
  It is enabled.
  Provides (ntpd).
  Requires (user-processes networking).
  Will be respawned.
$ sudo kill 23814
$ sudo herd status ntpd
Status of ntpd:
  It is running since 08:44:31 AM (52 minutes ago).
  Running value is 23814.
  It is enabled.
  Provides (ntpd).
  Requires (user-processes networking).
  Will be respawned.
ludo@ribbon ~/src/guix$ ps 23814
  PID TTY      STAT   TIME COMMAND
23814 ?        Zs     0:00 [ntpd] <defunct>
--8<---------------cut here---------------end--------------->8---

If I repeat the same experiment while stracing shepherd, here’s what I
see:

--8<---------------cut here---------------start------------->8---
$ sudo herd status guix-publish
Status of guix-publish:
  It is running since 08:44:32 AM (55 minutes ago).
  Running value is 23822.
  It is enabled.
  Provides (guix-publish).
  Requires (user-processes guix-daemon avahi-daemon).
  Will be respawned.
$ sudo kill 23822
$ ps 23822
  PID TTY      STAT   TIME COMMAND
23822 ?        Zs     0:02 [guix publish] <defunct>
--8<---------------cut here---------------end--------------->8---

… and the strace:

--8<---------------cut here---------------start------------->8---
1     rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
1     epoll_wait(13, [{events=EPOLLIN, data={u32=14, u64=14}}], 8, -1) = 1
1     accept4(14, {sa_family=AF_UNIX}, [112 => 2], SOCK_CLOEXEC|SOCK_NONBLOCK) = 25
1     accept4(14, 0x7ffe6f8f0be0, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
1     epoll_ctl(13, EPOLL_CTL_MOD, 14, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=14, u64=14}}) = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     read(25, "(shepherd-command (version 0) (action status) (service guix-publish) (arguments ()) (directory \"/home/ludo/src/guix\"))", 1024) = 118
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     write(25, "(reply (version 0) (result ((service (version 0) (provides (guix-publish)) (requires (user-processes guix-daemon avahi-daemon)) (respawn? #t) (docstring \"[No documentation.]\") (enabled? #t) (running 23822) (conflicts ()) (last-respawns ()) (status-changes ((running . 1685083472) (starting . 1685083472) (stopped . 1685083469) (stopping . 1685083469) (running . 1684924319) (starting . 1684924319) (stopped . 1684924319) (stopping . 1684924319) (running . 1684853459) (starting . 1684853459) (stopped"..., 686) = 686
1     close(25)                         = 0
1     rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
1     epoll_wait(13, [{events=EPOLLHUP, data={u32=30, u64=30}}], 8, -1) = 1
1     read(30, "", 4096)                = 0
1     close(30)                         = 0
1     close(36)                         = 0
1     rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
1     epoll_wait(13, 
--8<---------------cut here---------------end--------------->8---

The signal FD still exists but we didn’t see any activity on it:

--8<---------------cut here---------------start------------->8---
$ sudo ls -lrt /proc/1/fd |grep signalfd
lrwx------ 1 root root 64 May 26 09:41 10 -> anon_inode:[signalfd]
--8<---------------cut here---------------end--------------->8---

Here’s the signal handling status (notice two signals queued):

--8<---------------cut here---------------start------------->8---
$ sudo cat /proc/1/status|grep -i Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: 0000000000014003
SigIgn: 0000000000001000
SigCgt: 0000000120814423
--8<---------------cut here---------------end--------------->8---

With:

  (define (signal-mask->list mask)
    (let loop ((sig 32) (lst '()))
      (if (zero? sig)
          lst
          (loop (- sig 1)
                (if (bit-set? (- sig 1) mask)
                    (cons sig lst)
                    lst)))))

… we see that the block mask is equal to ‘%precious-signals’ (good):

--8<---------------cut here---------------start------------->8---
scheme@(shepherd system)> (signal-mask->list #x0000000000014003)
$33 = (1 2 15 17)
--8<---------------cut here---------------end--------------->8---

… while the ignored mask lists SIGPIPE:

--8<---------------cut here---------------start------------->8---
scheme@(shepherd system)> (signal-mask->list #x0000000000001000)
$32 = (13)
--8<---------------cut here---------------end--------------->8---

However, the process also has 3 GC marker threads and 1 finalization
thread:

--8<---------------cut here---------------start------------->8---
(gdb) info threads
  Id   Target Id                                     Frame 
* 1    Thread 0x7fdbd530f380 (LWP 1) "shepherd"      0x00007fdbd5415626 in epoll_wait (epfd=13, 
    events=0x7fdbd203a1a0, maxevents=8, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  2    Thread 0x7fdbd4cd9640 (LWP 118) "GC-marker-0" __futex_abstimed_wait_common64 (private=0, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>)
    at futex-internal.c:57
  3    Thread 0x7fdbd44d8640 (LWP 119) "GC-marker-1" __futex_abstimed_wait_common64 (private=0, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>)
    at futex-internal.c:57
  4    Thread 0x7fdbd3cd7640 (LWP 120) "GC-marker-2" __futex_abstimed_wait_common64 (private=0, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 <mark_cv+40>)
    at futex-internal.c:57
  5    Thread 0x7fdbd3389640 (LWP 123) "shepherd"    __GI___libc_read (nbytes=1, 
    buf=0x7fdbd3388620, fd=6) at ../sysdeps/unix/sysv/linux/read.c:26
--8<---------------cut here---------------end--------------->8---

These threads seem to be blocking every signal, which is good:

--8<---------------cut here---------------start------------->8---
$ sudo cat /proc/118/status |grep ^Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/119/status |grep ^Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/120/status |grep ^Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/123/status |grep ^Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5ffbfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
--8<---------------cut here---------------end--------------->8---

Then all of a sudden, as I’m conducting this experiment, I see:

--8<---------------cut here---------------start------------->8---
May 26 09:36:37 localhost ntpd[23814]: ntpd exiting on signal 15 (Terminated)
[…]
May 26 10:52:10 localhost shepherd[1]: Respawning nscd. 
May 26 10:52:10 localhost shepherd[1]: Service guix-publish (PID 23822) terminated with signal 15. 
May 26 10:52:10 localhost shepherd[1]: Respawning ntpd. 
May 26 10:52:10 localhost shepherd[1]: Respawning guix-publish. 
May 26 10:52:10 localhost shepherd[1]: Starting service nscd... 
May 26 10:52:10 localhost shepherd[1]: Starting service ntpd... 
May 26 10:52:10 localhost shepherd[1]: Service ntpd has been started. 
May 26 10:52:10 localhost shepherd[1]: Service ntpd started. 
May 26 10:52:10 localhost shepherd[1]: Service ntpd running with value 29148. 
--8<---------------cut here---------------end--------------->8---

10:52 is about the time I attached gdb to shepherd…

Interestingly, we get a “terminated with signal 15” message from
shepherd for guix-publish, but not for ntpd.

Long story short: there seems to be a problem with signal delivery.
Most likely, the initial grace period expiration above when stopping
nscd is a symptom of shepherd no longer receiving/processing SIGCHLD
rather than the cause.

To be continued…

Ludo’.




^ permalink raw reply	[flat|nested] 3+ messages in thread

* bug#63736: [Shepherd] Loss of SIGCHLD notifications
  2023-05-26 12:14 bug#63736: [Shepherd] Loss of SIGCHLD notifications Ludovic Courtès
@ 2023-05-27 17:00 ` Ludovic Courtès
  2023-11-23 20:46   ` Ludovic Courtès
  0 siblings, 1 reply; 3+ messages in thread
From: Ludovic Courtès @ 2023-05-27 17:00 UTC (permalink / raw)
  To: 63736

Ludovic Courtès <ludo@gnu.org> skribis:

> Long story short: there seems to be a problem with signal delivery.
> Most likely, the initial grace period expiration above when stopping
> nscd is a symptom of shepherd no longer receiving/processing SIGCHLD
> rather than the cause.

Another possibility is lockup: one of the relevant fibers is either gone
or stuck in ‘put-message’ or ‘get-message’.

I did two things:

  b9a37f3 shepherd: Make signal handling fiber an essential task.
  8ae2780 service: Do not attempt to restart transient services.

Commit 8ae2780 fixes a bug whereby ‘herd restart’ could end up
attempting to restart a transient service, which would lock up the
calling fiber because the service’s controlling fiber would first
receive the 'terminate message, so it would return and nobody would be
reading further messages send on its channel.

Commit b9a37f3 will allows us to ensure that the signal-handling fiber
never exits (and we’ll get a trace in the log if it tries to).

Ludo’.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* bug#63736: [Shepherd] Loss of SIGCHLD notifications
  2023-05-27 17:00 ` Ludovic Courtès
@ 2023-11-23 20:46   ` Ludovic Courtès
  0 siblings, 0 replies; 3+ messages in thread
From: Ludovic Courtès @ 2023-11-23 20:46 UTC (permalink / raw)
  To: 63736-done

Ludovic Courtès <ludo@gnu.org> skribis:

> Another possibility is lockup: one of the relevant fibers is either gone
> or stuck in ‘put-message’ or ‘get-message’.
>
> I did two things:
>
>   b9a37f3 shepherd: Make signal handling fiber an essential task.
>   8ae2780 service: Do not attempt to restart transient services.
>
> Commit 8ae2780 fixes a bug whereby ‘herd restart’ could end up
> attempting to restart a transient service, which would lock up the
> calling fiber because the service’s controlling fiber would first
> receive the 'terminate message, so it would return and nobody would be
> reading further messages send on its channel.
>
> Commit b9a37f3 will allows us to ensure that the signal-handling fiber
> never exits (and we’ll get a trace in the log if it tries to).

Apparently these commits, which made it in 0.10.1 months ago, fixed this
particular bug.  Closing!

Ludo’.




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-11-23 20:48 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-26 12:14 bug#63736: [Shepherd] Loss of SIGCHLD notifications Ludovic Courtès
2023-05-27 17:00 ` Ludovic Courtès
2023-11-23 20:46   ` Ludovic Courtès

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).