unofficial mirror of guix-patches@gnu.org 
 help / color / mirror / code / Atom feed
From: Carlo Zancanaro <carlo@zancanaro.id.au>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: 30637@debbugs.gnu.org
Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
Date: Sun, 04 Mar 2018 07:49:26 +1100	[thread overview]
Message-ID: <87woys9961.fsf@zancanaro.id.au> (raw)
In-Reply-To: <87h8px9od8.fsf@gnu.org>


[-- Attachment #1.1: Type: text/plain, Size: 1780 bytes --]

Hey Ludo,

On Sat, Mar 03 2018, Ludovic Courtès wrote:
> If they’re zombies, that means nobody called waitpid(2). 
> Presumably the
> polling code would need to do that.
>
> So I suppose ‘check-for-dead-services’ should do something like:
>
> [ ... ]
>
> Does that make sense?  Please check waitpid(2) carefully though, 
> because
> it’s pretty gnarly and I might have forgotten or misinterpreted
> something here.

Unfortunately we can't do that. We fall back to the polling 
approach to handle the fact that the processes that we care about 
aren't our children (hence we don't get SIGCHLD). The waitpid 
system call only waits for processes which are children of the 
calling process.

I looked into the zombie problem a bit more, and I found what the 
problem actually is. In the build environment a guile process is 
running as pid 1 (the *-guile-builder script for that job). This 
guile process never handles SIGCHLD, and never calls wait/waitpid, 
so any orphaned processes become zombies. I tried modifying 
derivations.scm, but it wanted to rebuild a lot of things, so I 
gave up. I think we need to add something like this to the 
*-guile-builder script:

  (sigaction SIGCHLD
    (lambda ()
      (let loop ()
        (match (waitpid WAIT_ANY WNOHANG)
          ((0 . _) #f)
          ((pid . _) (loop))
          (_ #f))))
    SA_NOCLDSTOP)

I've attached the output of `ps axjf` inside the build container, 
so you can see why I think that this is the problem. It's a bit of 
a shame that this is different to `guix environment --container`, 
where /bin/sh is pid 1, because it meant that it would build 
successfully in my container, but would fail in the build 
container (which is a confusing experience).


[-- Attachment #1.2: process-tree.txt --]
[-- Type: text/plain, Size: 8196 bytes --]

 PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    0     1     1     1 ?           -1 Ssl  30001   0:00 guile --no-auto-compile -L /gnu/store/71d3rwa514j7vy5l4vfivf68g5yxibvl-module-import /gnu/store/nln71c8hv82c4vkrssi12qmapp1ryk58-shepherd-0.0.0-guile-builder
    1   989     1     1 ?           -1 S    30001   0:00 make check -j 4
  989   990     1     1 ?           -1 S    30001   0:00  \_ make check-recursive
  990   991     1     1 ?           -1 S    30001   0:00      \_ /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash -c fail=; \ if (target_option=k; case ${target_option-} in ?) ;; *) echo "am__make_running_with_option: internal error: invalid" "target option '${target_option-}' specified" >&2; exit 1;; esac; has_opt=no; sane_makeflags=$MAKEFLAGS; if { if test -z '1'; then false; elif test -n 'x86_64-unknown-linux-gnu'; then true; elif test -n '4.2.1' && test -n '/tmp/guix-build-shepherd-0.0.0.drv-0/source'; then true; else false; fi; }; then sane_makeflags=$MFLAGS; else case $MAKEFLAGS in *\\[\ \.]*) bs=\\; sane_makeflags=`printf '%s\n' "$MAKEFLAGS" | sed "s/$bs$bs[$bs $bs.]*//g"`;; esac; fi; skip_next=no; strip_trailopt () { flg=`printf '%s\n' "$flg" | sed "s/$1.*$//"`; }; for flg in $sane_makeflags; do test $skip_next = yes && { skip_next=no; continue; }; case $flg in *=*|--*) continue;; -*I) strip_trailopt 'I'; skip_next=yes;; -*I?*) strip_trailopt 'I';; -*O) strip_trailopt 'O'; skip_next=yes;; -*O?*) strip_trailopt 'O';; -*l) strip_trailopt 'l'; skip_next=yes;; -*l?*) strip_trailopt 'l';; -[dEDm]) skip_next=yes;; -[JT]) skip_next=yes;; esac; case $flg in *$target_option*) has_opt=yes; break;; esac; done; test $has_opt = yes); then \   failcom='fail=yes'; \ else \   failcom='exit 1'; \ fi; \ dot_seen=no; \ target=`echo check-recursive | sed s/-recursive//`; \ case "check-recursive" in \   distclean-* | maintainer-clean-*) list='po' ;; \   *) list='po' ;; \ esac; \ for subdir in $list; do \   echo "Making $target in $subdir"; \   if test "$subdir" = "."; then \     dot_seen=yes; \     local_target="$target-am"; \   else \     local_target="$target"; \   fi; \   (CDPATH="${ZSH_VERSION+.}:" && cd $subdir && make  $local_target) \   || eval $failcom; \ done; \ if test "$dot_seen" = "no"; then \   make  "$target-am" || exit 1; \ fi; test -z "$fail"
  991   998     1     1 ?           -1 S    30001   0:00          \_ make check-am
  998   999     1     1 ?           -1 S    30001   0:00              \_ make check-TESTS
  999  1005     1     1 ?           -1 S    30001   0:00                  \_ /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash -c set +e; bases='tests/basic.log tests/respawn.log tests/respawn-throttling.log tests/misbehaved-client.log tests/no-home.log tests/pid-file.log tests/status-sexp.log tests/forking-service.log tests/signals.log'; bases=`for i in $bases; do echo $i; done | sed 's/\.log$//'`; bases=`echo $bases`; \ log_list=`for i in $bases; do echo $i.log; done`; \ trs_list=`for i in $bases; do echo $i.trs; done`; \ log_list=`echo $log_list`; trs_list=`echo $trs_list`; \ make  test-suite.log TEST_LOGS="$log_list"; \ exit $?;
 1005  1014     1     1 ?           -1 S    30001   0:00                      \_ make test-suite.log TEST_LOGS=tests/basic.log tests/respawn.log tests/respawn-throttling.log tests/misbehaved-client.log tests/no-home.log tests/pid-file.log tests/status-sexp.log tests/forking-service.log tests/signals.log
 1014  1554     1     1 ?           -1 S    30001   0:00                          \_ /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash -c p='tests/forking-service.sh'; \ case 'tests/forking-service.log' in */*) case 'tests/forking-service' in */*) b='tests/forking-service';; *) b=`echo 'tests/forking-service.log' | sed 's/\.log$//'`; esac;; *) b='tests/forking-service';; esac; \ case $- in *e*) set +e;; esac; srcdirstrip=`echo "." | sed 's|.|.|g'`; case $p in ./*) f=`echo "$p" | sed "s|^$srcdirstrip/||"`;; *) f=$p;; esac; { mgn= red= grn= lgn= blu= brg= std=; am__color_tests=no; if test "X" = Xno; then am__color_tests=no; elif test "X" = Xalways; then am__color_tests=yes; elif test "X$TERM" != Xdumb && { test -t 1; } 2>/dev/null; then am__color_tests=yes; fi; if test $am__color_tests = yes; then red='.[0;31m'; grn='.[0;32m'; lgn='.[1;32m'; blu='.[1;34m'; mgn='.[0;35m'; brg='.[1m'; std='.[m'; fi; }; srcdir=.; export srcdir; case "tests/forking-service.log" in */*) am__odir=`echo "./tests/forking-service.log" | sed 's|/[^/]*$||'`;; *) am__odir=.;; esac; test "x$am__odir" = x"." || test -d "$am__odir" || /gnu/store/6i33ik7haav0hd5a797l3llkq04ghx6g-coreutils-8.28/bin/mkdir -p "$am__odir" || exit $?; if test -f "./$f"; then dir=./; elif test -f "$f"; then dir=; else dir="./"; fi; tst=$dir$f; log='tests/forking-service.log'; if test -n ''; then am__enable_hard_errors=no; else am__enable_hard_errors=yes; fi; case "  " in *[\ \.]$f[\ \.]* | *[\ \.]$dir$f[\ \.]*) am__expect_failure=yes;; *) am__expect_failure=no;; esac; unset XDG_CONFIG_HOME; unset LANGUAGE; LC_ALL=C LC_MESSAGES=C PATH="/tmp/guix-build-shepherd-0.0.0.drv-0/source:$PATH" SHELL="/gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash" GUILE="/gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/bin/guile" GUILE_LOAD_PATH="/tmp/guix-build-shepherd-0.0.0.drv-0/source/modules:/tmp/guix-build-shepherd-0.0.0.drv-0/source/modules:$GUILE_LOAD_PATH" GUILE_LOAD_COMPILED_PATH="/tmp/guix-build-shepherd-0.0.0.drv-0/source/modules:/tmp/guix-build-shepherd-0.0.0.drv-0/source/modules:$GUILE_LOAD_COMPILED_PATH"  /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash ./build-aux/test-driver --test-name "$f" \ --log-file $b.log --trs-file $b.trs \ --color-tests "$am__color_tests" --enable-hard-errors "$am__enable_hard_errors" --expect-failure "$am__expect_failure"   -- /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash -x -e  \ "$tst" 
 1554  1561     1     1 ?           -1 S    30001   0:00                              \_ /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash ./build-aux/test-driver --test-name tests/forking-service.sh --log-file tests/forking-service.log --trs-file tests/forking-service.trs --color-tests no --enable-hard-errors yes --expect-failure no -- /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash -x -e ./tests/forking-service.sh
 1561  1562     1     1 ?           -1 S    30001   0:00                                  \_ /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash -x -e ./tests/forking-service.sh
 1562  1594     1     1 ?           -1 Sl   30001   0:00                                      \_ /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/bin/guile --no-auto-compile /tmp/guix-build-shepherd-0.0.0.drv-0/source/shepherd -I -s t-socket-1562 -c t-conf-1562 -l t-log-1562 --pid=t-pid-1562
 1562  2005     1     1 ?           -1 R    30001   0:00                                      \_ ps axjf
    1  1091     1     1 ?           -1 Z    30001   0:00 [shepherd] <defunct>
    1  1114     1     1 ?           -1 Z    30001   0:00 [shepherd] <defunct>
    1  1403  1123  1123 ?           -1 Z    30001   0:00 [sleep] <defunct>
    1  1415     1     1 ?           -1 Z    30001   0:00 [shepherd] <defunct>
    1  1433  1193  1193 ?           -1 Z    30001   0:00 [sleep] <defunct>
    1  1480  1479  1479 ?           -1 S    30001   0:00 sleep 600
    1  1541  1427  1427 ?           -1 Z    30001   0:00 [sleep] <defunct>
    1  1544  1471  1471 ?           -1 Z    30001   0:00 [sleep] <defunct>
    1  1644  1638  1638 ?           -1 Z    30001   0:00 [sleep] <defunct>
    1  1659  1658  1658 ?           -1 S    30001   0:00 sleep 600
    1  1705  1577  1577 ?           -1 Z    30001   0:00 [sleep] <defunct>
    1  1707  1618  1618 ?           -1 Z    30001   0:00 [sleep] <defunct>
    1  1770  1770  1770 ?           -1 Zs   30001   0:00 [bash] <defunct>
    1  1826  1736  1736 ?           -1 Z    30001   0:00 [sleep] <defunct>
    1  1888  1770  1770 ?           -1 Z    30001   0:00 [sleep] <defunct>

[-- Attachment #1.3: Type: text/plain, Size: 1361 bytes --]


> Please add a comment in the handler saying that we resort to 
> polling on
> OSes that do not support ‘prctl’.
>
> However, perhaps we should do:
>
>   (lambda args
>     (let ((errno (system-error-errno args)))
>       (if (= ENOSYS errno)
>           check-for-dead-services
>           (apply throw args))))
>
> so that important/unexpected errors are not silently ignored.

I had quite liked the idea that it would just ignore any error and 
do the fallback, because really all we care about is "prctl 
failed" when deciding on our fallback logic. I've decided to just 
handle ENOSYS (prctl not available) and EINVAL (which is returned 
when PR_SET_CHILD_SUBREAPER not available), and throw for 
everything else. I'd love to be able to test this on platforms 
where prctl will actually fail, though, because I don't like the 
idea of committing code that I haven't actually been able to run.

> If not, we should add in shepherd.texi, under “Slots of 
> services”, a few
> words saying that when ‘running’ is an integer it is assumed to 
> be a
> PID.

I've done this, but while doing it I realised that this has always 
been true. The SIGCHLD handler has always assumed that a number 
indicates a running process, my modifications haven't changed the 
assumption, they've just widened its scope.

Carlo


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.4: 0001-Add-prctl-syscall-wrapper-along-with-with-PR_SET_CHI.patch --]
[-- Type: text/x-patch, Size: 2521 bytes --]

From e529e4035eec147f448804dd10fdbca13a17f523 Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <carlo@zancanaro.id.au>
Date: Sun, 4 Mar 2018 07:01:30 +1100
Subject: [PATCH 1/3] Add prctl syscall wrapper along with with
 PR_SET_CHILD_SUBREAPER.

* configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER.
* modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable
  and export it.
  (prctl): Add new procedure and export it.
---
 configure.ac                   |  4 ++++
 modules/shepherd/system.scm.in | 21 ++++++++++++++++++++-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/configure.ac b/configure.ac
index bb5058d..fbe16f4 100644
--- a/configure.ac
+++ b/configure.ac
@@ -72,7 +72,11 @@ esac
 AC_SUBST([RB_AUTOBOOT])
 AC_SUBST([RB_HALT_SYSTEM])
 AC_SUBST([RB_POWER_OFF])
+AC_MSG_RESULT([done])
 
+AC_MSG_CHECKING([<sys/prctl.h> constants])
+AC_COMPUTE_INT([PR_SET_CHILD_SUBREAPER], [PR_SET_CHILD_SUBREAPER], [#include <sys/prctl.h>])
+AC_SUBST([PR_SET_CHILD_SUBREAPER])
 AC_MSG_RESULT([done])
 
 dnl Manual pages.
diff --git a/modules/shepherd/system.scm.in b/modules/shepherd/system.scm.in
index a54dca7..09d45bf 100644
--- a/modules/shepherd/system.scm.in
+++ b/modules/shepherd/system.scm.in
@@ -23,7 +23,9 @@
   #:export (reboot
             halt
             power-off
-            max-file-descriptors))
+            max-file-descriptors
+            prctl
+            PR_SET_CHILD_SUBREAPER))
 
 ;; The <sys/reboot.h> constants.
 (define RB_AUTOBOOT @RB_AUTOBOOT@)
@@ -130,6 +132,23 @@ the returned procedure is called."
                    (list err))
             result)))))
 
+(define PR_SET_CHILD_SUBREAPER @PR_SET_CHILD_SUBREAPER@)
+
+(define prctl
+  (if (dynamic-func "prctl" (dynamic-link))
+      (let ((proc (syscall->procedure long "prctl" (list int int))))
+        (lambda (process operation)
+          "Perform an operation on the given process"
+          (let-values (((result err) (proc process operation)))
+            (if (= -1 result)
+                (throw 'system-error "prctl" "~A: ~S"
+                       (list (strerror err) name)
+                       (list err))
+                result))))
+      (lambda (process operation)
+        (throw 'system-error "prctl" "~A" (list strerror ENOSYS)
+               (list ENOSYS)))))
+
 (define (max-file-descriptors)
   "Return the maximum number of open file descriptors allowed."
   (sysconf _SC_OPEN_MAX))
-- 
2.16.1


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.5: 0002-Handle-forked-process-SIGCHLD-signals.patch --]
[-- Type: text/x-patch, Size: 2325 bytes --]

From b43c128d8a175a9a123eb7b1af465fb3747a5393 Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <carlo@zancanaro.id.au>
Date: Sun, 4 Mar 2018 07:46:13 +1100
Subject: [PATCH 2/3] Handle forked process SIGCHLD signals

* Makefile.am (TESTS): Add tests/forking-service.sh.
* modules/shepherd.scm: Set the child subreaper attribute of main shepherd
  process (as long as we're not pid 1).
* modules/shepherd/service.scm (root-service)[daemonize]: Set the child
  subreaper attribute of newly forked shepherd process.
---
 Makefile.am                  | 1 +
 modules/shepherd.scm         | 7 +++++++
 modules/shepherd/service.scm | 4 +++-
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/Makefile.am b/Makefile.am
index eafa308..8dad006 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -190,6 +190,7 @@ TESTS =						\
   tests/no-home.sh				\
   tests/pid-file.sh				\
   tests/status-sexp.sh				\
+  tests/forking-service.sh			\
   tests/signals.sh
 
 TEST_EXTENSIONS = .sh
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index df5420f..faa1e47 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -51,6 +51,13 @@
 (define (main . args)
   (initialize-cli)
 
+  (when (not (= 1 (getpid)))
+    ;; Register for orphaned processes to be reparented onto us when their
+    ;; original parent dies. This lets us handle SIGCHLD from daemon processes
+    ;; that would otherwise have been reparented under pid 1. This is
+    ;; unnecessary when we are pid 1.
+    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
+
   (let ((config-file #f)
 	(socket-file default-socket-file)
         (pid-file    #f)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 2224932..b6394f2 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -1274,7 +1274,9 @@ we want to receive these signals."
            (local-output "Running as PID 1, so not daemonizing."))
           (else
            (if (zero? (primitive-fork))
-               #t
+               (begin
+                 (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1))
+                 #t)
                (primitive-exit 0))))))
      (persistency
       "Safe the current state of running and non-running services.
-- 
2.16.1


[-- Attachment #1.6: 0003-Poll-every-0.5s-to-find-dead-forked-services.patch --]
[-- Type: text/x-patch, Size: 12416 bytes --]

From 3d3c091660bbbd529af0058b0ba9b5ddbfc6b481 Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <carlo@zancanaro.id.au>
Date: Wed, 21 Feb 2018 22:57:59 +1100
Subject: [PATCH 3/3] Poll every 0.5s to find dead forked services

* modules/shepherd.scm (open-server-socket): Set socket to be
  non-blocking.
  (main): Use select with a timeout. If prctl failed when shepherd started
  then call check-for-dead-services between connections/timeouts.
* modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
  signal handler.
  (respawn-service): Separate logic for respawning services from handling
  SIGCHLD.
  (handle-SIGCHLD, check-for-dead-services): New exported procedures.
* tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
  symbols.
* doc/shepherd.texi (Slots of services): Add note about service running slot
  being a process id.
---
 doc/shepherd.texi            |  4 ++-
 modules/shepherd.scm         | 47 ++++++++++++++++++-------
 modules/shepherd/service.scm | 82 ++++++++++++++++++++++++++++----------------
 tests/basic.sh               |  4 +--
 tests/status-sexp.sh         |  4 +--
 5 files changed, 95 insertions(+), 46 deletions(-)

diff --git a/doc/shepherd.texi b/doc/shepherd.texi
index 815091f..47005d5 100644
--- a/doc/shepherd.texi
+++ b/doc/shepherd.texi
@@ -608,7 +608,9 @@ way.  The default value is @code{#f}, which indicates that the service
 is not running. When an attempt is made to start the service, it will
 be set to the return value of the procedure in the @code{start} slot.
 It will also be passed as an argument to the procedure in the
-@code{stop} slot.  This slot can not be initialized with a keyword.
+@code{stop} slot.  If it is set a value that is an integer, it is
+assumed to be a process id, and shepherd will monitor the process for
+unexpected exits.  This slot can not be initialized with a keyword.
 
 @item
 @vindex respawn? (slot of <service>)
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index faa1e47..e912d21 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -42,6 +42,8 @@
   (with-fluids ((%default-port-encoding "UTF-8"))
     (let ((sock    (socket PF_UNIX SOCK_STREAM 0))
           (address (make-socket-address AF_UNIX file-name)))
+      (fcntl sock F_SETFL (logior O_NONBLOCK
+                                  (fcntl sock F_GETFL)))
       (bind sock address)
       (listen sock 10)
       sock)))
@@ -49,14 +51,28 @@
 \f
 ;; Main program.
 (define (main . args)
-  (initialize-cli)
+  (define poll-services
+    (if (= 1 (getpid))
+        (lambda () #f) ;; If we're pid 1 then we don't need to set
+                       ;; PR_SET_CHILD_SUBREAPER
+        (catch 'system-error
+          (lambda ()
+            ;; Register for orphaned processes to be reparented onto us when
+            ;; their original parent dies. This lets us handle SIGCHLD from
+            ;; daemon processes that would otherwise have been reparented
+            ;; under pid 1. This is unnecessary when we are pid 1.
+            (prctl PR_SET_CHILD_SUBREAPER 1)
+            (lambda () #f))
+          (lambda args
+            ;; We fall back to polling for services on systems that don't
+            ;; support prctl/PR_SET_CHILD_SUBREAPER
+            (let ((errno (system-error-errno args)))
+              (if (or (= ENOSYS errno) ;; prctl not available
+                      (= EINVAL errno)) ;; PR_SET_CHILD_SUBREAPER not available
+                  check-for-dead-services ;; poll
+                  (apply throw args)))))))
 
-  (when (not (= 1 (getpid)))
-    ;; Register for orphaned processes to be reparented onto us when their
-    ;; original parent dies. This lets us handle SIGCHLD from daemon processes
-    ;; that would otherwise have been reparented under pid 1. This is
-    ;; unnecessary when we are pid 1.
-    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
+  (initialize-cli)
 
   (let ((config-file #f)
 	(socket-file default-socket-file)
@@ -225,11 +241,18 @@
             (_  #t))
 
           (let next-command ()
-            (match (accept sock)
-              ((command-source . client-address)
-               (setvbuf command-source _IOFBF 1024)
-               (process-connection command-source))
-              (_ #f))
+            (define (read-from sock)
+              (match (accept sock)
+                ((command-source . client-address)
+                 (setvbuf command-source _IOFBF 1024)
+                 (process-connection command-source))
+                (_ #f)))
+            (match (select (list sock) (list) (list) 0.5)
+              (((sock) _ _)
+               (read-from sock))
+              (_
+               #f))
+            (poll-services)
             (next-command))))))
 
 (define (process-connection sock)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index b6394f2..056483a 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -3,6 +3,7 @@
 ;; Copyright (C) 2002, 2003 Wolfgang Järling <wolfgang@pro-linux.de>
 ;; Copyright (C) 2014 Alex Sassmannshausen <alex.sassmannshausen@gmail.com>
 ;; Copyright (C) 2016 Alex Kost <alezost@gmail.com>
+;; Copyright (C) 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
 ;;
 ;; This file is part of the GNU Shepherd.
 ;;
@@ -64,6 +65,7 @@
             for-each-service
             lookup-services
             respawn-service
+            handle-SIGCHLD
             register-services
             provided-by
             required-by
@@ -77,6 +79,7 @@
             make-system-destructor
             make-init.d-service
 
+            check-for-dead-services
             root-service
             make-actions
 
@@ -800,7 +803,7 @@ false."
 its PID."
   ;; Install the SIGCHLD handler if this is the first fork+exec-command call
   (unless %sigchld-handler-installed?
-    (sigaction SIGCHLD respawn-service SA_NOCLDSTOP)
+    (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
     (set! %sigchld-handler-installed? #t))
   (let ((pid (primitive-fork)))
     (if (zero? pid)
@@ -991,7 +994,7 @@ child left."
                           what (strerror errno))
             '(0 . #f)))))))
 
-(define (respawn-service signum)
+(define (handle-SIGCHLD signum)
   "Handle SIGCHLD, possibly by respawning the service that just died, or
 otherwise by updating its state."
   (let loop ()
@@ -1010,38 +1013,44 @@ otherwise by updating its state."
          ;; SERV can be #f for instance when this code runs just after a
          ;; service's 'stop' method killed its process and completed.
          (when serv
-           (slot-set! serv 'running #f)
-           (if (and (respawn? serv)
-                    (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
-                                             (car respawn-limit)
-                                             (cdr respawn-limit))))
-               (if (not (slot-ref serv 'waiting-for-termination?))
-                   (begin
-                     ;; Everything is okay, start it.
-                     (local-output "Respawning ~a."
-                                   (canonical-name serv))
-                     (slot-set! serv 'last-respawns
-                                (cons (current-time)
-                                      (slot-ref serv 'last-respawns)))
-                     (start serv))
-                   ;; We have just been waiting for the
-                   ;; termination.  The `running' slot has already
-                   ;; been set to `#f' by `stop'.
-                   (begin
-                     (local-output "Service ~a terminated."
-                                   (canonical-name serv))
-                     (slot-set! serv 'waiting-for-termination? #f)))
-               (begin
-                 (local-output "Service ~a has been disabled."
-                               (canonical-name serv))
-                 (when (respawn? serv)
-                   (local-output "  (Respawning too fast.)"))
-                 (slot-set! serv 'enabled? #f))))
+           (respawn-service serv))
 
          ;; As noted in libc's manual (info "(libc) Process Completion"),
          ;; loop so we don't miss any terminated child process.
          (loop))))))
 
+(define (respawn-service serv)
+  "Respawn a service that has stopped running unexpectedly. If we have
+attempted to respawn the service a number of times already and it keeps dying,
+then disable it."
+  (slot-set! serv 'running #f)
+  (if (and (respawn? serv)
+           (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
+                                    (car respawn-limit)
+                                    (cdr respawn-limit))))
+      (if (not (slot-ref serv 'waiting-for-termination?))
+          (begin
+            ;; Everything is okay, start it.
+            (local-output "Respawning ~a."
+                          (canonical-name serv))
+            (slot-set! serv 'last-respawns
+                       (cons (current-time)
+                             (slot-ref serv 'last-respawns)))
+            (start serv))
+          ;; We have just been waiting for the
+          ;; termination.  The `running' slot has already
+          ;; been set to `#f' by `stop'.
+          (begin
+            (local-output "Service ~a terminated."
+                          (canonical-name serv))
+            (slot-set! serv 'waiting-for-termination? #f)))
+      (begin
+        (local-output "Service ~a has been disabled."
+                      (canonical-name serv))
+        (when (respawn? serv)
+          (local-output "  (Respawning too fast.)"))
+        (slot-set! serv 'enabled? #f))))
+
 ;; Add NEW-SERVICES to the list of known services.
 (define (register-services . new-services)
   (define (register-single-service new)
@@ -1171,6 +1180,21 @@ file when persistence is enabled."
         (lambda (p)
           (format p "~{~a ~}~%" running-services))))))
 
+(define (check-for-dead-services)
+  "Poll each process that we expect to be running, and respawn any which have
+unexpectedly stopped running. This procedure is used as a fallback on systems
+where prctl/PR_SET_CHILD_SUBREAPER is unsupported."
+  (define (process-exists? pid)
+    (catch #t
+      (lambda () (kill pid 0) #t)
+      (lambda _ #f)))
+  (for-each-service (lambda (service)
+                      (let ((running (slot-ref service 'running)))
+                        (when (and (integer? running)
+                                   (not (process-exists? running)))
+                            (local-output "PID ~a (~a) is dead!" running (canonical-name service))
+                            (respawn-service service))))))
+
 (define root-service
   (make <service>
     #:docstring "The root service is used to operate on shepherd itself."
diff --git a/tests/basic.sh b/tests/basic.sh
index 1ddb334..2ecd8fb 100644
--- a/tests/basic.sh
+++ b/tests/basic.sh
@@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" <<EOF
 (register-services
  (make <service>
    #:provides '(test-loaded)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop (const #f)))
 EOF
 
@@ -166,7 +166,7 @@ $herd status test-loaded
 $herd status test-loaded | grep stopped
 
 $herd start test-loaded
-$herd status test-loaded | grep -i 'running.*42'
+$herd status test-loaded | grep -i 'running.*abc'
 $herd stop test-loaded
 $herd unload root test-loaded
 
diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh
index b7c8cb4..11b967e 100644
--- a/tests/status-sexp.sh
+++ b/tests/status-sexp.sh
@@ -33,7 +33,7 @@ cat > "$conf"<<EOF
 (register-services
  (make <service>
    #:provides '(foo)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop  (const #f)
    #:docstring "Foo!"
    #:respawn? #t)
@@ -85,7 +85,7 @@ root_service_sexp="
 	     (service (version 0)
 	       (provides (foo)) (requires ())
 	       (respawn? #t) (docstring \"Foo!\")
-	       (enabled? #t) (running 42) (conflicts ())
+	       (enabled? #t) (running abc) (conflicts ())
 	       (last-respawns ()))
 	     (service (version 0)
 	       (provides (bar)) (requires (foo))
-- 
2.16.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

  reply	other threads:[~2018-03-03 20:50 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-27 21:56 [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Carlo Zancanaro
2018-02-28 22:06 ` Ludovic Courtès
2018-03-01 22:37   ` Carlo Zancanaro
2018-03-02  9:44     ` Ludovic Courtès
2018-03-02 10:13       ` Carlo Zancanaro
2018-03-02 12:42         ` Ludovic Courtès
2018-03-03  7:58           ` Carlo Zancanaro
2018-03-03 15:21             ` Ludovic Courtès
2018-03-03 20:49               ` Carlo Zancanaro [this message]
2018-03-04 22:11                 ` Ludovic Courtès
2018-03-04 22:35                   ` Carlo Zancanaro
2018-03-04 22:49                     ` Ludovic Courtès
2018-03-04 23:08                       ` Carlo Zancanaro
2018-03-05 14:15                         ` bug#30637: " Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87woys9961.fsf@zancanaro.id.au \
    --to=carlo@zancanaro.id.au \
    --cc=30637@debbugs.gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).