unofficial mirror of guix-patches@gnu.org 
 help / color / mirror / code / Atom feed
From: Carlo Zancanaro <carlo@zancanaro.id.au>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: 30637@debbugs.gnu.org
Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
Date: Sat, 03 Mar 2018 18:58:50 +1100	[thread overview]
Message-ID: <87inadr3np.fsf@zancanaro.id.au> (raw)
In-Reply-To: <87po4mhcn2.fsf@gnu.org>


[-- Attachment #1.1: Type: text/plain, Size: 2198 bytes --]

Hey Ludo,

I've re-written my patch, and it's attached in two commits. The 
first one adds the necessary calls to prctl, and the second adds 
the fallback to polling.

On Fri, Mar 02 2018, Ludovic Courtès wrote:
> The ‘prctl’ procedure itself should simply throw to 
> 'system-error on GNU/Hurd. But then, in (shepherd), we could add 
> the polling thing when (not (string-contains %host-type 
> "linux")).
>
> WDYT?

I don't like the idea of doing this based on the host type. In my 
patch I've done it based on whether the prctl call succeeded. If 
the prctl call throws a system-error then we poll, otherwise we 
rely on SIGCHLD. I don't have a system set up with another kernel, 
though, so I don't know how I can easily test whether the fallback 
logic is working properly. When I replaced the prctl call with 
(throw 'system-error) it seemed to work.

The fallback code still fails in the guix build environment (as my 
previous patch did), but when it's using prctl it works properly. 
This means that a build on Linux pre-3.4, or on Hurd, will fail, 
which probably isn't acceptable given that shepherd is a hard 
dependency for starting a GuixSD system. As far as I can tell the 
test fails because the processes stick around as zombies, so I 
assume that pid 1 in the build container isn't properly reaping 
zombie processes. Do you have any ideas how I can force it to do 
so?

> We want to set the “reaper” of child processes to Shepherd 
> itself, so we must do that in child processes. The shepherd 
> process cannot be its own reaper I suppose.

Reading the manpage, and then running some code, I think you're 
wrong about this. Using prctl with PR_SET_CHILD_SUBREAPER marks 
the calling process as a child subreaper. That means that any 
processes that are orphaned below the current process get 
reparented under the current process (or a closer child subreaper, 
if there's one further down). If there are no processes marked as 
child subreapers, then the orphaned process is reparented under 
pid 1. We should only need to call prctl in two places: when 
shepherd initially starts, or when we fork for daemonize.

Carlo


[-- Attachment #1.2: 0001-Handle-forked-process-SIGCHLD-signals.patch --]
[-- Type: text/x-patch, Size: 7908 bytes --]

From 5f26da2ce6a26c8412368900987ac5438f3e70cd Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <carlo@zancanaro.id.au>
Date: Sat, 3 Mar 2018 17:26:05 +1100
Subject: [PATCH 1/2] Handle forked process SIGCHLD signals

* Makefile.am (TESTS): Add tests/forking-service.sh.
* configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER.
* modules/shepherd.scm: Set the child subreaper attribute of main shepherd
  process (as long as we're not pid 1).
* modules/shepherd/service.scm (root-service)[daemonize]: Set the child
  subreaper attribute of newly forked shepherd process.
* modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable
  and export it.
  (prctl): Add new procedure and export it.
---
 Makefile.am                    |   1 +
 configure.ac                   |   4 ++
 modules/shepherd.scm           |   2 +
 modules/shepherd/service.scm   |   4 +-
 modules/shepherd/system.scm.in |  17 ++++++-
 tests/forking-service.sh       | 111 +++++++++++++++++++++++++++++++++++++++++
 6 files changed, 137 insertions(+), 2 deletions(-)
 create mode 100644 tests/forking-service.sh

diff --git a/Makefile.am b/Makefile.am
index eafa308..8dad006 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -190,6 +190,7 @@ TESTS =						\
   tests/no-home.sh				\
   tests/pid-file.sh				\
   tests/status-sexp.sh				\
+  tests/forking-service.sh			\
   tests/signals.sh
 
 TEST_EXTENSIONS = .sh
diff --git a/configure.ac b/configure.ac
index bb5058d..fbe16f4 100644
--- a/configure.ac
+++ b/configure.ac
@@ -72,7 +72,11 @@ esac
 AC_SUBST([RB_AUTOBOOT])
 AC_SUBST([RB_HALT_SYSTEM])
 AC_SUBST([RB_POWER_OFF])
+AC_MSG_RESULT([done])
 
+AC_MSG_CHECKING([<sys/prctl.h> constants])
+AC_COMPUTE_INT([PR_SET_CHILD_SUBREAPER], [PR_SET_CHILD_SUBREAPER], [#include <sys/prctl.h>])
+AC_SUBST([PR_SET_CHILD_SUBREAPER])
 AC_MSG_RESULT([done])
 
 dnl Manual pages.
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index df5420f..ab59e08 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -50,6 +50,8 @@
 ;; Main program.
 (define (main . args)
   (initialize-cli)
+  (when (not (= 1 (getpid)))
+    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
 
   (let ((config-file #f)
 	(socket-file default-socket-file)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 2224932..b6394f2 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -1274,7 +1274,9 @@ we want to receive these signals."
            (local-output "Running as PID 1, so not daemonizing."))
           (else
            (if (zero? (primitive-fork))
-               #t
+               (begin
+                 (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1))
+                 #t)
                (primitive-exit 0))))))
      (persistency
       "Safe the current state of running and non-running services.
diff --git a/modules/shepherd/system.scm.in b/modules/shepherd/system.scm.in
index a54dca7..55806cb 100644
--- a/modules/shepherd/system.scm.in
+++ b/modules/shepherd/system.scm.in
@@ -23,7 +23,9 @@
   #:export (reboot
             halt
             power-off
-            max-file-descriptors))
+            max-file-descriptors
+            prctl
+            PR_SET_CHILD_SUBREAPER))
 
 ;; The <sys/reboot.h> constants.
 (define RB_AUTOBOOT @RB_AUTOBOOT@)
@@ -130,6 +132,19 @@ the returned procedure is called."
                    (list err))
             result)))))
 
+(define PR_SET_CHILD_SUBREAPER @PR_SET_CHILD_SUBREAPER@)
+
+(define prctl
+  (let ((proc (syscall->procedure long "prctl" (list int int))))
+    (lambda (process operation)
+      "Perform an operation on the given process"
+      (let-values (((result err) (proc process operation)))
+        (if (= -1 result)
+            (throw 'system-error "prctl" "~A: ~S"
+                   (list (strerror err) name)
+                   (list err))
+            result)))))
+
 (define (max-file-descriptors)
   "Return the maximum number of open file descriptors allowed."
   (sysconf _SC_OPEN_MAX))
diff --git a/tests/forking-service.sh b/tests/forking-service.sh
new file mode 100644
index 0000000..90c684a
--- /dev/null
+++ b/tests/forking-service.sh
@@ -0,0 +1,111 @@
+# GNU Shepherd --- Test detecting a forked process' termination
+# Copyright © 2016 Ludovic Courtès <ludo@gnu.org>
+# Copyright © 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
+#
+# This file is part of the GNU Shepherd.
+#
+# The GNU Shepherd is free software; you can redistribute it and/or modify it
+# under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or (at
+# your option) any later version.
+#
+# The GNU Shepherd is distributed in the hope that it will be useful, but
+# WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with the GNU Shepherd.  If not, see <http://www.gnu.org/licenses/>.
+
+shepherd --version
+herd --version
+
+socket="t-socket-$$"
+conf="t-conf-$$"
+log="t-log-$$"
+pid="t-pid-$$"
+service_pid="t-service-pid-$$"
+service2_pid="t-service2-pid-$$"
+service2_started="t-service2-starts-$$"
+
+herd="herd -s $socket"
+
+function cleanup() {
+    cat $log || true
+    rm -f $socket $conf $log $service2_started
+    test -f $pid && kill "$(cat $pid)" || true
+    rm -f $pid
+    test -f $service_pid && kill "$(cat $service_pid)" || true
+    rm -f $service_pid
+    test -f $service2_pid && kill "$(cat $service2_pid)" || true
+    rm -f $service2_pid
+}
+
+trap cleanup EXIT
+
+cat > "$conf"<<EOF
+(define %command
+  '("$SHELL" "-c" "sleep 600 & echo \$! > $PWD/$service_pid"))
+
+(register-services
+ (make <service>
+   ;; A service that forks into a different process.
+   #:provides '(test)
+   #:start (make-forkexec-constructor %command
+                                      #:pid-file "$PWD/$service_pid")
+   #:stop  (make-kill-destructor)
+   #:respawn? #f))
+
+(define %command2
+  '("$SHELL" "-c" "echo started >> $PWD/$service2_started; sleep 600 & echo \$! > $PWD/$service2_pid"))
+
+(register-services
+ (make <service>
+   ;; A service that forks into a different process.
+   #:provides '(test2)
+   #:start (make-forkexec-constructor %command2
+                                      #:pid-file "$PWD/$service2_pid")
+   #:stop  (make-kill-destructor)
+   #:respawn? #t))
+EOF
+cat $conf
+
+rm -f "$pid"
+shepherd -I -s "$socket" -c "$conf" -l "$log" --pid="$pid" &
+
+# Wait till it's ready.
+while ! test -f "$pid" ; do sleep 0.3 ; done
+
+shepherd_pid="$(cat $pid)"
+
+# start both of the services
+$herd start test
+$herd start test2
+
+# make sure "test" is started
+until $herd status test | grep started; do sleep 0.3; done
+test -f "$service_pid"
+service_pid_value="$(cat $service_pid)"
+# now kill it
+kill "$service_pid_value"
+while kill -0 "$service_pid_value"; do sleep 0.3; done
+# shepherd should notice that the service has stopped within one second
+sleep 1
+$herd status test | grep stopped
+
+
+
+# make sure "test2" has started
+until $herd status test2 | grep started; do sleep 0.3; done
+test -f "$service2_pid"
+service2_pid_value="$(cat $service2_pid)"
+test "$(cat $PWD/$service2_started)" = "started"
+# now kill it
+rm -f "$service2_pid"
+kill $service2_pid_value
+while kill -0 "$service2_pid_value"; do sleep 0.3; done
+# shepherd should notice that the service has stopped, and restart it, within one second
+sleep 1;
+$herd status test2 | grep started
+test "$(cat $PWD/$service2_started)" = "started
+started"
-- 
2.16.1


[-- Attachment #1.3: 0002-Poll-every-0.5s-to-find-dead-forked-services.patch --]
[-- Type: text/x-patch, Size: 10120 bytes --]

From ec47fa189c7d47f1d9444d939b084850f0a7186c Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <carlo@zancanaro.id.au>
Date: Wed, 21 Feb 2018 22:57:59 +1100
Subject: [PATCH 2/2] Poll every 0.5s to find dead forked services

* modules/shepherd.scm (open-server-socket): Set socket to be
  non-blocking.
  (main): Use select with a timeout. If prctl failed when shepherd started
  then call check-for-dead-services between connections/timeouts.
* modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
  signal handler.
  (respawn-service): Separate logic for respawning services from handling
  SIGCHLD.
  (handle-SIGCHLD, check-for-dead-services): New exported procedures.
* tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
  symbols.
---
 modules/shepherd.scm         | 31 ++++++++++++++----
 modules/shepherd/service.scm | 78 +++++++++++++++++++++++++++-----------------
 tests/basic.sh               |  4 +--
 tests/status-sexp.sh         |  4 +--
 4 files changed, 76 insertions(+), 41 deletions(-)

diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index ab59e08..b824546 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -42,6 +42,8 @@
   (with-fluids ((%default-port-encoding "UTF-8"))
     (let ((sock    (socket PF_UNIX SOCK_STREAM 0))
           (address (make-socket-address AF_UNIX file-name)))
+      (fcntl sock F_SETFL (logior O_NONBLOCK
+                                  (fcntl sock F_GETFL)))
       (bind sock address)
       (listen sock 10)
       sock)))
@@ -49,9 +51,17 @@
 \f
 ;; Main program.
 (define (main . args)
+  (define poll-services
+    (if (= 1 (getpid))
+        (lambda () #f)
+        (catch 'system-error
+          (lambda ()
+            (prctl PR_SET_CHILD_SUBREAPER 1)
+            (lambda () #f))
+          (lambda (key . args)
+            check-for-dead-services))))
+
   (initialize-cli)
-  (when (not (= 1 (getpid)))
-    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
 
   (let ((config-file #f)
 	(socket-file default-socket-file)
@@ -220,11 +230,18 @@
             (_  #t))
 
           (let next-command ()
-            (match (accept sock)
-              ((command-source . client-address)
-               (setvbuf command-source _IOFBF 1024)
-               (process-connection command-source))
-              (_ #f))
+            (define (read-from sock)
+              (match (accept sock)
+                ((command-source . client-address)
+                 (setvbuf command-source _IOFBF 1024)
+                 (process-connection command-source))
+                (_ #f)))
+            (match (select (list sock) (list) (list) 0.5)
+              (((sock) _ _)
+               (read-from sock))
+              (_
+               #f))
+            (poll-services)
             (next-command))))))
 
 (define (process-connection sock)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index b6394f2..fc53d76 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -3,6 +3,7 @@
 ;; Copyright (C) 2002, 2003 Wolfgang Järling <wolfgang@pro-linux.de>
 ;; Copyright (C) 2014 Alex Sassmannshausen <alex.sassmannshausen@gmail.com>
 ;; Copyright (C) 2016 Alex Kost <alezost@gmail.com>
+;; Copyright (C) 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
 ;;
 ;; This file is part of the GNU Shepherd.
 ;;
@@ -64,6 +65,7 @@
             for-each-service
             lookup-services
             respawn-service
+            handle-SIGCHLD
             register-services
             provided-by
             required-by
@@ -77,6 +79,7 @@
             make-system-destructor
             make-init.d-service
 
+            check-for-dead-services
             root-service
             make-actions
 
@@ -800,7 +803,7 @@ false."
 its PID."
   ;; Install the SIGCHLD handler if this is the first fork+exec-command call
   (unless %sigchld-handler-installed?
-    (sigaction SIGCHLD respawn-service SA_NOCLDSTOP)
+    (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
     (set! %sigchld-handler-installed? #t))
   (let ((pid (primitive-fork)))
     (if (zero? pid)
@@ -991,7 +994,7 @@ child left."
                           what (strerror errno))
             '(0 . #f)))))))
 
-(define (respawn-service signum)
+(define (handle-SIGCHLD signum)
   "Handle SIGCHLD, possibly by respawning the service that just died, or
 otherwise by updating its state."
   (let loop ()
@@ -1009,39 +1012,42 @@ otherwise by updating its state."
 
          ;; SERV can be #f for instance when this code runs just after a
          ;; service's 'stop' method killed its process and completed.
-         (when serv
-           (slot-set! serv 'running #f)
-           (if (and (respawn? serv)
-                    (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
-                                             (car respawn-limit)
-                                             (cdr respawn-limit))))
-               (if (not (slot-ref serv 'waiting-for-termination?))
-                   (begin
-                     ;; Everything is okay, start it.
-                     (local-output "Respawning ~a."
-                                   (canonical-name serv))
-                     (slot-set! serv 'last-respawns
-                                (cons (current-time)
-                                      (slot-ref serv 'last-respawns)))
-                     (start serv))
-                   ;; We have just been waiting for the
-                   ;; termination.  The `running' slot has already
-                   ;; been set to `#f' by `stop'.
-                   (begin
-                     (local-output "Service ~a terminated."
-                                   (canonical-name serv))
-                     (slot-set! serv 'waiting-for-termination? #f)))
-               (begin
-                 (local-output "Service ~a has been disabled."
-                               (canonical-name serv))
-                 (when (respawn? serv)
-                   (local-output "  (Respawning too fast.)"))
-                 (slot-set! serv 'enabled? #f))))
+         (respawn-service serv)
 
          ;; As noted in libc's manual (info "(libc) Process Completion"),
          ;; loop so we don't miss any terminated child process.
          (loop))))))
 
+(define (respawn-service serv)
+  (when serv
+    (slot-set! serv 'running #f)
+    (if (and (respawn? serv)
+             (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
+                                      (car respawn-limit)
+                                      (cdr respawn-limit))))
+        (if (not (slot-ref serv 'waiting-for-termination?))
+            (begin
+              ;; Everything is okay, start it.
+              (local-output "Respawning ~a."
+                            (canonical-name serv))
+              (slot-set! serv 'last-respawns
+                         (cons (current-time)
+                               (slot-ref serv 'last-respawns)))
+              (start serv))
+            ;; We have just been waiting for the
+            ;; termination.  The `running' slot has already
+            ;; been set to `#f' by `stop'.
+            (begin
+              (local-output "Service ~a terminated."
+                            (canonical-name serv))
+              (slot-set! serv 'waiting-for-termination? #f)))
+        (begin
+          (local-output "Service ~a has been disabled."
+                        (canonical-name serv))
+          (when (respawn? serv)
+            (local-output "  (Respawning too fast.)"))
+          (slot-set! serv 'enabled? #f)))))
+
 ;; Add NEW-SERVICES to the list of known services.
 (define (register-services . new-services)
   (define (register-single-service new)
@@ -1171,6 +1177,18 @@ file when persistence is enabled."
         (lambda (p)
           (format p "~{~a ~}~%" running-services))))))
 
+(define (check-for-dead-services)
+  (define (process-exists? pid)
+    (catch #t
+      (lambda () (kill pid 0) #t)
+      (lambda _ #f)))
+  (for-each-service (lambda (service)
+                      (let ((running (slot-ref service 'running)))
+                        (when (and (integer? running)
+                                   (not (process-exists? running)))
+                            (local-output "PID ~a (~a) is dead!" running (canonical-name service))
+                            (respawn-service service))))))
+
 (define root-service
   (make <service>
     #:docstring "The root service is used to operate on shepherd itself."
diff --git a/tests/basic.sh b/tests/basic.sh
index 1ddb334..2ecd8fb 100644
--- a/tests/basic.sh
+++ b/tests/basic.sh
@@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" <<EOF
 (register-services
  (make <service>
    #:provides '(test-loaded)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop (const #f)))
 EOF
 
@@ -166,7 +166,7 @@ $herd status test-loaded
 $herd status test-loaded | grep stopped
 
 $herd start test-loaded
-$herd status test-loaded | grep -i 'running.*42'
+$herd status test-loaded | grep -i 'running.*abc'
 $herd stop test-loaded
 $herd unload root test-loaded
 
diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh
index b7c8cb4..11b967e 100644
--- a/tests/status-sexp.sh
+++ b/tests/status-sexp.sh
@@ -33,7 +33,7 @@ cat > "$conf"<<EOF
 (register-services
  (make <service>
    #:provides '(foo)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop  (const #f)
    #:docstring "Foo!"
    #:respawn? #t)
@@ -85,7 +85,7 @@ root_service_sexp="
 	     (service (version 0)
 	       (provides (foo)) (requires ())
 	       (respawn? #t) (docstring \"Foo!\")
-	       (enabled? #t) (running 42) (conflicts ())
+	       (enabled? #t) (running abc) (conflicts ())
 	       (last-respawns ()))
 	     (service (version 0)
 	       (provides (bar)) (requires (foo))
-- 
2.16.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

  reply	other threads:[~2018-03-03  8:00 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-27 21:56 [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Carlo Zancanaro
2018-02-28 22:06 ` Ludovic Courtès
2018-03-01 22:37   ` Carlo Zancanaro
2018-03-02  9:44     ` Ludovic Courtès
2018-03-02 10:13       ` Carlo Zancanaro
2018-03-02 12:42         ` Ludovic Courtès
2018-03-03  7:58           ` Carlo Zancanaro [this message]
2018-03-03 15:21             ` Ludovic Courtès
2018-03-03 20:49               ` Carlo Zancanaro
2018-03-04 22:11                 ` Ludovic Courtès
2018-03-04 22:35                   ` Carlo Zancanaro
2018-03-04 22:49                     ` Ludovic Courtès
2018-03-04 23:08                       ` Carlo Zancanaro
2018-03-05 14:15                         ` bug#30637: " Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87inadr3np.fsf@zancanaro.id.au \
    --to=carlo@zancanaro.id.au \
    --cc=30637@debbugs.gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).