unofficial mirror of guix-patches@gnu.org 
 help / color / mirror / code / Atom feed
* [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
@ 2018-02-27 21:56 Carlo Zancanaro
  2018-02-28 22:06 ` Ludovic Courtès
  0 siblings, 1 reply; 14+ messages in thread
From: Carlo Zancanaro @ 2018-02-27 21:56 UTC (permalink / raw)
  To: 30637


[-- Attachment #1.1: Type: text/plain, Size: 1604 bytes --]

Another patch for shepherd!

This one I'm much less happy about, but I'm sending it for 
feedback. I'll explain the problem it solves, then the (hacky) way 
that it solves it for now. I'm sure there's a better way to solve 
this, but it was annoying me enough for me to do this now.

The problem is that shepherd, when run as a user process, can 
"lose" services which fork away. Shepherd can still kill them, but 
a SIGCHLD won't be delivered if they die, so shepherd can't 
restart/disable them. My prime example is emacs, which I run with 
--daemon. If I then kill emacs, shepherd will still think that it 
is running.

To solve this problem, I have modified shepherd to poll each 
process that it thinks is running. I think this is approximately 
every 0.5s, but I don't quite understand guile's behaviour when it 
comes to socket timeouts.

This involves a subtle change in the meaning of the running field 
in <service> objects. My patch treats any `integer?` as a pid, and 
if a corresponding process is not found it will attempt to restart 
the service. I considered creating a new "pid" datatype to make it 
clear when a number represents a pid, but didn't want to go 
overboard without further feedback. So, thoughts? Can anyone 
suggest a better way to solve this problem?

I'm also confused by my new test. If I run it myself then it 
passes fine, but in a guix build container there is something 
different which means that the processes don't get reaped 
properly, so the test doesn't terminate. I'll keep trying to work 
it out, but if anyone else has an idea I'd love to hear it.

Carlo


[-- Attachment #1.2: 0001-shepherd-Poll-every-0.5s-to-find-dead-forked-service.patch --]
[-- Type: text/x-patch, Size: 13915 bytes --]

From ac87535ebdcc7490d9bbd552c99337917e42a6e8 Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <carlo@zancanaro.id.au>
Date: Wed, 21 Feb 2018 22:57:59 +1100
Subject: [PATCH] shepherd: Poll every 0.5s to find dead forked services

* modules/shepherd.scm (open-server-socket): Set socket to be non-blocking.
  (main): Use select with a timeout, and call check-for-dead-services between
  connections/timeouts.
* modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
  signal handler.
  (respawn-service): Separate logic for respawning services from handling
  SIGCHLD.
  (handle-SIGCHLD, check-for-dead-services): New exported procedures.
* tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
  symbols.
* tests/forking-service.sh: New file.
* Makefile.am: Add tests/forking-service.sh.
---
 Makefile.am                  |   3 +-
 modules/shepherd.scm         |  19 ++++++--
 modules/shepherd/service.scm |  78 ++++++++++++++++++------------
 tests/basic.sh               |   4 +-
 tests/forking-service.sh     | 111 +++++++++++++++++++++++++++++++++++++++++++
 tests/status-sexp.sh         |   4 +-
 6 files changed, 179 insertions(+), 40 deletions(-)
 create mode 100644 tests/forking-service.sh

diff --git a/Makefile.am b/Makefile.am
index 1c394e1..5eb058f 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -189,7 +189,8 @@ TESTS =						\
   tests/no-home.sh				\
   tests/pid-file.sh				\
   tests/status-sexp.sh				\
-  tests/signals.sh
+  tests/signals.sh				\
+  tests/forking-service.sh
 
 TEST_EXTENSIONS = .sh
 EXTRA_DIST += $(TESTS)
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index 650ba63..cdcd328 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -42,6 +42,8 @@
   (with-fluids ((%default-port-encoding "UTF-8"))
     (let ((sock    (socket PF_UNIX SOCK_STREAM 0))
           (address (make-socket-address AF_UNIX file-name)))
+      (fcntl sock F_SETFL (logior O_NONBLOCK
+                                  (fcntl sock F_GETFL)))
       (bind sock address)
       (listen sock 10)
       sock)))
@@ -218,11 +220,18 @@
             (_  #t))
 
           (let next-command ()
-            (match (accept sock)
-              ((command-source . client-address)
-               (setvbuf command-source _IOFBF 1024)
-               (process-connection command-source))
-              (_ #f))
+            (define (read-from sock)
+              (match (accept sock)
+                ((command-source . client-address)
+                 (setvbuf command-source _IOFBF 1024)
+                 (process-connection command-source))
+                (_ #f)))
+            (match (select (list sock) (list) (list) 0.5)
+              (((sock) _ _)
+               (read-from sock))
+              (_
+               #f))
+            (check-for-dead-services)
             (next-command))))))
 
 (define (process-connection sock)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 83600e4..22972c5 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -3,6 +3,7 @@
 ;; Copyright (C) 2002, 2003 Wolfgang Järling <wolfgang@pro-linux.de>
 ;; Copyright (C) 2014 Alex Sassmannshausen <alex.sassmannshausen@gmail.com>
 ;; Copyright (C) 2016 Alex Kost <alezost@gmail.com>
+;; Copyright (C) 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
 ;;
 ;; This file is part of the GNU Shepherd.
 ;;
@@ -64,6 +65,7 @@
             for-each-service
             lookup-services
             respawn-service
+            handle-SIGCHLD
             register-services
             provided-by
             required-by
@@ -77,6 +79,7 @@
             make-system-destructor
             make-init.d-service
 
+            check-for-dead-services
             root-service
             make-actions
 
@@ -800,7 +803,7 @@ false."
 its PID."
   ;; Install the SIGCHLD handler if this is the first fork+exec-command call
   (unless %sigchld-handler-installed?
-    (sigaction SIGCHLD respawn-service SA_NOCLDSTOP)
+    (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
     (set! %sigchld-handler-installed? #t))
   (let ((pid (primitive-fork)))
     (if (zero? pid)
@@ -991,7 +994,7 @@ child left."
                           what (strerror errno))
             '(0 . #f)))))))
 
-(define (respawn-service signum)
+(define (handle-SIGCHLD signum)
   "Handle SIGCHLD, possibly by respawning the service that just died, or
 otherwise by updating its state."
   (let loop ()
@@ -1009,39 +1012,42 @@ otherwise by updating its state."
 
          ;; SERV can be #f for instance when this code runs just after a
          ;; service's 'stop' method killed its process and completed.
-         (when serv
-           (slot-set! serv 'running #f)
-           (if (and (respawn? serv)
-                    (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
-                                             (car respawn-limit)
-                                             (cdr respawn-limit))))
-               (if (not (slot-ref serv 'waiting-for-termination?))
-                   (begin
-                     ;; Everything is okay, start it.
-                     (local-output "Respawning ~a."
-                                   (canonical-name serv))
-                     (slot-set! serv 'last-respawns
-                                (cons (current-time)
-                                      (slot-ref serv 'last-respawns)))
-                     (start serv))
-                   ;; We have just been waiting for the
-                   ;; termination.  The `running' slot has already
-                   ;; been set to `#f' by `stop'.
-                   (begin
-                     (local-output "Service ~a terminated."
-                                   (canonical-name serv))
-                     (slot-set! serv 'waiting-for-termination? #f)))
-               (begin
-                 (local-output "Service ~a has been disabled."
-                               (canonical-name serv))
-                 (when (respawn? serv)
-                   (local-output "  (Respawning too fast.)"))
-                 (slot-set! serv 'enabled? #f))))
+         (respawn-service serv)
 
          ;; As noted in libc's manual (info "(libc) Process Completion"),
          ;; loop so we don't miss any terminated child process.
          (loop))))))
 
+(define (respawn-service serv)
+  (when serv
+    (slot-set! serv 'running #f)
+    (if (and (respawn? serv)
+             (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
+                                      (car respawn-limit)
+                                      (cdr respawn-limit))))
+        (if (not (slot-ref serv 'waiting-for-termination?))
+            (begin
+              ;; Everything is okay, start it.
+              (local-output "Respawning ~a."
+                            (canonical-name serv))
+              (slot-set! serv 'last-respawns
+                         (cons (current-time)
+                               (slot-ref serv 'last-respawns)))
+              (start serv))
+            ;; We have just been waiting for the
+            ;; termination.  The `running' slot has already
+            ;; been set to `#f' by `stop'.
+            (begin
+              (local-output "Service ~a terminated."
+                            (canonical-name serv))
+              (slot-set! serv 'waiting-for-termination? #f)))
+        (begin
+          (local-output "Service ~a has been disabled."
+                        (canonical-name serv))
+          (when (respawn? serv)
+            (local-output "  (Respawning too fast.)"))
+          (slot-set! serv 'enabled? #f)))))
+
 ;; Add NEW-SERVICES to the list of known services.
 (define (register-services . new-services)
   (define (register-single-service new)
@@ -1171,6 +1177,18 @@ file when persistence is enabled."
         (lambda (p)
           (format p "~{~a ~}~%" running-services))))))
 
+(define (check-for-dead-services)
+  (define (process-exists? pid)
+    (catch #t
+      (lambda () (kill pid 0) #t)
+      (lambda _ #f)))
+  (for-each-service (lambda (service)
+                      (let ((running (slot-ref service 'running)))
+                        (when (and (integer? running)
+                                   (not (process-exists? running)))
+                            (local-output "PID ~a (~a) is dead!" running (canonical-name service))
+                            (respawn-service service))))))
+
 (define root-service
   (make <service>
     #:docstring "The root service is used to operate on shepherd itself."
diff --git a/tests/basic.sh b/tests/basic.sh
index 1ddb334..2ecd8fb 100644
--- a/tests/basic.sh
+++ b/tests/basic.sh
@@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" <<EOF
 (register-services
  (make <service>
    #:provides '(test-loaded)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop (const #f)))
 EOF
 
@@ -166,7 +166,7 @@ $herd status test-loaded
 $herd status test-loaded | grep stopped
 
 $herd start test-loaded
-$herd status test-loaded | grep -i 'running.*42'
+$herd status test-loaded | grep -i 'running.*abc'
 $herd stop test-loaded
 $herd unload root test-loaded
 
diff --git a/tests/forking-service.sh b/tests/forking-service.sh
new file mode 100644
index 0000000..90c684a
--- /dev/null
+++ b/tests/forking-service.sh
@@ -0,0 +1,111 @@
+# GNU Shepherd --- Test detecting a forked process' termination
+# Copyright © 2016 Ludovic Courtès <ludo@gnu.org>
+# Copyright © 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
+#
+# This file is part of the GNU Shepherd.
+#
+# The GNU Shepherd is free software; you can redistribute it and/or modify it
+# under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or (at
+# your option) any later version.
+#
+# The GNU Shepherd is distributed in the hope that it will be useful, but
+# WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with the GNU Shepherd.  If not, see <http://www.gnu.org/licenses/>.
+
+shepherd --version
+herd --version
+
+socket="t-socket-$$"
+conf="t-conf-$$"
+log="t-log-$$"
+pid="t-pid-$$"
+service_pid="t-service-pid-$$"
+service2_pid="t-service2-pid-$$"
+service2_started="t-service2-starts-$$"
+
+herd="herd -s $socket"
+
+function cleanup() {
+    cat $log || true
+    rm -f $socket $conf $log $service2_started
+    test -f $pid && kill "$(cat $pid)" || true
+    rm -f $pid
+    test -f $service_pid && kill "$(cat $service_pid)" || true
+    rm -f $service_pid
+    test -f $service2_pid && kill "$(cat $service2_pid)" || true
+    rm -f $service2_pid
+}
+
+trap cleanup EXIT
+
+cat > "$conf"<<EOF
+(define %command
+  '("$SHELL" "-c" "sleep 600 & echo \$! > $PWD/$service_pid"))
+
+(register-services
+ (make <service>
+   ;; A service that forks into a different process.
+   #:provides '(test)
+   #:start (make-forkexec-constructor %command
+                                      #:pid-file "$PWD/$service_pid")
+   #:stop  (make-kill-destructor)
+   #:respawn? #f))
+
+(define %command2
+  '("$SHELL" "-c" "echo started >> $PWD/$service2_started; sleep 600 & echo \$! > $PWD/$service2_pid"))
+
+(register-services
+ (make <service>
+   ;; A service that forks into a different process.
+   #:provides '(test2)
+   #:start (make-forkexec-constructor %command2
+                                      #:pid-file "$PWD/$service2_pid")
+   #:stop  (make-kill-destructor)
+   #:respawn? #t))
+EOF
+cat $conf
+
+rm -f "$pid"
+shepherd -I -s "$socket" -c "$conf" -l "$log" --pid="$pid" &
+
+# Wait till it's ready.
+while ! test -f "$pid" ; do sleep 0.3 ; done
+
+shepherd_pid="$(cat $pid)"
+
+# start both of the services
+$herd start test
+$herd start test2
+
+# make sure "test" is started
+until $herd status test | grep started; do sleep 0.3; done
+test -f "$service_pid"
+service_pid_value="$(cat $service_pid)"
+# now kill it
+kill "$service_pid_value"
+while kill -0 "$service_pid_value"; do sleep 0.3; done
+# shepherd should notice that the service has stopped within one second
+sleep 1
+$herd status test | grep stopped
+
+
+
+# make sure "test2" has started
+until $herd status test2 | grep started; do sleep 0.3; done
+test -f "$service2_pid"
+service2_pid_value="$(cat $service2_pid)"
+test "$(cat $PWD/$service2_started)" = "started"
+# now kill it
+rm -f "$service2_pid"
+kill $service2_pid_value
+while kill -0 "$service2_pid_value"; do sleep 0.3; done
+# shepherd should notice that the service has stopped, and restart it, within one second
+sleep 1;
+$herd status test2 | grep started
+test "$(cat $PWD/$service2_started)" = "started
+started"
diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh
index b7c8cb4..11b967e 100644
--- a/tests/status-sexp.sh
+++ b/tests/status-sexp.sh
@@ -33,7 +33,7 @@ cat > "$conf"<<EOF
 (register-services
  (make <service>
    #:provides '(foo)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop  (const #f)
    #:docstring "Foo!"
    #:respawn? #t)
@@ -85,7 +85,7 @@ root_service_sexp="
 	     (service (version 0)
 	       (provides (foo)) (requires ())
 	       (respawn? #t) (docstring \"Foo!\")
-	       (enabled? #t) (running 42) (conflicts ())
+	       (enabled? #t) (running abc) (conflicts ())
 	       (last-respawns ()))
 	     (service (version 0)
 	       (provides (bar)) (requires (foo))
-- 
2.16.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
  2018-02-27 21:56 [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Carlo Zancanaro
@ 2018-02-28 22:06 ` Ludovic Courtès
  2018-03-01 22:37   ` Carlo Zancanaro
  0 siblings, 1 reply; 14+ messages in thread
From: Ludovic Courtès @ 2018-02-28 22:06 UTC (permalink / raw)
  To: Carlo Zancanaro; +Cc: 30637

Hi Carlo,

Carlo Zancanaro <carlo@zancanaro.id.au> skribis:

> Another patch for shepherd!

Always welcome!  :-)

> The problem is that shepherd, when run as a user process, can "lose"
> services which fork away. Shepherd can still kill them, but a SIGCHLD
> won't be delivered if they die, so shepherd can't restart/disable
> them. My prime example is emacs, which I run with --daemon. If I then
> kill emacs, shepherd will still think that it is running.

There are two issues here, I think.

  1. shepherd cannot lose SIGCHLD: if a process dies immediately once
     it’s been spawned, as is the case with “emacs --daemon” or any
     other daemon-style program, it should receive SIGCHLD and process
     it.

     However, there could be a race condition: if SIGCHLD is handled
     before the ‘running’ value has been set, then we’ll still get the
     non-#f ‘running’ value even though the process died in the
     meantime.

     The code tries to prevent that (see (shepherd service) around line
     320), but looking more closely, I think the race is still there.
     Namely, the whole ‘let’ block, including the call to ‘start’,
     should be in ‘call-with-blocked-asyncs’, I think.  Could you check
     if that helps for you?

  2. shepherd currently can’t do much with real daemons.  So what we do
     in GuixSD is to either start programs in non-daemon mode, when
     that’s an option, or pass #:pid-file to retrieve the forked process
     PID.  I think you should do one of these as well.

WDYT?

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
  2018-02-28 22:06 ` Ludovic Courtès
@ 2018-03-01 22:37   ` Carlo Zancanaro
  2018-03-02  9:44     ` Ludovic Courtès
  0 siblings, 1 reply; 14+ messages in thread
From: Carlo Zancanaro @ 2018-03-01 22:37 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 30637

[-- Attachment #1: Type: text/plain, Size: 2241 bytes --]

Hey Ludo,

On Wed, Feb 28 2018, Ludovic Courtès wrote:
>> The problem is that shepherd, when run as a user process, can 
>> "lose"
>> services which fork away. Shepherd can still kill them, but a 
>> SIGCHLD
>> won't be delivered if they die, so shepherd can't 
>> restart/disable
>> them. My prime example is emacs, which I run with --daemon. If 
>> I then
>> kill emacs, shepherd will still think that it is running.
>
> There are two issues here, I think.
>
>   1. shepherd cannot lose SIGCHLD: if a process dies immediately 
>   once
>      it’s been spawned, as is the case with “emacs --daemon” or 
>      any
>      other daemon-style program, it should receive SIGCHLD and 
>      process
>      it.

Yeah, that's true, but the problem is that shepherd only processes 
the SIGCHLD if there is a service with its `running` slot set to 
the pid. When emacs forks, the original process may have its 
SIGCHLD handled, but that doesn't affect shepherd's service state 
(as it shouldn't, because it's using #:pid-file to track the 
forked process).

>   2. shepherd currently can’t do much with real daemons.  So 
>   what we do
>      in GuixSD is to either start programs in non-daemon mode, 
>      when
>      that’s an option, or pass #:pid-file to retrieve the forked 
>      process
>      PID.  I think you should do one of these as well.

I am doing that. The problem is that when a service dies (crashes, 
quits, etc.) the `respawn?` option cannot be honoured because 
shepherd is not notified that the process has terminated (because 
it never receives a SIGCHLD for the forked pid). My patch polls 
for the processes we expect, to make up for the lack of 
notification. I would much rather it receive an event/signal to 
notify that the forked process has died, but I don't know how to 
do that in a robust, portable way so I chose to poll instead.

If you look at my test case in tests/respawn-service.sh (which can 
be read in its entirety in the diff attached to my previous email) 
you can see the problem that this patch solves. The test will fail 
without the rest of my patch, but will pass with them (guix build 
container issue notwithstanding).

Carlo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
  2018-03-01 22:37   ` Carlo Zancanaro
@ 2018-03-02  9:44     ` Ludovic Courtès
  2018-03-02 10:13       ` Carlo Zancanaro
  0 siblings, 1 reply; 14+ messages in thread
From: Ludovic Courtès @ 2018-03-02  9:44 UTC (permalink / raw)
  To: Carlo Zancanaro; +Cc: 30637

Hi Carlo,

Carlo Zancanaro <carlo@zancanaro.id.au> skribis:

> On Wed, Feb 28 2018, Ludovic Courtès wrote:
>>> The problem is that shepherd, when run as a user process, can
>>> "lose"
>>> services which fork away. Shepherd can still kill them, but a
>>> SIGCHLD
>>> won't be delivered if they die, so shepherd can't restart/disable
>>> them. My prime example is emacs, which I run with --daemon. If I
>>> then
>>> kill emacs, shepherd will still think that it is running.
>>
>> There are two issues here, I think.
>>
>>   1. shepherd cannot lose SIGCHLD: if a process dies immediately
>> once
>>      it’s been spawned, as is the case with “emacs --daemon” or
>> any
>>      other daemon-style program, it should receive SIGCHLD and
>> process
>>      it.
>
> Yeah, that's true, but the problem is that shepherd only processes the
> SIGCHLD if there is a service with its `running` slot set to the
> pid.

Well, it does call ‘waitpid’ every time it gets a SIGCHLD, but it’s true
that it doesn’t do anything beyond that if it doesn’t know what service
a PID corresponds to.

> When emacs forks, the original process may have its SIGCHLD handled,
> but that doesn't affect shepherd's service state (as it shouldn't,
> because it's using #:pid-file to track the forked process).
>
>>   2. shepherd currently can’t do much with real daemons.  So   what
>> we do
>>      in GuixSD is to either start programs in non-daemon mode,
>> when
>>      that’s an option, or pass #:pid-file to retrieve the forked
>> process
>>      PID.  I think you should do one of these as well.
>
> I am doing that. The problem is that when a service dies (crashes,
> quits, etc.) the `respawn?` option cannot be honoured because shepherd
> is not notified that the process has terminated (because it never
> receives a SIGCHLD for the forked pid). My patch polls for the
> processes we expect, to make up for the lack of notification.

I see.

Actually, thinking more about it, we should be using
PR_SET_CHILD_SUBREAPER from prctl(2), which is designed exactly for
that.

So what about this plan:

  1. Add FFI bindings in (shepherd system) for prctl(2).  We should
     arrange for it to throw to 'system-error when the ‘prctl’ symbol is
     missing, as is the case on GNU/Hurd.

  2. Use prctl/PR_SET_CHILD_SUBREAPER in ‘exec-command’.  Here we must
     ‘catch-system-error’ around that call to cater to GNU/Hurd.

That would address the main issue without having to resort to polling.
Respawning will work only when #:pid-file is used though, but that’s
already an improvement.

Thoughts?

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
  2018-03-02  9:44     ` Ludovic Courtès
@ 2018-03-02 10:13       ` Carlo Zancanaro
  2018-03-02 12:42         ` Ludovic Courtès
  0 siblings, 1 reply; 14+ messages in thread
From: Carlo Zancanaro @ 2018-03-02 10:13 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 30637

[-- Attachment #1: Type: text/plain, Size: 1962 bytes --]

Hey Ludo,

On Fri, Mar 02 2018, Ludovic Courtès wrote:
>> I am doing that. The problem is that when a service dies 
>> (crashes, quits, etc.) the `respawn?` option cannot be honoured 
>> because shepherd is not notified that the process has 
>> terminated (because it never receives a SIGCHLD for the forked 
>> pid). My patch polls for the processes we expect, to make up 
>> for the lack of notification.
>
> I see.
>
> Actually, thinking more about it, we should be using 
> PR_SET_CHILD_SUBREAPER from prctl(2), which is designed exactly 
> for that.

Excellent! This is exactly the information that I needed. This is 
what I've been looking for, but without enough knowledge to be 
able to find it. Thanks!

> So what about this plan:
>
>   1. Add FFI bindings in (shepherd system) for prctl(2). We 
>   should arrange for it to throw to 'system-error when the 
>   ‘prctl’ symbol is missing, as is the case on GNU/Hurd.

Are we okay with having this just not work on GNU/Hurd (or kernels 
older than 3.4, according to the prctl manpage)? We could fall 
back to a polling approach if prctl isn't available? I don't 
really like the idea of this working on some kernels but not 
others, given that process supervision is one of the main jobs of 
shepherd.

>   2. Use prctl/PR_SET_CHILD_SUBREAPER in ‘exec-command’. Here we 
>   must ‘catch-system-error’ around that call to cater to 
>   GNU/Hurd.

Why would we need to set it in exec-command? It looks like it 
modifies the state of the calling process, which means we'd want 
to set it in the shepherd service, not in each of the child 
processes.

> That would address the main issue without having to resort to 
> polling. Respawning will work only when #:pid-file is used 
> though, but that’s already an improvement.
>
> Thoughts?

I'll try to get this working in the next few days. Hopefully 
you'll see a patch from me soon.

Carlo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
  2018-03-02 10:13       ` Carlo Zancanaro
@ 2018-03-02 12:42         ` Ludovic Courtès
  2018-03-03  7:58           ` Carlo Zancanaro
  0 siblings, 1 reply; 14+ messages in thread
From: Ludovic Courtès @ 2018-03-02 12:42 UTC (permalink / raw)
  To: Carlo Zancanaro; +Cc: 30637

Hello!

Carlo Zancanaro <carlo@zancanaro.id.au> skribis:

> On Fri, Mar 02 2018, Ludovic Courtès wrote:

[...]

>> So what about this plan:
>>
>>   1. Add FFI bindings in (shepherd system) for prctl(2). We   should
>> arrange for it to throw to 'system-error when the   ‘prctl’ symbol
>> is missing, as is the case on GNU/Hurd.
>
> Are we okay with having this just not work on GNU/Hurd (or kernels
> older than 3.4, according to the prctl manpage)? We could fall back to
> a polling approach if prctl isn't available? I don't really like the
> idea of this working on some kernels but not others, given that
> process supervision is one of the main jobs of shepherd.

Yeah, I agree.

The ‘prctl’ procedure itself should simply throw to 'system-error on
GNU/Hurd.  But then, in (shepherd), we could add the polling thing when
(not (string-contains %host-type "linux")).

WDYT?

>>   2. Use prctl/PR_SET_CHILD_SUBREAPER in ‘exec-command’. Here we
>> must ‘catch-system-error’ around that call to cater to   GNU/Hurd.

Actually this should be done in ‘fork+exec-command’ in the child
process.

> Why would we need to set it in exec-command? It looks like it modifies
> the state of the calling process, which means we'd want to set it in
> the shepherd service, not in each of the child processes.

We want to set the “reaper” of child processes to Shepherd itself, so we
must do that in child processes.  The shepherd process cannot be its own
reaper I suppose.

BTW, we should do PR_SET_CHILD_SUBREAPER only when (not (= 1 (getpid))).

> I'll try to get this working in the next few days. Hopefully you'll
> see a patch from me soon.

Awesome, thank you!

Ludo’.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
  2018-03-02 12:42         ` Ludovic Courtès
@ 2018-03-03  7:58           ` Carlo Zancanaro
  2018-03-03 15:21             ` Ludovic Courtès
  0 siblings, 1 reply; 14+ messages in thread
From: Carlo Zancanaro @ 2018-03-03  7:58 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 30637


[-- Attachment #1.1: Type: text/plain, Size: 2198 bytes --]

Hey Ludo,

I've re-written my patch, and it's attached in two commits. The 
first one adds the necessary calls to prctl, and the second adds 
the fallback to polling.

On Fri, Mar 02 2018, Ludovic Courtès wrote:
> The ‘prctl’ procedure itself should simply throw to 
> 'system-error on GNU/Hurd. But then, in (shepherd), we could add 
> the polling thing when (not (string-contains %host-type 
> "linux")).
>
> WDYT?

I don't like the idea of doing this based on the host type. In my 
patch I've done it based on whether the prctl call succeeded. If 
the prctl call throws a system-error then we poll, otherwise we 
rely on SIGCHLD. I don't have a system set up with another kernel, 
though, so I don't know how I can easily test whether the fallback 
logic is working properly. When I replaced the prctl call with 
(throw 'system-error) it seemed to work.

The fallback code still fails in the guix build environment (as my 
previous patch did), but when it's using prctl it works properly. 
This means that a build on Linux pre-3.4, or on Hurd, will fail, 
which probably isn't acceptable given that shepherd is a hard 
dependency for starting a GuixSD system. As far as I can tell the 
test fails because the processes stick around as zombies, so I 
assume that pid 1 in the build container isn't properly reaping 
zombie processes. Do you have any ideas how I can force it to do 
so?

> We want to set the “reaper” of child processes to Shepherd 
> itself, so we must do that in child processes. The shepherd 
> process cannot be its own reaper I suppose.

Reading the manpage, and then running some code, I think you're 
wrong about this. Using prctl with PR_SET_CHILD_SUBREAPER marks 
the calling process as a child subreaper. That means that any 
processes that are orphaned below the current process get 
reparented under the current process (or a closer child subreaper, 
if there's one further down). If there are no processes marked as 
child subreapers, then the orphaned process is reparented under 
pid 1. We should only need to call prctl in two places: when 
shepherd initially starts, or when we fork for daemonize.

Carlo


[-- Attachment #1.2: 0001-Handle-forked-process-SIGCHLD-signals.patch --]
[-- Type: text/x-patch, Size: 7908 bytes --]

From 5f26da2ce6a26c8412368900987ac5438f3e70cd Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <carlo@zancanaro.id.au>
Date: Sat, 3 Mar 2018 17:26:05 +1100
Subject: [PATCH 1/2] Handle forked process SIGCHLD signals

* Makefile.am (TESTS): Add tests/forking-service.sh.
* configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER.
* modules/shepherd.scm: Set the child subreaper attribute of main shepherd
  process (as long as we're not pid 1).
* modules/shepherd/service.scm (root-service)[daemonize]: Set the child
  subreaper attribute of newly forked shepherd process.
* modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable
  and export it.
  (prctl): Add new procedure and export it.
---
 Makefile.am                    |   1 +
 configure.ac                   |   4 ++
 modules/shepherd.scm           |   2 +
 modules/shepherd/service.scm   |   4 +-
 modules/shepherd/system.scm.in |  17 ++++++-
 tests/forking-service.sh       | 111 +++++++++++++++++++++++++++++++++++++++++
 6 files changed, 137 insertions(+), 2 deletions(-)
 create mode 100644 tests/forking-service.sh

diff --git a/Makefile.am b/Makefile.am
index eafa308..8dad006 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -190,6 +190,7 @@ TESTS =						\
   tests/no-home.sh				\
   tests/pid-file.sh				\
   tests/status-sexp.sh				\
+  tests/forking-service.sh			\
   tests/signals.sh
 
 TEST_EXTENSIONS = .sh
diff --git a/configure.ac b/configure.ac
index bb5058d..fbe16f4 100644
--- a/configure.ac
+++ b/configure.ac
@@ -72,7 +72,11 @@ esac
 AC_SUBST([RB_AUTOBOOT])
 AC_SUBST([RB_HALT_SYSTEM])
 AC_SUBST([RB_POWER_OFF])
+AC_MSG_RESULT([done])
 
+AC_MSG_CHECKING([<sys/prctl.h> constants])
+AC_COMPUTE_INT([PR_SET_CHILD_SUBREAPER], [PR_SET_CHILD_SUBREAPER], [#include <sys/prctl.h>])
+AC_SUBST([PR_SET_CHILD_SUBREAPER])
 AC_MSG_RESULT([done])
 
 dnl Manual pages.
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index df5420f..ab59e08 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -50,6 +50,8 @@
 ;; Main program.
 (define (main . args)
   (initialize-cli)
+  (when (not (= 1 (getpid)))
+    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
 
   (let ((config-file #f)
 	(socket-file default-socket-file)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 2224932..b6394f2 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -1274,7 +1274,9 @@ we want to receive these signals."
            (local-output "Running as PID 1, so not daemonizing."))
           (else
            (if (zero? (primitive-fork))
-               #t
+               (begin
+                 (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1))
+                 #t)
                (primitive-exit 0))))))
      (persistency
       "Safe the current state of running and non-running services.
diff --git a/modules/shepherd/system.scm.in b/modules/shepherd/system.scm.in
index a54dca7..55806cb 100644
--- a/modules/shepherd/system.scm.in
+++ b/modules/shepherd/system.scm.in
@@ -23,7 +23,9 @@
   #:export (reboot
             halt
             power-off
-            max-file-descriptors))
+            max-file-descriptors
+            prctl
+            PR_SET_CHILD_SUBREAPER))
 
 ;; The <sys/reboot.h> constants.
 (define RB_AUTOBOOT @RB_AUTOBOOT@)
@@ -130,6 +132,19 @@ the returned procedure is called."
                    (list err))
             result)))))
 
+(define PR_SET_CHILD_SUBREAPER @PR_SET_CHILD_SUBREAPER@)
+
+(define prctl
+  (let ((proc (syscall->procedure long "prctl" (list int int))))
+    (lambda (process operation)
+      "Perform an operation on the given process"
+      (let-values (((result err) (proc process operation)))
+        (if (= -1 result)
+            (throw 'system-error "prctl" "~A: ~S"
+                   (list (strerror err) name)
+                   (list err))
+            result)))))
+
 (define (max-file-descriptors)
   "Return the maximum number of open file descriptors allowed."
   (sysconf _SC_OPEN_MAX))
diff --git a/tests/forking-service.sh b/tests/forking-service.sh
new file mode 100644
index 0000000..90c684a
--- /dev/null
+++ b/tests/forking-service.sh
@@ -0,0 +1,111 @@
+# GNU Shepherd --- Test detecting a forked process' termination
+# Copyright © 2016 Ludovic Courtès <ludo@gnu.org>
+# Copyright © 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
+#
+# This file is part of the GNU Shepherd.
+#
+# The GNU Shepherd is free software; you can redistribute it and/or modify it
+# under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or (at
+# your option) any later version.
+#
+# The GNU Shepherd is distributed in the hope that it will be useful, but
+# WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with the GNU Shepherd.  If not, see <http://www.gnu.org/licenses/>.
+
+shepherd --version
+herd --version
+
+socket="t-socket-$$"
+conf="t-conf-$$"
+log="t-log-$$"
+pid="t-pid-$$"
+service_pid="t-service-pid-$$"
+service2_pid="t-service2-pid-$$"
+service2_started="t-service2-starts-$$"
+
+herd="herd -s $socket"
+
+function cleanup() {
+    cat $log || true
+    rm -f $socket $conf $log $service2_started
+    test -f $pid && kill "$(cat $pid)" || true
+    rm -f $pid
+    test -f $service_pid && kill "$(cat $service_pid)" || true
+    rm -f $service_pid
+    test -f $service2_pid && kill "$(cat $service2_pid)" || true
+    rm -f $service2_pid
+}
+
+trap cleanup EXIT
+
+cat > "$conf"<<EOF
+(define %command
+  '("$SHELL" "-c" "sleep 600 & echo \$! > $PWD/$service_pid"))
+
+(register-services
+ (make <service>
+   ;; A service that forks into a different process.
+   #:provides '(test)
+   #:start (make-forkexec-constructor %command
+                                      #:pid-file "$PWD/$service_pid")
+   #:stop  (make-kill-destructor)
+   #:respawn? #f))
+
+(define %command2
+  '("$SHELL" "-c" "echo started >> $PWD/$service2_started; sleep 600 & echo \$! > $PWD/$service2_pid"))
+
+(register-services
+ (make <service>
+   ;; A service that forks into a different process.
+   #:provides '(test2)
+   #:start (make-forkexec-constructor %command2
+                                      #:pid-file "$PWD/$service2_pid")
+   #:stop  (make-kill-destructor)
+   #:respawn? #t))
+EOF
+cat $conf
+
+rm -f "$pid"
+shepherd -I -s "$socket" -c "$conf" -l "$log" --pid="$pid" &
+
+# Wait till it's ready.
+while ! test -f "$pid" ; do sleep 0.3 ; done
+
+shepherd_pid="$(cat $pid)"
+
+# start both of the services
+$herd start test
+$herd start test2
+
+# make sure "test" is started
+until $herd status test | grep started; do sleep 0.3; done
+test -f "$service_pid"
+service_pid_value="$(cat $service_pid)"
+# now kill it
+kill "$service_pid_value"
+while kill -0 "$service_pid_value"; do sleep 0.3; done
+# shepherd should notice that the service has stopped within one second
+sleep 1
+$herd status test | grep stopped
+
+
+
+# make sure "test2" has started
+until $herd status test2 | grep started; do sleep 0.3; done
+test -f "$service2_pid"
+service2_pid_value="$(cat $service2_pid)"
+test "$(cat $PWD/$service2_started)" = "started"
+# now kill it
+rm -f "$service2_pid"
+kill $service2_pid_value
+while kill -0 "$service2_pid_value"; do sleep 0.3; done
+# shepherd should notice that the service has stopped, and restart it, within one second
+sleep 1;
+$herd status test2 | grep started
+test "$(cat $PWD/$service2_started)" = "started
+started"
-- 
2.16.1


[-- Attachment #1.3: 0002-Poll-every-0.5s-to-find-dead-forked-services.patch --]
[-- Type: text/x-patch, Size: 10120 bytes --]

From ec47fa189c7d47f1d9444d939b084850f0a7186c Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <carlo@zancanaro.id.au>
Date: Wed, 21 Feb 2018 22:57:59 +1100
Subject: [PATCH 2/2] Poll every 0.5s to find dead forked services

* modules/shepherd.scm (open-server-socket): Set socket to be
  non-blocking.
  (main): Use select with a timeout. If prctl failed when shepherd started
  then call check-for-dead-services between connections/timeouts.
* modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
  signal handler.
  (respawn-service): Separate logic for respawning services from handling
  SIGCHLD.
  (handle-SIGCHLD, check-for-dead-services): New exported procedures.
* tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
  symbols.
---
 modules/shepherd.scm         | 31 ++++++++++++++----
 modules/shepherd/service.scm | 78 +++++++++++++++++++++++++++-----------------
 tests/basic.sh               |  4 +--
 tests/status-sexp.sh         |  4 +--
 4 files changed, 76 insertions(+), 41 deletions(-)

diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index ab59e08..b824546 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -42,6 +42,8 @@
   (with-fluids ((%default-port-encoding "UTF-8"))
     (let ((sock    (socket PF_UNIX SOCK_STREAM 0))
           (address (make-socket-address AF_UNIX file-name)))
+      (fcntl sock F_SETFL (logior O_NONBLOCK
+                                  (fcntl sock F_GETFL)))
       (bind sock address)
       (listen sock 10)
       sock)))
@@ -49,9 +51,17 @@
 \f
 ;; Main program.
 (define (main . args)
+  (define poll-services
+    (if (= 1 (getpid))
+        (lambda () #f)
+        (catch 'system-error
+          (lambda ()
+            (prctl PR_SET_CHILD_SUBREAPER 1)
+            (lambda () #f))
+          (lambda (key . args)
+            check-for-dead-services))))
+
   (initialize-cli)
-  (when (not (= 1 (getpid)))
-    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
 
   (let ((config-file #f)
 	(socket-file default-socket-file)
@@ -220,11 +230,18 @@
             (_  #t))
 
           (let next-command ()
-            (match (accept sock)
-              ((command-source . client-address)
-               (setvbuf command-source _IOFBF 1024)
-               (process-connection command-source))
-              (_ #f))
+            (define (read-from sock)
+              (match (accept sock)
+                ((command-source . client-address)
+                 (setvbuf command-source _IOFBF 1024)
+                 (process-connection command-source))
+                (_ #f)))
+            (match (select (list sock) (list) (list) 0.5)
+              (((sock) _ _)
+               (read-from sock))
+              (_
+               #f))
+            (poll-services)
             (next-command))))))
 
 (define (process-connection sock)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index b6394f2..fc53d76 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -3,6 +3,7 @@
 ;; Copyright (C) 2002, 2003 Wolfgang Järling <wolfgang@pro-linux.de>
 ;; Copyright (C) 2014 Alex Sassmannshausen <alex.sassmannshausen@gmail.com>
 ;; Copyright (C) 2016 Alex Kost <alezost@gmail.com>
+;; Copyright (C) 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
 ;;
 ;; This file is part of the GNU Shepherd.
 ;;
@@ -64,6 +65,7 @@
             for-each-service
             lookup-services
             respawn-service
+            handle-SIGCHLD
             register-services
             provided-by
             required-by
@@ -77,6 +79,7 @@
             make-system-destructor
             make-init.d-service
 
+            check-for-dead-services
             root-service
             make-actions
 
@@ -800,7 +803,7 @@ false."
 its PID."
   ;; Install the SIGCHLD handler if this is the first fork+exec-command call
   (unless %sigchld-handler-installed?
-    (sigaction SIGCHLD respawn-service SA_NOCLDSTOP)
+    (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
     (set! %sigchld-handler-installed? #t))
   (let ((pid (primitive-fork)))
     (if (zero? pid)
@@ -991,7 +994,7 @@ child left."
                           what (strerror errno))
             '(0 . #f)))))))
 
-(define (respawn-service signum)
+(define (handle-SIGCHLD signum)
   "Handle SIGCHLD, possibly by respawning the service that just died, or
 otherwise by updating its state."
   (let loop ()
@@ -1009,39 +1012,42 @@ otherwise by updating its state."
 
          ;; SERV can be #f for instance when this code runs just after a
          ;; service's 'stop' method killed its process and completed.
-         (when serv
-           (slot-set! serv 'running #f)
-           (if (and (respawn? serv)
-                    (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
-                                             (car respawn-limit)
-                                             (cdr respawn-limit))))
-               (if (not (slot-ref serv 'waiting-for-termination?))
-                   (begin
-                     ;; Everything is okay, start it.
-                     (local-output "Respawning ~a."
-                                   (canonical-name serv))
-                     (slot-set! serv 'last-respawns
-                                (cons (current-time)
-                                      (slot-ref serv 'last-respawns)))
-                     (start serv))
-                   ;; We have just been waiting for the
-                   ;; termination.  The `running' slot has already
-                   ;; been set to `#f' by `stop'.
-                   (begin
-                     (local-output "Service ~a terminated."
-                                   (canonical-name serv))
-                     (slot-set! serv 'waiting-for-termination? #f)))
-               (begin
-                 (local-output "Service ~a has been disabled."
-                               (canonical-name serv))
-                 (when (respawn? serv)
-                   (local-output "  (Respawning too fast.)"))
-                 (slot-set! serv 'enabled? #f))))
+         (respawn-service serv)
 
          ;; As noted in libc's manual (info "(libc) Process Completion"),
          ;; loop so we don't miss any terminated child process.
          (loop))))))
 
+(define (respawn-service serv)
+  (when serv
+    (slot-set! serv 'running #f)
+    (if (and (respawn? serv)
+             (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
+                                      (car respawn-limit)
+                                      (cdr respawn-limit))))
+        (if (not (slot-ref serv 'waiting-for-termination?))
+            (begin
+              ;; Everything is okay, start it.
+              (local-output "Respawning ~a."
+                            (canonical-name serv))
+              (slot-set! serv 'last-respawns
+                         (cons (current-time)
+                               (slot-ref serv 'last-respawns)))
+              (start serv))
+            ;; We have just been waiting for the
+            ;; termination.  The `running' slot has already
+            ;; been set to `#f' by `stop'.
+            (begin
+              (local-output "Service ~a terminated."
+                            (canonical-name serv))
+              (slot-set! serv 'waiting-for-termination? #f)))
+        (begin
+          (local-output "Service ~a has been disabled."
+                        (canonical-name serv))
+          (when (respawn? serv)
+            (local-output "  (Respawning too fast.)"))
+          (slot-set! serv 'enabled? #f)))))
+
 ;; Add NEW-SERVICES to the list of known services.
 (define (register-services . new-services)
   (define (register-single-service new)
@@ -1171,6 +1177,18 @@ file when persistence is enabled."
         (lambda (p)
           (format p "~{~a ~}~%" running-services))))))
 
+(define (check-for-dead-services)
+  (define (process-exists? pid)
+    (catch #t
+      (lambda () (kill pid 0) #t)
+      (lambda _ #f)))
+  (for-each-service (lambda (service)
+                      (let ((running (slot-ref service 'running)))
+                        (when (and (integer? running)
+                                   (not (process-exists? running)))
+                            (local-output "PID ~a (~a) is dead!" running (canonical-name service))
+                            (respawn-service service))))))
+
 (define root-service
   (make <service>
     #:docstring "The root service is used to operate on shepherd itself."
diff --git a/tests/basic.sh b/tests/basic.sh
index 1ddb334..2ecd8fb 100644
--- a/tests/basic.sh
+++ b/tests/basic.sh
@@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" <<EOF
 (register-services
  (make <service>
    #:provides '(test-loaded)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop (const #f)))
 EOF
 
@@ -166,7 +166,7 @@ $herd status test-loaded
 $herd status test-loaded | grep stopped
 
 $herd start test-loaded
-$herd status test-loaded | grep -i 'running.*42'
+$herd status test-loaded | grep -i 'running.*abc'
 $herd stop test-loaded
 $herd unload root test-loaded
 
diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh
index b7c8cb4..11b967e 100644
--- a/tests/status-sexp.sh
+++ b/tests/status-sexp.sh
@@ -33,7 +33,7 @@ cat > "$conf"<<EOF
 (register-services
  (make <service>
    #:provides '(foo)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop  (const #f)
    #:docstring "Foo!"
    #:respawn? #t)
@@ -85,7 +85,7 @@ root_service_sexp="
 	     (service (version 0)
 	       (provides (foo)) (requires ())
 	       (respawn? #t) (docstring \"Foo!\")
-	       (enabled? #t) (running 42) (conflicts ())
+	       (enabled? #t) (running abc) (conflicts ())
 	       (last-respawns ()))
 	     (service (version 0)
 	       (provides (bar)) (requires (foo))
-- 
2.16.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
  2018-03-03  7:58           ` Carlo Zancanaro
@ 2018-03-03 15:21             ` Ludovic Courtès
  2018-03-03 20:49               ` Carlo Zancanaro
  0 siblings, 1 reply; 14+ messages in thread
From: Ludovic Courtès @ 2018-03-03 15:21 UTC (permalink / raw)
  To: Carlo Zancanaro; +Cc: 30637

Hi Carlo,

Overall LGTM!  It’s a long reply though, but that’s because there are
lots of details to pay attention to in this Unix quagmire.  :-)

Carlo Zancanaro <carlo@zancanaro.id.au> skribis:

> I've re-written my patch, and it's attached in two commits. The first
> one adds the necessary calls to prctl, and the second adds the
> fallback to polling.

If possible I would prefer a commit that only adds prctl, and the next
one would actually use it.  I find it clearer and more convenient if we
need to bisect or revert.

> On Fri, Mar 02 2018, Ludovic Courtès wrote:
>> The ‘prctl’ procedure itself should simply throw to 'system-error on
>> GNU/Hurd. But then, in (shepherd), we could add the polling thing
>> when (not (string-contains %host-type "linux")).
>>
>> WDYT?
>
> I don't like the idea of doing this based on the host type. In my
> patch I've done it based on whether the prctl call succeeded.

Right, I actually agree with feature-based checks.  :-)

More on that inline below.

> The fallback code still fails in the guix build environment (as my
> previous patch did), but when it's using prctl it works properly. This
> means that a build on Linux pre-3.4, or on Hurd, will fail, which
> probably isn't acceptable given that shepherd is a hard dependency for
> starting a GuixSD system. As far as I can tell the test fails because
> the processes stick around as zombies,

If they’re zombies, that means nobody called waitpid(2).  Presumably the
polling code would need to do that.

So I suppose ‘check-for-dead-services’ should do something like:

          (when (integer? running)
            (catch 'system-error
              (lambda ()
                (match (waitpid* running WNOHANG)
                  (#f #f)  ;uh, not a PID?
                  ((0 . _) #f) ;ditto?
                  ((pid . _)
                   (local-output "PID ~a (~a) is dead" running (canonical-name service))
                   (respawn-service service))))
              (lambda args
                (or (= ECHILD (system-error-errno args))  ;wrong PID?
                    (= EPERM (system-error-errno args))   ;not a child
                    (apply throw args)))))

Does that make sense?  Please check waitpid(2) carefully though, because
it’s pretty gnarly and I might have forgotten or misinterpreted
something here.

>> We want to set the “reaper” of child processes to Shepherd itself,
>> so we must do that in child processes. The shepherd process cannot
>> be its own reaper I suppose.
>
> Reading the manpage, and then running some code, I think you're wrong
> about this. Using prctl with PR_SET_CHILD_SUBREAPER marks the calling
> process as a child subreaper. That means that any processes that are
> orphaned below the current process get reparented under the current
> process (or a closer child subreaper, if there's one further down). If
> there are no processes marked as child subreapers, then the orphaned
> process is reparented under pid 1. We should only need to call prctl
> in two places: when shepherd initially starts, or when we fork for
> daemonize.

Oh you’re right, sorry for the confusion!

> From 5f26da2ce6a26c8412368900987ac5438f3e70cd Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <carlo@zancanaro.id.au>
> Date: Sat, 3 Mar 2018 17:26:05 +1100
> Subject: [PATCH 1/2] Handle forked process SIGCHLD signals
>
> * Makefile.am (TESTS): Add tests/forking-service.sh.
> * configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER.
> * modules/shepherd.scm: Set the child subreaper attribute of main shepherd
>   process (as long as we're not pid 1).
> * modules/shepherd/service.scm (root-service)[daemonize]: Set the child
>   subreaper attribute of newly forked shepherd process.
> * modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable
>   and export it.
>   (prctl): Add new procedure and export it.

[...]

> --- a/modules/shepherd.scm
> +++ b/modules/shepherd.scm
> @@ -50,6 +50,8 @@
>  ;; Main program.
>  (define (main . args)
>    (initialize-cli)
> +  (when (not (= 1 (getpid)))
> +    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))

I think it’s a good idea to add a comment, like:

  ;; Register ourselves to get SIGCHLD when child processes terminate.
  ;; This is necessary for daemons for which we’d otherwise never get
  ;; SIGCHLD.

> +(define prctl
> +  (let ((proc (syscall->procedure long "prctl" (list int int))))

On GNU/Hurd libc doesn’t have a “prctl” symbol.  So you need something
like:

  (if (dynamic-func "prctl" (dynamic-link))
      (let ((proc …)) …)
      (lambda (process operation)
        ;; We’re running on an OS that lacks ‘prctl’, such as GNU/Hurd.
        (throw 'system-error "prctl" "~A" (list (strerror ENOSYS))
               (list ENOSYS))))

> +function cleanup() {

You need either () or “function” but not both (shells other than Bash
might complain…).

> +shepherd_pid="$(cat $pid)"

Likewise, we should use `foo` instead of $(foo) to suppose non-Bash
shells (there are several occurrences of $(foo) here.)

> From ec47fa189c7d47f1d9444d939b084850f0a7186c Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <carlo@zancanaro.id.au>
> Date: Wed, 21 Feb 2018 22:57:59 +1100
> Subject: [PATCH 2/2] Poll every 0.5s to find dead forked services
>
> * modules/shepherd.scm (open-server-socket): Set socket to be
>   non-blocking.
>   (main): Use select with a timeout. If prctl failed when shepherd started
>   then call check-for-dead-services between connections/timeouts.
> * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
>   signal handler.
>   (respawn-service): Separate logic for respawning services from handling
>   SIGCHLD.
>   (handle-SIGCHLD, check-for-dead-services): New exported procedures.
> * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
>   symbols.

[...]

> +  (define poll-services
> +    (if (= 1 (getpid))
> +        (lambda () #f)
> +        (catch 'system-error
> +          (lambda ()
> +            (prctl PR_SET_CHILD_SUBREAPER 1)
> +            (lambda () #f))
> +          (lambda (key . args)
> +            check-for-dead-services))))

Please add a comment in the handler saying that we resort to polling on
OSes that do not support ‘prctl’.

However, perhaps we should do:

  (lambda args
    (let ((errno (system-error-errno args)))
      (if (= ENOSYS errno)
          check-for-dead-services
          (apply throw args))))

so that important/unexpected errors are not silently ignored.

> +(define (respawn-service serv)
> +  (when serv

Please add a docstring and move ‘when’ to the caller.

> +(define (check-for-dead-services)

Docstring as well :-), and also a comment explaining that this is a last
resort for prctl-less OSes.

>  (register-services
>   (make <service>
>     #:provides '(test-loaded)
> -   #:start (const 42)
> +   #:start (const 'abc)

Perhaps with the ‘check-for-dead-services’ use of ‘waitpid’ I outlined
above we can even keep 42 here?

If not, we should add in shepherd.texi, under “Slots of services”, a few
words saying that when ‘running’ is an integer it is assumed to be a
PID.

Could you send an updated patch?

Thanks for working on this!

Ludo’.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
  2018-03-03 15:21             ` Ludovic Courtès
@ 2018-03-03 20:49               ` Carlo Zancanaro
  2018-03-04 22:11                 ` Ludovic Courtès
  0 siblings, 1 reply; 14+ messages in thread
From: Carlo Zancanaro @ 2018-03-03 20:49 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 30637


[-- Attachment #1.1: Type: text/plain, Size: 1780 bytes --]

Hey Ludo,

On Sat, Mar 03 2018, Ludovic Courtès wrote:
> If they’re zombies, that means nobody called waitpid(2). 
> Presumably the
> polling code would need to do that.
>
> So I suppose ‘check-for-dead-services’ should do something like:
>
> [ ... ]
>
> Does that make sense?  Please check waitpid(2) carefully though, 
> because
> it’s pretty gnarly and I might have forgotten or misinterpreted
> something here.

Unfortunately we can't do that. We fall back to the polling 
approach to handle the fact that the processes that we care about 
aren't our children (hence we don't get SIGCHLD). The waitpid 
system call only waits for processes which are children of the 
calling process.

I looked into the zombie problem a bit more, and I found what the 
problem actually is. In the build environment a guile process is 
running as pid 1 (the *-guile-builder script for that job). This 
guile process never handles SIGCHLD, and never calls wait/waitpid, 
so any orphaned processes become zombies. I tried modifying 
derivations.scm, but it wanted to rebuild a lot of things, so I 
gave up. I think we need to add something like this to the 
*-guile-builder script:

  (sigaction SIGCHLD
    (lambda ()
      (let loop ()
        (match (waitpid WAIT_ANY WNOHANG)
          ((0 . _) #f)
          ((pid . _) (loop))
          (_ #f))))
    SA_NOCLDSTOP)

I've attached the output of `ps axjf` inside the build container, 
so you can see why I think that this is the problem. It's a bit of 
a shame that this is different to `guix environment --container`, 
where /bin/sh is pid 1, because it meant that it would build 
successfully in my container, but would fail in the build 
container (which is a confusing experience).


[-- Attachment #1.2: process-tree.txt --]
[-- Type: text/plain, Size: 8196 bytes --]

 PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    0     1     1     1 ?           -1 Ssl  30001   0:00 guile --no-auto-compile -L /gnu/store/71d3rwa514j7vy5l4vfivf68g5yxibvl-module-import /gnu/store/nln71c8hv82c4vkrssi12qmapp1ryk58-shepherd-0.0.0-guile-builder
    1   989     1     1 ?           -1 S    30001   0:00 make check -j 4
  989   990     1     1 ?           -1 S    30001   0:00  \_ make check-recursive
  990   991     1     1 ?           -1 S    30001   0:00      \_ /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash -c fail=; \ if (target_option=k; case ${target_option-} in ?) ;; *) echo "am__make_running_with_option: internal error: invalid" "target option '${target_option-}' specified" >&2; exit 1;; esac; has_opt=no; sane_makeflags=$MAKEFLAGS; if { if test -z '1'; then false; elif test -n 'x86_64-unknown-linux-gnu'; then true; elif test -n '4.2.1' && test -n '/tmp/guix-build-shepherd-0.0.0.drv-0/source'; then true; else false; fi; }; then sane_makeflags=$MFLAGS; else case $MAKEFLAGS in *\\[\ \.]*) bs=\\; sane_makeflags=`printf '%s\n' "$MAKEFLAGS" | sed "s/$bs$bs[$bs $bs.]*//g"`;; esac; fi; skip_next=no; strip_trailopt () { flg=`printf '%s\n' "$flg" | sed "s/$1.*$//"`; }; for flg in $sane_makeflags; do test $skip_next = yes && { skip_next=no; continue; }; case $flg in *=*|--*) continue;; -*I) strip_trailopt 'I'; skip_next=yes;; -*I?*) strip_trailopt 'I';; -*O) strip_trailopt 'O'; skip_next=yes;; -*O?*) strip_trailopt 'O';; -*l) strip_trailopt 'l'; skip_next=yes;; -*l?*) strip_trailopt 'l';; -[dEDm]) skip_next=yes;; -[JT]) skip_next=yes;; esac; case $flg in *$target_option*) has_opt=yes; break;; esac; done; test $has_opt = yes); then \   failcom='fail=yes'; \ else \   failcom='exit 1'; \ fi; \ dot_seen=no; \ target=`echo check-recursive | sed s/-recursive//`; \ case "check-recursive" in \   distclean-* | maintainer-clean-*) list='po' ;; \   *) list='po' ;; \ esac; \ for subdir in $list; do \   echo "Making $target in $subdir"; \   if test "$subdir" = "."; then \     dot_seen=yes; \     local_target="$target-am"; \   else \     local_target="$target"; \   fi; \   (CDPATH="${ZSH_VERSION+.}:" && cd $subdir && make  $local_target) \   || eval $failcom; \ done; \ if test "$dot_seen" = "no"; then \   make  "$target-am" || exit 1; \ fi; test -z "$fail"
  991   998     1     1 ?           -1 S    30001   0:00          \_ make check-am
  998   999     1     1 ?           -1 S    30001   0:00              \_ make check-TESTS
  999  1005     1     1 ?           -1 S    30001   0:00                  \_ /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash -c set +e; bases='tests/basic.log tests/respawn.log tests/respawn-throttling.log tests/misbehaved-client.log tests/no-home.log tests/pid-file.log tests/status-sexp.log tests/forking-service.log tests/signals.log'; bases=`for i in $bases; do echo $i; done | sed 's/\.log$//'`; bases=`echo $bases`; \ log_list=`for i in $bases; do echo $i.log; done`; \ trs_list=`for i in $bases; do echo $i.trs; done`; \ log_list=`echo $log_list`; trs_list=`echo $trs_list`; \ make  test-suite.log TEST_LOGS="$log_list"; \ exit $?;
 1005  1014     1     1 ?           -1 S    30001   0:00                      \_ make test-suite.log TEST_LOGS=tests/basic.log tests/respawn.log tests/respawn-throttling.log tests/misbehaved-client.log tests/no-home.log tests/pid-file.log tests/status-sexp.log tests/forking-service.log tests/signals.log
 1014  1554     1     1 ?           -1 S    30001   0:00                          \_ /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash -c p='tests/forking-service.sh'; \ case 'tests/forking-service.log' in */*) case 'tests/forking-service' in */*) b='tests/forking-service';; *) b=`echo 'tests/forking-service.log' | sed 's/\.log$//'`; esac;; *) b='tests/forking-service';; esac; \ case $- in *e*) set +e;; esac; srcdirstrip=`echo "." | sed 's|.|.|g'`; case $p in ./*) f=`echo "$p" | sed "s|^$srcdirstrip/||"`;; *) f=$p;; esac; { mgn= red= grn= lgn= blu= brg= std=; am__color_tests=no; if test "X" = Xno; then am__color_tests=no; elif test "X" = Xalways; then am__color_tests=yes; elif test "X$TERM" != Xdumb && { test -t 1; } 2>/dev/null; then am__color_tests=yes; fi; if test $am__color_tests = yes; then red='.[0;31m'; grn='.[0;32m'; lgn='.[1;32m'; blu='.[1;34m'; mgn='.[0;35m'; brg='.[1m'; std='.[m'; fi; }; srcdir=.; export srcdir; case "tests/forking-service.log" in */*) am__odir=`echo "./tests/forking-service.log" | sed 's|/[^/]*$||'`;; *) am__odir=.;; esac; test "x$am__odir" = x"." || test -d "$am__odir" || /gnu/store/6i33ik7haav0hd5a797l3llkq04ghx6g-coreutils-8.28/bin/mkdir -p "$am__odir" || exit $?; if test -f "./$f"; then dir=./; elif test -f "$f"; then dir=; else dir="./"; fi; tst=$dir$f; log='tests/forking-service.log'; if test -n ''; then am__enable_hard_errors=no; else am__enable_hard_errors=yes; fi; case "  " in *[\ \.]$f[\ \.]* | *[\ \.]$dir$f[\ \.]*) am__expect_failure=yes;; *) am__expect_failure=no;; esac; unset XDG_CONFIG_HOME; unset LANGUAGE; LC_ALL=C LC_MESSAGES=C PATH="/tmp/guix-build-shepherd-0.0.0.drv-0/source:$PATH" SHELL="/gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash" GUILE="/gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/bin/guile" GUILE_LOAD_PATH="/tmp/guix-build-shepherd-0.0.0.drv-0/source/modules:/tmp/guix-build-shepherd-0.0.0.drv-0/source/modules:$GUILE_LOAD_PATH" GUILE_LOAD_COMPILED_PATH="/tmp/guix-build-shepherd-0.0.0.drv-0/source/modules:/tmp/guix-build-shepherd-0.0.0.drv-0/source/modules:$GUILE_LOAD_COMPILED_PATH"  /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash ./build-aux/test-driver --test-name "$f" \ --log-file $b.log --trs-file $b.trs \ --color-tests "$am__color_tests" --enable-hard-errors "$am__enable_hard_errors" --expect-failure "$am__expect_failure"   -- /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash -x -e  \ "$tst" 
 1554  1561     1     1 ?           -1 S    30001   0:00                              \_ /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash ./build-aux/test-driver --test-name tests/forking-service.sh --log-file tests/forking-service.log --trs-file tests/forking-service.trs --color-tests no --enable-hard-errors yes --expect-failure no -- /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash -x -e ./tests/forking-service.sh
 1561  1562     1     1 ?           -1 S    30001   0:00                                  \_ /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash -x -e ./tests/forking-service.sh
 1562  1594     1     1 ?           -1 Sl   30001   0:00                                      \_ /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/bin/guile --no-auto-compile /tmp/guix-build-shepherd-0.0.0.drv-0/source/shepherd -I -s t-socket-1562 -c t-conf-1562 -l t-log-1562 --pid=t-pid-1562
 1562  2005     1     1 ?           -1 R    30001   0:00                                      \_ ps axjf
    1  1091     1     1 ?           -1 Z    30001   0:00 [shepherd] <defunct>
    1  1114     1     1 ?           -1 Z    30001   0:00 [shepherd] <defunct>
    1  1403  1123  1123 ?           -1 Z    30001   0:00 [sleep] <defunct>
    1  1415     1     1 ?           -1 Z    30001   0:00 [shepherd] <defunct>
    1  1433  1193  1193 ?           -1 Z    30001   0:00 [sleep] <defunct>
    1  1480  1479  1479 ?           -1 S    30001   0:00 sleep 600
    1  1541  1427  1427 ?           -1 Z    30001   0:00 [sleep] <defunct>
    1  1544  1471  1471 ?           -1 Z    30001   0:00 [sleep] <defunct>
    1  1644  1638  1638 ?           -1 Z    30001   0:00 [sleep] <defunct>
    1  1659  1658  1658 ?           -1 S    30001   0:00 sleep 600
    1  1705  1577  1577 ?           -1 Z    30001   0:00 [sleep] <defunct>
    1  1707  1618  1618 ?           -1 Z    30001   0:00 [sleep] <defunct>
    1  1770  1770  1770 ?           -1 Zs   30001   0:00 [bash] <defunct>
    1  1826  1736  1736 ?           -1 Z    30001   0:00 [sleep] <defunct>
    1  1888  1770  1770 ?           -1 Z    30001   0:00 [sleep] <defunct>

[-- Attachment #1.3: Type: text/plain, Size: 1361 bytes --]


> Please add a comment in the handler saying that we resort to 
> polling on
> OSes that do not support ‘prctl’.
>
> However, perhaps we should do:
>
>   (lambda args
>     (let ((errno (system-error-errno args)))
>       (if (= ENOSYS errno)
>           check-for-dead-services
>           (apply throw args))))
>
> so that important/unexpected errors are not silently ignored.

I had quite liked the idea that it would just ignore any error and 
do the fallback, because really all we care about is "prctl 
failed" when deciding on our fallback logic. I've decided to just 
handle ENOSYS (prctl not available) and EINVAL (which is returned 
when PR_SET_CHILD_SUBREAPER not available), and throw for 
everything else. I'd love to be able to test this on platforms 
where prctl will actually fail, though, because I don't like the 
idea of committing code that I haven't actually been able to run.

> If not, we should add in shepherd.texi, under “Slots of 
> services”, a few
> words saying that when ‘running’ is an integer it is assumed to 
> be a
> PID.

I've done this, but while doing it I realised that this has always 
been true. The SIGCHLD handler has always assumed that a number 
indicates a running process, my modifications haven't changed the 
assumption, they've just widened its scope.

Carlo


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.4: 0001-Add-prctl-syscall-wrapper-along-with-with-PR_SET_CHI.patch --]
[-- Type: text/x-patch, Size: 2521 bytes --]

From e529e4035eec147f448804dd10fdbca13a17f523 Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <carlo@zancanaro.id.au>
Date: Sun, 4 Mar 2018 07:01:30 +1100
Subject: [PATCH 1/3] Add prctl syscall wrapper along with with
 PR_SET_CHILD_SUBREAPER.

* configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER.
* modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable
  and export it.
  (prctl): Add new procedure and export it.
---
 configure.ac                   |  4 ++++
 modules/shepherd/system.scm.in | 21 ++++++++++++++++++++-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/configure.ac b/configure.ac
index bb5058d..fbe16f4 100644
--- a/configure.ac
+++ b/configure.ac
@@ -72,7 +72,11 @@ esac
 AC_SUBST([RB_AUTOBOOT])
 AC_SUBST([RB_HALT_SYSTEM])
 AC_SUBST([RB_POWER_OFF])
+AC_MSG_RESULT([done])
 
+AC_MSG_CHECKING([<sys/prctl.h> constants])
+AC_COMPUTE_INT([PR_SET_CHILD_SUBREAPER], [PR_SET_CHILD_SUBREAPER], [#include <sys/prctl.h>])
+AC_SUBST([PR_SET_CHILD_SUBREAPER])
 AC_MSG_RESULT([done])
 
 dnl Manual pages.
diff --git a/modules/shepherd/system.scm.in b/modules/shepherd/system.scm.in
index a54dca7..09d45bf 100644
--- a/modules/shepherd/system.scm.in
+++ b/modules/shepherd/system.scm.in
@@ -23,7 +23,9 @@
   #:export (reboot
             halt
             power-off
-            max-file-descriptors))
+            max-file-descriptors
+            prctl
+            PR_SET_CHILD_SUBREAPER))
 
 ;; The <sys/reboot.h> constants.
 (define RB_AUTOBOOT @RB_AUTOBOOT@)
@@ -130,6 +132,23 @@ the returned procedure is called."
                    (list err))
             result)))))
 
+(define PR_SET_CHILD_SUBREAPER @PR_SET_CHILD_SUBREAPER@)
+
+(define prctl
+  (if (dynamic-func "prctl" (dynamic-link))
+      (let ((proc (syscall->procedure long "prctl" (list int int))))
+        (lambda (process operation)
+          "Perform an operation on the given process"
+          (let-values (((result err) (proc process operation)))
+            (if (= -1 result)
+                (throw 'system-error "prctl" "~A: ~S"
+                       (list (strerror err) name)
+                       (list err))
+                result))))
+      (lambda (process operation)
+        (throw 'system-error "prctl" "~A" (list strerror ENOSYS)
+               (list ENOSYS)))))
+
 (define (max-file-descriptors)
   "Return the maximum number of open file descriptors allowed."
   (sysconf _SC_OPEN_MAX))
-- 
2.16.1


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.5: 0002-Handle-forked-process-SIGCHLD-signals.patch --]
[-- Type: text/x-patch, Size: 2325 bytes --]

From b43c128d8a175a9a123eb7b1af465fb3747a5393 Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <carlo@zancanaro.id.au>
Date: Sun, 4 Mar 2018 07:46:13 +1100
Subject: [PATCH 2/3] Handle forked process SIGCHLD signals

* Makefile.am (TESTS): Add tests/forking-service.sh.
* modules/shepherd.scm: Set the child subreaper attribute of main shepherd
  process (as long as we're not pid 1).
* modules/shepherd/service.scm (root-service)[daemonize]: Set the child
  subreaper attribute of newly forked shepherd process.
---
 Makefile.am                  | 1 +
 modules/shepherd.scm         | 7 +++++++
 modules/shepherd/service.scm | 4 +++-
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/Makefile.am b/Makefile.am
index eafa308..8dad006 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -190,6 +190,7 @@ TESTS =						\
   tests/no-home.sh				\
   tests/pid-file.sh				\
   tests/status-sexp.sh				\
+  tests/forking-service.sh			\
   tests/signals.sh
 
 TEST_EXTENSIONS = .sh
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index df5420f..faa1e47 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -51,6 +51,13 @@
 (define (main . args)
   (initialize-cli)
 
+  (when (not (= 1 (getpid)))
+    ;; Register for orphaned processes to be reparented onto us when their
+    ;; original parent dies. This lets us handle SIGCHLD from daemon processes
+    ;; that would otherwise have been reparented under pid 1. This is
+    ;; unnecessary when we are pid 1.
+    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
+
   (let ((config-file #f)
 	(socket-file default-socket-file)
         (pid-file    #f)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 2224932..b6394f2 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -1274,7 +1274,9 @@ we want to receive these signals."
            (local-output "Running as PID 1, so not daemonizing."))
           (else
            (if (zero? (primitive-fork))
-               #t
+               (begin
+                 (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1))
+                 #t)
                (primitive-exit 0))))))
      (persistency
       "Safe the current state of running and non-running services.
-- 
2.16.1


[-- Attachment #1.6: 0003-Poll-every-0.5s-to-find-dead-forked-services.patch --]
[-- Type: text/x-patch, Size: 12416 bytes --]

From 3d3c091660bbbd529af0058b0ba9b5ddbfc6b481 Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <carlo@zancanaro.id.au>
Date: Wed, 21 Feb 2018 22:57:59 +1100
Subject: [PATCH 3/3] Poll every 0.5s to find dead forked services

* modules/shepherd.scm (open-server-socket): Set socket to be
  non-blocking.
  (main): Use select with a timeout. If prctl failed when shepherd started
  then call check-for-dead-services between connections/timeouts.
* modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
  signal handler.
  (respawn-service): Separate logic for respawning services from handling
  SIGCHLD.
  (handle-SIGCHLD, check-for-dead-services): New exported procedures.
* tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
  symbols.
* doc/shepherd.texi (Slots of services): Add note about service running slot
  being a process id.
---
 doc/shepherd.texi            |  4 ++-
 modules/shepherd.scm         | 47 ++++++++++++++++++-------
 modules/shepherd/service.scm | 82 ++++++++++++++++++++++++++++----------------
 tests/basic.sh               |  4 +--
 tests/status-sexp.sh         |  4 +--
 5 files changed, 95 insertions(+), 46 deletions(-)

diff --git a/doc/shepherd.texi b/doc/shepherd.texi
index 815091f..47005d5 100644
--- a/doc/shepherd.texi
+++ b/doc/shepherd.texi
@@ -608,7 +608,9 @@ way.  The default value is @code{#f}, which indicates that the service
 is not running. When an attempt is made to start the service, it will
 be set to the return value of the procedure in the @code{start} slot.
 It will also be passed as an argument to the procedure in the
-@code{stop} slot.  This slot can not be initialized with a keyword.
+@code{stop} slot.  If it is set a value that is an integer, it is
+assumed to be a process id, and shepherd will monitor the process for
+unexpected exits.  This slot can not be initialized with a keyword.
 
 @item
 @vindex respawn? (slot of <service>)
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index faa1e47..e912d21 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -42,6 +42,8 @@
   (with-fluids ((%default-port-encoding "UTF-8"))
     (let ((sock    (socket PF_UNIX SOCK_STREAM 0))
           (address (make-socket-address AF_UNIX file-name)))
+      (fcntl sock F_SETFL (logior O_NONBLOCK
+                                  (fcntl sock F_GETFL)))
       (bind sock address)
       (listen sock 10)
       sock)))
@@ -49,14 +51,28 @@
 \f
 ;; Main program.
 (define (main . args)
-  (initialize-cli)
+  (define poll-services
+    (if (= 1 (getpid))
+        (lambda () #f) ;; If we're pid 1 then we don't need to set
+                       ;; PR_SET_CHILD_SUBREAPER
+        (catch 'system-error
+          (lambda ()
+            ;; Register for orphaned processes to be reparented onto us when
+            ;; their original parent dies. This lets us handle SIGCHLD from
+            ;; daemon processes that would otherwise have been reparented
+            ;; under pid 1. This is unnecessary when we are pid 1.
+            (prctl PR_SET_CHILD_SUBREAPER 1)
+            (lambda () #f))
+          (lambda args
+            ;; We fall back to polling for services on systems that don't
+            ;; support prctl/PR_SET_CHILD_SUBREAPER
+            (let ((errno (system-error-errno args)))
+              (if (or (= ENOSYS errno) ;; prctl not available
+                      (= EINVAL errno)) ;; PR_SET_CHILD_SUBREAPER not available
+                  check-for-dead-services ;; poll
+                  (apply throw args)))))))
 
-  (when (not (= 1 (getpid)))
-    ;; Register for orphaned processes to be reparented onto us when their
-    ;; original parent dies. This lets us handle SIGCHLD from daemon processes
-    ;; that would otherwise have been reparented under pid 1. This is
-    ;; unnecessary when we are pid 1.
-    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
+  (initialize-cli)
 
   (let ((config-file #f)
 	(socket-file default-socket-file)
@@ -225,11 +241,18 @@
             (_  #t))
 
           (let next-command ()
-            (match (accept sock)
-              ((command-source . client-address)
-               (setvbuf command-source _IOFBF 1024)
-               (process-connection command-source))
-              (_ #f))
+            (define (read-from sock)
+              (match (accept sock)
+                ((command-source . client-address)
+                 (setvbuf command-source _IOFBF 1024)
+                 (process-connection command-source))
+                (_ #f)))
+            (match (select (list sock) (list) (list) 0.5)
+              (((sock) _ _)
+               (read-from sock))
+              (_
+               #f))
+            (poll-services)
             (next-command))))))
 
 (define (process-connection sock)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index b6394f2..056483a 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -3,6 +3,7 @@
 ;; Copyright (C) 2002, 2003 Wolfgang Järling <wolfgang@pro-linux.de>
 ;; Copyright (C) 2014 Alex Sassmannshausen <alex.sassmannshausen@gmail.com>
 ;; Copyright (C) 2016 Alex Kost <alezost@gmail.com>
+;; Copyright (C) 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
 ;;
 ;; This file is part of the GNU Shepherd.
 ;;
@@ -64,6 +65,7 @@
             for-each-service
             lookup-services
             respawn-service
+            handle-SIGCHLD
             register-services
             provided-by
             required-by
@@ -77,6 +79,7 @@
             make-system-destructor
             make-init.d-service
 
+            check-for-dead-services
             root-service
             make-actions
 
@@ -800,7 +803,7 @@ false."
 its PID."
   ;; Install the SIGCHLD handler if this is the first fork+exec-command call
   (unless %sigchld-handler-installed?
-    (sigaction SIGCHLD respawn-service SA_NOCLDSTOP)
+    (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
     (set! %sigchld-handler-installed? #t))
   (let ((pid (primitive-fork)))
     (if (zero? pid)
@@ -991,7 +994,7 @@ child left."
                           what (strerror errno))
             '(0 . #f)))))))
 
-(define (respawn-service signum)
+(define (handle-SIGCHLD signum)
   "Handle SIGCHLD, possibly by respawning the service that just died, or
 otherwise by updating its state."
   (let loop ()
@@ -1010,38 +1013,44 @@ otherwise by updating its state."
          ;; SERV can be #f for instance when this code runs just after a
          ;; service's 'stop' method killed its process and completed.
          (when serv
-           (slot-set! serv 'running #f)
-           (if (and (respawn? serv)
-                    (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
-                                             (car respawn-limit)
-                                             (cdr respawn-limit))))
-               (if (not (slot-ref serv 'waiting-for-termination?))
-                   (begin
-                     ;; Everything is okay, start it.
-                     (local-output "Respawning ~a."
-                                   (canonical-name serv))
-                     (slot-set! serv 'last-respawns
-                                (cons (current-time)
-                                      (slot-ref serv 'last-respawns)))
-                     (start serv))
-                   ;; We have just been waiting for the
-                   ;; termination.  The `running' slot has already
-                   ;; been set to `#f' by `stop'.
-                   (begin
-                     (local-output "Service ~a terminated."
-                                   (canonical-name serv))
-                     (slot-set! serv 'waiting-for-termination? #f)))
-               (begin
-                 (local-output "Service ~a has been disabled."
-                               (canonical-name serv))
-                 (when (respawn? serv)
-                   (local-output "  (Respawning too fast.)"))
-                 (slot-set! serv 'enabled? #f))))
+           (respawn-service serv))
 
          ;; As noted in libc's manual (info "(libc) Process Completion"),
          ;; loop so we don't miss any terminated child process.
          (loop))))))
 
+(define (respawn-service serv)
+  "Respawn a service that has stopped running unexpectedly. If we have
+attempted to respawn the service a number of times already and it keeps dying,
+then disable it."
+  (slot-set! serv 'running #f)
+  (if (and (respawn? serv)
+           (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
+                                    (car respawn-limit)
+                                    (cdr respawn-limit))))
+      (if (not (slot-ref serv 'waiting-for-termination?))
+          (begin
+            ;; Everything is okay, start it.
+            (local-output "Respawning ~a."
+                          (canonical-name serv))
+            (slot-set! serv 'last-respawns
+                       (cons (current-time)
+                             (slot-ref serv 'last-respawns)))
+            (start serv))
+          ;; We have just been waiting for the
+          ;; termination.  The `running' slot has already
+          ;; been set to `#f' by `stop'.
+          (begin
+            (local-output "Service ~a terminated."
+                          (canonical-name serv))
+            (slot-set! serv 'waiting-for-termination? #f)))
+      (begin
+        (local-output "Service ~a has been disabled."
+                      (canonical-name serv))
+        (when (respawn? serv)
+          (local-output "  (Respawning too fast.)"))
+        (slot-set! serv 'enabled? #f))))
+
 ;; Add NEW-SERVICES to the list of known services.
 (define (register-services . new-services)
   (define (register-single-service new)
@@ -1171,6 +1180,21 @@ file when persistence is enabled."
         (lambda (p)
           (format p "~{~a ~}~%" running-services))))))
 
+(define (check-for-dead-services)
+  "Poll each process that we expect to be running, and respawn any which have
+unexpectedly stopped running. This procedure is used as a fallback on systems
+where prctl/PR_SET_CHILD_SUBREAPER is unsupported."
+  (define (process-exists? pid)
+    (catch #t
+      (lambda () (kill pid 0) #t)
+      (lambda _ #f)))
+  (for-each-service (lambda (service)
+                      (let ((running (slot-ref service 'running)))
+                        (when (and (integer? running)
+                                   (not (process-exists? running)))
+                            (local-output "PID ~a (~a) is dead!" running (canonical-name service))
+                            (respawn-service service))))))
+
 (define root-service
   (make <service>
     #:docstring "The root service is used to operate on shepherd itself."
diff --git a/tests/basic.sh b/tests/basic.sh
index 1ddb334..2ecd8fb 100644
--- a/tests/basic.sh
+++ b/tests/basic.sh
@@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" <<EOF
 (register-services
  (make <service>
    #:provides '(test-loaded)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop (const #f)))
 EOF
 
@@ -166,7 +166,7 @@ $herd status test-loaded
 $herd status test-loaded | grep stopped
 
 $herd start test-loaded
-$herd status test-loaded | grep -i 'running.*42'
+$herd status test-loaded | grep -i 'running.*abc'
 $herd stop test-loaded
 $herd unload root test-loaded
 
diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh
index b7c8cb4..11b967e 100644
--- a/tests/status-sexp.sh
+++ b/tests/status-sexp.sh
@@ -33,7 +33,7 @@ cat > "$conf"<<EOF
 (register-services
  (make <service>
    #:provides '(foo)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop  (const #f)
    #:docstring "Foo!"
    #:respawn? #t)
@@ -85,7 +85,7 @@ root_service_sexp="
 	     (service (version 0)
 	       (provides (foo)) (requires ())
 	       (respawn? #t) (docstring \"Foo!\")
-	       (enabled? #t) (running 42) (conflicts ())
+	       (enabled? #t) (running abc) (conflicts ())
 	       (last-respawns ()))
 	     (service (version 0)
 	       (provides (bar)) (requires (foo))
-- 
2.16.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
  2018-03-03 20:49               ` Carlo Zancanaro
@ 2018-03-04 22:11                 ` Ludovic Courtès
  2018-03-04 22:35                   ` Carlo Zancanaro
  0 siblings, 1 reply; 14+ messages in thread
From: Ludovic Courtès @ 2018-03-04 22:11 UTC (permalink / raw)
  To: Carlo Zancanaro; +Cc: 30637

Hello,

Carlo Zancanaro <carlo@zancanaro.id.au> skribis:

> On Sat, Mar 03 2018, Ludovic Courtès wrote:
>> If they’re zombies, that means nobody called waitpid(2). Presumably
>> the
>> polling code would need to do that.
>>
>> So I suppose ‘check-for-dead-services’ should do something like:
>>
>> [ ... ]
>>
>> Does that make sense?  Please check waitpid(2) carefully though,
>> because
>> it’s pretty gnarly and I might have forgotten or misinterpreted
>> something here.
>
> Unfortunately we can't do that. We fall back to the polling approach
> to handle the fact that the processes that we care about aren't our
> children (hence we don't get SIGCHLD). The waitpid system call only
> waits for processes which are children of the calling process.

Oh right!

> I looked into the zombie problem a bit more, and I found what the
> problem actually is. In the build environment a guile process is
> running as pid 1 (the *-guile-builder script for that job). This guile
> process never handles SIGCHLD, and never calls wait/waitpid, so any
> orphaned processes become zombies. I tried modifying derivations.scm,
> but it wanted to rebuild a lot of things, so I gave up. I think we
> need to add something like this to the *-guile-builder script:
>
>  (sigaction SIGCHLD
>    (lambda ()
>      (let loop ()
>        (match (waitpid WAIT_ANY WNOHANG)
>          ((0 . _) #f)
>          ((pid . _) (loop))
>          (_ #f))))
>    SA_NOCLDSTOP)

Good catch.  We could add this in gnu-build-system.scm in core-updates,
though it’s no big deal anyway since these are throw-away environments.

Thoughts?

> From e529e4035eec147f448804dd10fdbca13a17f523 Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <carlo@zancanaro.id.au>
> Date: Sun, 4 Mar 2018 07:01:30 +1100
> Subject: [PATCH 1/3] Add prctl syscall wrapper along with with
>  PR_SET_CHILD_SUBREAPER.
>
> * configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER.
> * modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable
>   and export it.
>   (prctl): Add new procedure and export it.

Applied with a copyright line for you.

> From b43c128d8a175a9a123eb7b1af465fb3747a5393 Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <carlo@zancanaro.id.au>
> Date: Sun, 4 Mar 2018 07:46:13 +1100
> Subject: [PATCH 2/3] Handle forked process SIGCHLD signals
>
> * Makefile.am (TESTS): Add tests/forking-service.sh.
> * modules/shepherd.scm: Set the child subreaper attribute of main shepherd
>   process (as long as we're not pid 1).
> * modules/shepherd/service.scm (root-service)[daemonize]: Set the child
>   subreaper attribute of newly forked shepherd process.

Applied.  However tests/forking-service.sh was missing, so I took it
from the previous version of this patch and added it.  I also changed
the “Bashishms” that were in that file, as discussed earlier.  Let me
know if anything’s wrong!

> From 3d3c091660bbbd529af0058b0ba9b5ddbfc6b481 Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <carlo@zancanaro.id.au>
> Date: Wed, 21 Feb 2018 22:57:59 +1100
> Subject: [PATCH 3/3] Poll every 0.5s to find dead forked services
>
> * modules/shepherd.scm (open-server-socket): Set socket to be
>   non-blocking.
>   (main): Use select with a timeout. If prctl failed when shepherd started
>   then call check-for-dead-services between connections/timeouts.
> * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
>   signal handler.
>   (respawn-service): Separate logic for respawning services from handling
>   SIGCHLD.
>   (handle-SIGCHLD, check-for-dead-services): New exported procedures.
> * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
>   symbols.
> * doc/shepherd.texi (Slots of services): Add note about service running slot
>   being a process id.

[...]

> --- a/modules/shepherd.scm
> +++ b/modules/shepherd.scm
> @@ -42,6 +42,8 @@
>    (with-fluids ((%default-port-encoding "UTF-8"))
>      (let ((sock    (socket PF_UNIX SOCK_STREAM 0))
>            (address (make-socket-address AF_UNIX file-name)))
> +      (fcntl sock F_SETFL (logior O_NONBLOCK
> +                                  (fcntl sock F_GETFL)))
>        (bind sock address)
>        (listen sock 10)
>        sock)))
> @@ -49,14 +51,28 @@
>  \f
>  ;; Main program.
>  (define (main . args)
> -  (initialize-cli)
> +  (define poll-services
> +    (if (= 1 (getpid))
> +        (lambda () #f) ;; If we're pid 1 then we don't need to set
> +                       ;; PR_SET_CHILD_SUBREAPER

[...]

> +            (match (select (list sock) (list) (list) 0.5)
> +              (((sock) _ _)
> +               (read-from sock))
> +              (_
> +               #f))
> +            (poll-services)

Here everyone ends up paying some overhead (the 0.5 second timeout),
which isn’t great.

How about something like:

  (define poll-services
    (and (not (= 1 (getpid)))
         …))

  (match (select (list sock) '() '() (if poll-services 0.5 0))
    …)

?

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
  2018-03-04 22:11                 ` Ludovic Courtès
@ 2018-03-04 22:35                   ` Carlo Zancanaro
  2018-03-04 22:49                     ` Ludovic Courtès
  0 siblings, 1 reply; 14+ messages in thread
From: Carlo Zancanaro @ 2018-03-04 22:35 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 30637


[-- Attachment #1.1: Type: text/plain, Size: 2143 bytes --]


On Sun, Mar 04 2018, Ludovic Courtès wrote:
> Good catch.  We could add this in gnu-build-system.scm in 
> core-updates, though it’s no big deal anyway since these are 
> throw-away environments.
>
> Thoughts?

The current forking-service.sh test fails in that environment, so 
we won't be able to build shepherd on Hurd, or systems with Linux 
pre 3.4. This is already the case without my third commit, though, 
because the prctl fallback logic isn't in place yet.

I think we should add it in core-updates. It does affect the 
behaviour of processes within the build environment, and can lead 
to test failures if people rely on pid 1 to reap zombie processes 
(which, from what I understand, they should be able to). This 
could even be leading to test failures in other packages which we 
have just disabled.

>> +            (match (select (list sock) (list) (list) 0.5)
>> +              (((sock) _ _)
>> +               (read-from sock))
>> +              (_
>> +               #f))
>> +            (poll-services)
>
> Here everyone ends up paying some overhead (the 0.5 second 
> timeout),
> which isn’t great.
>
> How about something like:
>
>   (define poll-services
>     (and (not (= 1 (getpid)))
>          …))
>
>   (match (select (list sock) '() '() (if poll-services 0.5 0))
>     …)

The wait for 0.5 seconds is only an upper-bound for the timeout. 
Changing it to a 0 would actually be worse, because it would spend 
longer polling for running services. The `select` procedure waits 
for `sock` to be ready to read from. When it's ready it returns 
immediately, but if `sock` takes more than 0.5 seconds to be ready 
then it will return anyway (and take the second branch in the 
match, which does nothing).

This should incur no (or extremely minuscule) overhead in how long 
it takes to respond to a socket, but provides an opportunity every 
half a second (at most) for shepherd to poll the running services.

On reflection, we should also change the commit message for this 
commit. I have attached a patch with a more accurate commit 
message.

Carlo


[-- Attachment #1.2: 0001-Poll-every-0.5s-to-find-dead-forked-services-if-prct.patch --]
[-- Type: text/x-patch, Size: 12428 bytes --]

From 5b01f79522c815dd8277298e87eef0506c2e8612 Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <carlo@zancanaro.id.au>
Date: Wed, 21 Feb 2018 22:57:59 +1100
Subject: [PATCH] Poll every 0.5s to find dead forked services if prctl fails.

* modules/shepherd.scm (open-server-socket): Set socket to be
  non-blocking.
  (main): Use select with a timeout. If prctl failed when shepherd started
  then call check-for-dead-services between connections/timeouts.
* modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
  signal handler.
  (respawn-service): Separate logic for respawning services from handling
  SIGCHLD.
  (handle-SIGCHLD, check-for-dead-services): New exported procedures.
* tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
  symbols.
* doc/shepherd.texi (Slots of services): Add note about service running slot
  being a process id.
---
 doc/shepherd.texi            |  4 ++-
 modules/shepherd.scm         | 47 ++++++++++++++++++-------
 modules/shepherd/service.scm | 82 ++++++++++++++++++++++++++++----------------
 tests/basic.sh               |  4 +--
 tests/status-sexp.sh         |  4 +--
 5 files changed, 95 insertions(+), 46 deletions(-)

diff --git a/doc/shepherd.texi b/doc/shepherd.texi
index 815091f..47005d5 100644
--- a/doc/shepherd.texi
+++ b/doc/shepherd.texi
@@ -608,7 +608,9 @@ way.  The default value is @code{#f}, which indicates that the service
 is not running. When an attempt is made to start the service, it will
 be set to the return value of the procedure in the @code{start} slot.
 It will also be passed as an argument to the procedure in the
-@code{stop} slot.  This slot can not be initialized with a keyword.
+@code{stop} slot.  If it is set a value that is an integer, it is
+assumed to be a process id, and shepherd will monitor the process for
+unexpected exits.  This slot can not be initialized with a keyword.
 
 @item
 @vindex respawn? (slot of <service>)
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index faa1e47..e912d21 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -42,6 +42,8 @@
   (with-fluids ((%default-port-encoding "UTF-8"))
     (let ((sock    (socket PF_UNIX SOCK_STREAM 0))
           (address (make-socket-address AF_UNIX file-name)))
+      (fcntl sock F_SETFL (logior O_NONBLOCK
+                                  (fcntl sock F_GETFL)))
       (bind sock address)
       (listen sock 10)
       sock)))
@@ -49,14 +51,28 @@
 \f
 ;; Main program.
 (define (main . args)
-  (initialize-cli)
+  (define poll-services
+    (if (= 1 (getpid))
+        (lambda () #f) ;; If we're pid 1 then we don't need to set
+                       ;; PR_SET_CHILD_SUBREAPER
+        (catch 'system-error
+          (lambda ()
+            ;; Register for orphaned processes to be reparented onto us when
+            ;; their original parent dies. This lets us handle SIGCHLD from
+            ;; daemon processes that would otherwise have been reparented
+            ;; under pid 1. This is unnecessary when we are pid 1.
+            (prctl PR_SET_CHILD_SUBREAPER 1)
+            (lambda () #f))
+          (lambda args
+            ;; We fall back to polling for services on systems that don't
+            ;; support prctl/PR_SET_CHILD_SUBREAPER
+            (let ((errno (system-error-errno args)))
+              (if (or (= ENOSYS errno) ;; prctl not available
+                      (= EINVAL errno)) ;; PR_SET_CHILD_SUBREAPER not available
+                  check-for-dead-services ;; poll
+                  (apply throw args)))))))
 
-  (when (not (= 1 (getpid)))
-    ;; Register for orphaned processes to be reparented onto us when their
-    ;; original parent dies. This lets us handle SIGCHLD from daemon processes
-    ;; that would otherwise have been reparented under pid 1. This is
-    ;; unnecessary when we are pid 1.
-    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
+  (initialize-cli)
 
   (let ((config-file #f)
 	(socket-file default-socket-file)
@@ -225,11 +241,18 @@
             (_  #t))
 
           (let next-command ()
-            (match (accept sock)
-              ((command-source . client-address)
-               (setvbuf command-source _IOFBF 1024)
-               (process-connection command-source))
-              (_ #f))
+            (define (read-from sock)
+              (match (accept sock)
+                ((command-source . client-address)
+                 (setvbuf command-source _IOFBF 1024)
+                 (process-connection command-source))
+                (_ #f)))
+            (match (select (list sock) (list) (list) 0.5)
+              (((sock) _ _)
+               (read-from sock))
+              (_
+               #f))
+            (poll-services)
             (next-command))))))
 
 (define (process-connection sock)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index b6394f2..056483a 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -3,6 +3,7 @@
 ;; Copyright (C) 2002, 2003 Wolfgang Järling <wolfgang@pro-linux.de>
 ;; Copyright (C) 2014 Alex Sassmannshausen <alex.sassmannshausen@gmail.com>
 ;; Copyright (C) 2016 Alex Kost <alezost@gmail.com>
+;; Copyright (C) 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
 ;;
 ;; This file is part of the GNU Shepherd.
 ;;
@@ -64,6 +65,7 @@
             for-each-service
             lookup-services
             respawn-service
+            handle-SIGCHLD
             register-services
             provided-by
             required-by
@@ -77,6 +79,7 @@
             make-system-destructor
             make-init.d-service
 
+            check-for-dead-services
             root-service
             make-actions
 
@@ -800,7 +803,7 @@ false."
 its PID."
   ;; Install the SIGCHLD handler if this is the first fork+exec-command call
   (unless %sigchld-handler-installed?
-    (sigaction SIGCHLD respawn-service SA_NOCLDSTOP)
+    (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
     (set! %sigchld-handler-installed? #t))
   (let ((pid (primitive-fork)))
     (if (zero? pid)
@@ -991,7 +994,7 @@ child left."
                           what (strerror errno))
             '(0 . #f)))))))
 
-(define (respawn-service signum)
+(define (handle-SIGCHLD signum)
   "Handle SIGCHLD, possibly by respawning the service that just died, or
 otherwise by updating its state."
   (let loop ()
@@ -1010,38 +1013,44 @@ otherwise by updating its state."
          ;; SERV can be #f for instance when this code runs just after a
          ;; service's 'stop' method killed its process and completed.
          (when serv
-           (slot-set! serv 'running #f)
-           (if (and (respawn? serv)
-                    (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
-                                             (car respawn-limit)
-                                             (cdr respawn-limit))))
-               (if (not (slot-ref serv 'waiting-for-termination?))
-                   (begin
-                     ;; Everything is okay, start it.
-                     (local-output "Respawning ~a."
-                                   (canonical-name serv))
-                     (slot-set! serv 'last-respawns
-                                (cons (current-time)
-                                      (slot-ref serv 'last-respawns)))
-                     (start serv))
-                   ;; We have just been waiting for the
-                   ;; termination.  The `running' slot has already
-                   ;; been set to `#f' by `stop'.
-                   (begin
-                     (local-output "Service ~a terminated."
-                                   (canonical-name serv))
-                     (slot-set! serv 'waiting-for-termination? #f)))
-               (begin
-                 (local-output "Service ~a has been disabled."
-                               (canonical-name serv))
-                 (when (respawn? serv)
-                   (local-output "  (Respawning too fast.)"))
-                 (slot-set! serv 'enabled? #f))))
+           (respawn-service serv))
 
          ;; As noted in libc's manual (info "(libc) Process Completion"),
          ;; loop so we don't miss any terminated child process.
          (loop))))))
 
+(define (respawn-service serv)
+  "Respawn a service that has stopped running unexpectedly. If we have
+attempted to respawn the service a number of times already and it keeps dying,
+then disable it."
+  (slot-set! serv 'running #f)
+  (if (and (respawn? serv)
+           (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
+                                    (car respawn-limit)
+                                    (cdr respawn-limit))))
+      (if (not (slot-ref serv 'waiting-for-termination?))
+          (begin
+            ;; Everything is okay, start it.
+            (local-output "Respawning ~a."
+                          (canonical-name serv))
+            (slot-set! serv 'last-respawns
+                       (cons (current-time)
+                             (slot-ref serv 'last-respawns)))
+            (start serv))
+          ;; We have just been waiting for the
+          ;; termination.  The `running' slot has already
+          ;; been set to `#f' by `stop'.
+          (begin
+            (local-output "Service ~a terminated."
+                          (canonical-name serv))
+            (slot-set! serv 'waiting-for-termination? #f)))
+      (begin
+        (local-output "Service ~a has been disabled."
+                      (canonical-name serv))
+        (when (respawn? serv)
+          (local-output "  (Respawning too fast.)"))
+        (slot-set! serv 'enabled? #f))))
+
 ;; Add NEW-SERVICES to the list of known services.
 (define (register-services . new-services)
   (define (register-single-service new)
@@ -1171,6 +1180,21 @@ file when persistence is enabled."
         (lambda (p)
           (format p "~{~a ~}~%" running-services))))))
 
+(define (check-for-dead-services)
+  "Poll each process that we expect to be running, and respawn any which have
+unexpectedly stopped running. This procedure is used as a fallback on systems
+where prctl/PR_SET_CHILD_SUBREAPER is unsupported."
+  (define (process-exists? pid)
+    (catch #t
+      (lambda () (kill pid 0) #t)
+      (lambda _ #f)))
+  (for-each-service (lambda (service)
+                      (let ((running (slot-ref service 'running)))
+                        (when (and (integer? running)
+                                   (not (process-exists? running)))
+                            (local-output "PID ~a (~a) is dead!" running (canonical-name service))
+                            (respawn-service service))))))
+
 (define root-service
   (make <service>
     #:docstring "The root service is used to operate on shepherd itself."
diff --git a/tests/basic.sh b/tests/basic.sh
index 1ddb334..2ecd8fb 100644
--- a/tests/basic.sh
+++ b/tests/basic.sh
@@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" <<EOF
 (register-services
  (make <service>
    #:provides '(test-loaded)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop (const #f)))
 EOF
 
@@ -166,7 +166,7 @@ $herd status test-loaded
 $herd status test-loaded | grep stopped
 
 $herd start test-loaded
-$herd status test-loaded | grep -i 'running.*42'
+$herd status test-loaded | grep -i 'running.*abc'
 $herd stop test-loaded
 $herd unload root test-loaded
 
diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh
index b7c8cb4..11b967e 100644
--- a/tests/status-sexp.sh
+++ b/tests/status-sexp.sh
@@ -33,7 +33,7 @@ cat > "$conf"<<EOF
 (register-services
  (make <service>
    #:provides '(foo)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop  (const #f)
    #:docstring "Foo!"
    #:respawn? #t)
@@ -85,7 +85,7 @@ root_service_sexp="
 	     (service (version 0)
 	       (provides (foo)) (requires ())
 	       (respawn? #t) (docstring \"Foo!\")
-	       (enabled? #t) (running 42) (conflicts ())
+	       (enabled? #t) (running abc) (conflicts ())
 	       (last-respawns ()))
 	     (service (version 0)
 	       (provides (bar)) (requires (foo))
-- 
2.16.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
  2018-03-04 22:35                   ` Carlo Zancanaro
@ 2018-03-04 22:49                     ` Ludovic Courtès
  2018-03-04 23:08                       ` Carlo Zancanaro
  0 siblings, 1 reply; 14+ messages in thread
From: Ludovic Courtès @ 2018-03-04 22:49 UTC (permalink / raw)
  To: Carlo Zancanaro; +Cc: 30637

Carlo Zancanaro <carlo@zancanaro.id.au> skribis:

> On Sun, Mar 04 2018, Ludovic Courtès wrote:
>> Good catch.  We could add this in gnu-build-system.scm in
>> core-updates, though it’s no big deal anyway since these are
>> throw-away environments.
>>
>> Thoughts?
>
> The current forking-service.sh test fails in that environment, so we
> won't be able to build shepherd on Hurd, or systems with Linux pre
> 3.4. This is already the case without my third commit, though, because
> the prctl fallback logic isn't in place yet.
>
> I think we should add it in core-updates. It does affect the behaviour
> of processes within the build environment, and can lead to test
> failures if people rely on pid 1 to reap zombie processes (which, from
> what I understand, they should be able to). This could even be leading
> to test failures in other packages which we have just disabled.

Yeah, makes sense.

>>> +            (match (select (list sock) (list) (list) 0.5)
>>> +              (((sock) _ _)
>>> +               (read-from sock))
>>> +              (_
>>> +               #f))
>>> +            (poll-services)
>>
>> Here everyone ends up paying some overhead (the 0.5 second timeout),
>> which isn’t great.
>>
>> How about something like:
>>
>>   (define poll-services
>>     (and (not (= 1 (getpid)))
>>          …))
>>
>>   (match (select (list sock) '() '() (if poll-services 0.5 0))
>>     …)
>
> The wait for 0.5 seconds is only an upper-bound for the
> timeout. Changing it to a 0 would actually be worse, because it would
> spend longer polling for running services. The `select` procedure
> waits for `sock` to be ready to read from. When it's ready it returns
> immediately, but if `sock` takes more than 0.5 seconds to be ready
> then it will return anyway (and take the second branch in the match,
> which does nothing).

Sorry, I didn’t mean 0 but rather #f (indefinite wait).

My point is: we shouldn’t wake up every 0.5 seconds for no reason.  IOW,
we should wake up periodically only in the non-pid-1-no-prctl case.

Does that make sense?

Ludo’.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
  2018-03-04 22:49                     ` Ludovic Courtès
@ 2018-03-04 23:08                       ` Carlo Zancanaro
  2018-03-05 14:15                         ` bug#30637: " Ludovic Courtès
  0 siblings, 1 reply; 14+ messages in thread
From: Carlo Zancanaro @ 2018-03-04 23:08 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 30637


[-- Attachment #1.1: Type: text/plain, Size: 361 bytes --]


On Sun, Mar 04 2018, Ludovic Courtès wrote:
> Sorry, I didn’t mean 0 but rather #f (indefinite wait).
>
> My point is: we shouldn’t wake up every 0.5 seconds for no 
> reason.  IOW,
> we should wake up periodically only in the non-pid-1-no-prctl 
> case.
>
> Does that make sense?

Yep! That makes a lot more sense. Patch attached.

Carlo


[-- Attachment #1.2: 0001-Poll-every-0.5s-to-find-dead-forked-services-if-prct.patch --]
[-- Type: text/x-patch, Size: 12448 bytes --]

From be442ea64e4fd8e235378a5f04d38296c0af9cf3 Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <carlo@zancanaro.id.au>
Date: Wed, 21 Feb 2018 22:57:59 +1100
Subject: [PATCH] Poll every 0.5s to find dead forked services if prctl fails.

* modules/shepherd.scm (open-server-socket): Set socket to be
  non-blocking.
  (main): If we are unable to use prctl/PR_SET_CHILD_SUBREAPER, then poll for
  service processes between client connections, or every 0.5 seconds.
* modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
  signal handler.
  (respawn-service): Separate logic for respawning services from handling
  SIGCHLD.
  (handle-SIGCHLD, check-for-dead-services): New exported procedures.
* tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
  symbols.
* doc/shepherd.texi (Slots of services): Add note about service running slot
  being a process id.
---
 doc/shepherd.texi            |  4 ++-
 modules/shepherd.scm         | 46 ++++++++++++++++++-------
 modules/shepherd/service.scm | 82 ++++++++++++++++++++++++++++----------------
 tests/basic.sh               |  4 +--
 tests/status-sexp.sh         |  4 +--
 5 files changed, 94 insertions(+), 46 deletions(-)

diff --git a/doc/shepherd.texi b/doc/shepherd.texi
index 815091f..47005d5 100644
--- a/doc/shepherd.texi
+++ b/doc/shepherd.texi
@@ -608,7 +608,9 @@ way.  The default value is @code{#f}, which indicates that the service
 is not running. When an attempt is made to start the service, it will
 be set to the return value of the procedure in the @code{start} slot.
 It will also be passed as an argument to the procedure in the
-@code{stop} slot.  This slot can not be initialized with a keyword.
+@code{stop} slot.  If it is set a value that is an integer, it is
+assumed to be a process id, and shepherd will monitor the process for
+unexpected exits.  This slot can not be initialized with a keyword.
 
 @item
 @vindex respawn? (slot of <service>)
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index faa1e47..9d94881 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -42,6 +42,8 @@
   (with-fluids ((%default-port-encoding "UTF-8"))
     (let ((sock    (socket PF_UNIX SOCK_STREAM 0))
           (address (make-socket-address AF_UNIX file-name)))
+      (fcntl sock F_SETFL (logior O_NONBLOCK
+                                  (fcntl sock F_GETFL)))
       (bind sock address)
       (listen sock 10)
       sock)))
@@ -49,14 +51,26 @@
 \f
 ;; Main program.
 (define (main . args)
-  (initialize-cli)
+  (define poll-services?
+    (and (not (= 1 (getpid))) ;; if we're pid 1 we don't need to do anything
+         (catch 'system-error
+           (lambda ()
+             ;; Register for orphaned processes to be reparented onto us when
+             ;; their original parent dies. This lets us handle SIGCHLD from
+             ;; daemon processes that would otherwise have been reparented
+             ;; under pid 1. Obviously this is unnecessary when we are pid 1.
+             (prctl PR_SET_CHILD_SUBREAPER 1)
+             #f) ;; don't poll
+           (lambda args
+             ;; We fall back to polling for services on systems that don't
+             ;; support prctl/PR_SET_CHILD_SUBREAPER
+             (let ((errno (system-error-errno args)))
+               (if (or (= ENOSYS errno) ;; prctl not available
+                       (= EINVAL errno)) ;; PR_SET_CHILD_SUBREAPER not available
+                   #t ;; poll
+                   (apply throw args)))))))
 
-  (when (not (= 1 (getpid)))
-    ;; Register for orphaned processes to be reparented onto us when their
-    ;; original parent dies. This lets us handle SIGCHLD from daemon processes
-    ;; that would otherwise have been reparented under pid 1. This is
-    ;; unnecessary when we are pid 1.
-    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
+  (initialize-cli)
 
   (let ((config-file #f)
 	(socket-file default-socket-file)
@@ -225,11 +239,19 @@
             (_  #t))
 
           (let next-command ()
-            (match (accept sock)
-              ((command-source . client-address)
-               (setvbuf command-source _IOFBF 1024)
-               (process-connection command-source))
-              (_ #f))
+            (define (read-from sock)
+              (match (accept sock)
+                ((command-source . client-address)
+                 (setvbuf command-source _IOFBF 1024)
+                 (process-connection command-source))
+                (_ #f)))
+            (match (select (list sock) (list) (list) (if poll-services? 0.5 #f))
+              (((sock) _ _)
+               (read-from sock))
+              (_
+               #f))
+            (when poll-services?
+              (check-for-dead-services))
             (next-command))))))
 
 (define (process-connection sock)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index b6394f2..056483a 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -3,6 +3,7 @@
 ;; Copyright (C) 2002, 2003 Wolfgang Järling <wolfgang@pro-linux.de>
 ;; Copyright (C) 2014 Alex Sassmannshausen <alex.sassmannshausen@gmail.com>
 ;; Copyright (C) 2016 Alex Kost <alezost@gmail.com>
+;; Copyright (C) 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
 ;;
 ;; This file is part of the GNU Shepherd.
 ;;
@@ -64,6 +65,7 @@
             for-each-service
             lookup-services
             respawn-service
+            handle-SIGCHLD
             register-services
             provided-by
             required-by
@@ -77,6 +79,7 @@
             make-system-destructor
             make-init.d-service
 
+            check-for-dead-services
             root-service
             make-actions
 
@@ -800,7 +803,7 @@ false."
 its PID."
   ;; Install the SIGCHLD handler if this is the first fork+exec-command call
   (unless %sigchld-handler-installed?
-    (sigaction SIGCHLD respawn-service SA_NOCLDSTOP)
+    (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
     (set! %sigchld-handler-installed? #t))
   (let ((pid (primitive-fork)))
     (if (zero? pid)
@@ -991,7 +994,7 @@ child left."
                           what (strerror errno))
             '(0 . #f)))))))
 
-(define (respawn-service signum)
+(define (handle-SIGCHLD signum)
   "Handle SIGCHLD, possibly by respawning the service that just died, or
 otherwise by updating its state."
   (let loop ()
@@ -1010,38 +1013,44 @@ otherwise by updating its state."
          ;; SERV can be #f for instance when this code runs just after a
          ;; service's 'stop' method killed its process and completed.
          (when serv
-           (slot-set! serv 'running #f)
-           (if (and (respawn? serv)
-                    (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
-                                             (car respawn-limit)
-                                             (cdr respawn-limit))))
-               (if (not (slot-ref serv 'waiting-for-termination?))
-                   (begin
-                     ;; Everything is okay, start it.
-                     (local-output "Respawning ~a."
-                                   (canonical-name serv))
-                     (slot-set! serv 'last-respawns
-                                (cons (current-time)
-                                      (slot-ref serv 'last-respawns)))
-                     (start serv))
-                   ;; We have just been waiting for the
-                   ;; termination.  The `running' slot has already
-                   ;; been set to `#f' by `stop'.
-                   (begin
-                     (local-output "Service ~a terminated."
-                                   (canonical-name serv))
-                     (slot-set! serv 'waiting-for-termination? #f)))
-               (begin
-                 (local-output "Service ~a has been disabled."
-                               (canonical-name serv))
-                 (when (respawn? serv)
-                   (local-output "  (Respawning too fast.)"))
-                 (slot-set! serv 'enabled? #f))))
+           (respawn-service serv))
 
          ;; As noted in libc's manual (info "(libc) Process Completion"),
          ;; loop so we don't miss any terminated child process.
          (loop))))))
 
+(define (respawn-service serv)
+  "Respawn a service that has stopped running unexpectedly. If we have
+attempted to respawn the service a number of times already and it keeps dying,
+then disable it."
+  (slot-set! serv 'running #f)
+  (if (and (respawn? serv)
+           (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
+                                    (car respawn-limit)
+                                    (cdr respawn-limit))))
+      (if (not (slot-ref serv 'waiting-for-termination?))
+          (begin
+            ;; Everything is okay, start it.
+            (local-output "Respawning ~a."
+                          (canonical-name serv))
+            (slot-set! serv 'last-respawns
+                       (cons (current-time)
+                             (slot-ref serv 'last-respawns)))
+            (start serv))
+          ;; We have just been waiting for the
+          ;; termination.  The `running' slot has already
+          ;; been set to `#f' by `stop'.
+          (begin
+            (local-output "Service ~a terminated."
+                          (canonical-name serv))
+            (slot-set! serv 'waiting-for-termination? #f)))
+      (begin
+        (local-output "Service ~a has been disabled."
+                      (canonical-name serv))
+        (when (respawn? serv)
+          (local-output "  (Respawning too fast.)"))
+        (slot-set! serv 'enabled? #f))))
+
 ;; Add NEW-SERVICES to the list of known services.
 (define (register-services . new-services)
   (define (register-single-service new)
@@ -1171,6 +1180,21 @@ file when persistence is enabled."
         (lambda (p)
           (format p "~{~a ~}~%" running-services))))))
 
+(define (check-for-dead-services)
+  "Poll each process that we expect to be running, and respawn any which have
+unexpectedly stopped running. This procedure is used as a fallback on systems
+where prctl/PR_SET_CHILD_SUBREAPER is unsupported."
+  (define (process-exists? pid)
+    (catch #t
+      (lambda () (kill pid 0) #t)
+      (lambda _ #f)))
+  (for-each-service (lambda (service)
+                      (let ((running (slot-ref service 'running)))
+                        (when (and (integer? running)
+                                   (not (process-exists? running)))
+                            (local-output "PID ~a (~a) is dead!" running (canonical-name service))
+                            (respawn-service service))))))
+
 (define root-service
   (make <service>
     #:docstring "The root service is used to operate on shepherd itself."
diff --git a/tests/basic.sh b/tests/basic.sh
index 1ddb334..2ecd8fb 100644
--- a/tests/basic.sh
+++ b/tests/basic.sh
@@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" <<EOF
 (register-services
  (make <service>
    #:provides '(test-loaded)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop (const #f)))
 EOF
 
@@ -166,7 +166,7 @@ $herd status test-loaded
 $herd status test-loaded | grep stopped
 
 $herd start test-loaded
-$herd status test-loaded | grep -i 'running.*42'
+$herd status test-loaded | grep -i 'running.*abc'
 $herd stop test-loaded
 $herd unload root test-loaded
 
diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh
index b7c8cb4..11b967e 100644
--- a/tests/status-sexp.sh
+++ b/tests/status-sexp.sh
@@ -33,7 +33,7 @@ cat > "$conf"<<EOF
 (register-services
  (make <service>
    #:provides '(foo)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop  (const #f)
    #:docstring "Foo!"
    #:respawn? #t)
@@ -85,7 +85,7 @@ root_service_sexp="
 	     (service (version 0)
 	       (provides (foo)) (requires ())
 	       (respawn? #t) (docstring \"Foo!\")
-	       (enabled? #t) (running 42) (conflicts ())
+	       (enabled? #t) (running abc) (conflicts ())
 	       (last-respawns ()))
 	     (service (version 0)
 	       (provides (bar)) (requires (foo))
-- 
2.16.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* bug#30637: [WIP] shepherd: Poll every 0.5s to find dead forked services
  2018-03-04 23:08                       ` Carlo Zancanaro
@ 2018-03-05 14:15                         ` Ludovic Courtès
  0 siblings, 0 replies; 14+ messages in thread
From: Ludovic Courtès @ 2018-03-05 14:15 UTC (permalink / raw)
  To: Carlo Zancanaro; +Cc: 30637-done

[-- Attachment #1: Type: text/plain, Size: 1087 bytes --]

Carlo Zancanaro <carlo@zancanaro.id.au> skribis:

> From be442ea64e4fd8e235378a5f04d38296c0af9cf3 Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <carlo@zancanaro.id.au>
> Date: Wed, 21 Feb 2018 22:57:59 +1100
> Subject: [PATCH] Poll every 0.5s to find dead forked services if prctl fails.
>
> * modules/shepherd.scm (open-server-socket): Set socket to be
>   non-blocking.
>   (main): If we are unable to use prctl/PR_SET_CHILD_SUBREAPER, then poll for
>   service processes between client connections, or every 0.5 seconds.
> * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
>   signal handler.
>   (respawn-service): Separate logic for respawning services from handling
>   SIGCHLD.
>   (handle-SIGCHLD, check-for-dead-services): New exported procedures.
> * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
>   symbols.
> * doc/shepherd.texi (Slots of services): Add note about service running slot
>   being a process id.

Awesome.  Applied with minor cosmetic changes (see below).

Thank you!

Ludo’.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 1545 bytes --]

diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index 9d94881..2b4a7b5 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -52,7 +52,8 @@
 ;; Main program.
 (define (main . args)
   (define poll-services?
-    (and (not (= 1 (getpid))) ;; if we're pid 1 we don't need to do anything
+    ;; Do we need polling to find out whether services died?
+    (and (not (= 1 (getpid)))                     ;if we're pid 1, we don't
          (catch 'system-error
            (lambda ()
              ;; Register for orphaned processes to be reparented onto us when
@@ -60,14 +61,13 @@
              ;; daemon processes that would otherwise have been reparented
              ;; under pid 1. Obviously this is unnecessary when we are pid 1.
              (prctl PR_SET_CHILD_SUBREAPER 1)
-             #f) ;; don't poll
+             #f)                                  ;don't poll
            (lambda args
              ;; We fall back to polling for services on systems that don't
-             ;; support prctl/PR_SET_CHILD_SUBREAPER
+             ;; support prctl/PR_SET_CHILD_SUBREAPER.
              (let ((errno (system-error-errno args)))
-               (if (or (= ENOSYS errno) ;; prctl not available
-                       (= EINVAL errno)) ;; PR_SET_CHILD_SUBREAPER not available
-                   #t ;; poll
+               (or (= ENOSYS errno)        ;prctl unavailable
+                   (= EINVAL errno)        ;PR_SET_CHILD_SUBREAPER unavailable
                    (apply throw args)))))))
 
   (initialize-cli)

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-03-05 14:16 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-27 21:56 [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Carlo Zancanaro
2018-02-28 22:06 ` Ludovic Courtès
2018-03-01 22:37   ` Carlo Zancanaro
2018-03-02  9:44     ` Ludovic Courtès
2018-03-02 10:13       ` Carlo Zancanaro
2018-03-02 12:42         ` Ludovic Courtès
2018-03-03  7:58           ` Carlo Zancanaro
2018-03-03 15:21             ` Ludovic Courtès
2018-03-03 20:49               ` Carlo Zancanaro
2018-03-04 22:11                 ` Ludovic Courtès
2018-03-04 22:35                   ` Carlo Zancanaro
2018-03-04 22:49                     ` Ludovic Courtès
2018-03-04 23:08                       ` Carlo Zancanaro
2018-03-05 14:15                         ` bug#30637: " Ludovic Courtès

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).