unofficial mirror of guix-patches@gnu.org 
 help / color / mirror / code / Atom feed
From: Carlo Zancanaro <carlo@zancanaro.id.au>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: 30637@debbugs.gnu.org
Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
Date: Mon, 05 Mar 2018 09:35:58 +1100	[thread overview]
Message-ID: <871sgzpiy9.fsf@zancanaro.id.au> (raw)
In-Reply-To: <87371f4hkf.fsf@gnu.org>


[-- Attachment #1.1: Type: text/plain, Size: 2143 bytes --]


On Sun, Mar 04 2018, Ludovic Courtès wrote:
> Good catch.  We could add this in gnu-build-system.scm in 
> core-updates, though it’s no big deal anyway since these are 
> throw-away environments.
>
> Thoughts?

The current forking-service.sh test fails in that environment, so 
we won't be able to build shepherd on Hurd, or systems with Linux 
pre 3.4. This is already the case without my third commit, though, 
because the prctl fallback logic isn't in place yet.

I think we should add it in core-updates. It does affect the 
behaviour of processes within the build environment, and can lead 
to test failures if people rely on pid 1 to reap zombie processes 
(which, from what I understand, they should be able to). This 
could even be leading to test failures in other packages which we 
have just disabled.

>> +            (match (select (list sock) (list) (list) 0.5)
>> +              (((sock) _ _)
>> +               (read-from sock))
>> +              (_
>> +               #f))
>> +            (poll-services)
>
> Here everyone ends up paying some overhead (the 0.5 second 
> timeout),
> which isn’t great.
>
> How about something like:
>
>   (define poll-services
>     (and (not (= 1 (getpid)))
>          …))
>
>   (match (select (list sock) '() '() (if poll-services 0.5 0))
>     …)

The wait for 0.5 seconds is only an upper-bound for the timeout. 
Changing it to a 0 would actually be worse, because it would spend 
longer polling for running services. The `select` procedure waits 
for `sock` to be ready to read from. When it's ready it returns 
immediately, but if `sock` takes more than 0.5 seconds to be ready 
then it will return anyway (and take the second branch in the 
match, which does nothing).

This should incur no (or extremely minuscule) overhead in how long 
it takes to respond to a socket, but provides an opportunity every 
half a second (at most) for shepherd to poll the running services.

On reflection, we should also change the commit message for this 
commit. I have attached a patch with a more accurate commit 
message.

Carlo


[-- Attachment #1.2: 0001-Poll-every-0.5s-to-find-dead-forked-services-if-prct.patch --]
[-- Type: text/x-patch, Size: 12428 bytes --]

From 5b01f79522c815dd8277298e87eef0506c2e8612 Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <carlo@zancanaro.id.au>
Date: Wed, 21 Feb 2018 22:57:59 +1100
Subject: [PATCH] Poll every 0.5s to find dead forked services if prctl fails.

* modules/shepherd.scm (open-server-socket): Set socket to be
  non-blocking.
  (main): Use select with a timeout. If prctl failed when shepherd started
  then call check-for-dead-services between connections/timeouts.
* modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
  signal handler.
  (respawn-service): Separate logic for respawning services from handling
  SIGCHLD.
  (handle-SIGCHLD, check-for-dead-services): New exported procedures.
* tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
  symbols.
* doc/shepherd.texi (Slots of services): Add note about service running slot
  being a process id.
---
 doc/shepherd.texi            |  4 ++-
 modules/shepherd.scm         | 47 ++++++++++++++++++-------
 modules/shepherd/service.scm | 82 ++++++++++++++++++++++++++++----------------
 tests/basic.sh               |  4 +--
 tests/status-sexp.sh         |  4 +--
 5 files changed, 95 insertions(+), 46 deletions(-)

diff --git a/doc/shepherd.texi b/doc/shepherd.texi
index 815091f..47005d5 100644
--- a/doc/shepherd.texi
+++ b/doc/shepherd.texi
@@ -608,7 +608,9 @@ way.  The default value is @code{#f}, which indicates that the service
 is not running. When an attempt is made to start the service, it will
 be set to the return value of the procedure in the @code{start} slot.
 It will also be passed as an argument to the procedure in the
-@code{stop} slot.  This slot can not be initialized with a keyword.
+@code{stop} slot.  If it is set a value that is an integer, it is
+assumed to be a process id, and shepherd will monitor the process for
+unexpected exits.  This slot can not be initialized with a keyword.
 
 @item
 @vindex respawn? (slot of <service>)
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index faa1e47..e912d21 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -42,6 +42,8 @@
   (with-fluids ((%default-port-encoding "UTF-8"))
     (let ((sock    (socket PF_UNIX SOCK_STREAM 0))
           (address (make-socket-address AF_UNIX file-name)))
+      (fcntl sock F_SETFL (logior O_NONBLOCK
+                                  (fcntl sock F_GETFL)))
       (bind sock address)
       (listen sock 10)
       sock)))
@@ -49,14 +51,28 @@
 \f
 ;; Main program.
 (define (main . args)
-  (initialize-cli)
+  (define poll-services
+    (if (= 1 (getpid))
+        (lambda () #f) ;; If we're pid 1 then we don't need to set
+                       ;; PR_SET_CHILD_SUBREAPER
+        (catch 'system-error
+          (lambda ()
+            ;; Register for orphaned processes to be reparented onto us when
+            ;; their original parent dies. This lets us handle SIGCHLD from
+            ;; daemon processes that would otherwise have been reparented
+            ;; under pid 1. This is unnecessary when we are pid 1.
+            (prctl PR_SET_CHILD_SUBREAPER 1)
+            (lambda () #f))
+          (lambda args
+            ;; We fall back to polling for services on systems that don't
+            ;; support prctl/PR_SET_CHILD_SUBREAPER
+            (let ((errno (system-error-errno args)))
+              (if (or (= ENOSYS errno) ;; prctl not available
+                      (= EINVAL errno)) ;; PR_SET_CHILD_SUBREAPER not available
+                  check-for-dead-services ;; poll
+                  (apply throw args)))))))
 
-  (when (not (= 1 (getpid)))
-    ;; Register for orphaned processes to be reparented onto us when their
-    ;; original parent dies. This lets us handle SIGCHLD from daemon processes
-    ;; that would otherwise have been reparented under pid 1. This is
-    ;; unnecessary when we are pid 1.
-    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
+  (initialize-cli)
 
   (let ((config-file #f)
 	(socket-file default-socket-file)
@@ -225,11 +241,18 @@
             (_  #t))
 
           (let next-command ()
-            (match (accept sock)
-              ((command-source . client-address)
-               (setvbuf command-source _IOFBF 1024)
-               (process-connection command-source))
-              (_ #f))
+            (define (read-from sock)
+              (match (accept sock)
+                ((command-source . client-address)
+                 (setvbuf command-source _IOFBF 1024)
+                 (process-connection command-source))
+                (_ #f)))
+            (match (select (list sock) (list) (list) 0.5)
+              (((sock) _ _)
+               (read-from sock))
+              (_
+               #f))
+            (poll-services)
             (next-command))))))
 
 (define (process-connection sock)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index b6394f2..056483a 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -3,6 +3,7 @@
 ;; Copyright (C) 2002, 2003 Wolfgang Järling <wolfgang@pro-linux.de>
 ;; Copyright (C) 2014 Alex Sassmannshausen <alex.sassmannshausen@gmail.com>
 ;; Copyright (C) 2016 Alex Kost <alezost@gmail.com>
+;; Copyright (C) 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
 ;;
 ;; This file is part of the GNU Shepherd.
 ;;
@@ -64,6 +65,7 @@
             for-each-service
             lookup-services
             respawn-service
+            handle-SIGCHLD
             register-services
             provided-by
             required-by
@@ -77,6 +79,7 @@
             make-system-destructor
             make-init.d-service
 
+            check-for-dead-services
             root-service
             make-actions
 
@@ -800,7 +803,7 @@ false."
 its PID."
   ;; Install the SIGCHLD handler if this is the first fork+exec-command call
   (unless %sigchld-handler-installed?
-    (sigaction SIGCHLD respawn-service SA_NOCLDSTOP)
+    (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
     (set! %sigchld-handler-installed? #t))
   (let ((pid (primitive-fork)))
     (if (zero? pid)
@@ -991,7 +994,7 @@ child left."
                           what (strerror errno))
             '(0 . #f)))))))
 
-(define (respawn-service signum)
+(define (handle-SIGCHLD signum)
   "Handle SIGCHLD, possibly by respawning the service that just died, or
 otherwise by updating its state."
   (let loop ()
@@ -1010,38 +1013,44 @@ otherwise by updating its state."
          ;; SERV can be #f for instance when this code runs just after a
          ;; service's 'stop' method killed its process and completed.
          (when serv
-           (slot-set! serv 'running #f)
-           (if (and (respawn? serv)
-                    (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
-                                             (car respawn-limit)
-                                             (cdr respawn-limit))))
-               (if (not (slot-ref serv 'waiting-for-termination?))
-                   (begin
-                     ;; Everything is okay, start it.
-                     (local-output "Respawning ~a."
-                                   (canonical-name serv))
-                     (slot-set! serv 'last-respawns
-                                (cons (current-time)
-                                      (slot-ref serv 'last-respawns)))
-                     (start serv))
-                   ;; We have just been waiting for the
-                   ;; termination.  The `running' slot has already
-                   ;; been set to `#f' by `stop'.
-                   (begin
-                     (local-output "Service ~a terminated."
-                                   (canonical-name serv))
-                     (slot-set! serv 'waiting-for-termination? #f)))
-               (begin
-                 (local-output "Service ~a has been disabled."
-                               (canonical-name serv))
-                 (when (respawn? serv)
-                   (local-output "  (Respawning too fast.)"))
-                 (slot-set! serv 'enabled? #f))))
+           (respawn-service serv))
 
          ;; As noted in libc's manual (info "(libc) Process Completion"),
          ;; loop so we don't miss any terminated child process.
          (loop))))))
 
+(define (respawn-service serv)
+  "Respawn a service that has stopped running unexpectedly. If we have
+attempted to respawn the service a number of times already and it keeps dying,
+then disable it."
+  (slot-set! serv 'running #f)
+  (if (and (respawn? serv)
+           (not (respawn-limit-hit? (slot-ref serv 'last-respawns)
+                                    (car respawn-limit)
+                                    (cdr respawn-limit))))
+      (if (not (slot-ref serv 'waiting-for-termination?))
+          (begin
+            ;; Everything is okay, start it.
+            (local-output "Respawning ~a."
+                          (canonical-name serv))
+            (slot-set! serv 'last-respawns
+                       (cons (current-time)
+                             (slot-ref serv 'last-respawns)))
+            (start serv))
+          ;; We have just been waiting for the
+          ;; termination.  The `running' slot has already
+          ;; been set to `#f' by `stop'.
+          (begin
+            (local-output "Service ~a terminated."
+                          (canonical-name serv))
+            (slot-set! serv 'waiting-for-termination? #f)))
+      (begin
+        (local-output "Service ~a has been disabled."
+                      (canonical-name serv))
+        (when (respawn? serv)
+          (local-output "  (Respawning too fast.)"))
+        (slot-set! serv 'enabled? #f))))
+
 ;; Add NEW-SERVICES to the list of known services.
 (define (register-services . new-services)
   (define (register-single-service new)
@@ -1171,6 +1180,21 @@ file when persistence is enabled."
         (lambda (p)
           (format p "~{~a ~}~%" running-services))))))
 
+(define (check-for-dead-services)
+  "Poll each process that we expect to be running, and respawn any which have
+unexpectedly stopped running. This procedure is used as a fallback on systems
+where prctl/PR_SET_CHILD_SUBREAPER is unsupported."
+  (define (process-exists? pid)
+    (catch #t
+      (lambda () (kill pid 0) #t)
+      (lambda _ #f)))
+  (for-each-service (lambda (service)
+                      (let ((running (slot-ref service 'running)))
+                        (when (and (integer? running)
+                                   (not (process-exists? running)))
+                            (local-output "PID ~a (~a) is dead!" running (canonical-name service))
+                            (respawn-service service))))))
+
 (define root-service
   (make <service>
     #:docstring "The root service is used to operate on shepherd itself."
diff --git a/tests/basic.sh b/tests/basic.sh
index 1ddb334..2ecd8fb 100644
--- a/tests/basic.sh
+++ b/tests/basic.sh
@@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" <<EOF
 (register-services
  (make <service>
    #:provides '(test-loaded)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop (const #f)))
 EOF
 
@@ -166,7 +166,7 @@ $herd status test-loaded
 $herd status test-loaded | grep stopped
 
 $herd start test-loaded
-$herd status test-loaded | grep -i 'running.*42'
+$herd status test-loaded | grep -i 'running.*abc'
 $herd stop test-loaded
 $herd unload root test-loaded
 
diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh
index b7c8cb4..11b967e 100644
--- a/tests/status-sexp.sh
+++ b/tests/status-sexp.sh
@@ -33,7 +33,7 @@ cat > "$conf"<<EOF
 (register-services
  (make <service>
    #:provides '(foo)
-   #:start (const 42)
+   #:start (const 'abc)
    #:stop  (const #f)
    #:docstring "Foo!"
    #:respawn? #t)
@@ -85,7 +85,7 @@ root_service_sexp="
 	     (service (version 0)
 	       (provides (foo)) (requires ())
 	       (respawn? #t) (docstring \"Foo!\")
-	       (enabled? #t) (running 42) (conflicts ())
+	       (enabled? #t) (running abc) (conflicts ())
 	       (last-respawns ()))
 	     (service (version 0)
 	       (provides (bar)) (requires (foo))
-- 
2.16.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

  reply	other threads:[~2018-03-04 22:37 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-27 21:56 [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Carlo Zancanaro
2018-02-28 22:06 ` Ludovic Courtès
2018-03-01 22:37   ` Carlo Zancanaro
2018-03-02  9:44     ` Ludovic Courtès
2018-03-02 10:13       ` Carlo Zancanaro
2018-03-02 12:42         ` Ludovic Courtès
2018-03-03  7:58           ` Carlo Zancanaro
2018-03-03 15:21             ` Ludovic Courtès
2018-03-03 20:49               ` Carlo Zancanaro
2018-03-04 22:11                 ` Ludovic Courtès
2018-03-04 22:35                   ` Carlo Zancanaro [this message]
2018-03-04 22:49                     ` Ludovic Courtès
2018-03-04 23:08                       ` Carlo Zancanaro
2018-03-05 14:15                         ` bug#30637: " Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871sgzpiy9.fsf@zancanaro.id.au \
    --to=carlo@zancanaro.id.au \
    --cc=30637@debbugs.gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).