From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40299) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1escFf-0002CK-Oe for guix-patches@gnu.org; Sun, 04 Mar 2018 17:37:09 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1escFa-0006SJ-Qz for guix-patches@gnu.org; Sun, 04 Mar 2018 17:37:07 -0500 Received: from debbugs.gnu.org ([208.118.235.43]:36901) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1escFa-0006Rs-LR for guix-patches@gnu.org; Sun, 04 Mar 2018 17:37:02 -0500 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1escFa-0000oN-AJ for guix-patches@gnu.org; Sun, 04 Mar 2018 17:37:02 -0500 Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Resent-Message-ID: References: <878tbe9jvx.fsf@zancanaro.id.au> <87y3jcu5v5.fsf@gnu.org> <87d10nwhfl.fsf@zancanaro.id.au> <87r2p2izgz.fsf@gnu.org> <87371ihjj2.fsf@zancanaro.id.au> <87po4mhcn2.fsf@gnu.org> <87inadr3np.fsf@zancanaro.id.au> <87h8px9od8.fsf@gnu.org> <87woys9961.fsf@zancanaro.id.au> <87371f4hkf.fsf@gnu.org> From: Carlo Zancanaro In-reply-to: <87371f4hkf.fsf@gnu.org> Date: Mon, 05 Mar 2018 09:35:58 +1100 Message-ID: <871sgzpiy9.fsf@zancanaro.id.au> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-patches-bounces+kyle=kyleam.com@gnu.org Sender: "Guix-patches" To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Cc: 30637@debbugs.gnu.org --==-=-= Content-Type: multipart/mixed; boundary="=-=-=" --=-=-= Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable On Sun, Mar 04 2018, Ludovic Court=C3=A8s wrote: > Good catch. We could add this in gnu-build-system.scm in=20 > core-updates, though it=E2=80=99s no big deal anyway since these are=20 > throw-away environments. > > Thoughts? The current forking-service.sh test fails in that environment, so=20 we won't be able to build shepherd on Hurd, or systems with Linux=20 pre 3.4. This is already the case without my third commit, though,=20 because the prctl fallback logic isn't in place yet. I think we should add it in core-updates. It does affect the=20 behaviour of processes within the build environment, and can lead=20 to test failures if people rely on pid 1 to reap zombie processes=20 (which, from what I understand, they should be able to). This=20 could even be leading to test failures in other packages which we=20 have just disabled. >> + (match (select (list sock) (list) (list) 0.5) >> + (((sock) _ _) >> + (read-from sock)) >> + (_ >> + #f)) >> + (poll-services) > > Here everyone ends up paying some overhead (the 0.5 second=20 > timeout), > which isn=E2=80=99t great. > > How about something like: > > (define poll-services > (and (not (=3D 1 (getpid))) > =E2=80=A6)) > > (match (select (list sock) '() '() (if poll-services 0.5 0)) > =E2=80=A6) The wait for 0.5 seconds is only an upper-bound for the timeout.=20 Changing it to a 0 would actually be worse, because it would spend=20 longer polling for running services. The `select` procedure waits=20 for `sock` to be ready to read from. When it's ready it returns=20 immediately, but if `sock` takes more than 0.5 seconds to be ready=20 then it will return anyway (and take the second branch in the=20 match, which does nothing). This should incur no (or extremely minuscule) overhead in how long=20 it takes to respond to a socket, but provides an opportunity every=20 half a second (at most) for shepherd to poll the running services. On reflection, we should also change the commit message for this=20 commit. I have attached a patch with a more accurate commit=20 message. Carlo --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: inline; filename=0001-Poll-every-0.5s-to-find-dead-forked-services-if-prct.patch Content-Transfer-Encoding: quoted-printable From=205b01f79522c815dd8277298e87eef0506c2e8612 Mon Sep 17 00:00:00 2001 From: Carlo Zancanaro Date: Wed, 21 Feb 2018 22:57:59 +1100 Subject: [PATCH] Poll every 0.5s to find dead forked services if prctl fail= s. * modules/shepherd.scm (open-server-socket): Set socket to be non-blocking. (main): Use select with a timeout. If prctl failed when shepherd started then call check-for-dead-services between connections/timeouts. * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD = as signal handler. (respawn-service): Separate logic for respawning services from handling SIGCHLD. (handle-SIGCHLD, check-for-dead-services): New exported procedures. * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with symbols. * doc/shepherd.texi (Slots of services): Add note about service running slot being a process id. =2D-- doc/shepherd.texi | 4 ++- modules/shepherd.scm | 47 ++++++++++++++++++------- modules/shepherd/service.scm | 82 ++++++++++++++++++++++++++++------------= ---- tests/basic.sh | 4 +-- tests/status-sexp.sh | 4 +-- 5 files changed, 95 insertions(+), 46 deletions(-) diff --git a/doc/shepherd.texi b/doc/shepherd.texi index 815091f..47005d5 100644 =2D-- a/doc/shepherd.texi +++ b/doc/shepherd.texi @@ -608,7 +608,9 @@ way. The default value is @code{#f}, which indicates t= hat the service is not running. When an attempt is made to start the service, it will be set to the return value of the procedure in the @code{start} slot. It will also be passed as an argument to the procedure in the =2D@code{stop} slot. This slot can not be initialized with a keyword. +@code{stop} slot. If it is set a value that is an integer, it is +assumed to be a process id, and shepherd will monitor the process for +unexpected exits. This slot can not be initialized with a keyword. =20 @item @vindex respawn? (slot of ) diff --git a/modules/shepherd.scm b/modules/shepherd.scm index faa1e47..e912d21 100644 =2D-- a/modules/shepherd.scm +++ b/modules/shepherd.scm @@ -42,6 +42,8 @@ (with-fluids ((%default-port-encoding "UTF-8")) (let ((sock (socket PF_UNIX SOCK_STREAM 0)) (address (make-socket-address AF_UNIX file-name))) + (fcntl sock F_SETFL (logior O_NONBLOCK + (fcntl sock F_GETFL))) (bind sock address) (listen sock 10) sock))) @@ -49,14 +51,28 @@ ;; Main program. (define (main . args) =2D (initialize-cli) + (define poll-services + (if (=3D 1 (getpid)) + (lambda () #f) ;; If we're pid 1 then we don't need to set + ;; PR_SET_CHILD_SUBREAPER + (catch 'system-error + (lambda () + ;; Register for orphaned processes to be reparented onto us wh= en + ;; their original parent dies. This lets us handle SIGCHLD from + ;; daemon processes that would otherwise have been reparented + ;; under pid 1. This is unnecessary when we are pid 1. + (prctl PR_SET_CHILD_SUBREAPER 1) + (lambda () #f)) + (lambda args + ;; We fall back to polling for services on systems that don't + ;; support prctl/PR_SET_CHILD_SUBREAPER + (let ((errno (system-error-errno args))) + (if (or (=3D ENOSYS errno) ;; prctl not available + (=3D EINVAL errno)) ;; PR_SET_CHILD_SUBREAPER not av= ailable + check-for-dead-services ;; poll + (apply throw args))))))) =20 =2D (when (not (=3D 1 (getpid))) =2D ;; Register for orphaned processes to be reparented onto us when the= ir =2D ;; original parent dies. This lets us handle SIGCHLD from daemon pro= cesses =2D ;; that would otherwise have been reparented under pid 1. This is =2D ;; unnecessary when we are pid 1. =2D (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1))) + (initialize-cli) =20 (let ((config-file #f) (socket-file default-socket-file) @@ -225,11 +241,18 @@ (_ #t)) =20 (let next-command () =2D (match (accept sock) =2D ((command-source . client-address) =2D (setvbuf command-source _IOFBF 1024) =2D (process-connection command-source)) =2D (_ #f)) + (define (read-from sock) + (match (accept sock) + ((command-source . client-address) + (setvbuf command-source _IOFBF 1024) + (process-connection command-source)) + (_ #f))) + (match (select (list sock) (list) (list) 0.5) + (((sock) _ _) + (read-from sock)) + (_ + #f)) + (poll-services) (next-command)))))) =20 (define (process-connection sock) diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm index b6394f2..056483a 100644 =2D-- a/modules/shepherd/service.scm +++ b/modules/shepherd/service.scm @@ -3,6 +3,7 @@ ;; Copyright (C) 2002, 2003 Wolfgang J=C3=A4rling ;; Copyright (C) 2014 Alex Sassmannshausen ;; Copyright (C) 2016 Alex Kost +;; Copyright (C) 2018 Carlo Zancanaro ;; ;; This file is part of the GNU Shepherd. ;; @@ -64,6 +65,7 @@ for-each-service lookup-services respawn-service + handle-SIGCHLD register-services provided-by required-by @@ -77,6 +79,7 @@ make-system-destructor make-init.d-service =20 + check-for-dead-services root-service make-actions =20 @@ -800,7 +803,7 @@ false." its PID." ;; Install the SIGCHLD handler if this is the first fork+exec-command ca= ll (unless %sigchld-handler-installed? =2D (sigaction SIGCHLD respawn-service SA_NOCLDSTOP) + (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP) (set! %sigchld-handler-installed? #t)) (let ((pid (primitive-fork))) (if (zero? pid) @@ -991,7 +994,7 @@ child left." what (strerror errno)) '(0 . #f))))))) =20 =2D(define (respawn-service signum) +(define (handle-SIGCHLD signum) "Handle SIGCHLD, possibly by respawning the service that just died, or otherwise by updating its state." (let loop () @@ -1010,38 +1013,44 @@ otherwise by updating its state." ;; SERV can be #f for instance when this code runs just after a ;; service's 'stop' method killed its process and completed. (when serv =2D (slot-set! serv 'running #f) =2D (if (and (respawn? serv) =2D (not (respawn-limit-hit? (slot-ref serv 'last-respaw= ns) =2D (car respawn-limit) =2D (cdr respawn-limit)))) =2D (if (not (slot-ref serv 'waiting-for-termination?)) =2D (begin =2D ;; Everything is okay, start it. =2D (local-output "Respawning ~a." =2D (canonical-name serv)) =2D (slot-set! serv 'last-respawns =2D (cons (current-time) =2D (slot-ref serv 'last-respawns))) =2D (start serv)) =2D ;; We have just been waiting for the =2D ;; termination. The `running' slot has already =2D ;; been set to `#f' by `stop'. =2D (begin =2D (local-output "Service ~a terminated." =2D (canonical-name serv)) =2D (slot-set! serv 'waiting-for-termination? #f))) =2D (begin =2D (local-output "Service ~a has been disabled." =2D (canonical-name serv)) =2D (when (respawn? serv) =2D (local-output " (Respawning too fast.)")) =2D (slot-set! serv 'enabled? #f)))) + (respawn-service serv)) =20 ;; As noted in libc's manual (info "(libc) Process Completion"), ;; loop so we don't miss any terminated child process. (loop)))))) =20 +(define (respawn-service serv) + "Respawn a service that has stopped running unexpectedly. If we have +attempted to respawn the service a number of times already and it keeps dy= ing, +then disable it." + (slot-set! serv 'running #f) + (if (and (respawn? serv) + (not (respawn-limit-hit? (slot-ref serv 'last-respawns) + (car respawn-limit) + (cdr respawn-limit)))) + (if (not (slot-ref serv 'waiting-for-termination?)) + (begin + ;; Everything is okay, start it. + (local-output "Respawning ~a." + (canonical-name serv)) + (slot-set! serv 'last-respawns + (cons (current-time) + (slot-ref serv 'last-respawns))) + (start serv)) + ;; We have just been waiting for the + ;; termination. The `running' slot has already + ;; been set to `#f' by `stop'. + (begin + (local-output "Service ~a terminated." + (canonical-name serv)) + (slot-set! serv 'waiting-for-termination? #f))) + (begin + (local-output "Service ~a has been disabled." + (canonical-name serv)) + (when (respawn? serv) + (local-output " (Respawning too fast.)")) + (slot-set! serv 'enabled? #f)))) + ;; Add NEW-SERVICES to the list of known services. (define (register-services . new-services) (define (register-single-service new) @@ -1171,6 +1180,21 @@ file when persistence is enabled." (lambda (p) (format p "~{~a ~}~%" running-services)))))) =20 +(define (check-for-dead-services) + "Poll each process that we expect to be running, and respawn any which h= ave +unexpectedly stopped running. This procedure is used as a fallback on syst= ems +where prctl/PR_SET_CHILD_SUBREAPER is unsupported." + (define (process-exists? pid) + (catch #t + (lambda () (kill pid 0) #t) + (lambda _ #f))) + (for-each-service (lambda (service) + (let ((running (slot-ref service 'running))) + (when (and (integer? running) + (not (process-exists? running))) + (local-output "PID ~a (~a) is dead!" running (= canonical-name service)) + (respawn-service service)))))) + (define root-service (make #:docstring "The root service is used to operate on shepherd itself." diff --git a/tests/basic.sh b/tests/basic.sh index 1ddb334..2ecd8fb 100644 =2D-- a/tests/basic.sh +++ b/tests/basic.sh @@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" < #:provides '(test-loaded) =2D #:start (const 42) + #:start (const 'abc) #:stop (const #f))) EOF =20 @@ -166,7 +166,7 @@ $herd status test-loaded $herd status test-loaded | grep stopped =20 $herd start test-loaded =2D$herd status test-loaded | grep -i 'running.*42' +$herd status test-loaded | grep -i 'running.*abc' $herd stop test-loaded $herd unload root test-loaded =20 diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh index b7c8cb4..11b967e 100644 =2D-- a/tests/status-sexp.sh +++ b/tests/status-sexp.sh @@ -33,7 +33,7 @@ cat > "$conf"< #:provides '(foo) =2D #:start (const 42) + #:start (const 'abc) #:stop (const #f) #:docstring "Foo!" #:respawn? #t) @@ -85,7 +85,7 @@ root_service_sexp=3D" (service (version 0) (provides (foo)) (requires ()) (respawn? #t) (docstring \"Foo!\") =2D (enabled? #t) (running 42) (conflicts ()) + (enabled? #t) (running abc) (conflicts ()) (last-respawns ())) (service (version 0) (provides (bar)) (requires (foo)) =2D-=20 2.16.1 --=-=-=-- --==-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEE1lpncq7JnOkt+LaeqdyPv9awIbwFAlqcdM4ACgkQqdyPv9aw Ibwy9BAAqFvXHBS+P7tUIBC+Espr9Sk3pc3M4Aw4jTxSmtCeE9W6RJnBEu/gRpf8 q8f7mHaWKDaJ4yUvrfudO/OqVVgbuacuqHxaZg5/hVV9O7vw45wnVE/OkgdJNILT hrMGiETzy8nOperMcGgo9Al5zxxO19vKUtXWCi8J11+y57xhlcgvGb+xlNX/rcNu hpnR0leS7zmepkQ6BzNEe5Byn+YqqnUxk7MM4pE/hUlTzQiJW4RR/6lUvEkke1ya KBZp8GRCkpwNSs6OdnU3i0XYhicVZWoLjBHS43vE8xLtdgJcEh8iQQgAqLF6nPvx lYInJKjAyUYuN4eBcdcZ7tMY19brp1aud5X/AW9YQmJZSpHyVMNOP8FTwKfYJ//N 04h1JTqE9YP5Uu8asGeGcllSvxF4po2TCELAj6e2DZ2+IX+4WQ4LWGo+pF1Uifr4 ut5nREw5tvWA9Q6Bmexkvw5jvUpVjk/xYXtp4+m2mXWN9NBlnW6Z/f1sTWQucnbN u7YCET2Jf8oDy7VF5w1iL8OXfQIunpy+1Qf4GVADV9zgVvvOa7Eugd5oG+WGeLYA 44MpMqFRjl9ByjF1O5BhqDa9m/JaDwIG3pJyiYq0Ynn/Py6yPzsHkXIDs5y/HLGU AOtVMrpQX0ngETtsDeq8ZGuJpd44F6Xn2pV70ZTlnb5ciUZBj7k= =zUD8 -----END PGP SIGNATURE----- --==-=-=--