From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:34968) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1es8z7-00056T-Va for guix-patches@gnu.org; Sat, 03 Mar 2018 10:22:07 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1es8z4-0002ez-Gh for guix-patches@gnu.org; Sat, 03 Mar 2018 10:22:05 -0500 Received: from debbugs.gnu.org ([208.118.235.43]:35049) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1es8z4-0002et-Bp for guix-patches@gnu.org; Sat, 03 Mar 2018 10:22:02 -0500 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1es8z4-00006J-2v for guix-patches@gnu.org; Sat, 03 Mar 2018 10:22:02 -0500 Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Resent-Message-ID: From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) References: <878tbe9jvx.fsf@zancanaro.id.au> <87y3jcu5v5.fsf@gnu.org> <87d10nwhfl.fsf@zancanaro.id.au> <87r2p2izgz.fsf@gnu.org> <87371ihjj2.fsf@zancanaro.id.au> <87po4mhcn2.fsf@gnu.org> <87inadr3np.fsf@zancanaro.id.au> Date: Sat, 03 Mar 2018 16:21:07 +0100 In-Reply-To: <87inadr3np.fsf@zancanaro.id.au> (Carlo Zancanaro's message of "Sat, 03 Mar 2018 18:58:50 +1100") Message-ID: <87h8px9od8.fsf@gnu.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-patches-bounces+kyle=kyleam.com@gnu.org Sender: "Guix-patches" To: Carlo Zancanaro Cc: 30637@debbugs.gnu.org Hi Carlo, Overall LGTM! It=E2=80=99s a long reply though, but that=E2=80=99s because= there are lots of details to pay attention to in this Unix quagmire. :-) Carlo Zancanaro skribis: > I've re-written my patch, and it's attached in two commits. The first > one adds the necessary calls to prctl, and the second adds the > fallback to polling. If possible I would prefer a commit that only adds prctl, and the next one would actually use it. I find it clearer and more convenient if we need to bisect or revert. > On Fri, Mar 02 2018, Ludovic Court=C3=A8s wrote: >> The =E2=80=98prctl=E2=80=99 procedure itself should simply throw to 'sys= tem-error on >> GNU/Hurd. But then, in (shepherd), we could add the polling thing >> when (not (string-contains %host-type "linux")). >> >> WDYT? > > I don't like the idea of doing this based on the host type. In my > patch I've done it based on whether the prctl call succeeded. Right, I actually agree with feature-based checks. :-) More on that inline below. > The fallback code still fails in the guix build environment (as my > previous patch did), but when it's using prctl it works properly. This > means that a build on Linux pre-3.4, or on Hurd, will fail, which > probably isn't acceptable given that shepherd is a hard dependency for > starting a GuixSD system. As far as I can tell the test fails because > the processes stick around as zombies, If they=E2=80=99re zombies, that means nobody called waitpid(2). Presumabl= y the polling code would need to do that. So I suppose =E2=80=98check-for-dead-services=E2=80=99 should do something = like: (when (integer? running) (catch 'system-error (lambda () (match (waitpid* running WNOHANG) (#f #f) ;uh, not a PID? ((0 . _) #f) ;ditto? ((pid . _) (local-output "PID ~a (~a) is dead" running (canonical-n= ame service)) (respawn-service service)))) (lambda args (or (=3D ECHILD (system-error-errno args)) ;wrong PID? (=3D EPERM (system-error-errno args)) ;not a child (apply throw args))))) Does that make sense? Please check waitpid(2) carefully though, because it=E2=80=99s pretty gnarly and I might have forgotten or misinterpreted something here. >> We want to set the =E2=80=9Creaper=E2=80=9D of child processes to Shephe= rd itself, >> so we must do that in child processes. The shepherd process cannot >> be its own reaper I suppose. > > Reading the manpage, and then running some code, I think you're wrong > about this. Using prctl with PR_SET_CHILD_SUBREAPER marks the calling > process as a child subreaper. That means that any processes that are > orphaned below the current process get reparented under the current > process (or a closer child subreaper, if there's one further down). If > there are no processes marked as child subreapers, then the orphaned > process is reparented under pid 1. We should only need to call prctl > in two places: when shepherd initially starts, or when we fork for > daemonize. Oh you=E2=80=99re right, sorry for the confusion! > From 5f26da2ce6a26c8412368900987ac5438f3e70cd Mon Sep 17 00:00:00 2001 > From: Carlo Zancanaro > Date: Sat, 3 Mar 2018 17:26:05 +1100 > Subject: [PATCH 1/2] Handle forked process SIGCHLD signals > > * Makefile.am (TESTS): Add tests/forking-service.sh. > * configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER. > * modules/shepherd.scm: Set the child subreaper attribute of main shepherd > process (as long as we're not pid 1). > * modules/shepherd/service.scm (root-service)[daemonize]: Set the child > subreaper attribute of newly forked shepherd process. > * modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variab= le > and export it. > (prctl): Add new procedure and export it. [...] > --- a/modules/shepherd.scm > +++ b/modules/shepherd.scm > @@ -50,6 +50,8 @@ > ;; Main program. > (define (main . args) > (initialize-cli) > + (when (not (=3D 1 (getpid))) > + (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1))) I think it=E2=80=99s a good idea to add a comment, like: ;; Register ourselves to get SIGCHLD when child processes terminate. ;; This is necessary for daemons for which we=E2=80=99d otherwise never g= et ;; SIGCHLD. > +(define prctl > + (let ((proc (syscall->procedure long "prctl" (list int int)))) On GNU/Hurd libc doesn=E2=80=99t have a =E2=80=9Cprctl=E2=80=9D symbol. So= you need something like: (if (dynamic-func "prctl" (dynamic-link)) (let ((proc =E2=80=A6)) =E2=80=A6) (lambda (process operation) ;; We=E2=80=99re running on an OS that lacks =E2=80=98prctl=E2=80= =99, such as GNU/Hurd. (throw 'system-error "prctl" "~A" (list (strerror ENOSYS)) (list ENOSYS)))) > +function cleanup() { You need either () or =E2=80=9Cfunction=E2=80=9D but not both (shells other= than Bash might complain=E2=80=A6). > +shepherd_pid=3D"$(cat $pid)" Likewise, we should use `foo` instead of $(foo) to suppose non-Bash shells (there are several occurrences of $(foo) here.) > From ec47fa189c7d47f1d9444d939b084850f0a7186c Mon Sep 17 00:00:00 2001 > From: Carlo Zancanaro > Date: Wed, 21 Feb 2018 22:57:59 +1100 > Subject: [PATCH 2/2] Poll every 0.5s to find dead forked services > > * modules/shepherd.scm (open-server-socket): Set socket to be > non-blocking. > (main): Use select with a timeout. If prctl failed when shepherd started > then call check-for-dead-services between connections/timeouts. > * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHL= D as > signal handler. > (respawn-service): Separate logic for respawning services from handling > SIGCHLD. > (handle-SIGCHLD, check-for-dead-services): New exported procedures. > * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with > symbols. [...] > + (define poll-services > + (if (=3D 1 (getpid)) > + (lambda () #f) > + (catch 'system-error > + (lambda () > + (prctl PR_SET_CHILD_SUBREAPER 1) > + (lambda () #f)) > + (lambda (key . args) > + check-for-dead-services)))) Please add a comment in the handler saying that we resort to polling on OSes that do not support =E2=80=98prctl=E2=80=99. However, perhaps we should do: (lambda args (let ((errno (system-error-errno args))) (if (=3D ENOSYS errno) check-for-dead-services (apply throw args)))) so that important/unexpected errors are not silently ignored. > +(define (respawn-service serv) > + (when serv Please add a docstring and move =E2=80=98when=E2=80=99 to the caller. > +(define (check-for-dead-services) Docstring as well :-), and also a comment explaining that this is a last resort for prctl-less OSes. > (register-services > (make > #:provides '(test-loaded) > - #:start (const 42) > + #:start (const 'abc) Perhaps with the =E2=80=98check-for-dead-services=E2=80=99 use of =E2=80=98= waitpid=E2=80=99 I outlined above we can even keep 42 here? If not, we should add in shepherd.texi, under =E2=80=9CSlots of services=E2= =80=9D, a few words saying that when =E2=80=98running=E2=80=99 is an integer it is assume= d to be a PID. Could you send an updated patch? Thanks for working on this! Ludo=E2=80=99.