From: ludo@gnu.org (Ludovic Courtès)
To: Carlo Zancanaro <carlo@zancanaro.id.au>
Cc: 30637@debbugs.gnu.org
Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
Date: Sat, 03 Mar 2018 16:21:07 +0100 [thread overview]
Message-ID: <87h8px9od8.fsf@gnu.org> (raw)
In-Reply-To: <87inadr3np.fsf@zancanaro.id.au> (Carlo Zancanaro's message of "Sat, 03 Mar 2018 18:58:50 +1100")
Hi Carlo,
Overall LGTM! It’s a long reply though, but that’s because there are
lots of details to pay attention to in this Unix quagmire. :-)
Carlo Zancanaro <carlo@zancanaro.id.au> skribis:
> I've re-written my patch, and it's attached in two commits. The first
> one adds the necessary calls to prctl, and the second adds the
> fallback to polling.
If possible I would prefer a commit that only adds prctl, and the next
one would actually use it. I find it clearer and more convenient if we
need to bisect or revert.
> On Fri, Mar 02 2018, Ludovic Courtès wrote:
>> The ‘prctl’ procedure itself should simply throw to 'system-error on
>> GNU/Hurd. But then, in (shepherd), we could add the polling thing
>> when (not (string-contains %host-type "linux")).
>>
>> WDYT?
>
> I don't like the idea of doing this based on the host type. In my
> patch I've done it based on whether the prctl call succeeded.
Right, I actually agree with feature-based checks. :-)
More on that inline below.
> The fallback code still fails in the guix build environment (as my
> previous patch did), but when it's using prctl it works properly. This
> means that a build on Linux pre-3.4, or on Hurd, will fail, which
> probably isn't acceptable given that shepherd is a hard dependency for
> starting a GuixSD system. As far as I can tell the test fails because
> the processes stick around as zombies,
If they’re zombies, that means nobody called waitpid(2). Presumably the
polling code would need to do that.
So I suppose ‘check-for-dead-services’ should do something like:
(when (integer? running)
(catch 'system-error
(lambda ()
(match (waitpid* running WNOHANG)
(#f #f) ;uh, not a PID?
((0 . _) #f) ;ditto?
((pid . _)
(local-output "PID ~a (~a) is dead" running (canonical-name service))
(respawn-service service))))
(lambda args
(or (= ECHILD (system-error-errno args)) ;wrong PID?
(= EPERM (system-error-errno args)) ;not a child
(apply throw args)))))
Does that make sense? Please check waitpid(2) carefully though, because
it’s pretty gnarly and I might have forgotten or misinterpreted
something here.
>> We want to set the “reaper” of child processes to Shepherd itself,
>> so we must do that in child processes. The shepherd process cannot
>> be its own reaper I suppose.
>
> Reading the manpage, and then running some code, I think you're wrong
> about this. Using prctl with PR_SET_CHILD_SUBREAPER marks the calling
> process as a child subreaper. That means that any processes that are
> orphaned below the current process get reparented under the current
> process (or a closer child subreaper, if there's one further down). If
> there are no processes marked as child subreapers, then the orphaned
> process is reparented under pid 1. We should only need to call prctl
> in two places: when shepherd initially starts, or when we fork for
> daemonize.
Oh you’re right, sorry for the confusion!
> From 5f26da2ce6a26c8412368900987ac5438f3e70cd Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <carlo@zancanaro.id.au>
> Date: Sat, 3 Mar 2018 17:26:05 +1100
> Subject: [PATCH 1/2] Handle forked process SIGCHLD signals
>
> * Makefile.am (TESTS): Add tests/forking-service.sh.
> * configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER.
> * modules/shepherd.scm: Set the child subreaper attribute of main shepherd
> process (as long as we're not pid 1).
> * modules/shepherd/service.scm (root-service)[daemonize]: Set the child
> subreaper attribute of newly forked shepherd process.
> * modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable
> and export it.
> (prctl): Add new procedure and export it.
[...]
> --- a/modules/shepherd.scm
> +++ b/modules/shepherd.scm
> @@ -50,6 +50,8 @@
> ;; Main program.
> (define (main . args)
> (initialize-cli)
> + (when (not (= 1 (getpid)))
> + (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
I think it’s a good idea to add a comment, like:
;; Register ourselves to get SIGCHLD when child processes terminate.
;; This is necessary for daemons for which we’d otherwise never get
;; SIGCHLD.
> +(define prctl
> + (let ((proc (syscall->procedure long "prctl" (list int int))))
On GNU/Hurd libc doesn’t have a “prctl” symbol. So you need something
like:
(if (dynamic-func "prctl" (dynamic-link))
(let ((proc …)) …)
(lambda (process operation)
;; We’re running on an OS that lacks ‘prctl’, such as GNU/Hurd.
(throw 'system-error "prctl" "~A" (list (strerror ENOSYS))
(list ENOSYS))))
> +function cleanup() {
You need either () or “function” but not both (shells other than Bash
might complain…).
> +shepherd_pid="$(cat $pid)"
Likewise, we should use `foo` instead of $(foo) to suppose non-Bash
shells (there are several occurrences of $(foo) here.)
> From ec47fa189c7d47f1d9444d939b084850f0a7186c Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <carlo@zancanaro.id.au>
> Date: Wed, 21 Feb 2018 22:57:59 +1100
> Subject: [PATCH 2/2] Poll every 0.5s to find dead forked services
>
> * modules/shepherd.scm (open-server-socket): Set socket to be
> non-blocking.
> (main): Use select with a timeout. If prctl failed when shepherd started
> then call check-for-dead-services between connections/timeouts.
> * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
> signal handler.
> (respawn-service): Separate logic for respawning services from handling
> SIGCHLD.
> (handle-SIGCHLD, check-for-dead-services): New exported procedures.
> * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
> symbols.
[...]
> + (define poll-services
> + (if (= 1 (getpid))
> + (lambda () #f)
> + (catch 'system-error
> + (lambda ()
> + (prctl PR_SET_CHILD_SUBREAPER 1)
> + (lambda () #f))
> + (lambda (key . args)
> + check-for-dead-services))))
Please add a comment in the handler saying that we resort to polling on
OSes that do not support ‘prctl’.
However, perhaps we should do:
(lambda args
(let ((errno (system-error-errno args)))
(if (= ENOSYS errno)
check-for-dead-services
(apply throw args))))
so that important/unexpected errors are not silently ignored.
> +(define (respawn-service serv)
> + (when serv
Please add a docstring and move ‘when’ to the caller.
> +(define (check-for-dead-services)
Docstring as well :-), and also a comment explaining that this is a last
resort for prctl-less OSes.
> (register-services
> (make <service>
> #:provides '(test-loaded)
> - #:start (const 42)
> + #:start (const 'abc)
Perhaps with the ‘check-for-dead-services’ use of ‘waitpid’ I outlined
above we can even keep 42 here?
If not, we should add in shepherd.texi, under “Slots of services”, a few
words saying that when ‘running’ is an integer it is assumed to be a
PID.
Could you send an updated patch?
Thanks for working on this!
Ludo’.
next prev parent reply other threads:[~2018-03-03 15:22 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-27 21:56 [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Carlo Zancanaro
2018-02-28 22:06 ` Ludovic Courtès
2018-03-01 22:37 ` Carlo Zancanaro
2018-03-02 9:44 ` Ludovic Courtès
2018-03-02 10:13 ` Carlo Zancanaro
2018-03-02 12:42 ` Ludovic Courtès
2018-03-03 7:58 ` Carlo Zancanaro
2018-03-03 15:21 ` Ludovic Courtès [this message]
2018-03-03 20:49 ` Carlo Zancanaro
2018-03-04 22:11 ` Ludovic Courtès
2018-03-04 22:35 ` Carlo Zancanaro
2018-03-04 22:49 ` Ludovic Courtès
2018-03-04 23:08 ` Carlo Zancanaro
2018-03-05 14:15 ` bug#30637: " Ludovic Courtès
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://guix.gnu.org/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87h8px9od8.fsf@gnu.org \
--to=ludo@gnu.org \
--cc=30637@debbugs.gnu.org \
--cc=carlo@zancanaro.id.au \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).