all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: ludo@gnu.org (Ludovic Courtès)
To: Carlo Zancanaro <carlo@zancanaro.id.au>
Cc: 30637@debbugs.gnu.org
Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services
Date: Sat, 03 Mar 2018 16:21:07 +0100	[thread overview]
Message-ID: <87h8px9od8.fsf@gnu.org> (raw)
In-Reply-To: <87inadr3np.fsf@zancanaro.id.au> (Carlo Zancanaro's message of "Sat, 03 Mar 2018 18:58:50 +1100")

Hi Carlo,

Overall LGTM!  It’s a long reply though, but that’s because there are
lots of details to pay attention to in this Unix quagmire.  :-)

Carlo Zancanaro <carlo@zancanaro.id.au> skribis:

> I've re-written my patch, and it's attached in two commits. The first
> one adds the necessary calls to prctl, and the second adds the
> fallback to polling.

If possible I would prefer a commit that only adds prctl, and the next
one would actually use it.  I find it clearer and more convenient if we
need to bisect or revert.

> On Fri, Mar 02 2018, Ludovic Courtès wrote:
>> The ‘prctl’ procedure itself should simply throw to 'system-error on
>> GNU/Hurd. But then, in (shepherd), we could add the polling thing
>> when (not (string-contains %host-type "linux")).
>>
>> WDYT?
>
> I don't like the idea of doing this based on the host type. In my
> patch I've done it based on whether the prctl call succeeded.

Right, I actually agree with feature-based checks.  :-)

More on that inline below.

> The fallback code still fails in the guix build environment (as my
> previous patch did), but when it's using prctl it works properly. This
> means that a build on Linux pre-3.4, or on Hurd, will fail, which
> probably isn't acceptable given that shepherd is a hard dependency for
> starting a GuixSD system. As far as I can tell the test fails because
> the processes stick around as zombies,

If they’re zombies, that means nobody called waitpid(2).  Presumably the
polling code would need to do that.

So I suppose ‘check-for-dead-services’ should do something like:

          (when (integer? running)
            (catch 'system-error
              (lambda ()
                (match (waitpid* running WNOHANG)
                  (#f #f)  ;uh, not a PID?
                  ((0 . _) #f) ;ditto?
                  ((pid . _)
                   (local-output "PID ~a (~a) is dead" running (canonical-name service))
                   (respawn-service service))))
              (lambda args
                (or (= ECHILD (system-error-errno args))  ;wrong PID?
                    (= EPERM (system-error-errno args))   ;not a child
                    (apply throw args)))))

Does that make sense?  Please check waitpid(2) carefully though, because
it’s pretty gnarly and I might have forgotten or misinterpreted
something here.

>> We want to set the “reaper” of child processes to Shepherd itself,
>> so we must do that in child processes. The shepherd process cannot
>> be its own reaper I suppose.
>
> Reading the manpage, and then running some code, I think you're wrong
> about this. Using prctl with PR_SET_CHILD_SUBREAPER marks the calling
> process as a child subreaper. That means that any processes that are
> orphaned below the current process get reparented under the current
> process (or a closer child subreaper, if there's one further down). If
> there are no processes marked as child subreapers, then the orphaned
> process is reparented under pid 1. We should only need to call prctl
> in two places: when shepherd initially starts, or when we fork for
> daemonize.

Oh you’re right, sorry for the confusion!

> From 5f26da2ce6a26c8412368900987ac5438f3e70cd Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <carlo@zancanaro.id.au>
> Date: Sat, 3 Mar 2018 17:26:05 +1100
> Subject: [PATCH 1/2] Handle forked process SIGCHLD signals
>
> * Makefile.am (TESTS): Add tests/forking-service.sh.
> * configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER.
> * modules/shepherd.scm: Set the child subreaper attribute of main shepherd
>   process (as long as we're not pid 1).
> * modules/shepherd/service.scm (root-service)[daemonize]: Set the child
>   subreaper attribute of newly forked shepherd process.
> * modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable
>   and export it.
>   (prctl): Add new procedure and export it.

[...]

> --- a/modules/shepherd.scm
> +++ b/modules/shepherd.scm
> @@ -50,6 +50,8 @@
>  ;; Main program.
>  (define (main . args)
>    (initialize-cli)
> +  (when (not (= 1 (getpid)))
> +    (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))

I think it’s a good idea to add a comment, like:

  ;; Register ourselves to get SIGCHLD when child processes terminate.
  ;; This is necessary for daemons for which we’d otherwise never get
  ;; SIGCHLD.

> +(define prctl
> +  (let ((proc (syscall->procedure long "prctl" (list int int))))

On GNU/Hurd libc doesn’t have a “prctl” symbol.  So you need something
like:

  (if (dynamic-func "prctl" (dynamic-link))
      (let ((proc …)) …)
      (lambda (process operation)
        ;; We’re running on an OS that lacks ‘prctl’, such as GNU/Hurd.
        (throw 'system-error "prctl" "~A" (list (strerror ENOSYS))
               (list ENOSYS))))

> +function cleanup() {

You need either () or “function” but not both (shells other than Bash
might complain…).

> +shepherd_pid="$(cat $pid)"

Likewise, we should use `foo` instead of $(foo) to suppose non-Bash
shells (there are several occurrences of $(foo) here.)

> From ec47fa189c7d47f1d9444d939b084850f0a7186c Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <carlo@zancanaro.id.au>
> Date: Wed, 21 Feb 2018 22:57:59 +1100
> Subject: [PATCH 2/2] Poll every 0.5s to find dead forked services
>
> * modules/shepherd.scm (open-server-socket): Set socket to be
>   non-blocking.
>   (main): Use select with a timeout. If prctl failed when shepherd started
>   then call check-for-dead-services between connections/timeouts.
> * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
>   signal handler.
>   (respawn-service): Separate logic for respawning services from handling
>   SIGCHLD.
>   (handle-SIGCHLD, check-for-dead-services): New exported procedures.
> * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
>   symbols.

[...]

> +  (define poll-services
> +    (if (= 1 (getpid))
> +        (lambda () #f)
> +        (catch 'system-error
> +          (lambda ()
> +            (prctl PR_SET_CHILD_SUBREAPER 1)
> +            (lambda () #f))
> +          (lambda (key . args)
> +            check-for-dead-services))))

Please add a comment in the handler saying that we resort to polling on
OSes that do not support ‘prctl’.

However, perhaps we should do:

  (lambda args
    (let ((errno (system-error-errno args)))
      (if (= ENOSYS errno)
          check-for-dead-services
          (apply throw args))))

so that important/unexpected errors are not silently ignored.

> +(define (respawn-service serv)
> +  (when serv

Please add a docstring and move ‘when’ to the caller.

> +(define (check-for-dead-services)

Docstring as well :-), and also a comment explaining that this is a last
resort for prctl-less OSes.

>  (register-services
>   (make <service>
>     #:provides '(test-loaded)
> -   #:start (const 42)
> +   #:start (const 'abc)

Perhaps with the ‘check-for-dead-services’ use of ‘waitpid’ I outlined
above we can even keep 42 here?

If not, we should add in shepherd.texi, under “Slots of services”, a few
words saying that when ‘running’ is an integer it is assumed to be a
PID.

Could you send an updated patch?

Thanks for working on this!

Ludo’.

  reply	other threads:[~2018-03-03 15:22 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-27 21:56 [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Carlo Zancanaro
2018-02-28 22:06 ` Ludovic Courtès
2018-03-01 22:37   ` Carlo Zancanaro
2018-03-02  9:44     ` Ludovic Courtès
2018-03-02 10:13       ` Carlo Zancanaro
2018-03-02 12:42         ` Ludovic Courtès
2018-03-03  7:58           ` Carlo Zancanaro
2018-03-03 15:21             ` Ludovic Courtès [this message]
2018-03-03 20:49               ` Carlo Zancanaro
2018-03-04 22:11                 ` Ludovic Courtès
2018-03-04 22:35                   ` Carlo Zancanaro
2018-03-04 22:49                     ` Ludovic Courtès
2018-03-04 23:08                       ` Carlo Zancanaro
2018-03-05 14:15                         ` bug#30637: " Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87h8px9od8.fsf@gnu.org \
    --to=ludo@gnu.org \
    --cc=30637@debbugs.gnu.org \
    --cc=carlo@zancanaro.id.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.