From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55068) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1erhFS-0007Pi-2e for guix-patches@gnu.org; Fri, 02 Mar 2018 04:45:07 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1erhFO-0001gC-V8 for guix-patches@gnu.org; Fri, 02 Mar 2018 04:45:06 -0500 Received: from debbugs.gnu.org ([208.118.235.43]:60435) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1erhFO-0001g3-QM for guix-patches@gnu.org; Fri, 02 Mar 2018 04:45:02 -0500 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1erhFO-0007vT-Dg for guix-patches@gnu.org; Fri, 02 Mar 2018 04:45:02 -0500 Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Resent-Message-ID: From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) References: <878tbe9jvx.fsf@zancanaro.id.au> <87y3jcu5v5.fsf@gnu.org> <87d10nwhfl.fsf@zancanaro.id.au> Date: Fri, 02 Mar 2018 10:44:12 +0100 In-Reply-To: <87d10nwhfl.fsf@zancanaro.id.au> (Carlo Zancanaro's message of "Fri, 02 Mar 2018 09:37:50 +1100") Message-ID: <87r2p2izgz.fsf@gnu.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-patches-bounces+kyle=kyleam.com@gnu.org Sender: "Guix-patches" To: Carlo Zancanaro Cc: 30637@debbugs.gnu.org Hi Carlo, Carlo Zancanaro skribis: > On Wed, Feb 28 2018, Ludovic Court=C3=A8s wrote: >>> The problem is that shepherd, when run as a user process, can >>> "lose" >>> services which fork away. Shepherd can still kill them, but a >>> SIGCHLD >>> won't be delivered if they die, so shepherd can't restart/disable >>> them. My prime example is emacs, which I run with --daemon. If I >>> then >>> kill emacs, shepherd will still think that it is running. >> >> There are two issues here, I think. >> >> 1. shepherd cannot lose SIGCHLD: if a process dies immediately >> once >> it=E2=80=99s been spawned, as is the case with =E2=80=9Cemacs --dae= mon=E2=80=9D or >> any >> other daemon-style program, it should receive SIGCHLD and >> process >> it. > > Yeah, that's true, but the problem is that shepherd only processes the > SIGCHLD if there is a service with its `running` slot set to the > pid. Well, it does call =E2=80=98waitpid=E2=80=99 every time it gets a SIGCHLD, = but it=E2=80=99s true that it doesn=E2=80=99t do anything beyond that if it doesn=E2=80=99t know = what service a PID corresponds to. > When emacs forks, the original process may have its SIGCHLD handled, > but that doesn't affect shepherd's service state (as it shouldn't, > because it's using #:pid-file to track the forked process). > >> 2. shepherd currently can=E2=80=99t do much with real daemons. So w= hat >> we do >> in GuixSD is to either start programs in non-daemon mode, >> when >> that=E2=80=99s an option, or pass #:pid-file to retrieve the forked >> process >> PID. I think you should do one of these as well. > > I am doing that. The problem is that when a service dies (crashes, > quits, etc.) the `respawn?` option cannot be honoured because shepherd > is not notified that the process has terminated (because it never > receives a SIGCHLD for the forked pid). My patch polls for the > processes we expect, to make up for the lack of notification. I see. Actually, thinking more about it, we should be using PR_SET_CHILD_SUBREAPER from prctl(2), which is designed exactly for that. So what about this plan: 1. Add FFI bindings in (shepherd system) for prctl(2). We should arrange for it to throw to 'system-error when the =E2=80=98prctl=E2=80= =99 symbol is missing, as is the case on GNU/Hurd. 2. Use prctl/PR_SET_CHILD_SUBREAPER in =E2=80=98exec-command=E2=80=99. H= ere we must =E2=80=98catch-system-error=E2=80=99 around that call to cater to GNU/= Hurd. That would address the main issue without having to resort to polling. Respawning will work only when #:pid-file is used though, but that=E2=80=99s already an improvement. Thoughts? Thanks, Ludo=E2=80=99.