From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp12.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms5.migadu.com with LMTPS id mNTxGbdsKWPUXwAAbAwnHQ (envelope-from ) for ; Tue, 20 Sep 2022 09:33:11 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp12.migadu.com with LMTPS id ODH6GbdsKWPyuQAAauVa8A (envelope-from ) for ; Tue, 20 Sep 2022 09:33:11 +0200 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id D24D22FBD1 for ; Tue, 20 Sep 2022 09:33:10 +0200 (CEST) Received: from localhost ([::1]:43880 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1oaXkj-00059P-Tm for larch@yhetil.org; Tue, 20 Sep 2022 03:33:09 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:54372) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oaXkd-00057z-2c for bug-guix@gnu.org; Tue, 20 Sep 2022 03:33:03 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:56889) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1oaXkc-0008OJ-Q3 for bug-guix@gnu.org; Tue, 20 Sep 2022 03:33:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1oaXkc-0000Fw-Eu for bug-guix@gnu.org; Tue, 20 Sep 2022 03:33:02 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#57922: Shepherd doesn't seem to correctly handle waitpid itself Resent-From: Josselin Poiret Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Tue, 20 Sep 2022 07:33:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 57922 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Maxim Cournoyer , 57922@debbugs.gnu.org Received: via spool by 57922-submit@debbugs.gnu.org id=B57922.1663659122920 (code B ref 57922); Tue, 20 Sep 2022 07:33:02 +0000 Received: (at 57922) by debbugs.gnu.org; 20 Sep 2022 07:32:02 +0000 Received: from localhost ([127.0.0.1]:55967 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oaXje-0000EY-0y for submit@debbugs.gnu.org; Tue, 20 Sep 2022 03:32:02 -0400 Received: from jpoiret.xyz ([206.189.101.64]:54904) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oaXjb-0000EL-Mb for 57922@debbugs.gnu.org; Tue, 20 Sep 2022 03:32:00 -0400 Received: from authenticated-user (jpoiret.xyz [206.189.101.64]) by jpoiret.xyz (Postfix) with ESMTPA id 225D6184F2B; Tue, 20 Sep 2022 07:31:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jpoiret.xyz; s=dkim; t=1663659118; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=mT1YjZpcqepo2dF2FR0yNfjgn5RDzMuu0yfYCsoJb48=; b=aRMkyh37qQdSRt8ZUj5JuMGFuaF/+XFfUW98xBKHCcmFasFbwTJxrQaBS9KF5Yver7X4um 9WSK+DxWgXTrKrCqGlUaXaNswGgv+DFKNgRkRHdYQHwXjIgnmLdg/bEFzx09yQzRL6wwMM sb1kYwvNPNMFn3gM7J/3qx+eFoGuqYo8etgzWSJEMUmzkrfBAZOTH7OtSQyPJhJ06d4Wdx QLBNiIoUCaRl4+9XX1MdMTJSCyY5bK6NXlqg3skXMfOXRK153KrMlmIkm7GMWsHPdGfP3H XMuAEpo6+h0gjw9yeQ74espP1QvzwLBkVVItioX4uD/649IRFF7wQej7CYKxeg== In-Reply-To: <874jx4q953.fsf@gmail.com> References: <874jx4q953.fsf@gmail.com> Date: Tue, 20 Sep 2022 09:31:57 +0200 Message-ID: <87o7va33iq.fsf@jpoiret.xyz> MIME-Version: 1.0 Content-Type: text/plain X-Spamd-Bar: / X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" Reply-to: Josselin Poiret From: Josselin Poiret via Bug reports for GNU Guix X-Migadu-Flow: FLOW_IN X-Migadu-To: larch@yhetil.org X-Migadu-Country: US ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1663659190; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type:resent-cc:resent-from:resent-sender: resent-message-id:in-reply-to:in-reply-to:references:references: list-id:list-help:list-unsubscribe:list-subscribe:list-post: dkim-signature; bh=mT1YjZpcqepo2dF2FR0yNfjgn5RDzMuu0yfYCsoJb48=; b=jsyK6ycLpxaJb9leZLwn07zTELh0BShIRD2zE6pWT/rW6Agu8n4327vxcuZkwueQQYF0ft C3cSRQBnDeLO6chVDT2wGaMVHOzZHIdf5CYqYRLGlPVv/JWeU/oh4Oj6on6H3O2muU+Yru elAsp8YvWXpvEPTaCO/y9e92wMFCWNKfuHffLv8xi7WiQaIVbxSABZAPe8DgN8prtC8jbB fSu7zlAWkBGTdbJ2vxu7Z5MmS4qbowbD/7tgZFDajXnaygSd2yKSv1mCwc/zN6CtVoQjyv zqg0b1jhE6KaWrJovjiVuIxTC2qIsp6yVkbTALna0LhaYpihqAxntzYktVG+yQ== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1663659190; a=rsa-sha256; cv=none; b=BqRo1GysOJQyTddLwYxVATPT1hk5AreUASA5Fw5LTmObEqNT2wrCqlLo+DT4q7UcUKLZac x7Ij/LrWEyhd1LilX/P+Nhsw2Pu536Bfcy9HFs4tZpiM8UN8tPwJIxfD2YP9pXVz9ChOz9 /UxS1jEo6N9dXsFsx6Y2nJGUn2RprT+nG40TplSFs4uvtdmNbpQ8DN4nCWjly2J+MGWY0I QljpdoR+NRFWW5e2RxiHWxrSKt4A14l01J5QrJPdN1dkPzkDj4QtxzjefUJBrNNcFbbqWp hVwb2uAWh78dP9z61/w//Ywtq20f2drAO7bBOmDMGJ2slfw94oo8DaTbWCY/cw== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=jpoiret.xyz header.s=dkim header.b=aRMkyh37; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of "bug-guix-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="bug-guix-bounces+larch=yhetil.org@gnu.org" X-Migadu-Spam-Score: -1.83 Authentication-Results: aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=jpoiret.xyz header.s=dkim header.b=aRMkyh37; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of "bug-guix-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="bug-guix-bounces+larch=yhetil.org@gnu.org" X-Migadu-Queue-Id: D24D22FBD1 X-Spam-Score: -1.83 X-Migadu-Scanner: scn1.migadu.com X-TUID: +R0aXVlpuEFC Hi Maxim, Maxim Cournoyer writes: > Hi, > > I've tried to determine why a workaround in the jami-service-type is > required in the 'stop' slot to avoid failures in 'herd restart jami', > and haven't quite found the culprit, but it appears to me that: > > 1. waipid is only called in one place in Shepherd, which is in the > handle-SIGCHLD procedure in (shepherd service), which does not > specifically wait for an exact PID but rather does: > > (waitpid* WAIT_ANY WNOHANG), which is waitpid with some special handling > in the case a system-error exception is thrown with an ECHILD or EINTR > error number. > > This doesn't strike me as a strong guarantee that waitpid occurs when > stop is called, because: > > 1. It requires to be installed in the signal handlers for each > processes, with something like: > > --8<---------------cut here---------------start------------->8--- > (unless %sigchld-handler-installed? > (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP) > (set! %sigchld-handler-installed? #t)) > --8<---------------cut here---------------end--------------->8--- > > Done for fork+exec-command and make-inetd-forkexec-constructor, but not > for make-forkexec-constructor/container, AFAICT; The signal handler is only installed once in PID 1 (in fact, you haven't forked yet here), since it's the one that receives the SIGCHLD. What I don't understand that well is that this signal handler could be installed only once when shepherd starts, right? That way, it wouldn't need to depend on specific start actions being chosen. > 2. it has the WNOHANG flag, which means the stop simply does a kill the > the signal handling weakly (because of WNOHANG) waits on it, which means > the start may begin before the process was actually completely > terminated. > > Here's a small reproducer to apply on our code base: > > --8<---------------cut here---------------start------------->8--- > modified gnu/services/telephony.scm > @@ -685,13 +685,7 @@ (define (archive-name->username archive) > > ;; Finally, return the PID of the daemon process. > daemon-pid)) > - (stop > - #~(lambda (pid . args) > - (kill pid SIGKILL) > - ;; Wait for the process to exit; this prevents overlapping > - ;; processes when issuing 'herd restart'. > - (waitpid pid) > - #f)))))))) > + (stop #~(make-kill-destructor)))))))) > > (define jami-service-type > (service-type > --8<---------------cut here---------------end--------------->8--- The real problem here is not really the WNOHANG flag (you could remove that and still get issues) but rather that the waitpid is run inside a signal handler, which in Guile means that it's run through asyncs. You have no guarantees wrt. when asyncs run, so they could run after or in the middle of the next action. I also think make-kill-destructor should waitpid the processes it's killing, as you're implying, and leave the signal handler only for unexpected service crashes. Best, -- Josselin Poiret