From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1 ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id +JNIOhS2lWADxwAAgWs5BA (envelope-from ) for ; Fri, 07 May 2021 23:50:12 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1 with LMTPS id KlPaNRS2lWBXPwAAbx9fmQ (envelope-from ) for ; Fri, 07 May 2021 21:50:12 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 175911DB3D for ; Fri, 7 May 2021 23:50:12 +0200 (CEST) Received: from localhost ([::1]:42016 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lf8ML-0006HY-Pz for larch@yhetil.org; Fri, 07 May 2021 17:50:09 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:40138) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lf8ME-0006H8-I8 for bug-guix@gnu.org; Fri, 07 May 2021 17:50:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:34589) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1lf8ME-0000og-BU for bug-guix@gnu.org; Fri, 07 May 2021 17:50:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1lf8ME-0003qD-9O for bug-guix@gnu.org; Fri, 07 May 2021 17:50:02 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#41948: Shepherd deadlocks Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Fri, 07 May 2021 21:50:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 41948 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Mathieu Othacehe Received: via spool by 41948-submit@debbugs.gnu.org id=B41948.162042419414753 (code B ref 41948); Fri, 07 May 2021 21:50:02 +0000 Received: (at 41948) by debbugs.gnu.org; 7 May 2021 21:49:54 +0000 Received: from localhost ([127.0.0.1]:46134 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1lf8M6-0003pt-GE for submit@debbugs.gnu.org; Fri, 07 May 2021 17:49:54 -0400 Received: from eggs.gnu.org ([209.51.188.92]:58008) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1lf8M2-0003pn-Ui for 41948@debbugs.gnu.org; Fri, 07 May 2021 17:49:53 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:45104) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lf8Lx-0000eS-NI for 41948@debbugs.gnu.org; Fri, 07 May 2021 17:49:45 -0400 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=58332 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lf8Lw-00032X-BE; Fri, 07 May 2021 17:49:44 -0400 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= References: <87h7v75txx.fsf@gnu.org> <87a70yc9kj.fsf@gnu.org> <87k0xyhq22.fsf@gnu.org> Date: Fri, 07 May 2021 23:49:42 +0200 In-Reply-To: <87k0xyhq22.fsf@gnu.org> (Mathieu Othacehe's message of "Sun, 16 Aug 2020 11:56:37 +0200") Message-ID: <87v97u9pu1.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 41948@debbugs.gnu.org Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1620424212; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:resent-cc:resent-from:resent-sender: resent-message-id:in-reply-to:in-reply-to:references:references: list-id:list-help:list-unsubscribe:list-subscribe:list-post; bh=MSqbSecaiiFCEp8Ml7VvV6C7ufi+9O9narFht+VE214=; b=OOVTbC8B16zmgdQOONinNsAQ30dqwJP7YYt09I+VzEwFfvAO2Dr+Y2dyvIABW3nEo1glkN pG0UhQz3xTM8y+Q8VYl7gP0gEZ5yR3Eqjy73shj90+uLDACKaa0LB8eVgCByhi8zcrz/Xy DUXu8byiJfNY/uf+0d3gG+5ci1DYUv2jf4iPNBg8f5HyooAUBFI25aX0NWAd6EqF4TBLoq LFie9dkR6iY+MUR3QDgDdhdyYxc58cGUNuuw2Cd7x1DMJn+sl65uNiiK+loj7LZ6UmF/Oz Nq8BquXIU+u9VM9JKEbm/54VL1Ld3HIksM9zsQGMhn0o0TdW0vdlPU+0eUO+YQ== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1620424212; a=rsa-sha256; cv=none; b=MuJ2ekpkSbdJ9WlhiXCL8Qe7mSqOh3y+pcmMPhStiwRdyuXjfVuwE7dS5is+mcqqnZFUG/ aXG/Se2qDXrq/d2dQ7B+kuSZbIBJpGsLWgOugUMeOcWm4wewQ8kQJFuoUhDo+Z84ZjKMAw 8yqx3tuNFKN+PpNYpttzHyjciNR89lqrk+AvySnhMKQJkddWzKLVsq1AOF8SV/fhjvLRmI JJD/+kb2TCgEq6cOwe33LE2RiS8dYPcHrO+DGEjH3A6BxHx//omOd6bG6ydQnE14BExJS5 qdW+NEcm3Z/7/xbLJf7h8S0U3PznOx90vBgxF8HZ7vHLQZqXwdlxOxzbUcYQ9A== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Migadu-Spam-Score: -1.95 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Migadu-Queue-Id: 175911DB3D X-Spam-Score: -1.95 X-Migadu-Scanner: scn0.migadu.com X-TUID: tjqJK85Hnt+U --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi! Mathieu Othacehe skribis: > Those two finalizer threads share the same pipe. When we try to > stop the finalizer thread in Shepherd, right before forking a new > process, we send a '\1' byte to the finalizer pipe. > > 1 write(6, "\1", 1 > > > which is received by (line 183597):=20 > > 253 <... read resumed>"\1", 1) =3D 1 > > the marionette finalizer thread. Then, we pthread_join the Shepherd > finalizer thread, which never stops! Quite unfortunate. While working on a fix for this issue (finalizer pipe shared between parent and child process), I found the =E2=80=98sleep_pipe=E2=80=99 of the = main thread is also shared between the parent and its child. Attached is a reproducer. It prints something like this before hanging: --8<---------------cut here---------------start------------->8--- $ guile ~/src/guile-debugging/signal-pipe.scm parent: 25947 child: 25953 alarm in parent! alarm in parent! alarm in parent! [...] alarm in parent! alarm in parent! child woken up! --8<---------------cut here---------------end--------------->8--- =E2=80=9Cchild woken up=E2=80=9D means that it=E2=80=99s the child process = that won the race reading on the sleep pipe (from =E2=80=98scm_std_select=E2=80=99). The parent process then hangs because, in =E2=80=98scm_std_select=E2=80=99,= it did: 1. select(1), which returned due to available data on =E2=80=98wakeup_fd= =E2=80=99; 2. =E2=80=98full_read (wakeup_fd, &dummy, 1)=E2=80=99 gets stuck forever = in read(2) because the child process read that byte in the meantime so there=E2=80=99s nothing left to read. Here=E2=80=99s the sequence: --8<---------------cut here---------------start------------->8--- 25947 select(4, [3], NULL, NULL, {tv_sec=3D0, tv_usec=3D100000}) =3D 0 (Tim= eout) 25947 getpid() =3D 25947 25947 kill(25947, SIGALRM) =3D 0 25947 --- SIGALRM {si_signo=3DSIGALRM, si_code=3DSI_USER, si_pid=3D25947, s= i_uid=3D1000} --- 25947 write(8, "\16", 1) =3D 1 25947 rt_sigreturn({mask=3D[]}) =3D 0 25952 <... read resumed>"\16", 1) =3D 1 25947 rt_sigprocmask(SIG_BLOCK, NULL, 25952 write(4, "\0", 1 25947 <... rt_sigprocmask resumed>[], 8) =3D 0 25953 <... select resumed>) =3D 1 (in [3], left {tv_sec=3D0, tv= _usec=3D346370}) 25952 <... write resumed>) =3D 1 25947 select(4, [3], NULL, NULL, {tv_sec=3D0, tv_usec=3D100000} 25953 read(3, 25952 rt_sigprocmask(SIG_BLOCK, NULL, 25947 <... select resumed>) =3D 1 (in [3], left {tv_sec=3D0, tv= _usec=3D99999}) 25953 <... read resumed>"\0", 1) =3D 1 25947 read(3, 25952 <... rt_sigprocmask resumed>~[KILL STOP PWR RTMIN RT_1], 8) =3D 0 25953 write(1, "child woken up!\n", 16 25952 read(7, --8<---------------cut here---------------end--------------->8--- Notice how =E2=80=9C\16=E2=80=9D (=3D SIGALRM) is written by the parent=E2= =80=99s main thread and read by the child=E2=80=99s main thread. Thoughts? Ludo=E2=80=99. --=-=-= Content-Type: text/plain Content-Disposition: inline; filename=signal-pipe.scm Content-Description: the reproducer ;; https://issues.guix.gnu.org/41948 (use-modules (ice-9 match)) (setvbuf (current-output-port) 'line) (sigaction SIGCHLD pk) ;start signal thread (match (primitive-fork) (0 (format #t "child: ~a~%" (getpid)) (let loop () (unless (zero? (usleep 500000)) ;; If this happens, it means the select(2) call in 'scm_std_select' ;; returned because one of our file descriptors had input data ;; available (which shouldn't happen). (format #t "child woken up!~%")) (loop))) (pid (format #t "parent: ~a~%" (getpid)) (sigaction SIGALRM (lambda _ (format #t "alarm in parent!~%"))) (let loop () (kill (getpid) SIGALRM) (usleep 100000) (loop)))) --=-=-=--