From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id 2/J5Iq7l7V72bQAA0tVLHw (envelope-from ) for ; Sat, 20 Jun 2020 10:32:14 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1 with LMTPS id CWbRHa7l7V7sRwAAbx9fmQ (envelope-from ) for ; Sat, 20 Jun 2020 10:32:14 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id DC2C29400B1 for ; Sat, 20 Jun 2020 10:32:13 +0000 (UTC) Received: from localhost ([::1]:48170 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jmanD-0008LJ-MB for larch@yhetil.org; Sat, 20 Jun 2020 06:32:11 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:42358) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jman5-0008Ky-7e for bug-guix@gnu.org; Sat, 20 Jun 2020 06:32:03 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:45845) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1jman4-0001O3-Gr for bug-guix@gnu.org; Sat, 20 Jun 2020 06:32:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1jman4-0003Vw-E7 for bug-guix@gnu.org; Sat, 20 Jun 2020 06:32:02 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#41948: Shepherd deadlocks Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Sat, 20 Jun 2020 10:32:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 41948 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Mathieu Othacehe Received: via spool by 41948-submit@debbugs.gnu.org id=B41948.159264911113489 (code B ref 41948); Sat, 20 Jun 2020 10:32:02 +0000 Received: (at 41948) by debbugs.gnu.org; 20 Jun 2020 10:31:51 +0000 Received: from localhost ([127.0.0.1]:57391 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jmams-0003VU-NO for submit@debbugs.gnu.org; Sat, 20 Jun 2020 06:31:51 -0400 Received: from eggs.gnu.org ([209.51.188.92]:56148) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jmamq-0003VI-Qc for 41948@debbugs.gnu.org; Sat, 20 Jun 2020 06:31:49 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:37530) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jmaml-0001Mw-IJ for 41948@debbugs.gnu.org; Sat, 20 Jun 2020 06:31:43 -0400 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=41150 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1jmamk-0001KG-9E; Sat, 20 Jun 2020 06:31:42 -0400 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= References: <87h7v75txx.fsf@gnu.org> Date: Sat, 20 Jun 2020 12:31:40 +0200 In-Reply-To: <87h7v75txx.fsf@gnu.org> (Mathieu Othacehe's message of "Fri, 19 Jun 2020 10:41:14 +0200") Message-ID: <87a70yc9kj.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-Spam-Score: -3.3 (---) X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 41948@debbugs.gnu.org Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Spam-Score: -1.01 X-TUID: 7MY1YQYwGGYN Hi, Mathieu Othacehe skribis: > When running "gui-installed-desktop-os-encrypted" test, Shepherd seems > to deadlock when restarting "guix-daemon". This can happen at different > stages: > > * In "umount-cow-store" procedure, just before finishing the install. > > * During "set-http-proxy" tests inside the marionette. >=20=20=20 > This is not always reproducible. In order to gather some information, I > created a Shepherd "strace" service that logs what's happening in > Shepherd itself (patch attached). We should be able to reproduce it with much simpler tests then, right? Like maybe =E2=80=9Cwhile : ; do herd restart guix-daemon ; done=E2=80=9D o= r similar? > It seems that, just after blocking signals, in "fork+exec-command", I > guess, Shepherd is taking a lock: > > 183553:1 chdir("/") =3D 0 > 183554:1 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=3D1, tv_nsec= =3D4387222}) =3D 0 > 183555:1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) =3D 0 > 183560:1 madvise(0x7f179782d000, 12288, MADV_DONTNEED) =3D 0 > 183561:1 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=3D1, tv_nsec= =3D13592357}) =3D 0 > 183562:1 clone(child_stack=3D0x7f17981e6fb0, flags=3DCLONE_VM|CLONE_F= S|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_P= ARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=3D[271], tls=3D0x7f17981e7700= , child_tidptr=3D0x7f17981e79d0) =3D 271 > 183563:1 stat("/etc/localtime", {st_mode=3DS_IFREG|0444, st_size=3D19= 20, ...}) =3D 0 > 183564:1 write(15, "shepherd[1]: changing HTTP/HTTPS"..., 86 > 183566:1 <... write resumed>) =3D 86 > 183575:1 getpgid(169) =3D 169 > 183576:1 getpgid(0) =3D 0 > 183577:1 kill(-169, SIGTERM > 183579:1 <... kill resumed>) =3D 0 > 183582:1 stat("/etc/localtime", {st_mode=3DS_IFREG|0444, st_size=3D19= 20, ...}) =3D 0 > 183583:1 write(15, "shepherd[1]: Service guix-daemon"..., 51 > 183585:1 <... write resumed>) =3D 51 > 183594:1 getuid() =3D 0 > 183595:1 rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM= CHLD], 8) =3D 0 > 183596:1 write(6, "\1", 1 > 183598:1 <... write resumed>) =3D 1 > 183605:1 futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL > > and is then blocking forever. When that happens, we should check how many threads exist in PID=C2=A01. There should be the finalization thread and the main thread, plus the signal thread (because there are still =E2=80=98sigaction=E2=80=99 calls in= the =E2=80=98main=E2=80=99 procedure), plus the GC marker threads. In , Andy suggests that the signal thread is not properly handled; indeed it takes locks and we don=E2=80=99t = try to shut it down upon fork. However, when using signalfd, the signal thread must be stuck in its =E2=80=98read=E2=80=99 call in =E2=80=98read_si= gnal_pipe_data=E2=80=99, so I don=E2=80=99t see how it could cause problems. The GC threads are presumably taken care of by the atfork handler in libgc. Thoughts? Ludo=E2=80=99.