From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1.migadu.com ([2001:41d0:403:4876::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms13.migadu.com with LMTPS id kIS2B9mzamdwPgEA62LTzQ:P1 (envelope-from ) for ; Tue, 24 Dec 2024 13:15:05 +0000 Received: from aspmx1.migadu.com ([2001:41d0:403:4876::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1.migadu.com with LMTPS id kIS2B9mzamdwPgEA62LTzQ (envelope-from ) for ; Tue, 24 Dec 2024 14:15:05 +0100 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=debbugs.gnu.org header.s=debbugs-gnu-org header.b=Kj34theX; dkim=fail ("headers eddsa verify failed") header.d=russelstein.xyz header.s=ed25519 header.b=JyEMuF0L; dkim=fail ("headers rsa verify failed") header.d=russelstein.xyz header.s=rsa header.b=OPJLZgP7; spf=pass (aspmx1.migadu.com: domain of "bug-guix-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="bug-guix-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=none) header.from=gnu.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1735046105; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:resent-cc:resent-from:resent-sender: resent-message-id:in-reply-to:in-reply-to:references:references: list-id:list-help:list-unsubscribe:list-subscribe:list-post: dkim-signature; bh=ghIxgPpzvvRz9XVJd2/eSY3NQo2J+enLsBb2FP63JQQ=; b=qNydzR+NNt7A6WDKcJG+3LrHBVRs69bD1fo0nwcUDnpdVjNcr31/ASIC+7WsJy3jJZIxou BCi76a6P9rTkN38yvJlsuHWMmuWBTf8psqBQVAmNMcpZx6Lu593cTnHgcTWla+gW6CPb/o yj3b6sqnvtQer/JqTSrMc43JTjNC+BLlCuHklTxTNyoAcbwgRyu5820op1z4SUt7FiTDBe 2BzNobUkjAKhI/W5V+0FsZ1ICWSJYSgN2mazSBliNpVOPkvd/BrWXjnAzdIrsxhf1m8hOi aP7XCySF6Ccx/UkFhvHB57Y2RvNhrYrvNZpyMD04Ktg38pNW/Zex5ZQn9FV17w== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=debbugs.gnu.org header.s=debbugs-gnu-org header.b=Kj34theX; dkim=fail ("headers eddsa verify failed") header.d=russelstein.xyz header.s=ed25519 header.b=JyEMuF0L; dkim=fail ("headers rsa verify failed") header.d=russelstein.xyz header.s=rsa header.b=OPJLZgP7; spf=pass (aspmx1.migadu.com: domain of "bug-guix-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="bug-guix-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=none) header.from=gnu.org ARC-Seal: i=1; s=key1; d=yhetil.org; t=1735046105; a=rsa-sha256; cv=none; b=Jjy3EkPwwvlN5M2DB8JnI+X2EkleESzmRg1wq+0YOYjljnHLiOK5aJhwpvBMJ3/EknFDdA QmvZp6RY1tzwVASvb4zmL9woV+pSpu1pcUF7pLeY3CVUkVWrgmQbvS45iaynAgdTdjzFNt rPNLJdLRjCoXARQGqr7lngneWzTi7sFjE8oCns/Sn48JxV2OQdt/U20RbD/L1g98RkZrZz nTh92cQjqPMzZKzhjacZKSgp3C4yWysJ775CtX2d864LhC4FlrJjGnxYnLpOce2HFFQnx+ DVAwLEjBViaEQGPphu0huP/CCuJMAzWhdthoSewWr7H2TYf3iVGPFuVFW8LR5Q== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 0702586261 for ; Tue, 24 Dec 2024 14:15:04 +0100 (CET) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1tQ4kK-0003kM-WB; Tue, 24 Dec 2024 08:14:49 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tQ4ie-0003cT-By for bug-guix@gnu.org; Tue, 24 Dec 2024 08:13:04 -0500 Received: from debbugs.gnu.org ([2001:470:142:5::43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1tQ4ic-0004L1-H0 for bug-guix@gnu.org; Tue, 24 Dec 2024 08:13:03 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=debbugs.gnu.org; s=debbugs-gnu-org; h=MIME-Version:Date:References:In-Reply-To:From:To:Subject; bh=ghIxgPpzvvRz9XVJd2/eSY3NQo2J+enLsBb2FP63JQQ=; b=Kj34theXcJBey3jyVB3lJZ78PiafOk6qRC7UjLUvx2/pvCPuShbBNxu9cZGAVgQ9NocLbV2opKIiyImeakqjIqrqth4czIoKdzSTDcx2e0k8DXRERZ6pUcfYd/nD2ExKU5HFnX2+yk4fNs2JJH66j3eTrLaR06sekIujpcpBAGq9KMojRJNJiAp+BwRo2N3AQ6MBSuqWM+rXEjyHAaiPFaaptqqIpV564vMkSBm5z0rQ3+QJUMKCn9fafA7wgXFV93/X9C7ZJMW4a6lUJUkZZpWlcmQnkiPxF8bU+HHx02TdONGkFmWsbAHARj2URb0g7IqFeGvr8mEmHJ3mW0hxSw==; Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1tQ4ic-00087k-3c for bug-guix@gnu.org; Tue, 24 Dec 2024 08:13:02 -0500 X-Loop: help-debbugs@gnu.org Subject: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock Resent-From: Reepca Russelstein Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Tue, 24 Dec 2024 13:13:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 31785 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Cc: Christopher Baines , 31785@debbugs.gnu.org Received: via spool by 31785-submit@debbugs.gnu.org id=B31785.173504593831130 (code B ref 31785); Tue, 24 Dec 2024 13:13:02 +0000 Received: (at 31785) by debbugs.gnu.org; 24 Dec 2024 13:12:18 +0000 Received: from localhost ([127.0.0.1]:59920 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tQ4hs-00085z-J9 for submit@debbugs.gnu.org; Tue, 24 Dec 2024 08:12:17 -0500 Received: from mailout.russelstein.xyz ([209.141.47.21]:45088) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tQ4hn-00085d-T6 for 31785@debbugs.gnu.org; Tue, 24 Dec 2024 08:12:14 -0500 DKIM-Signature: v=1; a=ed25519-sha256; q=dns/txt; c=relaxed/relaxed; d=russelstein.xyz; s=ed25519; h=Content-Type:MIME-Version:Message-ID:Date: References:In-Reply-To:Subject:Cc:To:From:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=ghIxgPpzvvRz9XVJd2/eSY3NQo2J+enLsBb2FP63JQQ=; b=JyEMuF0LAD/dNc87W/dkDBv8oa tzcl1TD4c8btRmIWOeFyCrbOJGhL4Wcr0/AVnhN02sUM/u60F83tbMpaZqDA==; DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=russelstein.xyz; s=rsa; h=Content-Type:MIME-Version:Message-ID:Date: References:In-Reply-To:Subject:Cc:To:From:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=ghIxgPpzvvRz9XVJd2/eSY3NQo2J+enLsBb2FP63JQQ=; b=OPJLZgP7ZTbE6tCwPEfWsSYFIq PPYIDUSDQtb9P0V0KQWEqtSCmmpSZMnwXf3sSQQYOTnlZcABTJH5y1bmSE1UzWwxkTgX3ZO+WY6xf sn+KjodqMlWf7mMXlMjlqjfN9SF7lV6h9u0ji0inFXnXZzbKCk+o+9QDrBzqHHda6LcquR8pQMHW1 gMCImYudn8qTF4vr2KMs47zU8yR89q1mjfaOQC5YaLqAP7q+5AorMs5Unbpz3G1tcFglgwCLK59YM zFkUKv0z8EZUSsvF18A/x/T9LMcl22pbRc6Gr0OAhWwt83bz6Lm0CUsP7hhNjpSt2kOBSPgPZtzv/ WgwhSpKqv2n8EBUXMi+IkOA2V0T1CMK5R+ZbcyHtVimpAboD3kaP1bd/ZiJpjbYc+eWEHMpIlkXG+ LOYPQ6TGCNKtnx9nO8GmTN90rrhU3BQzTYegiDTz/D/pw8PfOVftZ5zKs1XWjaBkhCbGF61irz+Aj H2D2lrgAyZTaNnjDtLbgFiIwzwncC/vCZ+gb0geOOVPWLE3F+7p73CGUheV35TwFgsCgTOYVY6Nak 3LIxpTrZ4KOjO7D6gkXakviqNRkTpzD69/e8iOLXqjqNXMRri6cTUx3Ed9I3aPrh6OtAFGIMViWUg MT7awrFt9SLl6usu0nX/r7n+AtKgSCJ6RY5i4tID0=; Received: by russelstein.xyz with esmtpsa (TLS1.3:ECDHE_SECP256R1__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim 4.98) (envelope-from ) id 1tQ4fc-0000000070K-1fVo; Tue, 24 Dec 2024 07:09:58 -0600 In-Reply-To: <874j2xge0y.fsf@gnu.org> ("Ludovic =?UTF-8?Q?Court=C3=A8s?="'s message of "Sat, 21 Dec 2024 18:08:13 +0100") References: <87602ph0yv.fsf@gnu.org> <878qs9gg5k.fsf@gnu.org> <874j2xge0y.fsf@gnu.org> Date: Tue, 24 Dec 2024 07:08:52 -0600 Message-ID: <87bjx16xej.fsf@russelstein.xyz> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-Mailman-Approved-At: Tue, 24 Dec 2024 08:14:47 -0500 X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-to: Reepca Russelstein From: Reepca Russelstein via Bug reports for GNU Guix Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: bug-guix-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN X-Migadu-Spam-Score: -8.81 X-Spam-Score: -8.81 X-Migadu-Queue-Id: 0702586261 X-Migadu-Scanner: mx10.migadu.com X-TUID: rECMYWSRY4du --==-=-= Content-Type: multipart/mixed; boundary="=-=-=" --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable (note: the following includes some "thinking out loud", skip to the patch if you're in a hurry) I'm a bit puzzled how exactly a true "forward-progress-not-possible" deadlock can even occur, especially in the manner you describe. The derivation graph is inherently a directed acyclic graph, so as long as locks are only acquired for a given derivation after all of its inputs (or references, for substitutions) are present, and are released once the desired outputs are present, it should only even theoretically be possible to have a deadlock between different outputs of the same derivation. These requirements /seem/ to be satisfied by build.cc. Path locks are only acquired in DerivationGoal::tryToBuild and SubstitutionGoal::tryToRun, and the former is only reachable through DerivationGoal::inputsRealised (and DerivationGoal::buildDone in case of multiple rounds), and the latter is only reachable through SubstitutionGoal::referencesValid. As their names imply, having made it through them should suffice to prove that the inputs and references are present. That leaves the requirement that the locks are released once the desired outputs are present. It /is/ possible for a lock to stick around longer than desired (along with many other bad things) in DerivationGoal::tryToBuild, but only if a cached build failure is encountered, to my knowledge (notice how there is no call to done, amDone, or any rescheduling procedure in that case; the running DerivationGoal just drops off the face of the earth and tells nobody, including any goal waiting on it). This also happens in DerivationGoal::haveDerivation. I assume you're not using cached build failures, right? It defaults to off, so it should only be in use if you explicitly passed =2D-cache-failures to guix-daemon (or had your client pass a value of "true" for "build-cache-failure", but I don't see that string anywhere in the guile side of guix). One detail that worries me is that SubstitutionGoal.outputLock is a std::shared_ptr instead of a PathLocks. I cannot for the life of me figure out why this is the case, since no sharing of ownership of the underlying PathLocks seems to ever be done. outputLock.reset() seems to be used as if it were a synonym for outputLock.unlock() in many places, so whoever wrote those uses probably thought the same. As far as I can tell, it *should* always cause PathLocks::unlock to be called, so it shouldn't be an issue, it just worries me that I don't understand why it's done that way. Ahh, I think I may have just found another place in DerivationGoal::tryToBuild that might be the source of your issues. First, note the comment: /* Obtain locks on all output paths. The locks are automatically released when we exit this function or the client crashes. ... */ Well, as I noticed earlier in the cached build failures case, this isn't exactly accurate, and there's probably a reason why. The comment seems to believe specifically that leaving the scope of tryToBuild will automatically release the locks, which suggests to me that at one point outputLocks was a local variable, rather than a field of DerivationGoal. This theorized older version would be consistent with what we see in this (current) snippet: if (buildMode !=3D bmCheck && validPaths.size() =3D=3D drv.outputs.size= ()) { debug(format("skipping build of derivation `%1%', someone beat us t= o it") % drvPath); outputLocks.setDeletion(true); done(BuildResult::AlreadyValid); return; } If leaving the scope of tryToBuild would cause outputLocks.unlock to run, then outputLocks.setDeletion is all that would need to be called. But since that isn't the case, this will cause the lock to be held for as long as the DerivationGoal exists. (note that I haven't actually checked the commit history to see whether outputLocks was or wasn't at some point a local variable - I have no idea why the commenter thought returning would automatically release it) Perhaps this same error was made elsewhere. C-s to see where else setDeletion is used without an accompanying unlock and... sure enough, in SubstitutionGoal::tryToRun, we see: /* Check again whether the path is invalid. */ if (!repair && worker.store.isValidPath(storePath)) { debug(format("store path `%1%' has become valid") % storePath); outputLock->setDeletion(true); amDone(ecSuccess); return; } Now outputLock will be held for as long as this SubstitutionGoal exists. The Nix issue that you identified as closely resembling what you were encountering was resolved with this commit message (https://github.com/NixOS/nix/commit/29cde917fe6b8f2e669c8bf10b38f640045c83= b8): =2D------------------------------------------------------------------ Fix deadlock in SubstitutionGoal We were relying on SubstitutionGoal's destructor releasing the lock, but if a goal is a top-level goal, the destructor won't run in a timely manner since its reference count won't drop to zero. So release it explicitly. =2D------------------------------------------------------------------ Hmmm. Destructors not running in a timely manner, and in this particular case - which is guaranteed to occur when one guix-daemon process gets blocked by another holding a lock and the one initially holding the lock produces the output path - the only way the locks get released is when the destructors are run. I do believe this may be related! One other possible reason for destructors not running in as timely a manner as one might hope (though probably not as severely as for top-level goals) would be that Worker::run uses a ready queue, awake2, of type Goals, which contains strong references. This ensures that any Goal on this scheduler turn will not be destroyed, even if it has already run, until the end of this scheduler turn is reached. This probably isn't the cause for what you're seeing, but this kind of detail matters a lot when so much is implicitly tied to the lifetime of references. This could be corrected by popping off Goals one at a time from awake2. Patch attached for DerivationGoal::tryToBuild and SubstitutionGoal::tryToRun. I haven't tested it beyond verifying that it compiles, but I've got a pretty good feeling=E2=84=A2 about it. The com= mit message includes discussion of a concrete scenario in which this could cause a deadlock. Interestingly enough, it seems to require at least 3 guix-daemon processes running during the "inciting event", though now that I think about it, the third doesn't need to be a persistent part of the deadlock. I did a brief glance over what I could find of Nix's corresponding code (of course much has changed), and it looks like there is a chance that at least their DerivationGoal::tryToBuild in src/libstore/build/derivation-goal.cc is affected by the same issue (you see the same pattern of calling outputLocks.setDeletion but not outputLocks.unlock in the same case). I couldn't quickly determine whether it was possible their substitutes were affected, as that code has more significantly diverged and I see no mention of locks in it. You may want to give them a nudge and see if this information is of use to them. Merry Christmas. =2D reepca --=-=-= Content-Type: text/x-patch Content-Disposition: attachment; filename=0001-nix-build.cc-explicitly-unlock-in-the-has-become-val.patch Content-Transfer-Encoding: quoted-printable From=202030f198892f972ee6bc2fb8529cbd6afb6bc9ad Mon Sep 17 00:00:00 2001 From: Reepca Russelstein Date: Tue, 24 Dec 2024 05:40:58 -0600 Subject: [PATCH] nix: build.cc: explicitly unlock in the has-become-valid case. (hopefully) fixes #31785. Similar to Nix issue #178, fixed in https://github.com/NixOS/nix/commit/29cde917fe6b8f2e669c8bf10b38f640045c83b= 8. We can't rely on Goal deletion to release our locks in a timely manner. In the case in which multiple guix-daemon processes simultaneously try produci= ng an output path path1, the one that gets there first (P1) will get the lock, and the second one (P2) will continue trying to acquire the lock until it is released. Once it has acquired the lock, it checks to see whether the path has already become valid in the meantime, and if so it reports success to those Goals waiting on its completion and finishes. Unfortunately, it fails to release the locks it holds first, so those stay held until that Goal gets deleted. Suppose we have the following store path dependency graph: path4 / | \ path1 path2 path3 P2 is now sitting on path1's lock, and will continue to do so until path4 is completed. Suppose there is also a P3, and it has been blocked while P1 builds path2. Now P3 is sitting on path2's lock, and can't acquire path1's lock to determine that it has been completed. Likewise P2 is sitting on path1's lock, and now can't acquire path2's lock to determine that it has b= een completed. Finally, P3 completes path3 while P2 is blocked. Now: =2D P1 knows that path1 and path2 are complete, and holds no locks, but can= 't determine that path3 is complete =2D P2 knows that path1 and path3 are complete, and holds locks on path1 and path3, but can't determine that path2 is complete =2D P3 knows that path2 and path3 are complete, and holds a lock on path2, = but can't determine that path1 is complete And none of these locks will be released until path4 is complete. Thus, we have a deadlock. To resolve this, we should explicitly release these locks as soon as they should be released. * nix/libstore/build.cc (DerivationGoal::tryToBuild, SubstitutionGoal::tryToRun): Explicitly release locks in the has-become-valid case. Change-Id: Ie510f84828892315fe6776c830db33d0f70bcef8 =2D-- nix/libstore/build.cc | 2 ++ 1 file changed, 2 insertions(+) diff --git a/nix/libstore/build.cc b/nix/libstore/build.cc index 43a8a371846..edd01bab34d 100644 =2D-- a/nix/libstore/build.cc +++ b/nix/libstore/build.cc @@ -1200,6 +1200,7 @@ void DerivationGoal::tryToBuild() if (buildMode !=3D bmCheck && validPaths.size() =3D=3D drv.outputs.siz= e()) { debug(format("skipping build of derivation `%1%', someone beat us = to it") % drvPath); outputLocks.setDeletion(true); + outputLocks.unlock(); done(BuildResult::AlreadyValid); return; } @@ -3070,6 +3071,7 @@ void SubstitutionGoal::tryToRun() if (!repair && worker.store.isValidPath(storePath)) { debug(format("store path `%1%' has become valid") % storePath); outputLock->setDeletion(true); + outputLock.reset(); amDone(ecSuccess); return; } =2D-=20 2.45.2 --=-=-=-- --==-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQFLBAEBCAA1FiEEdNapMPRLm4SepVYGwWaqSV9/GJwFAmdqsmUXHHJlZXBjYUBy dXNzZWxzdGVpbi54eXoACgkQwWaqSV9/GJyVJgf/TOqLfl5jKX9madCj9VGdaidy gMFJdgr1c9BhGzYFt+FCkBeftN+Dhas4ywtTgN+nmNCAAyMWQSxHK9VvZZFmMxsc sI5qBy+unAvEanR0dn0K02twtY1Zw+rvyZsLB8ECaxGSgBcH8/oalIm6nq/inM2G AHaI5JvPcmZZS1mDJ4FegWaGsxSFo+NdTOKMrHesByL1CqD8YtzVwGiD/tIu1rgT aBuaLWHVSWaLLn84mzqohlcTKlSPTyjnIZgwrBOftR+yhG6X5/LOJskhmjlPlN91 3KJTGunnyDI55ymJw26YDIrYi1aYQcyFS8Zz538kOjzha7nfS5budXHxUViZVA== =OikB -----END PGP SIGNATURE----- --==-=-=--