unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
From: Reepca Russelstein via Bug reports for GNU Guix <bug-guix@gnu.org>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: Christopher Baines <guix@cbaines.net>, 31785@debbugs.gnu.org
Subject: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock
Date: Tue, 24 Dec 2024 07:08:52 -0600	[thread overview]
Message-ID: <87bjx16xej.fsf@russelstein.xyz> (raw)
In-Reply-To: <874j2xge0y.fsf@gnu.org> ("Ludovic Courtès"'s message of "Sat, 21 Dec 2024 18:08:13 +0100")


[-- Attachment #1.1: Type: text/plain, Size: 7370 bytes --]


(note: the following includes some "thinking out loud", skip to the
patch if you're in a hurry)

I'm a bit puzzled how exactly a true "forward-progress-not-possible"
deadlock can even occur, especially in the manner you describe.  The
derivation graph is inherently a directed acyclic graph, so as long as
locks are only acquired for a given derivation after all of its inputs
(or references, for substitutions) are present, and are released once
the desired outputs are present, it should only even theoretically be
possible to have a deadlock between different outputs of the same
derivation.

These requirements /seem/ to be satisfied by build.cc.  Path locks are
only acquired in DerivationGoal::tryToBuild and
SubstitutionGoal::tryToRun, and the former is only reachable through
DerivationGoal::inputsRealised (and DerivationGoal::buildDone in case of
multiple rounds), and the latter is only reachable through
SubstitutionGoal::referencesValid.  As their names imply, having made it
through them should suffice to prove that the inputs and references are
present.

That leaves the requirement that the locks are released once the desired
outputs are present.  It /is/ possible for a lock to stick around longer
than desired (along with many other bad things) in
DerivationGoal::tryToBuild, but only if a cached build failure is
encountered, to my knowledge (notice how there is no call to done,
amDone, or any rescheduling procedure in that case; the running
DerivationGoal just drops off the face of the earth and tells nobody,
including any goal waiting on it).  This also happens in
DerivationGoal::haveDerivation.

I assume you're not using cached build failures, right?  It defaults to
off, so it should only be in use if you explicitly passed
--cache-failures to guix-daemon (or had your client pass a value of
"true" for "build-cache-failure", but I don't see that string anywhere
in the guile side of guix).

One detail that worries me is that SubstitutionGoal.outputLock is a
std::shared_ptr<PathLocks> instead of a PathLocks.  I cannot for the
life of me figure out why this is the case, since no sharing of
ownership of the underlying PathLocks seems to ever be done.
outputLock.reset() seems to be used as if it were a synonym for
outputLock.unlock() in many places, so whoever wrote those uses probably
thought the same.  As far as I can tell, it *should* always cause
PathLocks::unlock to be called, so it shouldn't be an issue, it just
worries me that I don't understand why it's done that way.

Ahh, I think I may have just found another place in
DerivationGoal::tryToBuild that might be the source of your issues.
First, note the comment:

/* Obtain locks on all output paths.  The locks are automatically
   released when we exit this function or the client crashes. ... */

Well, as I noticed earlier in the cached build failures case, this isn't
exactly accurate, and there's probably a reason why.  The comment seems
to believe specifically that leaving the scope of tryToBuild will
automatically release the locks, which suggests to me that at one point
outputLocks was a local variable, rather than a field of DerivationGoal.
This theorized older version would be consistent with what we see in
this (current) snippet:

    if (buildMode != bmCheck && validPaths.size() == drv.outputs.size()) {
        debug(format("skipping build of derivation `%1%', someone beat us to it") % drvPath);
        outputLocks.setDeletion(true);
        done(BuildResult::AlreadyValid);
        return;
    }

If leaving the scope of tryToBuild would cause outputLocks.unlock to
run, then outputLocks.setDeletion is all that would need to be called.
But since that isn't the case, this will cause the lock to be held for
as long as the DerivationGoal exists.

(note that I haven't actually checked the commit history to see whether
outputLocks was or wasn't at some point a local variable - I have no
idea why the commenter thought returning would automatically release it)

Perhaps this same error was made elsewhere.  C-s to see where else
setDeletion is used without an accompanying unlock and... sure enough,
in SubstitutionGoal::tryToRun, we see:

    /* Check again whether the path is invalid. */
    if (!repair && worker.store.isValidPath(storePath)) {
        debug(format("store path `%1%' has become valid") % storePath);
        outputLock->setDeletion(true);
        amDone(ecSuccess);
        return;
    }

Now outputLock will be held for as long as this SubstitutionGoal exists.

The Nix issue that you identified as closely resembling what you were
encountering was resolved with this commit message
(https://github.com/NixOS/nix/commit/29cde917fe6b8f2e669c8bf10b38f640045c83b8):

-------------------------------------------------------------------
Fix deadlock in SubstitutionGoal

We were relying on SubstitutionGoal's destructor releasing the lock,
but if a goal is a top-level goal, the destructor won't run in a
timely manner since its reference count won't drop to zero.  So
release it explicitly.
-------------------------------------------------------------------

Hmmm.  Destructors not running in a timely manner, and in this
particular case - which is guaranteed to occur when one guix-daemon
process gets blocked by another holding a lock and the one initially
holding the lock produces the output path - the only way the locks get
released is when the destructors are run.  I do believe this may be
related!

One other possible reason for destructors not running in as timely a
manner as one might hope (though probably not as severely as for
top-level goals) would be that Worker::run uses a ready queue, awake2,
of type Goals, which contains strong references.  This ensures that any
Goal on this scheduler turn will not be destroyed, even if it has
already run, until the end of this scheduler turn is reached.  This
probably isn't the cause for what you're seeing, but this kind of detail
matters a lot when so much is implicitly tied to the lifetime of
references.  This could be corrected by popping off Goals one at a time
from awake2.

Patch attached for DerivationGoal::tryToBuild and
SubstitutionGoal::tryToRun.  I haven't tested it beyond verifying that
it compiles, but I've got a pretty good feeling™ about it.  The commit
message includes discussion of a concrete scenario in which this could
cause a deadlock.  Interestingly enough, it seems to require at least 3
guix-daemon processes running during the "inciting event", though now
that I think about it, the third doesn't need to be a persistent part of
the deadlock.

I did a brief glance over what I could find of Nix's corresponding code
(of course much has changed), and it looks like there is a chance that
at least their DerivationGoal::tryToBuild in
src/libstore/build/derivation-goal.cc is affected by the same issue (you
see the same pattern of calling outputLocks.setDeletion but not
outputLocks.unlock in the same case).  I couldn't quickly determine
whether it was possible their substitutes were affected, as that code
has more significantly diverged and I see no mention of locks in it.
You may want to give them a nudge and see if this information is of use
to them.

Merry Christmas.

- reepca

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: 0001-nix-build.cc-explicitly-unlock-in-the-has-become-val.patch --]
[-- Type: text/x-patch, Size: 3298 bytes --]

From 2030f198892f972ee6bc2fb8529cbd6afb6bc9ad Mon Sep 17 00:00:00 2001
From: Reepca Russelstein <reepca@russelstein.xyz>
Date: Tue, 24 Dec 2024 05:40:58 -0600
Subject: [PATCH] nix: build.cc: explicitly unlock in the has-become-valid
 case.

(hopefully) fixes #31785.  Similar to Nix issue #178, fixed in
https://github.com/NixOS/nix/commit/29cde917fe6b8f2e669c8bf10b38f640045c83b8.

We can't rely on Goal deletion to release our locks in a timely manner.  In
the case in which multiple guix-daemon processes simultaneously try producing
an output path path1, the one that gets there first (P1) will get the lock,
and the second one (P2) will continue trying to acquire the lock until it is
released.  Once it has acquired the lock, it checks to see whether the path
has already become valid in the meantime, and if so it reports success to
those Goals waiting on its completion and finishes.  Unfortunately, it fails
to release the locks it holds first, so those stay held until that Goal gets
deleted.

Suppose we have the following store path dependency graph:

          path4
      /     |     \
   path1   path2   path3

P2 is now sitting on path1's lock, and will continue to do so until path4 is
completed.  Suppose there is also a P3, and it has been blocked while P1
builds path2.  Now P3 is sitting on path2's lock, and can't acquire path1's
lock to determine that it has been completed.  Likewise P2 is sitting on
path1's lock, and now can't acquire path2's lock to determine that it has been
completed.  Finally, P3 completes path3 while P2 is blocked.

Now:

- P1 knows that path1 and path2 are complete, and holds no locks, but can't
  determine that path3 is complete
- P2 knows that path1 and path3 are complete, and holds locks on path1 and
  path3, but can't determine that path2 is complete
- P3 knows that path2 and path3 are complete, and holds a lock on path2, but
  can't determine that path1 is complete

And none of these locks will be released until path4 is complete.  Thus, we
have a deadlock.

To resolve this, we should explicitly release these locks as soon as they
should be released.

* nix/libstore/build.cc
  (DerivationGoal::tryToBuild, SubstitutionGoal::tryToRun):
  Explicitly release locks in the has-become-valid case.

Change-Id: Ie510f84828892315fe6776c830db33d0f70bcef8
---
 nix/libstore/build.cc | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/nix/libstore/build.cc b/nix/libstore/build.cc
index 43a8a371846..edd01bab34d 100644
--- a/nix/libstore/build.cc
+++ b/nix/libstore/build.cc
@@ -1200,6 +1200,7 @@ void DerivationGoal::tryToBuild()
     if (buildMode != bmCheck && validPaths.size() == drv.outputs.size()) {
         debug(format("skipping build of derivation `%1%', someone beat us to it") % drvPath);
         outputLocks.setDeletion(true);
+        outputLocks.unlock();
         done(BuildResult::AlreadyValid);
         return;
     }
@@ -3070,6 +3071,7 @@ void SubstitutionGoal::tryToRun()
     if (!repair && worker.store.isValidPath(storePath)) {
         debug(format("store path `%1%' has become valid") % storePath);
         outputLock->setDeletion(true);
+        outputLock.reset();
         amDone(ecSuccess);
         return;
     }
-- 
2.45.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 519 bytes --]

  parent reply	other threads:[~2024-12-24 13:15 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-11 14:06 bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock Ludovic Courtès
2019-09-06  9:04 ` Ludovic Courtès
2020-11-04 15:21 ` Ludovic Courtès
2024-12-21 16:22 ` Ludovic Courtès
2024-12-21 17:08   ` Ludovic Courtès
2024-12-21 22:43     ` Ludovic Courtès
2024-12-24 13:08     ` Reepca Russelstein via Bug reports for GNU Guix [this message]
2024-12-27 22:45       ` Ludovic Courtès
2024-12-28  7:06         ` Reepca Russelstein via Bug reports for GNU Guix
2024-12-29 23:57           ` Ludovic Courtès
2024-12-30 12:52             ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bjx16xej.fsf@russelstein.xyz \
    --to=bug-guix@gnu.org \
    --cc=31785@debbugs.gnu.org \
    --cc=guix@cbaines.net \
    --cc=ludo@gnu.org \
    --cc=reepca@russelstein.xyz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).