(note: the following includes some "thinking out loud", skip to the patch if you're in a hurry) I'm a bit puzzled how exactly a true "forward-progress-not-possible" deadlock can even occur, especially in the manner you describe. The derivation graph is inherently a directed acyclic graph, so as long as locks are only acquired for a given derivation after all of its inputs (or references, for substitutions) are present, and are released once the desired outputs are present, it should only even theoretically be possible to have a deadlock between different outputs of the same derivation. These requirements /seem/ to be satisfied by build.cc. Path locks are only acquired in DerivationGoal::tryToBuild and SubstitutionGoal::tryToRun, and the former is only reachable through DerivationGoal::inputsRealised (and DerivationGoal::buildDone in case of multiple rounds), and the latter is only reachable through SubstitutionGoal::referencesValid. As their names imply, having made it through them should suffice to prove that the inputs and references are present. That leaves the requirement that the locks are released once the desired outputs are present. It /is/ possible for a lock to stick around longer than desired (along with many other bad things) in DerivationGoal::tryToBuild, but only if a cached build failure is encountered, to my knowledge (notice how there is no call to done, amDone, or any rescheduling procedure in that case; the running DerivationGoal just drops off the face of the earth and tells nobody, including any goal waiting on it). This also happens in DerivationGoal::haveDerivation. I assume you're not using cached build failures, right? It defaults to off, so it should only be in use if you explicitly passed --cache-failures to guix-daemon (or had your client pass a value of "true" for "build-cache-failure", but I don't see that string anywhere in the guile side of guix). One detail that worries me is that SubstitutionGoal.outputLock is a std::shared_ptr instead of a PathLocks. I cannot for the life of me figure out why this is the case, since no sharing of ownership of the underlying PathLocks seems to ever be done. outputLock.reset() seems to be used as if it were a synonym for outputLock.unlock() in many places, so whoever wrote those uses probably thought the same. As far as I can tell, it *should* always cause PathLocks::unlock to be called, so it shouldn't be an issue, it just worries me that I don't understand why it's done that way. Ahh, I think I may have just found another place in DerivationGoal::tryToBuild that might be the source of your issues. First, note the comment: /* Obtain locks on all output paths. The locks are automatically released when we exit this function or the client crashes. ... */ Well, as I noticed earlier in the cached build failures case, this isn't exactly accurate, and there's probably a reason why. The comment seems to believe specifically that leaving the scope of tryToBuild will automatically release the locks, which suggests to me that at one point outputLocks was a local variable, rather than a field of DerivationGoal. This theorized older version would be consistent with what we see in this (current) snippet: if (buildMode != bmCheck && validPaths.size() == drv.outputs.size()) { debug(format("skipping build of derivation `%1%', someone beat us to it") % drvPath); outputLocks.setDeletion(true); done(BuildResult::AlreadyValid); return; } If leaving the scope of tryToBuild would cause outputLocks.unlock to run, then outputLocks.setDeletion is all that would need to be called. But since that isn't the case, this will cause the lock to be held for as long as the DerivationGoal exists. (note that I haven't actually checked the commit history to see whether outputLocks was or wasn't at some point a local variable - I have no idea why the commenter thought returning would automatically release it) Perhaps this same error was made elsewhere. C-s to see where else setDeletion is used without an accompanying unlock and... sure enough, in SubstitutionGoal::tryToRun, we see: /* Check again whether the path is invalid. */ if (!repair && worker.store.isValidPath(storePath)) { debug(format("store path `%1%' has become valid") % storePath); outputLock->setDeletion(true); amDone(ecSuccess); return; } Now outputLock will be held for as long as this SubstitutionGoal exists. The Nix issue that you identified as closely resembling what you were encountering was resolved with this commit message (https://github.com/NixOS/nix/commit/29cde917fe6b8f2e669c8bf10b38f640045c83b8): ------------------------------------------------------------------- Fix deadlock in SubstitutionGoal We were relying on SubstitutionGoal's destructor releasing the lock, but if a goal is a top-level goal, the destructor won't run in a timely manner since its reference count won't drop to zero. So release it explicitly. ------------------------------------------------------------------- Hmmm. Destructors not running in a timely manner, and in this particular case - which is guaranteed to occur when one guix-daemon process gets blocked by another holding a lock and the one initially holding the lock produces the output path - the only way the locks get released is when the destructors are run. I do believe this may be related! One other possible reason for destructors not running in as timely a manner as one might hope (though probably not as severely as for top-level goals) would be that Worker::run uses a ready queue, awake2, of type Goals, which contains strong references. This ensures that any Goal on this scheduler turn will not be destroyed, even if it has already run, until the end of this scheduler turn is reached. This probably isn't the cause for what you're seeing, but this kind of detail matters a lot when so much is implicitly tied to the lifetime of references. This could be corrected by popping off Goals one at a time from awake2. Patch attached for DerivationGoal::tryToBuild and SubstitutionGoal::tryToRun. I haven't tested it beyond verifying that it compiles, but I've got a pretty good feeling™ about it. The commit message includes discussion of a concrete scenario in which this could cause a deadlock. Interestingly enough, it seems to require at least 3 guix-daemon processes running during the "inciting event", though now that I think about it, the third doesn't need to be a persistent part of the deadlock. I did a brief glance over what I could find of Nix's corresponding code (of course much has changed), and it looks like there is a chance that at least their DerivationGoal::tryToBuild in src/libstore/build/derivation-goal.cc is affected by the same issue (you see the same pattern of calling outputLocks.setDeletion but not outputLocks.unlock in the same case). I couldn't quickly determine whether it was possible their substitutes were affected, as that code has more significantly diverged and I see no mention of locks in it. You may want to give them a nudge and see if this information is of use to them. Merry Christmas. - reepca