From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1.migadu.com ([2001:41d0:303:e16b::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms13.migadu.com with LMTPS id +NDoL8Iub2fltQAA62LTzQ:P1 (envelope-from ) for ; Fri, 27 Dec 2024 22:48:34 +0000 Received: from aspmx1.migadu.com ([2001:41d0:303:e16b::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1.migadu.com with LMTPS id +NDoL8Iub2fltQAA62LTzQ (envelope-from ) for ; Fri, 27 Dec 2024 23:48:34 +0100 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=debbugs.gnu.org header.s=debbugs-gnu-org header.b="MQS/GO2h"; dkim=fail ("headers rsa verify failed") header.d=gnu.org header.s=fencepost-gnu-org header.b=ZA0SWuWG; spf=pass (aspmx1.migadu.com: domain of "bug-guix-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="bug-guix-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=none) header.from=gnu.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1735339714; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:resent-cc:resent-from:resent-sender: resent-message-id:in-reply-to:in-reply-to:references:references: list-id:list-help:list-unsubscribe:list-subscribe:list-post: dkim-signature; bh=Yt2BTIX7uu98trjgCi89w8SYMRRpCg28ysSgIg/EDY0=; b=h2PRzg6eTLRP1rJd9LQq346N/vg/j81iRJUgaED5mVMLzVnErAygSjzUF4/ZsI7cDgZ4Ve qMysPEOENPLO9W+ZG0cXKDBOblncQoSreG4c60h9Qb7MDXa86cXxJBy2g7lQlWZF/v/yfv /5YI0jGgXKTQMiF7XlW2bGUQUXqHGxUAMbGPUdCHvz8kzCakJkRUA/hz673Uw1K5wBJG3Q DB5kuwblh+asb1p0wf5iYoC7TKe1NmrnXN9QdFOWNt8YcPC5NcaFWsU/Om+jG3b2NDNrt2 HPYE93Xnx3lYE1NgotSCLoonkctEy4t3BZFsWS7EkhvFz0BYG9wad5y2zbDsaA== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=debbugs.gnu.org header.s=debbugs-gnu-org header.b="MQS/GO2h"; dkim=fail ("headers rsa verify failed") header.d=gnu.org header.s=fencepost-gnu-org header.b=ZA0SWuWG; spf=pass (aspmx1.migadu.com: domain of "bug-guix-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="bug-guix-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=none) header.from=gnu.org ARC-Seal: i=1; s=key1; d=yhetil.org; t=1735339714; a=rsa-sha256; cv=none; b=GWUnYZGM46cHJHMzzCpumRW7MVZ8JD/1fHp31tMQS1LyhiNVzipbma0Y0KEovsS/OCh7B8 rhcR8xdRwyxeB+BamRb10Hdmbzii4Q7P22oGCW1mNnE7B6ElxOZRzC3fiFmeQ6LLmDZbOy Tdib5gqmdSDyv+ro17ls1nkiQaiS5bQFlTeevJaBr3Q+TwDIeFPb8KKr8RJxV9lxHJvuSt 8LJU+9agcbss8/6fojMBhax8IGq1oNbqMMWoqmvYLsXZyTqUTGg11VKUt+gEZRZcIxIh5P GtoqVmJQiPmaHXSpOdcE/hzzYLKpjYgmminyuIFubEwHz/lXUANJRKzfLf0Lcg== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 4CA827157 for ; Fri, 27 Dec 2024 23:48:33 +0100 (CET) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1tRJ7l-0002ub-6D; Fri, 27 Dec 2024 17:48:05 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tRJ7i-0002u3-II for bug-guix@gnu.org; Fri, 27 Dec 2024 17:48:03 -0500 Received: from debbugs.gnu.org ([2001:470:142:5::43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1tRJ7i-0007GZ-A1 for bug-guix@gnu.org; Fri, 27 Dec 2024 17:48:02 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=debbugs.gnu.org; s=debbugs-gnu-org; h=MIME-Version:Date:References:In-Reply-To:From:To:Subject; bh=Yt2BTIX7uu98trjgCi89w8SYMRRpCg28ysSgIg/EDY0=; b=MQS/GO2hBARMje3ohrSv4HYYADgaxBMe8jD74VLubaCG/SUol1fnVVexxuIgGMP4yTegawFg+qDFVu0K0iVnKSXJ/ZXq5VR8JnpC25WmYupZIwl3+i5TeiUDFUrP+gEg8MoYx+z1YosyInz5eioXU1ScJVqFAuvNa33zavO8UG3uwgr9N7xaPxw8z0T/ExgRWBL11VPilyZzzBZE9RLlkg/+8mOwcPVV8yXTYKb3LbEF82VBOAetTFaGCXVtWfr0tru9ecB0orglgtgivF0c9ufZ09mV0p0zQg3xG3L4JjrWr6KuxIB7Hvey0+aOHSGbBmRr4jEjXyF8dBVMv9I2jg==; Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1tRJ7h-0004at-VL for bug-guix@gnu.org; Fri, 27 Dec 2024 17:48:02 -0500 X-Loop: help-debbugs@gnu.org Subject: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Fri, 27 Dec 2024 22:48:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 31785 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Reepca Russelstein Cc: Christopher Baines , 31785@debbugs.gnu.org Received: via spool by 31785-submit@debbugs.gnu.org id=B31785.173533966017627 (code B ref 31785); Fri, 27 Dec 2024 22:48:01 +0000 Received: (at 31785) by debbugs.gnu.org; 27 Dec 2024 22:47:40 +0000 Received: from localhost ([127.0.0.1]:47915 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tRJ7L-0004aE-3Y for submit@debbugs.gnu.org; Fri, 27 Dec 2024 17:47:39 -0500 Received: from eggs.gnu.org ([209.51.188.92]:32896) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tRJ7I-0004a1-Af for 31785@debbugs.gnu.org; Fri, 27 Dec 2024 17:47:37 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tRJ54-0006vn-R4; Fri, 27 Dec 2024 17:45:18 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:Date:References:In-Reply-To:Subject:To: From; bh=Yt2BTIX7uu98trjgCi89w8SYMRRpCg28ysSgIg/EDY0=; b=ZA0SWuWGJTuCvv2CTjd3 sbUtnpHdNfVLM2QRmdnj7e+tdZPETX88ldxPcuxC2cWfzxpiW0HBPCgzWGqaLJORtI/cPvTRdIOPs N/VtVK7KMgoDFFS06zdFY8w4C2X8IlIwM/Z5CoVX9AKZSsQEU2BILmPYaAvObSONQtWf62OAOYVDU wprLPLK/74d3t75oJp/HKoFtnUQSB7O6rhWPekQHrxzSWzEV1n9Wa9HnMec54IPM4+W6vTMfea3El 9jFrs15AcCZpfLdAwnrOwHSDyFnJxaNSRpPeGGufpilTSXL86F1XzkrMwzTTj4Dn31ia43ZKTVgwA ZiNIPYSr9CPY1g==; From: Ludovic =?UTF-8?Q?Court=C3=A8s?= In-Reply-To: <87bjx16xej.fsf@russelstein.xyz> (Reepca Russelstein's message of "Tue, 24 Dec 2024 07:08:52 -0600") References: <87602ph0yv.fsf@gnu.org> <878qs9gg5k.fsf@gnu.org> <874j2xge0y.fsf@gnu.org> <87bjx16xej.fsf@russelstein.xyz> Date: Fri, 27 Dec 2024 23:45:10 +0100 Message-ID: <87bjwwyccp.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: bug-guix-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN X-Migadu-Queue-Id: 4CA827157 X-Migadu-Scanner: mx13.migadu.com X-Migadu-Spam-Score: -5.39 X-Spam-Score: -5.39 X-TUID: QsFR6d76LTcB --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hello Reepca, So glad to get your message. :-) Reepca Russelstein skribis: > I assume you're not using cached build failures, right? It defaults to > off, so it should only be in use if you explicitly passed > --cache-failures to guix-daemon (or had your client pass a value of > "true" for "build-cache-failure", but I don't see that string anywhere > in the guile side of guix). Right, not using cached build failures on these machines. > One detail that worries me is that SubstitutionGoal.outputLock is a > std::shared_ptr instead of a PathLocks. I cannot for the > life of me figure out why this is the case, since no sharing of > ownership of the underlying PathLocks seems to ever be done. > outputLock.reset() seems to be used as if it were a synonym for > outputLock.unlock() in many places, so whoever wrote those uses probably > thought the same. As far as I can tell, it *should* always cause > PathLocks::unlock to be called, so it shouldn't be an issue, it just > worries me that I don't understand why it's done that way. Yes, I had the same thoughts. I checked the =E2=80=9Cshared_ptr outputLock;=E2=80=9D member is here since at least 2005, when the file was renamed to build.cc. > Ahh, I think I may have just found another place in > DerivationGoal::tryToBuild that might be the source of your issues. > First, note the comment: > > /* Obtain locks on all output paths. The locks are automatically > released when we exit this function or the client crashes. ... */ > > Well, as I noticed earlier in the cached build failures case, this isn't > exactly accurate, and there's probably a reason why. The comment seems > to believe specifically that leaving the scope of tryToBuild will > automatically release the locks, which suggests to me that at one point > outputLocks was a local variable, rather than a field of DerivationGoal. > This theorized older version would be consistent with what we see in > this (current) snippet: > > if (buildMode !=3D bmCheck && validPaths.size() =3D=3D drv.outputs.si= ze()) { > debug(format("skipping build of derivation `%1%', someone beat us= to it") % drvPath); > outputLocks.setDeletion(true); > done(BuildResult::AlreadyValid); > return; > } Uh! Sounds like a plausible problem. > If leaving the scope of tryToBuild would cause outputLocks.unlock to > run, then outputLocks.setDeletion is all that would need to be called. > But since that isn't the case, this will cause the lock to be held for > as long as the DerivationGoal exists. Yep. [...] > From 2030f198892f972ee6bc2fb8529cbd6afb6bc9ad Mon Sep 17 00:00:00 2001 > From: Reepca Russelstein > Date: Tue, 24 Dec 2024 05:40:58 -0600 > Subject: [PATCH] nix: build.cc: explicitly unlock in the has-become-valid > case. > > (hopefully) fixes #31785. Similar to Nix issue #178, fixed in > https://github.com/NixOS/nix/commit/29cde917fe6b8f2e669c8bf10b38f640045c8= 3b8. > > We can't rely on Goal deletion to release our locks in a timely manner. = In > the case in which multiple guix-daemon processes simultaneously try produ= cing > an output path path1, the one that gets there first (P1) will get the loc= k, > and the second one (P2) will continue trying to acquire the lock until it= is > released. Once it has acquired the lock, it checks to see whether the pa= th > has already become valid in the meantime, and if so it reports success to > those Goals waiting on its completion and finishes. Unfortunately, it fa= ils > to release the locks it holds first, so those stay held until that Goal g= ets > deleted. > > Suppose we have the following store path dependency graph: > > path4 > / | \ > path1 path2 path3 > > P2 is now sitting on path1's lock, and will continue to do so until path4= is > completed. Suppose there is also a P3, and it has been blocked while P1 > builds path2. Now P3 is sitting on path2's lock, and can't acquire path1= 's > lock to determine that it has been completed. Likewise P2 is sitting on > path1's lock, and now can't acquire path2's lock to determine that it has= been > completed. Finally, P3 completes path3 while P2 is blocked. > > Now: > > - P1 knows that path1 and path2 are complete, and holds no locks, but can= 't > determine that path3 is complete > - P2 knows that path1 and path3 are complete, and holds locks on path1 and > path3, but can't determine that path2 is complete > - P3 knows that path2 and path3 are complete, and holds a lock on path2, = but > can't determine that path1 is complete > > And none of these locks will be released until path4 is complete. Thus, = we > have a deadlock. > > To resolve this, we should explicitly release these locks as soon as they > should be released. > > * nix/libstore/build.cc > (DerivationGoal::tryToBuild, SubstitutionGoal::tryToRun): > Explicitly release locks in the has-become-valid case. > > Change-Id: Ie510f84828892315fe6776c830db33d0f70bcef8 I=E2=80=99m very much convinced by the patch. Yet it bothers me that I can= not reproduce the problem. I tried first with this test, which attempts to reproduce what you describe in the commit log above: --8<---------------cut here---------------start------------->8--- ;; https://issues.guix.gnu.org/31785 (use-modules (guix) ((gnu packages) #:select (specification->package)) (srfi srfi-1) (ice-9 threads) (ice-9 match) (ice-9 textual-ports)) (define (nonce) (logxor (car (gettimeofday)) (cdr (gettimeofday)) (getpid))) (define drv1 (computed-file "drv1" #~(begin #$(nonce) (sleep 3) (mkdir #$output)))) (define drv2 (computed-file "drv2" #~(begin #$(nonce) (sleep 3) (mkdir #$output)))) (define drv3 (computed-file "drv3" #~(begin #$(nonce) (sleep 3) (mkdir #$output)))) (define drv4 (computed-file "drv4" #~(begin #$(nonce) (sleep 3) (pk 'deps: #$drv1 #$drv2 #$drv3) (mkdir #$output)))) (%graft? #f) (define (log-port prefix) "Return a logging port that prefixes each line with PREFIX." (make-custom-textual-output-port "log" (lambda (str start count) (let ((str (substring str start (+ start count)))) (format (current-error-port) "~a: ~a" prefix str) count)) #f #f #f)) (setvbuf (current-error-port) 'line) (set-port-encoding! (current-error-port) "UTF-8") (let* ((builder (lambda (name lst) (call-with-new-thread (lambda () (parameterize ((current-build-output-port (log-port name))) (with-store store (set-build-options store #:verbosity 4) (run-with-store store (mlet %store-monad ((lst (mapm %store-monad lower-object lst))) (built-derivations lst))))))))) (thread1 (builder 'thread1 (list drv4))) (thread2 (builder 'thread2 (list drv4 drv1 drv2))) (thread3 (builder 'thread3 (list drv4 drv3 drv2)))) (join-thread thread1) (join-thread thread2) (join-thread thread3)) --8<---------------cut here---------------end--------------->8--- But there=E2=80=99s no deadlock, and I think that=E2=80=99s because the pro= blem we=E2=80=99re seeing has to do with substitute goals, and there=E2=80=99s no such goal he= re. So then I tried this: --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable diff --git a/tests/store.scm b/tests/store.scm index 45948f4f43..224b9867c4 100644 --- a/tests/store.scm +++ b/tests/store.scm @@ -1,5 +1,5 @@ ;;; GNU Guix --- Functional package management for GNU -;;; Copyright =C2=A9 2012-2021, 2023 Ludovic Court=C3=A8s +;;; Copyright =C2=A9 2012-2021, 2023, 2024 Ludovic Court=C3=A8s ;;; ;;; This file is part of GNU Guix. ;;; @@ -35,6 +35,7 @@ (define-module (test-store) #:use-module (gnu packages bootstrap) #:use-module (ice-9 match) #:use-module (ice-9 regex) + #:use-module (ice-9 threads) #:use-module (rnrs bytevectors) #:use-module (rnrs io ports) #:use-module (web uri) @@ -1042,6 +1043,52 @@ (define %shell (ensure-path s item) (path-info-nar-size (query-path-info s item))))) =20 +(test-assert "substitute deadlock" + (with-store s + (let* ((guile (package-derivation s %bootstrap-guile (%current-system)= )) + (c (random-text)) ;contents of the output + (drv1 (build-expression->derivation + s "substitute-me1" + `(begin ,c (exit 1)) ;would actually fail + #:guile-for-build guile)) + (drv2 (build-expression->derivation + s "substitute-me2" + `(begin ,c (exit 1)) ;would actually fail + #:guile-for-build guile)) + (drv3 (build-expression->derivation + s "depends-on-substitutable1" + `(call-with-output-file %output + (lambda (p) + (display ,c p))) + #:inputs `(("drv1" ,drv1)) + #:guile-for-build guile)) + (drv4 (build-expression->derivation + s "depends-on-substitutable2" + `(call-with-output-file %output + (lambda (p) + (display ,c p))) + #:inputs `(("drv2" ,drv2)) + #:guile-for-build guile))) + (with-derivation-substitute drv1 c + (with-derivation-substitute drv2 c + (system* "ls" "-l" "/tmp/subst") + (let* ((builder (lambda (drv) + (call-with-new-thread + (lambda () + (with-store store + (set-build-options store + #:use-substitutes? #t + #:substitute-urls + (%test-substitute-urls= )) + (build-things store + (list (derivation-file-name= drv)))))))) + (thread1 (builder drv3)) + (thread2 (builder drv4))) + (join-thread thread1) + (join-thread thread2) + (and (valid-path? s (derivation->output-path drv3)) + (valid-path? s (derivation->output-path drv4))))))))) + (test-assert "export/import several paths" (let* ((texts (unfold (cut >=3D <> 10) (lambda _ (random-text)) --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable But I think we=E2=80=99d have to be lucky for it to trigger the deadlock (l= ucky because the two substitutes should be fetched at the same instant, roughly). Given that it=E2=80=99s a nasty issue, I=E2=80=99d prefer to have an (ideal= ly automated) test. Thoughts? Ideas? Anyway, thanks for thinking through it and for explaining your reasoning and coming up with a patch! And: merry end-of-year holidays too. :-) Ludo=E2=80=99. --=-=-=--