From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id sH5RL6KTz16lawAA0tVLHw (envelope-from ) for ; Thu, 28 May 2020 10:34:10 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id wA0vK6KTz14VZAAAB5/wlQ (envelope-from ) for ; Thu, 28 May 2020 10:34:10 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 1729D940C04 for ; Thu, 28 May 2020 10:34:10 +0000 (UTC) Received: from localhost ([::1]:37214 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jeFrT-0006jL-Rj for larch@yhetil.org; Thu, 28 May 2020 06:34:07 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:46360) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jeFrO-0006jF-BB for guix-patches@gnu.org; Thu, 28 May 2020 06:34:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:39527) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1jeFrO-0008HC-1g for guix-patches@gnu.org; Thu, 28 May 2020 06:34:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1jeFrN-0000YX-Uz for guix-patches@gnu.org; Thu, 28 May 2020 06:34:01 -0400 X-Loop: help-debbugs@gnu.org Subject: [bug#41119] [PATCH] fix some issues with (guix nar) Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Thu, 28 May 2020 10:34:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 41119 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: To: Caleb Ristvedt Cc: guix-sysadmin@gnu.org, 41119@debbugs.gnu.org Received: via spool by 41119-submit@debbugs.gnu.org id=B41119.15906619882068 (code B ref 41119); Thu, 28 May 2020 10:34:01 +0000 Received: (at 41119) by debbugs.gnu.org; 28 May 2020 10:33:08 +0000 Received: from localhost ([127.0.0.1]:51073 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jeFqP-0000XB-TF for submit@debbugs.gnu.org; Thu, 28 May 2020 06:33:08 -0400 Received: from eggs.gnu.org ([209.51.188.92]:38806) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jeFqO-0000Wg-4a for 41119@debbugs.gnu.org; Thu, 28 May 2020 06:33:00 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:58433) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jeFqI-00082s-Ki; Thu, 28 May 2020 06:32:54 -0400 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=38122 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1jeFqH-0004ae-MX; Thu, 28 May 2020 06:32:54 -0400 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= References: <87h7wsqu50.fsf@cune.org> <87k10xgi1t.fsf@gnu.org> <87mu5se943.fsf@cune.org> Date: Thu, 28 May 2020 12:32:51 +0200 In-Reply-To: <87mu5se943.fsf@cune.org> (Caleb Ristvedt's message of "Thu, 28 May 2020 03:50:36 -0500") Message-ID: <87imgge4do.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-Spam-Score: -3.3 (---) X-BeenThere: guix-patches@gnu.org List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-patches-bounces+larch=yhetil.org@gnu.org Sender: "Guix-patches" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of guix-patches-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-patches-bounces@gnu.org X-Spam-Score: -1.01 X-TUID: gfaViCR48s6P Hi! Caleb Ristvedt skribis: >> This change had an undesirable effect: the connection would be kept for >> the body of =E2=80=98with-temporary-store-file=E2=80=99, during which we= =E2=80=99d call: >> >> finalize-store-file -> register-path >> >> which accesses the database. At this point, for each =E2=80=98guix offl= oad=E2=80=99 >> process, we=E2=80=99d thus have the database open twice: once for the se= ssion=E2=80=99s >> guix-daemon, and once for that =E2=80=98register-path=E2=80=99 call. > > If the connection wasn't kept for the body of with-temporary-store-file, > the temporary store file wouldn't be protected from GC during the body > (the daemon treats unlocked temproots files as "stale"), thus rather > defeating the purpose. It makes sense, then, that the connection was > also kept for the body prior to this patch - indeed, unless emacs's > parenthesis-matching capabilities are failing me, it appears that the > body is solidly within the 'with-store' form in > 37edbc91e34fb5658261e637e6ffccdb381e5271. Oh you=E2=80=99re right, sorry for the confusion. >> On berlin, the effect is that we see many =E2=80=98guix offload=E2=80=99= processes >> stalled because the SQLite database is busy: > > ... which makes this quite the mystery indeed. I assume you've tested > with the patch reverted and found that this issue goes away? No. I observed the behavior and looked for recent changes that could cause the problem. But I guess I was tired and jumped to silly conclusions. > > > AFAIK just having the database open doesn't by itself impose any > locks. The daemon process we're connected to should have it open, but > should just be blocked waiting for our next RPC. Database locks happen > when transactions are started (either explicitly or implicitly), and > implicitly-started transactions are automatically committed by sqlite > (specifically when the statement that started the transaction is either > reset or finalized). The only loose end I can think of right now is that > call-with-transaction only catches exceptions of type 'sqlite-error, so > in theory if a different type of exception were to be thrown, it could > exit in a broken state where neither a commit nor a rollback has been > performed. Really it should catch all exception types, and use match in > the handler to pick out the sqlite-errors. If that were causing the > problems, though, we'd expect to see some errors appearing in the > offload output. Good point but yes, we=E2=80=99d see an error, and =E2=80=98guix offload=E2= =80=99 would probably exit right away. > Actually, come to think of it, there could be another issue with > call-with-transaction: if somehow it's possible for SQLITE_BUSY errors > to occur despite the connection having succeeded with a 'begin > immediate;' (which immediately starts a write transaction), then the > rollback wouldn't occur, and what should be a failed transaction > followed by a successful transaction becomes one long, > restarted-in-the-middle transaction. I'm not sure if that's a problem in > practice, though. > > And now that I look at it again, it turns out that most of our database > query procedures in (guix store database) aren't finalizing their > statements in case of a nonlocal exit... which would tend to happen if, > for example, an SQLITE_BUSY error occurred. Which would cause the > statement to not be finalized until the garbage collector got ahold of > it. But due to statement caching the garbage collector likely won't get > ahold of it until the database connection itself is destroyed. The > wording at https://www.sqlite.org/lang_transaction.html makes me think > that this shouldn't be an issue because the errors we'd expect all seem > to roll back automatically, but if we somehow got one that didn't roll > back automatically, there would potentially be an extended amount of > time before the statement was finalized and the implicit transaction > committed. > > Also, I've noticed that with the way that finalize-store-file is > written, we actually already have a database open when we call > register-path. This is because it's needed in order to call path-id, but > the scope of that with-database form is rather larger than it needs to > be. > > We may have a situation here where things go fine until a single > SQLITE_BUSY error is produced by chance, and that causes more > SQLITE_BUSY errors, and so on. Hmm, sounds plausible. > In summary, there are many things I could imagine going wrong to cause / > contribute to the observed behavior, but the patch, barring some absurd > guile compilation bug, is not one of them. I do, however, think that > (guix store database) needs some love. Yeah. >> They loop pretty much indefinitely on this and nothing (or very little) >> happens on the system. > > To be clear, the nothing-happening status is common to all processes > that use the database, including daemon processes? That's quite severe. I just did a random sample, but several offload processes were stuck like the one I showed, and clients would usually get =E2=80=9Cdatabase is locked=E2=80=9D messages from the daemon. >> I=E2=80=99ll revert this patch but I=E2=80=99m happy to hear what you th= ink, Caleb. > > If the data says it's causing those problems, I'd tend to agree with > that. I would really like to understand how, though, because even after > a few hours of brainstorming bizarre edge cases I still can't come up > with a satisfying explanation. No you=E2=80=99re right, my analysis was wrong. Further investigation need= ed=E2=80=A6 >> Another reason to implement temp roots in Scheme, as it would allow us >> to not open a connection to the daemon from =E2=80=98guix offload=E2=80= =99! > > Soon=E2=84=A2. Conceptually the code is there, I'm working towards a reba= se that > tries to first make the rest of daemon-side guix compatible with fibers > - thread pools=E2=9C=93, eval-with-container=E2=9C=93, fibers-friendly wa= itpid=E2=9C=93, etc. Neat! For master we could do with a simpler implementation, but we=E2=80= =99ll see. Thanks, Ludo=E2=80=99.