[bug#41119] [PATCH] fix some issues with (guix nar)

all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

From: "Ludovic Courtès" <ludo@gnu.org>
To: Caleb Ristvedt <caleb.ristvedt@cune.org>
Cc: guix-sysadmin@gnu.org, 41119@debbugs.gnu.org
Subject: [bug#41119] [PATCH] fix some issues with (guix nar)
Date: Thu, 28 May 2020 12:32:51 +0200	[thread overview]
Message-ID: <87imgge4do.fsf@gnu.org> (raw)
In-Reply-To: <87mu5se943.fsf@cune.org> (Caleb Ristvedt's message of "Thu, 28 May 2020 03:50:36 -0500")

Hi!

Caleb Ristvedt <caleb.ristvedt@cune.org> skribis:

>> This change had an undesirable effect: the connection would be kept for
>> the body of ‘with-temporary-store-file’, during which we’d call:
>>
>>   finalize-store-file -> register-path
>>
>> which accesses the database.  At this point, for each ‘guix offload’
>> process, we’d thus have the database open twice: once for the session’s
>> guix-daemon, and once for that ‘register-path’ call.
>
> If the connection wasn't kept for the body of with-temporary-store-file,
> the temporary store file wouldn't be protected from GC during the body
> (the daemon treats unlocked temproots files as "stale"), thus rather
> defeating the purpose. It makes sense, then, that the connection was
> also kept for the body prior to this patch - indeed, unless emacs's
> parenthesis-matching capabilities are failing me, it appears that the
> body is solidly within the 'with-store' form in
> 37edbc91e34fb5658261e637e6ffccdb381e5271.

Oh you’re right, sorry for the confusion.

>> On berlin, the effect is that we see many ‘guix offload’ processes
>> stalled because the SQLite database is busy:
>
> ... which makes this quite the mystery indeed. I assume you've tested
> with the patch reverted and found that this issue goes away?

No.  I observed the behavior and looked for recent changes that could
cause the problem.  But I guess I was tired and jumped to silly
conclusions.

> <wild brainstorming starts here>
>
> AFAIK just having the database open doesn't by itself impose any
> locks. The daemon process we're connected to should have it open, but
> should just be blocked waiting for our next RPC. Database locks happen
> when transactions are started (either explicitly or implicitly), and
> implicitly-started transactions are automatically committed by sqlite
> (specifically when the statement that started the transaction is either
> reset or finalized). The only loose end I can think of right now is that
> call-with-transaction only catches exceptions of type 'sqlite-error, so
> in theory if a different type of exception were to be thrown, it could
> exit in a broken state where neither a commit nor a rollback has been
> performed. Really it should catch all exception types, and use match in
> the handler to pick out the sqlite-errors. If that were causing the
> problems, though, we'd expect to see some errors appearing in the
> offload output.

Good point but yes, we’d see an error, and ‘guix offload’ would probably
exit right away.

> Actually, come to think of it, there could be another issue with
> call-with-transaction: if somehow it's possible for SQLITE_BUSY errors
> to occur despite the connection having succeeded with a 'begin
> immediate;' (which immediately starts a write transaction), then the
> rollback wouldn't occur, and what should be a failed transaction
> followed by a successful transaction becomes one long,
> restarted-in-the-middle transaction. I'm not sure if that's a problem in
> practice, though.
>
> And now that I look at it again, it turns out that most of our database
> query procedures in (guix store database) aren't finalizing their
> statements in case of a nonlocal exit... which would tend to happen if,
> for example, an SQLITE_BUSY error occurred. Which would cause the
> statement to not be finalized until the garbage collector got ahold of
> it. But due to statement caching the garbage collector likely won't get
> ahold of it until the database connection itself is destroyed. The
> wording at https://www.sqlite.org/lang_transaction.html makes me think
> that this shouldn't be an issue because the errors we'd expect all seem
> to roll back automatically, but if we somehow got one that didn't roll
> back automatically, there would potentially be an extended amount of
> time before the statement was finalized and the implicit transaction
> committed.
>
> Also, I've noticed that with the way that finalize-store-file is
> written, we actually already have a database open when we call
> register-path. This is because it's needed in order to call path-id, but
> the scope of that with-database form is rather larger than it needs to
> be.
>
> We may have a situation here where things go fine until a single
> SQLITE_BUSY error is produced by chance, and that causes more
> SQLITE_BUSY errors, and so on.

Hmm, sounds plausible.

> In summary, there are many things I could imagine going wrong to cause /
> contribute to the observed behavior, but the patch, barring some absurd
> guile compilation bug, is not one of them. I do, however, think that
> (guix store database) needs some love.

Yeah.

>> They loop pretty much indefinitely on this and nothing (or very little)
>> happens on the system.
>
> To be clear, the nothing-happening status is common to all processes
> that use the database, including daemon processes? That's quite severe.

I just did a random sample, but several offload processes were stuck
like the one I showed, and clients would usually get “database is
locked” messages from the daemon.

>> I’ll revert this patch but I’m happy to hear what you think, Caleb.
>
> If the data says it's causing those problems, I'd tend to agree with
> that. I would really like to understand how, though, because even after
> a few hours of brainstorming bizarre edge cases I still can't come up
> with a satisfying explanation.

No you’re right, my analysis was wrong.  Further investigation needed…

>> Another reason to implement temp roots in Scheme, as it would allow us
>> to not open a connection to the daemon from ‘guix offload’!
>
> Soon™. Conceptually the code is there, I'm working towards a rebase that
> tries to first make the rest of daemon-side guix compatible with fibers
> - thread pools✓, eval-with-container✓, fibers-friendly waitpid✓, etc.

Neat!  For master we could do with a simpler implementation, but we’ll
see.

Thanks,
Ludo’.

     prev parent reply	other threads:[~2020-05-28 10:34 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-07  3:52 [bug#41119] [PATCH] fix some issues with (guix nar) Caleb Ristvedt
2020-05-07  8:05 ` Ludovic Courtès
2020-05-11 21:39   ` bug#41119: " Ludovic Courtès
2020-05-27 21:54 ` [bug#41119] " Ludovic Courtès
2020-05-28  8:50   ` Caleb Ristvedt
2020-05-28 10:32     ` Ludovic Courtès [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87imgge4do.fsf@gnu.org \
    --to=ludo@gnu.org \
    --cc=41119@debbugs.gnu.org \
    --cc=caleb.ristvedt@cune.org \
    --cc=guix-sysadmin@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.