unofficial mirror of guix-patches@gnu.org 
 help / color / mirror / code / Atom feed
* [bug#41119] [PATCH] fix some issues with (guix nar)
@ 2020-05-07  3:52 Caleb Ristvedt
  2020-05-07  8:05 ` Ludovic Courtès
  2020-05-27 21:54 ` [bug#41119] " Ludovic Courtès
  0 siblings, 2 replies; 6+ messages in thread
From: Caleb Ristvedt @ 2020-05-07  3:52 UTC (permalink / raw)
  To: 41119


[-- Attachment #1.1: Type: text/plain, Size: 825 bytes --]

I noticed two issues while looking at (guix nar):

1. The proper store-lock-handling protocol isn't used in
   FINALIZE-STORE-FILE. Lock acquisition needs to check for a deletion
   token, retrying if it exists, and lock release needs to delete the
   lock file and write the deletion token.

2. WITH-TEMPORARY-STORE-FILE opens a new daemon connection every time it
   retries with a new filename, and only closes any of them after the
   body has completed. So if we retry 20 times, we get 20 concurrent
   daemon connections. This also prevents the call to LOOP from being a
   tail call.

The attached patches resolve these issues. There are of course going to
be more places we need to (properly) acquire and release store locks as
guile-daemon code gets merged, but for now this should work as a bandaid
fix.

- reepca


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: FINALIZE-STORE-FILE fix --]
[-- Type: text/x-patch, Size: 1739 bytes --]

From b2c66b443bd42e05820cfb3920c96f1894820587 Mon Sep 17 00:00:00 2001
From: Caleb Ristvedt <caleb.ristvedt@cune.org>
Date: Wed, 6 May 2020 11:48:21 -0500
Subject: [PATCH 1/2] nar: 'finalize-store-file' follows proper store lock
 protocol.

* guix/nar.scm (finalize-store-file): check for deletion token when acquiring
  lock, write deletion token and delete lock file before releasing lock.
---
 guix/nar.scm | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/guix/nar.scm b/guix/nar.scm
index 29636aa0f8..f91af72879 100644
--- a/guix/nar.scm
+++ b/guix/nar.scm
@@ -82,10 +82,19 @@
 REFERENCES and DERIVER.  When LOCK? is true, acquire exclusive locks on TARGET
 before attempting to register it; otherwise, assume TARGET's locks are already
 held."
+  ;; TODO: make this reusable
+  (define (acquire-lock filename)
+    (let ((port (lock-file filename)))
+      (if (zero? (stat:size (stat port)))
+          port
+          (begin
+            (close port)
+            (acquire-lock filename)))))
+
   (with-database %default-database-file db
     (unless (path-id db target)
       (let ((lock (and lock?
-                       (lock-file (string-append target ".lock")))))
+                       (acquire-lock (string-append target ".lock")))))
 
         (unless (path-id db target)
           ;; If FILE already exists, delete it (it's invalid anyway.)
@@ -102,6 +111,9 @@ held."
                          #:deriver deriver))
 
         (when lock?
+          (delete-file (string-append target ".lock"))
+          (display "d" lock)
+          (force-output lock)
           (unlock-file lock))))))
 
 (define (temporary-store-file)
-- 
2.26.2


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.3: WITH-TEMPORARY-STORE-FILE fix --]
[-- Type: text/x-patch, Size: 1340 bytes --]

From 43ee61b405b01038b3e7c84aba64521ab8a62236 Mon Sep 17 00:00:00 2001
From: Caleb Ristvedt <caleb.ristvedt@cune.org>
Date: Wed, 6 May 2020 11:52:16 -0500
Subject: [PATCH 2/2] nar: 'with-temporary-store-file' uses a single connection

Previously the 'with-store' form was entered every time a different temporary
file was tried.  This caused there to be as many simultaneous open connections
as there were attempts, and prevented the (loop ...) call from being a tail
call.  This change fixes that.

* guix/nar.scm (with-temporary-store-file): open connection once prior to
  entering the loop.
---
 guix/nar.scm | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/guix/nar.scm b/guix/nar.scm
index f91af72879..404cef8b97 100644
--- a/guix/nar.scm
+++ b/guix/nar.scm
@@ -126,8 +126,8 @@ held."
 (define-syntax-rule (with-temporary-store-file name body ...)
   "Evaluate BODY with NAME bound to the file name of a temporary store item
 protected from GC."
-  (let loop ((name (temporary-store-file)))
-    (with-store store
+  (with-store store
+    (let loop ((name (temporary-store-file)))
       ;; Add NAME to the current process' roots.  (Opening this connection to
       ;; the daemon allows us to reuse its code that deals with the
       ;; per-process roots file.)
-- 
2.26.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [bug#41119] [PATCH] fix some issues with (guix nar)
  2020-05-07  3:52 [bug#41119] [PATCH] fix some issues with (guix nar) Caleb Ristvedt
@ 2020-05-07  8:05 ` Ludovic Courtès
  2020-05-11 21:39   ` bug#41119: " Ludovic Courtès
  2020-05-27 21:54 ` [bug#41119] " Ludovic Courtès
  1 sibling, 1 reply; 6+ messages in thread
From: Ludovic Courtès @ 2020-05-07  8:05 UTC (permalink / raw)
  To: Caleb Ristvedt; +Cc: 41119

Hi!

Caleb Ristvedt <caleb.ristvedt@cune.org> skribis:

> From b2c66b443bd42e05820cfb3920c96f1894820587 Mon Sep 17 00:00:00 2001
> From: Caleb Ristvedt <caleb.ristvedt@cune.org>
> Date: Wed, 6 May 2020 11:48:21 -0500
> Subject: [PATCH 1/2] nar: 'finalize-store-file' follows proper store lock
>  protocol.
>
> * guix/nar.scm (finalize-store-file): check for deletion token when acquiring
>   lock, write deletion token and delete lock file before releasing lock.

[...]

> +  ;; TODO: make this reusable
> +  (define (acquire-lock filename)

For consistency, s/filename/file/ please.  :-)

> +    (let ((port (lock-file filename)))
> +      (if (zero? (stat:size (stat port)))
> +          port
> +          (begin

Could you add a comment, like: “If FILE is non-empty, that’s because it
contains the deletion token, so try again.”

>          (when lock?
> +          (delete-file (string-append target ".lock"))
> +          (display "d" lock)
> +          (force-output lock)

Also a comment explaining why we’re writing a deletion token.

It’s a fine point of the daemon that I had totally overlooked.  I wonder
what the implications might have been.

> From 43ee61b405b01038b3e7c84aba64521ab8a62236 Mon Sep 17 00:00:00 2001
> From: Caleb Ristvedt <caleb.ristvedt@cune.org>
> Date: Wed, 6 May 2020 11:52:16 -0500
> Subject: [PATCH 2/2] nar: 'with-temporary-store-file' uses a single connection
>
> Previously the 'with-store' form was entered every time a different temporary
> file was tried.  This caused there to be as many simultaneous open connections
> as there were attempts, and prevented the (loop ...) call from being a tail
> call.  This change fixes that.
>
> * guix/nar.scm (with-temporary-store-file): open connection once prior to
>   entering the loop.

LGTM!

You can push both patches to ‘master’ (make sure “make authenticate”
passes before you do).

Thanks a lot for the quick fixes!

Ludo’.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#41119: [PATCH] fix some issues with (guix nar)
  2020-05-07  8:05 ` Ludovic Courtès
@ 2020-05-11 21:39   ` Ludovic Courtès
  0 siblings, 0 replies; 6+ messages in thread
From: Ludovic Courtès @ 2020-05-11 21:39 UTC (permalink / raw)
  To: Caleb Ristvedt; +Cc: 41119-done

This was pushed a couple of days ago:

  b338c41c82 nar: 'with-temporary-store-file' uses a single connection
  37edbc91e3 nar: 'finalize-store-file' follows proper store lock protocol.

Closing, and thanks again!

Ludo’.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* [bug#41119] [PATCH] fix some issues with (guix nar)
  2020-05-07  3:52 [bug#41119] [PATCH] fix some issues with (guix nar) Caleb Ristvedt
  2020-05-07  8:05 ` Ludovic Courtès
@ 2020-05-27 21:54 ` Ludovic Courtès
  2020-05-28  8:50   ` Caleb Ristvedt
  1 sibling, 1 reply; 6+ messages in thread
From: Ludovic Courtès @ 2020-05-27 21:54 UTC (permalink / raw)
  To: Caleb Ristvedt; +Cc: guix-sysadmin, 41119

Hi,

Caleb Ristvedt <caleb.ristvedt@cune.org> skribis:

> From 43ee61b405b01038b3e7c84aba64521ab8a62236 Mon Sep 17 00:00:00 2001
> From: Caleb Ristvedt <caleb.ristvedt@cune.org>
> Date: Wed, 6 May 2020 11:52:16 -0500
> Subject: [PATCH 2/2] nar: 'with-temporary-store-file' uses a single connection
>
> Previously the 'with-store' form was entered every time a different temporary
> file was tried.  This caused there to be as many simultaneous open connections
> as there were attempts, and prevented the (loop ...) call from being a tail
> call.  This change fixes that.
>
> * guix/nar.scm (with-temporary-store-file): open connection once prior to
>   entering the loop.
> ---
>  guix/nar.scm | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/guix/nar.scm b/guix/nar.scm
> index f91af72879..404cef8b97 100644
> --- a/guix/nar.scm
> +++ b/guix/nar.scm
> @@ -126,8 +126,8 @@ held."
>  (define-syntax-rule (with-temporary-store-file name body ...)
>    "Evaluate BODY with NAME bound to the file name of a temporary store item
>  protected from GC."
> -  (let loop ((name (temporary-store-file)))
> -    (with-store store
> +  (with-store store
> +    (let loop ((name (temporary-store-file)))
>        ;; Add NAME to the current process' roots.  (Opening this connection to
>        ;; the daemon allows us to reuse its code that deals with the
>        ;; per-process roots file.)

This change had an undesirable effect: the connection would be kept for
the body of ‘with-temporary-store-file’, during which we’d call:

  finalize-store-file -> register-path

which accesses the database.  At this point, for each ‘guix offload’
process, we’d thus have the database open twice: once for the session’s
guix-daemon, and once for that ‘register-path’ call.

On berlin, the effect is that we see many ‘guix offload’ processes
stalled because the SQLite database is busy:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ guix processes |grep ^SessionPID|wc -l
104
ludo@berlin ~$ guix processes |recsel -e 'ClientCommand ~ "offload"'|grep ^SessionPID |wc -l
69
ludo@berlin ~$ guix processes |recsel -e 'ClientCommand ~ "offload"'|head 
SessionPID: 10916
ClientPID: 7408
ClientCommand: /gnu/store/18hp7flyb3yid3yp49i6qcdq0sbi5l1n-guile-3.0.2/bin/guile \ /gnu/store/abiva5ivq99x30r2s9pa3jj0pv9g16sv-guix-1.1.0-4.bdc801e/bin/.guix-real offload x86_64-linux 3600 1 21600

SessionPID: 11333
ClientPID: 9505
ClientCommand: /gnu/store/18hp7flyb3yid3yp49i6qcdq0sbi5l1n-guile-3.0.2/bin/guile \ /gnu/store/abiva5ivq99x30r2s9pa3jj0pv9g16sv-guix-1.1.0-4.bdc801e/bin/.guix-real offload x86_64-linux 3600 1 21600

SessionPID: 16277
ClientPID: 9179
ludo@berlin ~$ sudo strace -p 7408
strace: Process 7408 attached
restart_syscall(<... resuming interrupted read ...>) = 0
fcntl(19, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=125, l_len=1}) = 0
fcntl(19, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=120, l_len=1}) = -1 EAGAIN (Resource temporarily unavailable)
fcntl(19, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=125, l_len=1}) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000000}, NULL) = 0
fcntl(19, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=125, l_len=1}) = 0
fcntl(19, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=120, l_len=1}) = -1 EAGAIN (Resource temporarily unavailable)
fcntl(19, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=125, l_len=1}) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000000}, NULL) = 0
fcntl(19, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=125, l_len=1}) = 0
fcntl(19, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=120, l_len=1}) = -1 EAGAIN (Resource temporarily unavailable)
fcntl(19, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=125, l_len=1}) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000000}, ^Cstrace: Process 7408 detached
 <detached ...>
ludo@berlin ~$ sudo gdb -p 7408
[…]
(gdb) bt
#0  0x00007f2e2aa327a1 in clock_nanosleep@GLIBC_2.2.5 () from target:/gnu/store/fa6wj5bxkj5ll1d7292a70knmyl7a0cr-glibc-2.31/lib/libc.so.6
#1  0x00007f2e2aa37c03 in nanosleep () from target:/gnu/store/fa6wj5bxkj5ll1d7292a70knmyl7a0cr-glibc-2.31/lib/libc.so.6
#2  0x00007f2e2aa611a4 in usleep () from target:/gnu/store/fa6wj5bxkj5ll1d7292a70knmyl7a0cr-glibc-2.31/lib/libc.so.6
#3  0x00007f2e1e8245ea in unixSleep () from target:/gnu/store/807c6g9xqrxdjyhm8wm1r6jjjmc8q4vs-sqlite-3.31.1/lib/libsqlite3.so.0
#4  0x00007f2e1e81f56e in sqliteDefaultBusyCallback () from target:/gnu/store/807c6g9xqrxdjyhm8wm1r6jjjmc8q4vs-sqlite-3.31.1/lib/libsqlite3.so.0
#5  0x00007f2e1e81f5d9 in sqlite3InvokeBusyHandler () from target:/gnu/store/807c6g9xqrxdjyhm8wm1r6jjjmc8q4vs-sqlite-3.31.1/lib/libsqlite3.so.0
#6  0x00007f2e1e877ec1 in sqlite3BtreeBeginTrans () from target:/gnu/store/807c6g9xqrxdjyhm8wm1r6jjjmc8q4vs-sqlite-3.31.1/lib/libsqlite3.so.0
#7  0x00007f2e1e89fc64 in sqlite3VdbeExec () from target:/gnu/store/807c6g9xqrxdjyhm8wm1r6jjjmc8q4vs-sqlite-3.31.1/lib/libsqlite3.so.0
#8  0x00007f2e1e8a6d09 in sqlite3_step () from target:/gnu/store/807c6g9xqrxdjyhm8wm1r6jjjmc8q4vs-sqlite-3.31.1/lib/libsqlite3.so.0
#9  0x00007f2e1e8a7add in sqlite3_exec () from target:/gnu/store/807c6g9xqrxdjyhm8wm1r6jjjmc8q4vs-sqlite-3.31.1/lib/libsqlite3.so.0
#10 0x00007f2e2af0466d in ffi_call_unix64 () from target:/gnu/store/bw15z9kh9c65ycc2vbhl2izwfwfva7p1-libffi-3.3/lib/libffi.so.7
#11 0x00007f2e2af02ac0 in ffi_call_int () from target:/gnu/store/bw15z9kh9c65ycc2vbhl2izwfwfva7p1-libffi-3.3/lib/libffi.so.7
#12 0x00007f2e2aff148e in scm_i_foreign_call () from target:/gnu/store/18hp7flyb3yid3yp49i6qcdq0sbi5l1n-guile-3.0.2/lib/libguile-3.0.so.1
--8<---------------cut here---------------end--------------->8---

They loop pretty much indefinitely on this and nothing (or very little)
happens on the system.

I’ll revert this patch but I’m happy to hear what you think, Caleb.

Another reason to implement temp roots in Scheme, as it would allow us
to not open a connection to the daemon from ‘guix offload’!

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* [bug#41119] [PATCH] fix some issues with (guix nar)
  2020-05-27 21:54 ` [bug#41119] " Ludovic Courtès
@ 2020-05-28  8:50   ` Caleb Ristvedt
  2020-05-28 10:32     ` Ludovic Courtès
  0 siblings, 1 reply; 6+ messages in thread
From: Caleb Ristvedt @ 2020-05-28  8:50 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-sysadmin, 41119

Ludovic Courtès <ludo@gnu.org> writes:

> Hi,
>
> This change had an undesirable effect: the connection would be kept for
> the body of ‘with-temporary-store-file’, during which we’d call:
>
>   finalize-store-file -> register-path
>
> which accesses the database.  At this point, for each ‘guix offload’
> process, we’d thus have the database open twice: once for the session’s
> guix-daemon, and once for that ‘register-path’ call.

If the connection wasn't kept for the body of with-temporary-store-file,
the temporary store file wouldn't be protected from GC during the body
(the daemon treats unlocked temproots files as "stale"), thus rather
defeating the purpose. It makes sense, then, that the connection was
also kept for the body prior to this patch - indeed, unless emacs's
parenthesis-matching capabilities are failing me, it appears that the
body is solidly within the 'with-store' form in
37edbc91e34fb5658261e637e6ffccdb381e5271.

> On berlin, the effect is that we see many ‘guix offload’ processes
> stalled because the SQLite database is busy:

... which makes this quite the mystery indeed. I assume you've tested
with the patch reverted and found that this issue goes away? If so, I am
very puzzled. One would expect that "database open twice" would tend to
have *less* contention issues than "database open at least twice".

<wild brainstorming starts here>

AFAIK just having the database open doesn't by itself impose any
locks. The daemon process we're connected to should have it open, but
should just be blocked waiting for our next RPC. Database locks happen
when transactions are started (either explicitly or implicitly), and
implicitly-started transactions are automatically committed by sqlite
(specifically when the statement that started the transaction is either
reset or finalized). The only loose end I can think of right now is that
call-with-transaction only catches exceptions of type 'sqlite-error, so
in theory if a different type of exception were to be thrown, it could
exit in a broken state where neither a commit nor a rollback has been
performed. Really it should catch all exception types, and use match in
the handler to pick out the sqlite-errors. If that were causing the
problems, though, we'd expect to see some errors appearing in the
offload output.

Actually, come to think of it, there could be another issue with
call-with-transaction: if somehow it's possible for SQLITE_BUSY errors
to occur despite the connection having succeeded with a 'begin
immediate;' (which immediately starts a write transaction), then the
rollback wouldn't occur, and what should be a failed transaction
followed by a successful transaction becomes one long,
restarted-in-the-middle transaction. I'm not sure if that's a problem in
practice, though.

And now that I look at it again, it turns out that most of our database
query procedures in (guix store database) aren't finalizing their
statements in case of a nonlocal exit... which would tend to happen if,
for example, an SQLITE_BUSY error occurred. Which would cause the
statement to not be finalized until the garbage collector got ahold of
it. But due to statement caching the garbage collector likely won't get
ahold of it until the database connection itself is destroyed. The
wording at https://www.sqlite.org/lang_transaction.html makes me think
that this shouldn't be an issue because the errors we'd expect all seem
to roll back automatically, but if we somehow got one that didn't roll
back automatically, there would potentially be an extended amount of
time before the statement was finalized and the implicit transaction
committed.

Also, I've noticed that with the way that finalize-store-file is
written, we actually already have a database open when we call
register-path. This is because it's needed in order to call path-id, but
the scope of that with-database form is rather larger than it needs to
be.

We may have a situation here where things go fine until a single
SQLITE_BUSY error is produced by chance, and that causes more
SQLITE_BUSY errors, and so on.

</wild brainstorming>

In summary, there are many things I could imagine going wrong to cause /
contribute to the observed behavior, but the patch, barring some absurd
guile compilation bug, is not one of them. I do, however, think that
(guix store database) needs some love.

> They loop pretty much indefinitely on this and nothing (or very little)
> happens on the system.

To be clear, the nothing-happening status is common to all processes
that use the database, including daemon processes? That's quite severe.

> I’ll revert this patch but I’m happy to hear what you think, Caleb.

If the data says it's causing those problems, I'd tend to agree with
that. I would really like to understand how, though, because even after
a few hours of brainstorming bizarre edge cases I still can't come up
with a satisfying explanation.

> Another reason to implement temp roots in Scheme, as it would allow us
> to not open a connection to the daemon from ‘guix offload’!

Soon™. Conceptually the code is there, I'm working towards a rebase that
tries to first make the rest of daemon-side guix compatible with fibers
- thread pools✓, eval-with-container✓, fibers-friendly waitpid✓, etc.

- reepca




^ permalink raw reply	[flat|nested] 6+ messages in thread

* [bug#41119] [PATCH] fix some issues with (guix nar)
  2020-05-28  8:50   ` Caleb Ristvedt
@ 2020-05-28 10:32     ` Ludovic Courtès
  0 siblings, 0 replies; 6+ messages in thread
From: Ludovic Courtès @ 2020-05-28 10:32 UTC (permalink / raw)
  To: Caleb Ristvedt; +Cc: guix-sysadmin, 41119

Hi!

Caleb Ristvedt <caleb.ristvedt@cune.org> skribis:

>> This change had an undesirable effect: the connection would be kept for
>> the body of ‘with-temporary-store-file’, during which we’d call:
>>
>>   finalize-store-file -> register-path
>>
>> which accesses the database.  At this point, for each ‘guix offload’
>> process, we’d thus have the database open twice: once for the session’s
>> guix-daemon, and once for that ‘register-path’ call.
>
> If the connection wasn't kept for the body of with-temporary-store-file,
> the temporary store file wouldn't be protected from GC during the body
> (the daemon treats unlocked temproots files as "stale"), thus rather
> defeating the purpose. It makes sense, then, that the connection was
> also kept for the body prior to this patch - indeed, unless emacs's
> parenthesis-matching capabilities are failing me, it appears that the
> body is solidly within the 'with-store' form in
> 37edbc91e34fb5658261e637e6ffccdb381e5271.

Oh you’re right, sorry for the confusion.

>> On berlin, the effect is that we see many ‘guix offload’ processes
>> stalled because the SQLite database is busy:
>
> ... which makes this quite the mystery indeed. I assume you've tested
> with the patch reverted and found that this issue goes away?

No.  I observed the behavior and looked for recent changes that could
cause the problem.  But I guess I was tired and jumped to silly
conclusions.

> <wild brainstorming starts here>
>
> AFAIK just having the database open doesn't by itself impose any
> locks. The daemon process we're connected to should have it open, but
> should just be blocked waiting for our next RPC. Database locks happen
> when transactions are started (either explicitly or implicitly), and
> implicitly-started transactions are automatically committed by sqlite
> (specifically when the statement that started the transaction is either
> reset or finalized). The only loose end I can think of right now is that
> call-with-transaction only catches exceptions of type 'sqlite-error, so
> in theory if a different type of exception were to be thrown, it could
> exit in a broken state where neither a commit nor a rollback has been
> performed. Really it should catch all exception types, and use match in
> the handler to pick out the sqlite-errors. If that were causing the
> problems, though, we'd expect to see some errors appearing in the
> offload output.

Good point but yes, we’d see an error, and ‘guix offload’ would probably
exit right away.

> Actually, come to think of it, there could be another issue with
> call-with-transaction: if somehow it's possible for SQLITE_BUSY errors
> to occur despite the connection having succeeded with a 'begin
> immediate;' (which immediately starts a write transaction), then the
> rollback wouldn't occur, and what should be a failed transaction
> followed by a successful transaction becomes one long,
> restarted-in-the-middle transaction. I'm not sure if that's a problem in
> practice, though.
>
> And now that I look at it again, it turns out that most of our database
> query procedures in (guix store database) aren't finalizing their
> statements in case of a nonlocal exit... which would tend to happen if,
> for example, an SQLITE_BUSY error occurred. Which would cause the
> statement to not be finalized until the garbage collector got ahold of
> it. But due to statement caching the garbage collector likely won't get
> ahold of it until the database connection itself is destroyed. The
> wording at https://www.sqlite.org/lang_transaction.html makes me think
> that this shouldn't be an issue because the errors we'd expect all seem
> to roll back automatically, but if we somehow got one that didn't roll
> back automatically, there would potentially be an extended amount of
> time before the statement was finalized and the implicit transaction
> committed.
>
> Also, I've noticed that with the way that finalize-store-file is
> written, we actually already have a database open when we call
> register-path. This is because it's needed in order to call path-id, but
> the scope of that with-database form is rather larger than it needs to
> be.
>
> We may have a situation here where things go fine until a single
> SQLITE_BUSY error is produced by chance, and that causes more
> SQLITE_BUSY errors, and so on.

Hmm, sounds plausible.

> In summary, there are many things I could imagine going wrong to cause /
> contribute to the observed behavior, but the patch, barring some absurd
> guile compilation bug, is not one of them. I do, however, think that
> (guix store database) needs some love.

Yeah.

>> They loop pretty much indefinitely on this and nothing (or very little)
>> happens on the system.
>
> To be clear, the nothing-happening status is common to all processes
> that use the database, including daemon processes? That's quite severe.

I just did a random sample, but several offload processes were stuck
like the one I showed, and clients would usually get “database is
locked” messages from the daemon.

>> I’ll revert this patch but I’m happy to hear what you think, Caleb.
>
> If the data says it's causing those problems, I'd tend to agree with
> that. I would really like to understand how, though, because even after
> a few hours of brainstorming bizarre edge cases I still can't come up
> with a satisfying explanation.

No you’re right, my analysis was wrong.  Further investigation needed…

>> Another reason to implement temp roots in Scheme, as it would allow us
>> to not open a connection to the daemon from ‘guix offload’!
>
> Soon™. Conceptually the code is there, I'm working towards a rebase that
> tries to first make the rest of daemon-side guix compatible with fibers
> - thread pools✓, eval-with-container✓, fibers-friendly waitpid✓, etc.

Neat!  For master we could do with a simpler implementation, but we’ll
see.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-05-28 10:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-07  3:52 [bug#41119] [PATCH] fix some issues with (guix nar) Caleb Ristvedt
2020-05-07  8:05 ` Ludovic Courtès
2020-05-11 21:39   ` bug#41119: " Ludovic Courtès
2020-05-27 21:54 ` [bug#41119] " Ludovic Courtès
2020-05-28  8:50   ` Caleb Ristvedt
2020-05-28 10:32     ` Ludovic Courtès

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).