unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / Atom feed
* bug#41625: Sporadic guix-offload crashes due to EOF errors
@ 2020-05-31  9:51 Marius Bakke
  2020-05-31 10:12 ` Marius Bakke
  0 siblings, 1 reply; 15+ messages in thread
From: Marius Bakke @ 2020-05-31  9:51 UTC (permalink / raw)
  To: 41625

[-- Attachment #1: Type: text/plain, Size: 2571 bytes --]

Hello,

During 'guix build -s aarch64-linux dolphin' on Berlin, I got this crash:

--8<---------------cut here---------------start------------->8---
building /gnu/store/87655bh9rqcr29qasl1c4yj3skmxkyiz-kfilemetadata-5.70.0.drv...
process 12989 acquired build slot '/var/guix/offload/overdrive1.guixsd.org:52522/1'
process 12989 acquired build slot '/var/guix/offload/dover.guix.info:9023/1'
process 12989 acquired build slot '/var/guix/offload/141.80.167.167:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.163:22/0'
process 12989 acquired build slot '/var/guix/offload/localhost:2223/1'
process 12989 acquired build slot '/var/guix/offload/141.80.167.168:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.173:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.176:22/0'
process 12989 acquired build slot '/var/guix/offload/localhost:2222/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.165:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.169:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.181:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.170:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.174:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.180:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.161:22/0'
Backtrace:
In ice-9/boot-9.scm:
  1736:10  5 (with-exception-handler _ _ #:unwind? _ # _)
In unknown file:
           4 (apply-smob/0 #<thunk 7f3344d296c0>)
In ice-9/boot-9.scm:
    718:2  3 (call-with-prompt _ _ #<procedure default-prompt-handle…>)
In ice-9/eval.scm:
    619:8  2 (_ #(#(#<directory (guile-user) 7f3344933f00>)))
In guix/ui.scm:
  1936:12  1 (run-guix-command _ . _)
In guix/scripts/offload.scm:
   742:22  0 (guix-offload . _)

guix/scripts/offload.scm:742:22: In procedure guix-offload:
Throw to key `match-error' with args `("match" "no matching pattern" #<eof>)'.
guix build: error: unexpected EOF reading a line
--8<---------------cut here---------------end--------------->8---

Which is strange because guix/scripts/offload.scm:742 is wrapped in a
(unless (eof-object? ...)) block.

When this happens, the build command terminates, along with any other
builds that it had started concurrently.  Builds from other clients
were unaffected, of course.

I have also seen this occur on my personal offloading setup once every
blue moon, but don't know what could have caused it.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#41625: Sporadic guix-offload crashes due to EOF errors
  2020-05-31  9:51 bug#41625: Sporadic guix-offload crashes due to EOF errors Marius Bakke
@ 2020-05-31 10:12 ` Marius Bakke
  2020-05-31 11:21   ` Marius Bakke
  0 siblings, 1 reply; 15+ messages in thread
From: Marius Bakke @ 2020-05-31 10:12 UTC (permalink / raw)
  To: 41625

[-- Attachment #1: Type: text/plain, Size: 2446 bytes --]

Marius Bakke <marius@gnu.org> writes:

> During 'guix build -s aarch64-linux dolphin' on Berlin, I got this crash:

Funny, I just got it _again_ building the same derivation.  It seems to
happen while checking load on the selected offload machine, which is
'overdrive1.guixsd.org' in this and the previous case:

--8<---------------cut here---------------start------------->8---
successfully built /gnu/store/qwscp6h2zsxr0knaizn1fn0saw1pfimi-kidletime-5.70.0.drv
process 51308 acquired build slot '/var/guix/offload/141.80.167.174:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.181:22/0'
process 51308 acquired build slot '/var/guix/offload/overdrive1.guixsd.org:52522/1'
process 51308 acquired build slot '/var/guix/offload/localhost:2223/1'
process 51308 acquired build slot '/var/guix/offload/141.80.167.162:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.180:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.161:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.170:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.163:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.166:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.177:22/0'
process 51308 acquired build slot '/var/guix/offload/localhost:2222/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.173:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.168:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.169:22/0'
process 51308 acquired build slot '/var/guix/offload/141.80.167.176:22/0'
Backtrace:
In ice-9/boot-9.scm:
  1736:10  5 (with-exception-handler _ _ #:unwind? _ # _)
In unknown file:
           4 (apply-smob/0 #<thunk 7f3e4487b1c0>)
In ice-9/boot-9.scm:
    718:2  3 (call-with-prompt _ _ #<procedure default-prompt-handle…>)
In ice-9/eval.scm:
    619:8  2 (_ #(#(#<directory (guile-user) 7f3e4445ff00>)))
In guix/ui.scm:
  1936:12  1 (run-guix-command _ . _)
In guix/scripts/offload.scm:
   742:22  0 (guix-offload . _)

guix/scripts/offload.scm:742:22: In procedure guix-offload:
Throw to key `match-error' with args `("match" "no matching pattern" #<eof>)'.
guix build: error: unexpected EOF reading a line
--8<---------------cut here---------------end--------------->8---

'guix offload test' passes without problems.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#41625: Sporadic guix-offload crashes due to EOF errors
  2020-05-31 10:12 ` Marius Bakke
@ 2020-05-31 11:21   ` Marius Bakke
  2020-06-04 12:05     ` Ludovic Courtès
  0 siblings, 1 reply; 15+ messages in thread
From: Marius Bakke @ 2020-05-31 11:21 UTC (permalink / raw)
  To: 41625

[-- Attachment #1: Type: text/plain, Size: 750 bytes --]

Marius Bakke <marius@gnu.org> writes:

> 'guix offload test' passes without problems.

Not so fast, running it in a loop reveals the crash.

There is a trace file in /root/offloadtest.trace on Berlin with such an
occurence.  It looks like a timeout is reached shortly before the EOF
error:

10139 poll([{fd=14, events=POLLIN|POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT}])
10139 poll([{fd=14, events=POLLIN}], 1, 15000) = 0 (Timeout)
10139 write(2, "Backtrace:\n", 11)      = 11

This seems to be from a different node than the one reported previously,
as the preceding connect() was to this machine:

10139 connect(44, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("141.80.167.186")}, 16) = -1 EINPROGRESS (Operation now in progress)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#41625: Sporadic guix-offload crashes due to EOF errors
  2020-05-31 11:21   ` Marius Bakke
@ 2020-06-04 12:05     ` Ludovic Courtès
  2021-05-24  5:33       ` Maxim Cournoyer
  0 siblings, 1 reply; 15+ messages in thread
From: Ludovic Courtès @ 2020-06-04 12:05 UTC (permalink / raw)
  To: Marius Bakke; +Cc: 41625

Hi,

Marius Bakke <marius@gnu.org> skribis:

> Marius Bakke <marius@gnu.org> writes:
>
>> 'guix offload test' passes without problems.
>
> Not so fast, running it in a loop reveals the crash.
>
> There is a trace file in /root/offloadtest.trace on Berlin with such an
> occurence.  It looks like a timeout is reached shortly before the EOF
> error:
>
> 10139 poll([{fd=14, events=POLLIN|POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT}])
> 10139 poll([{fd=14, events=POLLIN}], 1, 15000) = 0 (Timeout)
> 10139 write(2, "Backtrace:\n", 11)      = 11
>
> This seems to be from a different node than the one reported previously,
> as the preceding connect() was to this machine:
>
> 10139 connect(44, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("141.80.167.186")}, 16) = -1 EINPROGRESS (Operation now in progress)

So it looks like ‘connect’ fails and eventually we get an EOF object.
However, I don’t see where that EOF comes from because the return value
of ‘connect!’ (the Guile-SSH procedure) is properly checked.

Ludo’.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#41625: Sporadic guix-offload crashes due to EOF errors
  2020-06-04 12:05     ` Ludovic Courtès
@ 2021-05-24  5:33       ` Maxim Cournoyer
  2021-05-25 15:50         ` bug#41625: [PATCH] offload: Handle a possible EOF response from read-repl-response Maxim Cournoyer
  0 siblings, 1 reply; 15+ messages in thread
From: Maxim Cournoyer @ 2021-05-24  5:33 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 41625

Hi,

Ludovic Courtès <ludo@gnu.org> writes:

> Hi,
>
> Marius Bakke <marius@gnu.org> skribis:
>
>> Marius Bakke <marius@gnu.org> writes:
>>
>>> 'guix offload test' passes without problems.
>>
>> Not so fast, running it in a loop reveals the crash.
>>
>> There is a trace file in /root/offloadtest.trace on Berlin with such an
>> occurence.  It looks like a timeout is reached shortly before the EOF
>> error:
>>
>> 10139 poll([{fd=14, events=POLLIN|POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT}])
>> 10139 poll([{fd=14, events=POLLIN}], 1, 15000) = 0 (Timeout)
>> 10139 write(2, "Backtrace:\n", 11)      = 11
>>
>> This seems to be from a different node than the one reported previously,
>> as the preceding connect() was to this machine:
>>
>> 10139 connect(44, {sa_family=AF_INET, sin_port=htons(22),
>> sin_addr=inet_addr("141.80.167.186")}, 16) = -1 EINPROGRESS
>> (Operation now in progress)
>
> So it looks like ‘connect’ fails and eventually we get an EOF object.
> However, I don’t see where that EOF comes from because the return value
> of ‘connect!’ (the Guile-SSH procedure) is properly checked.
>
> Ludo’.

I got a slightly different backtrace that suggests making the connection
is not at fault, rather it occurs during the read-repl-response call:

--8<---------------cut here---------------start------------->8---
guix offload: testing 1 build machines defined in '/etc/guix/machines.scm'...
Backtrace:
           8 (primitive-load "/home/maxim/.config/guix/current/bin/guix")
In guix/ui.scm:
  2165:12  7 (run-guix-command _ . _)
In ice-9/boot-9.scm:
  1752:10  6 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
  1747:15  5 (with-exception-handler #<procedure 7f2caf885780 at ice-9/boot-9.scm:1831:7 (exn)> _ # _ # …)
In guix/scripts/offload.scm:
   704:21  4 (check-machine-availability _ _)
In srfi/srfi-1.scm:
   586:17  3 (map1 (#<session maxim@overdrive1.guix.gnu.org:52522 (connected) 7f2cae396fc0>))
In guix/inferior.scm:
    258:2  2 (port->inferior _ _)
    240:2  1 (read-repl-response _ _)
In ice-9/boot-9.scm:
  1685:16  0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
Throw to key `match-error' with args `("match" "no matching pattern" #<eof>)'.
--8<---------------cut here---------------end--------------->8---

I seem to get this more often than not with the overdrive1 offload
machine.

Maxim




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#41625: [PATCH] offload: Handle a possible EOF response from read-repl-response.
  2021-05-24  5:33       ` Maxim Cournoyer
@ 2021-05-25 15:50         ` Maxim Cournoyer
  2021-05-25 20:27           ` Ludovic Courtès
  0 siblings, 1 reply; 15+ messages in thread
From: Maxim Cournoyer @ 2021-05-25 15:50 UTC (permalink / raw)
  To: 41625; +Cc: Maxim Cournoyer

Fixes <https://issues.guix.gnu.org/41625>.

* guix/scripts/offload.scm (check-machine-availability): Refactor so that it
takes a single machine object, to allow for retrying a single machine.  Handle
the case where the checks raised an exception due to the connection to the
build machine having been lost, and retry up to 3 times.  Ensure the cleanup
code is run in all situations.
(check-machines-availability): New procedure.  Call
CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about
twice as fast with 4 build machines, from ~30 s to ~15 s).
* guix/inferior.scm (&inferior-connection-lost): New condition type.
(read-repl-response): Raise a condition of the above type when reading EOF
from the build machine's port.
---
 guix/inferior.scm        |  9 ++++++++
 guix/scripts/offload.scm | 50 +++++++++++++++++++++++++++++-----------
 2 files changed, 45 insertions(+), 14 deletions(-)

diff --git a/guix/inferior.scm b/guix/inferior.scm
index 7c8e478f2a..4ac1ea3484 100644
--- a/guix/inferior.scm
+++ b/guix/inferior.scm
@@ -1,5 +1,6 @@
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2018, 2019, 2020, 2021 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -70,6 +71,7 @@
             inferior-exception-arguments
             inferior-exception-inferior
             inferior-exception-stack
+            inferior-connection-lost?
             read-repl-response
 
             inferior-packages
@@ -228,6 +230,9 @@ equivalent.  Return #f if the inferior could not be launched."
   (inferior   inferior-exception-inferior)        ;<inferior> | #f
   (stack      inferior-exception-stack))          ;list of (FILE COLUMN LINE)
 
+(define-condition-type &inferior-connection-lost &error
+  inferior-connection-lost?)
+
 (define* (read-repl-response port #:optional inferior)
   "Read a (guix repl) response from PORT and return it as a Scheme object.
 Raise '&inferior-exception' when an exception is read from PORT."
@@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read from PORT."
   (match (read port)
     (('values objects ...)
      (apply values (map sexp->object objects)))
+    ;; Unexpectedly read EOF from the port.  This can happen for example when
+    ;; the underlying connection for PORT was lost with Guile-SSH.
+    (? eof-object?
+       (raise (condition (&inferior-connection-lost))))
     (('exception ('arguments key objects ...)
                  ('stack frames ...))
      ;; Protocol (0 1 1) and later.
diff --git a/guix/scripts/offload.scm b/guix/scripts/offload.scm
index 835078cb97..7c6d38b218 100644
--- a/guix/scripts/offload.scm
+++ b/guix/scripts/offload.scm
@@ -1,7 +1,7 @@
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2014, 2015, 2016, 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
 ;;; Copyright © 2017 Ricardo Wurmus <rekado@elephly.net>
-;;; Copyright © 2020 Maxim Cournoyer <maxim.cournoyer@gmail.com>
+;;; Copyright © 2020, 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com>
 ;;; Copyright © 2020 Julien Lepiller <julien@lepiller.eu>
 ;;;
 ;;; This file is part of GNU Guix.
@@ -53,6 +53,7 @@
   #:use-module (ice-9 regex)
   #:use-module (ice-9 format)
   #:use-module (ice-9 binary-ports)
+  #:use-module (ice-9 threads)
   #:export (build-machine
             build-machine?
             build-machine-name
@@ -684,7 +685,7 @@ daemon is not running."
           (leave (G_ "failed to import '~a' from '~a'~%")
                  item name)))))
 
-(define (check-machine-availability machine-file pred)
+(define (check-machines-availability machine-file pred)
   "Check that each machine matching PRED in MACHINE-FILE is usable as a build
 machine."
   (define (build-machine=? m1 m2)
@@ -696,18 +697,39 @@ machine."
   (let ((machines (filter pred
                           (delete-duplicates (build-machines machine-file)
                                              build-machine=?))))
-    (info (G_ "testing ~a build machines defined in '~a'...~%")
+    (info (G_ "Testing ~a build machines defined in '~a'...~%")
           (length machines) machine-file)
-    (let* ((names    (map build-machine-name machines))
-           (sockets  (map build-machine-daemon-socket machines))
-           (sessions (map (cut open-ssh-session <> %short-timeout) machines))
-           (nodes    (map remote-inferior sessions)))
-      (for-each assert-node-has-guix nodes names)
-      (for-each assert-node-repl nodes names)
-      (for-each assert-node-can-import sessions nodes names sockets)
-      (for-each assert-node-can-export sessions nodes names sockets)
-      (for-each close-inferior nodes)
-      (for-each disconnect! sessions))))
+    (par-for-each check-machine-availability machines)))
+
+(define (check-machine-availability machine)
+  "Check whether MACHINE is available.  Exit with an error upon failure."
+  ;; Sometimes, the machine remote port may return EOF, presumably because the
+  ;; connection was lost.  Retry up to 3 times.
+  (let loop ((retries 3))
+    (guard (c ((inferior-connection-lost? c)
+               (let ((retries-left (1- retries)))
+                 (if (> retries-left 0)
+                     (begin
+                       (format (current-error-port)
+                               (G_ "connection to machine ~s lost; retrying~%")
+                               (build-machine-name machine))
+                       (loop (retries-left)))
+                     (leave (G_ "connection repeatedly lost with machine '~a'~%")
+                            (build-machine-name machine))))))
+      (let* ((name    (build-machine-name machine))
+             (socket  (build-machine-daemon-socket machine))
+             (session (open-ssh-session machine %short-timeout))
+             (node    (remote-inferior session)))
+        (dynamic-wind
+          (lambda () #t)
+          (lambda ()
+            (assert-node-has-guix node name)
+            (assert-node-repl node name)
+            (assert-node-can-import session node name socket)
+            (assert-node-can-export session node name socket))
+          (lambda ()
+            (close-inferior node)
+            (disconnect! session)))))))
 
 (define (check-machine-status machine-file pred)
   "Print the load of each machine matching PRED in MACHINE-FILE."
@@ -824,7 +846,7 @@ machine."
                        ((file) (values file (const #t)))
                        (()     (values %machine-file (const #t)))
                        (x      (leave (G_ "wrong number of arguments~%"))))))
-         (check-machine-availability (or file %machine-file) pred))))
+         (check-machines-availability (or file %machine-file) pred))))
     (("status" rest ...)
      (with-error-handling
        (let-values (((file pred)
-- 
2.31.1





^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#41625: [PATCH] offload: Handle a possible EOF response from read-repl-response.
  2021-05-25 15:50         ` bug#41625: [PATCH] offload: Handle a possible EOF response from read-repl-response Maxim Cournoyer
@ 2021-05-25 20:27           ` Ludovic Courtès
  2021-05-26  3:18             ` bug#41625: [PATCH v2] " Maxim Cournoyer
  0 siblings, 1 reply; 15+ messages in thread
From: Ludovic Courtès @ 2021-05-25 20:27 UTC (permalink / raw)
  To: Maxim Cournoyer; +Cc: 41625

Hi!

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

> Fixes <https://issues.guix.gnu.org/41625>.
>
> * guix/scripts/offload.scm (check-machine-availability): Refactor so that it
> takes a single machine object, to allow for retrying a single machine.  Handle
> the case where the checks raised an exception due to the connection to the
> build machine having been lost, and retry up to 3 times.  Ensure the cleanup
> code is run in all situations.
> (check-machines-availability): New procedure.  Call
> CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about
> twice as fast with 4 build machines, from ~30 s to ~15 s).
> * guix/inferior.scm (&inferior-connection-lost): New condition type.
> (read-repl-response): Raise a condition of the above type when reading EOF
> from the build machine's port.

[...]

> +(define-condition-type &inferior-connection-lost &error
> +  inferior-connection-lost?)
> +
>  (define* (read-repl-response port #:optional inferior)
>    "Read a (guix repl) response from PORT and return it as a Scheme object.
>  Raise '&inferior-exception' when an exception is read from PORT."
> @@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read from PORT."
>    (match (read port)
>      (('values objects ...)
>       (apply values (map sexp->object objects)))
> +    ;; Unexpectedly read EOF from the port.  This can happen for example when
> +    ;; the underlying connection for PORT was lost with Guile-SSH.
> +    (? eof-object?
> +       (raise (condition (&inferior-connection-lost))))

The match clause syntax is incorrect; should be:

 ((? eof-object?)
  (raise …))

> +    (info (G_ "Testing ~a build machines defined in '~a'...~%")
>            (length machines) machine-file)
> -    (let* ((names    (map build-machine-name machines))
> -           (sockets  (map build-machine-daemon-socket machines))
> -           (sessions (map (cut open-ssh-session <> %short-timeout) machines))
> -           (nodes    (map remote-inferior sessions)))
> -      (for-each assert-node-has-guix nodes names)
> -      (for-each assert-node-repl nodes names)
> -      (for-each assert-node-can-import sessions nodes names sockets)
> -      (for-each assert-node-can-export sessions nodes names sockets)
> -      (for-each close-inferior nodes)
> -      (for-each disconnect! sessions))))
> +    (par-for-each check-machine-availability machines)))

Why not!  IMO this should go in a separate patch, though, since it’s not
related.

> +(define (check-machine-availability machine)
> +  "Check whether MACHINE is available.  Exit with an error upon failure."
> +  ;; Sometimes, the machine remote port may return EOF, presumably because the
> +  ;; connection was lost.  Retry up to 3 times.
> +  (let loop ((retries 3))
> +    (guard (c ((inferior-connection-lost? c)
> +               (let ((retries-left (1- retries)))
> +                 (if (> retries-left 0)
> +                     (begin
> +                       (format (current-error-port)
> +                               (G_ "connection to machine ~s lost; retrying~%")
> +                               (build-machine-name machine))
> +                       (loop (retries-left)))
> +                     (leave (G_ "connection repeatedly lost with machine '~a'~%")
> +                            (build-machine-name machine))))))

I’m afraid we’re papering over problems here.

Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
berlin enough to reproduce the issue?  If so, we could monitor/strace
sshd on overdrive1 to get a better understanding of what’s going on.

WDYT?

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#41625: [PATCH v2] offload: Handle a possible EOF response from read-repl-response.
  2021-05-25 20:27           ` Ludovic Courtès
@ 2021-05-26  3:18             ` Maxim Cournoyer
  2021-05-26  9:14               ` Ludovic Courtès
  2021-05-26 15:48               ` Marius Bakke
  0 siblings, 2 replies; 15+ messages in thread
From: Maxim Cournoyer @ 2021-05-26  3:18 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 41625

[-- Attachment #1: Type: text/plain, Size: 4337 bytes --]

Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

[...]

>>  (define* (read-repl-response port #:optional inferior)
>>    "Read a (guix repl) response from PORT and return it as a Scheme object.
>>  Raise '&inferior-exception' when an exception is read from PORT."
>> @@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read from PORT."
>>    (match (read port)
>>      (('values objects ...)
>>       (apply values (map sexp->object objects)))
>> +    ;; Unexpectedly read EOF from the port.  This can happen for example when
>> +    ;; the underlying connection for PORT was lost with Guile-SSH.
>> +    (? eof-object?
>> +       (raise (condition (&inferior-connection-lost))))
>
> The match clause syntax is incorrect; should be:
>
>  ((? eof-object?)
>   (raise …))

Good catch, fixed.

>> +    (info (G_ "Testing ~a build machines defined in '~a'...~%")
>>            (length machines) machine-file)
>> -    (let* ((names    (map build-machine-name machines))
>> -           (sockets  (map build-machine-daemon-socket machines))
>> -           (sessions (map (cut open-ssh-session <> %short-timeout) machines))
>> -           (nodes    (map remote-inferior sessions)))
>> -      (for-each assert-node-has-guix nodes names)
>> -      (for-each assert-node-repl nodes names)
>> -      (for-each assert-node-can-import sessions nodes names sockets)
>> -      (for-each assert-node-can-export sessions nodes names sockets)
>> -      (for-each close-inferior nodes)
>> -      (for-each disconnect! sessions))))
>> +    (par-for-each check-machine-availability machines)))
>
> Why not!  IMO this should go in a separate patch, though, since it’s not
> related.

For me, it is related in that retrying all the checks of *every* build
offload machine would be too expensive; it already takes 32 s for my 4
offload machines; retrying this for up to 3 times would mean waiting for
a minute and half, which I don't find reasonable (imagine on berlin!).

>> +(define (check-machine-availability machine)
>> +  "Check whether MACHINE is available.  Exit with an error upon failure."
>> +  ;; Sometimes, the machine remote port may return EOF, presumably because the
>> +  ;; connection was lost.  Retry up to 3 times.
>> +  (let loop ((retries 3))
>> +    (guard (c ((inferior-connection-lost? c)
>> +               (let ((retries-left (1- retries)))
>> +                 (if (> retries-left 0)
>> +                     (begin
>> +                       (format (current-error-port)
>> +                               (G_ "connection to machine ~s lost; retrying~%")
>> +                               (build-machine-name machine))
>> +                       (loop (retries-left)))
>> +                     (leave (G_ "connection repeatedly lost with machine '~a'~%")
>> +                            (build-machine-name machine))))))
>
> I’m afraid we’re papering over problems here.

I had that thought too, but then also realized that even if this was
papering over a problem, it'd be a good one to paper over as this
problem can legitimately happen in practice, due to the network's
inherently shaky nature.  It seems better to be ready for it.  Also, my
hopes in being able to troubleshoot such a difficult to reproduce
networking issue are rather low.

> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
> berlin enough to reproduce the issue?  If so, we could monitor/strace
> sshd on overdrive1 to get a better understanding of what’s going on.

It's actually difficult to trigger it; it seems to happen mostly on the
first try after a long time without connecting to the machine; on the
2nd and later tries, everything is smooth.  Waiting a few minutes is not
enough to re-trigger the problem.

I've managed to see the problem a few lucky times with:

--8<---------------cut here---------------start------------->8---
while true; do guix offload test /etc/guix/machines.scm overdrive1; done
--8<---------------cut here---------------end--------------->8---

I don't have a password set for my user on overdrive1, so can't attach
strace to sshd, but yeah, we could try to capture it and see if we can
understand what's going on.

Attached is v2 of the patch, with the match clause fixed.


[-- Attachment #2: 0001-offload-Handle-a-possible-EOF-response-from-read-rep.patch --]
[-- Type: text/x-patch, Size: 7300 bytes --]

From c52172502749a4d194dc51db9d2c394cb15e8d07 Mon Sep 17 00:00:00 2001
From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Date: Tue, 25 May 2021 08:42:06 -0400
Subject: [PATCH] offload: Handle a possible EOF response from
 read-repl-response.

Fixes <https://issues.guix.gnu.org/41625>.

* guix/scripts/offload.scm (check-machine-availability): Refactor so that it
takes a single machine object, to allow for retrying a single machine.  Handle
the case where the checks raised an exception due to the connection to the
build machine having been lost, and retry up to 3 times.  Ensure the cleanup
code is run in all situations.
(check-machines-availability): New procedure.  Call
CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about
twice as fast with 4 build machines, from ~30 s to ~15 s).
* guix/inferior.scm (&inferior-connection-lost): New condition type.
(read-repl-response): Raise a condition of the above type when reading EOF
from the build machine's port.
---
 guix/inferior.scm        |  9 ++++++++
 guix/scripts/offload.scm | 50 +++++++++++++++++++++++++++++-----------
 2 files changed, 45 insertions(+), 14 deletions(-)

diff --git a/guix/inferior.scm b/guix/inferior.scm
index 7c8e478f2a..45d9d843e3 100644
--- a/guix/inferior.scm
+++ b/guix/inferior.scm
@@ -1,5 +1,6 @@
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2018, 2019, 2020, 2021 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -70,6 +71,7 @@
             inferior-exception-arguments
             inferior-exception-inferior
             inferior-exception-stack
+            inferior-connection-lost?
             read-repl-response
 
             inferior-packages
@@ -228,6 +230,9 @@ equivalent.  Return #f if the inferior could not be launched."
   (inferior   inferior-exception-inferior)        ;<inferior> | #f
   (stack      inferior-exception-stack))          ;list of (FILE COLUMN LINE)
 
+(define-condition-type &inferior-connection-lost &error
+  inferior-connection-lost?)
+
 (define* (read-repl-response port #:optional inferior)
   "Read a (guix repl) response from PORT and return it as a Scheme object.
 Raise '&inferior-exception' when an exception is read from PORT."
@@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read from PORT."
   (match (read port)
     (('values objects ...)
      (apply values (map sexp->object objects)))
+    ;; Unexpectedly read EOF from the port.  This can happen for example when
+    ;; the underlying connection for PORT was lost with Guile-SSH.
+    ((? eof-object?)
+     (raise (condition (&inferior-connection-lost))))
     (('exception ('arguments key objects ...)
                  ('stack frames ...))
      ;; Protocol (0 1 1) and later.
diff --git a/guix/scripts/offload.scm b/guix/scripts/offload.scm
index 835078cb97..0271874f6a 100644
--- a/guix/scripts/offload.scm
+++ b/guix/scripts/offload.scm
@@ -1,7 +1,7 @@
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2014, 2015, 2016, 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
 ;;; Copyright © 2017 Ricardo Wurmus <rekado@elephly.net>
-;;; Copyright © 2020 Maxim Cournoyer <maxim.cournoyer@gmail.com>
+;;; Copyright © 2020, 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com>
 ;;; Copyright © 2020 Julien Lepiller <julien@lepiller.eu>
 ;;;
 ;;; This file is part of GNU Guix.
@@ -53,6 +53,7 @@
   #:use-module (ice-9 regex)
   #:use-module (ice-9 format)
   #:use-module (ice-9 binary-ports)
+  #:use-module (ice-9 threads)
   #:export (build-machine
             build-machine?
             build-machine-name
@@ -684,7 +685,7 @@ daemon is not running."
           (leave (G_ "failed to import '~a' from '~a'~%")
                  item name)))))
 
-(define (check-machine-availability machine-file pred)
+(define (check-machines-availability machine-file pred)
   "Check that each machine matching PRED in MACHINE-FILE is usable as a build
 machine."
   (define (build-machine=? m1 m2)
@@ -696,18 +697,39 @@ machine."
   (let ((machines (filter pred
                           (delete-duplicates (build-machines machine-file)
                                              build-machine=?))))
-    (info (G_ "testing ~a build machines defined in '~a'...~%")
+    (info (G_ "Testing ~a build machines defined in '~a'...~%")
           (length machines) machine-file)
-    (let* ((names    (map build-machine-name machines))
-           (sockets  (map build-machine-daemon-socket machines))
-           (sessions (map (cut open-ssh-session <> %short-timeout) machines))
-           (nodes    (map remote-inferior sessions)))
-      (for-each assert-node-has-guix nodes names)
-      (for-each assert-node-repl nodes names)
-      (for-each assert-node-can-import sessions nodes names sockets)
-      (for-each assert-node-can-export sessions nodes names sockets)
-      (for-each close-inferior nodes)
-      (for-each disconnect! sessions))))
+    (par-for-each check-machine-availability machines)))
+
+(define (check-machine-availability machine)
+  "Check whether MACHINE is available.  Exit with an error upon failure."
+  ;; Sometimes, the machine remote port may return EOF, presumably because the
+  ;; connection was lost.  Retry up to 3 times.
+  (let loop ((retries 3))
+    (guard (c ((inferior-connection-lost? c)
+               (let ((retries-left (1- retries)))
+                 (if (> retries-left 0)
+                     (begin
+                       (format (current-error-port)
+                               (G_ "connection to machine '~a' lost; retrying~%")
+                               (build-machine-name machine))
+                       (loop (retries-left)))
+                     (leave (G_ "connection repeatedly lost with machine '~a'~%")
+                            (build-machine-name machine))))))
+      (let* ((name    (build-machine-name machine))
+             (socket  (build-machine-daemon-socket machine))
+             (session (open-ssh-session machine %short-timeout))
+             (node    (remote-inferior session)))
+        (dynamic-wind
+          (lambda () #t)
+          (lambda ()
+            (assert-node-has-guix node name)
+            (assert-node-repl node name)
+            (assert-node-can-import session node name socket)
+            (assert-node-can-export session node name socket))
+          (lambda ()
+            (close-inferior node)
+            (disconnect! session)))))))
 
 (define (check-machine-status machine-file pred)
   "Print the load of each machine matching PRED in MACHINE-FILE."
@@ -824,7 +846,7 @@ machine."
                        ((file) (values file (const #t)))
                        (()     (values %machine-file (const #t)))
                        (x      (leave (G_ "wrong number of arguments~%"))))))
-         (check-machine-availability (or file %machine-file) pred))))
+         (check-machines-availability (or file %machine-file) pred))))
     (("status" rest ...)
      (with-error-handling
        (let-values (((file pred)
-- 
2.31.1


[-- Attachment #3: Type: text/plain, Size: 16 bytes --]


Thanks!

Maxim

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#41625: [PATCH v2] offload: Handle a possible EOF response from read-repl-response.
  2021-05-26  3:18             ` bug#41625: [PATCH v2] " Maxim Cournoyer
@ 2021-05-26  9:14               ` Ludovic Courtès
  2021-05-27 11:49                 ` Maxim Cournoyer
                                   ` (2 more replies)
  2021-05-26 15:48               ` Marius Bakke
  1 sibling, 3 replies; 15+ messages in thread
From: Ludovic Courtès @ 2021-05-26  9:14 UTC (permalink / raw)
  To: Maxim Cournoyer; +Cc: 41625

Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

>>> +    (info (G_ "Testing ~a build machines defined in '~a'...~%")
>>>            (length machines) machine-file)
>>> -    (let* ((names    (map build-machine-name machines))
>>> -           (sockets  (map build-machine-daemon-socket machines))
>>> -           (sessions (map (cut open-ssh-session <> %short-timeout) machines))
>>> -           (nodes    (map remote-inferior sessions)))
>>> -      (for-each assert-node-has-guix nodes names)
>>> -      (for-each assert-node-repl nodes names)
>>> -      (for-each assert-node-can-import sessions nodes names sockets)
>>> -      (for-each assert-node-can-export sessions nodes names sockets)
>>> -      (for-each close-inferior nodes)
>>> -      (for-each disconnect! sessions))))
>>> +    (par-for-each check-machine-availability machines)))
>>
>> Why not!  IMO this should go in a separate patch, though, since it’s not
>> related.
>
> For me, it is related in that retrying all the checks of *every* build
> offload machine would be too expensive; it already takes 32 s for my 4
> offload machines; retrying this for up to 3 times would mean waiting for
> a minute and half, which I don't find reasonable (imagine on berlin!).

I see.  So I’d say it’s a prerequisite (a patch that must come before)
but not entirely the same thing.  I’m nitpicking!

We should make sure it doesn’t trigger thread-safety issues in libssh or
anything like that (running it repeatedly on a large machines.scm should
give us some confidence).

>>> +(define (check-machine-availability machine)
>>> +  "Check whether MACHINE is available.  Exit with an error upon failure."
>>> +  ;; Sometimes, the machine remote port may return EOF, presumably because the
>>> +  ;; connection was lost.  Retry up to 3 times.
>>> +  (let loop ((retries 3))
>>> +    (guard (c ((inferior-connection-lost? c)
>>> +               (let ((retries-left (1- retries)))
>>> +                 (if (> retries-left 0)
>>> +                     (begin
>>> +                       (format (current-error-port)
>>> +                               (G_ "connection to machine ~s lost; retrying~%")
>>> +                               (build-machine-name machine))
>>> +                       (loop (retries-left)))
>>> +                     (leave (G_ "connection repeatedly lost with machine '~a'~%")
>>> +                            (build-machine-name machine))))))
>>
>> I’m afraid we’re papering over problems here.
>
> I had that thought too, but then also realized that even if this was
> papering over a problem, it'd be a good one to paper over as this
> problem can legitimately happen in practice, due to the network's
> inherently shaky nature.  It seems better to be ready for it.  Also, my
> hopes in being able to troubleshoot such a difficult to reproduce
> networking issue are rather low.

Yes, but note that this is just for ‘guix offload test’.  The actual
code run while offloading will still fail badly.

>> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
>> berlin enough to reproduce the issue?  If so, we could monitor/strace
>> sshd on overdrive1 to get a better understanding of what’s going on.
>
> It's actually difficult to trigger it; it seems to happen mostly on the
> first try after a long time without connecting to the machine; on the
> 2nd and later tries, everything is smooth.  Waiting a few minutes is not
> enough to re-trigger the problem.
>
> I've managed to see the problem a few lucky times with:
>
> while true; do guix offload test /etc/guix/machines.scm overdrive1; done
>
> I don't have a password set for my user on overdrive1, so can't attach
> strace to sshd, but yeah, we could try to capture it and see if we can
> understand what's going on.

OK.

> From c52172502749a4d194dc51db9d2c394cb15e8d07 Mon Sep 17 00:00:00 2001
> From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
> Date: Tue, 25 May 2021 08:42:06 -0400
> Subject: [PATCH] offload: Handle a possible EOF response from
>  read-repl-response.
>
> Fixes <https://issues.guix.gnu.org/41625>.
>
> * guix/scripts/offload.scm (check-machine-availability): Refactor so that it
> takes a single machine object, to allow for retrying a single machine.  Handle
> the case where the checks raised an exception due to the connection to the
> build machine having been lost, and retry up to 3 times.  Ensure the cleanup
> code is run in all situations.
> (check-machines-availability): New procedure.  Call
> CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about
> twice as fast with 4 build machines, from ~30 s to ~15 s).
> * guix/inferior.scm (&inferior-connection-lost): New condition type.
> (read-repl-response): Raise a condition of the above type when reading EOF
> from the build machine's port.

[...]

> +(define-condition-type &inferior-connection-lost &error
> +  inferior-connection-lost?)

Perhaps worth adding an ‘inferior’ and/or ‘port’ field.  That would
allow the handler to present more information as to which inferior is
failing.

Maybe ‘premature-eof’ would be more accurate than ‘connection-lost’.

> +                       (format (current-error-port)
> +                               (G_ "connection to machine '~a' lost; retrying~%")
> +                               (build-machine-name machine))

You can use ‘info’ instead of ‘format’.

Otherwise LGTM, thanks!

Ludo’.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#41625: [PATCH v2] offload: Handle a possible EOF response from read-repl-response.
  2021-05-26  3:18             ` bug#41625: [PATCH v2] " Maxim Cournoyer
  2021-05-26  9:14               ` Ludovic Courtès
@ 2021-05-26 15:48               ` Marius Bakke
  2021-05-27 11:51                 ` Maxim Cournoyer
  1 sibling, 1 reply; 15+ messages in thread
From: Marius Bakke @ 2021-05-26 15:48 UTC (permalink / raw)
  To: Maxim Cournoyer, Ludovic Courtès; +Cc: 41625

[-- Attachment #1: Type: text/plain, Size: 1192 bytes --]

Maxim Cournoyer <maxim.cournoyer@gmail.com> skriver:

>> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
>> berlin enough to reproduce the issue?  If so, we could monitor/strace
>> sshd on overdrive1 to get a better understanding of what’s going on.
>
> It's actually difficult to trigger it; it seems to happen mostly on the
> first try after a long time without connecting to the machine; on the
> 2nd and later tries, everything is smooth.  Waiting a few minutes is not
> enough to re-trigger the problem.
>
> I've managed to see the problem a few lucky times with:
>
> --8<---------------cut here---------------start------------->8---
> while true; do guix offload test /etc/guix/machines.scm overdrive1; done
> --8<---------------cut here---------------end--------------->8---

I used to be able to reproduce it by inducing a high load on the target
machine and just let Guix keep trying to connect.  But now I did that,
and set overload threshold to 0.0 for good measure, and Guix has been
waiting patiently for two hours without failure.

So AFAICT this bug has been fixed.  Perhaps Berlin or the Overdrive
simply needs to be updated?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 247 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#41625: [PATCH v2] offload: Handle a possible EOF response from read-repl-response.
  2021-05-26  9:14               ` Ludovic Courtès
@ 2021-05-27 11:49                 ` Maxim Cournoyer
  2021-05-27 14:57                 ` bug#41625: [PATCH v3] " Maxim Cournoyer
  2021-05-27 17:20                 ` bug#41625: [PATCH v2] " Maxim Cournoyer
  2 siblings, 0 replies; 15+ messages in thread
From: Maxim Cournoyer @ 2021-05-27 11:49 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 41625

Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

[...]

> I see.  So I’d say it’s a prerequisite (a patch that must come before)
> but not entirely the same thing.  I’m nitpicking!

Eh, it's okay :-).  Splitting changes into the right unit is a problem
that is akin to naming things; it's hard!  I welcome your suggestion.

> We should make sure it doesn’t trigger thread-safety issues in libssh or
> anything like that (running it repeatedly on a large machines.scm should
> give us some confidence).

It seems fine so far, but I've only tested in a loop with 4 build
machines.  When it nears completion I'll give it a shot on berlin.

[...]

> Yes, but note that this is just for ‘guix offload test’.  The actual
> code run while offloading will still fail badly.

Ah, thanks for pointing that; I somehow thought that this machine status
checking code was a prelude to every offloaded build.

[...]

>> I don't have a password set for my user on overdrive1, so can't attach
>> strace to sshd, but yeah, we could try to capture it and see if we can
>> understand what's going on.
>
> OK.

I'd be happy to try strace when your are available.  You can ping me on
the chat.  It's been more than 8 hours since I tried, so I should be
able to trigger the problem :-).

[...]

> Perhaps worth adding an ‘inferior’ and/or ‘port’ field.  That would
> allow the handler to present more information as to which inferior is
> failing.
>
> Maybe ‘premature-eof’ would be more accurate than ‘connection-lost’.

Good suggestions.  I'll implement them.

>> +                       (format (current-error-port)
>> +                               (G_ "connection to machine '~a' lost; retrying~%")
>> +                               (build-machine-name machine))
>
> You can use ‘info’ instead of ‘format’.

That also.  Thanks!

On another note, I was able to 'exercise' the fix, and the exception is
raised but something fails with the following backtrace instead of being
retried:

--8<---------------cut here---------------start------------->8---
guix offload: Testing 1 build machines defined in '/etc/guix/machines.scm'...
connection to machine 'overdrive1.guix.gnu.org' lost; retrying
Backtrace:
In ice-9/boot-9.scm:
  1752:10 10 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
In unknown file:
           9 (apply-smob/0 #<thunk 7f915c028f60>)
In ice-9/boot-9.scm:
    724:2  8 (call-with-prompt _ _ #<procedure default-prompt-handler (k proc)>)
In ice-9/eval.scm:
    619:8  7 (_ #(#(#<directory (guile-user) 7f915c022c80>)))
In guix/ui.scm:
  2161:12  6 (run-guix-command _ . _)
In ice-9/boot-9.scm:
  1752:10  5 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
  1747:15  4 (with-exception-handler #<procedure 7f91576bf0c0 at ice-9/boot-9.scm:1831:7 (exn)> _ # _ # …)
In srfi/srfi-1.scm:
    634:9  3 (for-each #<procedure check-machine-availability (a)> (#<<build-machine> name: "overdriv…>))
In ice-9/eval.scm:
   191:35  2 (_ #(#(#(#<directory (guix scripts offload) 7f9159852780> 3 #<<build-machine> na…> …) …) …))
Exception thrown while printing backtrace:
In procedure frame-local-ref: Argument 2 out of range: 1

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
Wrong type to apply: 2
--8<---------------cut here---------------end--------------->8---

I haven't been able to pinpoint what yet.  Notice that in the above code
I've changed par-for-each by just for-each, doubting it might have
something to do with it, but it appears unrelated.

Thanks,

Maxim




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#41625: [PATCH v2] offload: Handle a possible EOF response from read-repl-response.
  2021-05-26 15:48               ` Marius Bakke
@ 2021-05-27 11:51                 ` Maxim Cournoyer
  0 siblings, 0 replies; 15+ messages in thread
From: Maxim Cournoyer @ 2021-05-27 11:51 UTC (permalink / raw)
  To: Marius Bakke; +Cc: 41625

Hi Marius,

Marius Bakke <marius@gnu.org> writes:

> Maxim Cournoyer <maxim.cournoyer@gmail.com> skriver:
>
>>> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
>>> berlin enough to reproduce the issue?  If so, we could monitor/strace
>>> sshd on overdrive1 to get a better understanding of what’s going on.
>>
>> It's actually difficult to trigger it; it seems to happen mostly on the
>> first try after a long time without connecting to the machine; on the
>> 2nd and later tries, everything is smooth.  Waiting a few minutes is not
>> enough to re-trigger the problem.
>>
>> I've managed to see the problem a few lucky times with:
>>
>> --8<---------------cut here---------------start------------->8---
>> while true; do guix offload test /etc/guix/machines.scm overdrive1; done
>> --8<---------------cut here---------------end--------------->8---
>
> I used to be able to reproduce it by inducing a high load on the target
> machine and just let Guix keep trying to connect.  But now I did that,
> and set overload threshold to 0.0 for good measure, and Guix has been
> waiting patiently for two hours without failure.
>
> So AFAICT this bug has been fixed.  Perhaps Berlin or the Overdrive
> simply needs to be updated?

Ah!  Do you have root access to overdrive1?  It'd be interesting to
reconfigure it to update the guix-daemon and see if the problem
vanishes.

Thanks for the information!

Maxim




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#41625: [PATCH v3] offload: Handle a possible EOF response from read-repl-response.
  2021-05-26  9:14               ` Ludovic Courtès
  2021-05-27 11:49                 ` Maxim Cournoyer
@ 2021-05-27 14:57                 ` Maxim Cournoyer
  2021-05-27 17:20                 ` bug#41625: [PATCH v2] " Maxim Cournoyer
  2 siblings, 0 replies; 15+ messages in thread
From: Maxim Cournoyer @ 2021-05-27 14:57 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 41625

[-- Attachment #1: Type: text/plain, Size: 3635 bytes --]

Hi Ludo,

Now that I have root access to overdrive1, I could strace the sshd
process (I just did 'strace -p340', noting the process of sshd displayed
with 'herd status sshd'):

--8<---------------cut here---------------start------------->8---
pselect6(87, [3 4], NULL, NULL, NULL, NULL) = 1 (in [3])
accept(3, {sa_family=AF_INET, sin_port=htons(33262), sin_addr=inet_addr("66.158.152.121")}, [128->16]) = 5
fcntl(5, F_GETFL)                       = 0x2 (flags O_RDWR)
pipe2([6, 7], 0)                        = 0
socketpair(AF_UNIX, SOCK_STREAM, 0, [8, 9]) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xffff8e0ef0e0) = 644
close(7)                                = 0
close(9)                                = 0
write(8, "\0\0\1\245\0", 5)             = 5
write(8, "\0\0\1\234\nPort 22\nPermitRootLogin no\n"..., 420) = 420
close(8)                                = 0
close(5)                                = 0
getpid()                                = 340
getpid()                                = 340
getpid()                                = 340
getpid()                                = 340
getpid()                                = 340
getpid()                                = 340
getpid()                                = 340
pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) = 1 (in [6])
read(6, "\0", 1)                        = 1
pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) = 1 (in [6])
read(6, "", 1)                          = 0
close(6)                                = 0
pselect6(87, [3 4], NULL, NULL, NULL, NULL) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=644, si_uid=0, si_status=255, si_utime=1, si_stime=0} ---
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 255}], WNOHANG, NULL) = 644
wait4(-1, 0xfffffa4d90e4, WNOHANG, NULL) = 0
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
pselect6(87, [3 4], NULL, NULL, NULL, NULL <detached ...>
--8<---------------cut here---------------end--------------->8---

With the attached v3 patch, the corresponding output (still problematic)
was:

--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix offload test /etc/guix/machines.scm overdrive1
guix offload: Testing 1 build machines defined in '/etc/guix/machines.scm'...
guix offload: got premature EOF from machine 'overdrive1.guix.gnu.org' from inferior '#<inferior pipe (0 1 1) 7f6ee10d3400>' on port '#<input-output: channel (open) 7f6ee10c5500>'; retrying connection
Backtrace:
In ice-9/boot-9.scm:
  1752:10 10 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
In unknown file:
           9 (apply-smob/0 #<thunk 7f6ee5ae9f60>)
In ice-9/boot-9.scm:
    724:2  8 (call-with-prompt _ _ #<procedure default-prompt-handler (k proc)>)
In ice-9/eval.scm:
    619:8  7 (_ #(#(#<directory (guile-user) 7f6ee5ae3c80>)))
In guix/ui.scm:
  2161:12  6 (run-guix-command _ . _)
In ice-9/boot-9.scm:
  1752:10  5 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
  1747:15  4 (with-exception-handler #<procedure 7f6ee25e1480 at ice-9/boot-9.scm:1831:7 (exn)> _ # _ # …)
In ice-9/threads.scm:
   288:21  3 (loop _)
In guix/scripts/offload.scm:
   719:29  2 (_ _)
   719:29  1 (_ _)
In ice-9/boot-9.scm:
  1685:16  0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
Wrong type to apply: 2
--8<---------------cut here---------------end--------------->8---

I haven't tried analyzing the strace output yet.

Maxim


[-- Attachment #2: 0001-offload-Parallelize-machine-check-in-offload-test.patch --]
[-- Type: text/x-patch, Size: 4445 bytes --]

From c7b2ec1c58adf8c795df0a6aaf075dbc331f41e8 Mon Sep 17 00:00:00 2001
From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Date: Thu, 27 May 2021 08:44:44 -0400
Subject: [PATCH 1/2] offload: Parallelize machine check in offload test.

* guix/scripts/offload.scm (check-machine-availability): Refactor so that it
takes a single machine object.  Ensure the cleanup code is always run.
(check-machines-availability): New procedure.  Call
CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about
twice as fast with 4 build machines, from ~30 s to ~15 s).
---
 guix/scripts/offload.scm | 39 +++++++++++++++++++++++++--------------
 1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/guix/scripts/offload.scm b/guix/scripts/offload.scm
index 835078cb97..b0fd20e158 100644
--- a/guix/scripts/offload.scm
+++ b/guix/scripts/offload.scm
@@ -1,7 +1,7 @@
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2014, 2015, 2016, 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
 ;;; Copyright © 2017 Ricardo Wurmus <rekado@elephly.net>
-;;; Copyright © 2020 Maxim Cournoyer <maxim.cournoyer@gmail.com>
+;;; Copyright © 2020, 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com>
 ;;; Copyright © 2020 Julien Lepiller <julien@lepiller.eu>
 ;;;
 ;;; This file is part of GNU Guix.
@@ -53,6 +53,7 @@
   #:use-module (ice-9 regex)
   #:use-module (ice-9 format)
   #:use-module (ice-9 binary-ports)
+  #:use-module (ice-9 threads)
   #:export (build-machine
             build-machine?
             build-machine-name
@@ -684,7 +685,7 @@ daemon is not running."
           (leave (G_ "failed to import '~a' from '~a'~%")
                  item name)))))
 
-(define (check-machine-availability machine-file pred)
+(define (check-machines-availability machine-file pred)
   "Check that each machine matching PRED in MACHINE-FILE is usable as a build
 machine."
   (define (build-machine=? m1 m2)
@@ -696,18 +697,28 @@ machine."
   (let ((machines (filter pred
                           (delete-duplicates (build-machines machine-file)
                                              build-machine=?))))
-    (info (G_ "testing ~a build machines defined in '~a'...~%")
+    (info (G_ "Testing ~a build machines defined in '~a'...~%")
           (length machines) machine-file)
-    (let* ((names    (map build-machine-name machines))
-           (sockets  (map build-machine-daemon-socket machines))
-           (sessions (map (cut open-ssh-session <> %short-timeout) machines))
-           (nodes    (map remote-inferior sessions)))
-      (for-each assert-node-has-guix nodes names)
-      (for-each assert-node-repl nodes names)
-      (for-each assert-node-can-import sessions nodes names sockets)
-      (for-each assert-node-can-export sessions nodes names sockets)
-      (for-each close-inferior nodes)
-      (for-each disconnect! sessions))))
+    (par-for-each check-machine-availability machines)))
+
+(define (check-machine-availability machine)
+  "Check whether MACHINE is available.  Exit with an error upon failure."
+  ;; Sometimes, the machine remote port may return EOF, presumably because the
+  ;; connection was lost.  Retry up to 3 times.
+  (let* ((name    (build-machine-name machine))
+         (socket  (build-machine-daemon-socket machine))
+         (session (open-ssh-session machine %short-timeout))
+         (node    (remote-inferior session)))
+    (dynamic-wind
+      (lambda () #t)
+      (lambda ()
+        (assert-node-has-guix node name)
+        (assert-node-repl node name)
+        (assert-node-can-import session node name socket)
+        (assert-node-can-export session node name socket))
+      (lambda ()
+        (close-inferior node)
+        (disconnect! session)))))
 
 (define (check-machine-status machine-file pred)
   "Print the load of each machine matching PRED in MACHINE-FILE."
@@ -824,7 +835,7 @@ machine."
                        ((file) (values file (const #t)))
                        (()     (values %machine-file (const #t)))
                        (x      (leave (G_ "wrong number of arguments~%"))))))
-         (check-machine-availability (or file %machine-file) pred))))
+         (check-machines-availability (or file %machine-file) pred))))
     (("status" rest ...)
      (with-error-handling
        (let-values (((file pred)
-- 
2.31.1


[-- Attachment #3: 0002-offload-Handle-a-possible-EOF-response-from-read-rep.patch --]
[-- Type: text/x-patch, Size: 5325 bytes --]

From b5558777617e4674a150895458d57d202de56120 Mon Sep 17 00:00:00 2001
From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Date: Tue, 25 May 2021 08:42:06 -0400
Subject: [PATCH 2/2] offload: Handle a possible EOF response from
 read-repl-response.

Partially fixes <https://issues.guix.gnu.org/41625>.

* guix/scripts/offload.scm (check-machine-availability): Handle the case where
the checks raised an exception due to receiving EOF prematurely, and retry up
to 3 times.
* guix/inferior.scm (&inferior-premature-eof): New condition type.
(read-repl-response): Raise a condition of the above type when reading EOF
from the build machine's port.
---
 guix/inferior.scm        | 15 ++++++++++++++
 guix/scripts/offload.scm | 42 ++++++++++++++++++++++++++--------------
 2 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/guix/inferior.scm b/guix/inferior.scm
index 7c8e478f2a..e63b37a7dd 100644
--- a/guix/inferior.scm
+++ b/guix/inferior.scm
@@ -1,5 +1,6 @@
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2018, 2019, 2020, 2021 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -70,6 +71,9 @@
             inferior-exception-arguments
             inferior-exception-inferior
             inferior-exception-stack
+            inferior-premature-eof?
+            inferior-premature-eof-port
+            inferior-premature-eof-inferior
             read-repl-response
 
             inferior-packages
@@ -228,6 +232,11 @@ equivalent.  Return #f if the inferior could not be launched."
   (inferior   inferior-exception-inferior)        ;<inferior> | #f
   (stack      inferior-exception-stack))          ;list of (FILE COLUMN LINE)
 
+(define-condition-type &inferior-premature-eof &error
+  inferior-premature-eof?
+  (port inferior-premature-eof-port)
+  (inferior inferior-premature-eof-inferior))
+
 (define* (read-repl-response port #:optional inferior)
   "Read a (guix repl) response from PORT and return it as a Scheme object.
 Raise '&inferior-exception' when an exception is read from PORT."
@@ -241,6 +250,12 @@ Raise '&inferior-exception' when an exception is read from PORT."
   (match (read port)
     (('values objects ...)
      (apply values (map sexp->object objects)))
+    ((? eof-object?)
+     ;; Unexpectedly read EOF from the port.  This can happen for example when
+     ;; the underlying connection for PORT was lost with Guile-SSH.
+     (raise (condition (&inferior-premature-eof
+                        (inferior inferior)
+                        (port port)))))
     (('exception ('arguments key objects ...)
                  ('stack frames ...))
      ;; Protocol (0 1 1) and later.
diff --git a/guix/scripts/offload.scm b/guix/scripts/offload.scm
index b0fd20e158..4312eb4e22 100644
--- a/guix/scripts/offload.scm
+++ b/guix/scripts/offload.scm
@@ -705,20 +705,34 @@ machine."
   "Check whether MACHINE is available.  Exit with an error upon failure."
   ;; Sometimes, the machine remote port may return EOF, presumably because the
   ;; connection was lost.  Retry up to 3 times.
-  (let* ((name    (build-machine-name machine))
-         (socket  (build-machine-daemon-socket machine))
-         (session (open-ssh-session machine %short-timeout))
-         (node    (remote-inferior session)))
-    (dynamic-wind
-      (lambda () #t)
-      (lambda ()
-        (assert-node-has-guix node name)
-        (assert-node-repl node name)
-        (assert-node-can-import session node name socket)
-        (assert-node-can-export session node name socket))
-      (lambda ()
-        (close-inferior node)
-        (disconnect! session)))))
+  (let loop ((retries 3))
+    (guard (c ((inferior-premature-eof? c)
+               (let ((retries-left (1- retries))
+                     (inferior (inferior-premature-eof-inferior c)))
+                 (if (> retries-left 0)
+                     (begin
+                       (info (G_ "got premature EOF from machine '~a' from \
+inferior '~a' on port '~a'; retrying connection~%")
+                             (build-machine-name machine)
+                             inferior
+                             (inferior-premature-eof-port c))
+                       (loop (retries-left)))
+                     (leave (G_ "connection repeatedly lost with machine '~a'~%")
+                            (build-machine-name machine))))))
+      (let* ((name    (build-machine-name machine))
+             (socket  (build-machine-daemon-socket machine))
+             (session (open-ssh-session machine %short-timeout))
+             (node    (remote-inferior session)))
+        (dynamic-wind
+          (lambda () #t)
+          (lambda ()
+            (assert-node-has-guix node name)
+            (assert-node-repl node name)
+            (assert-node-can-import session node name socket)
+            (assert-node-can-export session node name socket))
+          (lambda ()
+            (close-inferior node)
+            (disconnect! session)))))))
 
 (define (check-machine-status machine-file pred)
   "Print the load of each machine matching PRED in MACHINE-FILE."
-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#41625: [PATCH v2] offload: Handle a possible EOF response from read-repl-response.
  2021-05-26  9:14               ` Ludovic Courtès
  2021-05-27 11:49                 ` Maxim Cournoyer
  2021-05-27 14:57                 ` bug#41625: [PATCH v3] " Maxim Cournoyer
@ 2021-05-27 17:20                 ` Maxim Cournoyer
  2021-05-29 19:24                   ` Ludovic Courtès
  2 siblings, 1 reply; 15+ messages in thread
From: Maxim Cournoyer @ 2021-05-27 17:20 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 41625

[-- Attachment #1: Type: text/plain, Size: 127 bytes --]

Hello again,

My previous strace output didn't follow the 'clone' syscall; the one
attached does.  It's compressed with lzip.


[-- Attachment #2: sshd-strace.out.lz --]
[-- Type: application/octet-stream, Size: 67057 bytes --]

[-- Attachment #3: Type: text/plain, Size: 16 bytes --]


Thanks!

Maxim

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#41625: [PATCH v2] offload: Handle a possible EOF response from read-repl-response.
  2021-05-27 17:20                 ` bug#41625: [PATCH v2] " Maxim Cournoyer
@ 2021-05-29 19:24                   ` Ludovic Courtès
  0 siblings, 0 replies; 15+ messages in thread
From: Ludovic Courtès @ 2021-05-29 19:24 UTC (permalink / raw)
  To: Maxim Cournoyer; +Cc: 41625

Hi Maxim,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

> My previous strace output didn't follow the 'clone' syscall; the one
> attached does.  It's compressed with lzip.

--8<---------------cut here---------------start------------->8---
11457 sendto(3, "<86>May 27 19:03:45 sshd[11457]: pam_unix(sshd:session): session opened for user maxim by (uid=0)", 97, MSG_NOSIGNAL, NULL, 0) = 97
11457 socketpair(AF_UNIX, SOCK_STREAM, 0, [5, 7]) = 0
11457 fcntl(5, F_SETFD, FD_CLOEXEC)     = 0
11457 fcntl(7, F_SETFD, FD_CLOEXEC)     = 0
11457 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD <unfinished ...>
11459 set_robust_list(0xffff9d2e10f0, 24 <unfinished ...>
11457 <... clone resumed>, child_tidptr=0xffff9d2e10e0) = 11459

[…]

11459 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD <unfinished ...>
11460 set_robust_list(0xffff9d2e10f0, 24 <unfinished ...>
11459 <... clone resumed>, child_tidptr=0xffff9d2e10e0) = 11460

[…]

11460 execve("/gnu/store/6b5gnpnbi73l0pasan0fip4w3f1afizi-bash-5.0.16/bin/bash", ["bash", "-c", "guix repl -t machine"], 0xaaaacfb78470 /* 16 vars */) = 0

[…]

11460 <... read resumed>"(use-modules (gnu))\n", 4096) = 20

[…]

11459 read(4, "", 16384)                = 0
11459 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
11459 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
11459 close(11)                         = 0
11459 close(10)                         = 0
11459 close(13)                         = 0
11459 close(4)                          = 0
11459 brk(0xaaaacfba7000)               = 0xaaaacfba7000
11459 write(5, "\0\0\0\0012", 5)        = 5
11457 <... ppoll resumed>)              = 1 ([{fd=7, revents=POLLIN}])
11459 exit_group(0 <unfinished ...>

[…]

11457 sendto(3, "<86>May 27 19:04:08 sshd[11457]: pam_unix(sshd:session): session closed for user maxim", 86, MSG_NOSIGNAL, NULL, 0) = 86

[…]

11460 openat(AT_FDCWD, "/gnu/store/83vzpl9pl5v86ncb8jy2gbfgsac0amjv-guix-module-union/lib/guile/3.0/site-ccache/gnu/packages/mtools.go", O_RDONLY|O_CLOEXEC) = 14
11460 lseek(14, 0, SEEK_END)            = 72707
11460 mmap(NULL, 72707, PROT_READ, MAP_PRIVATE, 14, 0) = 0xffff90194000
11460 close(14)                         = 0
11460 mprotect(0xffff901a4000, 3432, PROT_READ|PROT_WRITE) = 0
11460 newfstatat(AT_FDCWD, "/gnu/store/83vzpl9pl5v86ncb8jy2gbfgsac0amjv-guix-module-union/share/guile/site/3.0/gnu/packages/patches/u-boot-rockchip-inno-usb.patch", {st_mode=S_IFREG|0444, st_size=10450, ...}, 0) = 0
11460 write(1, "(values (non-self-quoting 2052 \"#<unspecified>\"))\n", 50) = -1 EPIPE (Broken pipe)
11460 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=11460, si_uid=30013} ---
--8<---------------cut here---------------end--------------->8---

(‘guix repl’ is PID 11460.)

It looks like the SSH session gets closed while ‘guix repl’ is still
running, and thus the client never sees the reply from ‘guix repl’
(‘guix repl’ itself eventually gets EPIPE because its parent process
terminated and there’s no one listening to it.)

So the bug may be on the client side, in ‘guix offload’, which for some
reason drops the session too early.  Can you enable libssh debugging
there, reproduce the problem, and share the libssh debug info from that
run?

Do enable libssh debugging, you can drop something like this in (guix
scripts offload) for instance:

  (use-modules (ssh log))
  (set-log-verbosity! 'protocol)

The available logging levels, from Guile-SSH’s log.c, are:

--8<---------------cut here---------------start------------->8---
struct symbol_mapping log_verbosity[] = {
  /* 0, No logging at all */
  { "nolog",              SSH_LOG_NOLOG     },
  /* 1, Only rare and noteworthy events */
  { "rare",               SSH_LOG_RARE      },
  /* 2, High level protocol information */
  { "protocol",           SSH_LOG_PROTOCOL  },
  /* 3, Lower level protocol infomations, packet level */
  { "packet",             SSH_LOG_PACKET    },
  /* 4, Every function path */
  { "functions",          SSH_LOG_FUNCTIONS },
  { NULL,                 -1                }
};
--8<---------------cut here---------------end--------------->8---

HTH!

Ludo’.




^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-05-29 19:25 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-31  9:51 bug#41625: Sporadic guix-offload crashes due to EOF errors Marius Bakke
2020-05-31 10:12 ` Marius Bakke
2020-05-31 11:21   ` Marius Bakke
2020-06-04 12:05     ` Ludovic Courtès
2021-05-24  5:33       ` Maxim Cournoyer
2021-05-25 15:50         ` bug#41625: [PATCH] offload: Handle a possible EOF response from read-repl-response Maxim Cournoyer
2021-05-25 20:27           ` Ludovic Courtès
2021-05-26  3:18             ` bug#41625: [PATCH v2] " Maxim Cournoyer
2021-05-26  9:14               ` Ludovic Courtès
2021-05-27 11:49                 ` Maxim Cournoyer
2021-05-27 14:57                 ` bug#41625: [PATCH v3] " Maxim Cournoyer
2021-05-27 17:20                 ` bug#41625: [PATCH v2] " Maxim Cournoyer
2021-05-29 19:24                   ` Ludovic Courtès
2021-05-26 15:48               ` Marius Bakke
2021-05-27 11:51                 ` Maxim Cournoyer

unofficial mirror of bug-guix@gnu.org 

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://yhetil.org/guix-bugs/0 guix-bugs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 guix-bugs guix-bugs/ https://yhetil.org/guix-bugs \
		bug-guix@gnu.org
	public-inbox-index guix-bugs

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.yhetil.org/yhetil.gnu.guix.bugs
	nntp://news.gmane.io/gmane.comp.gnu.guix.bugs


AGPL code for this site: git clone http://ou63pmih66umazou.onion/public-inbox.git