unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: 41625@debbugs.gnu.org
Subject: bug#41625: [PATCH v2] offload: Handle a possible EOF response from read-repl-response.
Date: Tue, 25 May 2021 23:18:17 -0400	[thread overview]
Message-ID: <87mtsikwsm.fsf_-_@gmail.com> (raw)
In-Reply-To: <875yz61rvt.fsf@gnu.org> ("Ludovic Courtès"'s message of "Tue, 25 May 2021 22:27:02 +0200")

[-- Attachment #1: Type: text/plain, Size: 4337 bytes --]

Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

[...]

>>  (define* (read-repl-response port #:optional inferior)
>>    "Read a (guix repl) response from PORT and return it as a Scheme object.
>>  Raise '&inferior-exception' when an exception is read from PORT."
>> @@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read from PORT."
>>    (match (read port)
>>      (('values objects ...)
>>       (apply values (map sexp->object objects)))
>> +    ;; Unexpectedly read EOF from the port.  This can happen for example when
>> +    ;; the underlying connection for PORT was lost with Guile-SSH.
>> +    (? eof-object?
>> +       (raise (condition (&inferior-connection-lost))))
>
> The match clause syntax is incorrect; should be:
>
>  ((? eof-object?)
>   (raise …))

Good catch, fixed.

>> +    (info (G_ "Testing ~a build machines defined in '~a'...~%")
>>            (length machines) machine-file)
>> -    (let* ((names    (map build-machine-name machines))
>> -           (sockets  (map build-machine-daemon-socket machines))
>> -           (sessions (map (cut open-ssh-session <> %short-timeout) machines))
>> -           (nodes    (map remote-inferior sessions)))
>> -      (for-each assert-node-has-guix nodes names)
>> -      (for-each assert-node-repl nodes names)
>> -      (for-each assert-node-can-import sessions nodes names sockets)
>> -      (for-each assert-node-can-export sessions nodes names sockets)
>> -      (for-each close-inferior nodes)
>> -      (for-each disconnect! sessions))))
>> +    (par-for-each check-machine-availability machines)))
>
> Why not!  IMO this should go in a separate patch, though, since it’s not
> related.

For me, it is related in that retrying all the checks of *every* build
offload machine would be too expensive; it already takes 32 s for my 4
offload machines; retrying this for up to 3 times would mean waiting for
a minute and half, which I don't find reasonable (imagine on berlin!).

>> +(define (check-machine-availability machine)
>> +  "Check whether MACHINE is available.  Exit with an error upon failure."
>> +  ;; Sometimes, the machine remote port may return EOF, presumably because the
>> +  ;; connection was lost.  Retry up to 3 times.
>> +  (let loop ((retries 3))
>> +    (guard (c ((inferior-connection-lost? c)
>> +               (let ((retries-left (1- retries)))
>> +                 (if (> retries-left 0)
>> +                     (begin
>> +                       (format (current-error-port)
>> +                               (G_ "connection to machine ~s lost; retrying~%")
>> +                               (build-machine-name machine))
>> +                       (loop (retries-left)))
>> +                     (leave (G_ "connection repeatedly lost with machine '~a'~%")
>> +                            (build-machine-name machine))))))
>
> I’m afraid we’re papering over problems here.

I had that thought too, but then also realized that even if this was
papering over a problem, it'd be a good one to paper over as this
problem can legitimately happen in practice, due to the network's
inherently shaky nature.  It seems better to be ready for it.  Also, my
hopes in being able to troubleshoot such a difficult to reproduce
networking issue are rather low.

> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
> berlin enough to reproduce the issue?  If so, we could monitor/strace
> sshd on overdrive1 to get a better understanding of what’s going on.

It's actually difficult to trigger it; it seems to happen mostly on the
first try after a long time without connecting to the machine; on the
2nd and later tries, everything is smooth.  Waiting a few minutes is not
enough to re-trigger the problem.

I've managed to see the problem a few lucky times with:

--8<---------------cut here---------------start------------->8---
while true; do guix offload test /etc/guix/machines.scm overdrive1; done
--8<---------------cut here---------------end--------------->8---

I don't have a password set for my user on overdrive1, so can't attach
strace to sshd, but yeah, we could try to capture it and see if we can
understand what's going on.

Attached is v2 of the patch, with the match clause fixed.


[-- Attachment #2: 0001-offload-Handle-a-possible-EOF-response-from-read-rep.patch --]
[-- Type: text/x-patch, Size: 7300 bytes --]

From c52172502749a4d194dc51db9d2c394cb15e8d07 Mon Sep 17 00:00:00 2001
From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Date: Tue, 25 May 2021 08:42:06 -0400
Subject: [PATCH] offload: Handle a possible EOF response from
 read-repl-response.

Fixes <https://issues.guix.gnu.org/41625>.

* guix/scripts/offload.scm (check-machine-availability): Refactor so that it
takes a single machine object, to allow for retrying a single machine.  Handle
the case where the checks raised an exception due to the connection to the
build machine having been lost, and retry up to 3 times.  Ensure the cleanup
code is run in all situations.
(check-machines-availability): New procedure.  Call
CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about
twice as fast with 4 build machines, from ~30 s to ~15 s).
* guix/inferior.scm (&inferior-connection-lost): New condition type.
(read-repl-response): Raise a condition of the above type when reading EOF
from the build machine's port.
---
 guix/inferior.scm        |  9 ++++++++
 guix/scripts/offload.scm | 50 +++++++++++++++++++++++++++++-----------
 2 files changed, 45 insertions(+), 14 deletions(-)

diff --git a/guix/inferior.scm b/guix/inferior.scm
index 7c8e478f2a..45d9d843e3 100644
--- a/guix/inferior.scm
+++ b/guix/inferior.scm
@@ -1,5 +1,6 @@
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2018, 2019, 2020, 2021 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -70,6 +71,7 @@
             inferior-exception-arguments
             inferior-exception-inferior
             inferior-exception-stack
+            inferior-connection-lost?
             read-repl-response
 
             inferior-packages
@@ -228,6 +230,9 @@ equivalent.  Return #f if the inferior could not be launched."
   (inferior   inferior-exception-inferior)        ;<inferior> | #f
   (stack      inferior-exception-stack))          ;list of (FILE COLUMN LINE)
 
+(define-condition-type &inferior-connection-lost &error
+  inferior-connection-lost?)
+
 (define* (read-repl-response port #:optional inferior)
   "Read a (guix repl) response from PORT and return it as a Scheme object.
 Raise '&inferior-exception' when an exception is read from PORT."
@@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read from PORT."
   (match (read port)
     (('values objects ...)
      (apply values (map sexp->object objects)))
+    ;; Unexpectedly read EOF from the port.  This can happen for example when
+    ;; the underlying connection for PORT was lost with Guile-SSH.
+    ((? eof-object?)
+     (raise (condition (&inferior-connection-lost))))
     (('exception ('arguments key objects ...)
                  ('stack frames ...))
      ;; Protocol (0 1 1) and later.
diff --git a/guix/scripts/offload.scm b/guix/scripts/offload.scm
index 835078cb97..0271874f6a 100644
--- a/guix/scripts/offload.scm
+++ b/guix/scripts/offload.scm
@@ -1,7 +1,7 @@
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2014, 2015, 2016, 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
 ;;; Copyright © 2017 Ricardo Wurmus <rekado@elephly.net>
-;;; Copyright © 2020 Maxim Cournoyer <maxim.cournoyer@gmail.com>
+;;; Copyright © 2020, 2021 Maxim Cournoyer <maxim.cournoyer@gmail.com>
 ;;; Copyright © 2020 Julien Lepiller <julien@lepiller.eu>
 ;;;
 ;;; This file is part of GNU Guix.
@@ -53,6 +53,7 @@
   #:use-module (ice-9 regex)
   #:use-module (ice-9 format)
   #:use-module (ice-9 binary-ports)
+  #:use-module (ice-9 threads)
   #:export (build-machine
             build-machine?
             build-machine-name
@@ -684,7 +685,7 @@ daemon is not running."
           (leave (G_ "failed to import '~a' from '~a'~%")
                  item name)))))
 
-(define (check-machine-availability machine-file pred)
+(define (check-machines-availability machine-file pred)
   "Check that each machine matching PRED in MACHINE-FILE is usable as a build
 machine."
   (define (build-machine=? m1 m2)
@@ -696,18 +697,39 @@ machine."
   (let ((machines (filter pred
                           (delete-duplicates (build-machines machine-file)
                                              build-machine=?))))
-    (info (G_ "testing ~a build machines defined in '~a'...~%")
+    (info (G_ "Testing ~a build machines defined in '~a'...~%")
           (length machines) machine-file)
-    (let* ((names    (map build-machine-name machines))
-           (sockets  (map build-machine-daemon-socket machines))
-           (sessions (map (cut open-ssh-session <> %short-timeout) machines))
-           (nodes    (map remote-inferior sessions)))
-      (for-each assert-node-has-guix nodes names)
-      (for-each assert-node-repl nodes names)
-      (for-each assert-node-can-import sessions nodes names sockets)
-      (for-each assert-node-can-export sessions nodes names sockets)
-      (for-each close-inferior nodes)
-      (for-each disconnect! sessions))))
+    (par-for-each check-machine-availability machines)))
+
+(define (check-machine-availability machine)
+  "Check whether MACHINE is available.  Exit with an error upon failure."
+  ;; Sometimes, the machine remote port may return EOF, presumably because the
+  ;; connection was lost.  Retry up to 3 times.
+  (let loop ((retries 3))
+    (guard (c ((inferior-connection-lost? c)
+               (let ((retries-left (1- retries)))
+                 (if (> retries-left 0)
+                     (begin
+                       (format (current-error-port)
+                               (G_ "connection to machine '~a' lost; retrying~%")
+                               (build-machine-name machine))
+                       (loop (retries-left)))
+                     (leave (G_ "connection repeatedly lost with machine '~a'~%")
+                            (build-machine-name machine))))))
+      (let* ((name    (build-machine-name machine))
+             (socket  (build-machine-daemon-socket machine))
+             (session (open-ssh-session machine %short-timeout))
+             (node    (remote-inferior session)))
+        (dynamic-wind
+          (lambda () #t)
+          (lambda ()
+            (assert-node-has-guix node name)
+            (assert-node-repl node name)
+            (assert-node-can-import session node name socket)
+            (assert-node-can-export session node name socket))
+          (lambda ()
+            (close-inferior node)
+            (disconnect! session)))))))
 
 (define (check-machine-status machine-file pred)
   "Print the load of each machine matching PRED in MACHINE-FILE."
@@ -824,7 +846,7 @@ machine."
                        ((file) (values file (const #t)))
                        (()     (values %machine-file (const #t)))
                        (x      (leave (G_ "wrong number of arguments~%"))))))
-         (check-machine-availability (or file %machine-file) pred))))
+         (check-machines-availability (or file %machine-file) pred))))
     (("status" rest ...)
      (with-error-handling
        (let-values (((file pred)
-- 
2.31.1


[-- Attachment #3: Type: text/plain, Size: 16 bytes --]


Thanks!

Maxim

  reply	other threads:[~2021-05-26  3:19 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-31  9:51 bug#41625: Sporadic guix-offload crashes due to EOF errors Marius Bakke
2020-05-31 10:12 ` Marius Bakke
2020-05-31 11:21   ` Marius Bakke
2020-06-04 12:05     ` Ludovic Courtès
2021-05-24  5:33       ` Maxim Cournoyer
2021-05-25 15:50         ` bug#41625: [PATCH] offload: Handle a possible EOF response from read-repl-response Maxim Cournoyer
2021-05-25 20:27           ` Ludovic Courtès
2021-05-26  3:18             ` Maxim Cournoyer [this message]
2021-05-26  9:14               ` bug#41625: [PATCH v2] " Ludovic Courtès
2021-05-27 11:49                 ` Maxim Cournoyer
2021-05-27 14:57                 ` bug#41625: [PATCH v3] " Maxim Cournoyer
2021-07-05  8:57                   ` bug#41625: Sporadic guix-offload crashes due to EOF errors Ludovic Courtès
2021-09-24  4:53                     ` Maxim Cournoyer
2021-09-24  4:55                     ` Maxim Cournoyer
2021-05-27 17:20                 ` bug#41625: [PATCH v2] offload: Handle a possible EOF response from read-repl-response Maxim Cournoyer
2021-05-29 19:24                   ` Ludovic Courtès
2021-05-26 15:48               ` Marius Bakke
2021-05-27 11:51                 ` Maxim Cournoyer
2022-03-26  5:03                   ` bug#41625: Sporadic guix-offload crashes due to EOF errors Maxim Cournoyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mtsikwsm.fsf_-_@gmail.com \
    --to=maxim.cournoyer@gmail.com \
    --cc=41625@debbugs.gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).