Hi Ludovic, Ludovic Courtès writes: [...] >> (define* (read-repl-response port #:optional inferior) >> "Read a (guix repl) response from PORT and return it as a Scheme object. >> Raise '&inferior-exception' when an exception is read from PORT." >> @@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read from PORT." >> (match (read port) >> (('values objects ...) >> (apply values (map sexp->object objects))) >> + ;; Unexpectedly read EOF from the port. This can happen for example when >> + ;; the underlying connection for PORT was lost with Guile-SSH. >> + (? eof-object? >> + (raise (condition (&inferior-connection-lost)))) > > The match clause syntax is incorrect; should be: > > ((? eof-object?) > (raise …)) Good catch, fixed. >> + (info (G_ "Testing ~a build machines defined in '~a'...~%") >> (length machines) machine-file) >> - (let* ((names (map build-machine-name machines)) >> - (sockets (map build-machine-daemon-socket machines)) >> - (sessions (map (cut open-ssh-session <> %short-timeout) machines)) >> - (nodes (map remote-inferior sessions))) >> - (for-each assert-node-has-guix nodes names) >> - (for-each assert-node-repl nodes names) >> - (for-each assert-node-can-import sessions nodes names sockets) >> - (for-each assert-node-can-export sessions nodes names sockets) >> - (for-each close-inferior nodes) >> - (for-each disconnect! sessions)))) >> + (par-for-each check-machine-availability machines))) > > Why not! IMO this should go in a separate patch, though, since it’s not > related. For me, it is related in that retrying all the checks of *every* build offload machine would be too expensive; it already takes 32 s for my 4 offload machines; retrying this for up to 3 times would mean waiting for a minute and half, which I don't find reasonable (imagine on berlin!). >> +(define (check-machine-availability machine) >> + "Check whether MACHINE is available. Exit with an error upon failure." >> + ;; Sometimes, the machine remote port may return EOF, presumably because the >> + ;; connection was lost. Retry up to 3 times. >> + (let loop ((retries 3)) >> + (guard (c ((inferior-connection-lost? c) >> + (let ((retries-left (1- retries))) >> + (if (> retries-left 0) >> + (begin >> + (format (current-error-port) >> + (G_ "connection to machine ~s lost; retrying~%") >> + (build-machine-name machine)) >> + (loop (retries-left))) >> + (leave (G_ "connection repeatedly lost with machine '~a'~%") >> + (build-machine-name machine)))))) > > I’m afraid we’re papering over problems here. I had that thought too, but then also realized that even if this was papering over a problem, it'd be a good one to paper over as this problem can legitimately happen in practice, due to the network's inherently shaky nature. It seems better to be ready for it. Also, my hopes in being able to troubleshoot such a difficult to reproduce networking issue are rather low. > Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on > berlin enough to reproduce the issue? If so, we could monitor/strace > sshd on overdrive1 to get a better understanding of what’s going on. It's actually difficult to trigger it; it seems to happen mostly on the first try after a long time without connecting to the machine; on the 2nd and later tries, everything is smooth. Waiting a few minutes is not enough to re-trigger the problem. I've managed to see the problem a few lucky times with: --8<---------------cut here---------------start------------->8--- while true; do guix offload test /etc/guix/machines.scm overdrive1; done --8<---------------cut here---------------end--------------->8--- I don't have a password set for my user on overdrive1, so can't attach strace to sshd, but yeah, we could try to capture it and see if we can understand what's going on. Attached is v2 of the patch, with the match clause fixed.