unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
From: "Ludovic Courtès" <ludo@gnu.org>
To: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Cc: 41625@debbugs.gnu.org
Subject: bug#41625: [PATCH v2] offload: Handle a possible EOF response from read-repl-response.
Date: Wed, 26 May 2021 11:14:32 +0200	[thread overview]
Message-ID: <87fsy9x3ev.fsf@gnu.org> (raw)
In-Reply-To: <87mtsikwsm.fsf_-_@gmail.com> (Maxim Cournoyer's message of "Tue,  25 May 2021 23:18:17 -0400")

Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

>>> +    (info (G_ "Testing ~a build machines defined in '~a'...~%")
>>>            (length machines) machine-file)
>>> -    (let* ((names    (map build-machine-name machines))
>>> -           (sockets  (map build-machine-daemon-socket machines))
>>> -           (sessions (map (cut open-ssh-session <> %short-timeout) machines))
>>> -           (nodes    (map remote-inferior sessions)))
>>> -      (for-each assert-node-has-guix nodes names)
>>> -      (for-each assert-node-repl nodes names)
>>> -      (for-each assert-node-can-import sessions nodes names sockets)
>>> -      (for-each assert-node-can-export sessions nodes names sockets)
>>> -      (for-each close-inferior nodes)
>>> -      (for-each disconnect! sessions))))
>>> +    (par-for-each check-machine-availability machines)))
>>
>> Why not!  IMO this should go in a separate patch, though, since it’s not
>> related.
>
> For me, it is related in that retrying all the checks of *every* build
> offload machine would be too expensive; it already takes 32 s for my 4
> offload machines; retrying this for up to 3 times would mean waiting for
> a minute and half, which I don't find reasonable (imagine on berlin!).

I see.  So I’d say it’s a prerequisite (a patch that must come before)
but not entirely the same thing.  I’m nitpicking!

We should make sure it doesn’t trigger thread-safety issues in libssh or
anything like that (running it repeatedly on a large machines.scm should
give us some confidence).

>>> +(define (check-machine-availability machine)
>>> +  "Check whether MACHINE is available.  Exit with an error upon failure."
>>> +  ;; Sometimes, the machine remote port may return EOF, presumably because the
>>> +  ;; connection was lost.  Retry up to 3 times.
>>> +  (let loop ((retries 3))
>>> +    (guard (c ((inferior-connection-lost? c)
>>> +               (let ((retries-left (1- retries)))
>>> +                 (if (> retries-left 0)
>>> +                     (begin
>>> +                       (format (current-error-port)
>>> +                               (G_ "connection to machine ~s lost; retrying~%")
>>> +                               (build-machine-name machine))
>>> +                       (loop (retries-left)))
>>> +                     (leave (G_ "connection repeatedly lost with machine '~a'~%")
>>> +                            (build-machine-name machine))))))
>>
>> I’m afraid we’re papering over problems here.
>
> I had that thought too, but then also realized that even if this was
> papering over a problem, it'd be a good one to paper over as this
> problem can legitimately happen in practice, due to the network's
> inherently shaky nature.  It seems better to be ready for it.  Also, my
> hopes in being able to troubleshoot such a difficult to reproduce
> networking issue are rather low.

Yes, but note that this is just for ‘guix offload test’.  The actual
code run while offloading will still fail badly.

>> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
>> berlin enough to reproduce the issue?  If so, we could monitor/strace
>> sshd on overdrive1 to get a better understanding of what’s going on.
>
> It's actually difficult to trigger it; it seems to happen mostly on the
> first try after a long time without connecting to the machine; on the
> 2nd and later tries, everything is smooth.  Waiting a few minutes is not
> enough to re-trigger the problem.
>
> I've managed to see the problem a few lucky times with:
>
> while true; do guix offload test /etc/guix/machines.scm overdrive1; done
>
> I don't have a password set for my user on overdrive1, so can't attach
> strace to sshd, but yeah, we could try to capture it and see if we can
> understand what's going on.

OK.

> From c52172502749a4d194dc51db9d2c394cb15e8d07 Mon Sep 17 00:00:00 2001
> From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
> Date: Tue, 25 May 2021 08:42:06 -0400
> Subject: [PATCH] offload: Handle a possible EOF response from
>  read-repl-response.
>
> Fixes <https://issues.guix.gnu.org/41625>.
>
> * guix/scripts/offload.scm (check-machine-availability): Refactor so that it
> takes a single machine object, to allow for retrying a single machine.  Handle
> the case where the checks raised an exception due to the connection to the
> build machine having been lost, and retry up to 3 times.  Ensure the cleanup
> code is run in all situations.
> (check-machines-availability): New procedure.  Call
> CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about
> twice as fast with 4 build machines, from ~30 s to ~15 s).
> * guix/inferior.scm (&inferior-connection-lost): New condition type.
> (read-repl-response): Raise a condition of the above type when reading EOF
> from the build machine's port.

[...]

> +(define-condition-type &inferior-connection-lost &error
> +  inferior-connection-lost?)

Perhaps worth adding an ‘inferior’ and/or ‘port’ field.  That would
allow the handler to present more information as to which inferior is
failing.

Maybe ‘premature-eof’ would be more accurate than ‘connection-lost’.

> +                       (format (current-error-port)
> +                               (G_ "connection to machine '~a' lost; retrying~%")
> +                               (build-machine-name machine))

You can use ‘info’ instead of ‘format’.

Otherwise LGTM, thanks!

Ludo’.




  reply	other threads:[~2021-05-26  9:19 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-31  9:51 bug#41625: Sporadic guix-offload crashes due to EOF errors Marius Bakke
2020-05-31 10:12 ` Marius Bakke
2020-05-31 11:21   ` Marius Bakke
2020-06-04 12:05     ` Ludovic Courtès
2021-05-24  5:33       ` Maxim Cournoyer
2021-05-25 15:50         ` bug#41625: [PATCH] offload: Handle a possible EOF response from read-repl-response Maxim Cournoyer
2021-05-25 20:27           ` Ludovic Courtès
2021-05-26  3:18             ` bug#41625: [PATCH v2] " Maxim Cournoyer
2021-05-26  9:14               ` Ludovic Courtès [this message]
2021-05-27 11:49                 ` Maxim Cournoyer
2021-05-27 14:57                 ` bug#41625: [PATCH v3] " Maxim Cournoyer
2021-07-05  8:57                   ` bug#41625: Sporadic guix-offload crashes due to EOF errors Ludovic Courtès
2021-09-24  4:53                     ` Maxim Cournoyer
2021-09-24  4:55                     ` Maxim Cournoyer
2021-05-27 17:20                 ` bug#41625: [PATCH v2] offload: Handle a possible EOF response from read-repl-response Maxim Cournoyer
2021-05-29 19:24                   ` Ludovic Courtès
2021-05-26 15:48               ` Marius Bakke
2021-05-27 11:51                 ` Maxim Cournoyer
2022-03-26  5:03                   ` bug#41625: Sporadic guix-offload crashes due to EOF errors Maxim Cournoyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87fsy9x3ev.fsf@gnu.org \
    --to=ludo@gnu.org \
    --cc=41625@debbugs.gnu.org \
    --cc=maxim.cournoyer@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).