From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id VsvzIjG+rWD44QAAgWs5BA (envelope-from ) for ; Wed, 26 May 2021 05:19:13 +0200 Received: from aspmx1.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id iOYOHjG+rWAJRwAAB5/wlQ (envelope-from ) for ; Wed, 26 May 2021 03:19:13 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 7E9B120163 for ; Wed, 26 May 2021 05:19:12 +0200 (CEST) Received: from localhost ([::1]:51400 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1llk4b-0001sT-No for larch@yhetil.org; Tue, 25 May 2021 23:19:09 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:53964) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1llk4U-0001rq-IZ for bug-guix@gnu.org; Tue, 25 May 2021 23:19:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:35855) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1llk4U-0007m7-Ax for bug-guix@gnu.org; Tue, 25 May 2021 23:19:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1llk4U-0003DL-6B for bug-guix@gnu.org; Tue, 25 May 2021 23:19:02 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#41625: [PATCH v2] offload: Handle a possible EOF response from read-repl-response. Resent-From: Maxim Cournoyer Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Wed, 26 May 2021 03:19:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 41625 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Received: via spool by 41625-submit@debbugs.gnu.org id=B41625.162199911112318 (code B ref 41625); Wed, 26 May 2021 03:19:02 +0000 Received: (at 41625) by debbugs.gnu.org; 26 May 2021 03:18:31 +0000 Received: from localhost ([127.0.0.1]:47401 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1llk3y-0003Cc-QR for submit@debbugs.gnu.org; Tue, 25 May 2021 23:18:31 -0400 Received: from mail-qk1-f173.google.com ([209.85.222.173]:44702) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1llk3t-0003CL-Do for 41625@debbugs.gnu.org; Tue, 25 May 2021 23:18:28 -0400 Received: by mail-qk1-f173.google.com with SMTP id h20so17171995qko.11 for <41625@debbugs.gnu.org>; Tue, 25 May 2021 20:18:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:references:date:in-reply-to:message-id :user-agent:mime-version; bh=Bnp5qTbv1Ey6Nw+09iMMtxQO4ZjxTGise4hMO7B/N5A=; b=bUfANkc/MZ/5pFfyUxM17o6GofvIhMxUgMx9wFwgxdYx5WqKhD2TEC3Du+yfXD8P7t pOzurispLpRxnwbXlsq+mU1b28WsCqV96a0fSi3+HkSKSK746Dlc5dM4riBKMV9a+Y+X Zau/zk7+nXZWLAAU/3l4CJHW0wlfmHpxDtsSX9IzfwZ64sAgvw3cuq8OwhsSoRGSP7Be VQz9SyHVXoUFmW3pKNFcroLQvmrmq+aQPuj5edpl11VZ+RanSIKyO86xt7xjbGbZVoX8 uNip44Sn03s1bNCkY5RLxayOa5XEBd8unsgAPN6AydJfZn3Lb3cHOg4lkbGVWCbBp37R pCDw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:references:date:in-reply-to :message-id:user-agent:mime-version; bh=Bnp5qTbv1Ey6Nw+09iMMtxQO4ZjxTGise4hMO7B/N5A=; b=eyXpE3fEl3ebULFV2zSS6nch0jUUoJcopa35bHQ415iF9GGWZ/VIiXyakoiNqYAcj5 sV/MG4Cp53N4cbpVEGSxgeyOT3IT39DNVz1FCbK5NAFUXIml64D5M9GvT73TLb6zTzuL gPI9TmaHNGt/8lKg1z8b6DUWU5GIcDC0Z30dxEUlYZbel+zE2a7a/P0vPYt9/xyBFZ0R YroOO1qBGrAhtmbHRydYtSiZBawEgpoq31mUdtgN4l+jF+W0Zc5K7iU0z1wlfsh+jt1u 5mR8jaTK1CZi3ozojG19zwQvfZ1qrkwb4KG+irEhFpdSIaSfYb8Wx4onBFxa4SGr1c3Y YkHQ== X-Gm-Message-State: AOAM530PRdXq39XtH6dI2JXDeMabmH1IbvlhK9OSfwH3rGf4AuSFoRM1 SJcvVnRXrY99247VNTN2SWg= X-Google-Smtp-Source: ABdhPJwKao9MFUfpbnufe27L574H51KoJeqoHcoWqrLjhgJpkCUZ0TANU4BbsmOnwnHwnNbY6PvV1Q== X-Received: by 2002:a37:d4b:: with SMTP id 72mr34199567qkn.443.1621999099685; Tue, 25 May 2021 20:18:19 -0700 (PDT) Received: from hurd (dsl-10-130-162.b2b2c.ca. [72.10.130.162]) by smtp.gmail.com with ESMTPSA id v8sm757906qkg.102.2021.05.25.20.18.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 25 May 2021 20:18:19 -0700 (PDT) From: Maxim Cournoyer References: <87mtsky9um.fsf@gmail.com> <20210525155003.27590-1-maxim.cournoyer@gmail.com> <875yz61rvt.fsf@gnu.org> Date: Tue, 25 May 2021 23:18:17 -0400 In-Reply-To: <875yz61rvt.fsf@gnu.org> ("Ludovic =?UTF-8?Q?Court=C3=A8s?="'s message of "Tue, 25 May 2021 22:27:02 +0200") Message-ID: <87mtsikwsm.fsf_-_@gmail.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 41625@debbugs.gnu.org Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1621999152; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:resent-cc:resent-from:resent-sender: resent-message-id:in-reply-to:in-reply-to:references:references: list-id:list-help:list-unsubscribe:list-subscribe:list-post: dkim-signature; bh=Bnp5qTbv1Ey6Nw+09iMMtxQO4ZjxTGise4hMO7B/N5A=; b=qGw8XopP6salvV3oYbXJSm+ISuWYuK5NQE5P/z/0sL04haFBSdxkoywCGVffdhIBVYfTjq MKUvs0KJvQFMhsFDVWMaG9JEVtoIU5n5F11BNq/jbJBY22I3hm0LBgN0VgTsqaeJAg9wHG uEv3DAs+Lze9poph2doa2ooE9bHK4HcmKSgHYsgkXPKIEUEpnLGwy32k+UCtDxQvOR+/Uz URy2Iy7F8ACb65DanCQzxQio1U5CQai/T+5UlOq74eHa9KEze7FEcuVneLndOgcKAQ/vsc s0QFK95KPGItB5uNx98ZEsQLrwfukQ2/LgzNuU1WwuVKhHse0/EqUjZALcxy5g== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1621999152; a=rsa-sha256; cv=none; b=DBfhE9M6fotow5K4qQBMpelw01/LfMrC+wjjGqILplpyjl8s7GJOwsG+rNM0pdzbw+9XwR Y82Z6lidAHXZzhhF3VxBIyegsh4NdhmD6GeZ0VkPf3NrAUbVbHVMfHoCWuL4mNPb9GhdqR jFDbM0wu/N+gRuyFcXrQYDEjLRIvnBO0RgkM+lqWWMtKygxwpkVJtxF6V+cHPRA2mI+RpK oGyifVuRMs0wslXcFJiDwC2HxdMiZS4ht6iWPkabiPABzkHJUSfLNMoMfKO9XzICwy4MnS aFzRziI5tvTaEPTUtLvulvNUzK2TwZ+GIlR1WMpIqGtDXlxUAvw30hcDsBvDkQ== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=gmail.com header.s=20161025 header.b="bUfANkc/"; dmarc=fail reason="SPF not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Migadu-Spam-Score: -1.33 Authentication-Results: aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=gmail.com header.s=20161025 header.b="bUfANkc/"; dmarc=fail reason="SPF not aligned (relaxed)" header.from=gmail.com (policy=none); spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Migadu-Queue-Id: 7E9B120163 X-Spam-Score: -1.33 X-Migadu-Scanner: scn0.migadu.com X-TUID: qpi7FEw6912+ --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi Ludovic, Ludovic Court=C3=A8s writes: [...] >> (define* (read-repl-response port #:optional inferior) >> "Read a (guix repl) response from PORT and return it as a Scheme obje= ct. >> Raise '&inferior-exception' when an exception is read from PORT." >> @@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is re= ad from PORT." >> (match (read port) >> (('values objects ...) >> (apply values (map sexp->object objects))) >> + ;; Unexpectedly read EOF from the port. This can happen for exampl= e when >> + ;; the underlying connection for PORT was lost with Guile-SSH. >> + (? eof-object? >> + (raise (condition (&inferior-connection-lost)))) > > The match clause syntax is incorrect; should be: > > ((? eof-object?) > (raise =E2=80=A6)) Good catch, fixed. >> + (info (G_ "Testing ~a build machines defined in '~a'...~%") >> (length machines) machine-file) >> - (let* ((names (map build-machine-name machines)) >> - (sockets (map build-machine-daemon-socket machines)) >> - (sessions (map (cut open-ssh-session <> %short-timeout) mach= ines)) >> - (nodes (map remote-inferior sessions))) >> - (for-each assert-node-has-guix nodes names) >> - (for-each assert-node-repl nodes names) >> - (for-each assert-node-can-import sessions nodes names sockets) >> - (for-each assert-node-can-export sessions nodes names sockets) >> - (for-each close-inferior nodes) >> - (for-each disconnect! sessions)))) >> + (par-for-each check-machine-availability machines))) > > Why not! IMO this should go in a separate patch, though, since it=E2=80= =99s not > related. For me, it is related in that retrying all the checks of *every* build offload machine would be too expensive; it already takes 32 s for my 4 offload machines; retrying this for up to 3 times would mean waiting for a minute and half, which I don't find reasonable (imagine on berlin!). >> +(define (check-machine-availability machine) >> + "Check whether MACHINE is available. Exit with an error upon failure= ." >> + ;; Sometimes, the machine remote port may return EOF, presumably beca= use the >> + ;; connection was lost. Retry up to 3 times. >> + (let loop ((retries 3)) >> + (guard (c ((inferior-connection-lost? c) >> + (let ((retries-left (1- retries))) >> + (if (> retries-left 0) >> + (begin >> + (format (current-error-port) >> + (G_ "connection to machine ~s lost; retr= ying~%") >> + (build-machine-name machine)) >> + (loop (retries-left))) >> + (leave (G_ "connection repeatedly lost with machin= e '~a'~%") >> + (build-machine-name machine)))))) > > I=E2=80=99m afraid we=E2=80=99re papering over problems here. I had that thought too, but then also realized that even if this was papering over a problem, it'd be a good one to paper over as this problem can legitimately happen in practice, due to the network's inherently shaky nature. It seems better to be ready for it. Also, my hopes in being able to troubleshoot such a difficult to reproduce networking issue are rather low. > Is running =E2=80=98guix offload test /etc/guix/machines.scm overdrive1= =E2=80=99 on > berlin enough to reproduce the issue? If so, we could monitor/strace > sshd on overdrive1 to get a better understanding of what=E2=80=99s going = on. It's actually difficult to trigger it; it seems to happen mostly on the first try after a long time without connecting to the machine; on the 2nd and later tries, everything is smooth. Waiting a few minutes is not enough to re-trigger the problem. I've managed to see the problem a few lucky times with: --8<---------------cut here---------------start------------->8--- while true; do guix offload test /etc/guix/machines.scm overdrive1; done --8<---------------cut here---------------end--------------->8--- I don't have a password set for my user on overdrive1, so can't attach strace to sshd, but yeah, we could try to capture it and see if we can understand what's going on. Attached is v2 of the patch, with the match clause fixed. --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: attachment; filename=0001-offload-Handle-a-possible-EOF-response-from-read-rep.patch Content-Transfer-Encoding: quoted-printable >From c52172502749a4d194dc51db9d2c394cb15e8d07 Mon Sep 17 00:00:00 2001 From: Maxim Cournoyer Date: Tue, 25 May 2021 08:42:06 -0400 Subject: [PATCH] offload: Handle a possible EOF response from read-repl-response. Fixes . * guix/scripts/offload.scm (check-machine-availability): Refactor so that it takes a single machine object, to allow for retrying a single machine. Han= dle the case where the checks raised an exception due to the connection to the build machine having been lost, and retry up to 3 times. Ensure the cleanup code is run in all situations. (check-machines-availability): New procedure. Call CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about twice as fast with 4 build machines, from ~30 s to ~15 s). * guix/inferior.scm (&inferior-connection-lost): New condition type. (read-repl-response): Raise a condition of the above type when reading EOF from the build machine's port. --- guix/inferior.scm | 9 ++++++++ guix/scripts/offload.scm | 50 +++++++++++++++++++++++++++++----------- 2 files changed, 45 insertions(+), 14 deletions(-) diff --git a/guix/inferior.scm b/guix/inferior.scm index 7c8e478f2a..45d9d843e3 100644 --- a/guix/inferior.scm +++ b/guix/inferior.scm @@ -1,5 +1,6 @@ ;;; GNU Guix --- Functional package management for GNU ;;; Copyright =C2=A9 2018, 2019, 2020, 2021 Ludovic Court=C3=A8s +;;; Copyright =C2=A9 2021 Maxim Cournoyer ;;; ;;; This file is part of GNU Guix. ;;; @@ -70,6 +71,7 @@ inferior-exception-arguments inferior-exception-inferior inferior-exception-stack + inferior-connection-lost? read-repl-response =20 inferior-packages @@ -228,6 +230,9 @@ equivalent. Return #f if the inferior could not be lau= nched." (inferior inferior-exception-inferior) ; | #f (stack inferior-exception-stack)) ;list of (FILE COLUMN LI= NE) =20 +(define-condition-type &inferior-connection-lost &error + inferior-connection-lost?) + (define* (read-repl-response port #:optional inferior) "Read a (guix repl) response from PORT and return it as a Scheme object. Raise '&inferior-exception' when an exception is read from PORT." @@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read = from PORT." (match (read port) (('values objects ...) (apply values (map sexp->object objects))) + ;; Unexpectedly read EOF from the port. This can happen for example w= hen + ;; the underlying connection for PORT was lost with Guile-SSH. + ((? eof-object?) + (raise (condition (&inferior-connection-lost)))) (('exception ('arguments key objects ...) ('stack frames ...)) ;; Protocol (0 1 1) and later. diff --git a/guix/scripts/offload.scm b/guix/scripts/offload.scm index 835078cb97..0271874f6a 100644 --- a/guix/scripts/offload.scm +++ b/guix/scripts/offload.scm @@ -1,7 +1,7 @@ ;;; GNU Guix --- Functional package management for GNU ;;; Copyright =C2=A9 2014, 2015, 2016, 2017, 2018, 2019, 2020 Ludovic Cour= t=C3=A8s ;;; Copyright =C2=A9 2017 Ricardo Wurmus -;;; Copyright =C2=A9 2020 Maxim Cournoyer +;;; Copyright =C2=A9 2020, 2021 Maxim Cournoyer ;;; Copyright =C2=A9 2020 Julien Lepiller ;;; ;;; This file is part of GNU Guix. @@ -53,6 +53,7 @@ #:use-module (ice-9 regex) #:use-module (ice-9 format) #:use-module (ice-9 binary-ports) + #:use-module (ice-9 threads) #:export (build-machine build-machine? build-machine-name @@ -684,7 +685,7 @@ daemon is not running." (leave (G_ "failed to import '~a' from '~a'~%") item name))))) =20 -(define (check-machine-availability machine-file pred) +(define (check-machines-availability machine-file pred) "Check that each machine matching PRED in MACHINE-FILE is usable as a bu= ild machine." (define (build-machine=3D? m1 m2) @@ -696,18 +697,39 @@ machine." (let ((machines (filter pred (delete-duplicates (build-machines machine-file) build-machine=3D?)))) - (info (G_ "testing ~a build machines defined in '~a'...~%") + (info (G_ "Testing ~a build machines defined in '~a'...~%") (length machines) machine-file) - (let* ((names (map build-machine-name machines)) - (sockets (map build-machine-daemon-socket machines)) - (sessions (map (cut open-ssh-session <> %short-timeout) machine= s)) - (nodes (map remote-inferior sessions))) - (for-each assert-node-has-guix nodes names) - (for-each assert-node-repl nodes names) - (for-each assert-node-can-import sessions nodes names sockets) - (for-each assert-node-can-export sessions nodes names sockets) - (for-each close-inferior nodes) - (for-each disconnect! sessions)))) + (par-for-each check-machine-availability machines))) + +(define (check-machine-availability machine) + "Check whether MACHINE is available. Exit with an error upon failure." + ;; Sometimes, the machine remote port may return EOF, presumably because= the + ;; connection was lost. Retry up to 3 times. + (let loop ((retries 3)) + (guard (c ((inferior-connection-lost? c) + (let ((retries-left (1- retries))) + (if (> retries-left 0) + (begin + (format (current-error-port) + (G_ "connection to machine '~a' lost; retry= ing~%") + (build-machine-name machine)) + (loop (retries-left))) + (leave (G_ "connection repeatedly lost with machine '= ~a'~%") + (build-machine-name machine)))))) + (let* ((name (build-machine-name machine)) + (socket (build-machine-daemon-socket machine)) + (session (open-ssh-session machine %short-timeout)) + (node (remote-inferior session))) + (dynamic-wind + (lambda () #t) + (lambda () + (assert-node-has-guix node name) + (assert-node-repl node name) + (assert-node-can-import session node name socket) + (assert-node-can-export session node name socket)) + (lambda () + (close-inferior node) + (disconnect! session))))))) =20 (define (check-machine-status machine-file pred) "Print the load of each machine matching PRED in MACHINE-FILE." @@ -824,7 +846,7 @@ machine." ((file) (values file (const #t))) (() (values %machine-file (const #t))) (x (leave (G_ "wrong number of arguments~%")))= ))) - (check-machine-availability (or file %machine-file) pred)))) + (check-machines-availability (or file %machine-file) pred)))) (("status" rest ...) (with-error-handling (let-values (((file pred) --=20 2.31.1 --=-=-= Content-Type: text/plain Thanks! Maxim --=-=-=--