From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1 ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id mJsyMJYSrmCMgAEAgWs5BA (envelope-from ) for ; Wed, 26 May 2021 11:19:18 +0200 Received: from aspmx1.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1 with LMTPS id +A3iK5YSrmDnDAAAbx9fmQ (envelope-from ) for ; Wed, 26 May 2021 09:19:18 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 9097024DA3 for ; Wed, 26 May 2021 11:19:17 +0200 (CEST) Received: from localhost ([::1]:51400 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1llph1-0002rW-1Z for larch@yhetil.org; Wed, 26 May 2021 05:19:12 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55820) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1llpd0-0004wm-MZ for bug-guix@gnu.org; Wed, 26 May 2021 05:15:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:36071) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1llpd0-0002W9-Dc for bug-guix@gnu.org; Wed, 26 May 2021 05:15:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1llpd0-0003Up-8O for bug-guix@gnu.org; Wed, 26 May 2021 05:15:02 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#41625: [PATCH v2] offload: Handle a possible EOF response from read-repl-response. Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Wed, 26 May 2021 09:15:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 41625 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Maxim Cournoyer Received: via spool by 41625-submit@debbugs.gnu.org id=B41625.162202048413397 (code B ref 41625); Wed, 26 May 2021 09:15:02 +0000 Received: (at 41625) by debbugs.gnu.org; 26 May 2021 09:14:44 +0000 Received: from localhost ([127.0.0.1]:47617 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1llpch-0003U1-Pu for submit@debbugs.gnu.org; Wed, 26 May 2021 05:14:44 -0400 Received: from eggs.gnu.org ([209.51.188.92]:33888) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1llpcf-0003Tn-0s for 41625@debbugs.gnu.org; Wed, 26 May 2021 05:14:42 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:56784) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1llpcY-0002BT-Hx; Wed, 26 May 2021 05:14:35 -0400 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=53510 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1llpcY-0007Fb-8i; Wed, 26 May 2021 05:14:34 -0400 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= References: <87mtsky9um.fsf@gmail.com> <20210525155003.27590-1-maxim.cournoyer@gmail.com> <875yz61rvt.fsf@gnu.org> <87mtsikwsm.fsf_-_@gmail.com> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 7 Prairial an 229 de la =?UTF-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Wed, 26 May 2021 11:14:32 +0200 In-Reply-To: <87mtsikwsm.fsf_-_@gmail.com> (Maxim Cournoyer's message of "Tue, 25 May 2021 23:18:17 -0400") Message-ID: <87fsy9x3ev.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 41625@debbugs.gnu.org Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1622020757; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:resent-cc: resent-from:resent-sender:resent-message-id:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post; bh=diAWNoAwNKnAr1S2/85Pwny5MdbNm1dbqs44szcQMWo=; b=TcZasIMYHJoQjOc5gjV8GXsv3c1QuyUWm67RZpe7nfSCl7tB0xYePAeTmq2Uw9SFJQ5aKh EPerJkQKoaJUA0eOMBoh4G67gubvffyHmziioUkSSdMS++buCPloBa9kz0wQiztM6+isKy wJLp1vY8vvZXT2fBMhlL+Kagg1h8EtpJ1ubgugWZfG0qopq9bo98H/3qXA3wn2VGHBq+3k nkRhGWzLBOMtQk52x4+QZFxxreVciPmCj3oM403v8joldHHgk/hl13ZZ+RQ5E5KIj5i1vb Ncl85hwvhsUDZYyP6a5u/AD+dAPAh0U/kbnBGjEBIWh0/WxLWfhW8ExH8pGw0g== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1622020757; a=rsa-sha256; cv=none; b=ATy4ShwQeBTlHWd/0X/iTfuO2VYphFUH2pmslBgQOubt/51cyywlaOt71hpVlYJRDDf2I1 GEJsw/PcI5RQvIyJzplKcwv9vOth/1c+URERxRMSRuwwyYO9aCMXa+EUjJksZaCUwoWpnr F1nxPBi1BU4nor+wUW3LPBZE2HhOkxC7yf2MQeim93CnCPxx0keHsayZs9v6VDUlgAaLce /vRaILcxT1I8mwIUoienaMdhvOoWWT2XQ4Ekvw/rHNskBwjHcMCEwUkoiYtbzJ7YsCHHC8 OAGQu82k/IQVBFmn2JQn7RhiPxU6w+YM6ovmmlcxIwZ3TRitkCZUXAhreG6hsw== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Migadu-Spam-Score: -2.93 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Migadu-Queue-Id: 9097024DA3 X-Spam-Score: -2.93 X-Migadu-Scanner: scn0.migadu.com X-TUID: dU039tk49ppN Hi, Maxim Cournoyer skribis: >>> + (info (G_ "Testing ~a build machines defined in '~a'...~%") >>> (length machines) machine-file) >>> - (let* ((names (map build-machine-name machines)) >>> - (sockets (map build-machine-daemon-socket machines)) >>> - (sessions (map (cut open-ssh-session <> %short-timeout) mac= hines)) >>> - (nodes (map remote-inferior sessions))) >>> - (for-each assert-node-has-guix nodes names) >>> - (for-each assert-node-repl nodes names) >>> - (for-each assert-node-can-import sessions nodes names sockets) >>> - (for-each assert-node-can-export sessions nodes names sockets) >>> - (for-each close-inferior nodes) >>> - (for-each disconnect! sessions)))) >>> + (par-for-each check-machine-availability machines))) >> >> Why not! IMO this should go in a separate patch, though, since it=E2=80= =99s not >> related. > > For me, it is related in that retrying all the checks of *every* build > offload machine would be too expensive; it already takes 32 s for my 4 > offload machines; retrying this for up to 3 times would mean waiting for > a minute and half, which I don't find reasonable (imagine on berlin!). I see. So I=E2=80=99d say it=E2=80=99s a prerequisite (a patch that must c= ome before) but not entirely the same thing. I=E2=80=99m nitpicking! We should make sure it doesn=E2=80=99t trigger thread-safety issues in libs= sh or anything like that (running it repeatedly on a large machines.scm should give us some confidence). >>> +(define (check-machine-availability machine) >>> + "Check whether MACHINE is available. Exit with an error upon failur= e." >>> + ;; Sometimes, the machine remote port may return EOF, presumably bec= ause the >>> + ;; connection was lost. Retry up to 3 times. >>> + (let loop ((retries 3)) >>> + (guard (c ((inferior-connection-lost? c) >>> + (let ((retries-left (1- retries))) >>> + (if (> retries-left 0) >>> + (begin >>> + (format (current-error-port) >>> + (G_ "connection to machine ~s lost; ret= rying~%") >>> + (build-machine-name machine)) >>> + (loop (retries-left))) >>> + (leave (G_ "connection repeatedly lost with machi= ne '~a'~%") >>> + (build-machine-name machine)))))) >> >> I=E2=80=99m afraid we=E2=80=99re papering over problems here. > > I had that thought too, but then also realized that even if this was > papering over a problem, it'd be a good one to paper over as this > problem can legitimately happen in practice, due to the network's > inherently shaky nature. It seems better to be ready for it. Also, my > hopes in being able to troubleshoot such a difficult to reproduce > networking issue are rather low. Yes, but note that this is just for =E2=80=98guix offload test=E2=80=99. T= he actual code run while offloading will still fail badly. >> Is running =E2=80=98guix offload test /etc/guix/machines.scm overdrive1= =E2=80=99 on >> berlin enough to reproduce the issue? If so, we could monitor/strace >> sshd on overdrive1 to get a better understanding of what=E2=80=99s going= on. > > It's actually difficult to trigger it; it seems to happen mostly on the > first try after a long time without connecting to the machine; on the > 2nd and later tries, everything is smooth. Waiting a few minutes is not > enough to re-trigger the problem. > > I've managed to see the problem a few lucky times with: > > while true; do guix offload test /etc/guix/machines.scm overdrive1; done > > I don't have a password set for my user on overdrive1, so can't attach > strace to sshd, but yeah, we could try to capture it and see if we can > understand what's going on. OK. > From c52172502749a4d194dc51db9d2c394cb15e8d07 Mon Sep 17 00:00:00 2001 > From: Maxim Cournoyer > Date: Tue, 25 May 2021 08:42:06 -0400 > Subject: [PATCH] offload: Handle a possible EOF response from > read-repl-response. > > Fixes . > > * guix/scripts/offload.scm (check-machine-availability): Refactor so that= it > takes a single machine object, to allow for retrying a single machine. H= andle > the case where the checks raised an exception due to the connection to the > build machine having been lost, and retry up to 3 times. Ensure the clea= nup > code is run in all situations. > (check-machines-availability): New procedure. Call > CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about > twice as fast with 4 build machines, from ~30 s to ~15 s). > * guix/inferior.scm (&inferior-connection-lost): New condition type. > (read-repl-response): Raise a condition of the above type when reading EOF > from the build machine's port. [...] > +(define-condition-type &inferior-connection-lost &error > + inferior-connection-lost?) Perhaps worth adding an =E2=80=98inferior=E2=80=99 and/or =E2=80=98port=E2= =80=99 field. That would allow the handler to present more information as to which inferior is failing. Maybe =E2=80=98premature-eof=E2=80=99 would be more accurate than =E2=80=98= connection-lost=E2=80=99. > + (format (current-error-port) > + (G_ "connection to machine '~a' lost; ret= rying~%") > + (build-machine-name machine)) You can use =E2=80=98info=E2=80=99 instead of =E2=80=98format=E2=80=99. Otherwise LGTM, thanks! Ludo=E2=80=99.