From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id GNd9LkRerWAFHgAAgWs5BA (envelope-from ) for ; Tue, 25 May 2021 22:29:56 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id 8BEzKkRerWC9GAAAB5/wlQ (envelope-from ) for ; Tue, 25 May 2021 20:29:56 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 080F11A8A2 for ; Tue, 25 May 2021 22:29:56 +0200 (CEST) Received: from localhost ([::1]:52260 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lldgZ-0002zZ-1R for larch@yhetil.org; Tue, 25 May 2021 16:29:55 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:48622) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lldek-00007b-Fu for bug-guix@gnu.org; Tue, 25 May 2021 16:28:04 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:35529) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1lldek-0006yo-83 for bug-guix@gnu.org; Tue, 25 May 2021 16:28:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1lldek-0005iT-32 for bug-guix@gnu.org; Tue, 25 May 2021 16:28:02 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#41625: [PATCH] offload: Handle a possible EOF response from read-repl-response. Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Tue, 25 May 2021 20:28:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 41625 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Maxim Cournoyer Received: via spool by 41625-submit@debbugs.gnu.org id=B41625.162197443921922 (code B ref 41625); Tue, 25 May 2021 20:28:02 +0000 Received: (at 41625) by debbugs.gnu.org; 25 May 2021 20:27:19 +0000 Received: from localhost ([127.0.0.1]:47075 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1llde2-0005hV-Nf for submit@debbugs.gnu.org; Tue, 25 May 2021 16:27:19 -0400 Received: from eggs.gnu.org ([209.51.188.92]:54758) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1llde0-0005hC-IG for 41625@debbugs.gnu.org; Tue, 25 May 2021 16:27:16 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:34636) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1llddv-0006Vb-CG; Tue, 25 May 2021 16:27:11 -0400 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=42546 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1llddo-0000kI-ED; Tue, 25 May 2021 16:27:05 -0400 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= References: <87mtsky9um.fsf@gmail.com> <20210525155003.27590-1-maxim.cournoyer@gmail.com> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 6 Prairial an 229 de la =?UTF-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Tue, 25 May 2021 22:27:02 +0200 In-Reply-To: <20210525155003.27590-1-maxim.cournoyer@gmail.com> (Maxim Cournoyer's message of "Tue, 25 May 2021 11:50:03 -0400") Message-ID: <875yz61rvt.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 41625@debbugs.gnu.org Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1621974596; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:resent-cc: resent-from:resent-sender:resent-message-id:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post; bh=4C1jFnjuA9ItCAKA3QYVSiZEWREMlK1Izb1vjEQvWV8=; b=ulQxfjNkEH9WJ7ZrRMZJLahzYKsPQqzVRVQ2uzT42delxhdw/a/6gFzgHPmEeC3xrCgrm4 Jde8kVswl7tgX6NMPjQNdSKlwbeIyx7fOprk9PIM0Q8pl7681gaRLSI70bZVxrfyppxe4C uT2yzxqjYMr6m/VtY3nwzqGfeymR5ugmPbAl8yhNYUlbDhnzZhA8Mo+sp/sbI8F+b3xfMN ETSTa1xnef/kJbSQgOhGU/pE96qpHrpRjHnp215p0vdIAuIhIhHXQi6GPpIcaPZHyxSGNt ee7HZIF3lP2ohfVn4K1QMttw+xuf28JJ+Ssv+UJxcNeMd1BlIX374Qkre359aw== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1621974596; a=rsa-sha256; cv=none; b=n5uA+2hpcfc7NvbYz7bZBW7nsEpdqc8wyHKD26a2AZ/1C6BpsM3uHVjG24u0wU3OK7jiZC jeQNze54S9H6SSFhBS6O1hUJKzEKK5fvfXgzZd2ltm1F9SGqEOh3FGodhEZkE9fikyTqbn i8pud2oujojP6fPB1/BU7yFFhQy+MyvN8O048FsfNfMRXVyBxv6Vg7nQSFjCe55fHOULzU kClfgtFbuTP4/40Kq37T4Vax5zyinErrTH8Y+5V1+xun/oQgndl1LQOh/fFYhOwNyg/fG6 mCszsoOJaC8SiRguRUzT7CcgVBXBZtdZewjbV00wI+gPgUMfsC9AwzlaB5vGaQ== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Migadu-Spam-Score: -1.43 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Migadu-Queue-Id: 080F11A8A2 X-Spam-Score: -1.43 X-Migadu-Scanner: scn1.migadu.com X-TUID: Hp1lyYJxfiB6 Hi! Maxim Cournoyer skribis: > Fixes . > > * guix/scripts/offload.scm (check-machine-availability): Refactor so that= it > takes a single machine object, to allow for retrying a single machine. H= andle > the case where the checks raised an exception due to the connection to the > build machine having been lost, and retry up to 3 times. Ensure the clea= nup > code is run in all situations. > (check-machines-availability): New procedure. Call > CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about > twice as fast with 4 build machines, from ~30 s to ~15 s). > * guix/inferior.scm (&inferior-connection-lost): New condition type. > (read-repl-response): Raise a condition of the above type when reading EOF > from the build machine's port. [...] > +(define-condition-type &inferior-connection-lost &error > + inferior-connection-lost?) > + > (define* (read-repl-response port #:optional inferior) > "Read a (guix repl) response from PORT and return it as a Scheme objec= t. > Raise '&inferior-exception' when an exception is read from PORT." > @@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is rea= d from PORT." > (match (read port) > (('values objects ...) > (apply values (map sexp->object objects))) > + ;; Unexpectedly read EOF from the port. This can happen for example= when > + ;; the underlying connection for PORT was lost with Guile-SSH. > + (? eof-object? > + (raise (condition (&inferior-connection-lost)))) The match clause syntax is incorrect; should be: ((? eof-object?) (raise =E2=80=A6)) > + (info (G_ "Testing ~a build machines defined in '~a'...~%") > (length machines) machine-file) > - (let* ((names (map build-machine-name machines)) > - (sockets (map build-machine-daemon-socket machines)) > - (sessions (map (cut open-ssh-session <> %short-timeout) machi= nes)) > - (nodes (map remote-inferior sessions))) > - (for-each assert-node-has-guix nodes names) > - (for-each assert-node-repl nodes names) > - (for-each assert-node-can-import sessions nodes names sockets) > - (for-each assert-node-can-export sessions nodes names sockets) > - (for-each close-inferior nodes) > - (for-each disconnect! sessions)))) > + (par-for-each check-machine-availability machines))) Why not! IMO this should go in a separate patch, though, since it=E2=80=99= s not related. > +(define (check-machine-availability machine) > + "Check whether MACHINE is available. Exit with an error upon failure." > + ;; Sometimes, the machine remote port may return EOF, presumably becau= se the > + ;; connection was lost. Retry up to 3 times. > + (let loop ((retries 3)) > + (guard (c ((inferior-connection-lost? c) > + (let ((retries-left (1- retries))) > + (if (> retries-left 0) > + (begin > + (format (current-error-port) > + (G_ "connection to machine ~s lost; retry= ing~%") > + (build-machine-name machine)) > + (loop (retries-left))) > + (leave (G_ "connection repeatedly lost with machine= '~a'~%") > + (build-machine-name machine)))))) I=E2=80=99m afraid we=E2=80=99re papering over problems here. Is running =E2=80=98guix offload test /etc/guix/machines.scm overdrive1=E2= =80=99 on berlin enough to reproduce the issue? If so, we could monitor/strace sshd on overdrive1 to get a better understanding of what=E2=80=99s going on. WDYT? Thanks, Ludo=E2=80=99.