From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1 ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id MBpxCKTJ4mAPBgAAgWs5BA (envelope-from ) for ; Mon, 05 Jul 2021 10:58:12 +0200 Received: from aspmx1.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1 with LMTPS id kH83BKTJ4mBYegAAbx9fmQ (envelope-from ) for ; Mon, 05 Jul 2021 08:58:12 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 6B05A146F7 for ; Mon, 5 Jul 2021 10:58:11 +0200 (CEST) Received: from localhost ([::1]:41442 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1m0KQb-0005ng-70 for larch@yhetil.org; Mon, 05 Jul 2021 04:58:09 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:50872) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1m0KQU-0005nQ-7N for bug-guix@gnu.org; Mon, 05 Jul 2021 04:58:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:60598) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1m0KQT-0002Lw-Q5 for bug-guix@gnu.org; Mon, 05 Jul 2021 04:58:01 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1m0KQT-0006U0-QV for bug-guix@gnu.org; Mon, 05 Jul 2021 04:58:01 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#41625: Sporadic guix-offload crashes due to EOF errors Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Mon, 05 Jul 2021 08:58:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 41625 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Maxim Cournoyer Received: via spool by 41625-submit@debbugs.gnu.org id=B41625.162547544624856 (code B ref 41625); Mon, 05 Jul 2021 08:58:01 +0000 Received: (at 41625) by debbugs.gnu.org; 5 Jul 2021 08:57:26 +0000 Received: from localhost ([127.0.0.1]:43907 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1m0KPu-0006Sq-44 for submit@debbugs.gnu.org; Mon, 05 Jul 2021 04:57:26 -0400 Received: from eggs.gnu.org ([209.51.188.92]:52916) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1m0KPt-0006Sf-4V for 41625@debbugs.gnu.org; Mon, 05 Jul 2021 04:57:25 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:58810) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1m0KPn-0001lN-Q5; Mon, 05 Jul 2021 04:57:19 -0400 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=42646 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1m0KPm-0006UI-EA; Mon, 05 Jul 2021 04:57:19 -0400 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= References: <87mtsky9um.fsf@gmail.com> <20210525155003.27590-1-maxim.cournoyer@gmail.com> <875yz61rvt.fsf@gnu.org> <87mtsikwsm.fsf_-_@gmail.com> <87fsy9x3ev.fsf@gnu.org> <87r1hsjkbv.fsf_-_@gmail.com> Date: Mon, 05 Jul 2021 10:57:07 +0200 In-Reply-To: <87r1hsjkbv.fsf_-_@gmail.com> (Maxim Cournoyer's message of "Thu, 27 May 2021 10:57:24 -0400") Message-ID: <878s2lw330.fsf_-_@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 41625@debbugs.gnu.org Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1625475491; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:resent-cc: resent-from:resent-sender:resent-message-id:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post; bh=PBFPe1TBGqmhPXnt0U9A3Zjoy/Bvt1MccCNQsYBDD/8=; b=s8+XMEojk9LeVHluYcA+fZpZFG9ZcJJAwhlEgzAtRm9vNviwKvsjA1CzBZuYaK6yuk54gQ fxjnoAcwzRYbOxkXHgpJO18Q7pXG6FZ+0ztgy/RUhUTZTqTsEBj24vyeqL0lizEAMFisOe 8g9d2DtPT4NIKtzrECXUFu164n1ob9QZGNo3YCSZ6IZTbTsALt0kUp6BcYoCmbugrOkkuA G9+3b++MYmdhPIByKAS2iYOudpWcSFLmRvdtDKflZfvalbBfloHwF25o4s93EEXn3juCaN XhK977aqsJONF+twZhCskXsEgjlFkrGh1q6vEtuOxB+XUjAHDiRFXwcjIXFvcA== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1625475491; a=rsa-sha256; cv=none; b=kRtNWzRSAZD5lK6r1991FPLgGC2XiWZB4AHx0DgxhVogJfEPRICpCuA3Nl8QsHga5d70XR KeHPubBSWN3tcjfznA5FlbJeSm6F8WpWp+E3pHeG6Rh3BY4uoUmcP0+NJvR2L5Ao33rN4T U7rlWDJrwVWi7Ka/StZ4EKtKBGWI8eWf4HtG7IkGOvNcwvAdI87mHju+INwDxrWzijHNkZ /mxVa6TZ5tgmPqFqUA6CW04omTJON4U+84AcW2hRCICvav/o8kN96wa3AmPSOpYEqgx1nR Q5SCzpwi0ll49ttBGOrudTptazBT3LCgnWBd5lXmQJ8ZeptM3XUAKF9fIVn8sw== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Migadu-Spam-Score: -2.91 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Migadu-Queue-Id: 6B05A146F7 X-Spam-Score: -2.91 X-Migadu-Scanner: scn0.migadu.com X-TUID: +a239xk698O2 Hi, Maxim Cournoyer skribis: > Now that I have root access to overdrive1, I could strace the sshd > process (I just did 'strace -p340', noting the process of sshd displayed > with 'herd status sshd'): > > pselect6(87, [3 4], NULL, NULL, NULL, NULL) =3D 1 (in [3]) > accept(3, {sa_family=3DAF_INET, sin_port=3Dhtons(33262), sin_addr=3Dinet_= addr("66.158.152.121")}, [128->16]) =3D 5 > fcntl(5, F_GETFL) =3D 0x2 (flags O_RDWR) > pipe2([6, 7], 0) =3D 0 > socketpair(AF_UNIX, SOCK_STREAM, 0, [8, 9]) =3D 0 > clone(child_stack=3DNULL, flags=3DCLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID= |SIGCHLD, child_tidptr=3D0xffff8e0ef0e0) =3D 644 > close(7) =3D 0 > close(9) =3D 0 > write(8, "\0\0\1\245\0", 5) =3D 5 > write(8, "\0\0\1\234\nPort 22\nPermitRootLogin no\n"..., 420) =3D 420 > close(8) =3D 0 > close(5) =3D 0 > getpid() =3D 340 > getpid() =3D 340 > getpid() =3D 340 > getpid() =3D 340 > getpid() =3D 340 > getpid() =3D 340 > getpid() =3D 340 > pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) =3D 1 (in [6]) > read(6, "\0", 1) =3D 1 > pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) =3D 1 (in [6]) > read(6, "", 1) =3D 0 OK, so it looks as if the client disconnected right away. Hard to tell exactly what that happened. :-/ Perhaps turning libssh debugging on on the client side could help (by uncommenting =E2=80=9C#:log-verbosity 'proto= col=E2=80=9D in (guix ssh)). >>>From c7b2ec1c58adf8c795df0a6aaf075dbc331f41e8 Mon Sep 17 00:00:00 2001 > From: Maxim Cournoyer > Date: Thu, 27 May 2021 08:44:44 -0400 > Subject: [PATCH 1/2] offload: Parallelize machine check in offload test. > > * guix/scripts/offload.scm (check-machine-availability): Refactor so that= it > takes a single machine object. Ensure the cleanup code is always run. > (check-machines-availability): New procedure. Call > CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about > twice as fast with 4 build machines, from ~30 s to ~15 s). I remain wary of this change, because that could lead to subtle non-deterministic bugs (of the kind that keeps you busy for weeks) and because I personally don=E2=80=99t mind if =E2=80=98guix offload test=E2=80= =99 takes 30s on berlin, and because the intermingled output may make diagnostics less clear. >>>From b5558777617e4674a150895458d57d202de56120 Mon Sep 17 00:00:00 2001 > From: Maxim Cournoyer > Date: Tue, 25 May 2021 08:42:06 -0400 > Subject: [PATCH 2/2] offload: Handle a possible EOF response from > read-repl-response. > > Partially fixes . > > * guix/scripts/offload.scm (check-machine-availability): Handle the case = where > the checks raised an exception due to receiving EOF prematurely, and retr= y up > to 3 times. > * guix/inferior.scm (&inferior-premature-eof): New condition type. > (read-repl-response): Raise a condition of the above type when reading EOF > from the build machine's port. LGTM! Thanks, Ludo=E2=80=99.