From mboxrd@z Thu Jan 1 00:00:00 1970 From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Subject: bug#33410: Offloaded builds can get stuck indefinitely due to network issues Date: Sat, 17 Nov 2018 15:21:32 +0100 Message-ID: <87efbjokcz.fsf@gnu.org> References: <87a7m8xs42.fsf@netris.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:55483) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gO1U3-0007NZ-40 for bug-guix@gnu.org; Sat, 17 Nov 2018 09:22:03 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gO1U2-0003Gs-Dr for bug-guix@gnu.org; Sat, 17 Nov 2018 09:22:03 -0500 Received: from debbugs.gnu.org ([208.118.235.43]:52906) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1gO1U2-0003Gk-Aj for bug-guix@gnu.org; Sat, 17 Nov 2018 09:22:02 -0500 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1gO1U2-0002NC-5V for bug-guix@gnu.org; Sat, 17 Nov 2018 09:22:02 -0500 Sender: "Debbugs-submit" Resent-Message-ID: In-Reply-To: <87a7m8xs42.fsf@netris.org> (Mark H. Weaver's message of "Fri, 16 Nov 2018 23:08:50 -0500") List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org Sender: "bug-Guix" To: Mark H Weaver Cc: 33410@debbugs.gnu.org Hello, Mark H Weaver skribis: > I just discovered that 4 out of 5 armhf build slots on Hydra have been > stuck for 24 hours, apparently after the network connections to the > build slaves were lost, possibly due to a temporary network outage. > > I've seen this kind of thing happen periodically since we switched to > using guile-ssh for offloaded builds. Which guix-daemon version is hydra running? Commit a708de151c255712071e42e5c8284756b51768cd adds a safeguard to make sure timeouts are honored, though there might be some cases where it doesn=E2=80=99t quite work as expected (I suspect libssh handles EINTR internally by looping, in which case our signal handling async doesn=E2=80= =99t get a chance to run.) > On Hydra I can monitor the builds and investigate when a given build > seems to be taking far too long, and I can kill those jobs to free up > the build slots. There's no way to kill the builds from Hydra's web > interface, but I can kill them manually by logging into Hydra. > > This might become a more serious problem on Berlin, as we add ARM build > slaves that are not on the same local network as Berlin itself, until > the web interface allows for this kind of monitoring and intervention. The current situation on berlin is suboptimal: I run =E2=80=98guix processe= s=E2=80=99 when I suspect something is wrong, and that=E2=80=99s how I found about . Thanks, Ludo=E2=80=99.