From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark H Weaver Subject: bug#33410: Offloaded builds can get stuck indefinitely due to network issues Date: Fri, 16 Nov 2018 23:08:50 -0500 Message-ID: <87a7m8xs42.fsf@netris.org> Mime-Version: 1.0 Content-Type: text/plain Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:43572) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gNrvp-00034W-Td for bug-guix@gnu.org; Fri, 16 Nov 2018 23:10:06 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gNrvm-0000zL-Ql for bug-guix@gnu.org; Fri, 16 Nov 2018 23:10:05 -0500 Received: from debbugs.gnu.org ([208.118.235.43]:52724) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1gNrvm-0000zA-N5 for bug-guix@gnu.org; Fri, 16 Nov 2018 23:10:02 -0500 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1gNrvm-0000XA-Gb for bug-guix@gnu.org; Fri, 16 Nov 2018 23:10:02 -0500 Sender: "Debbugs-submit" Resent-Message-ID: Received: from eggs.gnu.org ([2001:4830:134:3::10]:43494) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gNrvL-00032C-J7 for bug-guix@gnu.org; Fri, 16 Nov 2018 23:09:36 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gNrvI-0000rI-Em for bug-guix@gnu.org; Fri, 16 Nov 2018 23:09:35 -0500 Received: from world.peace.net ([64.112.178.59]:50540) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1gNrvI-0000r9-C4 for bug-guix@gnu.org; Fri, 16 Nov 2018 23:09:32 -0500 List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org Sender: "bug-Guix" To: 33410@debbugs.gnu.org I just discovered that 4 out of 5 armhf build slots on Hydra have been stuck for 24 hours, apparently after the network connections to the build slaves were lost, possibly due to a temporary network outage. I've seen this kind of thing happen periodically since we switched to using guile-ssh for offloaded builds. On Hydra I can monitor the builds and investigate when a given build seems to be taking far too long, and I can kill those jobs to free up the build slots. There's no way to kill the builds from Hydra's web interface, but I can kill them manually by logging into Hydra. This might become a more serious problem on Berlin, as we add ARM build slaves that are not on the same local network as Berlin itself, until the web interface allows for this kind of monitoring and intervention. Mark