From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id WFcpNuw5/17EWwAA0tVLHw (envelope-from ) for ; Fri, 03 Jul 2020 14:00:12 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id UFoeMuw5/16FFAAA1q6Kng (envelope-from ) for ; Fri, 03 Jul 2020 14:00:12 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 62775940224 for ; Fri, 3 Jul 2020 14:00:12 +0000 (UTC) Received: from localhost ([::1]:46176 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jrMEc-0004EJ-65 for larch@yhetil.org; Fri, 03 Jul 2020 10:00:10 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:55778) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jrMEU-0004E3-NR for bug-guix@gnu.org; Fri, 03 Jul 2020 10:00:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:45797) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1jrMEU-0001nY-Cb for bug-guix@gnu.org; Fri, 03 Jul 2020 10:00:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1jrMEU-0006mK-AQ for bug-guix@gnu.org; Fri, 03 Jul 2020 10:00:02 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#34033: Offloading sometimes hangs Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Fri, 03 Jul 2020 14:00:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 34033 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Mathieu Othacehe Received: via spool by 34033-submit@debbugs.gnu.org id=B34033.159378474625962 (code B ref 34033); Fri, 03 Jul 2020 14:00:02 +0000 Received: (at 34033) by debbugs.gnu.org; 3 Jul 2020 13:59:06 +0000 Received: from localhost ([127.0.0.1]:57343 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jrMDa-0006kg-6i for submit@debbugs.gnu.org; Fri, 03 Jul 2020 09:59:06 -0400 Received: from eggs.gnu.org ([209.51.188.92]:41190) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jrMDY-0006kA-KM for 34033@debbugs.gnu.org; Fri, 03 Jul 2020 09:59:04 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:41891) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jrMDT-0001hY-2c for 34033@debbugs.gnu.org; Fri, 03 Jul 2020 09:58:59 -0400 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=34278 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1jrMDR-0002mM-BL; Fri, 03 Jul 2020 09:58:57 -0400 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= References: <87o98obikk.fsf@gnu.org> <87fttuq2mz.fsf@gnu.org> <87pn9ec82g.fsf@gnu.org> <877dvlkriv.fsf@gnu.org> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 16 Messidor an 228 de la =?UTF-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Fri, 03 Jul 2020 15:58:55 +0200 In-Reply-To: <877dvlkriv.fsf@gnu.org> (Mathieu Othacehe's message of "Fri, 03 Jul 2020 09:05:12 +0200") Message-ID: <871rlsog2o.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-Spam-Score: -3.3 (---) X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: 34033@debbugs.gnu.org Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Spam-Score: -1.01 X-TUID: wPnb9AcjILgt Hi! Mathieu Othacehe skribis: >> Something is going wrong here! I'll keep investigating. > > To help us investigate those issues I added a "/status" page, which is > also accessible from a new drop-down menu in the Cuirass navigation bar. > > See, https://ci.guix.gnu.org/status. Nice! So it=E2=80=99s roughly like the info at /api/queue, but filtered to running builds, right? > Hydra has the same interface, but also a "Machine status" page, that > breaks down the running builds machine per machine. I plan to implement > that one next. Reading Hydra code, I also discovered that some part of > the offloading is directly done from Hydra, which talks with the > nix-daemon of the connected build machines, interesting! Yes, Hydra does most of the scheduling by itself. Since this is redundant with what the daemon + offload do, I thought Cuirass shouldn=E2= =80=99t do any scheduling at all and instead let the daemon take care of it all. This has advantages (the daemon has a global view and can achieve better scheduling), and drawbacks (the protocol requires us to wait for =E2=80=98build-things=E2=80=99 completion before we can queue more builds, = and scheduling decisions are almost invisible to Cuirass). > While I'm writing, we have 5 running builds for ~1 hour, and 76040 queued > builds. Given the computing power of Berlin, there must be a bottleneck > somewhere. Yes! I=E2=80=99ve often run =E2=80=9Cguix processes=E2=80=9D on berlin, th= en stracing the =E2=80=98SessionPID=E2=80=99 process. It=E2=80=99s insightful because you = sometimes see the daemon is stuck waiting for a machine to offload to, sometimes it=E2=80=99s stuck waiting for a build that will perhaps just eventually timeout=E2=80=A6 Ludo=E2=80=99.