From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1.migadu.com ([2001:41d0:403:58f0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms13.migadu.com with LMTPS id 4DIpAD4ncGa/yQAA62LTzQ:P1 (envelope-from ) for ; Mon, 17 Jun 2024 12:08:30 +0000 Received: from aspmx1.migadu.com ([2001:41d0:403:58f0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1.migadu.com with LMTPS id 4DIpAD4ncGa/yQAA62LTzQ (envelope-from ) for ; Mon, 17 Jun 2024 14:08:30 +0200 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=gnu.org header.s=fencepost-gnu-org header.b=kEc8kHjT; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1718626109; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=M/ogREtkI9XVFyJ2QgYffnfCxVLk367hgUXPzNjlUqE=; b=UANy4KoqxiFIqGwdkT+Po33qMiVSZSouL7ZBRHyi7ACsmL0j/A7XFnp3ZU7Ggs46/HzLwy DGiA6e/dGJmUIVoMHHxhzff4m4g+BQRpOltR423aYw1uELjzr8/mEkmPrHIjPjIwzION0K fEzJbdX06Fk9WS3+u6vvtLc9r+zyKHB38jmkx60qbadHqKP9tsRxlIjoMXs3X2hcNe8GVh 3G+to6LeeiV4yEAdy+rGVXsnNNteDncyLl88wGYPYd3fhuumPvun4F/fAV2966dzCOBE1t gE5uH/hZBp6eb45V9HElAY/YgOyH+jUUQrXlzWVhMc0Hk9th3JyrsBp9hrPNbQ== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1718626109; a=rsa-sha256; cv=none; b=spJgp8K5OqB9nAekOAHI2nuPc0j9Tm2uN6mvCLYsS3Usq4zZCsBBLnEGwa9j8kL/+N/Rp5 xsb7mxyuGddd5ehSTV2l1WIOn8M9/s5+gMebwMADUzLIe681b4P2a8+AWDPPFJ/2bFvlRD fjcgTMnSzVrV7bcKNIf1ff7r7uuRiqVCck4ahSOHAR3fVV2tkk1rFN5opnS/fo5Ni8bNYT L0FIUtLJSlEkWZxjYZX2TAluc47S4lMPxwRDSrXJgQwSVCws6ou6wMbMRqIc5/vfYVyn78 PHY8nUdDSlLzXf8NwzjygeRu6fEmcWnXoXdhgS/6+lCwVwKG2Kl3cBM4+mAG2A== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=gnu.org header.s=fencepost-gnu-org header.b=kEc8kHjT; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id C52ECADC5 for ; Mon, 17 Jun 2024 14:08:29 +0200 (CEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sJB9G-0003YM-FN; Mon, 17 Jun 2024 08:07:46 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sJB9F-0003Xv-P6 for guix-devel@gnu.org; Mon, 17 Jun 2024 08:07:45 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sJB9E-0002gM-Qh; Mon, 17 Jun 2024 08:07:44 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:Date:References:In-Reply-To:Subject:To: From; bh=M/ogREtkI9XVFyJ2QgYffnfCxVLk367hgUXPzNjlUqE=; b=kEc8kHjThpyKFSl08h0R A98dQYQCCXm12Icv0DOR6IZL2Xld2VHou1nrrZ0JJX6ZZ0wo4pS0PWDYwdVyV63LubdKcMKyTejvY hMK3gVzqbqwCOFMoBs3BxDSGyq4zKvQzAdA5Ew7LbAcmCRQiSZCNSm+24WwSrpI7UogM56YDhpBqL oxFS5jBSuoPYNpjGaGgqYwZq2HC5J42cvWYDuu9MJm6fQ5S9PeFVO6XftU7L0vXVGkHj1dL8nQPP6 Rs9slcb/hl1tfB9Qmaf93tWhAmZWzzpS6NZdmkh4r53WDzj3181A/sYwGhuGj0hdEkEp2C25OwjQY RsRcsbcZuuLmvQ==; From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: Andreas Enge Cc: guix-devel@gnu.org Subject: Little progress on powerpc64le and aarch64 builds on ci.guix In-Reply-To: (Andreas Enge's message of "Wed, 12 Jun 2024 11:47:32 +0200") References: <87y17on4ao.fsf@gnu.org> <87bk4jjbdu.fsf@gnu.org> <87ed9a3ydf.fsf@gnu.org> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: =?utf-8?Q?D=C3=A9cadi?= 30 Prairial an 232 de la =?utf-8?Q?R=C3=A9volution=2C?= jour du Chariot X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Mon, 17 Jun 2024 14:07:39 +0200 Message-ID: <87iky7ai2c.fsf_-_@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Flow: FLOW_IN X-Migadu-Country: US X-Migadu-Spam-Score: -8.10 X-Spam-Score: -8.10 X-Migadu-Queue-Id: C52ECADC5 X-Migadu-Scanner: mx11.migadu.com X-TUID: vWH3mCu+Df44 Hello! Andreas Enge skribis: > Am Thu, Jun 06, 2024 at 07:48:27PM +0200 schrieb Andreas Enge: >> Could the graph on >> https://ci.guix.gnu.org/metrics >> be augmented by the number of packages to be built for the different >> architectures? That would be nice, I agree (I haven=E2=80=99t looked much at that part of = the code). > In that direction, the metrics now show that very few packages were built > in the last 24 hours, except maybe for ARM (where we anyway build few > packages). But the number of waiting builds stalls at around 280000. > > Are these all for ARM now? Should we cancel builds a bit more aggressively > to make sure that recent packages are favoured? In the meantime, here=E2=80=99s me doing stats-as-a-service: --8<---------------cut here---------------start------------->8--- ludo@berlin ~$ sudo -u cuirass psql cuirass cuirass=3D> select count(*) from builds where status =3D -2 ; count=20=20 -------- 284314 (1 row) Time: 635.478 ms cuirass=3D> select count(*) from builds where status =3D -2 and system =3D = 'x86_64-linux'; count=20 ------- 0 (1 row) Time: 761.333 ms cuirass=3D> select count(*) from builds where status =3D -2 and system =3D = 'aarch64-linux'; count=20=20 -------- 160847 (1 row) Time: 661.968 ms cuirass=3D> select count(*) from builds where status =3D -2 and system =3D = 'powerpc64le-linux'; count=20=20 -------- 119124 (1 row) Time: 589.800 ms cuirass=3D> select count(*) from builds where status =3D -2 and system =3D = 'armhf-linux'; count=20 ------- 4343 (1 row) Time: 549.242 ms cuirass=3D> select count(*) from builds where status =3D -2 and system =3D = 'i686-linux'; count=20 ------- 0 (1 row) Time: 1088.130 ms (00:01.088) --8<---------------cut here---------------end--------------->8--- So lots of AArch64 and POWER9 builds. Executive summary: 1. Of all the AArch64 build machines we have, only =E2=80=98overdrive1=E2= =80=99 is currently actually contributing build power; 2. AArch64 build machines =E2=80=98pankow=E2=80=99, =E2=80=98grunewald=E2= =80=99, and =E2=80=98kreuzberg=E2=80=99 (HoneyCombs) need on-site intervention so we can reconfigure them and reboot them. 3. Some other AArch64 build machines (=E2=80=98lieserl=E2=80=99 and =E2= =80=98monokuma=E2=80=99) have been off for months and we=E2=80=99re discussing on guix-sysadmin ways= to turn them back on; 4. POWER9, I=E2=80=99m not sure. 5. =E2=80=98cuirass remote-server=E2=80=99 may be too slow at handling in= coming messages from workers, leading to redundant builds and the impression on https://ci.guix.gnu.org/workers that workers are idle, even when they=E2=80=99re in fact busy building stuff. Investigation details: I noticed that =E2=80=98cuirass remote-server=E2=80=99 on berlin would all = too often consider workers as =E2=80=9Cunresponsive=E2=80=9D (meaning that it hasn=E2= =80=99t received a =E2=80=98ping=E2=80=99 message from them in the past 2 minutes): --8<---------------cut here---------------start------------->8--- ludo@berlin ~$ sudo grep unresponsive /var/log/cuirass-remote-server.log |t= ail -10 2024-06-17 12:44:02 restarted 1 builds that were on unresponsive workers 2024-06-17 12:50:03 restarted 1 builds that were on unresponsive workers 2024-06-17 12:55:03 restarted 1 builds that were on unresponsive workers 2024-06-17 13:01:03 restarted 3 builds that were on unresponsive workers 2024-06-17 13:08:03 restarted 1 builds that were on unresponsive workers 2024-06-17 13:20:03 restarted 1 builds that were on unresponsive workers 2024-06-17 13:22:03 restarted 4 builds that were on unresponsive workers 2024-06-17 13:24:03 restarted 2 builds that were on unresponsive workers 2024-06-17 13:29:03 restarted 1 builds that were on unresponsive workers 2024-06-17 13:33:03 restarted 3 builds that were on unresponsive workers --8<---------------cut here---------------end--------------->8--- As shown in this log, the effect is that some builds get restarted, even though they are still being built by a worker that was wrongfully considered unresponsive. This needs further investigation. The SQL query for =E2=80=98db-get-pending-build=E2=80=99 fixed by Cuirass commit 17338588d4862b04e9e405c1244a2ea703b50d98 is no longer at fault: it=E2=80=99= s now reasonably fast (there=E2=80=99s a warning in =E2=80=98cuirass-remote-serve= r.log=E2=80=99 if it ever takes more than 10s). It could be that the backlog of incoming messages in =E2=80=98remote-server=E2=80=99 still keeps increasing though, = since workers send pings every minute no matter what. A further problem is that we=E2=80=99re unable to retrieve binaries from a couple of build machines: --8<---------------cut here---------------start------------->8--- ludo@berlin ~$ sudo grep error: /var/log/cuirass-remote-server.log |tail -10 2024-06-17 13:05:21 error: failed to add /gnu/store/f96ya7x7yjns39n8np16rmn= hzarqcchd-guix-78d385a6b to store: path `/gnu/store/f96ya7x7yjns39n8np16rmn= hzarqcchd-guix-78d385a6b' does not exist and cannot be created 2024-06-17 13:05:21 error: The remote-worker signing key might be unauthori= zed. 2024-06-17 13:05:21 error: failed to add /gnu/store/f96ya7x7yjns39n8np16rmn= hzarqcchd-guix-78d385a6b to store: path `/gnu/store/f96ya7x7yjns39n8np16rmn= hzarqcchd-guix-78d385a6b' does not exist and cannot be created 2024-06-17 13:05:21 error: The remote-worker signing key might be unauthori= zed. 2024-06-17 13:05:21 error: failed to add /gnu/store/f96ya7x7yjns39n8np16rmn= hzarqcchd-guix-78d385a6b to store: path `/gnu/store/f96ya7x7yjns39n8np16rmn= hzarqcchd-guix-78d385a6b' does not exist and cannot be created 2024-06-17 13:05:21 error: The remote-worker signing key might be unauthori= zed. 2024-06-17 13:17:29 error: failed to add /gnu/store/ljhvgbblb4y7554rg542vam= 5hp8rg9mg-ocaml-bos-0.2.1 to store: path `/gnu/store/ljhvgbblb4y7554rg542va= m5hp8rg9mg-ocaml-bos-0.2.1' does not exist and cannot be created 2024-06-17 13:17:29 error: The remote-worker signing key might be unauthori= zed. 2024-06-17 13:24:03 error: failed to add /gnu/store/vb57h47b5xpin1h0rrvh9qd= 2bxapy8f7-ocaml-uucp-15.0.0 to store: path `/gnu/store/vb57h47b5xpin1h0rrvh= 9qd2bxapy8f7-ocaml-uucp-15.0.0' does not exist and cannot be created 2024-06-17 13:24:03 error: The remote-worker signing key might be unauthori= zed. --8<---------------cut here---------------end--------------->8--- By picking store items from these error messages, we can determine that at least =E2=80=98pankow=E2=80=99 (10.0.0.8, AArch64) and =E2=80=98grunewal= d=E2=80=99 (10.0.0.10, AArch64) are at fault: --8<---------------cut here---------------start------------->8--- ludo@berlin ~$ guix gc --derivers /gnu/store/vb57h47b5xpin1h0rrvh9qd2bxapy8= f7-ocaml-uucp-15.0.0 /gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv ludo@berlin ~$ sudo grep 8yc7j6q169f8312wx6jxs7g0z4xy5l5l /var/log/cuirass-= remote-server.log |tail -10 2024-06-17 13:21:50 10.0.0.8 (uUTl7MVR): build started: '/gnu/store/8yc7j6q= 169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv'. 2024-06-17 13:24:03 fetching 1 outputs of '/gnu/store/8yc7j6q169f8312wx6jxs= 7g0z4xy5l5l-ocaml-uucp-15.0.0.drv' from http://10.0.0.8:5558 2024-06-17 13:24:03 build succeeded: '/gnu/store/8yc7j6q169f8312wx6jxs7g0z4= xy5l5l-ocaml-uucp-15.0.0.drv' ludo@berlin ~$ guix gc --derivers /gnu/store/f96ya7x7yjns39n8np16rmnhzarqcc= hd-guix-78d385a6b /gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv ludo@berlin ~$ sudo grep ygrgwp9jyksjpnd76b83ifdskbcdjbhh /var/log/cuirass-= remote-server.log |tail -10 2024-06-17 13:05:21 fetching 1 outputs of '/gnu/store/ygrgwp9jyksjpnd76b83i= fdskbcdjbhh-guix-78d385a6b.drv' from http://10.0.0.8:5558 2024-06-17 13:05:21 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskb= cdjbhh-guix-78d385a6b.drv' 2024-06-17 13:05:21 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskb= cdjbhh-guix-78d385a6b.drv' 2024-06-17 13:05:21 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskb= cdjbhh-guix-78d385a6b.drv' 2024-06-17 13:05:21 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskb= cdjbhh-guix-78d385a6b.drv' 2024-06-17 13:34:39 build failed: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdj= bhh-guix-78d385a6b.drv' 2024-06-17 13:41:08 fetching 1 outputs of '/gnu/store/ygrgwp9jyksjpnd76b83i= fdskbcdjbhh-guix-78d385a6b.drv' from http://10.0.0.10:5558 2024-06-17 13:41:08 fetching 1 outputs of '/gnu/store/ygrgwp9jyksjpnd76b83i= fdskbcdjbhh-guix-78d385a6b.drv' from http://10.0.0.10:5558 2024-06-17 13:41:09 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskb= cdjbhh-guix-78d385a6b.drv' 2024-06-17 13:41:09 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskb= cdjbhh-guix-78d385a6b.drv' --8<---------------cut here---------------end--------------->8--- The signing key of =E2=80=98grunewald=E2=80=99 is definitely registered: --8<---------------cut here---------------start------------->8--- $ ssh grunewald cat /etc/guix/signing-key.pub (public-key=20 (ecc=20 (curve Ed25519) (q #370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481#) ) ) $ grep -rl 370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481= ~/src/guix-maintenance/hydra/ $ ssh berlin grep 370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49= 714D481 /etc/guix/acl (q #370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481#) --8<---------------cut here---------------end--------------->8--- That of =E2=80=98pankow=E2=80=99 I can=E2=80=99t say because I cannot log i= n. Most likely, it rebooted and might have regenerated a new signing key different from the one that=E2=80=99s registered. So in effect, =E2=80=98pankow=E2=80=99 is e= ffectively not contributing any build. The third machine of the HoneyComb family is =E2=80=98kreuzberg=E2=80=99: i= t=E2=80=99s been off for a few days, after I rebooted it and it didn=E2=80=99t come back. Thanks, Ludo=E2=80=99. PS: I=E2=80=99m traveling this week so I won=E2=80=99t be very responsive.