From: "Ludovic Courtès" <ludo@gnu.org>
To: Andreas Enge <andreas@enge.fr>
Cc: guix-devel@gnu.org
Subject: Little progress on powerpc64le and aarch64 builds on ci.guix
Date: Mon, 17 Jun 2024 14:07:39 +0200 [thread overview]
Message-ID: <87iky7ai2c.fsf_-_@gnu.org> (raw)
In-Reply-To: <ZmlutEmsTuM9VoFz@jurong> (Andreas Enge's message of "Wed, 12 Jun 2024 11:47:32 +0200")
Hello!
Andreas Enge <andreas@enge.fr> skribis:
> Am Thu, Jun 06, 2024 at 07:48:27PM +0200 schrieb Andreas Enge:
>> Could the graph on
>> https://ci.guix.gnu.org/metrics
>> be augmented by the number of packages to be built for the different
>> architectures?
That would be nice, I agree (I haven’t looked much at that part of the
code).
> In that direction, the metrics now show that very few packages were built
> in the last 24 hours, except maybe for ARM (where we anyway build few
> packages). But the number of waiting builds stalls at around 280000.
>
> Are these all for ARM now? Should we cancel builds a bit more aggressively
> to make sure that recent packages are favoured?
In the meantime, here’s me doing stats-as-a-service:
--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ sudo -u cuirass psql cuirass
cuirass=> select count(*) from builds where status = -2 ;
count
--------
284314
(1 row)
Time: 635.478 ms
cuirass=> select count(*) from builds where status = -2 and system = 'x86_64-linux';
count
-------
0
(1 row)
Time: 761.333 ms
cuirass=> select count(*) from builds where status = -2 and system = 'aarch64-linux';
count
--------
160847
(1 row)
Time: 661.968 ms
cuirass=> select count(*) from builds where status = -2 and system = 'powerpc64le-linux';
count
--------
119124
(1 row)
Time: 589.800 ms
cuirass=> select count(*) from builds where status = -2 and system = 'armhf-linux';
count
-------
4343
(1 row)
Time: 549.242 ms
cuirass=> select count(*) from builds where status = -2 and system = 'i686-linux';
count
-------
0
(1 row)
Time: 1088.130 ms (00:01.088)
--8<---------------cut here---------------end--------------->8---
So lots of AArch64 and POWER9 builds.
Executive summary:
1. Of all the AArch64 build machines we have, only ‘overdrive1’ is
currently actually contributing build power;
2. AArch64 build machines ‘pankow’, ‘grunewald’, and ‘kreuzberg’
(HoneyCombs) need on-site intervention so we can reconfigure them
and reboot them.
3. Some other AArch64 build machines (‘lieserl’ and ‘monokuma’) have
been off for months and we’re discussing on guix-sysadmin ways to
turn them back on;
4. POWER9, I’m not sure.
5. ‘cuirass remote-server’ may be too slow at handling incoming
messages from workers, leading to redundant builds and the
impression on https://ci.guix.gnu.org/workers that workers are
idle, even when they’re in fact busy building stuff.
Investigation details:
I noticed that ‘cuirass remote-server’ on berlin would all too often
consider workers as “unresponsive” (meaning that it hasn’t received a
‘ping’ message from them in the past 2 minutes):
--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ sudo grep unresponsive /var/log/cuirass-remote-server.log |tail -10
2024-06-17 12:44:02 restarted 1 builds that were on unresponsive workers
2024-06-17 12:50:03 restarted 1 builds that were on unresponsive workers
2024-06-17 12:55:03 restarted 1 builds that were on unresponsive workers
2024-06-17 13:01:03 restarted 3 builds that were on unresponsive workers
2024-06-17 13:08:03 restarted 1 builds that were on unresponsive workers
2024-06-17 13:20:03 restarted 1 builds that were on unresponsive workers
2024-06-17 13:22:03 restarted 4 builds that were on unresponsive workers
2024-06-17 13:24:03 restarted 2 builds that were on unresponsive workers
2024-06-17 13:29:03 restarted 1 builds that were on unresponsive workers
2024-06-17 13:33:03 restarted 3 builds that were on unresponsive workers
--8<---------------cut here---------------end--------------->8---
As shown in this log, the effect is that some builds get restarted, even
though they are still being built by a worker that was wrongfully
considered unresponsive.
This needs further investigation. The SQL query for
‘db-get-pending-build’ fixed by Cuirass commit
17338588d4862b04e9e405c1244a2ea703b50d98 is no longer at fault: it’s now
reasonably fast (there’s a warning in ‘cuirass-remote-server.log’ if it
ever takes more than 10s). It could be that the backlog of incoming
messages in ‘remote-server’ still keeps increasing though, since workers
send pings every minute no matter what.
A further problem is that we’re unable to retrieve binaries from a
couple of build machines:
--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ sudo grep error: /var/log/cuirass-remote-server.log |tail -10
2024-06-17 13:05:21 error: failed to add /gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b to store: path `/gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b' does not exist and cannot be created
2024-06-17 13:05:21 error: The remote-worker signing key might be unauthorized.
2024-06-17 13:05:21 error: failed to add /gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b to store: path `/gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b' does not exist and cannot be created
2024-06-17 13:05:21 error: The remote-worker signing key might be unauthorized.
2024-06-17 13:05:21 error: failed to add /gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b to store: path `/gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b' does not exist and cannot be created
2024-06-17 13:05:21 error: The remote-worker signing key might be unauthorized.
2024-06-17 13:17:29 error: failed to add /gnu/store/ljhvgbblb4y7554rg542vam5hp8rg9mg-ocaml-bos-0.2.1 to store: path `/gnu/store/ljhvgbblb4y7554rg542vam5hp8rg9mg-ocaml-bos-0.2.1' does not exist and cannot be created
2024-06-17 13:17:29 error: The remote-worker signing key might be unauthorized.
2024-06-17 13:24:03 error: failed to add /gnu/store/vb57h47b5xpin1h0rrvh9qd2bxapy8f7-ocaml-uucp-15.0.0 to store: path `/gnu/store/vb57h47b5xpin1h0rrvh9qd2bxapy8f7-ocaml-uucp-15.0.0' does not exist and cannot be created
2024-06-17 13:24:03 error: The remote-worker signing key might be unauthorized.
--8<---------------cut here---------------end--------------->8---
By picking store items from these error messages, we can determine that
at least ‘pankow’ (10.0.0.8, AArch64) and ‘grunewald’ (10.0.0.10,
AArch64) are at fault:
--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ guix gc --derivers /gnu/store/vb57h47b5xpin1h0rrvh9qd2bxapy8f7-ocaml-uucp-15.0.0
/gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv
ludo@berlin ~$ sudo grep 8yc7j6q169f8312wx6jxs7g0z4xy5l5l /var/log/cuirass-remote-server.log |tail -10
2024-06-17 13:21:50 10.0.0.8 (uUTl7MVR): build started: '/gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv'.
2024-06-17 13:24:03 fetching 1 outputs of '/gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv' from http://10.0.0.8:5558
2024-06-17 13:24:03 build succeeded: '/gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv'
ludo@berlin ~$ guix gc --derivers /gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b
/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv
ludo@berlin ~$ sudo grep ygrgwp9jyksjpnd76b83ifdskbcdjbhh /var/log/cuirass-remote-server.log |tail -10
2024-06-17 13:05:21 fetching 1 outputs of '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' from http://10.0.0.8:5558
2024-06-17 13:05:21 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:05:21 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:05:21 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:05:21 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:34:39 build failed: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:41:08 fetching 1 outputs of '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' from http://10.0.0.10:5558
2024-06-17 13:41:08 fetching 1 outputs of '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' from http://10.0.0.10:5558
2024-06-17 13:41:09 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:41:09 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
--8<---------------cut here---------------end--------------->8---
The signing key of ‘grunewald’ is definitely registered:
--8<---------------cut here---------------start------------->8---
$ ssh grunewald cat /etc/guix/signing-key.pub
(public-key
(ecc
(curve Ed25519)
(q #370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481#)
)
)
$ grep -rl 370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481 ~/src/guix-maintenance/hydra/
$ ssh berlin grep 370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481 /etc/guix/acl
(q #370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481#)
--8<---------------cut here---------------end--------------->8---
That of ‘pankow’ I can’t say because I cannot log in. Most likely, it
rebooted and might have regenerated a new signing key different from the
one that’s registered. So in effect, ‘pankow’ is effectively not
contributing any build.
The third machine of the HoneyComb family is ‘kreuzberg’: it’s been off
for a few days, after I rebooted it and it didn’t come back.
Thanks,
Ludo’.
PS: I’m traveling this week so I won’t be very responsive.
next prev parent reply other threads:[~2024-06-17 12:08 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-31 6:25 CI is not processing jobs Lars-Dominik Braun
2024-06-01 14:09 ` Ludovic Courtès
2024-06-02 21:14 ` Ludovic Courtès
2024-06-06 15:05 ` Ludovic Courtès
2024-06-06 17:48 ` Andreas Enge
2024-06-12 9:47 ` Andreas Enge
2024-06-12 14:50 ` Maxim Cournoyer
2024-06-17 12:07 ` Ludovic Courtès [this message]
2024-06-07 6:38 ` Lars-Dominik Braun
2024-06-16 14:22 ` Philip McGrath
2024-06-17 12:11 ` qa.guix delays in processing patches Ludovic Courtès
2024-06-19 4:45 ` Philip McGrath
2024-06-19 13:50 ` Christopher Baines
2024-06-25 20:07 ` Philip McGrath
2024-06-27 13:32 ` Andreas Enge
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87iky7ai2c.fsf_-_@gnu.org \
--to=ludo@gnu.org \
--cc=andreas@enge.fr \
--cc=guix-devel@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.