all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: "Ludovic Courtès" <ludo@gnu.org>
To: Andreas Enge <andreas@enge.fr>
Cc: guix-devel@gnu.org
Subject: Little progress on powerpc64le and aarch64 builds on ci.guix
Date: Mon, 17 Jun 2024 14:07:39 +0200	[thread overview]
Message-ID: <87iky7ai2c.fsf_-_@gnu.org> (raw)
In-Reply-To: <ZmlutEmsTuM9VoFz@jurong> (Andreas Enge's message of "Wed, 12 Jun 2024 11:47:32 +0200")

Hello!

Andreas Enge <andreas@enge.fr> skribis:

> Am Thu, Jun 06, 2024 at 07:48:27PM +0200 schrieb Andreas Enge:
>> Could the graph on
>>    https://ci.guix.gnu.org/metrics
>> be augmented by the number of packages to be built for the different
>> architectures?

That would be nice, I agree (I haven’t looked much at that part of the
code).

> In that direction, the metrics now show that very few packages were built
> in the last 24 hours, except maybe for ARM (where we anyway build few
> packages). But the number of waiting builds stalls at around 280000.
>
> Are these all for ARM now? Should we cancel builds a bit more aggressively
> to make sure that recent packages are favoured?

In the meantime, here’s me doing stats-as-a-service:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ sudo -u cuirass psql cuirass
cuirass=> select count(*) from builds where status = -2 ;
 count  
--------
 284314
(1 row)

Time: 635.478 ms
cuirass=> select count(*) from builds where status = -2 and system = 'x86_64-linux';
 count 
-------
     0
(1 row)

Time: 761.333 ms
cuirass=> select count(*) from builds where status = -2 and system = 'aarch64-linux';
 count  
--------
 160847
(1 row)

Time: 661.968 ms
cuirass=> select count(*) from builds where status = -2 and system = 'powerpc64le-linux';
 count  
--------
 119124
(1 row)

Time: 589.800 ms
cuirass=> select count(*) from builds where status = -2 and system = 'armhf-linux';
 count 
-------
  4343
(1 row)

Time: 549.242 ms
cuirass=> select count(*) from builds where status = -2 and system = 'i686-linux';
 count 
-------
     0
(1 row)

Time: 1088.130 ms (00:01.088)
--8<---------------cut here---------------end--------------->8---

So lots of AArch64 and POWER9 builds.

Executive summary:

  1. Of all the AArch64 build machines we have, only ‘overdrive1’ is
     currently actually contributing build power;

  2. AArch64 build machines ‘pankow’, ‘grunewald’, and ‘kreuzberg’
     (HoneyCombs) need on-site intervention so we can reconfigure them
     and reboot them.

  3. Some other AArch64 build machines (‘lieserl’ and ‘monokuma’) have
     been off for months and we’re discussing on guix-sysadmin ways to
     turn them back on;

  4. POWER9, I’m not sure.

  5. ‘cuirass remote-server’ may be too slow at handling incoming
     messages from workers, leading to redundant builds and the
     impression on https://ci.guix.gnu.org/workers that workers are
     idle, even when they’re in fact busy building stuff.


Investigation details:

I noticed that ‘cuirass remote-server’ on berlin would all too often
consider workers as “unresponsive” (meaning that it hasn’t received a
‘ping’ message from them in the past 2 minutes):

--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ sudo grep unresponsive /var/log/cuirass-remote-server.log |tail -10
2024-06-17 12:44:02 restarted 1 builds that were on unresponsive workers
2024-06-17 12:50:03 restarted 1 builds that were on unresponsive workers
2024-06-17 12:55:03 restarted 1 builds that were on unresponsive workers
2024-06-17 13:01:03 restarted 3 builds that were on unresponsive workers
2024-06-17 13:08:03 restarted 1 builds that were on unresponsive workers
2024-06-17 13:20:03 restarted 1 builds that were on unresponsive workers
2024-06-17 13:22:03 restarted 4 builds that were on unresponsive workers
2024-06-17 13:24:03 restarted 2 builds that were on unresponsive workers
2024-06-17 13:29:03 restarted 1 builds that were on unresponsive workers
2024-06-17 13:33:03 restarted 3 builds that were on unresponsive workers
--8<---------------cut here---------------end--------------->8---

As shown in this log, the effect is that some builds get restarted, even
though they are still being built by a worker that was wrongfully
considered unresponsive.

This needs further investigation.  The SQL query for
‘db-get-pending-build’ fixed by Cuirass commit
17338588d4862b04e9e405c1244a2ea703b50d98 is no longer at fault: it’s now
reasonably fast (there’s a warning in ‘cuirass-remote-server.log’ if it
ever takes more than 10s).  It could be that the backlog of incoming
messages in ‘remote-server’ still keeps increasing though, since workers
send pings every minute no matter what.

A further problem is that we’re unable to retrieve binaries from a
couple of build machines:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ sudo grep error: /var/log/cuirass-remote-server.log |tail -10
2024-06-17 13:05:21 error: failed to add /gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b to store: path `/gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b' does not exist and cannot be created
2024-06-17 13:05:21 error: The remote-worker signing key might be unauthorized.
2024-06-17 13:05:21 error: failed to add /gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b to store: path `/gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b' does not exist and cannot be created
2024-06-17 13:05:21 error: The remote-worker signing key might be unauthorized.
2024-06-17 13:05:21 error: failed to add /gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b to store: path `/gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b' does not exist and cannot be created
2024-06-17 13:05:21 error: The remote-worker signing key might be unauthorized.
2024-06-17 13:17:29 error: failed to add /gnu/store/ljhvgbblb4y7554rg542vam5hp8rg9mg-ocaml-bos-0.2.1 to store: path `/gnu/store/ljhvgbblb4y7554rg542vam5hp8rg9mg-ocaml-bos-0.2.1' does not exist and cannot be created
2024-06-17 13:17:29 error: The remote-worker signing key might be unauthorized.
2024-06-17 13:24:03 error: failed to add /gnu/store/vb57h47b5xpin1h0rrvh9qd2bxapy8f7-ocaml-uucp-15.0.0 to store: path `/gnu/store/vb57h47b5xpin1h0rrvh9qd2bxapy8f7-ocaml-uucp-15.0.0' does not exist and cannot be created
2024-06-17 13:24:03 error: The remote-worker signing key might be unauthorized.
--8<---------------cut here---------------end--------------->8---

By picking store items from these error messages, we can determine that
at least ‘pankow’ (10.0.0.8, AArch64) and ‘grunewald’ (10.0.0.10,
AArch64) are at fault:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ guix gc --derivers /gnu/store/vb57h47b5xpin1h0rrvh9qd2bxapy8f7-ocaml-uucp-15.0.0
/gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv
ludo@berlin ~$ sudo grep 8yc7j6q169f8312wx6jxs7g0z4xy5l5l /var/log/cuirass-remote-server.log |tail -10
2024-06-17 13:21:50 10.0.0.8 (uUTl7MVR): build started: '/gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv'.
2024-06-17 13:24:03 fetching 1 outputs of '/gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv' from http://10.0.0.8:5558
2024-06-17 13:24:03 build succeeded: '/gnu/store/8yc7j6q169f8312wx6jxs7g0z4xy5l5l-ocaml-uucp-15.0.0.drv'
ludo@berlin ~$ guix gc --derivers /gnu/store/f96ya7x7yjns39n8np16rmnhzarqcchd-guix-78d385a6b
/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv
ludo@berlin ~$ sudo grep ygrgwp9jyksjpnd76b83ifdskbcdjbhh /var/log/cuirass-remote-server.log  |tail -10
2024-06-17 13:05:21 fetching 1 outputs of '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' from http://10.0.0.8:5558
2024-06-17 13:05:21 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:05:21 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:05:21 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:05:21 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:34:39 build failed: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:41:08 fetching 1 outputs of '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' from http://10.0.0.10:5558
2024-06-17 13:41:08 fetching 1 outputs of '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv' from http://10.0.0.10:5558
2024-06-17 13:41:09 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
2024-06-17 13:41:09 build succeeded: '/gnu/store/ygrgwp9jyksjpnd76b83ifdskbcdjbhh-guix-78d385a6b.drv'
--8<---------------cut here---------------end--------------->8---

The signing key of ‘grunewald’ is definitely registered:

--8<---------------cut here---------------start------------->8---
$ ssh grunewald cat /etc/guix/signing-key.pub
(public-key 
 (ecc 
  (curve Ed25519)
  (q #370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481#)
  )
 )
$ grep -rl 370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481 ~/src/guix-maintenance/hydra/
$ ssh berlin grep 370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481 /etc/guix/acl
    (q #370A0165E60213CA122E026402EE3DEA61FE4E4EE27D16DA44044AA49714D481#)
--8<---------------cut here---------------end--------------->8---

That of ‘pankow’ I can’t say because I cannot log in.  Most likely, it
rebooted and might have regenerated a new signing key different from the
one that’s registered.  So in effect, ‘pankow’ is effectively not
contributing any build.

The third machine of the HoneyComb family is ‘kreuzberg’: it’s been off
for a few days, after I rebooted it and it didn’t come back.

Thanks,
Ludo’.

PS: I’m traveling this week so I won’t be very responsive.


  parent reply	other threads:[~2024-06-17 12:08 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-31  6:25 CI is not processing jobs Lars-Dominik Braun
2024-06-01 14:09 ` Ludovic Courtès
2024-06-02 21:14   ` Ludovic Courtès
2024-06-06 15:05     ` Ludovic Courtès
2024-06-06 17:48       ` Andreas Enge
2024-06-12  9:47         ` Andreas Enge
2024-06-12 14:50           ` Maxim Cournoyer
2024-06-17 12:07           ` Ludovic Courtès [this message]
2024-06-07  6:38       ` Lars-Dominik Braun
2024-06-16 14:22       ` Philip McGrath
2024-06-17 12:11         ` qa.guix delays in processing patches Ludovic Courtès
2024-06-19  4:45           ` Philip McGrath
2024-06-19 13:50           ` Christopher Baines
2024-06-25 20:07             ` Philip McGrath
2024-06-27 13:32               ` Andreas Enge

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87iky7ai2c.fsf_-_@gnu.org \
    --to=ludo@gnu.org \
    --cc=andreas@enge.fr \
    --cc=guix-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.