* bug#53463: ci.guix.gnu.org not building the 'guix' job @ 2022-01-23 0:56 Leo Famulari 2022-01-23 23:00 ` Leo Famulari 2022-08-16 7:57 ` Mathieu Othacehe 0 siblings, 2 replies; 11+ messages in thread From: Leo Famulari @ 2022-01-23 0:56 UTC (permalink / raw) To: 53463; +Cc: guix-sysadmin As far as I can tell, ci.guix.gnu.org has stopped building the 'guix' job since a couple days ago: https://ci.guix.gnu.org/jobset/guix ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#53463: ci.guix.gnu.org not building the 'guix' job 2022-01-23 0:56 bug#53463: ci.guix.gnu.org not building the 'guix' job Leo Famulari @ 2022-01-23 23:00 ` Leo Famulari 2022-01-27 22:13 ` Leo Famulari 2022-02-02 18:41 ` Mathieu Othacehe 2022-08-16 7:57 ` Mathieu Othacehe 1 sibling, 2 replies; 11+ messages in thread From: Leo Famulari @ 2022-01-23 23:00 UTC (permalink / raw) To: 53463 Also, the 'master' job hasn't been run in ~2 days: https://ci.guix.gnu.org/jobset/master I think the build farm is waiting to finish collecting garbage. ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#53463: ci.guix.gnu.org not building the 'guix' job 2022-01-23 23:00 ` Leo Famulari @ 2022-01-27 22:13 ` Leo Famulari 2022-02-02 18:41 ` Mathieu Othacehe 1 sibling, 0 replies; 11+ messages in thread From: Leo Famulari @ 2022-01-27 22:13 UTC (permalink / raw) To: 53463 On Sun, Jan 23, 2022 at 06:00:40PM -0500, Leo Famulari wrote: > Also, the 'master' job hasn't been run in ~2 days: > > https://ci.guix.gnu.org/jobset/master > > I think the build farm is waiting to finish collecting garbage. Unfortunately, the 'master' jobset is broken again, and the 'guix' jobset is still broken. ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#53463: ci.guix.gnu.org not building the 'guix' job 2022-01-23 23:00 ` Leo Famulari 2022-01-27 22:13 ` Leo Famulari @ 2022-02-02 18:41 ` Mathieu Othacehe 2022-02-04 8:58 ` Ludovic Courtès 1 sibling, 1 reply; 11+ messages in thread From: Mathieu Othacehe @ 2022-02-02 18:41 UTC (permalink / raw) To: Leo Famulari; +Cc: 53463 Hello, The issue here seems to be that the evaluations of the 'guix' jobset are never finishing, even when the GC is not running. I tried to strace one of the stuck evaluation process, it returns repeatedly: --8<---------------cut here---------------start------------->8--- [pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96 [pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96 [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227]) [pid 36294] read(227, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 65536) = 88 [pid 36294] write(239, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 88) = 88 [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227]) [pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96 [pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96 [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227]) [pid 36294] read(227, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 65536) = 88 [pid 36294] write(239, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 88) = 88 [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227]) [pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96 [pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96 --8<---------------cut here---------------end--------------->8--- To be continued, Thanks, Mathieu ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#53463: ci.guix.gnu.org not building the 'guix' job 2022-02-02 18:41 ` Mathieu Othacehe @ 2022-02-04 8:58 ` Ludovic Courtès 2022-02-04 9:54 ` Mathieu Othacehe 0 siblings, 1 reply; 11+ messages in thread From: Ludovic Courtès @ 2022-02-04 8:58 UTC (permalink / raw) To: Mathieu Othacehe; +Cc: 53463 Hello! Mathieu Othacehe <othacehe@gnu.org> skribis: > I tried to strace one of the stuck evaluation process, it returns > repeatedly: > > [pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96 > [pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96 > [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 > [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227]) > [pid 36294] read(227, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 65536) = 88 > [pid 36294] write(239, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 88) = 88 > [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 > [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227]) > [pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96 > [pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96 > [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 > [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227]) > [pid 36294] read(227, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 65536) = 88 > [pid 36294] write(239, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 88) = 88 > [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 > [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227]) > [pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96 > [pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96 Oh! That indicates that it’s failing to offload to one of the ‘localhost’ build machines specified in /etc/guix/machines.scm. Normally there’s an SSH tunnel set up for those, but I guess it broke. Perhaps we can update /etc/guix/machines.scm to refer to armhf-linux machines by their WireGuard IP? Thanks, Ludo’. ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#53463: ci.guix.gnu.org not building the 'guix' job 2022-02-04 8:58 ` Ludovic Courtès @ 2022-02-04 9:54 ` Mathieu Othacehe 2022-02-08 10:22 ` Ludovic Courtès 0 siblings, 1 reply; 11+ messages in thread From: Mathieu Othacehe @ 2022-02-04 9:54 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 53463 Hey, > Oh! That indicates that it’s failing to offload to one of the > ‘localhost’ build machines specified in /etc/guix/machines.scm. > Normally there’s an SSH tunnel set up for those, but I guess it broke. > > Perhaps we can update /etc/guix/machines.scm to refer to armhf-linux > machines by their WireGuard IP? Seems like the right thing to do. This bit is also an unstaged change in the berlin maintenance repository, we should commit it. Tobias, could you have a look :) ? --8<---------------cut here---------------start------------->8--- +(define powerpc64le + (list + ;; A VM donated/hosted by OSUOSL & administered by nckx. + ;; XXX: SSH tunnel via overdrive1: + ;; ssh -L 2224:p9.tobias.gr:22 hydra@10.0.0.3 + #;(build-machine + ;;(name "p9.tobias.gr") + (name "localhost") + (port 2224) + (user "hydra") + (systems '("powerpc64le-linux")) + (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJEbRxJ6WqnNLYEMNDUKFcdMtyZ9V/6oEfBFSHY8xE6A nckx")))) --8<---------------cut here---------------end--------------->8--- I also found that other machines were unreachable and commented them: --8<---------------cut here---------------start------------->8--- ;; CPU: 16 ARM Cortex-A72 cores ;; RAM: 32 GB - (list (build-machine + (list #;(build-machine ;;kreuzberg (name "10.0.0.9") (user "hydra") @@ -243,13 +256,13 @@ ;; BeagleBoard X15 kindly hosted by Simon Josefsson. ;; CPU: Cortex A15 (2 cores) ;; RAM: 2 GB - (build-machine + #;(build-machine (name "10.0.0.5") ;guix-x15 (user "hydra") (systems '("armhf-linux")) (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOfXjwCAFWeGiUoOVXEgtIeXxbtymjOTg7ph1ObMAcJ0 root@beaglebone")) - (build-machine + #;(build-machine (name "10.0.0.6") ;guix-x15b (user "hydra") (systems '("armhf-linux")) --8<---------------cut here---------------end--------------->8--- Nevertheless we are hitting an offload issue here, maybe an occurrence of #24496. The offload mechanism should timeout when a machine is unreachable instead of retrying over and over, causing all evaluation processes to hang. Thanks, Mathieu ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#53463: ci.guix.gnu.org not building the 'guix' job 2022-02-04 9:54 ` Mathieu Othacehe @ 2022-02-08 10:22 ` Ludovic Courtès 2022-02-08 12:52 ` Ricardo Wurmus 2022-03-21 8:38 ` Ludovic Courtès 0 siblings, 2 replies; 11+ messages in thread From: Ludovic Courtès @ 2022-02-08 10:22 UTC (permalink / raw) To: Mathieu Othacehe; +Cc: Ricardo Wurmus, 53463 Hi, Mathieu Othacehe <othacehe@gnu.org> skribis: >> Oh! That indicates that it’s failing to offload to one of the >> ‘localhost’ build machines specified in /etc/guix/machines.scm. >> Normally there’s an SSH tunnel set up for those, but I guess it broke. >> >> Perhaps we can update /etc/guix/machines.scm to refer to armhf-linux >> machines by their WireGuard IP? > > Seems like the right thing to do. This bit is also an unstaged change in > the berlin maintenance repository, we should commit it. Tobias, could > you have a look :) ? > > +(define powerpc64le > + (list > + ;; A VM donated/hosted by OSUOSL & administered by nckx. > + ;; XXX: SSH tunnel via overdrive1: > + ;; ssh -L 2224:p9.tobias.gr:22 hydra@10.0.0.3 > + #;(build-machine > + ;;(name "p9.tobias.gr") > + (name "localhost") > + (port 2224) > + (user "hydra") > + (systems '("powerpc64le-linux")) > + (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJEbRxJ6WqnNLYEMNDUKFcdMtyZ9V/6oEfBFSHY8xE6A nckx")))) IIRC this machine is now running WireGuard, Tobias? If so, could you change this to refer to its WireGuard IP and commit it? > I also found that other machines were unreachable and commented them: > > ;; CPU: 16 ARM Cortex-A72 cores > ;; RAM: 32 GB > - (list (build-machine > + (list #;(build-machine > ;;kreuzberg > (name "10.0.0.9") > (user "hydra") Ricardo, could you check what’s wrong with kreuzberg? > @@ -243,13 +256,13 @@ > ;; BeagleBoard X15 kindly hosted by Simon Josefsson. > ;; CPU: Cortex A15 (2 cores) > ;; RAM: 2 GB > - (build-machine > + #;(build-machine > (name "10.0.0.5") ;guix-x15 > (user "hydra") > (systems '("armhf-linux")) > (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOfXjwCAFWeGiUoOVXEgtIeXxbtymjOTg7ph1ObMAcJ0 root@beaglebone")) > > - (build-machine > + #;(build-machine > (name "10.0.0.6") ;guix-x15b > (user "hydra") > (systems '("armhf-linux")) Oops. Note that it’s not necessary to comment them all out. As long as at least one machine is available for a given system type, we’re fine: ‘guix offload’ will pick it up. > Nevertheless we are hitting an offload issue here, maybe an occurrence > of #24496. The offload mechanism should timeout when a machine is > unreachable instead of retrying over and over, causing all evaluation > processes to hang. Yes, though the problem here is that some architectures were left with zero machines IIRC, so it would have failed one way or another. Thanks! Ludo’. ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#53463: ci.guix.gnu.org not building the 'guix' job 2022-02-08 10:22 ` Ludovic Courtès @ 2022-02-08 12:52 ` Ricardo Wurmus 2022-03-21 8:38 ` Ludovic Courtès 1 sibling, 0 replies; 11+ messages in thread From: Ricardo Wurmus @ 2022-02-08 12:52 UTC (permalink / raw) To: Ludovic Courtès; +Cc: Mathieu Othacehe, 53463 Ludovic Courtès <ludo@gnu.org> writes: > Hi, > > Mathieu Othacehe <othacehe@gnu.org> skribis: > >>> Oh! That indicates that it’s failing to offload to one of the >>> ‘localhost’ build machines specified in /etc/guix/machines.scm. >>> Normally there’s an SSH tunnel set up for those, but I guess it broke. >>> >>> Perhaps we can update /etc/guix/machines.scm to refer to armhf-linux >>> machines by their WireGuard IP? >> >> Seems like the right thing to do. This bit is also an unstaged change in >> the berlin maintenance repository, we should commit it. Tobias, could >> you have a look :) ? >> >> +(define powerpc64le >> + (list >> + ;; A VM donated/hosted by OSUOSL & administered by nckx. >> + ;; XXX: SSH tunnel via overdrive1: >> + ;; ssh -L 2224:p9.tobias.gr:22 hydra@10.0.0.3 >> + #;(build-machine >> + ;;(name "p9.tobias.gr") >> + (name "localhost") >> + (port 2224) >> + (user "hydra") >> + (systems '("powerpc64le-linux")) >> + (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJEbRxJ6WqnNLYEMNDUKFcdMtyZ9V/6oEfBFSHY8xE6A nckx")))) > > IIRC this machine is now running WireGuard, Tobias? If so, could you > change this to refer to its WireGuard IP and commit it? > >> I also found that other machines were unreachable and commented them: >> >> ;; CPU: 16 ARM Cortex-A72 cores >> ;; RAM: 32 GB >> - (list (build-machine >> + (list #;(build-machine >> ;;kreuzberg >> (name "10.0.0.9") >> (user "hydra") > > Ricardo, could you check what’s wrong with kreuzberg? Oh, the usual… --8<---------------cut here---------------start------------->8--- root@kreuzberg ~# guix shell wireguard-tools -- wg interface: wg0 public key: f9WGJTXp8bozJb0KxePjkOclF5pJUy1AomHWJHy80y4= private key: (hidden) listening port: 51820 peer: wOIfhHqQ+JQmskRS2qSvNRgZGh33UxFDi8uuSXOltF0= endpoint: 141.80.181.40:51820 allowed ips: 10.0.0.1/32 latest handshake: 2 days, 2 hours, 11 minutes, 13 seconds ago transfer: 292.79 MiB received, 6.05 GiB sent --8<---------------cut here---------------end--------------->8--- Whenever the build farm is awfully quiet (e.g. because of GC) the wireguard connection times out. I usually restart the cuirass-remote-worker and everything’s fine again. Today I got some additional SD cards for these machines, so I’m going to reconfigure them (locally, because of the “guix deploy” bug) and then move them to the data centre. Once reconfigured they will keep the wireguard connection alive all by themselves, so no manual intervention is necessary. I didn’t reconfigure them locally because I hoped we would be able to make time for the “guix deploy” bug, but things turned out differently. -- Ricardo ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#53463: ci.guix.gnu.org not building the 'guix' job 2022-02-08 10:22 ` Ludovic Courtès 2022-02-08 12:52 ` Ricardo Wurmus @ 2022-03-21 8:38 ` Ludovic Courtès 2022-03-21 8:55 ` Mathieu Othacehe 1 sibling, 1 reply; 11+ messages in thread From: Ludovic Courtès @ 2022-03-21 8:38 UTC (permalink / raw) To: Mathieu Othacehe; +Cc: Ricardo Wurmus, 53463 Hi there! Looks like this bug is solved: the ‘guix’ jobset is getting built. However, evaluations are marked as “failed”, even though their build log shows they succeeded, and if you click on one of them, you see that all its builds are there: https://ci.guix.gnu.org/eval/168652 https://ci.guix.gnu.org/eval/168652/log/raw https://ci.guix.gnu.org/jobset/guix?border-high=169749 Any idea what could be wrong? Thanks, Ludo’. ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#53463: ci.guix.gnu.org not building the 'guix' job 2022-03-21 8:38 ` Ludovic Courtès @ 2022-03-21 8:55 ` Mathieu Othacehe 0 siblings, 0 replies; 11+ messages in thread From: Mathieu Othacehe @ 2022-03-21 8:55 UTC (permalink / raw) To: Ludovic Courtès; +Cc: Ricardo Wurmus, 53463 Hey Ludo, > However, evaluations are marked as “failed”, even though their build log > shows they succeeded, and if you click on one of them, you see that all > its builds are there: > > https://ci.guix.gnu.org/eval/168652 > https://ci.guix.gnu.org/eval/168652/log/raw > https://ci.guix.gnu.org/jobset/guix?border-high=169749 This started at the time we enabled the armhf architecture, so I guess it is marked as failed because the guix specification could not be evaluated for this architecture. Thanks, Mathieu ^ permalink raw reply [flat|nested] 11+ messages in thread
* bug#53463: ci.guix.gnu.org not building the 'guix' job 2022-01-23 0:56 bug#53463: ci.guix.gnu.org not building the 'guix' job Leo Famulari 2022-01-23 23:00 ` Leo Famulari @ 2022-08-16 7:57 ` Mathieu Othacehe 1 sibling, 0 replies; 11+ messages in thread From: Mathieu Othacehe @ 2022-08-16 7:57 UTC (permalink / raw) To: Leo Famulari; +Cc: guix-sysadmin, 53463-done Hello, > https://ci.guix.gnu.org/jobset/guix It is now fixed for the following architectures: x86_64-linux, i686-linux and aarch64-linux. I'll try to repair it for powerpc64le-linux soon. We can close this one I guess. Thanks, Mathieu ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-08-16 7:58 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-01-23 0:56 bug#53463: ci.guix.gnu.org not building the 'guix' job Leo Famulari 2022-01-23 23:00 ` Leo Famulari 2022-01-27 22:13 ` Leo Famulari 2022-02-02 18:41 ` Mathieu Othacehe 2022-02-04 8:58 ` Ludovic Courtès 2022-02-04 9:54 ` Mathieu Othacehe 2022-02-08 10:22 ` Ludovic Courtès 2022-02-08 12:52 ` Ricardo Wurmus 2022-03-21 8:38 ` Ludovic Courtès 2022-03-21 8:55 ` Mathieu Othacehe 2022-08-16 7:57 ` Mathieu Othacehe
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/guix.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).