unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
* bug#53463: ci.guix.gnu.org not building the 'guix' job
@ 2022-01-23  0:56 Leo Famulari
  2022-01-23 23:00 ` Leo Famulari
  2022-08-16  7:57 ` Mathieu Othacehe
  0 siblings, 2 replies; 11+ messages in thread
From: Leo Famulari @ 2022-01-23  0:56 UTC (permalink / raw)
  To: 53463; +Cc: guix-sysadmin

As far as I can tell, ci.guix.gnu.org has stopped building the 'guix'
job since a couple days ago:

https://ci.guix.gnu.org/jobset/guix




^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#53463: ci.guix.gnu.org not building the 'guix' job
  2022-01-23  0:56 bug#53463: ci.guix.gnu.org not building the 'guix' job Leo Famulari
@ 2022-01-23 23:00 ` Leo Famulari
  2022-01-27 22:13   ` Leo Famulari
  2022-02-02 18:41   ` Mathieu Othacehe
  2022-08-16  7:57 ` Mathieu Othacehe
  1 sibling, 2 replies; 11+ messages in thread
From: Leo Famulari @ 2022-01-23 23:00 UTC (permalink / raw)
  To: 53463

Also, the 'master' job hasn't been run in ~2 days:

https://ci.guix.gnu.org/jobset/master

I think the build farm is waiting to finish collecting garbage.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#53463: ci.guix.gnu.org not building the 'guix' job
  2022-01-23 23:00 ` Leo Famulari
@ 2022-01-27 22:13   ` Leo Famulari
  2022-02-02 18:41   ` Mathieu Othacehe
  1 sibling, 0 replies; 11+ messages in thread
From: Leo Famulari @ 2022-01-27 22:13 UTC (permalink / raw)
  To: 53463

On Sun, Jan 23, 2022 at 06:00:40PM -0500, Leo Famulari wrote:
> Also, the 'master' job hasn't been run in ~2 days:
> 
> https://ci.guix.gnu.org/jobset/master
> 
> I think the build farm is waiting to finish collecting garbage.

Unfortunately, the 'master' jobset is broken again, and the 'guix'
jobset is still broken.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#53463: ci.guix.gnu.org not building the 'guix' job
  2022-01-23 23:00 ` Leo Famulari
  2022-01-27 22:13   ` Leo Famulari
@ 2022-02-02 18:41   ` Mathieu Othacehe
  2022-02-04  8:58     ` Ludovic Courtès
  1 sibling, 1 reply; 11+ messages in thread
From: Mathieu Othacehe @ 2022-02-02 18:41 UTC (permalink / raw)
  To: Leo Famulari; +Cc: 53463


Hello,

The issue here seems to be that the evaluations of the 'guix' jobset are
never finishing, even when the GC is not running.

I tried to strace one of the stuck evaluation process, it returns
repeatedly:

--8<---------------cut here---------------start------------->8---
[pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
[pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96
[pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
[pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
[pid 36294] read(227, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 65536) = 88
[pid 36294] write(239, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 88) = 88
[pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
[pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
[pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
[pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96
[pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
[pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
[pid 36294] read(227, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 65536) = 88
[pid 36294] write(239, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 88) = 88
[pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
[pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
[pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
[pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96
--8<---------------cut here---------------end--------------->8---

To be continued,

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#53463: ci.guix.gnu.org not building the 'guix' job
  2022-02-02 18:41   ` Mathieu Othacehe
@ 2022-02-04  8:58     ` Ludovic Courtès
  2022-02-04  9:54       ` Mathieu Othacehe
  0 siblings, 1 reply; 11+ messages in thread
From: Ludovic Courtès @ 2022-02-04  8:58 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 53463

Hello!

Mathieu Othacehe <othacehe@gnu.org> skribis:

> I tried to strace one of the stuck evaluation process, it returns
> repeatedly:
>
> [pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
> [pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96
> [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
> [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
> [pid 36294] read(227, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 65536) = 88
> [pid 36294] write(239, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 88) = 88
> [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
> [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
> [pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
> [pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96
> [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
> [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
> [pid 36294] read(227, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 65536) = 88
> [pid 36294] write(239, "gmlo\0\0\0\0G\0\0\0\0\0\0\0process 40190 acquired build slot '/var/guix/offload/localhost:2224/0'\n\0", 88) = 88
> [pid 36294] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
> [pid 36294] pselect6(240, [40 227 239], [], [], NULL, NULL) = 1 (in [227])
> [pid 36294] read(227, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 65536) = 96
> [pid 36294] write(239, "gmlo\0\0\0\0J\0\0\0\0\0\0\0guix offload: error: failed to connect to 'localhost': Connection refused\n\0\0\0\0\0\0", 96) = 96

Oh!  That indicates that it’s failing to offload to one of the
‘localhost’ build machines specified in /etc/guix/machines.scm.
Normally there’s an SSH tunnel set up for those, but I guess it broke.

Perhaps we can update /etc/guix/machines.scm to refer to armhf-linux
machines by their WireGuard IP?

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#53463: ci.guix.gnu.org not building the 'guix' job
  2022-02-04  8:58     ` Ludovic Courtès
@ 2022-02-04  9:54       ` Mathieu Othacehe
  2022-02-08 10:22         ` Ludovic Courtès
  0 siblings, 1 reply; 11+ messages in thread
From: Mathieu Othacehe @ 2022-02-04  9:54 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 53463


Hey,

> Oh!  That indicates that it’s failing to offload to one of the
> ‘localhost’ build machines specified in /etc/guix/machines.scm.
> Normally there’s an SSH tunnel set up for those, but I guess it broke.
>
> Perhaps we can update /etc/guix/machines.scm to refer to armhf-linux
> machines by their WireGuard IP?

Seems like the right thing to do. This bit is also an unstaged change in
the berlin maintenance repository, we should commit it. Tobias, could
you have a look :) ?

--8<---------------cut here---------------start------------->8---
+(define powerpc64le
+  (list
+   ;; A VM donated/hosted by OSUOSL & administered by nckx.
+   ;; XXX: SSH tunnel via overdrive1:
+   ;; ssh -L 2224:p9.tobias.gr:22 hydra@10.0.0.3
+   #;(build-machine
+    ;;(name "p9.tobias.gr")
+    (name "localhost")
+    (port 2224)
+    (user "hydra")
+    (systems '("powerpc64le-linux"))
+    (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJEbRxJ6WqnNLYEMNDUKFcdMtyZ9V/6oEfBFSHY8xE6A nckx"))))
--8<---------------cut here---------------end--------------->8---

I also found that other machines were unreachable and commented them:

--8<---------------cut here---------------start------------->8---
   ;; CPU: 16 ARM Cortex-A72 cores
   ;; RAM: 32 GB
-  (list (build-machine
+  (list #;(build-machine
          ;;kreuzberg
          (name "10.0.0.9")
          (user "hydra")
@@ -243,13 +256,13 @@
    ;; BeagleBoard X15 kindly hosted by Simon Josefsson.
    ;; CPU: Cortex A15 (2 cores)
    ;; RAM: 2 GB
-   (build-machine
+   #;(build-machine
     (name "10.0.0.5")                   ;guix-x15
     (user "hydra")
     (systems '("armhf-linux"))
     (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOfXjwCAFWeGiUoOVXEgtIeXxbtymjOTg7ph1ObMAcJ0 root@beaglebone"))
 
-   (build-machine
+   #;(build-machine
     (name "10.0.0.6")                   ;guix-x15b
     (user "hydra")
     (systems '("armhf-linux"))
--8<---------------cut here---------------end--------------->8---

Nevertheless we are hitting an offload issue here, maybe an occurrence
of #24496. The offload mechanism should timeout when a machine is
unreachable instead of retrying over and over, causing all evaluation
processes to hang.

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#53463: ci.guix.gnu.org not building the 'guix' job
  2022-02-04  9:54       ` Mathieu Othacehe
@ 2022-02-08 10:22         ` Ludovic Courtès
  2022-02-08 12:52           ` Ricardo Wurmus
  2022-03-21  8:38           ` Ludovic Courtès
  0 siblings, 2 replies; 11+ messages in thread
From: Ludovic Courtès @ 2022-02-08 10:22 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: Ricardo Wurmus, 53463

Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

>> Oh!  That indicates that it’s failing to offload to one of the
>> ‘localhost’ build machines specified in /etc/guix/machines.scm.
>> Normally there’s an SSH tunnel set up for those, but I guess it broke.
>>
>> Perhaps we can update /etc/guix/machines.scm to refer to armhf-linux
>> machines by their WireGuard IP?
>
> Seems like the right thing to do. This bit is also an unstaged change in
> the berlin maintenance repository, we should commit it. Tobias, could
> you have a look :) ?
>
> +(define powerpc64le
> +  (list
> +   ;; A VM donated/hosted by OSUOSL & administered by nckx.
> +   ;; XXX: SSH tunnel via overdrive1:
> +   ;; ssh -L 2224:p9.tobias.gr:22 hydra@10.0.0.3
> +   #;(build-machine
> +    ;;(name "p9.tobias.gr")
> +    (name "localhost")
> +    (port 2224)
> +    (user "hydra")
> +    (systems '("powerpc64le-linux"))
> +    (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJEbRxJ6WqnNLYEMNDUKFcdMtyZ9V/6oEfBFSHY8xE6A nckx"))))

IIRC this machine is now running WireGuard, Tobias?  If so, could you
change this to refer to its WireGuard IP and commit it?

> I also found that other machines were unreachable and commented them:
>
>    ;; CPU: 16 ARM Cortex-A72 cores
>    ;; RAM: 32 GB
> -  (list (build-machine
> +  (list #;(build-machine
>           ;;kreuzberg
>           (name "10.0.0.9")
>           (user "hydra")

Ricardo, could you check what’s wrong with kreuzberg?

> @@ -243,13 +256,13 @@
>     ;; BeagleBoard X15 kindly hosted by Simon Josefsson.
>     ;; CPU: Cortex A15 (2 cores)
>     ;; RAM: 2 GB
> -   (build-machine
> +   #;(build-machine
>      (name "10.0.0.5")                   ;guix-x15
>      (user "hydra")
>      (systems '("armhf-linux"))
>      (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOfXjwCAFWeGiUoOVXEgtIeXxbtymjOTg7ph1ObMAcJ0 root@beaglebone"))
>  
> -   (build-machine
> +   #;(build-machine
>      (name "10.0.0.6")                   ;guix-x15b
>      (user "hydra")
>      (systems '("armhf-linux"))

Oops.

Note that it’s not necessary to comment them all out.  As long as at
least one machine is available for a given system type, we’re fine:
‘guix offload’ will pick it up.

> Nevertheless we are hitting an offload issue here, maybe an occurrence
> of #24496. The offload mechanism should timeout when a machine is
> unreachable instead of retrying over and over, causing all evaluation
> processes to hang.

Yes, though the problem here is that some architectures were left with
zero machines IIRC, so it would have failed one way or another.

Thanks!

Ludo’.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#53463: ci.guix.gnu.org not building the 'guix' job
  2022-02-08 10:22         ` Ludovic Courtès
@ 2022-02-08 12:52           ` Ricardo Wurmus
  2022-03-21  8:38           ` Ludovic Courtès
  1 sibling, 0 replies; 11+ messages in thread
From: Ricardo Wurmus @ 2022-02-08 12:52 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Mathieu Othacehe, 53463


Ludovic Courtès <ludo@gnu.org> writes:

> Hi,
>
> Mathieu Othacehe <othacehe@gnu.org> skribis:
>
>>> Oh!  That indicates that it’s failing to offload to one of the
>>> ‘localhost’ build machines specified in /etc/guix/machines.scm.
>>> Normally there’s an SSH tunnel set up for those, but I guess it broke.
>>>
>>> Perhaps we can update /etc/guix/machines.scm to refer to armhf-linux
>>> machines by their WireGuard IP?
>>
>> Seems like the right thing to do. This bit is also an unstaged change in
>> the berlin maintenance repository, we should commit it. Tobias, could
>> you have a look :) ?
>>
>> +(define powerpc64le
>> +  (list
>> +   ;; A VM donated/hosted by OSUOSL & administered by nckx.
>> +   ;; XXX: SSH tunnel via overdrive1:
>> +   ;; ssh -L 2224:p9.tobias.gr:22 hydra@10.0.0.3
>> +   #;(build-machine
>> +    ;;(name "p9.tobias.gr")
>> +    (name "localhost")
>> +    (port 2224)
>> +    (user "hydra")
>> +    (systems '("powerpc64le-linux"))
>> +    (host-key "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJEbRxJ6WqnNLYEMNDUKFcdMtyZ9V/6oEfBFSHY8xE6A nckx"))))
>
> IIRC this machine is now running WireGuard, Tobias?  If so, could you
> change this to refer to its WireGuard IP and commit it?
>
>> I also found that other machines were unreachable and commented them:
>>
>>    ;; CPU: 16 ARM Cortex-A72 cores
>>    ;; RAM: 32 GB
>> -  (list (build-machine
>> +  (list #;(build-machine
>>           ;;kreuzberg
>>           (name "10.0.0.9")
>>           (user "hydra")
>
> Ricardo, could you check what’s wrong with kreuzberg?

Oh, the usual…

--8<---------------cut here---------------start------------->8---
root@kreuzberg ~# guix shell wireguard-tools -- wg
interface: wg0
  public key: f9WGJTXp8bozJb0KxePjkOclF5pJUy1AomHWJHy80y4=
  private key: (hidden)
  listening port: 51820

peer: wOIfhHqQ+JQmskRS2qSvNRgZGh33UxFDi8uuSXOltF0=
  endpoint: 141.80.181.40:51820
  allowed ips: 10.0.0.1/32
  latest handshake: 2 days, 2 hours, 11 minutes, 13 seconds ago
  transfer: 292.79 MiB received, 6.05 GiB sent
--8<---------------cut here---------------end--------------->8---

Whenever the build farm is awfully quiet (e.g. because of GC) the
wireguard connection times out.  I usually restart the
cuirass-remote-worker and everything’s fine again.

Today I got some additional SD cards for these machines, so I’m going to
reconfigure them (locally, because of the “guix deploy” bug) and then
move them to the data centre.  Once reconfigured they will keep the
wireguard connection alive all by themselves, so no manual intervention
is necessary.

I didn’t reconfigure them locally because I hoped we would be able to
make time for the “guix deploy” bug, but things turned out differently.

-- 
Ricardo




^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#53463: ci.guix.gnu.org not building the 'guix' job
  2022-02-08 10:22         ` Ludovic Courtès
  2022-02-08 12:52           ` Ricardo Wurmus
@ 2022-03-21  8:38           ` Ludovic Courtès
  2022-03-21  8:55             ` Mathieu Othacehe
  1 sibling, 1 reply; 11+ messages in thread
From: Ludovic Courtès @ 2022-03-21  8:38 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: Ricardo Wurmus, 53463

Hi there!

Looks like this bug is solved: the ‘guix’ jobset is getting built.

However, evaluations are marked as “failed”, even though their build log
shows they succeeded, and if you click on one of them, you see that all
its builds are there:

  https://ci.guix.gnu.org/eval/168652
  https://ci.guix.gnu.org/eval/168652/log/raw
  https://ci.guix.gnu.org/jobset/guix?border-high=169749

Any idea what could be wrong?

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#53463: ci.guix.gnu.org not building the 'guix' job
  2022-03-21  8:38           ` Ludovic Courtès
@ 2022-03-21  8:55             ` Mathieu Othacehe
  0 siblings, 0 replies; 11+ messages in thread
From: Mathieu Othacehe @ 2022-03-21  8:55 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Ricardo Wurmus, 53463


Hey Ludo,

> However, evaluations are marked as “failed”, even though their build log
> shows they succeeded, and if you click on one of them, you see that all
> its builds are there:
>
>   https://ci.guix.gnu.org/eval/168652
>   https://ci.guix.gnu.org/eval/168652/log/raw
>   https://ci.guix.gnu.org/jobset/guix?border-high=169749

This started at the time we enabled the armhf architecture, so I guess
it is marked as failed because the guix specification could not be
evaluated for this architecture.

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 11+ messages in thread

* bug#53463: ci.guix.gnu.org not building the 'guix' job
  2022-01-23  0:56 bug#53463: ci.guix.gnu.org not building the 'guix' job Leo Famulari
  2022-01-23 23:00 ` Leo Famulari
@ 2022-08-16  7:57 ` Mathieu Othacehe
  1 sibling, 0 replies; 11+ messages in thread
From: Mathieu Othacehe @ 2022-08-16  7:57 UTC (permalink / raw)
  To: Leo Famulari; +Cc: guix-sysadmin, 53463-done


Hello,

> https://ci.guix.gnu.org/jobset/guix

It is now fixed for the following architectures: x86_64-linux,
i686-linux and aarch64-linux. I'll try to repair it for
powerpc64le-linux soon.

We can close this one I guess.

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-08-16  7:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-01-23  0:56 bug#53463: ci.guix.gnu.org not building the 'guix' job Leo Famulari
2022-01-23 23:00 ` Leo Famulari
2022-01-27 22:13   ` Leo Famulari
2022-02-02 18:41   ` Mathieu Othacehe
2022-02-04  8:58     ` Ludovic Courtès
2022-02-04  9:54       ` Mathieu Othacehe
2022-02-08 10:22         ` Ludovic Courtès
2022-02-08 12:52           ` Ricardo Wurmus
2022-03-21  8:38           ` Ludovic Courtès
2022-03-21  8:55             ` Mathieu Othacehe
2022-08-16  7:57 ` Mathieu Othacehe

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).