unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
* bug#34033: Offloading sometimes hangs
@ 2019-01-10 16:09 Ludovic Courtès
  2019-01-14 22:45 ` Ludovic Courtès
  0 siblings, 1 reply; 9+ messages in thread
From: Ludovic Courtès @ 2019-01-10 16:09 UTC (permalink / raw)
  To: 34033

Hello,

So there’s another situation where offloading regularly hangs on
berlin.  The ‘guix offload’ process looks like this:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007f1f715686a1 in __GI___poll (fds=0x14e9b30, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f1f673b94e7 in ssh_poll (timeout=<optimized out>, nfds=<optimized out>, fds=<optimized out>)
    at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/poll.c:98
#2  ssh_poll_ctx_dopoll (ctx=ctx@entry=0x14ee2e0, timeout=timeout@entry=-1)
    at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/poll.c:612
#3  0x00007f1f673ba449 in ssh_handle_packets (session=session@entry=0x2249360, timeout=timeout@entry=-1)
    at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/session.c:634
#4  0x00007f1f673ba51d in ssh_handle_packets_termination (session=session@entry=0x2249360, timeout=<optimized out>,
    timeout@entry=-3, fct=fct@entry=0x7f1f673a4430 <ssh_channel_read_termination>, user=user@entry=0x7ffce23953f0)
    at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/session.c:696
#5  0x00007f1f673a6aaf in ssh_channel_read_timeout (channel=0x224e360, dest=dest@entry=0x18ef020,
    count=count@entry=8, is_stderr=<optimized out>, timeout=-3, timeout@entry=-1)
    at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/channels.c:2705
#6  0x00007f1f673a6bbb in ssh_channel_read (channel=<optimized out>, dest=dest@entry=0x18ef020, count=count@entry=8,
    is_stderr=<optimized out>) at /tmp/guix-build-libssh-0.7.7.drv-0/libssh-0.7.7-checkout/src/channels.c:2621
#7  0x00007f1f67413a23 in read_from_channel_port (
    channel=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0, dst=<optimized out>, start=0, count=8) at channel-type.c:161
#8  0x00007f1f71b65287 in scm_i_read_bytes (
    port=port@entry=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0, dst=dst@entry="#<vu8vector>" = {...}, start=start@entry=0, count=count@entry=8) at ports.c:1559
#9  0x00007f1f71b6996c in scm_c_read_bytes (
    port=port@entry=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0, dst=dst@entry="#<vu8vector>" = {...}, start=start@entry=0, count=count@entry=8) at ports.c:1639
#10 0x00007f1f71b6fd80 in scm_get_bytevector_n (
    port=<error reading variable: ERROR: In procedure gdbscm_memory_port_fill_input: error reading memory>0x22f01a0,
    count=<optimized out>) at r6rs-ports.c:421
#11 0x00007f1f71ba4715 in vm_regular_engine (thread=0x14e9b30, vp=0xc31f30, registers=0xffffffff, resume=1901495969)
    at vm-engine.c:786

[...]

(gdb) p *fds
$1 = {fd = 15, events = 1, revents = 0}
(gdb) shell ls -l /proc/12185/fd
total 0
lr-x------ 1 root root 64 Jan 10 16:56 0 -> 'pipe:[76778016]'
l-wx------ 1 root root 64 Jan 10 16:56 1 -> 'pipe:[76778015]'
lr-x------ 1 root root 64 Jan 10 16:56 10 -> 'pipe:[76838317]'
l-wx------ 1 root root 64 Jan 10 16:56 11 -> 'pipe:[76838317]'
lr-x------ 1 root root 64 Jan 10 16:56 12 -> 'pipe:[76851360]'
l-wx------ 1 root root 64 Jan 10 16:56 13 -> 'pipe:[76851360]'
l-wx------ 1 root root 64 Jan 10 16:56 14 -> /var/guix/offload/overdrive1.guixsd.org/1
lrwx------ 1 root root 64 Jan 10 16:56 15 -> 'socket:[76860702]'
lr-x------ 1 root root 64 Jan 10 16:56 16 -> /dev/urandom
l-wx------ 1 root root 64 Jan 10 16:56 2 -> 'pipe:[76778015]'
lr-x------ 1 root root 64 Jan 10 16:56 3 -> 'pipe:[76838313]'
l-wx------ 1 root root 64 Jan 10 16:56 4 -> 'pipe:[76778017]'
l-wx------ 1 root root 64 Jan 10 16:56 5 -> 'pipe:[76838313]'
lr-x------ 1 root root 64 Jan 10 16:56 6 -> 'pipe:[76838316]'
l-wx------ 1 root root 64 Jan 10 16:56 7 -> 'pipe:[76838316]'
lr-x------ 1 root root 64 Jan 10 16:56 8 -> 'pipe:[76841414]'
l-wx------ 1 root root 64 Jan 10 16:56 9 -> 'pipe:[76841414]'
--8<---------------cut here---------------end--------------->8---

It’s a ‘get-bytevector-n’ for 8 bytes, so it looks like the daemon
protocol.  At that point the socket is actually dead: if I connect on
the remote machine (overdrive1.guixsd.org) I can see that there are no
other open SSH sessions.

A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUP
instead of just POLLIN.

Additionally, we could change Guile-SSH so that we can specify a timeout
when reading from a channel.

Ludo’.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#34033: Offloading sometimes hangs
  2019-01-10 16:09 bug#34033: Offloading sometimes hangs Ludovic Courtès
@ 2019-01-14 22:45 ` Ludovic Courtès
  2020-02-22  4:37   ` Maxim Cournoyer
  2020-07-02 14:20   ` Mathieu Othacehe
  0 siblings, 2 replies; 9+ messages in thread
From: Ludovic Courtès @ 2019-01-14 22:45 UTC (permalink / raw)
  To: 34033

Hello,

Ludovic Courtès <ludo@gnu.org> skribis:

> A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUP
> instead of just POLLIN.

Reported here:

  https://www.libssh.org/archive/libssh/2019-01/0000000.html

A fix has been proposed by upstream and should be committed shortly.

> Additionally, we could change Guile-SSH so that we can specify a timeout
> when reading from a channel.

Turns out we can set a per-session timeout, which we already do (see
#:timeout in ‘open-ssh-session’ in (guix scripts offload)) but
‘ssh_channel_read’ would ignore it and instead pass an infinite timeout
to poll(2):

  https://www.libssh.org/archive/libssh/2019-01/0000001.html

This issue happens to be fixed in libssh 0.8.x, so I upgraded our libssh
package in commit a8b0556ea1e439c89dc1ba33c8864e8b9b811f08.

(That still doesn’t tell us why our ‘guix offload’ processes would
occasionally be stuck but at least it ensures the build farm keeps
making progress even when that happens.)

Ludo’.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#34033: Offloading sometimes hangs
  2019-01-14 22:45 ` Ludovic Courtès
@ 2020-02-22  4:37   ` Maxim Cournoyer
  2020-02-22 20:35     ` Ludovic Courtès
  2020-07-02 14:20   ` Mathieu Othacehe
  1 sibling, 1 reply; 9+ messages in thread
From: Maxim Cournoyer @ 2020-02-22  4:37 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 34033

Hello Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

> Hello,
>
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUP
>> instead of just POLLIN.
>
> Reported here:
>
>   https://www.libssh.org/archive/libssh/2019-01/0000000.html
>
> A fix has been proposed by upstream and should be committed shortly.
>
>> Additionally, we could change Guile-SSH so that we can specify a timeout
>> when reading from a channel.
>
> Turns out we can set a per-session timeout, which we already do (see
> #:timeout in ‘open-ssh-session’ in (guix scripts offload)) but
> ‘ssh_channel_read’ would ignore it and instead pass an infinite timeout
> to poll(2):
>
>   https://www.libssh.org/archive/libssh/2019-01/0000001.html
>
> This issue happens to be fixed in libssh 0.8.x, so I upgraded our libssh
> package in commit a8b0556ea1e439c89dc1ba33c8864e8b9b811f08.
>
> (That still doesn’t tell us why our ‘guix offload’ processes would
> occasionally be stuck but at least it ensures the build farm keeps
> making progress even when that happens.)
>
> Ludo’.

Seems the patch in the response at the URL you linked is awaiting some
feedback/review.  Is this the reason 'guix substitute' hangs for so long
when the substitute server is down? (like 1 minute or so).

Maxim

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#34033: Offloading sometimes hangs
  2020-02-22  4:37   ` Maxim Cournoyer
@ 2020-02-22 20:35     ` Ludovic Courtès
  2020-02-24 13:59       ` Maxim Cournoyer
  0 siblings, 1 reply; 9+ messages in thread
From: Ludovic Courtès @ 2020-02-22 20:35 UTC (permalink / raw)
  To: Maxim Cournoyer; +Cc: 34033

Hi Maxim,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Hello,
>>
>> Ludovic Courtès <ludo@gnu.org> skribis:
>>
>>> A simple thing would be to somehow get libssh to pass POLLIN | POLLRDHUP
>>> instead of just POLLIN.
>>
>> Reported here:
>>
>>   https://www.libssh.org/archive/libssh/2019-01/0000000.html
>>
>> A fix has been proposed by upstream and should be committed shortly.
>>
>>> Additionally, we could change Guile-SSH so that we can specify a timeout
>>> when reading from a channel.
>>
>> Turns out we can set a per-session timeout, which we already do (see
>> #:timeout in ‘open-ssh-session’ in (guix scripts offload)) but
>> ‘ssh_channel_read’ would ignore it and instead pass an infinite timeout
>> to poll(2):
>>
>>   https://www.libssh.org/archive/libssh/2019-01/0000001.html
>>
>> This issue happens to be fixed in libssh 0.8.x, so I upgraded our libssh
>> package in commit a8b0556ea1e439c89dc1ba33c8864e8b9b811f08.
>>
>> (That still doesn’t tell us why our ‘guix offload’ processes would
>> occasionally be stuck but at least it ensures the build farm keeps
>> making progress even when that happens.)
>>
>> Ludo’.
>
> Seems the patch in the response at the URL you linked is awaiting some
> feedback/review.  Is this the reason 'guix substitute' hangs for so long
> when the substitute server is down? (like 1 minute or so).

The issues above are in libssh and were fixed a while ago.  ‘guix
substitute’ doesn’t use Guile-SSH/libssh, so the problem you’re seeing
must be something different.

What do you mean by “the substitute server is down”?  You mean ‘guix
publish’ is not running, or the machine is unavailable altogether?

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#34033: Offloading sometimes hangs
  2020-02-22 20:35     ` Ludovic Courtès
@ 2020-02-24 13:59       ` Maxim Cournoyer
  2020-02-24 14:59         ` Ludovic Courtès
  0 siblings, 1 reply; 9+ messages in thread
From: Maxim Cournoyer @ 2020-02-24 13:59 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 34033

Hello Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:


[...]

> The issues above are in libssh and were fixed a while ago.  ‘guix
> substitute’ doesn’t use Guile-SSH/libssh, so the problem you’re seeing
> must be something different.

OK, good to know!

> What do you mean by “the substitute server is down”?  You mean ‘guix
> publish’ is not running, or the machine is unavailable altogether?

The machine is turned off (i.e., the machine is unavailable altogether
:-).  It doesn't hang forever, but the timeout is a rather long.  I'm
using that machine as both a substitute and an offload server.

Maxim

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#34033: Offloading sometimes hangs
  2020-02-24 13:59       ` Maxim Cournoyer
@ 2020-02-24 14:59         ` Ludovic Courtès
  0 siblings, 0 replies; 9+ messages in thread
From: Ludovic Courtès @ 2020-02-24 14:59 UTC (permalink / raw)
  To: Maxim Cournoyer; +Cc: 34033

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:

[...]

>> What do you mean by “the substitute server is down”?  You mean ‘guix
>> publish’ is not running, or the machine is unavailable altogether?
>
> The machine is turned off (i.e., the machine is unavailable altogether
> :-).  It doesn't hang forever, but the timeout is a rather long.  I'm
> using that machine as both a substitute and an offload server.

I don’t think it’s Guix-specific though.  Presumably, if you do:

  telnet that-server https

you get the same timeout, right?

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#34033: Offloading sometimes hangs
  2019-01-14 22:45 ` Ludovic Courtès
  2020-02-22  4:37   ` Maxim Cournoyer
@ 2020-07-02 14:20   ` Mathieu Othacehe
  2020-07-03  7:05     ` Mathieu Othacehe
  1 sibling, 1 reply; 9+ messages in thread
From: Mathieu Othacehe @ 2020-07-02 14:20 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 34033


Hello,

> (That still doesn’t tell us why our ‘guix offload’ processes would
> occasionally be stuck but at least it ensures the build farm keeps
> making progress even when that happens.)

I'm still not sure it's directly related to this bug but I observed
several offloading hangs on Berlin today.

For instance, in Cuirass logs:

--8<---------------cut here---------------start------------->8---
2020-07-02T09:59:45 '/gnu/store/rm8ndiichxhwybaizis5pgck77952ilp-halt.drv' offloaded to '141.80.167.164'
2020-07-02T09:54:30 '/gnu/store/dxczkbf5wa6qr37gm7wr995hcxs8s0ya-motion-4.2.2.drv' offloaded to '141.80.167.170'
--8<---------------cut here---------------end--------------->8---

those two builds were offloaded around 10:00 today and there's still no
report from them at 16:00. 

On 141.80.167.164 there's a matching build log:

--8<---------------cut here---------------start------------->8---
-rw-r--r-- 1 root root 1735 Jul  2 10:00 /var/log/guix/drvs/rm/8ndiichxhwybaizis5pgck77952ilp-halt.drv.bz2
--8<---------------cut here---------------end--------------->8---

same on 141.80.167.170,

--8<---------------cut here---------------start------------->8---
-rw-r--r-- 1 root root 6344 Jul  2 09:56 /var/log/guix/drvs/dx/czkbf5wa6qr37gm7wr995hcxs8s0ya-motion-4.2.2.drv.bz2
--8<---------------cut here---------------end--------------->8---

Having those builds "unfinished" keeps the rest of the evaluation
hanging. Running this SQL command in Cuirass database:

--8<---------------cut here---------------start------------->8---
sqlite> select derivation, datetime(starttime, 'unixepoch', 'localtime'),stoptime from Builds where status=-1 and evaluation=14771;
/gnu/store/ncp59nyidli4lm3ff2lkfjym25yb18j5-guix-1.1.0-14.5bd8033.drv|2020-07-02 09:33:04|0
/gnu/store/rm8ndiichxhwybaizis5pgck77952ilp-halt.drv|2020-07-02 09:59:28|0
/gnu/store/71wnjgm2waqgw3fqmxmc4r3f1ifd1l92-cups-test.drv|2020-07-02 10:00:26|0
/gnu/store/9qsqd7jfwnaw9sm323y45cwymn98kyjl-exim-test.drv|2020-07-02 10:00:51|0
/gnu/store/vhcww4fw4qxw0hl1009npd26b22gfj3c-bitlbee-test.drv|2020-07-02 10:00:24|0
/gnu/store/92jrd6dfzgdifr107hwi64s8hf4mls47-iptables.drv|2020-07-02 09:59:49|0
/gnu/store/380nq6sjphd0agrvl43sr6ypli1yraz4-gnunet-0.12.2.drv|2020-07-02 09:51:32|0
/gnu/store/lqs22nbc6vy2z2524rmkcsmbh5mllm62-cuirass-0.0.1-37.882393d.drv|2020-07-02 10:34:37|0
/gnu/store/dxczkbf5wa6qr37gm7wr995hcxs8s0ya-motion-4.2.2.drv|2020-07-02 09:54:02|0
/gnu/store/5ln3r997ycr7rd6fqahd2d426mjw0rxb-gzochi-0.12.drv|2020-07-02 09:53:51|0
--8<---------------cut here---------------end--------------->8---

shows that the evaluation is pretty much pending since 10:00.

According to Cuirass logs again, all those builds were offloaded,
"/gnu/store/380nq6sjphd0agrvl43sr6ypli1yraz4-gnunet-0.12.2.drv",
"/gnu/store/lqs22nbc6vy2z2524rmkcsmbh5mllm62-cuirass-0.0.1-37.882393d.drv"
and /gnu/store/5ln3r997ycr7rd6fqahd2d426mjw0rxb-gzochi-0.12.drv
are reported as failed, and all other are still hanging.

Something is going wrong here! I'll keep investigating.

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#34033: Offloading sometimes hangs
  2020-07-02 14:20   ` Mathieu Othacehe
@ 2020-07-03  7:05     ` Mathieu Othacehe
  2020-07-03 13:58       ` Ludovic Courtès
  0 siblings, 1 reply; 9+ messages in thread
From: Mathieu Othacehe @ 2020-07-03  7:05 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 34033


Hey,

> Something is going wrong here! I'll keep investigating.

To help us investigate those issues I added a "/status" page, which is
also accessible from a new drop-down menu in the Cuirass navigation bar.

See, https://ci.guix.gnu.org/status.

Hydra has the same interface, but also a "Machine status" page, that
breaks down the running builds machine per machine. I plan to implement
that one next. Reading Hydra code, I also discovered that some part of
the offloading is directly done from Hydra, which talks with the
nix-daemon of the connected build machines, interesting!

While I'm writing, we have 5 running builds for ~1 hour, and 76040 queued
builds. Given the computing power of Berlin, there must be a bottleneck
somewhere.

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#34033: Offloading sometimes hangs
  2020-07-03  7:05     ` Mathieu Othacehe
@ 2020-07-03 13:58       ` Ludovic Courtès
  0 siblings, 0 replies; 9+ messages in thread
From: Ludovic Courtès @ 2020-07-03 13:58 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 34033

Hi!

Mathieu Othacehe <othacehe@gnu.org> skribis:

>> Something is going wrong here! I'll keep investigating.
>
> To help us investigate those issues I added a "/status" page, which is
> also accessible from a new drop-down menu in the Cuirass navigation bar.
>
> See, https://ci.guix.gnu.org/status.

Nice!  So it’s roughly like the info at /api/queue, but filtered to
running builds, right?

> Hydra has the same interface, but also a "Machine status" page, that
> breaks down the running builds machine per machine. I plan to implement
> that one next. Reading Hydra code, I also discovered that some part of
> the offloading is directly done from Hydra, which talks with the
> nix-daemon of the connected build machines, interesting!

Yes, Hydra does most of the scheduling by itself.  Since this is
redundant with what the daemon + offload do, I thought Cuirass shouldn’t
do any scheduling at all and instead let the daemon take care of it
all.

This has advantages (the daemon has a global view and can achieve better
scheduling), and drawbacks (the protocol requires us to wait for
‘build-things’ completion before we can queue more builds, and
scheduling decisions are almost invisible to Cuirass).

> While I'm writing, we have 5 running builds for ~1 hour, and 76040 queued
> builds. Given the computing power of Berlin, there must be a bottleneck
> somewhere.

Yes!  I’ve often run “guix processes” on berlin, then stracing the
‘SessionPID’ process.  It’s insightful because you sometimes see the
daemon is stuck waiting for a machine to offload to, sometimes it’s
stuck waiting for a build that will perhaps just eventually timeout…

Ludo’.




^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-07-03 14:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-01-10 16:09 bug#34033: Offloading sometimes hangs Ludovic Courtès
2019-01-14 22:45 ` Ludovic Courtès
2020-02-22  4:37   ` Maxim Cournoyer
2020-02-22 20:35     ` Ludovic Courtès
2020-02-24 13:59       ` Maxim Cournoyer
2020-02-24 14:59         ` Ludovic Courtès
2020-07-02 14:20   ` Mathieu Othacehe
2020-07-03  7:05     ` Mathieu Othacehe
2020-07-03 13:58       ` Ludovic Courtès

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).