* bug#30365: Offloading sometimes hangs
@ 2018-02-06 10:04 Ludovic Courtès
2018-02-07 13:42 ` Ludovic Courtès
0 siblings, 1 reply; 7+ messages in thread
From: Ludovic Courtès @ 2018-02-06 10:04 UTC (permalink / raw)
To: 30365
Hi,
On berlin.guixsd.org, offloading would sometimes hang in the middle of
an offloaded build: no more build log output showing up, nothing
happening (this is with guix-0.14.0-6.0dcf675).
On the build machine side, the guile process that forwards data between
the sshd and guix-daemon¹ is stuck on:
read(0, …)
with this stack trace:
--8<---------------cut here---------------start------------->8---
(gdb) bt
#0 0x00007f09d6068aed in read () from /gnu/store/3h31zsqxjjg52da5gp3qmhkh4x8klhah-glibc-2.25/lib/libpthread.so.0
#1 0x00007f09d653fc47 in fport_read ()
from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#2 0x00007f09d656cd77 in scm_i_read_bytes ()
from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#3 0x00007f09d65705fe in scm_fill_input ()
from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#4 0x00007f09d6577897 in scm_get_bytevector_some ()
from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#5 0x00007f09d65abc4d in vm_regular_engine ()
from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#6 0x00007f09d65af2aa in scm_call_n ()
from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#7 0x00007f09d65338d7 in scm_primitive_eval ()
from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
--8<---------------cut here---------------end--------------->8---
In theory this “cannot happen” because it reads from stdin iff ‘select’
said stdin is ready.
On the server side (on berlin itself), the corresponding ‘guix offload’
process is stuck here:
--8<---------------cut here---------------start------------->8---
(gdb) bt
#0 0x00007ff49b3590bd in poll () from target:/gnu/store/3h31zsqxjjg52da5gp3qmhkh4x8klhah-glibc-2.25/lib/libc.so.6
#1 0x00007ff48f4db377 in ssh_poll_ctx_dopoll ()
from target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4
#2 0x00007ff48f4dc319 in ssh_handle_packets ()
from target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4
#3 0x00007ff48f4dc3ed in ssh_handle_packets_termination ()
from target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4
#4 0x00007ff48f4c8eff in ssh_channel_read_timeout ()
from target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4
#5 0x00007ff48f930803 in read_from_channel_port ()
from target:/gnu/store/xfaqdvk060yz7ddc9isk3wkybqmcfj3w-guile-ssh-0.11.2/lib/libguile-ssh.so.11
#6 0x00007ff49cea7d77 in scm_i_read_bytes ()
from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#7 0x00007ff49ceac3fc in scm_c_read_bytes ()
from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#8 0x00007ff49ceb2838 in scm_get_bytevector_n ()
from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#9 0x00007ff49cee6c4d in vm_regular_engine ()
from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#10 0x00007ff49ceea2aa in scm_call_n ()
from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#11 0x00007ff49ce6e8d7 in scm_primitive_eval ()
--8<---------------cut here---------------end--------------->8---
Presumably the ‘scm_get_bytevector_n’ call comes from (guix
serialization) or ‘process-stderr’.
IOW we have a deadlock where both sides are waiting for input data.
Ludo’.
¹ https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102
^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#30365: Offloading sometimes hangs
2018-02-06 10:04 bug#30365: Offloading sometimes hangs Ludovic Courtès
@ 2018-02-07 13:42 ` Ludovic Courtès
2018-02-07 20:54 ` Ludovic Courtès
2018-02-09 23:16 ` Ludovic Courtès
0 siblings, 2 replies; 7+ messages in thread
From: Ludovic Courtès @ 2018-02-07 13:42 UTC (permalink / raw)
To: 30365
ludo@gnu.org (Ludovic Courtès) skribis:
> On the build machine side, the guile process that forwards data between
> the sshd and guix-daemon¹ is stuck on:
>
> read(0, …)
> ¹ https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102
I’ve been able to strace this Guile process. When “things work
normally”, its loop does things like this:
--8<---------------cut here---------------start------------->8---
20456 10:52:17 read(9, "gmlo\0\0\0\0\201\0\0\0\0\0\0\0make[2]: Entering directory '/tmp/guix-build-make-4.2.1.drv-0/make-4.2.1/doc'\nmake[2]: Nothing to be done for 'install-exec-am'.\n\0\0\0\0\0\0\0", 65536) = 152
20456 10:52:17 write(1, "gmlo\0\0\0\0\201\0\0\0\0\0\0\0make[2]: Entering directory '/tmp/guix-build-make-4.2.1.drv-0/make-4.2.1/doc'\nmake[2]: Nothing to be done for 'install-exec-am'.\n\0\0\0\0\0\0\0", 152) = 152
20456 10:52:17 rt_sigprocmask(SIG_BLOCK, NULL, <unfinished ...>
20456 10:52:17 <... rt_sigprocmask resumed> [], 8) = 0
20456 10:52:17 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...>
20456 10:52:17 <... pselect6 resumed> ) = 1 (in [9])
20456 10:52:17 read(9, "gmlo\0\0\0\0\223\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/mkdir -p '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0\0", 65536) = 168
20456 10:52:17 write(1, "gmlo\0\0\0\0\223\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/mkdir -p '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0\0", 168) = 168
20456 10:52:17 rt_sigprocmask(SIG_BLOCK, NULL, <unfinished ...>
20456 10:52:17 <... rt_sigprocmask resumed> [], 8) = 0
20456 10:52:17 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...>
20456 10:52:17 <... pselect6 resumed> ) = 1 (in [9])
20456 10:52:17 read(9, "gmlo\0\0\0\0\304\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/install -c -m 644 ./make.info ./make.info-1 ./make.info-2 '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0", 65536) = 216
20456 10:52:17 write(1, "gmlo\0\0\0\0\304\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/install -c -m 644 ./make.info ./make.info-1 ./make.info-2 '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0", 216) = 216
--8<---------------cut here---------------end--------------->8---
Here FD 9 is the guix-daemon socket and 1 is stdout, which is connected
to the remote ‘guix offload’ process.
We then see a slightly different sequence leading to the read(0, …) call
that hangs:
--8<---------------cut here---------------start------------->8---
20456 10:52:18 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...>
20456 10:52:18 <... pselect6 resumed> ) = 1 (in [9])
20456 10:52:18 read(9, "gmlo\0\0\0\0005\0\0\0\0\0\0\0phase `validate-runpath' succeeded after 0.0 seconds\n\0\0\0", 65536) = 72
20456 10:52:18 write(1, "gmlo\0\0\0\0005\0\0\0\0\0\0\0phase `validate-runpath' succeeded after 0.0 seconds\n\0\0\0", 72) = 72
20456 10:52:18 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=261868144}) = 0
20456 10:52:18 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
20456 10:52:18 madvise(0xffffb0025000, 4096, MADV_DONTNEED <unfinished ...>
20456 10:52:18 <... madvise resumed> ) = 0
20456 10:52:18 write(6, "\0", 1 <unfinished ...>
20456 10:52:18 <... write resumed> ) = 1
20456 10:52:18 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, <unfinished ...>
20456 10:52:18 <... clock_gettime resumed> {tv_sec=0, tv_nsec=268384968}) = 0
20456 10:52:18 read(0, <unfinished ...>
--8<---------------cut here---------------end--------------->8---
Notice that there’s no pselect showing up here!
The two clock_gettime correspond to ‘get_internal_run_time’ in libguile,
and ‘madvise’ corresponds to ‘return_unused_stack_to_os’; IOW, this part
corresponds to garbage collection.
So what we have here is that the Scheme procedure ‘select’ returned
stdin as “ready for reading”. How did that happen? I believe this is
due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’
returns 1, so ‘select’ returns EINTR but it does so without clearing the
FD sets.
Ludo’.
^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#30365: Offloading sometimes hangs
2018-02-07 13:42 ` Ludovic Courtès
@ 2018-02-07 20:54 ` Ludovic Courtès
2018-02-09 23:16 ` Ludovic Courtès
1 sibling, 0 replies; 7+ messages in thread
From: Ludovic Courtès @ 2018-02-07 20:54 UTC (permalink / raw)
To: 30365
[-- Attachment #1: Type: text/plain, Size: 664 bytes --]
ludo@gnu.org (Ludovic Courtès) skribis:
> ludo@gnu.org (Ludovic Courtès) skribis:
>
>> On the build machine side, the guile process that forwards data between
>> the sshd and guix-daemon¹ is stuck on:
>>
>> read(0, …)
>
>> ¹ https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102
>
> I’ve been able to strace this Guile process.
For reference, one way to test this without going through offloading is by:
1. storing the ‘redirect’ code in a file, say redirect.scm;
2. talk to the daemon through a pipe running redirect.scm as with the
attached patch.
Ludo’.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 1324 bytes --]
diff --git a/guix/store.scm b/guix/store.scm
index 6742611c6..b38e5bc23 100644
--- a/guix/store.scm
+++ b/guix/store.scm
@@ -473,6 +473,7 @@ name."
(connect uri))
+(use-modules (ice-9 popen))
(define* (open-connection #:optional (uri (%daemon-socket-uri))
#:key port (reserve-space? #t) cpu-affinity)
"Connect to the daemon at URI (a string), or, if PORT is not #f, use it as
@@ -491,7 +492,14 @@ for this connection will be pinned. Return a server object."
(errno EPROTO))
(&message (message "build daemon handshake failed"))))))
(let*-values (((port)
- (or port (connect-to-daemon uri)))
+ (or port
+ (with-fluids ((%default-port-encoding "ISO-8859-1"))
+ (open-pipe* OPEN_BOTH;; "strace" "-t" "-o"
+ ;; "/home/ludo/redirect.log"
+ ;; "-s" "1024"
+ "guile" "/home/ludo/src/guix/redirect.scm"))
+ ;; (connect-to-daemon uri)
+ ))
((output flush)
(buffering-output-port port
(make-bytevector 8192))))
^ permalink raw reply related [flat|nested] 7+ messages in thread
* bug#30365: Offloading sometimes hangs
2018-02-07 13:42 ` Ludovic Courtès
2018-02-07 20:54 ` Ludovic Courtès
@ 2018-02-09 23:16 ` Ludovic Courtès
2018-02-10 10:17 ` Ricardo Wurmus
2018-02-15 0:06 ` Danny Milosavljevic
1 sibling, 2 replies; 7+ messages in thread
From: Ludovic Courtès @ 2018-02-09 23:16 UTC (permalink / raw)
To: 30365-done
Hello!
ludo@gnu.org (Ludovic Courtès) skribis:
> So what we have here is that the Scheme procedure ‘select’ returned
> stdin as “ready for reading”. How did that happen? I believe this is
> due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’
> returns 1, so ‘select’ returns EINTR but it does so without clearing the
> FD sets.
I’ve pushed a workaround here:
https://git.savannah.gnu.org/cgit/guix.git/commit/?id=8446dc5a360e3a13fecea870f86efdbd893e3905
and guix-0.14.0-8.bc880f9 includes that fix.
It’s been running for several hours on berlin, building a bunch of
things notably on aarch64, and it seems to work well!
On the Guix side this can be considered fixed, I think.
On the Guile side, we’ll have to push the fix for
<https://bugs.gnu.org/30368>.
Ludo’.
^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#30365: Offloading sometimes hangs
2018-02-09 23:16 ` Ludovic Courtès
@ 2018-02-10 10:17 ` Ricardo Wurmus
2018-02-10 11:07 ` Ludovic Courtès
2018-02-15 0:06 ` Danny Milosavljevic
1 sibling, 1 reply; 7+ messages in thread
From: Ricardo Wurmus @ 2018-02-10 10:17 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: 30365-done
Hi Ludo,
> ludo@gnu.org (Ludovic Courtès) skribis:
>
>> So what we have here is that the Scheme procedure ‘select’ returned
>> stdin as “ready for reading”. How did that happen? I believe this is
>> due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’
>> returns 1, so ‘select’ returns EINTR but it does so without clearing the
>> FD sets.
>
> I’ve pushed a workaround here:
>
> https://git.savannah.gnu.org/cgit/guix.git/commit/?id=8446dc5a360e3a13fecea870f86efdbd893e3905
>
> and guix-0.14.0-8.bc880f9 includes that fix.
>
> It’s been running for several hours on berlin, building a bunch of
> things notably on aarch64, and it seems to work well!
Congratulations on figuring this out!
--
Ricardo
GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC
https://elephly.net
^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#30365: Offloading sometimes hangs
2018-02-10 10:17 ` Ricardo Wurmus
@ 2018-02-10 11:07 ` Ludovic Courtès
0 siblings, 0 replies; 7+ messages in thread
From: Ludovic Courtès @ 2018-02-10 11:07 UTC (permalink / raw)
To: Ricardo Wurmus; +Cc: 30365
Hello,
Ricardo Wurmus <rekado@elephly.net> skribis:
>> ludo@gnu.org (Ludovic Courtès) skribis:
>>
>>> So what we have here is that the Scheme procedure ‘select’ returned
>>> stdin as “ready for reading”. How did that happen? I believe this is
>>> due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’
>>> returns 1, so ‘select’ returns EINTR but it does so without clearing the
>>> FD sets.
>>
>> I’ve pushed a workaround here:
>>
>> https://git.savannah.gnu.org/cgit/guix.git/commit/?id=8446dc5a360e3a13fecea870f86efdbd893e3905
>>
>> and guix-0.14.0-8.bc880f9 includes that fix.
>>
>> It’s been running for several hours on berlin, building a bunch of
>> things notably on aarch64, and it seems to work well!
>
> Congratulations on figuring this out!
Well it did show up again during the night. :-/ So the problem appears
less frequently it seems, but it still appears.
A related issue is that ‘guix offload’ doesn’t time out in this case
but it probably should.
Ludo’.
^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#30365: Offloading sometimes hangs
2018-02-09 23:16 ` Ludovic Courtès
2018-02-10 10:17 ` Ricardo Wurmus
@ 2018-02-15 0:06 ` Danny Milosavljevic
1 sibling, 0 replies; 7+ messages in thread
From: Danny Milosavljevic @ 2018-02-15 0:06 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: 30365-done
Hi Ludo,
> On the Guile side, we’ll have to push the fix for
> <https://bugs.gnu.org/30368>.
Can't think of a way to test select. But can think of a way to test pselect:
Excuse the C:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/select.h>
#include <signal.h>
static void dummy(int signum) {
}
int main() {
int fds[2];
struct sigaction sa = {
.sa_handler = dummy,
.sa_flags = 0,
};
sigemptyset(&sa.sa_mask);
pipe(fds);
/* Make sure we don't miss it */
sigsetmask(sigmask(SIGCHLD));
/* Make sure we handle it once we DO get it (after unblocking it :P) */
sigaction(SIGCHLD, &sa, NULL);
if (fork() == 0) {
_exit(0); /* Cause SIGCHLD (to be at least pending) in the parent. */
} else {
close(fds[0]);
fd_set read_set;
fd_set write_set;
fd_set except_set;
sigset_t sigmask;
int rc;
FD_ZERO(&read_set);
FD_ZERO(&write_set);
FD_SET(fds[1], &write_set);
FD_ZERO(&except_set);
sigemptyset(&sigmask); /* unblock SIGCHLD */
rc = pselect(fds[0] + 1, &read_set, &write_set, &except_set, NULL, &sigmask);
if (rc == -1) {
perror("select");
} else if (rc == 0) {
printf("ZERO\n");
} else {
printf("N\n");
}
}
return 0;
}
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2018-02-15 0:07 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-02-06 10:04 bug#30365: Offloading sometimes hangs Ludovic Courtès
2018-02-07 13:42 ` Ludovic Courtès
2018-02-07 20:54 ` Ludovic Courtès
2018-02-09 23:16 ` Ludovic Courtès
2018-02-10 10:17 ` Ricardo Wurmus
2018-02-10 11:07 ` Ludovic Courtès
2018-02-15 0:06 ` Danny Milosavljevic
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.