* bug#30365: Offloading sometimes hangs @ 2018-02-06 10:04 Ludovic Courtès 2018-02-07 13:42 ` Ludovic Courtès 0 siblings, 1 reply; 7+ messages in thread From: Ludovic Courtès @ 2018-02-06 10:04 UTC (permalink / raw) To: 30365 Hi, On berlin.guixsd.org, offloading would sometimes hang in the middle of an offloaded build: no more build log output showing up, nothing happening (this is with guix-0.14.0-6.0dcf675). On the build machine side, the guile process that forwards data between the sshd and guix-daemon¹ is stuck on: read(0, …) with this stack trace: --8<---------------cut here---------------start------------->8--- (gdb) bt #0 0x00007f09d6068aed in read () from /gnu/store/3h31zsqxjjg52da5gp3qmhkh4x8klhah-glibc-2.25/lib/libpthread.so.0 #1 0x00007f09d653fc47 in fport_read () from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1 #2 0x00007f09d656cd77 in scm_i_read_bytes () from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1 #3 0x00007f09d65705fe in scm_fill_input () from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1 #4 0x00007f09d6577897 in scm_get_bytevector_some () from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1 #5 0x00007f09d65abc4d in vm_regular_engine () from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1 #6 0x00007f09d65af2aa in scm_call_n () from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1 #7 0x00007f09d65338d7 in scm_primitive_eval () from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1 --8<---------------cut here---------------end--------------->8--- In theory this “cannot happen” because it reads from stdin iff ‘select’ said stdin is ready. On the server side (on berlin itself), the corresponding ‘guix offload’ process is stuck here: --8<---------------cut here---------------start------------->8--- (gdb) bt #0 0x00007ff49b3590bd in poll () from target:/gnu/store/3h31zsqxjjg52da5gp3qmhkh4x8klhah-glibc-2.25/lib/libc.so.6 #1 0x00007ff48f4db377 in ssh_poll_ctx_dopoll () from target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4 #2 0x00007ff48f4dc319 in ssh_handle_packets () from target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4 #3 0x00007ff48f4dc3ed in ssh_handle_packets_termination () from target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4 #4 0x00007ff48f4c8eff in ssh_channel_read_timeout () from target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4 #5 0x00007ff48f930803 in read_from_channel_port () from target:/gnu/store/xfaqdvk060yz7ddc9isk3wkybqmcfj3w-guile-ssh-0.11.2/lib/libguile-ssh.so.11 #6 0x00007ff49cea7d77 in scm_i_read_bytes () from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1 #7 0x00007ff49ceac3fc in scm_c_read_bytes () from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1 #8 0x00007ff49ceb2838 in scm_get_bytevector_n () from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1 #9 0x00007ff49cee6c4d in vm_regular_engine () from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1 #10 0x00007ff49ceea2aa in scm_call_n () from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1 #11 0x00007ff49ce6e8d7 in scm_primitive_eval () --8<---------------cut here---------------end--------------->8--- Presumably the ‘scm_get_bytevector_n’ call comes from (guix serialization) or ‘process-stderr’. IOW we have a deadlock where both sides are waiting for input data. Ludo’. ¹ https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102 ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#30365: Offloading sometimes hangs 2018-02-06 10:04 bug#30365: Offloading sometimes hangs Ludovic Courtès @ 2018-02-07 13:42 ` Ludovic Courtès 2018-02-07 20:54 ` Ludovic Courtès 2018-02-09 23:16 ` Ludovic Courtès 0 siblings, 2 replies; 7+ messages in thread From: Ludovic Courtès @ 2018-02-07 13:42 UTC (permalink / raw) To: 30365 ludo@gnu.org (Ludovic Courtès) skribis: > On the build machine side, the guile process that forwards data between > the sshd and guix-daemon¹ is stuck on: > > read(0, …) > ¹ https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102 I’ve been able to strace this Guile process. When “things work normally”, its loop does things like this: --8<---------------cut here---------------start------------->8--- 20456 10:52:17 read(9, "gmlo\0\0\0\0\201\0\0\0\0\0\0\0make[2]: Entering directory '/tmp/guix-build-make-4.2.1.drv-0/make-4.2.1/doc'\nmake[2]: Nothing to be done for 'install-exec-am'.\n\0\0\0\0\0\0\0", 65536) = 152 20456 10:52:17 write(1, "gmlo\0\0\0\0\201\0\0\0\0\0\0\0make[2]: Entering directory '/tmp/guix-build-make-4.2.1.drv-0/make-4.2.1/doc'\nmake[2]: Nothing to be done for 'install-exec-am'.\n\0\0\0\0\0\0\0", 152) = 152 20456 10:52:17 rt_sigprocmask(SIG_BLOCK, NULL, <unfinished ...> 20456 10:52:17 <... rt_sigprocmask resumed> [], 8) = 0 20456 10:52:17 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...> 20456 10:52:17 <... pselect6 resumed> ) = 1 (in [9]) 20456 10:52:17 read(9, "gmlo\0\0\0\0\223\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/mkdir -p '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0\0", 65536) = 168 20456 10:52:17 write(1, "gmlo\0\0\0\0\223\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/mkdir -p '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0\0", 168) = 168 20456 10:52:17 rt_sigprocmask(SIG_BLOCK, NULL, <unfinished ...> 20456 10:52:17 <... rt_sigprocmask resumed> [], 8) = 0 20456 10:52:17 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...> 20456 10:52:17 <... pselect6 resumed> ) = 1 (in [9]) 20456 10:52:17 read(9, "gmlo\0\0\0\0\304\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/install -c -m 644 ./make.info ./make.info-1 ./make.info-2 '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0", 65536) = 216 20456 10:52:17 write(1, "gmlo\0\0\0\0\304\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/install -c -m 644 ./make.info ./make.info-1 ./make.info-2 '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0", 216) = 216 --8<---------------cut here---------------end--------------->8--- Here FD 9 is the guix-daemon socket and 1 is stdout, which is connected to the remote ‘guix offload’ process. We then see a slightly different sequence leading to the read(0, …) call that hangs: --8<---------------cut here---------------start------------->8--- 20456 10:52:18 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...> 20456 10:52:18 <... pselect6 resumed> ) = 1 (in [9]) 20456 10:52:18 read(9, "gmlo\0\0\0\0005\0\0\0\0\0\0\0phase `validate-runpath' succeeded after 0.0 seconds\n\0\0\0", 65536) = 72 20456 10:52:18 write(1, "gmlo\0\0\0\0005\0\0\0\0\0\0\0phase `validate-runpath' succeeded after 0.0 seconds\n\0\0\0", 72) = 72 20456 10:52:18 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=261868144}) = 0 20456 10:52:18 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 20456 10:52:18 madvise(0xffffb0025000, 4096, MADV_DONTNEED <unfinished ...> 20456 10:52:18 <... madvise resumed> ) = 0 20456 10:52:18 write(6, "\0", 1 <unfinished ...> 20456 10:52:18 <... write resumed> ) = 1 20456 10:52:18 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, <unfinished ...> 20456 10:52:18 <... clock_gettime resumed> {tv_sec=0, tv_nsec=268384968}) = 0 20456 10:52:18 read(0, <unfinished ...> --8<---------------cut here---------------end--------------->8--- Notice that there’s no pselect showing up here! The two clock_gettime correspond to ‘get_internal_run_time’ in libguile, and ‘madvise’ corresponds to ‘return_unused_stack_to_os’; IOW, this part corresponds to garbage collection. So what we have here is that the Scheme procedure ‘select’ returned stdin as “ready for reading”. How did that happen? I believe this is due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’ returns 1, so ‘select’ returns EINTR but it does so without clearing the FD sets. Ludo’. ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#30365: Offloading sometimes hangs 2018-02-07 13:42 ` Ludovic Courtès @ 2018-02-07 20:54 ` Ludovic Courtès 2018-02-09 23:16 ` Ludovic Courtès 1 sibling, 0 replies; 7+ messages in thread From: Ludovic Courtès @ 2018-02-07 20:54 UTC (permalink / raw) To: 30365 [-- Attachment #1: Type: text/plain, Size: 664 bytes --] ludo@gnu.org (Ludovic Courtès) skribis: > ludo@gnu.org (Ludovic Courtès) skribis: > >> On the build machine side, the guile process that forwards data between >> the sshd and guix-daemon¹ is stuck on: >> >> read(0, …) > >> ¹ https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102 > > I’ve been able to strace this Guile process. For reference, one way to test this without going through offloading is by: 1. storing the ‘redirect’ code in a file, say redirect.scm; 2. talk to the daemon through a pipe running redirect.scm as with the attached patch. Ludo’. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: Type: text/x-patch, Size: 1324 bytes --] diff --git a/guix/store.scm b/guix/store.scm index 6742611c6..b38e5bc23 100644 --- a/guix/store.scm +++ b/guix/store.scm @@ -473,6 +473,7 @@ name." (connect uri)) +(use-modules (ice-9 popen)) (define* (open-connection #:optional (uri (%daemon-socket-uri)) #:key port (reserve-space? #t) cpu-affinity) "Connect to the daemon at URI (a string), or, if PORT is not #f, use it as @@ -491,7 +492,14 @@ for this connection will be pinned. Return a server object." (errno EPROTO)) (&message (message "build daemon handshake failed")))))) (let*-values (((port) - (or port (connect-to-daemon uri))) + (or port + (with-fluids ((%default-port-encoding "ISO-8859-1")) + (open-pipe* OPEN_BOTH;; "strace" "-t" "-o" + ;; "/home/ludo/redirect.log" + ;; "-s" "1024" + "guile" "/home/ludo/src/guix/redirect.scm")) + ;; (connect-to-daemon uri) + )) ((output flush) (buffering-output-port port (make-bytevector 8192)))) ^ permalink raw reply related [flat|nested] 7+ messages in thread
* bug#30365: Offloading sometimes hangs 2018-02-07 13:42 ` Ludovic Courtès 2018-02-07 20:54 ` Ludovic Courtès @ 2018-02-09 23:16 ` Ludovic Courtès 2018-02-10 10:17 ` Ricardo Wurmus 2018-02-15 0:06 ` Danny Milosavljevic 1 sibling, 2 replies; 7+ messages in thread From: Ludovic Courtès @ 2018-02-09 23:16 UTC (permalink / raw) To: 30365-done Hello! ludo@gnu.org (Ludovic Courtès) skribis: > So what we have here is that the Scheme procedure ‘select’ returned > stdin as “ready for reading”. How did that happen? I believe this is > due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’ > returns 1, so ‘select’ returns EINTR but it does so without clearing the > FD sets. I’ve pushed a workaround here: https://git.savannah.gnu.org/cgit/guix.git/commit/?id=8446dc5a360e3a13fecea870f86efdbd893e3905 and guix-0.14.0-8.bc880f9 includes that fix. It’s been running for several hours on berlin, building a bunch of things notably on aarch64, and it seems to work well! On the Guix side this can be considered fixed, I think. On the Guile side, we’ll have to push the fix for <https://bugs.gnu.org/30368>. Ludo’. ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#30365: Offloading sometimes hangs 2018-02-09 23:16 ` Ludovic Courtès @ 2018-02-10 10:17 ` Ricardo Wurmus 2018-02-10 11:07 ` Ludovic Courtès 2018-02-15 0:06 ` Danny Milosavljevic 1 sibling, 1 reply; 7+ messages in thread From: Ricardo Wurmus @ 2018-02-10 10:17 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 30365-done Hi Ludo, > ludo@gnu.org (Ludovic Courtès) skribis: > >> So what we have here is that the Scheme procedure ‘select’ returned >> stdin as “ready for reading”. How did that happen? I believe this is >> due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’ >> returns 1, so ‘select’ returns EINTR but it does so without clearing the >> FD sets. > > I’ve pushed a workaround here: > > https://git.savannah.gnu.org/cgit/guix.git/commit/?id=8446dc5a360e3a13fecea870f86efdbd893e3905 > > and guix-0.14.0-8.bc880f9 includes that fix. > > It’s been running for several hours on berlin, building a bunch of > things notably on aarch64, and it seems to work well! Congratulations on figuring this out! -- Ricardo GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC https://elephly.net ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#30365: Offloading sometimes hangs 2018-02-10 10:17 ` Ricardo Wurmus @ 2018-02-10 11:07 ` Ludovic Courtès 0 siblings, 0 replies; 7+ messages in thread From: Ludovic Courtès @ 2018-02-10 11:07 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: 30365 Hello, Ricardo Wurmus <rekado@elephly.net> skribis: >> ludo@gnu.org (Ludovic Courtès) skribis: >> >>> So what we have here is that the Scheme procedure ‘select’ returned >>> stdin as “ready for reading”. How did that happen? I believe this is >>> due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’ >>> returns 1, so ‘select’ returns EINTR but it does so without clearing the >>> FD sets. >> >> I’ve pushed a workaround here: >> >> https://git.savannah.gnu.org/cgit/guix.git/commit/?id=8446dc5a360e3a13fecea870f86efdbd893e3905 >> >> and guix-0.14.0-8.bc880f9 includes that fix. >> >> It’s been running for several hours on berlin, building a bunch of >> things notably on aarch64, and it seems to work well! > > Congratulations on figuring this out! Well it did show up again during the night. :-/ So the problem appears less frequently it seems, but it still appears. A related issue is that ‘guix offload’ doesn’t time out in this case but it probably should. Ludo’. ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#30365: Offloading sometimes hangs 2018-02-09 23:16 ` Ludovic Courtès 2018-02-10 10:17 ` Ricardo Wurmus @ 2018-02-15 0:06 ` Danny Milosavljevic 1 sibling, 0 replies; 7+ messages in thread From: Danny Milosavljevic @ 2018-02-15 0:06 UTC (permalink / raw) To: Ludovic Courtès; +Cc: 30365-done Hi Ludo, > On the Guile side, we’ll have to push the fix for > <https://bugs.gnu.org/30368>. Can't think of a way to test select. But can think of a way to test pselect: Excuse the C: #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/select.h> #include <signal.h> static void dummy(int signum) { } int main() { int fds[2]; struct sigaction sa = { .sa_handler = dummy, .sa_flags = 0, }; sigemptyset(&sa.sa_mask); pipe(fds); /* Make sure we don't miss it */ sigsetmask(sigmask(SIGCHLD)); /* Make sure we handle it once we DO get it (after unblocking it :P) */ sigaction(SIGCHLD, &sa, NULL); if (fork() == 0) { _exit(0); /* Cause SIGCHLD (to be at least pending) in the parent. */ } else { close(fds[0]); fd_set read_set; fd_set write_set; fd_set except_set; sigset_t sigmask; int rc; FD_ZERO(&read_set); FD_ZERO(&write_set); FD_SET(fds[1], &write_set); FD_ZERO(&except_set); sigemptyset(&sigmask); /* unblock SIGCHLD */ rc = pselect(fds[0] + 1, &read_set, &write_set, &except_set, NULL, &sigmask); if (rc == -1) { perror("select"); } else if (rc == 0) { printf("ZERO\n"); } else { printf("N\n"); } } return 0; } ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2018-02-15 0:07 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-02-06 10:04 bug#30365: Offloading sometimes hangs Ludovic Courtès 2018-02-07 13:42 ` Ludovic Courtès 2018-02-07 20:54 ` Ludovic Courtès 2018-02-09 23:16 ` Ludovic Courtès 2018-02-10 10:17 ` Ricardo Wurmus 2018-02-10 11:07 ` Ludovic Courtès 2018-02-15 0:06 ` Danny Milosavljevic
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/guix.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.