unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
* bug#30365: Offloading sometimes hangs
@ 2018-02-06 10:04 Ludovic Courtès
  2018-02-07 13:42 ` Ludovic Courtès
  0 siblings, 1 reply; 7+ messages in thread
From: Ludovic Courtès @ 2018-02-06 10:04 UTC (permalink / raw)
  To: 30365

Hi,

On berlin.guixsd.org, offloading would sometimes hang in the middle of
an offloaded build: no more build log output showing up, nothing
happening (this is with guix-0.14.0-6.0dcf675).

On the build machine side, the guile process that forwards data between
the sshd and guix-daemon¹ is stuck on:

  read(0, …)

with this stack trace:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007f09d6068aed in read () from /gnu/store/3h31zsqxjjg52da5gp3qmhkh4x8klhah-glibc-2.25/lib/libpthread.so.0
#1  0x00007f09d653fc47 in fport_read ()
   from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#2  0x00007f09d656cd77 in scm_i_read_bytes ()
   from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#3  0x00007f09d65705fe in scm_fill_input ()
   from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#4  0x00007f09d6577897 in scm_get_bytevector_some ()
   from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#5  0x00007f09d65abc4d in vm_regular_engine ()
   from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#6  0x00007f09d65af2aa in scm_call_n ()
   from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#7  0x00007f09d65338d7 in scm_primitive_eval ()
   from /gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
--8<---------------cut here---------------end--------------->8---

In theory this “cannot happen” because it reads from stdin iff ‘select’
said stdin is ready.

On the server side (on berlin itself), the corresponding ‘guix offload’
process is stuck here:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007ff49b3590bd in poll () from target:/gnu/store/3h31zsqxjjg52da5gp3qmhkh4x8klhah-glibc-2.25/lib/libc.so.6
#1  0x00007ff48f4db377 in ssh_poll_ctx_dopoll ()
   from target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4
#2  0x00007ff48f4dc319 in ssh_handle_packets ()
   from target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4
#3  0x00007ff48f4dc3ed in ssh_handle_packets_termination ()
   from target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4
#4  0x00007ff48f4c8eff in ssh_channel_read_timeout ()
   from target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4
#5  0x00007ff48f930803 in read_from_channel_port ()
   from target:/gnu/store/xfaqdvk060yz7ddc9isk3wkybqmcfj3w-guile-ssh-0.11.2/lib/libguile-ssh.so.11
#6  0x00007ff49cea7d77 in scm_i_read_bytes ()
   from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#7  0x00007ff49ceac3fc in scm_c_read_bytes ()
   from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#8  0x00007ff49ceb2838 in scm_get_bytevector_n ()
   from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#9  0x00007ff49cee6c4d in vm_regular_engine ()
   from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#10 0x00007ff49ceea2aa in scm_call_n ()
   from target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#11 0x00007ff49ce6e8d7 in scm_primitive_eval ()
--8<---------------cut here---------------end--------------->8---

Presumably the ‘scm_get_bytevector_n’ call comes from (guix
serialization) or ‘process-stderr’.

IOW we have a deadlock where both sides are waiting for input data.

Ludo’.

¹ https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102

^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#30365: Offloading sometimes hangs
  2018-02-06 10:04 bug#30365: Offloading sometimes hangs Ludovic Courtès
@ 2018-02-07 13:42 ` Ludovic Courtès
  2018-02-07 20:54   ` Ludovic Courtès
  2018-02-09 23:16   ` Ludovic Courtès
  0 siblings, 2 replies; 7+ messages in thread
From: Ludovic Courtès @ 2018-02-07 13:42 UTC (permalink / raw)
  To: 30365

ludo@gnu.org (Ludovic Courtès) skribis:

> On the build machine side, the guile process that forwards data between
> the sshd and guix-daemon¹ is stuck on:
>
>   read(0, …)

> ¹ https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102

I’ve been able to strace this Guile process.  When “things work
normally”, its loop does things like this:

--8<---------------cut here---------------start------------->8---
20456 10:52:17 read(9, "gmlo\0\0\0\0\201\0\0\0\0\0\0\0make[2]: Entering directory '/tmp/guix-build-make-4.2.1.drv-0/make-4.2.1/doc'\nmake[2]: Nothing to be done for 'install-exec-am'.\n\0\0\0\0\0\0\0", 65536) = 152
20456 10:52:17 write(1, "gmlo\0\0\0\0\201\0\0\0\0\0\0\0make[2]: Entering directory '/tmp/guix-build-make-4.2.1.drv-0/make-4.2.1/doc'\nmake[2]: Nothing to be done for 'install-exec-am'.\n\0\0\0\0\0\0\0", 152) = 152
20456 10:52:17 rt_sigprocmask(SIG_BLOCK, NULL,  <unfinished ...>
20456 10:52:17 <... rt_sigprocmask resumed> [], 8) = 0
20456 10:52:17 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...>
20456 10:52:17 <... pselect6 resumed> ) = 1 (in [9])
20456 10:52:17 read(9, "gmlo\0\0\0\0\223\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/mkdir -p '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0\0", 65536) = 168
20456 10:52:17 write(1, "gmlo\0\0\0\0\223\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/mkdir -p '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0\0", 168) = 168
20456 10:52:17 rt_sigprocmask(SIG_BLOCK, NULL,  <unfinished ...>
20456 10:52:17 <... rt_sigprocmask resumed> [], 8) = 0
20456 10:52:17 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...>
20456 10:52:17 <... pselect6 resumed> ) = 1 (in [9])
20456 10:52:17 read(9, "gmlo\0\0\0\0\304\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/install -c -m 644 ./make.info ./make.info-1 ./make.info-2 '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0", 65536) = 216
20456 10:52:17 write(1, "gmlo\0\0\0\0\304\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/install -c -m 644 ./make.info ./make.info-1 ./make.info-2 '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0", 216) = 216
--8<---------------cut here---------------end--------------->8---

Here FD 9 is the guix-daemon socket and 1 is stdout, which is connected
to the remote ‘guix offload’ process.

We then see a slightly different sequence leading to the read(0, …) call
that hangs:

--8<---------------cut here---------------start------------->8---
20456 10:52:18 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...>
20456 10:52:18 <... pselect6 resumed> ) = 1 (in [9])
20456 10:52:18 read(9, "gmlo\0\0\0\0005\0\0\0\0\0\0\0phase `validate-runpath' succeeded after 0.0 seconds\n\0\0\0", 65536) = 72
20456 10:52:18 write(1, "gmlo\0\0\0\0005\0\0\0\0\0\0\0phase `validate-runpath' succeeded after 0.0 seconds\n\0\0\0", 72) = 72
20456 10:52:18 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=261868144}) = 0
20456 10:52:18 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
20456 10:52:18 madvise(0xffffb0025000, 4096, MADV_DONTNEED <unfinished ...>
20456 10:52:18 <... madvise resumed> )  = 0
20456 10:52:18 write(6, "\0", 1 <unfinished ...>
20456 10:52:18 <... write resumed> )    = 1
20456 10:52:18 clock_gettime(CLOCK_PROCESS_CPUTIME_ID,  <unfinished ...>
20456 10:52:18 <... clock_gettime resumed> {tv_sec=0, tv_nsec=268384968}) = 0
20456 10:52:18 read(0,  <unfinished ...>
--8<---------------cut here---------------end--------------->8---

Notice that there’s no pselect showing up here!

The two clock_gettime correspond to ‘get_internal_run_time’ in libguile,
and ‘madvise’ corresponds to ‘return_unused_stack_to_os’; IOW, this part
corresponds to garbage collection.

So what we have here is that the Scheme procedure ‘select’ returned
stdin as “ready for reading”.  How did that happen?  I believe this is
due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’
returns 1, so ‘select’ returns EINTR but it does so without clearing the
FD sets.

Ludo’.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#30365: Offloading sometimes hangs
  2018-02-07 13:42 ` Ludovic Courtès
@ 2018-02-07 20:54   ` Ludovic Courtès
  2018-02-09 23:16   ` Ludovic Courtès
  1 sibling, 0 replies; 7+ messages in thread
From: Ludovic Courtès @ 2018-02-07 20:54 UTC (permalink / raw)
  To: 30365

[-- Attachment #1: Type: text/plain, Size: 664 bytes --]

ludo@gnu.org (Ludovic Courtès) skribis:

> ludo@gnu.org (Ludovic Courtès) skribis:
>
>> On the build machine side, the guile process that forwards data between
>> the sshd and guix-daemon¹ is stuck on:
>>
>>   read(0, …)
>
>> ¹ https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102
>
> I’ve been able to strace this Guile process.

For reference, one way to test this without going through offloading is by:

  1. storing the ‘redirect’ code in a file, say redirect.scm;
  2. talk to the daemon through a pipe running redirect.scm as with the
     attached patch.

Ludo’.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 1324 bytes --]

diff --git a/guix/store.scm b/guix/store.scm
index 6742611c6..b38e5bc23 100644
--- a/guix/store.scm
+++ b/guix/store.scm
@@ -473,6 +473,7 @@ name."
 
   (connect uri))
 
+(use-modules (ice-9 popen))
 (define* (open-connection #:optional (uri (%daemon-socket-uri))
                           #:key port (reserve-space? #t) cpu-affinity)
   "Connect to the daemon at URI (a string), or, if PORT is not #f, use it as
@@ -491,7 +492,14 @@ for this connection will be pinned.  Return a server object."
                                             (errno EPROTO))
                      (&message (message "build daemon handshake failed"))))))
     (let*-values (((port)
-                   (or port (connect-to-daemon uri)))
+                   (or port
+                       (with-fluids ((%default-port-encoding "ISO-8859-1"))
+                        (open-pipe* OPEN_BOTH;;  "strace" "-t" "-o"
+                                    ;; "/home/ludo/redirect.log"
+                                    ;; "-s" "1024"
+                                    "guile" "/home/ludo/src/guix/redirect.scm"))
+                       ;; (connect-to-daemon uri)
+                       ))
                   ((output flush)
                    (buffering-output-port port
                                           (make-bytevector 8192))))

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* bug#30365: Offloading sometimes hangs
  2018-02-07 13:42 ` Ludovic Courtès
  2018-02-07 20:54   ` Ludovic Courtès
@ 2018-02-09 23:16   ` Ludovic Courtès
  2018-02-10 10:17     ` Ricardo Wurmus
  2018-02-15  0:06     ` Danny Milosavljevic
  1 sibling, 2 replies; 7+ messages in thread
From: Ludovic Courtès @ 2018-02-09 23:16 UTC (permalink / raw)
  To: 30365-done

Hello!

ludo@gnu.org (Ludovic Courtès) skribis:

> So what we have here is that the Scheme procedure ‘select’ returned
> stdin as “ready for reading”.  How did that happen?  I believe this is
> due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’
> returns 1, so ‘select’ returns EINTR but it does so without clearing the
> FD sets.

I’ve pushed a workaround here:

  https://git.savannah.gnu.org/cgit/guix.git/commit/?id=8446dc5a360e3a13fecea870f86efdbd893e3905

and guix-0.14.0-8.bc880f9 includes that fix.

It’s been running for several hours on berlin, building a bunch of
things notably on aarch64, and it seems to work well!

On the Guix side this can be considered fixed, I think.

On the Guile side, we’ll have to push the fix for
<https://bugs.gnu.org/30368>.

Ludo’.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#30365: Offloading sometimes hangs
  2018-02-09 23:16   ` Ludovic Courtès
@ 2018-02-10 10:17     ` Ricardo Wurmus
  2018-02-10 11:07       ` Ludovic Courtès
  2018-02-15  0:06     ` Danny Milosavljevic
  1 sibling, 1 reply; 7+ messages in thread
From: Ricardo Wurmus @ 2018-02-10 10:17 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 30365-done


Hi Ludo,

> ludo@gnu.org (Ludovic Courtès) skribis:
>
>> So what we have here is that the Scheme procedure ‘select’ returned
>> stdin as “ready for reading”.  How did that happen?  I believe this is
>> due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’
>> returns 1, so ‘select’ returns EINTR but it does so without clearing the
>> FD sets.
>
> I’ve pushed a workaround here:
>
>   https://git.savannah.gnu.org/cgit/guix.git/commit/?id=8446dc5a360e3a13fecea870f86efdbd893e3905
>
> and guix-0.14.0-8.bc880f9 includes that fix.
>
> It’s been running for several hours on berlin, building a bunch of
> things notably on aarch64, and it seems to work well!

Congratulations on figuring this out!

-- 
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6  2150 197A 5888 235F ACAC
https://elephly.net

^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#30365: Offloading sometimes hangs
  2018-02-10 10:17     ` Ricardo Wurmus
@ 2018-02-10 11:07       ` Ludovic Courtès
  0 siblings, 0 replies; 7+ messages in thread
From: Ludovic Courtès @ 2018-02-10 11:07 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: 30365

Hello,

Ricardo Wurmus <rekado@elephly.net> skribis:

>> ludo@gnu.org (Ludovic Courtès) skribis:
>>
>>> So what we have here is that the Scheme procedure ‘select’ returned
>>> stdin as “ready for reading”.  How did that happen?  I believe this is
>>> due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’
>>> returns 1, so ‘select’ returns EINTR but it does so without clearing the
>>> FD sets.
>>
>> I’ve pushed a workaround here:
>>
>>   https://git.savannah.gnu.org/cgit/guix.git/commit/?id=8446dc5a360e3a13fecea870f86efdbd893e3905
>>
>> and guix-0.14.0-8.bc880f9 includes that fix.
>>
>> It’s been running for several hours on berlin, building a bunch of
>> things notably on aarch64, and it seems to work well!
>
> Congratulations on figuring this out!

Well it did show up again during the night.  :-/  So the problem appears
less frequently it seems, but it still appears.

A related issue is that ‘guix offload’ doesn’t time out in this case
but it probably should.

Ludo’.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* bug#30365: Offloading sometimes hangs
  2018-02-09 23:16   ` Ludovic Courtès
  2018-02-10 10:17     ` Ricardo Wurmus
@ 2018-02-15  0:06     ` Danny Milosavljevic
  1 sibling, 0 replies; 7+ messages in thread
From: Danny Milosavljevic @ 2018-02-15  0:06 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 30365-done

Hi Ludo,

> On the Guile side, we’ll have to push the fix for
> <https://bugs.gnu.org/30368>.

Can't think of a way to test select.  But can think of a way to test pselect:

Excuse the C:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/select.h>
#include <signal.h>

static void dummy(int signum) {
}

int main() {
        int fds[2];
        struct sigaction sa = {
                .sa_handler = dummy,
                .sa_flags = 0,
        };
        sigemptyset(&sa.sa_mask);
        pipe(fds);
	/* Make sure we don't miss it */
        sigsetmask(sigmask(SIGCHLD));
	/* Make sure we handle it once we DO get it (after unblocking it :P) */
        sigaction(SIGCHLD, &sa, NULL);
        if (fork() == 0) {
                _exit(0); /* Cause SIGCHLD (to be at least pending) in the parent. */
        } else {
                close(fds[0]);
                fd_set read_set;
                fd_set write_set;
                fd_set except_set;
                sigset_t sigmask;
                int rc;
                FD_ZERO(&read_set);
                FD_ZERO(&write_set);
                FD_SET(fds[1], &write_set);
                FD_ZERO(&except_set);
                sigemptyset(&sigmask); /* unblock SIGCHLD */
                rc = pselect(fds[0] + 1, &read_set, &write_set, &except_set, NULL, &sigmask);
                if (rc == -1) {
                        perror("select");
                } else if (rc == 0) {
                        printf("ZERO\n");
                } else {
                        printf("N\n");
                }
        }
        return 0;
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-02-15  0:07 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-06 10:04 bug#30365: Offloading sometimes hangs Ludovic Courtès
2018-02-07 13:42 ` Ludovic Courtès
2018-02-07 20:54   ` Ludovic Courtès
2018-02-09 23:16   ` Ludovic Courtès
2018-02-10 10:17     ` Ricardo Wurmus
2018-02-10 11:07       ` Ludovic Courtès
2018-02-15  0:06     ` Danny Milosavljevic

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).