From: ludo@gnu.org (Ludovic Courtès)
To: 30365@debbugs.gnu.org
Subject: bug#30365: Offloading sometimes hangs
Date: Wed, 07 Feb 2018 14:42:22 +0100 [thread overview]
Message-ID: <87o9l0zzk1.fsf@gnu.org> (raw)
In-Reply-To: <877erq8med.fsf@gnu.org> ("Ludovic \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\= \=\?utf-8\?Q\?s\?\= message of "Tue, 06 Feb 2018 11:04:10 +0100")
ludo@gnu.org (Ludovic Courtès) skribis:
> On the build machine side, the guile process that forwards data between
> the sshd and guix-daemon¹ is stuck on:
>
> read(0, …)
> ¹ https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102
I’ve been able to strace this Guile process. When “things work
normally”, its loop does things like this:
--8<---------------cut here---------------start------------->8---
20456 10:52:17 read(9, "gmlo\0\0\0\0\201\0\0\0\0\0\0\0make[2]: Entering directory '/tmp/guix-build-make-4.2.1.drv-0/make-4.2.1/doc'\nmake[2]: Nothing to be done for 'install-exec-am'.\n\0\0\0\0\0\0\0", 65536) = 152
20456 10:52:17 write(1, "gmlo\0\0\0\0\201\0\0\0\0\0\0\0make[2]: Entering directory '/tmp/guix-build-make-4.2.1.drv-0/make-4.2.1/doc'\nmake[2]: Nothing to be done for 'install-exec-am'.\n\0\0\0\0\0\0\0", 152) = 152
20456 10:52:17 rt_sigprocmask(SIG_BLOCK, NULL, <unfinished ...>
20456 10:52:17 <... rt_sigprocmask resumed> [], 8) = 0
20456 10:52:17 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...>
20456 10:52:17 <... pselect6 resumed> ) = 1 (in [9])
20456 10:52:17 read(9, "gmlo\0\0\0\0\223\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/mkdir -p '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0\0", 65536) = 168
20456 10:52:17 write(1, "gmlo\0\0\0\0\223\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/mkdir -p '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0\0", 168) = 168
20456 10:52:17 rt_sigprocmask(SIG_BLOCK, NULL, <unfinished ...>
20456 10:52:17 <... rt_sigprocmask resumed> [], 8) = 0
20456 10:52:17 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...>
20456 10:52:17 <... pselect6 resumed> ) = 1 (in [9])
20456 10:52:17 read(9, "gmlo\0\0\0\0\304\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/install -c -m 644 ./make.info ./make.info-1 ./make.info-2 '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0", 65536) = 216
20456 10:52:17 write(1, "gmlo\0\0\0\0\304\0\0\0\0\0\0\0 /gnu/store/m82nil6vykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/install -c -m 644 ./make.info ./make.info-1 ./make.info-2 '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0", 216) = 216
--8<---------------cut here---------------end--------------->8---
Here FD 9 is the guix-daemon socket and 1 is stdout, which is connected
to the remote ‘guix offload’ process.
We then see a slightly different sequence leading to the read(0, …) call
that hangs:
--8<---------------cut here---------------start------------->8---
20456 10:52:18 pselect6(10, [0 3 9], [], [], NULL, NULL <unfinished ...>
20456 10:52:18 <... pselect6 resumed> ) = 1 (in [9])
20456 10:52:18 read(9, "gmlo\0\0\0\0005\0\0\0\0\0\0\0phase `validate-runpath' succeeded after 0.0 seconds\n\0\0\0", 65536) = 72
20456 10:52:18 write(1, "gmlo\0\0\0\0005\0\0\0\0\0\0\0phase `validate-runpath' succeeded after 0.0 seconds\n\0\0\0", 72) = 72
20456 10:52:18 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=261868144}) = 0
20456 10:52:18 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
20456 10:52:18 madvise(0xffffb0025000, 4096, MADV_DONTNEED <unfinished ...>
20456 10:52:18 <... madvise resumed> ) = 0
20456 10:52:18 write(6, "\0", 1 <unfinished ...>
20456 10:52:18 <... write resumed> ) = 1
20456 10:52:18 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, <unfinished ...>
20456 10:52:18 <... clock_gettime resumed> {tv_sec=0, tv_nsec=268384968}) = 0
20456 10:52:18 read(0, <unfinished ...>
--8<---------------cut here---------------end--------------->8---
Notice that there’s no pselect showing up here!
The two clock_gettime correspond to ‘get_internal_run_time’ in libguile,
and ‘madvise’ corresponds to ‘return_unused_stack_to_os’; IOW, this part
corresponds to garbage collection.
So what we have here is that the Scheme procedure ‘select’ returned
stdin as “ready for reading”. How did that happen? I believe this is
due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’
returns 1, so ‘select’ returns EINTR but it does so without clearing the
FD sets.
Ludo’.
next prev parent reply other threads:[~2018-02-07 13:43 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-06 10:04 bug#30365: Offloading sometimes hangs Ludovic Courtès
2018-02-07 13:42 ` Ludovic Courtès [this message]
2018-02-07 20:54 ` Ludovic Courtès
2018-02-09 23:16 ` Ludovic Courtès
2018-02-10 10:17 ` Ricardo Wurmus
2018-02-10 11:07 ` Ludovic Courtès
2018-02-15 0:06 ` Danny Milosavljevic
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87o9l0zzk1.fsf@gnu.org \
--to=ludo@gnu.org \
--cc=30365@debbugs.gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.