From mboxrd@z Thu Jan 1 00:00:00 1970 From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Subject: bug#30365: Offloading sometimes hangs Date: Wed, 07 Feb 2018 14:42:22 +0100 Message-ID: <87o9l0zzk1.fsf@gnu.org> References: <877erq8med.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:46985) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ejQ0A-0000ip-Qi for bug-guix@gnu.org; Wed, 07 Feb 2018 08:43:08 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ejQ06-0000BE-QD for bug-guix@gnu.org; Wed, 07 Feb 2018 08:43:06 -0500 Received: from debbugs.gnu.org ([208.118.235.43]:53070) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1ejQ06-0000B4-Lk for bug-guix@gnu.org; Wed, 07 Feb 2018 08:43:02 -0500 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1ejQ06-00087Z-8s for bug-guix@gnu.org; Wed, 07 Feb 2018 08:43:02 -0500 Sender: "Debbugs-submit" Resent-Message-ID: In-Reply-To: <877erq8med.fsf@gnu.org> ("Ludovic \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\= \=\?utf-8\?Q\?s\?\= message of "Tue, 06 Feb 2018 11:04:10 +0100") List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org Sender: "bug-Guix" To: 30365@debbugs.gnu.org ludo@gnu.org (Ludovic Court=C3=A8s) skribis: > On the build machine side, the guile process that forwards data between > the sshd and guix-daemon=C2=B9 is stuck on: > > read(0, =E2=80=A6) > =C2=B9 https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=3D= 0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102 I=E2=80=99ve been able to strace this Guile process. When =E2=80=9Cthings = work normally=E2=80=9D, its loop does things like this: --8<---------------cut here---------------start------------->8--- 20456 10:52:17 read(9, "gmlo\0\0\0\0\201\0\0\0\0\0\0\0make[2]: Entering dir= ectory '/tmp/guix-build-make-4.2.1.drv-0/make-4.2.1/doc'\nmake[2]: Nothing = to be done for 'install-exec-am'.\n\0\0\0\0\0\0\0", 65536) =3D 152 20456 10:52:17 write(1, "gmlo\0\0\0\0\201\0\0\0\0\0\0\0make[2]: Entering di= rectory '/tmp/guix-build-make-4.2.1.drv-0/make-4.2.1/doc'\nmake[2]: Nothing= to be done for 'install-exec-am'.\n\0\0\0\0\0\0\0", 152) =3D 152 20456 10:52:17 rt_sigprocmask(SIG_BLOCK, NULL, 20456 10:52:17 <... rt_sigprocmask resumed> [], 8) =3D 0 20456 10:52:17 pselect6(10, [0 3 9], [], [], NULL, NULL 20456 10:52:17 <... pselect6 resumed> ) =3D 1 (in [9]) 20456 10:52:17 read(9, "gmlo\0\0\0\0\223\0\0\0\0\0\0\0 /gnu/store/m82nil6vy= kdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/mkdir -p '/gnu/store/az7fl= k1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0\0", 65536) =3D= 168 20456 10:52:17 write(1, "gmlo\0\0\0\0\223\0\0\0\0\0\0\0 /gnu/store/m82nil6v= ykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/mkdir -p '/gnu/store/az7f= lk1jg6aaz1g9s7gpgpy0xc0igg7v-make-4.2.1/share/info'\n\0\0\0\0\0", 168) =3D = 168 20456 10:52:17 rt_sigprocmask(SIG_BLOCK, NULL, 20456 10:52:17 <... rt_sigprocmask resumed> [], 8) =3D 0 20456 10:52:17 pselect6(10, [0 3 9], [], [], NULL, NULL 20456 10:52:17 <... pselect6 resumed> ) =3D 1 (in [9]) 20456 10:52:17 read(9, "gmlo\0\0\0\0\304\0\0\0\0\0\0\0 /gnu/store/m82nil6vy= kdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/install -c -m 644 ./make.i= nfo ./make.info-1 ./make.info-2 '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg7= v-make-4.2.1/share/info'\n\0\0\0\0", 65536) =3D 216 20456 10:52:17 write(1, "gmlo\0\0\0\0\304\0\0\0\0\0\0\0 /gnu/store/m82nil6v= ykdywv9wmpn2k0rnq5zkw61z-bootstrap-binaries-0/bin/install -c -m 644 ./make.= info ./make.info-1 ./make.info-2 '/gnu/store/az7flk1jg6aaz1g9s7gpgpy0xc0igg= 7v-make-4.2.1/share/info'\n\0\0\0\0", 216) =3D 216 --8<---------------cut here---------------end--------------->8--- Here FD 9 is the guix-daemon socket and 1 is stdout, which is connected to the remote =E2=80=98guix offload=E2=80=99 process. We then see a slightly different sequence leading to the read(0, =E2=80=A6)= call that hangs: --8<---------------cut here---------------start------------->8--- 20456 10:52:18 pselect6(10, [0 3 9], [], [], NULL, NULL 20456 10:52:18 <... pselect6 resumed> ) =3D 1 (in [9]) 20456 10:52:18 read(9, "gmlo\0\0\0\0005\0\0\0\0\0\0\0phase `validate-runpat= h' succeeded after 0.0 seconds\n\0\0\0", 65536) =3D 72 20456 10:52:18 write(1, "gmlo\0\0\0\0005\0\0\0\0\0\0\0phase `validate-runpa= th' succeeded after 0.0 seconds\n\0\0\0", 72) =3D 72 20456 10:52:18 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=3D0, tv_nsec= =3D261868144}) =3D 0 20456 10:52:18 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) =3D 0 20456 10:52:18 madvise(0xffffb0025000, 4096, MADV_DONTNEED 20456 10:52:18 <... madvise resumed> ) =3D 0 20456 10:52:18 write(6, "\0", 1 20456 10:52:18 <... write resumed> ) =3D 1 20456 10:52:18 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, 20456 10:52:18 <... clock_gettime resumed> {tv_sec=3D0, tv_nsec=3D268384968= }) =3D 0 20456 10:52:18 read(0, --8<---------------cut here---------------end--------------->8--- Notice that there=E2=80=99s no pselect showing up here! The two clock_gettime correspond to =E2=80=98get_internal_run_time=E2=80=99= in libguile, and =E2=80=98madvise=E2=80=99 corresponds to =E2=80=98return_unused_stack_t= o_os=E2=80=99; IOW, this part corresponds to garbage collection. So what we have here is that the Scheme procedure =E2=80=98select=E2=80=99 = returned stdin as =E2=80=9Cready for reading=E2=80=9D. How did that happen? I beli= eve this is due to : =E2=80=98scm_i_prepare_to_wait_on_fd= =E2=80=99 returns 1, so =E2=80=98select=E2=80=99 returns EINTR but it does so without= clearing the FD sets. Ludo=E2=80=99.