From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp10.migadu.com ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms9.migadu.com with LMTPS id yEw9A2x7PWR6QQEASxT56A (envelope-from ) for ; Mon, 17 Apr 2023 19:01:32 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:bcc0::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp10.migadu.com with LMTPS id 2MteAmx7PWQwfwAAG6o9tA (envelope-from ) for ; Mon, 17 Apr 2023 19:01:32 +0200 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 647E0341EF for ; Mon, 17 Apr 2023 19:01:31 +0200 (CEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1poRrj-0000jD-8L; Mon, 17 Apr 2023 12:38:08 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1poRrf-0000ia-D4 for bug-guix@gnu.org; Mon, 17 Apr 2023 12:38:03 -0400 Received: from debbugs.gnu.org ([209.51.188.43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1poRre-0000ld-Bz for bug-guix@gnu.org; Mon, 17 Apr 2023 12:38:03 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1poRrd-0003RV-NL for bug-guix@gnu.org; Mon, 17 Apr 2023 12:38:01 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Mon, 17 Apr 2023 16:38:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 61839 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: 61839@debbugs.gnu.org Cc: Mathieu Othacehe , Andreas Enge Received: via spool by 61839-submit@debbugs.gnu.org id=B61839.168174946713209 (code B ref 61839); Mon, 17 Apr 2023 16:38:01 +0000 Received: (at 61839) by debbugs.gnu.org; 17 Apr 2023 16:37:47 +0000 Received: from localhost ([127.0.0.1]:55923 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1poRrO-0003Qz-U0 for submit@debbugs.gnu.org; Mon, 17 Apr 2023 12:37:47 -0400 Received: from eggs.gnu.org ([209.51.188.92]:43922) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1poRrN-0003Qk-5C for 61839@debbugs.gnu.org; Mon, 17 Apr 2023 12:37:46 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1poRrH-0000k7-2a; Mon, 17 Apr 2023 12:37:39 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:In-Reply-To:Date:References:Subject:To: From; bh=aJyv1vdXysdiIG2f34skQEptpu3vNlbj/J4KbAfaBwY=; b=b/0jMb6NNKHoJE9X2ERk +QTDkQDLBE6LQRE/nk+bywrxJioUKx8Cj9zkd6ncBbxf1bEVpLE/bV2Jnb6qxqvSJFNCtpQYRaEyT aI29M70gLVyLr+Vh079J1xVWsJP8jSkaG6SqJ5Lg5CH3h14/SzrbqzHlkNLLe9YiGXN9PkU0GsMCp r0NEPeXNFPAp3/sWrhpe95pSkUBN6uxg/Sx+n/z7h0iJOQccO9A3wNrWAJNddYQ9ai8a7PUUZvUvX T+fNBDNHQdqE4KeedO+prLDj2AiVArlDwMYHxhoeNK8lH9WxshDcgTDhNb8QDJ9tVnJ2twJwLjI8c YgZMmaJmtcMd5w==; Received: from [193.50.110.118] (helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1poRrG-0006xR-Kf; Mon, 17 Apr 2023 12:37:38 -0400 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= References: <87a60zo0le.fsf@inria.fr> Date: Mon, 17 Apr 2023 18:37:36 +0200 In-Reply-To: <87a60zo0le.fsf@inria.fr> ("Ludovic =?UTF-8?Q?Court=C3=A8s?="'s message of "Mon, 27 Feb 2023 11:37:49 +0100") Message-ID: <87edoicvn3.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: bug-guix-bounces+larch=yhetil.org@gnu.org X-Migadu-Flow: FLOW_IN X-Migadu-Country: US ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=gnu.org header.s=fencepost-gnu-org header.b="b/0jMb6N"; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of "bug-guix-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="bug-guix-bounces+larch=yhetil.org@gnu.org" ARC-Seal: i=1; s=key1; d=yhetil.org; t=1681750891; a=rsa-sha256; cv=none; b=c7HP+41EYG0e+6gnsrszTYHZov9Zt73aXSXZL4CH5hgRyxaPAkmSOVrv+F2L9zUtmdanhS V7PQb7X+qBV7GuV96DPUf5pArsxbzW4Nn5xwMJBUeNbtO3n5p+31sbS5ZXS+xLicN9afEl Y5iKumu9+3meaPMTHDXMUCzv6VuCltSK7Z29MpU4C2wxgLqHy0P6hp5jlPZvkSPzy+f44U 78dFm8/q0gW9ELJHH+xH5Zt2fiXUizej8TCj0rPSFv4mXz5PqXThfx6y6/CYVE1PLWZl3W yH4MaDnlXK376dbuoDgfkYb/mcirVfsXe3jLN2Qo9Ug8s97vSyTC6W170ACGGg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1681750891; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:resent-cc: resent-from:resent-sender:resent-message-id:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=aJyv1vdXysdiIG2f34skQEptpu3vNlbj/J4KbAfaBwY=; b=nfR2SIqqdGMrEzyTVXK6BAmthDlrljwwlsXM4q8rC3BsUX711uZZHpZR05makSWUfjRUXp Q3NItzec2rGO+TG7RR4cwkSj8Y1KPGaLVJiuuAaJ8IE7EY0+tsKDHYdxuP1i2+2Q0SdYNd ERtUKOVNGK0pBagotgbz7q+f9K8JAblMQ+HLPSNBFUTw85k8LaMcwv176jrOBIcY+s4uJX l0PpMeyyLLaFK+9u3YfLF+yreXv4OPZ+11LyP16FQqLDe7yK0Vzlcs1TJGMhstaoQZsASj dKWQ3miqg6Wzl/glwS6FXcPxzZGmpWAE5k6pVIQ4WtNein+F4Rh02m9qHamIQg== X-Migadu-Spam-Score: -5.21 X-Spam-Score: -5.21 X-Migadu-Queue-Id: 647E0341EF Authentication-Results: aspmx1.migadu.com; dkim=fail ("headers rsa verify failed") header.d=gnu.org header.s=fencepost-gnu-org header.b="b/0jMb6N"; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of "bug-guix-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="bug-guix-bounces+larch=yhetil.org@gnu.org" X-Migadu-Scanner: scn0.migadu.com X-TUID: XuhIxwImvvoX Hi! Ludovic Court=C3=A8s skribis: > We often get evaluation errors for =E2=80=98core-updates=E2=80=99, where = the build log > reveals that this is due to a build timeout while building the Guix > instance that will be used for the evaluation. Example: [...] > exporting path `/gnu/store/m2hyx7qk2j1bsvpj1qv89701klv8kqkk-guix-command' > exporting path `/gnu/store/vlm7401d5jvna6j7qdxhcr581cd8g8vh-guix-daemon-1= .4.0-4.01fd830' > exporting path `/gnu/store/zv3c0anfj085d2lyxh4g4aahhczmcaj8-guix-daemon' > exporting path `/gnu/store/f26rw4haizf2h126k9ja9wk31cm49mj0-guix-17bd0243= 3' > exporting path `/gnu/store/wc736va72zl2zz36817ahf3kc5fkfr69-inferior-scri= pt.scm' > building of `/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv' timed= out after 3600 seconds of silence With help from Andreas, I was able to reproduce it on berlin by running: guix build /gnu/store/b22580wj63wk4a3i09n22wa60y36kaa5-openjdk-16.0.2.drv Timeout happens while =E2=80=98guix offload=E2=80=99 is sending build requi= sites to the target machine. At that point, the client =E2=80=98guix offload=E2=80=99 p= rocess is stuck like so: --8<---------------cut here---------------start------------->8--- (gdb) bt #0 0x00007f1fd7ad7d6f in __GI___poll (fds=3D0x766b00, nfds=3D1, timeout=3D= 3594784) at ../sysdeps/unix/sysv/linux/poll.c:29 #1 0x00007f1fcc220de7 in ssh_poll_ctx_dopoll () from target:/gnu/store/f3h= n75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4 #2 0x00007f1fcc221f0c in ssh_handle_packets () from target:/gnu/store/f3hn= 75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4 #3 0x00007f1fcc221fdb in ssh_handle_packets_termination () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib= /libssh.so.4 #4 0x00007f1fcc2034bb in channel_write_common () from target:/gnu/store/f3= hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4 #5 0x00007f1fcc28b15b in ?? () from target:/gnu/store/rjhq6cz7xbck4l7zar6g= a3ygph2chwzr-guile-ssh-0.16.0/lib/libguile-ssh.so #6 0x00007f1fd7cf4f34 in scm_i_write_bytes (port=3D0x7f1fc556ba00, src=3D0= x7f1fc3c27000, start=3D0, count=3D65536) at ports.c:2858 #7 0x00007f1fd7cff546 in scm_put_bytevector (port=3D0x7f1fc556ba00, bv=3D0= x7f1fc3c27000, start=3D, count=3D) at r6rs-ports.c:676 #8 0x00007f1fc5ac5197 in ?? () #9 0x000000000069731c in ?? () #10 0x00007f1fc5acdf71 in ?? () #11 0x0000000000697318 in ?? () #12 0x00007f1fd7cd2ccc in scm_jit_enter_mcode (thread=3D0x7f1fd73b1d80, mco= de=3D0x69bf14 "\034;\004") at jit.c:6038 #13 0x00007f1fd7d2874f in vm_regular_engine (thread=3D0x7f1fd73b1d80) at vm= -engine.c:546 #14 0x00007f1fd7d355d9 in scm_call_n (proc=3D, argv=3D, nargs=3D1) at vm.c:1610 #15 0x00007f1fd7c9d457 in scm_primitive_eval (exp=3D, exp@en= try=3D0x7f1fcfab8cf0) at eval.c:671 #16 0x00007f1fd7ca34b6 in scm_eval (exp=3D0x7f1fcfab8cf0, module_or_state= =3D0x7f1fcfaacc80) at eval.c:705 #17 0x00007f1fd7d073b6 in scm_shell (argc=3D11, argv=3D0x6a9fc0) at script.= c:357 #18 0x00007f1fd7cb249c in invoke_main_func (body_data=3D0x7fffd2c0d370) at = init.c:312 #19 0x00007f1fd7c9b85a in c_body (d=3D0x7fffd2c0d290) at continuations.c:430 #20 0x00007f1fd7d28326 in vm_regular_engine (thread=3D0x7f1fd73b1d80) at vm= -engine.c:972 #21 0x00007f1fd7d355d9 in scm_call_n (proc=3D, argv=3D, nargs=3D2) at vm.c:1610 #22 0x00007f1fd7c9d09a in scm_call_2 (proc=3D, arg1=3D, arg2=3D) at eval.c:503 #23 0x00007f1fd7d53742 in scm_c_with_exception_handler.constprop.0 (type=3D= 0x404, handler_data=3Dhandler_data@entry=3D0x7fffd2c0d220, thunk_data=3Dthunk_data@entry=3D0x7fffd2c0d220, thunk=3D= , handler=3D) at exceptions.c:170 #24 0x00007f1fd7d2588f in scm_c_catch (tag=3D, body=3D, body_data=3D, handler=3D, handler_data=3D, pre_unwind_handler=3D, p= re_unwind_handler_data=3D0x7f1fcfa10040) at throw.c:168 #25 0x00007f1fd7c9de66 in scm_i_with_continuation_barrier (pre_unwind_handl= er=3D0x7f1fd7c9db80 , pre_unwind_handler_data=3D0x7f1fcfa10040, handler_data=3D0x7fffd2c0d290= , handler=3D0x7f1fd7ca48b0 , body_data=3D0x7fffd2c0d290, body=3D0x7f1fd7c9b850 ) at continuations.c:368 #26 scm_c_with_continuation_barrier (func=3D, data=3D) at continuations.c:464 #27 0x00007f1fd7d24b39 in with_guile (base=3D0x7fffd2c0d318, data=3D0x7fffd= 2c0d340) at threads.c:645 #28 0x00007f1fd7bfb0ba in GC_call_with_stack_base () from target:/gnu/store= /2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1 #29 0x00007f1fd7d1d8b8 in scm_i_with_guile (dynamic_state=3D= , data=3D, func=3D) at threads.c:688 #30 scm_with_guile (func=3D, data=3D) at thre= ads.c:694 #31 0x00007f1fd7cbb025 in scm_boot_guile (argc=3D, argv=3D, main_func=3D, closure=3D) at init.c:295 #32 0x0000000000401191 in ?? () #33 0x00007f1fd7a117dd in __libc_start_main (main=3D0x401110, argc=3D8, arg= v=3D0x7fffd2c0d4a8, init=3D, fini=3D, rtld_fini=3D, stack_end=3D0x7fffd2c0d498) at ../csu/libc= -start.c:332 #34 0x00000000004011da in ?? () (gdb) p *fds $1 =3D {fd =3D 15, events =3D 1, revents =3D 0} (gdb) shell ls -l /proc/62345/fd/15 lrwx------ 1 root root 64 Apr 17 17:42 /proc/62345/fd/15 -> 'socket:[323897= 5699]' --8<---------------cut here---------------end--------------->8--- The SSH connection is still live, and on the server side I see the =E2=80= =98guix repl -t machine=E2=80=99 process stuck in: --8<---------------cut here---------------start------------->8--- write(2, "linking \342\200\230/gnu/store/0gbj38wswwxq50i68ci973fgi2l531b4-o= penjdk-16.0.2-checkout/test/hotspot/jtreg/compiler/c2/cr6340864/TestLongVec= t.java\342\200\231 to \342\200\230/gnu/store/.links/03zscrls9431zipkpx1sn45= 0llvf5hkhvbkpjk57j22f7d2nb3vs\342\200\231\n", 221 --8<---------------cut here---------------end--------------->8--- (This message comes from optimise-store.cc:192.) This suggests that =E2=80=98guix repl -t machine=E2=80=99 has filled its ou= tput buffer; its output isn=E2=80=99t being read so it gets stuck in that write(2) call.= Its parent =E2=80=98sshd=E2=80=99 process is stuck in: --8<---------------cut here---------------start------------->8--- ppoll([{fd=3D3, events=3DPOLLIN}, {fd=3D4, events=3D0}, {fd=3D10, events=3D= POLLIN}, {fd=3D12, events=3DPOLLIN}, {fd=3D11, events=3DPOLLOUT}], 5, NULL,= [], 8^C --8<---------------cut here---------------end--------------->8--- One hypothesis is that when transferring store items with many files, such as openjdk-checkout here, guix-daemon on the remote machine outputs many such =E2=80=9Clinking =E2=80=A6 to =E2=80=A6=E2=80=9D messages; if the= client doesn=E2=80=99t read them, then we end up with a deadlock like this, with the server-side process stuck writing more messages. However, that =E2=80=9Clinking =E2=80=A6=E2=80=9D message is supposed to be= written only at non-default debugging levels: printMsg(lvlTalkative, format("linking =E2=80=98%1%=E2=80=99 to =E2=80= =98%2%=E2=80=99") % path % linkPath); Turns out that guix-daemon on build machines is running with =E2=80=98--deb= ug=E2=80=99 starting with maintenance.git commit 72d8a62466b71eda657b52ca17e99f7189ea18ad (=E2=80=9Csysadmin: build-machines: Enable guix-daemon debug output.=E2=80=9D) The immediate fix is to undo that=E2=80=94I=E2=80=99ll do that later. The other one is to swallow =E2=80=98current-error-port=E2=80=99 in =E2=80=98store-import-channel=E2=80=99. To be continued=E2=80=A6 Ludo=E2=80=99.