all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: "Ludovic Courtès" <ludo@gnu.org>
To: 61839@debbugs.gnu.org
Cc: Mathieu Othacehe <othacehe@gnu.org>, Andreas Enge <andreas@enge.fr>
Subject: bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time
Date: Mon, 17 Apr 2023 18:37:36 +0200	[thread overview]
Message-ID: <87edoicvn3.fsf@gnu.org> (raw)
In-Reply-To: <87a60zo0le.fsf@inria.fr> ("Ludovic Courtès"'s message of "Mon, 27 Feb 2023 11:37:49 +0100")

Hi!

Ludovic Courtès <ludo@gnu.org> skribis:

> We often get evaluation errors for ‘core-updates’, where the build log
> reveals that this is due to a build timeout while building the Guix
> instance that will be used for the evaluation.  Example:

[...]

> exporting path `/gnu/store/m2hyx7qk2j1bsvpj1qv89701klv8kqkk-guix-command'
> exporting path `/gnu/store/vlm7401d5jvna6j7qdxhcr581cd8g8vh-guix-daemon-1.4.0-4.01fd830'
> exporting path `/gnu/store/zv3c0anfj085d2lyxh4g4aahhczmcaj8-guix-daemon'
> exporting path `/gnu/store/f26rw4haizf2h126k9ja9wk31cm49mj0-guix-17bd02433'
> exporting path `/gnu/store/wc736va72zl2zz36817ahf3kc5fkfr69-inferior-script.scm'
> building of `/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv' timed out after 3600 seconds of silence

With help from Andreas, I was able to reproduce it on berlin by running:

  guix build /gnu/store/b22580wj63wk4a3i09n22wa60y36kaa5-openjdk-16.0.2.drv

Timeout happens while ‘guix offload’ is sending build requisites to the
target machine.  At that point, the client ‘guix offload’ process is
stuck like so:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007f1fd7ad7d6f in __GI___poll (fds=0x766b00, nfds=1, timeout=3594784) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f1fcc220de7 in ssh_poll_ctx_dopoll () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#2  0x00007f1fcc221f0c in ssh_handle_packets () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#3  0x00007f1fcc221fdb in ssh_handle_packets_termination ()
   from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#4  0x00007f1fcc2034bb in channel_write_common () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#5  0x00007f1fcc28b15b in ?? () from target:/gnu/store/rjhq6cz7xbck4l7zar6ga3ygph2chwzr-guile-ssh-0.16.0/lib/libguile-ssh.so
#6  0x00007f1fd7cf4f34 in scm_i_write_bytes (port=0x7f1fc556ba00, src=0x7f1fc3c27000, start=0, count=65536) at ports.c:2858
#7  0x00007f1fd7cff546 in scm_put_bytevector (port=0x7f1fc556ba00, bv=0x7f1fc3c27000, start=<optimized out>, count=<optimized out>)
    at r6rs-ports.c:676
#8  0x00007f1fc5ac5197 in ?? ()
#9  0x000000000069731c in ?? ()
#10 0x00007f1fc5acdf71 in ?? ()
#11 0x0000000000697318 in ?? ()
#12 0x00007f1fd7cd2ccc in scm_jit_enter_mcode (thread=0x7f1fd73b1d80, mcode=0x69bf14 "\034;\004") at jit.c:6038
#13 0x00007f1fd7d2874f in vm_regular_engine (thread=0x7f1fd73b1d80) at vm-engine.c:546
#14 0x00007f1fd7d355d9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=1) at vm.c:1610
#15 0x00007f1fd7c9d457 in scm_primitive_eval (exp=<optimized out>, exp@entry=0x7f1fcfab8cf0) at eval.c:671
#16 0x00007f1fd7ca34b6 in scm_eval (exp=0x7f1fcfab8cf0, module_or_state=0x7f1fcfaacc80) at eval.c:705
#17 0x00007f1fd7d073b6 in scm_shell (argc=11, argv=0x6a9fc0) at script.c:357
#18 0x00007f1fd7cb249c in invoke_main_func (body_data=0x7fffd2c0d370) at init.c:312
#19 0x00007f1fd7c9b85a in c_body (d=0x7fffd2c0d290) at continuations.c:430
#20 0x00007f1fd7d28326 in vm_regular_engine (thread=0x7f1fd73b1d80) at vm-engine.c:972
#21 0x00007f1fd7d355d9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=2) at vm.c:1610
#22 0x00007f1fd7c9d09a in scm_call_2 (proc=<optimized out>, arg1=<optimized out>, arg2=<optimized out>) at eval.c:503
#23 0x00007f1fd7d53742 in scm_c_with_exception_handler.constprop.0 (type=0x404, handler_data=handler_data@entry=0x7fffd2c0d220,
    thunk_data=thunk_data@entry=0x7fffd2c0d220, thunk=<optimized out>, handler=<optimized out>) at exceptions.c:170
#24 0x00007f1fd7d2588f in scm_c_catch (tag=<optimized out>, body=<optimized out>, body_data=<optimized out>, handler=<optimized out>,
    handler_data=<optimized out>, pre_unwind_handler=<optimized out>, pre_unwind_handler_data=0x7f1fcfa10040) at throw.c:168
#25 0x00007f1fd7c9de66 in scm_i_with_continuation_barrier (pre_unwind_handler=0x7f1fd7c9db80 <pre_unwind_handler>,
    pre_unwind_handler_data=0x7f1fcfa10040, handler_data=0x7fffd2c0d290, handler=0x7f1fd7ca48b0 <c_handler>, body_data=0x7fffd2c0d290,
    body=0x7f1fd7c9b850 <c_body>) at continuations.c:368
#26 scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:464
#27 0x00007f1fd7d24b39 in with_guile (base=0x7fffd2c0d318, data=0x7fffd2c0d340) at threads.c:645
#28 0x00007f1fd7bfb0ba in GC_call_with_stack_base () from target:/gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#29 0x00007f1fd7d1d8b8 in scm_i_with_guile (dynamic_state=<optimized out>, data=<optimized out>, func=<optimized out>) at threads.c:688
#30 scm_with_guile (func=<optimized out>, data=<optimized out>) at threads.c:694
#31 0x00007f1fd7cbb025 in scm_boot_guile (argc=<optimized out>, argv=<optimized out>, main_func=<optimized out>, closure=<optimized out>)
    at init.c:295
#32 0x0000000000401191 in ?? ()
#33 0x00007f1fd7a117dd in __libc_start_main (main=0x401110, argc=8, argv=0x7fffd2c0d4a8, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7fffd2c0d498) at ../csu/libc-start.c:332
#34 0x00000000004011da in ?? ()
(gdb) p *fds
$1 = {fd = 15, events = 1, revents = 0}
(gdb) shell ls -l /proc/62345/fd/15
lrwx------ 1 root root 64 Apr 17 17:42 /proc/62345/fd/15 -> 'socket:[3238975699]'
--8<---------------cut here---------------end--------------->8---

The SSH connection is still live, and on the server side I see the ‘guix
repl -t machine’ process stuck in:

--8<---------------cut here---------------start------------->8---
write(2, "linking \342\200\230/gnu/store/0gbj38wswwxq50i68ci973fgi2l531b4-openjdk-16.0.2-checkout/test/hotspot/jtreg/compiler/c2/cr6340864/TestLongVect.java\342\200\231 to \342\200\230/gnu/store/.links/03zscrls9431zipkpx1sn450llvf5hkhvbkpjk57j22f7d2nb3vs\342\200\231\n", 221
--8<---------------cut here---------------end--------------->8---

(This message comes from optimise-store.cc:192.)

This suggests that ‘guix repl -t machine’ has filled its output buffer;
its output isn’t being read so it gets stuck in that write(2) call.  Its
parent ‘sshd’ process is stuck in:

--8<---------------cut here---------------start------------->8---
ppoll([{fd=3, events=POLLIN}, {fd=4, events=0}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=11, events=POLLOUT}], 5, NULL, [], 8^C
--8<---------------cut here---------------end--------------->8---

One hypothesis is that when transferring store items with many files,
such as openjdk-checkout here, guix-daemon on the remote machine outputs
many such “linking … to …” messages; if the client doesn’t read them,
then we end up with a deadlock like this, with the server-side process
stuck writing more messages.

However, that “linking …” message is supposed to be written only at
non-default debugging levels:

    printMsg(lvlTalkative, format("linking ‘%1%’ to ‘%2%’") % path % linkPath);

Turns out that guix-daemon on build machines is running with ‘--debug’
starting with maintenance.git commit
72d8a62466b71eda657b52ca17e99f7189ea18ad (“sysadmin: build-machines:
Enable guix-daemon debug output.”)

The immediate fix is to undo that—I’ll do that later.

The other one is to swallow ‘current-error-port’ in
‘store-import-channel’.

To be continued…

Ludo’.




  parent reply	other threads:[~2023-04-17 17:01 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-27 10:37 bug#61839: [Cuirass] Build timeouts during evaluations Ludovic Courtès
2023-03-03 14:24 ` Ludovic Courtès
2023-04-17 16:37 ` Ludovic Courtès [this message]
2023-04-17 22:55   ` bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87edoicvn3.fsf@gnu.org \
    --to=ludo@gnu.org \
    --cc=61839@debbugs.gnu.org \
    --cc=andreas@enge.fr \
    --cc=othacehe@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.