* bug#61839: [Cuirass] Build timeouts during evaluations
@ 2023-02-27 10:37 Ludovic Courtès
2023-03-03 14:24 ` Ludovic Courtès
2023-04-17 16:37 ` bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time Ludovic Courtès
0 siblings, 2 replies; 4+ messages in thread
From: Ludovic Courtès @ 2023-02-27 10:37 UTC (permalink / raw)
To: 61839; +Cc: Mathieu Othacehe
Hello Guix!
We often get evaluation errors for ‘core-updates’, where the build log
reveals that this is due to a build timeout while building the Guix
instance that will be used for the evaluation. Example:
--8<---------------cut here---------------start------------->8---
building of `/gnu/store/x9dyzywwmsp8vmpvpx54yhcnz7rqin0l-asciidoc-9.1.0.drv' timed out after 3600 seconds of silence
cannot build derivation `/gnu/store/bl7aq8999lmp6flbyw3k656m522jmcl3-git-2.39.1.drv': 1 dependencies couldn't be built
cannot build derivation `/gnu/store/ldajjyvb4np1zq9l9s09j0dpsbpscpan-gnulib-2022-12-06-1.440b528.drv': 1 dependencies couldn't be built
cannot build derivation `/gnu/store/fdks470j7xjw3dc6fiyw15hdsxj6cdbc-guile-gnutls-3.7.11.drv': 1 dependencies couldn't be built
cannot build derivation `/gnu/store/71pl6a38bd0b81qgi45wz2ymvlv0x7xn-guix-cli-core.drv': 1 dependencies couldn't be built
[...]
cannot build derivation `/gnu/store/m7r73wypxhhmv8cqza1x8g2zyq9k1xpg-profile.drv': 1 dependencies couldn't be built
[...]
ERROR:
1. &store-protocol-error:
message: "build of `/gnu/store/m7r73wypxhhmv8cqza1x8g2zyq9k1xpg-profile.drv' failed"
status: 101
--8<---------------cut here---------------end--------------->8---
(From <https://ci.guix.gnu.org/eval/233402/log/raw>.)
Likewise for the ‘guile’ jobset:
--8<---------------cut here---------------start------------->8---
exporting path `/gnu/store/m2hyx7qk2j1bsvpj1qv89701klv8kqkk-guix-command'
exporting path `/gnu/store/vlm7401d5jvna6j7qdxhcr581cd8g8vh-guix-daemon-1.4.0-4.01fd830'
exporting path `/gnu/store/zv3c0anfj085d2lyxh4g4aahhczmcaj8-guix-daemon'
exporting path `/gnu/store/f26rw4haizf2h126k9ja9wk31cm49mj0-guix-17bd02433'
exporting path `/gnu/store/wc736va72zl2zz36817ahf3kc5fkfr69-inferior-script.scm'
building of `/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv' timed out after 3600 seconds of silence
cannot build derivation `/gnu/store/nrcsvww8j24rjzyin90w76qgbkgphhvp-profile.drv': 1 dependencies couldn't be built
Backtrace:
In ice-9/boot-9.scm:
1752:10 12 (with-exception-handler _ _ #:unwind? _ # _)
In unknown file:
11 (apply-smob/0 #<thunk 7fd8cbe13340>)
In ice-9/boot-9.scm:
724:2 10 (call-with-prompt _ _ #<procedure default-prompt-handle?>)
In ice-9/eval.scm:
619:8 9 (_ #(#(#<directory (guile-user) 7fd8cbe16c80>)))
In cuirass/ui.scm:
104:10 8 (run-cuirass-command _ . _)
In cuirass/scripts/evaluate.scm:
105:27 7 (cuirass-evaluate _)
In ice-9/boot-9.scm:
1752:10 6 (with-exception-handler _ _ #:unwind? _ # _)
In guix/store.scm:
658:37 5 (thunk)
2123:25 4 (run-with-store #<store-connection 256.99 7fd8cad09dc0> ?)
In cuirass/scripts/evaluate.scm:
87:8 3 (_ _)
In guix/store.scm:
1995:38 2 (_ #<store-connection 256.99 7fd8a33e6d70>)
1421:15 1 (_ #<store-connection 256.99 7fd8a33e6d70> _ _)
In ice-9/boot-9.scm:
1685:16 0 (raise-exception _ #:continuable? _)
ice-9/boot-9.scm:1685:16: In procedure raise-exception:
ERROR:
1. &store-protocol-error:
message: "build of `/gnu/store/nrcsvww8j24rjzyin90w76qgbkgphhvp-profile.drv' failed"
status: 101
--8<---------------cut here---------------end--------------->8---
Building this particular derivation is quick though:
--8<---------------cut here---------------start------------->8---
$ time guix build /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv --check
The following derivation will be built:
/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv
building /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv...
(repl-version 0 1 1)
(values (value "/gnu/store/rgk1sc2zvllf7cgbyj5g28b9v2x6nvn3-guile/share/guile/site/3.0"))
successfully built /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv
successfully built /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv
/gnu/store/rgk1sc2zvllf7cgbyj5g28b9v2x6nvn3-guile
real 0m1.253s
user 0m0.751s
sys 0m0.034s
--8<---------------cut here---------------end--------------->8---
So the problem must be elsewhere.
One thing is that this particular .drv is quite big:
--8<---------------cut here---------------start------------->8---
$ guix size /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv
store item total self
/gnu/store/wjr1sqxci75jfkvwzmmvlgypkyp3fvqi-guix-17bd024 105.0 105.0 42.9%
/gnu/store/6b3fp2kwqxw7cvnh6jz9g49n9k2n5avy-guix-packages-base-source 36.2 36.2 14.8%
/gnu/store/h3hjni2bd8f5kbn8w0g2xx2jnagfynnr-guile-e2ed33e 29.0 29.0 11.8%
/gnu/store/gfb26xn5f2qyrxzv847cd6iql6h8igh0-packages 23.1 23.1 9.4%
/gnu/store/dj8rq4jd8qgk4k1hzg8djcvi6hlbdb7m-doc 22.3 22.3 9.1%
/gnu/store/a5z192qxni40ly92cknwxkpnxc8zxfgf-guix 7.8 7.8 3.2%
/gnu/store/f8wg2y3rrh4dgi6znjrv8a1l4ymkc7rb-guix-packages-source 2.8 2.8 1.2%
/gnu/store/8iwm8x6472xnf7g6hfxzw029gxm1qbkq-guix-system-source 2.5 2.5 1.0%
/gnu/store/2dxbf11shgvs1cgvzz6wpnfra97f0sfw-guix-extra-source 2.5 2.5 1.0%
/gnu/store/q57az26zdgafic5b8xsbxlh4iwddvr86-doc 2.1 2.1 0.9%
[…]
/gnu/store/0mxnx8l4fgigvd7gakwdk6hc6im4wnai-disarchive-mirrors 0.0 0.0 0.0%
total: 244.8 MiB
--8<---------------cut here---------------end--------------->8---
Merely transferring it for offloading takes time; depending on machine
load, perhaps it can hit the 1h max-silent timeout? That’d mean the
throughput is below 70 KiB/s. Plausible?
The closure of
/gnu/store/x9dyzywwmsp8vmpvpx54yhcnz7rqin0l-asciidoc-9.1.0.drv is
3.0 MiB and it builds in 4 seconds, so there must be something else
going on, perhaps a bug in libssh/Guile-SSH leading to stale
connections?
Thoughts?
Ludo’.
^ permalink raw reply [flat|nested] 4+ messages in thread
* bug#61839: [Cuirass] Build timeouts during evaluations
2023-02-27 10:37 bug#61839: [Cuirass] Build timeouts during evaluations Ludovic Courtès
@ 2023-03-03 14:24 ` Ludovic Courtès
2023-04-17 16:37 ` bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time Ludovic Courtès
1 sibling, 0 replies; 4+ messages in thread
From: Ludovic Courtès @ 2023-03-03 14:24 UTC (permalink / raw)
To: 61839; +Cc: Mathieu Othacehe
Ludovic Courtès <ludo@gnu.org> skribis:
> exporting path `/gnu/store/f26rw4haizf2h126k9ja9wk31cm49mj0-guix-17bd02433'
> exporting path `/gnu/store/wc736va72zl2zz36817ahf3kc5fkfr69-inferior-script.scm'
> building of `/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv' timed out after 3600 seconds of silence
I experienced something similar while building from the command line on
berlin, so it looks like an offloading problem.
Ludo’.
^ permalink raw reply [flat|nested] 4+ messages in thread
* bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time
2023-02-27 10:37 bug#61839: [Cuirass] Build timeouts during evaluations Ludovic Courtès
2023-03-03 14:24 ` Ludovic Courtès
@ 2023-04-17 16:37 ` Ludovic Courtès
2023-04-17 22:55 ` Ludovic Courtès
1 sibling, 1 reply; 4+ messages in thread
From: Ludovic Courtès @ 2023-04-17 16:37 UTC (permalink / raw)
To: 61839; +Cc: Mathieu Othacehe, Andreas Enge
Hi!
Ludovic Courtès <ludo@gnu.org> skribis:
> We often get evaluation errors for ‘core-updates’, where the build log
> reveals that this is due to a build timeout while building the Guix
> instance that will be used for the evaluation. Example:
[...]
> exporting path `/gnu/store/m2hyx7qk2j1bsvpj1qv89701klv8kqkk-guix-command'
> exporting path `/gnu/store/vlm7401d5jvna6j7qdxhcr581cd8g8vh-guix-daemon-1.4.0-4.01fd830'
> exporting path `/gnu/store/zv3c0anfj085d2lyxh4g4aahhczmcaj8-guix-daemon'
> exporting path `/gnu/store/f26rw4haizf2h126k9ja9wk31cm49mj0-guix-17bd02433'
> exporting path `/gnu/store/wc736va72zl2zz36817ahf3kc5fkfr69-inferior-script.scm'
> building of `/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv' timed out after 3600 seconds of silence
With help from Andreas, I was able to reproduce it on berlin by running:
guix build /gnu/store/b22580wj63wk4a3i09n22wa60y36kaa5-openjdk-16.0.2.drv
Timeout happens while ‘guix offload’ is sending build requisites to the
target machine. At that point, the client ‘guix offload’ process is
stuck like so:
--8<---------------cut here---------------start------------->8---
(gdb) bt
#0 0x00007f1fd7ad7d6f in __GI___poll (fds=0x766b00, nfds=1, timeout=3594784) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f1fcc220de7 in ssh_poll_ctx_dopoll () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#2 0x00007f1fcc221f0c in ssh_handle_packets () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#3 0x00007f1fcc221fdb in ssh_handle_packets_termination ()
from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#4 0x00007f1fcc2034bb in channel_write_common () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#5 0x00007f1fcc28b15b in ?? () from target:/gnu/store/rjhq6cz7xbck4l7zar6ga3ygph2chwzr-guile-ssh-0.16.0/lib/libguile-ssh.so
#6 0x00007f1fd7cf4f34 in scm_i_write_bytes (port=0x7f1fc556ba00, src=0x7f1fc3c27000, start=0, count=65536) at ports.c:2858
#7 0x00007f1fd7cff546 in scm_put_bytevector (port=0x7f1fc556ba00, bv=0x7f1fc3c27000, start=<optimized out>, count=<optimized out>)
at r6rs-ports.c:676
#8 0x00007f1fc5ac5197 in ?? ()
#9 0x000000000069731c in ?? ()
#10 0x00007f1fc5acdf71 in ?? ()
#11 0x0000000000697318 in ?? ()
#12 0x00007f1fd7cd2ccc in scm_jit_enter_mcode (thread=0x7f1fd73b1d80, mcode=0x69bf14 "\034;\004") at jit.c:6038
#13 0x00007f1fd7d2874f in vm_regular_engine (thread=0x7f1fd73b1d80) at vm-engine.c:546
#14 0x00007f1fd7d355d9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=1) at vm.c:1610
#15 0x00007f1fd7c9d457 in scm_primitive_eval (exp=<optimized out>, exp@entry=0x7f1fcfab8cf0) at eval.c:671
#16 0x00007f1fd7ca34b6 in scm_eval (exp=0x7f1fcfab8cf0, module_or_state=0x7f1fcfaacc80) at eval.c:705
#17 0x00007f1fd7d073b6 in scm_shell (argc=11, argv=0x6a9fc0) at script.c:357
#18 0x00007f1fd7cb249c in invoke_main_func (body_data=0x7fffd2c0d370) at init.c:312
#19 0x00007f1fd7c9b85a in c_body (d=0x7fffd2c0d290) at continuations.c:430
#20 0x00007f1fd7d28326 in vm_regular_engine (thread=0x7f1fd73b1d80) at vm-engine.c:972
#21 0x00007f1fd7d355d9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=2) at vm.c:1610
#22 0x00007f1fd7c9d09a in scm_call_2 (proc=<optimized out>, arg1=<optimized out>, arg2=<optimized out>) at eval.c:503
#23 0x00007f1fd7d53742 in scm_c_with_exception_handler.constprop.0 (type=0x404, handler_data=handler_data@entry=0x7fffd2c0d220,
thunk_data=thunk_data@entry=0x7fffd2c0d220, thunk=<optimized out>, handler=<optimized out>) at exceptions.c:170
#24 0x00007f1fd7d2588f in scm_c_catch (tag=<optimized out>, body=<optimized out>, body_data=<optimized out>, handler=<optimized out>,
handler_data=<optimized out>, pre_unwind_handler=<optimized out>, pre_unwind_handler_data=0x7f1fcfa10040) at throw.c:168
#25 0x00007f1fd7c9de66 in scm_i_with_continuation_barrier (pre_unwind_handler=0x7f1fd7c9db80 <pre_unwind_handler>,
pre_unwind_handler_data=0x7f1fcfa10040, handler_data=0x7fffd2c0d290, handler=0x7f1fd7ca48b0 <c_handler>, body_data=0x7fffd2c0d290,
body=0x7f1fd7c9b850 <c_body>) at continuations.c:368
#26 scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:464
#27 0x00007f1fd7d24b39 in with_guile (base=0x7fffd2c0d318, data=0x7fffd2c0d340) at threads.c:645
#28 0x00007f1fd7bfb0ba in GC_call_with_stack_base () from target:/gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#29 0x00007f1fd7d1d8b8 in scm_i_with_guile (dynamic_state=<optimized out>, data=<optimized out>, func=<optimized out>) at threads.c:688
#30 scm_with_guile (func=<optimized out>, data=<optimized out>) at threads.c:694
#31 0x00007f1fd7cbb025 in scm_boot_guile (argc=<optimized out>, argv=<optimized out>, main_func=<optimized out>, closure=<optimized out>)
at init.c:295
#32 0x0000000000401191 in ?? ()
#33 0x00007f1fd7a117dd in __libc_start_main (main=0x401110, argc=8, argv=0x7fffd2c0d4a8, init=<optimized out>, fini=<optimized out>,
rtld_fini=<optimized out>, stack_end=0x7fffd2c0d498) at ../csu/libc-start.c:332
#34 0x00000000004011da in ?? ()
(gdb) p *fds
$1 = {fd = 15, events = 1, revents = 0}
(gdb) shell ls -l /proc/62345/fd/15
lrwx------ 1 root root 64 Apr 17 17:42 /proc/62345/fd/15 -> 'socket:[3238975699]'
--8<---------------cut here---------------end--------------->8---
The SSH connection is still live, and on the server side I see the ‘guix
repl -t machine’ process stuck in:
--8<---------------cut here---------------start------------->8---
write(2, "linking \342\200\230/gnu/store/0gbj38wswwxq50i68ci973fgi2l531b4-openjdk-16.0.2-checkout/test/hotspot/jtreg/compiler/c2/cr6340864/TestLongVect.java\342\200\231 to \342\200\230/gnu/store/.links/03zscrls9431zipkpx1sn450llvf5hkhvbkpjk57j22f7d2nb3vs\342\200\231\n", 221
--8<---------------cut here---------------end--------------->8---
(This message comes from optimise-store.cc:192.)
This suggests that ‘guix repl -t machine’ has filled its output buffer;
its output isn’t being read so it gets stuck in that write(2) call. Its
parent ‘sshd’ process is stuck in:
--8<---------------cut here---------------start------------->8---
ppoll([{fd=3, events=POLLIN}, {fd=4, events=0}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=11, events=POLLOUT}], 5, NULL, [], 8^C
--8<---------------cut here---------------end--------------->8---
One hypothesis is that when transferring store items with many files,
such as openjdk-checkout here, guix-daemon on the remote machine outputs
many such “linking … to …” messages; if the client doesn’t read them,
then we end up with a deadlock like this, with the server-side process
stuck writing more messages.
However, that “linking …” message is supposed to be written only at
non-default debugging levels:
printMsg(lvlTalkative, format("linking ‘%1%’ to ‘%2%’") % path % linkPath);
Turns out that guix-daemon on build machines is running with ‘--debug’
starting with maintenance.git commit
72d8a62466b71eda657b52ca17e99f7189ea18ad (“sysadmin: build-machines:
Enable guix-daemon debug output.”)
The immediate fix is to undo that—I’ll do that later.
The other one is to swallow ‘current-error-port’ in
‘store-import-channel’.
To be continued…
Ludo’.
^ permalink raw reply [flat|nested] 4+ messages in thread
* bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time
2023-04-17 16:37 ` bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time Ludovic Courtès
@ 2023-04-17 22:55 ` Ludovic Courtès
0 siblings, 0 replies; 4+ messages in thread
From: Ludovic Courtès @ 2023-04-17 22:55 UTC (permalink / raw)
To: 61839; +Cc: Mathieu Othacehe, Andreas Enge
Hi,
Ludovic Courtès <ludo@gnu.org> skribis:
> Turns out that guix-daemon on build machines is running with ‘--debug’
> starting with maintenance.git commit
> 72d8a62466b71eda657b52ca17e99f7189ea18ad (“sysadmin: build-machines:
> Enable guix-daemon debug output.”)
>
> The immediate fix is to undo that—I’ll do that later.
Done:
https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=cbe75edae348bf0aa235e93ebeb58fe9db14fa7a
I redeployed the build machines (on the same Guix commit as before).
The 4 machines that host a childhurd are still being redeployed though.
> The other one is to swallow ‘current-error-port’ in
> ‘store-import-channel’.
Done:
https://git.savannah.gnu.org/cgit/guix.git/commit/?id=166a3e3fdeaa5279b2dd5000088f913c05af9558
Ludo’.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-04-17 23:32 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-27 10:37 bug#61839: [Cuirass] Build timeouts during evaluations Ludovic Courtès
2023-03-03 14:24 ` Ludovic Courtès
2023-04-17 16:37 ` bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time Ludovic Courtès
2023-04-17 22:55 ` Ludovic Courtès
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.