unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
* bug#61839: [Cuirass] Build timeouts during evaluations
@ 2023-02-27 10:37 Ludovic Courtès
  2023-03-03 14:24 ` Ludovic Courtès
  2023-04-17 16:37 ` bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time Ludovic Courtès
  0 siblings, 2 replies; 4+ messages in thread
From: Ludovic Courtès @ 2023-02-27 10:37 UTC (permalink / raw)
  To: 61839; +Cc: Mathieu Othacehe

Hello Guix!

We often get evaluation errors for ‘core-updates’, where the build log
reveals that this is due to a build timeout while building the Guix
instance that will be used for the evaluation.  Example:

--8<---------------cut here---------------start------------->8---
building of `/gnu/store/x9dyzywwmsp8vmpvpx54yhcnz7rqin0l-asciidoc-9.1.0.drv' timed out after 3600 seconds of silence
cannot build derivation `/gnu/store/bl7aq8999lmp6flbyw3k656m522jmcl3-git-2.39.1.drv': 1 dependencies couldn't be built
cannot build derivation `/gnu/store/ldajjyvb4np1zq9l9s09j0dpsbpscpan-gnulib-2022-12-06-1.440b528.drv': 1 dependencies couldn't be built
cannot build derivation `/gnu/store/fdks470j7xjw3dc6fiyw15hdsxj6cdbc-guile-gnutls-3.7.11.drv': 1 dependencies couldn't be built
cannot build derivation `/gnu/store/71pl6a38bd0b81qgi45wz2ymvlv0x7xn-guix-cli-core.drv': 1 dependencies couldn't be built

[...]

cannot build derivation `/gnu/store/m7r73wypxhhmv8cqza1x8g2zyq9k1xpg-profile.drv': 1 dependencies couldn't be built

[...]

ERROR:
  1. &store-protocol-error:
      message: "build of `/gnu/store/m7r73wypxhhmv8cqza1x8g2zyq9k1xpg-profile.drv' failed"
      status: 101
--8<---------------cut here---------------end--------------->8---

(From <https://ci.guix.gnu.org/eval/233402/log/raw>.)

Likewise for the ‘guile’ jobset:

--8<---------------cut here---------------start------------->8---
exporting path `/gnu/store/m2hyx7qk2j1bsvpj1qv89701klv8kqkk-guix-command'
exporting path `/gnu/store/vlm7401d5jvna6j7qdxhcr581cd8g8vh-guix-daemon-1.4.0-4.01fd830'
exporting path `/gnu/store/zv3c0anfj085d2lyxh4g4aahhczmcaj8-guix-daemon'
exporting path `/gnu/store/f26rw4haizf2h126k9ja9wk31cm49mj0-guix-17bd02433'
exporting path `/gnu/store/wc736va72zl2zz36817ahf3kc5fkfr69-inferior-script.scm'
building of `/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv' timed out after 3600 seconds of silence
cannot build derivation `/gnu/store/nrcsvww8j24rjzyin90w76qgbkgphhvp-profile.drv': 1 dependencies couldn't be built
Backtrace:
In ice-9/boot-9.scm:
  1752:10 12 (with-exception-handler _ _ #:unwind? _ # _)
In unknown file:
          11 (apply-smob/0 #<thunk 7fd8cbe13340>)
In ice-9/boot-9.scm:
    724:2 10 (call-with-prompt _ _ #<procedure default-prompt-handle?>)
In ice-9/eval.scm:
    619:8  9 (_ #(#(#<directory (guile-user) 7fd8cbe16c80>)))
In cuirass/ui.scm:
   104:10  8 (run-cuirass-command _ . _)
In cuirass/scripts/evaluate.scm:
   105:27  7 (cuirass-evaluate _)
In ice-9/boot-9.scm:
  1752:10  6 (with-exception-handler _ _ #:unwind? _ # _)
In guix/store.scm:
   658:37  5 (thunk)
  2123:25  4 (run-with-store #<store-connection 256.99 7fd8cad09dc0> ?)
In cuirass/scripts/evaluate.scm:
     87:8  3 (_ _)
In guix/store.scm:
  1995:38  2 (_ #<store-connection 256.99 7fd8a33e6d70>)
  1421:15  1 (_ #<store-connection 256.99 7fd8a33e6d70> _ _)
In ice-9/boot-9.scm:
  1685:16  0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
ERROR:
  1. &store-protocol-error:
      message: "build of `/gnu/store/nrcsvww8j24rjzyin90w76qgbkgphhvp-profile.drv' failed"
      status: 101
--8<---------------cut here---------------end--------------->8---

Building this particular derivation is quick though:

--8<---------------cut here---------------start------------->8---
$ time guix build /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv --check
The following derivation will be built:
  /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv
building /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv...
(repl-version 0 1 1)
(values (value "/gnu/store/rgk1sc2zvllf7cgbyj5g28b9v2x6nvn3-guile/share/guile/site/3.0"))
successfully built /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv
successfully built /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv
/gnu/store/rgk1sc2zvllf7cgbyj5g28b9v2x6nvn3-guile

real    0m1.253s
user    0m0.751s
sys     0m0.034s
--8<---------------cut here---------------end--------------->8---

So the problem must be elsewhere.

One thing is that this particular .drv is quite big:

--8<---------------cut here---------------start------------->8---
$ guix size /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv
store item                                                       total    self
/gnu/store/wjr1sqxci75jfkvwzmmvlgypkyp3fvqi-guix-17bd024           105.0   105.0  42.9%
/gnu/store/6b3fp2kwqxw7cvnh6jz9g49n9k2n5avy-guix-packages-base-source    36.2    36.2  14.8%
/gnu/store/h3hjni2bd8f5kbn8w0g2xx2jnagfynnr-guile-e2ed33e           29.0    29.0  11.8%
/gnu/store/gfb26xn5f2qyrxzv847cd6iql6h8igh0-packages                23.1    23.1   9.4%
/gnu/store/dj8rq4jd8qgk4k1hzg8djcvi6hlbdb7m-doc                     22.3    22.3   9.1%
/gnu/store/a5z192qxni40ly92cknwxkpnxc8zxfgf-guix                     7.8     7.8   3.2%
/gnu/store/f8wg2y3rrh4dgi6znjrv8a1l4ymkc7rb-guix-packages-source     2.8     2.8   1.2%
/gnu/store/8iwm8x6472xnf7g6hfxzw029gxm1qbkq-guix-system-source       2.5     2.5   1.0%
/gnu/store/2dxbf11shgvs1cgvzz6wpnfra97f0sfw-guix-extra-source        2.5     2.5   1.0%
/gnu/store/q57az26zdgafic5b8xsbxlh4iwddvr86-doc                      2.1     2.1   0.9%
[…]
/gnu/store/0mxnx8l4fgigvd7gakwdk6hc6im4wnai-disarchive-mirrors       0.0     0.0   0.0%
total: 244.8 MiB
--8<---------------cut here---------------end--------------->8---

Merely transferring it for offloading takes time; depending on machine
load, perhaps it can hit the 1h max-silent timeout?  That’d mean the
throughput is below 70 KiB/s.  Plausible?

The closure of
/gnu/store/x9dyzywwmsp8vmpvpx54yhcnz7rqin0l-asciidoc-9.1.0.drv is
3.0 MiB and it builds in 4 seconds, so there must be something else
going on, perhaps a bug in libssh/Guile-SSH leading to stale
connections?

Thoughts?

Ludo’.




^ permalink raw reply	[flat|nested] 4+ messages in thread

* bug#61839: [Cuirass] Build timeouts during evaluations
  2023-02-27 10:37 bug#61839: [Cuirass] Build timeouts during evaluations Ludovic Courtès
@ 2023-03-03 14:24 ` Ludovic Courtès
  2023-04-17 16:37 ` bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time Ludovic Courtès
  1 sibling, 0 replies; 4+ messages in thread
From: Ludovic Courtès @ 2023-03-03 14:24 UTC (permalink / raw)
  To: 61839; +Cc: Mathieu Othacehe

Ludovic Courtès <ludo@gnu.org> skribis:

> exporting path `/gnu/store/f26rw4haizf2h126k9ja9wk31cm49mj0-guix-17bd02433'
> exporting path `/gnu/store/wc736va72zl2zz36817ahf3kc5fkfr69-inferior-script.scm'
> building of `/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv' timed out after 3600 seconds of silence

I experienced something similar while building from the command line on
berlin, so it looks like an offloading problem.

Ludo’.




^ permalink raw reply	[flat|nested] 4+ messages in thread

* bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time
  2023-02-27 10:37 bug#61839: [Cuirass] Build timeouts during evaluations Ludovic Courtès
  2023-03-03 14:24 ` Ludovic Courtès
@ 2023-04-17 16:37 ` Ludovic Courtès
  2023-04-17 22:55   ` Ludovic Courtès
  1 sibling, 1 reply; 4+ messages in thread
From: Ludovic Courtès @ 2023-04-17 16:37 UTC (permalink / raw)
  To: 61839; +Cc: Mathieu Othacehe, Andreas Enge

Hi!

Ludovic Courtès <ludo@gnu.org> skribis:

> We often get evaluation errors for ‘core-updates’, where the build log
> reveals that this is due to a build timeout while building the Guix
> instance that will be used for the evaluation.  Example:

[...]

> exporting path `/gnu/store/m2hyx7qk2j1bsvpj1qv89701klv8kqkk-guix-command'
> exporting path `/gnu/store/vlm7401d5jvna6j7qdxhcr581cd8g8vh-guix-daemon-1.4.0-4.01fd830'
> exporting path `/gnu/store/zv3c0anfj085d2lyxh4g4aahhczmcaj8-guix-daemon'
> exporting path `/gnu/store/f26rw4haizf2h126k9ja9wk31cm49mj0-guix-17bd02433'
> exporting path `/gnu/store/wc736va72zl2zz36817ahf3kc5fkfr69-inferior-script.scm'
> building of `/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv' timed out after 3600 seconds of silence

With help from Andreas, I was able to reproduce it on berlin by running:

  guix build /gnu/store/b22580wj63wk4a3i09n22wa60y36kaa5-openjdk-16.0.2.drv

Timeout happens while ‘guix offload’ is sending build requisites to the
target machine.  At that point, the client ‘guix offload’ process is
stuck like so:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007f1fd7ad7d6f in __GI___poll (fds=0x766b00, nfds=1, timeout=3594784) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f1fcc220de7 in ssh_poll_ctx_dopoll () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#2  0x00007f1fcc221f0c in ssh_handle_packets () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#3  0x00007f1fcc221fdb in ssh_handle_packets_termination ()
   from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#4  0x00007f1fcc2034bb in channel_write_common () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#5  0x00007f1fcc28b15b in ?? () from target:/gnu/store/rjhq6cz7xbck4l7zar6ga3ygph2chwzr-guile-ssh-0.16.0/lib/libguile-ssh.so
#6  0x00007f1fd7cf4f34 in scm_i_write_bytes (port=0x7f1fc556ba00, src=0x7f1fc3c27000, start=0, count=65536) at ports.c:2858
#7  0x00007f1fd7cff546 in scm_put_bytevector (port=0x7f1fc556ba00, bv=0x7f1fc3c27000, start=<optimized out>, count=<optimized out>)
    at r6rs-ports.c:676
#8  0x00007f1fc5ac5197 in ?? ()
#9  0x000000000069731c in ?? ()
#10 0x00007f1fc5acdf71 in ?? ()
#11 0x0000000000697318 in ?? ()
#12 0x00007f1fd7cd2ccc in scm_jit_enter_mcode (thread=0x7f1fd73b1d80, mcode=0x69bf14 "\034;\004") at jit.c:6038
#13 0x00007f1fd7d2874f in vm_regular_engine (thread=0x7f1fd73b1d80) at vm-engine.c:546
#14 0x00007f1fd7d355d9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=1) at vm.c:1610
#15 0x00007f1fd7c9d457 in scm_primitive_eval (exp=<optimized out>, exp@entry=0x7f1fcfab8cf0) at eval.c:671
#16 0x00007f1fd7ca34b6 in scm_eval (exp=0x7f1fcfab8cf0, module_or_state=0x7f1fcfaacc80) at eval.c:705
#17 0x00007f1fd7d073b6 in scm_shell (argc=11, argv=0x6a9fc0) at script.c:357
#18 0x00007f1fd7cb249c in invoke_main_func (body_data=0x7fffd2c0d370) at init.c:312
#19 0x00007f1fd7c9b85a in c_body (d=0x7fffd2c0d290) at continuations.c:430
#20 0x00007f1fd7d28326 in vm_regular_engine (thread=0x7f1fd73b1d80) at vm-engine.c:972
#21 0x00007f1fd7d355d9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=2) at vm.c:1610
#22 0x00007f1fd7c9d09a in scm_call_2 (proc=<optimized out>, arg1=<optimized out>, arg2=<optimized out>) at eval.c:503
#23 0x00007f1fd7d53742 in scm_c_with_exception_handler.constprop.0 (type=0x404, handler_data=handler_data@entry=0x7fffd2c0d220,
    thunk_data=thunk_data@entry=0x7fffd2c0d220, thunk=<optimized out>, handler=<optimized out>) at exceptions.c:170
#24 0x00007f1fd7d2588f in scm_c_catch (tag=<optimized out>, body=<optimized out>, body_data=<optimized out>, handler=<optimized out>,
    handler_data=<optimized out>, pre_unwind_handler=<optimized out>, pre_unwind_handler_data=0x7f1fcfa10040) at throw.c:168
#25 0x00007f1fd7c9de66 in scm_i_with_continuation_barrier (pre_unwind_handler=0x7f1fd7c9db80 <pre_unwind_handler>,
    pre_unwind_handler_data=0x7f1fcfa10040, handler_data=0x7fffd2c0d290, handler=0x7f1fd7ca48b0 <c_handler>, body_data=0x7fffd2c0d290,
    body=0x7f1fd7c9b850 <c_body>) at continuations.c:368
#26 scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:464
#27 0x00007f1fd7d24b39 in with_guile (base=0x7fffd2c0d318, data=0x7fffd2c0d340) at threads.c:645
#28 0x00007f1fd7bfb0ba in GC_call_with_stack_base () from target:/gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#29 0x00007f1fd7d1d8b8 in scm_i_with_guile (dynamic_state=<optimized out>, data=<optimized out>, func=<optimized out>) at threads.c:688
#30 scm_with_guile (func=<optimized out>, data=<optimized out>) at threads.c:694
#31 0x00007f1fd7cbb025 in scm_boot_guile (argc=<optimized out>, argv=<optimized out>, main_func=<optimized out>, closure=<optimized out>)
    at init.c:295
#32 0x0000000000401191 in ?? ()
#33 0x00007f1fd7a117dd in __libc_start_main (main=0x401110, argc=8, argv=0x7fffd2c0d4a8, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7fffd2c0d498) at ../csu/libc-start.c:332
#34 0x00000000004011da in ?? ()
(gdb) p *fds
$1 = {fd = 15, events = 1, revents = 0}
(gdb) shell ls -l /proc/62345/fd/15
lrwx------ 1 root root 64 Apr 17 17:42 /proc/62345/fd/15 -> 'socket:[3238975699]'
--8<---------------cut here---------------end--------------->8---

The SSH connection is still live, and on the server side I see the ‘guix
repl -t machine’ process stuck in:

--8<---------------cut here---------------start------------->8---
write(2, "linking \342\200\230/gnu/store/0gbj38wswwxq50i68ci973fgi2l531b4-openjdk-16.0.2-checkout/test/hotspot/jtreg/compiler/c2/cr6340864/TestLongVect.java\342\200\231 to \342\200\230/gnu/store/.links/03zscrls9431zipkpx1sn450llvf5hkhvbkpjk57j22f7d2nb3vs\342\200\231\n", 221
--8<---------------cut here---------------end--------------->8---

(This message comes from optimise-store.cc:192.)

This suggests that ‘guix repl -t machine’ has filled its output buffer;
its output isn’t being read so it gets stuck in that write(2) call.  Its
parent ‘sshd’ process is stuck in:

--8<---------------cut here---------------start------------->8---
ppoll([{fd=3, events=POLLIN}, {fd=4, events=0}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=11, events=POLLOUT}], 5, NULL, [], 8^C
--8<---------------cut here---------------end--------------->8---

One hypothesis is that when transferring store items with many files,
such as openjdk-checkout here, guix-daemon on the remote machine outputs
many such “linking … to …” messages; if the client doesn’t read them,
then we end up with a deadlock like this, with the server-side process
stuck writing more messages.

However, that “linking …” message is supposed to be written only at
non-default debugging levels:

    printMsg(lvlTalkative, format("linking ‘%1%’ to ‘%2%’") % path % linkPath);

Turns out that guix-daemon on build machines is running with ‘--debug’
starting with maintenance.git commit
72d8a62466b71eda657b52ca17e99f7189ea18ad (“sysadmin: build-machines:
Enable guix-daemon debug output.”)

The immediate fix is to undo that—I’ll do that later.

The other one is to swallow ‘current-error-port’ in
‘store-import-channel’.

To be continued…

Ludo’.




^ permalink raw reply	[flat|nested] 4+ messages in thread

* bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time
  2023-04-17 16:37 ` bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time Ludovic Courtès
@ 2023-04-17 22:55   ` Ludovic Courtès
  0 siblings, 0 replies; 4+ messages in thread
From: Ludovic Courtès @ 2023-04-17 22:55 UTC (permalink / raw)
  To: 61839; +Cc: Mathieu Othacehe, Andreas Enge

Hi,

Ludovic Courtès <ludo@gnu.org> skribis:

> Turns out that guix-daemon on build machines is running with ‘--debug’
> starting with maintenance.git commit
> 72d8a62466b71eda657b52ca17e99f7189ea18ad (“sysadmin: build-machines:
> Enable guix-daemon debug output.”)
>
> The immediate fix is to undo that—I’ll do that later.

Done:

  https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=cbe75edae348bf0aa235e93ebeb58fe9db14fa7a

I redeployed the build machines (on the same Guix commit as before).
The 4 machines that host a childhurd are still being redeployed though.

> The other one is to swallow ‘current-error-port’ in
> ‘store-import-channel’.

Done:

  https://git.savannah.gnu.org/cgit/guix.git/commit/?id=166a3e3fdeaa5279b2dd5000088f913c05af9558

Ludo’.




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-04-17 23:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-27 10:37 bug#61839: [Cuirass] Build timeouts during evaluations Ludovic Courtès
2023-03-03 14:24 ` Ludovic Courtès
2023-04-17 16:37 ` bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time Ludovic Courtès
2023-04-17 22:55   ` Ludovic Courtès

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).