unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
       [not found] ` <87mtfj174l.fsf@gnu.org>
@ 2022-05-16  3:49   ` Maxim Cournoyer
  2022-05-16  7:32     ` Mathieu Othacehe
  0 siblings, 1 reply; 27+ messages in thread
From: Maxim Cournoyer @ 2022-05-16  3:49 UTC (permalink / raw)
  To: 55441; +Cc: Mathieu Othacehe, guix-sysadmin

Hello Guix,

Since a couple days, Craps exhibits a new problem, where the
evaluations on master appear stalled, showing as "In progress..."
forever.

Eventually, it seems the maximum number of PostgreSQL connections
allowed (300, if what Ricardo wrote on #guix is accurate :-)) is
reached, and we get the following error:

--8<---------------cut here---------------start------------->8---
Backtrace:
In ice-9/boot-9.scm:
  1752:10 11 (with-exception-handler _ _ #:unwind? _ # _)
In unknown file:
          10 (apply-smob/0 #<thunk 7f4455f8b340>)
In ice-9/boot-9.scm:
    724:2  9 (call-with-prompt _ _ #<procedure default-prompt-handle?>)
In ice-9/eval.scm:
    619:8  8 (_ #(#(#<directory (guile-user) 7f4455f8ec80>)))
In cuirass/ui.scm:
   104:10  7 (run-cuirass-command _ . _)
In cuirass/scripts/evaluate.scm:
    101:7  6 (cuirass-evaluate _)
In cuirass/utils.scm:
    123:6  5 (make-worker-thread-channel _ #:parallelism _)
In srfi/srfi-1.scm:
    634:9  4 (for-each #<procedure 7f4451dd9e40 at cuirass/utils.sc?> ?)
In cuirass/utils.scm:
   125:21  3 (_ . _)
In cuirass/scripts/evaluate.scm:
    101:7  2 (_)
In cuirass/database.scm:
   375:13  1 (db-open #:database _)
In squee.scm:
   257:10  0 (connect-to-postgres-paramstring _)

squee.scm:257:10: In procedure connect-to-postgres-paramstring:
Throw to key `psql-connect-error' with args `(connection-bad "FATAL:  remaining connection slots are reserved for non-replication superuser connections\n")'.
--8<---------------cut here---------------end--------------->8---

(see: https://ci.guix.gnu.org/eval/323648/log/raw)

Restarting postgresql workarounds the above error, but it doesn't fix
the hang, and the problem connection problem recurs.

I've logged at the Cuirass and messages logs, but I haven't spot
anything abnormal.

Thanks,

Maxim




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-16  3:49   ` bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections Maxim Cournoyer
@ 2022-05-16  7:32     ` Mathieu Othacehe
  2022-05-16  9:13       ` Ludovic Courtès
  0 siblings, 1 reply; 27+ messages in thread
From: Mathieu Othacehe @ 2022-05-16  7:32 UTC (permalink / raw)
  To: Maxim Cournoyer; +Cc: ludo, 55441, guix-sysadmin


Hello Maxim,

Thanks for the report, the root cause here is the evaluation
hanging. The Postgres slots starvation is just a symptom caused by
having many concurrent (stuck) evaluation processes using slots.

I noticed this error: https://ci.guix.gnu.org/eval/320682/log/raw.

The following lines can be caused by having a multi-threaded program:
(cuirass scripts evaluate) because of its par-for-each procedure, call
"primitive-fork" in open-bidirectional-pipe of (guix inferior).

--8<---------------cut here---------------start------------->8---
warning: call to primitive-fork while multiple threads are running;
         further behavior unspecified.  See "Processes" in the
         manual, for more information.
--8<---------------cut here---------------end--------------->8---

Not sure that it could cause the observed hang, Ludo what do you think?

Thanks,

Mathieu




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-16  7:32     ` Mathieu Othacehe
@ 2022-05-16  9:13       ` Ludovic Courtès
  2022-05-16 10:32         ` Ludovic Courtès
  2022-05-16 12:49         ` Maxim Cournoyer
  0 siblings, 2 replies; 27+ messages in thread
From: Ludovic Courtès @ 2022-05-16  9:13 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 55441, guix-sysadmin, Maxim Cournoyer

Hi!

Mathieu Othacehe <othacehe@gnu.org> skribis:

> I noticed this error: https://ci.guix.gnu.org/eval/320682/log/raw.
>
> The following lines can be caused by having a multi-threaded program:
> (cuirass scripts evaluate) because of its par-for-each procedure, call
> "primitive-fork" in open-bidirectional-pipe of (guix inferior).
>
> warning: call to primitive-fork while multiple threads are running;
>          further behavior unspecified.  See "Processes" in the
>          manual, for more information.
>
> Not sure that it could cause the observed hang, Ludo what do you think?

As you write, the warning comes from ‘open-bidirectional-pipe’, added in
commit bd86bbd300474204878e927f6cd3f0defa1662a5.  I noticed it back then
but as far as I could see, it was harmless (but definitely suboptimal):

  https://issues.guix.gnu.org/48007#11

Indeed, “make cuirass-jobs” works for me.

I decided to go ahead and run ‘cuirass evaluate’ by hand to gather more
info:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ sudo su - cuirass -s /bin/sh -c "/gnu/store/qamrzl942khb7w0cb4csyflnxfkwvvxq-cuirass-1.1.0-11.9f08035/bin/cuirass evaluate dbname=cuirass 323183"
Computing Guix derivation for 'x86_64-linux'... |
warning: call to primitive-fork while multiple threads are running;
         further behavior unspecified.  See "Processes" in the
         manual, for more information.
warning: call to primitive-fork while multiple threads are running;
         further behavior unspecified.  See "Processes" in the
         manual, for more information.
warning: call to primitive-fork while multiple threads are running;
         further behavior unspecified.  See "Processes" in the
         manual, for more information.
warning: call to primitive-fork while multiple threads are running;
         further behavior unspecified.  See "Processes" in the
         manual, for more information.
2022-05-16T09:56:43 Registering builds for evaluation 323183.
2022-05-16T10:08:47 Registering builds for evaluation 323183.
--8<---------------cut here---------------end--------------->8---

I straced it in another terminal.  It did its work for a long while, and
eventually activity stopped:

--8<---------------cut here---------------start------------->8---
read(228, "nu/store/7gi4qsakz775z5wcmddg8xan7q806sj1-findutils-4.8.0\")) (#:nix-name . \"findutils-4.8.0\") (#:sys"..., 4096) = 4096
read(228, "nrd-gcc-cross-x86_64-w64-mingw32-10.3.0.drv\" \"/gnu/store/ykrksw421rd22zh4w3pc751r90wc881n-patch-2.7."..., 4096) = 4096
read(228, "wm-binutils-cross-x86_64-w64-mingw32-2.37.drv\" \"/gnu/store/xfh4965w2amrsl7x0h3dslrhndzn5bbz-linux-li"..., 4096) = 4096
read(228, "1ir-gawk-5.1.0.drv\" \"/gnu/store/p1d2w4fg1ljbyfs0ayv854snahcimamk-libunistring-0.9.10.drv\" \"/gnu/stor"..., 4096) = 4096
read(228, "3ysc-guile-2.2.7.drv\" \"/gnu/store/b0ma6l61ad4dfb3030sl1ngvdlicc523-glibc-utf8-locales-2.33.drv\" \"/gn"..., 4096) = 4096
read(228, "j6a5mc7w13xiw8136y4brv7w-xz-5.2.5.drv\" \"/gnu/store/w1bxsf02c4y0hj0nd39awv7i55ych641-libffi-3.3.drv\" "..., 4096) = 4096
read(228, "tore/qbqgamcpwgs2sw8lcr6pxl5y38fpyz0d-sed-4.8.drv\" \"/gnu/store/qmsbxbgb758h6f7018snxlwn69w9w9i4-diff"..., 4096) = 4096
read(228, "oconf-2.69.drv\" \"/gnu/store/6n5mn2vkl4x4fab37sx1rmqyqxqcwd4y-libidn2-2.3.1.drv\" \"/gnu/store/6q5cp0m2"..., 4096) = 4096
mmap(0x7fddd26af000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fddd268f000
read(228, "-binutils-cross-x86_64-w64-mingw32-2.37.drv\" \"/gnu/store/wdmk9llhdzx5y10mziss1i5z6q037a7z-guile-3.0."..., 4096) = 954
futex(0x7fdf015381f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdf01546b00, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7fdf015381f4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdf01546b00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdf01538f74, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPWR {si_signo=SIGPWR, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigsuspend(~[INT QUIT ABRT BUS SEGV TERM XCPU RTMIN RT_1], 8) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGXCPU {si_signo=SIGXCPU, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigreturn({mask=~[INT QUIT ABRT BUS KILL SEGV TERM STOP RTMIN RT_1]}) = -1 EINTR (Interrupted system call)
rt_sigreturn({mask=[]})                 = 202
futex(0x7fdf01538f74, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPWR {si_signo=SIGPWR, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigsuspend(~[INT QUIT ABRT BUS SEGV TERM XCPU RTMIN RT_1], 8) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGXCPU {si_signo=SIGXCPU, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigreturn({mask=~[INT QUIT ABRT BUS KILL SEGV TERM STOP RTMIN RT_1]}) = -1 EINTR (Interrupted system call)
rt_sigreturn({mask=[]})                 = 202
futex(0x7fdf01538f74, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[…]
--- SIGXCPU {si_signo=SIGXCPU, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigreturn({mask=~[INT QUIT ABRT BUS KILL SEGV TERM STOP RTMIN RT_1]}) = -1 EINTR (Interrupted system call)
rt_sigreturn({mask=[]})                 = 202
futex(0x7fdf01538f74, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fdf01546c00, FUTEX_WAKE_PRIVATE, 1) = 0
close(224)                              = 0
futex(0x7fdf01538f70, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY
[stops here]
--8<---------------cut here---------------end--------------->8---

The ‘read’ calls are about getting the jobset alists; FD 228 is a socket
(to a ‘guix repl’ subprocess I presume).  SIGPWR/SIGXCPU suggests
attempts to make a stop-the-world GC.

GDB shows that the process has 120 threads (it’s a 96-core machine, but
still…) and most threads are stuck in pthread_cond_wait:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007fdf01d62afa in __futex_abstimed_wait_common64 (futex_word=futex_word@entry=0x7fded20cd8b4, 
    expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, 
    cancel=cancel@entry=true) at ../sysdeps/nptl/futex-internal.c:74
#1  0x00007fdf01d62b5b in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fded20cd8b4, 
    expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0)
    at ../sysdeps/nptl/futex-internal.c:123
#2  0x00007fdf01d5cab0 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7fdef9c15fc0, 
    cond=cond@entry=0x7fded20cd888) at pthread_cond_wait.c:504
#3  __pthread_cond_wait (cond=cond@entry=0x7fded20cd888, mutex=mutex@entry=0x7fdef9c15fc0) at pthread_cond_wait.c:619
#4  0x00007fdf01ea2a55 in scm_pthread_cond_wait (cond=cond@entry=0x7fded20cd888, mutex=mutex@entry=0x7fdef9c15fc0)
    at threads.c:1605
#5  0x00007fdf01ea7b6b in block_self (
    queue=((#<smob thread 7fdedfd05eb0>) #<smob thread 7fdedfd05af0> #<smob thread 7fdedfd05950> #<smob thread 7fdedfd05870> #<smob thread 7fdedfd05790> #<smob thread 7fdedfd056b0> #<smob thread 7fdedfd05570> #<smob thread 7fdedfd05490> #<smob thread 7fdedfd05350> #<smob thread 7fdedfd05150> #<smob thread 7fdedfd05070> #<smob thread 7fdedfd18f50> #<smob thread 7fdedfd18d70> #<smob thread 7fdedfd189a0> #<smob thread 7fdedfd18730> #<smob thread 7fdedfd185e0> #<smob thread 7fdedfd182b0> #<smob thread 7fdedfd18160> #<smob thread 7fdedfd21f50> #<smob thread 7fdedfd21e00> #<smob thread 7fdedfd21cb0> #<smob thread 7fdedfd217a0> #<smob thread 7fdedfd21650> #<smob thread 7fdedfd21500> #<smob thread 7fdedfd213b0> #<smob thread 7fdedfd21260> #<smob thread 7fdedfd2ff90> #<smob thread 7fdedfd2fe60> #<smob thread 7fdedfd2fcf0> #<smob thread 7fdedfd2fba0> #<smob thread 7fdedfd2fa50> #<smob thread 7fdedfd2f900> #<smob thread 7fdedfd2f650> #<smob thread 7fdedfd2f570> #<smob thread 7fdedfd2f430> #<smob thread 7fdedfd2f310> #<smob thread 7fdedfd2f1a0> #<smob thread 7fdedfd2f0b0> #<smob thread 7fdedfd4efd0> #<smob thread 7fdedfd4ee90> #<smob thread 7fdedfd4ebd0> #<smob thread 7fdedfd4e940> #<smob thread 7fdedfd4e760> #<smob thread 7fdedfd4e660> #<smob thread 7fdedfd4e270> #<smob thread 7fdedfdbefc0> #<smob thread 7fdedfdbeec0> #<smob thread 7fdedfdbede0> #<smob thread 7fdedfdbed00> #<smob thread 7fdedfdbeb00> #<smob thread 7fdedfdbe8f0> #<smob thread 7fdedfdbe810> #<smob thread 7fdedfdbe730> #<smob thread 7fdedfdbe650> #<smob thread 7fdedfdbe3f0> #<smob thread 7fdedfdbe220> #<smob thread 7fdedfdbe120> #<smob thread 7fdedfdbe020> #<smob thread 7fdedfa34610> #<smob thread 7fdedfa344d0> #<smob thread 7fdedfa34110> #<smob thread 7fdedfa52e90> #<smob thread 7fdedfa52d30> #<smob thread 7fdedfa52c40> #<smob thread 7fdedfa737e0> #<smob thread 7fdedfa73400> #<smob thread 7fdedfa73000> #<smob thread 7fdedfa88ef0> #<smob thread 7fdedfa88cc0> #<smob thread 7fdedfa888e0> #<smob thread 7fdedfa88800> #<smob thread 7fdedfa88720> #<smob thread 7fdedfa88620> #<smob thread 7fdedfa884b0> #<smob thread 7fdedfa88370> #<smob thread 7fdedfa880b0> #<smob thread 7fdedfb06fa0> #<smob thread 7fdedfb06b90> #<smob thread 7fdedfb066a0> #<smob thread 7fdedfb06470> #<smob thread 7fdedfb062e0> #<smob thread 7fdedfb06140> #<smob thread 7fdedfb58f80> #<smob thread 7fdedfb58af0> #<smob thread 7fdedfb58830> #<smob thread 7fdedfb93cd0> #<smob thread 7fdedfb93ae0> #<smob thread 7fdedfb936c0> #<smob thread 7fdedfb935c0> #<smob thread 7fdedfb93400> #<smob thread 7fdedfb932c0> #<smob thread 7fdedfd05ff0> #<smob thread 7fdedfd05c20> #<smob thread 7fdedfd05eb0>), mutex=mutex@entry=0x7fdef9c15fc0, waittime=waittime@entry=0x0) at threads.c:312
#6  0x00007fdf01eaa1cd in timed_wait (waittime=0x0, current_thread=0x7fded20cd6c0, c=0x7fdef9c10d20, 
    m=0x7fdef9c15fc0, kind=SCM_MUTEX_STANDARD) at threads.c:1331
#7  scm_timed_wait_condition_variable (cond=#<smob condition-variable 7fdef9c122e0>, 
    mutex=#<smob mutex 7fdef9c12320>, timeout=#<undefined 904>) at threads.c:1424
#8  0x00007fded169a7f3 in ?? ()
#9  0x0000000000b83688 in ?? ()
#10 0x00007fded169a750 in ?? ()
#11 0x00007fdeec009e98 in ?? ()
#12 0x00007fdf01e57ccc in scm_jit_enter_mcode (thread=0x7fded20cd6c0, mcode=0xb8369c "\034\330\003") at jit.c:6038
#13 0x00007fdf01eacf3c in vm_regular_engine (thread=0x7fded20cd6c0) at vm-engine.c:360
#14 0x00007fdf01eba5e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608
#15 0x00007fdf01e1ea0e in scm_call_with_unblocked_asyncs (proc=#<program 7fdef732e880>) at async.c:406
#16 0x00007fdf01ead336 in vm_regular_engine (thread=0x7fded20cd6c0) at vm-engine.c:972
#17 0x00007fdf01eba5e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608
#18 0x00007fdf01ea9be6 in really_launch (d=0x7fdef730c860) at threads.c:778
#19 0x00007fdf01e2085a in c_body (d=0x7fdec8c44d80) at continuations.c:430
#20 0x00007fded2874532 in ?? ()
#21 0x00007fdef9a227e0 in ?? ()
#22 0x00007fdef7d8c370 in ?? ()
#23 0x0000000000000048 in ?? ()
#24 0x00007fdf01e57ccc in scm_jit_enter_mcode (thread=0x7fded20cd6c0, mcode=0xb81874 "\034<\003") at jit.c:6038
#25 0x00007fdf01eacf3c in vm_regular_engine (thread=0x7fded20cd6c0) at vm-engine.c:360
#26 0x00007fdf01eba5e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=2) at vm.c:1608
#27 0x00007fdf01e2209a in scm_call_2 (proc=<optimized out>, arg1=<optimized out>, arg2=<optimized out>) at eval.c:503
#28 0x00007fdf01ed8752 in scm_c_with_exception_handler.constprop.0 (type=#t, 
    handler_data=handler_data@entry=0x7fdec8c44d10, thunk_data=thunk_data@entry=0x7fdec8c44d10, 
    thunk=<optimized out>, handler=<optimized out>) at exceptions.c:170
#29 0x00007fdf01eaa88f in scm_c_catch (tag=<optimized out>, body=<optimized out>, body_data=<optimized out>, 
    handler=<optimized out>, handler_data=<optimized out>, pre_unwind_handler=<optimized out>, 
    pre_unwind_handler_data=0x7fdef9b97040) at throw.c:168
#30 0x00007fdf01e22e66 in scm_i_with_continuation_barrier (pre_unwind_handler=0x7fdf01e22b80 <pre_unwind_handler>, 
    pre_unwind_handler_data=0x7fdef9b97040, handler_data=0x7fdec8c44d80, handler=0x7fdf01e298b0 <c_handler>, 
    body_data=0x7fdec8c44d80, body=0x7fdf01e20850 <c_body>) at continuations.c:368
#31 scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:464
#32 0x00007fdf01ea9b39 in with_guile (base=0x7fdec8c44e08, data=0x7fdec8c44e30) at threads.c:645
#33 0x00007fdf01d800ba in GC_call_with_stack_base ()
   from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#34 0x00007fdf01ea216d in scm_i_with_guile (dynamic_state=<optimized out>, data=0x7fdef730c860, 
    func=0x7fdf01ea9b70 <really_launch>) at threads.c:688
#35 launch_thread (d=0x7fdef730c860) at threads.c:787
#36 0x00007fdf01d56d7e in start_thread (arg=0x7fdec8c45640) at pthread_create.c:473
#37 0x00007fdf01954eff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
--8<---------------cut here---------------end--------------->8---

There’s also one thread stuck in read(2):

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  __libc_read (nbytes=1, buf=0x7fdedf97bdd0, fd=7) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __libc_read (fd=7, buf=buf@entry=0x7fdedf97bdd0, nbytes=nbytes@entry=1) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007fdf01e382e8 in fport_read (port=<optimized out>, dst=<optimized out>, start=<optimized out>, count=1)
    at fports.c:597
#3  0x00007fdf01e78d22 in scm_i_read_bytes (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>, 
    dst="#<vu8vector>" = {...}, start=start@entry=0, count=1) at ports.c:1566
#4  0x00007fdf01e7b1c7 in scm_fill_input (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>, 
    minimum_size=1, minimum_size@entry=0, cur_out=cur_out@entry=0x7fde86fcc3e8, 
    avail_out=avail_out@entry=0x7fde86fcc3e0) at ports.c:2693
#5  0x00007fdf01e7b434 in peek_byte_or_eof (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>, 
    buf_out=buf_out@entry=0x7fde86fcc448, cur_out=cur_out@entry=0x7fde86fcc440) at ports.c:1526
#6  0x00007fdf01e7c1b5 in maybe_consume_bom (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>, 
    bom=bom@entry=0x7fdf01ee56ad <scm_utf8_bom> "\357\273\277scm_i_read_bytes", bom_len=bom_len@entry=3)
    at ports.c:2468
#7  0x00007fdf01e7b324 in port_clear_stream_start_for_bom_read (io_mode=BOM_IO_TEXT, 
    port=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>) at ports.c:2504
#8  scm_fill_input (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>, minimum_size=1, 
    minimum_size@entry=0, cur_out=cur_out@entry=0x7fde86fcc4e8, avail_out=avail_out@entry=0x7fde86fcc4e0)
    at ports.c:2648
#9  0x00007fdf01e7b434 in peek_byte_or_eof (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>, 
    buf_out=buf_out@entry=0x7fde86fcc548, cur_out=cur_out@entry=0x7fde86fcc540) at ports.c:1526
#10 0x00007fdf01e7b97c in peek_utf8_codepoint (port=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>, 
    buf_out=buf_out@entry=0x7fde86fcc5a8, cur_out=cur_out@entry=0x7fde86fcc5a0, len_out=len_out@entry=0x7fde86fcc598)
    at ports.c:1790
#11 0x00007fdf01e7bee4 in peek_codepoint (len=0x7fde86fcc598, cur=0x7fde86fcc5a0, buf=0x7fde86fcc5a8, 
    port=<optimized out>) at ports.c:1984
#12 scm_peek_char (port=<optimized out>) at ports.c:2202
#13 0x00007fdef997b1ab in ?? ()
#14 0x0000000000b82ae0 in ?? ()
#15 0x00007fdef997b120 in ?? ()
#16 0x00007fdec4292c48 in ?? ()
#17 0x00007fdf01e57ccc in scm_jit_enter_mcode (thread=0x7fded1366000, mcode=0xb82af0 "\034\234\003") at jit.c:6038
#18 0x00007fdf01eacf3c in vm_regular_engine (thread=0x7fded1366000) at vm-engine.c:360
#19 0x00007fdf01eba5e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608
#20 0x00007fdf01e1ea0e in scm_call_with_unblocked_asyncs (proc=#<program 7fdef59a0f60>) at async.c:406
--8<---------------cut here---------------end--------------->8---

A plausible scenario is a synchronization issue: ‘cuirass evaluate’ got
all the data it asked for from ‘guix repl’, but it’s still waiting on
read(2) even though there’s nothing left.  (There are two child ‘guix
repl’ processes and both are waiting on read(0, …).)

Thoughts?

Ludo’.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-16  9:13       ` Ludovic Courtès
@ 2022-05-16 10:32         ` Ludovic Courtès
  2022-05-16 12:19           ` Ludovic Courtès
  2022-05-16 12:49         ` Maxim Cournoyer
  1 sibling, 1 reply; 27+ messages in thread
From: Ludovic Courtès @ 2022-05-16 10:32 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 55441, guix-sysadmin, Maxim Cournoyer

Ludovic Courtès <ludo@gnu.org> skribis:

> A plausible scenario is a synchronization issue: ‘cuirass evaluate’ got
> all the data it asked for from ‘guix repl’, but it’s still waiting on
> read(2) even though there’s nothing left.  (There are two child ‘guix
> repl’ processes and both are waiting on read(0, …).)

This reminds me of:

  https://github.com/UMCUGenetics/hpcguix-web/commit/88b07caaf3084f48b181c3f3f84af27dc9b834ec

I’ve added a missing call to ‘close-inferior’.  It’s a good idea, though
I’m not entirely convinced yet it’ll solve the problem:

  https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=f087aaf685dbc7cc18f0254895f4a4b0dfaba631

Ludo’.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-16 10:32         ` Ludovic Courtès
@ 2022-05-16 12:19           ` Ludovic Courtès
  2022-05-16 17:32             ` Maxim Cournoyer
  0 siblings, 1 reply; 27+ messages in thread
From: Ludovic Courtès @ 2022-05-16 12:19 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 55441, guix-sysadmin, Maxim Cournoyer

Ludovic Courtès <ludo@gnu.org> skribis:

> I’ve added a missing call to ‘close-inferior’.  It’s a good idea, though
> I’m not entirely convinced yet it’ll solve the problem:
>
>   https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=f087aaf685dbc7cc18f0254895f4a4b0dfaba631

I tested it like this and it ran to completion without hanging:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~$ guix build cuirass --with-branch=cuirass=master
[…]
/gnu/store/sqsf2q3qf9d485mcw6lm14abwr54na01-cuirass-git.master
ludo@berlin ~$  sudo su - cuirass -s /bin/sh -c "/gnu/store/sqsf2q3qf9d485mcw6lm14abwr54na01-cuirass-git.master/bin/cuirass evaluate dbname=cuirass 323183"
Password:
Computing Guix derivation for 'x86_64-linux'... \
2022-05-16T12:39:18 Registering builds for evaluation 323183.
2022-05-16T12:40:34 Registering builds for evaluation 323183.
GC Warning: Repeated allocation of very large block (appr. size 14385152):
        May lead to memory leak and poor performance
2022-05-16T12:41:25 Registering builds for evaluation 323183.
2022-05-16T12:44:35 Registering builds for evaluation 323183.
--8<---------------cut here---------------end--------------->8---

Should we try and deploy this commit on berlin?

Maxim, how can we proceed?

Ludo’.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-16  9:13       ` Ludovic Courtès
  2022-05-16 10:32         ` Ludovic Courtès
@ 2022-05-16 12:49         ` Maxim Cournoyer
  2022-05-17 12:52           ` Ludovic Courtès
  1 sibling, 1 reply; 27+ messages in thread
From: Maxim Cournoyer @ 2022-05-16 12:49 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 55441, Mathieu Othacehe

Hi Ludo and Mathieu!

Ludovic Courtès <ludo@gnu.org> writes:

> Hi!
>
> Mathieu Othacehe <othacehe@gnu.org> skribis:
>
>> I noticed this error: https://ci.guix.gnu.org/eval/320682/log/raw.
>>
>> The following lines can be caused by having a multi-threaded program:
>> (cuirass scripts evaluate) because of its par-for-each procedure, call
>> "primitive-fork" in open-bidirectional-pipe of (guix inferior).
>>
>> warning: call to primitive-fork while multiple threads are running;
>>          further behavior unspecified.  See "Processes" in the
>>          manual, for more information.
>>
>> Not sure that it could cause the observed hang, Ludo what do you think?
>
> As you write, the warning comes from ‘open-bidirectional-pipe’, added in
> commit bd86bbd300474204878e927f6cd3f0defa1662a5.  I noticed it back then
> but as far as I could see, it was harmless (but definitely suboptimal):
>
>   https://issues.guix.gnu.org/48007#11

Thanks for investigating it!

I'm not yet very knowledgeable about POSIX threads and shortcomings, but
this looks like a dangerous trap to have lying around :-).  Is there an
alternative implementation we could come up with that would avoid the
potential problem (and warnings) entirely?

Otherwise, if we are sure there are no practical problems in a
multi-threaded scenario, I think a comment explaining the reasoning in
the code (of why it's considered safe despite the warnings) and a test
exercising the behavior would be in order to ensure it continues working
as intended in the face of changes.

Thoughts?

Thanks again,

Maxim




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-16 12:19           ` Ludovic Courtès
@ 2022-05-16 17:32             ` Maxim Cournoyer
  2022-05-17  7:45               ` Ludovic Courtès
  0 siblings, 1 reply; 27+ messages in thread
From: Maxim Cournoyer @ 2022-05-16 17:32 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 55441, Mathieu Othacehe

Hello,

Ludovic Courtès <ludo@gnu.org> writes:

> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> I’ve added a missing call to ‘close-inferior’.  It’s a good idea, though
>> I’m not entirely convinced yet it’ll solve the problem:
>>
>>   https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=f087aaf685dbc7cc18f0254895f4a4b0dfaba631
>
> I tested it like this and it ran to completion without hanging:
>
> ludo@berlin ~$ guix build cuirass --with-branch=cuirass=master
> […]
> /gnu/store/sqsf2q3qf9d485mcw6lm14abwr54na01-cuirass-git.master
> ludo@berlin ~$  sudo su - cuirass -s /bin/sh -c "/gnu/store/sqsf2q3qf9d485mcw6lm14abwr54na01-cuirass-git.master/bin/cuirass evaluate dbname=cuirass 323183"
> Password:
> Computing Guix derivation for 'x86_64-linux'... \
> 2022-05-16T12:39:18 Registering builds for evaluation 323183.
> 2022-05-16T12:40:34 Registering builds for evaluation 323183.
> GC Warning: Repeated allocation of very large block (appr. size 14385152):
>         May lead to memory leak and poor performance
> 2022-05-16T12:41:25 Registering builds for evaluation 323183.
> 2022-05-16T12:44:35 Registering builds for evaluation 323183.
>
> Should we try and deploy this commit on berlin?
>
> Maxim, how can we proceed?

Berlin was reconfigured with this commit of Cuirass, and is now running
the derivations with it, but so far still "In progress..." after more
than 100 minutes [0]

[0]  https://ci.guix.gnu.org/eval/325592

Looking in /var/log/cuirass.log, I can see:

--8<---------------cut here---------------start------------->8---
2022-05-16 15:57:41 2022-05-16T15:57:41 next evaluation in 300 seconds
2022-05-16 15:59:02 Uncaught exception in fiber ##f:
2022-05-16 15:59:02 In cuirass/base.scm:
2022-05-16 15:59:02    726:13  3 (_)
2022-05-16 15:59:02 In ice-9/boot-9.scm:
2022-05-16 15:59:02   1752:10  2 (with-exception-handler _ _ #:unwind? _ # _)
2022-05-16 15:59:02   1685:16  1 (raise-exception _ #:continuable? _)
2022-05-16 15:59:02   1683:16  0 (raise-exception _ #:continuable? _)
2022-05-16 15:59:02 ice-9/boot-9.scm:1683:16: In procedure raise-exception:
2022-05-16 15:59:02 ERROR:
2022-05-16 15:59:02   1. &evaluation-error:
2022-05-16 15:59:02       name: "core"
2022-05-16 15:59:02       id: 325375
2022-05-16 16:02:41 2022-05-16T16:02:41 Fetching channels for spec 'core'.
--8<---------------cut here---------------end--------------->8---

But it seems it should only affect another job specification ('core') ?

Before, there was also:

--8<---------------cut here---------------start------------->8---
2022-05-16 15:41:34 Uncaught exception in fiber ##f:
2022-05-16 15:41:34 In cuirass/base.scm:
2022-05-16 15:41:34    726:13  3 (_)
2022-05-16 15:41:34 In ice-9/boot-9.scm:
2022-05-16 15:41:34   1752:10  2 (with-exception-handler _ _ #:unwind? _ # _)
2022-05-16 15:41:34   1685:16  1 (raise-exception _ #:continuable? _)
2022-05-16 15:41:34   1683:16  0 (raise-exception _ #:continuable? _)
2022-05-16 15:41:34 ice-9/boot-9.scm:1683:16: In procedure raise-exception:
2022-05-16 15:41:34 ERROR:
2022-05-16 15:41:34   1. &evaluation-error:
2022-05-16 15:41:34       name: "purge-python2-packages"
2022-05-16 15:41:34       id: 325332
--8<---------------cut here---------------end--------------->8---

and

--8<---------------cut here---------------start------------->8---
2022-05-16 15:05:27 Uncaught exception in fiber ##f:
2022-05-16 15:05:27 In cuirass/base.scm:
2022-05-16 15:05:27    726:13  3 (_)
2022-05-16 15:05:27 In ice-9/boot-9.scm:
2022-05-16 15:05:27   1752:10  2 (with-exception-handler _ _ #:unwind? _ # _)
2022-05-16 15:05:27   1685:16  1 (raise-exception _ #:continuable? _)
2022-05-16 15:05:27   1683:16  0 (raise-exception _ #:continuable? _)
2022-05-16 15:05:27 ice-9/boot-9.scm:1683:16: In procedure raise-exception:
2022-05-16 15:05:27 ERROR:
2022-05-16 15:05:27   1. &evaluation-error:
2022-05-16 15:05:27       name: "master"
2022-05-16 15:05:27       id: 324938
--8<---------------cut here---------------end--------------->8---

and

--8<---------------cut here---------------start------------->8---
2022-05-16 13:02:07 Uncaught exception in fiber ##f:
2022-05-16 13:02:07 In cuirass/base.scm:
2022-05-16 13:02:07    726:13  3 (_)
2022-05-16 13:02:07 In ice-9/boot-9.scm:
2022-05-16 13:02:07   1752:10  2 (with-exception-handler _ _ #:unwind? _ # _)
2022-05-16 13:02:07   1685:16  1 (raise-exception _ #:continuable? _)
2022-05-16 13:02:07   1683:16  0 (raise-exception _ #:continuable? _)
2022-05-16 13:02:07 ice-9/boot-9.scm:1683:16: In procedure raise-exception:
2022-05-16 13:02:07 ERROR:
2022-05-16 13:02:07   1. &evaluation-error:
2022-05-16 13:02:07       name: "guix"
2022-05-16 13:02:07       id: 324937
--8<---------------cut here---------------end--------------->8---

I don't know if these are related or not; probably not, as their
timestamps are older by more than 3 hours, while the last derivations
were started less than 1h30 ago.

Thanks,

Maxim




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-16 17:32             ` Maxim Cournoyer
@ 2022-05-17  7:45               ` Ludovic Courtès
  2022-05-18  7:44                 ` Ludovic Courtès
  0 siblings, 1 reply; 27+ messages in thread
From: Ludovic Courtès @ 2022-05-17  7:45 UTC (permalink / raw)
  To: Maxim Cournoyer; +Cc: 55441, Mathieu Othacehe

Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

> Berlin was reconfigured with this commit of Cuirass, and is now running
> the derivations with it, but so far still "In progress..." after more
> than 100 minutes [0]

Yeah, I’m not sure about the backtrace you report, however there’s again
a bunch of ‘cuirass evaluate’ processes hanging, this time with the main
thread stuck on ‘waitpid’ (presumably from the ‘close-inferior’ call):

--8<---------------cut here---------------start------------->8---
#0  0x00007f0886310f27 in __GI___wait4 (pid=86099, stat_loc=stat_loc@entry=0x7ffea0cb849c, options=options@entry=0, usage=usage@entry=0x0)
    at ../sysdeps/unix/sysv/linux/wait4.c:30
#1  0x00007f0886310ea7 in __GI___waitpid (pid=<optimized out>, stat_loc=stat_loc@entry=0x7ffea0cb849c, options=options@entry=0)
    at waitpid.c:38
#2  0x00007f08868ae25e in scm_waitpid (pid=86099, options=<optimized out>) at posix.c:727
#3  0x00007f088689a336 in vm_regular_engine (thread=0x7f08861fad80) at vm-engine.c:972
#4  0x00007f08868a75e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=1) at vm.c:1608
#5  0x00007f088680f457 in scm_primitive_eval (exp=<optimized out>,
    exp@entry=((@ (ice-9 control) %) (begin ((@@ (ice-9 command-line) load/lang) "/gnu/store/z8haznhwck4bjm4gxqy25wvwv4041wvx-cuirass-1.1.0-12.f087aaf/bin/.cuirass-real") (main (command-line)) (quit)))) at eval.c:671
#6  0x00007f08868154b6 in scm_eval (
    exp=((@ (ice-9 control) %) (begin ((@@ (ice-9 command-line) load/lang) "/gnu/store/z8haznhwck4bjm4gxqy25wvwv4041wvx-cuirass-1.1.0-12.f087aaf/bin/.cuirass-real") (main (command-line)) (quit))), module_or_state="#<struct module>" = {...}) at eval.c:705
#7  0x00007f08868793b6 in scm_shell (argc=9, argv=0x7ffea0cb8c98) at script.c:357
--8<---------------cut here---------------end--------------->8---

Process 86099 (the one it’s waiting for) is indeed ‘guix repl’ and it’s
waiting for input in read(0, …).

There’s a second thread stuck in ‘waitpid’:

--8<---------------cut here---------------start------------->8---
(gdb) thread 27
[Switching to thread 27 (LWP 86002)]
#0  0x00007f0886310f27 in __GI___wait4 (pid=86100, stat_loc=stat_loc@entry=0x7f085393c60c, options=options@entry=0, usage=usage@entry=0x0)
    at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
(gdb) bt
#0  0x00007f0886310f27 in __GI___wait4 (pid=86100, stat_loc=stat_loc@entry=0x7f085393c60c, options=options@entry=0, usage=usage@entry=0x0)
    at ../sysdeps/unix/sysv/linux/wait4.c:30
#1  0x00007f0886310ea7 in __GI___waitpid (pid=<optimized out>, stat_loc=stat_loc@entry=0x7f085393c60c, options=options@entry=0)
    at waitpid.c:38
#2  0x00007f08868ae25e in scm_waitpid (pid=86100, options=<optimized out>) at posix.c:727
#3  0x00007f088689a336 in vm_regular_engine (thread=0x7f087d54e240) at vm-engine.c:972
#4  0x00007f08868a75e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608
#5  0x00007f088680ba0e in scm_call_with_unblocked_asyncs (proc=#<program 7f0854a50f40>) at async.c:406
#6  0x00007f088689a336 in vm_regular_engine (thread=0x7f087d54e240) at vm-engine.c:972
#7  0x00007f08868a75e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608
--8<---------------cut here---------------end--------------->8---

and then a couple of threads in ‘read’:

--8<---------------cut here---------------start------------->8---
#0  __libc_read (nbytes=1, buf=0x7f085a69ab90, fd=5) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __libc_read (fd=5, buf=buf@entry=0x7f085a69ab90, nbytes=nbytes@entry=1) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007f08868252e8 in fport_read (port=<optimized out>, dst=<optimized out>, start=<optimized out>, count=1) at fports.c:597
#3  0x00007f0886865d22 in scm_i_read_bytes (port=port@entry=#<port #<port-type file 7f087e821b40> 7f0854af6d20>, dst="#<vu8vector>" = {...}, 
    start=start@entry=0, count=1) at ports.c:1566
#4  0x00007f08868681c7 in scm_fill_input (port=port@entry=#<port #<port-type file 7f087e821b40> 7f0854af6d20>, 
    minimum_size=minimum_size@entry=1, cur_out=cur_out@entry=0x7f085313b5e0, avail_out=avail_out@entry=0x7f085313b588) at ports.c:2693
#5  0x00007f0886868d5c in peek_iconv_codepoint (port=#<port #<port-type file 7f087e821b40> 7f0854af6d20>, buf=buf@entry=0x7f085313b5e8, 
    cur=cur@entry=0x7f085313b5e0, len=len@entry=0x7f085313b5d8) at ports.c:1944
#6  0x00007f0886868e4a in peek_codepoint (len=0x7f085313b5d8, cur=0x7f085313b5e0, buf=0x7f085313b5e8, port=<optimized out>) at ports.c:1988
#7  scm_peek_char (port=<optimized out>) at ports.c:2202
#8  0x00007f087e62582b in ?? ()
#9  0x0000000000857760 in ?? ()
#10 0x00007f087e6257a0 in ?? ()
#11 0x00007f087d850c48 in ?? ()
#12 0x00007f0886844ccc in scm_jit_enter_mcode (thread=0x7f087d54e000, mcode=0x857770 "\034\234\003") at jit.c:6038
#13 0x00007f0886899f3c in vm_regular_engine (thread=0x7f087d54e000) at vm-engine.c:360
--8<---------------cut here---------------end--------------->8---

Normally we call ‘waitpid’ once the pipe has been closed:

--8<---------------cut here---------------start------------->8---
(define* (open-inferior directory
                        #:key (command "bin/guix")
                        (error-port (%make-void-port "w")))
  "Open the inferior Guix in DIRECTORY, running 'DIRECTORY/COMMAND repl' or
equivalent.  Return #f if the inferior could not be launched."
  (let ((pipe pid (inferior-pipe directory command error-port)))
    (port->inferior pipe
                    (lambda (port)
                      (close-port port)
                      (waitpid pid)))))   ;<----- here
--8<---------------cut here---------------end--------------->8---

… and when the pipe is closed, the child ‘guix repl’ process gets EOF
and exits.

So I’m not sure why the ‘guix repl’ process would stick around.

Thoughts?

Ludo’.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-16 12:49         ` Maxim Cournoyer
@ 2022-05-17 12:52           ` Ludovic Courtès
  0 siblings, 0 replies; 27+ messages in thread
From: Ludovic Courtès @ 2022-05-17 12:52 UTC (permalink / raw)
  To: Maxim Cournoyer; +Cc: 55441, Mathieu Othacehe

Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

> I'm not yet very knowledgeable about POSIX threads and shortcomings, but
> this looks like a dangerous trap to have lying around :-).  Is there an
> alternative implementation we could come up with that would avoid the
> potential problem (and warnings) entirely?

It *is* a dangerous trap, mitigated by the fact that our code execs soon
after fork, but that’s not great (“A fork() on the Road” explains the
many shortcomings of ‘fork’, including how it interacts with
multi-threaded code.)

As discussed on IRC yesterday, the longer-term approach is to augment
Guile is a posix_spawn-like primitive implemented in C.  If it’s in C,
we can make it do the right thing safely, as is done with ‘open-pipe’.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-17  7:45               ` Ludovic Courtès
@ 2022-05-18  7:44                 ` Ludovic Courtès
  2022-05-20 18:32                   ` Ludovic Courtès
  0 siblings, 1 reply; 27+ messages in thread
From: Ludovic Courtès @ 2022-05-18  7:44 UTC (permalink / raw)
  To: Maxim Cournoyer; +Cc: 55441, Mathieu Othacehe

Hi,

Ludovic Courtès <ludo@gnu.org> skribis:

> Hi,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> Berlin was reconfigured with this commit of Cuirass, and is now running
>> the derivations with it, but so far still "In progress..." after more
>> than 100 minutes [0]
>
> Yeah, I’m not sure about the backtrace you report, however there’s again
> a bunch of ‘cuirass evaluate’ processes hanging, this time with the main
> thread stuck on ‘waitpid’ (presumably from the ‘close-inferior’ call):

An update: I ran this by hand:

--8<---------------cut here---------------start------------->8---
$ sudo su - cuirass -s /bin/sh -c "/gnu/store/z8haznhwck4bjm4gxqy25wvwv4041wvx-cuirass-1.1.0-12.f087aaf/bin/cuirass evaluate dbname=cuirass 326988"
Computing Guix derivation for 'x86_64-linux'... -^-
warning: call to primitive-fork while multiple threads are running;
         further behavior unspecified.  See "Processes" in the
         manual, for more information.
warning: call to primitive-fork while multiple threads are running;
         further behavior unspecified.  See "Processes" in the
         manual, for more information.
warning: call to primitive-fork while multiple threads are running;
         furthwarning: call to primitive-fork while multiple threads are running;
         further behavior unspecified.  See "Processes" in the
         manual, for more information.
warning: call to primitive-fork while multiple threads are running;
         further behavior unspecified.  See "Processes" in the
         manual, for more information.
warning: call to primitive-fork while multiple threads are running;
         further behavior unspecified.  See "Processes" in the
         manual, for more information.
--8<---------------cut here---------------end--------------->8---

… and attached ‘strace’, which shows things are working as intended:

--8<---------------cut here---------------start------------->8---
read(228, "r6x878i3fyr12lf9v496cl9x5jh6-gzip-1.10.drv\" \"/gnu/store/6280r7vqfp5qkv6cdib02203xw1sk7h3-binutils-cr"..., 4096) = 4096
read(228, "pper-0.drv\" \"/gnu/store/45wsr6x878i3fyr12lf9v496cl9x5jh6-gzip-1.10.drv\" \"/gnu/store/6280r7vqfp5qkv6c"..., 4096) = 4096
read(228, "8qy2pnpn2030xxfmm56h71-gettext-0.21-doc\") (\"out\" . \"/gnu/store/gmrqdvy9v42hfbs6b7vaxfrwf8b5qhqy-gett"..., 4096) = 4096
read(228, "\" \"/gnu/store/k5889v7ms3f5x1rjr3php71k4743fn19-bash-minimal-5.1.8.drv\" \"/gnu/store/lbyw1h3xz33qnb08j"..., 4096) = 4096
read(228, "x2g7ysm16xqnrpbkqdc4hhni1r42dgg-patch-2.7.6.drv\" \"/gnu/store/3fy0f7gy85ddy6rpa4mmhjygns8qzk03-findut"..., 4096) = 4096
read(228, "08s1nz9bpv6k6a56idv6l7r2zjqphszl-file-5.39.drv\" \"/gnu/store/1x2g7ysm16xqnrpbkqdc4hhni1r42dgg-patch-2"..., 4096) = 4096
read(228, "tore/azdhavgwpjc527n81iqwrf91v8sv3x23-help2man-1.48.5.drv\" \"/gnu/store/b1pi5ssin56s4mqq6964hsh2bdkyd"..., 4096) = 4096
read(228, ") (#:nix-name . \"guix-1.3.0-25.c1719a0\") (#:system . \"x86_64-linux\") (#:max-silent-time . 3600) (#:t"..., 4096) = 121
close(228)                              = 0
wait4(83882,    # hangs forever
--8<---------------cut here---------------end--------------->8---

The question to me becomes: why is ‘guix repl’ not getting EOF from its
read(0, …) call?

FWIW, here’s an estimate of the amount of data transferred from the
inferior to the main process:

--8<---------------cut here---------------start------------->8---
$ grep 'read(228, .* = 4096$' <log.evaluate|wc -l
25807
--8<---------------cut here---------------end--------------->8---

I’m trying to reproduce that synthetically.

Ludo’.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-18  7:44                 ` Ludovic Courtès
@ 2022-05-20 18:32                   ` Ludovic Courtès
  2022-05-24 21:02                     ` Ludovic Courtès
  0 siblings, 1 reply; 27+ messages in thread
From: Ludovic Courtès @ 2022-05-20 18:32 UTC (permalink / raw)
  To: Maxim Cournoyer; +Cc: 55441, Mathieu Othacehe

Hi!

Ludovic Courtès <ludo@gnu.org> skribis:

> wait4(83882,    # hangs forever
>
>
> The question to me becomes: why is ‘guix repl’ not getting EOF from its
> read(0, …) call?

The answer was that another child process (an inferior) was holding a
copy of the socketpair file descriptor connected to that ‘guix repl’
standard input.

Fixed in Guix commit a4994d739306abcf3f36706012fb88b35a970e6b with a
test that reproduces the issue.

Commit d02b7abe24fac84ef1fb1880f51d56fc9fb6cfef updates the ‘guix’
package so we should be able to reconfigure berlin now and hopefully
(crossing fingers!) be done with it.

Ludo’.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-20 18:32                   ` Ludovic Courtès
@ 2022-05-24 21:02                     ` Ludovic Courtès
  2022-05-25 16:32                       ` Ludovic Courtès
  0 siblings, 1 reply; 27+ messages in thread
From: Ludovic Courtès @ 2022-05-24 21:02 UTC (permalink / raw)
  To: Maxim Cournoyer; +Cc: 55441, Mathieu Othacehe

Hi!

Ludovic Courtès <ludo@gnu.org> skribis:

> Fixed in Guix commit a4994d739306abcf3f36706012fb88b35a970e6b with a
> test that reproduces the issue.
>
> Commit d02b7abe24fac84ef1fb1880f51d56fc9fb6cfef updates the ‘guix’
> package so we should be able to reconfigure berlin now and hopefully
> (crossing fingers!) be done with it.

An update: Cuirass is now up-to-date on berlin.guix, built from Guix
commit adf5ae5a412ed13302186dd4ce8e2df783d4515d.

Unfortunately, while evaluations now run to completion, child processes
of ‘cuirass evaluate’ stick around at the end:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x7f5b1d054f08) at ../sysdeps/nptl/futex-internal.h:146
#1  __lll_lock_wait (futex=futex@entry=0x7f5b1d054f08, private=0) at lowlevellock.c:52
#2  0x00007f5b1d873ef3 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7f5b1d054f08) at ../nptl/pthread_mutex_lock.c:80
#3  0x00007f5b1d995303 in scm_c_weak_set_remove_x (pred=<optimized out>, closure=0x7f5b13dd8d00, raw_hash=1824276156261873434, set=#<weak-set 7f5b156772f0>) at weak-set.c:794
#4  scm_weak_set_remove_x (obj=#<port #<port-type file 7f5b1567ab40> 7f5b13dd8d00>, set=#<weak-set 7f5b156772f0>) at weak-set.c:817
#5  close_port (explicit=<optimized out>, port=#<port #<port-type file 7f5b1567ab40> 7f5b13dd8d00>) at ports.c:891
#6  close_port (port=#<port #<port-type file 7f5b1567ab40> 7f5b13dd8d00>, explicit=<optimized out>) at ports.c:874
#7  0x00007f5af3a7df82 in ?? ()
#8  0x0000000000dbd860 in ?? ()
#9  0x00007f5af3a7df60 in ?? ()
#10 0x0000000000db82b8 in ?? ()
#11 0x00007f5b1d972ccc in scm_jit_enter_mcode (thread=0x7f5b157bf240, mcode=0xdbd86c "\034\217\003") at jit.c:6038
#12 0x00007f5b1d9c7f3c in vm_regular_engine (thread=0x7f5b157bf240) at vm-engine.c:360
#13 0x00007f5b1d9d55e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608
#14 0x00007f5b1d939a0e in scm_call_with_unblocked_asyncs (proc=#<program 7f5aebcd7f40>) at async.c:406
#15 0x00007f5b1d9c8336 in vm_regular_engine (thread=0x7f5b157bf240) at vm-engine.c:972
#16 0x00007f5b1d9d55e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608
#17 0x00007f5b1d9c4be6 in really_launch (d=0x7f5aebccac80) at threads.c:778
#18 0x00007f5b1d93b85a in c_body (d=0x7f5aea691d80) at continuations.c:430
#19 0x00007f5aeeb118c2 in ?? ()
#20 0x00007f5b1553d7e0 in ?? ()
#21 0x00007f5b138a7370 in ?? ()
#22 0x0000000000000048 in ?? ()
#23 0x00007f5b1d972ccc in scm_jit_enter_mcode (thread=0x7f5b157bf240, mcode=0xdbc874 "\034<\003") at jit.c:6038
#24 0x00007f5b1d9c7f3c in vm_regular_engine (thread=0x7f5b157bf240) at vm-engine.c:360
#25 0x00007f5b1d9d55e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=2) at vm.c:1608
#26 0x00007f5b1d93d09a in scm_call_2 (proc=<optimized out>, arg1=<optimized out>, arg2=<optimized out>) at eval.c:503
#27 0x00007f5b1d9f3752 in scm_c_with_exception_handler.constprop.0 (type=#t, handler_data=handler_data@entry=0x7f5aea691d10, thunk_data=thunk_data@entry=0x7f5aea691d10,
    thunk=<optimized out>, handler=<optimized out>) at exceptions.c:170
#28 0x00007f5b1d9c588f in scm_c_catch (tag=<optimized out>, body=<optimized out>, body_data=<optimized out>, handler=<optimized out>, handler_data=<optimized out>,
    pre_unwind_handler=<optimized out>, pre_unwind_handler_data=0x7f5b156b2040) at throw.c:168
#29 0x00007f5b1d93de66 in scm_i_with_continuation_barrier (pre_unwind_handler=0x7f5b1d93db80 <pre_unwind_handler>, pre_unwind_handler_data=0x7f5b156b2040, handler_data=0x7f5aea691d80,
    handler=0x7f5b1d9448b0 <c_handler>, body_data=0x7f5aea691d80, body=0x7f5b1d93b850 <c_body>) at continuations.c:368
#30 scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:464
#31 0x00007f5b1d9c4b39 in with_guile (base=0x7f5aea691e08, data=0x7f5aea691e30) at threads.c:645
#32 0x00007f5b1d89b0ba in GC_call_with_stack_base () from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#33 0x00007f5b1d9bd16d in scm_i_with_guile (dynamic_state=<optimized out>, data=0x7f5aebccac80, func=0x7f5b1d9c4b70 <really_launch>) at threads.c:688
#34 launch_thread (d=0x7f5aebccac80) at threads.c:787
#35 0x00007f5b1d871d7e in start_thread (arg=0x7f5aea692640) at pthread_create.c:473
#36 0x00007f5b1d46feff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) info threads
  Id   Target Id             Frame
* 1    process 53801 "guile" futex_wait (private=0, expected=2, futex_word=0x7f5b1d054f08) at ../sysdeps/nptl/futex-internal.h:146
--8<---------------cut here---------------end--------------->8---

Notice there’s a single thread: it very much looks like the random
results one gets when forking a multithreaded process (in this case,
this one thread is a finalization thread, except it’s running in a
process that doesn’t actually have the other Guile threads).  The
fork+threads problem is already manifesting, after all.

I’ll try and come up with a solution to that, if nobody beats me at it.
What’s annoying is that it’s not easy to test: the problem doesn’t
manifest on my 4-core laptop, but it does on the 96-core berlin.

To be continued…

Ludo’.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-24 21:02                     ` Ludovic Courtès
@ 2022-05-25 16:32                       ` Ludovic Courtès
  2022-05-25 18:21                         ` Ludovic Courtès
  0 siblings, 1 reply; 27+ messages in thread
From: Ludovic Courtès @ 2022-05-25 16:32 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 55441, Maxim Cournoyer

Ludovic Courtès <ludo@gnu.org> skribis:

> Notice there’s a single thread: it very much looks like the random
> results one gets when forking a multithreaded process (in this case,
> this one thread is a finalization thread, except it’s running in a
> process that doesn’t actually have the other Guile threads).  The
> fork+threads problem is already manifesting, after all.

An update: The ‘with-database’ used by (cuirass scripts evaluate),
creates threads unconditionally and leaves them around when its body has
exited.  So there’s currently no way in (cuirass scripts evaluate) to
arrange so that ‘open-inferior’ is called in a single-threaded context
(I tried that in ‘wip-bug-55441’).

Bummer.  Looks like we’re left with unpleasant solutions, such as
writing bindings for ‘posix_spawn’ and using that in (guix inferior).

Mathieu, any ideas?

Ludo’.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-25 16:32                       ` Ludovic Courtès
@ 2022-05-25 18:21                         ` Ludovic Courtès
  2022-05-25 23:13                           ` Christopher Baines
  2022-05-26 12:29                           ` Ludovic Courtès
  0 siblings, 2 replies; 27+ messages in thread
From: Ludovic Courtès @ 2022-05-25 18:21 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 55441, Maxim Cournoyer

[-- Attachment #1: Type: text/plain, Size: 400 bytes --]

For now, I’m going to go with the solution below, which is to use an
older revision of Guix (one where ‘open-inferior’ was using
‘open-pipe*’) as the dependency of the ‘cuirass’ package.

I’m running “cuirass evaluate” manually on berlin to make sure it
actually works.  If everything goes well, I’ll push it and reconfigure
berlin later today or tomorrow.

Ludo’.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-patch, Size: 2483 bytes --]

diff --git a/gnu/packages/ci.scm b/gnu/packages/ci.scm
index ab421fe870..9cd2f6884e 100644
--- a/gnu/packages/ci.scm
+++ b/gnu/packages/ci.scm
@@ -164,7 +164,14 @@ (define-public cuirass
              ;; FIXME: this is propagated by "guile-git", but it needs to be among
              ;; the inputs to add it to GUILE_LOAD_PATH.
              guile-bytestructures
-             guix))
+
+             ;; FIXME: The 'cuirass evaluate' command is multithreaded, but it
+             ;; uses 'open-inferior', which calls 'primitive-fork', thus
+             ;; potentially creating child processes that deadlock.  To work
+             ;; around that, use the last revision of Guix where
+             ;; 'open-inferior' was using 'open-pipe*' rather than
+             ;; 'primitive-fork'.  See <https://issues.guix.gnu.org/55441>.
+             guix-for-cuirass))
       (native-inputs
        (list autoconf automake pkg-config texinfo ephemeralpg))
       (native-search-paths
diff --git a/gnu/packages/package-management.scm b/gnu/packages/package-management.scm
index 619458a0d3..16ad4fe9db 100644
--- a/gnu/packages/package-management.scm
+++ b/gnu/packages/package-management.scm
@@ -489,6 +489,30 @@ (define code
       (license license:gpl3+)
       (properties '((ftp-server . "alpha.gnu.org"))))))
 
+(define-public guix-for-cuirass
+  ;; Known-good revision before commit
+  ;; bd86bbd300474204878e927f6cd3f0defa1662a5, which introduced
+  ;; 'primitive-fork' in 'open-inferior'.
+  (let ((version "1.3.0")
+        (commit "a27e47f9d1e22dc32bb250cfeef88cfacb930e23")
+        (revision 23))
+    (package
+      (inherit guix)
+      (version (string-append version "-"
+                              (number->string revision)
+                              "." (string-take commit 7)))
+      (source (origin
+                (method git-fetch)
+                (uri (git-reference
+                      (url "https://git.savannah.gnu.org/git/guix.git")
+                      (commit commit)))
+                (sha256
+                 (base32
+                  "12jmvagbw05hmmlrb82i0qazhlv7mcfnl4dmknwx3a9hd760g9y1"))
+                (file-name (string-append "guix-" version "-checkout"))))
+      (properties `((hidden? . #t)
+                    ,@(package-properties guix))))))
+
 (define-public guix-daemon
   ;; This package is for internal consumption: it allows us to quickly build
   ;; the 'guix-daemon' program and use that in (guix self), used by 'guix

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-25 18:21                         ` Ludovic Courtès
@ 2022-05-25 23:13                           ` Christopher Baines
  2022-05-26  9:44                             ` Ludovic Courtès
  2022-05-26 12:29                           ` Ludovic Courtès
  1 sibling, 1 reply; 27+ messages in thread
From: Christopher Baines @ 2022-05-25 23:13 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 55441

[-- Attachment #1: Type: text/plain, Size: 1379 bytes --]


Ludovic Courtès <ludo@gnu.org> writes:

> For now, I’m going to go with the solution below, which is to use an
> older revision of Guix (one where ‘open-inferior’ was using
> ‘open-pipe*’) as the dependency of the ‘cuirass’ package.
>
> I’m running “cuirass evaluate” manually on berlin to make sure it
> actually works.  If everything goes well, I’ll push it and reconfigure
> berlin later today or tomorrow.

To put in an email something I put on IRC earlier.

Maybe the store connection caching could be optional when calling
inferior-eval-with-store, and that could also switch between using
open-pipe* and primitive-fork for starting the inferior process.

I'm guessing the use of primitive-fork for starting the inferior process
is causing problems with Cuirass in some cases, and it's possible that
it'll affect the data service in a similar way as well.

I don't think the connection caching actually benefits Cuirass though,
since it only calls inferior-eval-with-store once per inferior.

Additionally, on the data service side, the caching functionality is
actually undesirable as it leads to the inferior process running out of
memory, so currently the cache is manually cleared in various places
[1].

1: http://git.savannah.gnu.org/cgit/guix/data-service.git/commit/?id=ff116d5e6437ffb916aa4bc5d1458a142297a900

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 987 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-25 23:13                           ` Christopher Baines
@ 2022-05-26  9:44                             ` Ludovic Courtès
  2022-05-26 20:49                               ` Josselin Poiret via Bug reports for GNU Guix
  0 siblings, 1 reply; 27+ messages in thread
From: Ludovic Courtès @ 2022-05-26  9:44 UTC (permalink / raw)
  To: Christopher Baines; +Cc: 55441

Hi,

Christopher Baines <mail@cbaines.net> skribis:

> Maybe the store connection caching could be optional when calling
> inferior-eval-with-store, and that could also switch between using
> open-pipe* and primitive-fork for starting the inferior process.

I don’t think that’s desirable as an interface.  To me, the proper fix
will be to use an interface like ‘posix_spawn’ so we can have both
correctness and speed, rather than let users choose.

Ludo’.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-25 18:21                         ` Ludovic Courtès
  2022-05-25 23:13                           ` Christopher Baines
@ 2022-05-26 12:29                           ` Ludovic Courtès
  1 sibling, 0 replies; 27+ messages in thread
From: Ludovic Courtès @ 2022-05-26 12:29 UTC (permalink / raw)
  To: Mathieu Othacehe; +Cc: 55441, guix-sysadmin, Maxim Cournoyer

Ludovic Courtès <ludo@gnu.org> skribis:

> For now, I’m going to go with the solution below, which is to use an
> older revision of Guix (one where ‘open-inferior’ was using
> ‘open-pipe*’) as the dependency of the ‘cuirass’ package.
>
> I’m running “cuirass evaluate” manually on berlin to make sure it
> actually works.  If everything goes well, I’ll push it and reconfigure
> berlin later today or tomorrow.

Pushed commit as dcb7ce1eb6911f9d503e7cd2bfe380058cee956b.

I reconfigured berlin a couple of hours ago:

--8<---------------cut here---------------start------------->8---
ludo@berlin ~/src/maintenance/hydra$ guix system describe
Generation 576  May 26 2022 12:50:31    (current)
  file name: /var/guix/profiles/system-576-link
  canonical file name: /gnu/store/dd4p4cw67p5wbwwv7szgz0zxsvkq78zb-system
  label: GNU with Linux-Libre 5.17.9
  bootloader: grub
  root device: label: "my-root"
  kernel: /gnu/store/si7fz3pxwq5aiqpdwkw0qs4jmsqh89n5-linux-libre-5.17.9/bzImage
  channels:
    guix:
      repository URL: https://git.savannah.gnu.org/git/guix.git
      branch: master
      commit: 2262ee2c63fd92a17836ff7450623db76ed39d92
  configuration file: /gnu/store/sd59z8h741jilaa3cdf16441f84xghil-configuration.scm
ludo@berlin ~/src/maintenance/hydra$ git log |head -5
commit 07f587cb266f284dee6ac2e744161973d3512b04
Author: Ludovic Courtès <ludo@gnu.org>
Date:   Tue May 24 14:57:36 2022 +0200

    hydra: Add missing service type 'description' field.
--8<---------------cut here---------------end--------------->8---

There’s been a couple of evaluations (346596 and 346651 in the ‘master’
jobset) and so far things seem to work smoothly.

Let’s keep monitoring.

Ludo’.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-26  9:44                             ` Ludovic Courtès
@ 2022-05-26 20:49                               ` Josselin Poiret via Bug reports for GNU Guix
  2022-05-26 20:50                                 ` bug#55441: [PATCH] guix: inferior: Make open-bidirectional-pipe use piped-process Josselin Poiret via Bug reports for GNU Guix
                                                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Josselin Poiret via Bug reports for GNU Guix @ 2022-05-26 20:49 UTC (permalink / raw)
  To: Ludovic Courtès, Christopher Baines; +Cc: 55441

Hello everyone,

With all the recent platform changes, I was looking at `make
cuirass-jobs` for the first time and it left some deadlocked inferiors
as well, so I figured I could try my own solution, since I've already
dabbled in Guile POSIX internals!  Here's a patch that rewrites
open-bidirectional-pipe with a more thread-safe alternative,
`piped-process`, but this also requires some Guile patches, that I will
post as well.  With these, `make cuirass-jobs` doesn't hang at all
anymore for me.

Let me know what you think of these

Best,
-- 
Josselin Poiret




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [PATCH] guix: inferior: Make open-bidirectional-pipe use piped-process.
  2022-05-26 20:49                               ` Josselin Poiret via Bug reports for GNU Guix
@ 2022-05-26 20:50                                 ` Josselin Poiret via Bug reports for GNU Guix
  2022-05-26 21:02                                   ` Maxime Devos
  2022-05-28 17:15                                   ` Ludovic Courtès
  2022-05-26 20:50                                 ` bug#55441: [PATCH 1/2] Fix child spawning closing standard fds prematurely Josselin Poiret via Bug reports for GNU Guix
                                                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 27+ messages in thread
From: Josselin Poiret via Bug reports for GNU Guix @ 2022-05-26 20:50 UTC (permalink / raw)
  To: Ludovic Courtès, Christopher Baines; +Cc: 55441, Josselin Poiret

* guix/inferior.scm: Load up scm_init_popen extension to be able to
use piped-process.
(open-bidirectional-pipe): Remove custom fork+exec code and use
piped-process instead.
---
 guix/inferior.scm | 39 ++++++++++++++++++---------------------
 1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/guix/inferior.scm b/guix/inferior.scm
index 54200b75e4..a4a4c1d40e 100644
--- a/guix/inferior.scm
+++ b/guix/inferior.scm
@@ -134,6 +134,10 @@ (define (write-inferior inferior port)
 
 (set-record-type-printer! <inferior> write-inferior)
 
+(eval-when (expand load eval)
+  (load-extension (string-append "libguile-" (effective-version))
+                  "scm_init_popen"))
+
 (define (open-bidirectional-pipe command . args)
   "Open a bidirectional pipe to COMMAND invoked with ARGS and return it, as a
 regular file port (socket).
@@ -147,27 +151,20 @@ (define (open-bidirectional-pipe command . args)
   ;; the REPL process wouldn't get EOF on standard input.
   (match (socketpair AF_UNIX (logior SOCK_STREAM SOCK_CLOEXEC) 0)
     ((parent . child)
-     (match (primitive-fork)
-       (0
-        (dynamic-wind
-          (lambda ()
-            #t)
-          (lambda ()
-            (close-port parent)
-            (close-fdes 0)
-            (close-fdes 1)
-            (dup2 (fileno child) 0)
-            (dup2 (fileno child) 1)
-            ;; Mimic 'open-pipe*'.
-            (unless (file-port? (current-error-port))
-              (close-fdes 2)
-              (dup2 (open-fdes "/dev/null" O_WRONLY) 2))
-            (apply execlp command command args))
-          (lambda ()
-            (primitive-_exit 127))))
-       (pid
-        (close-port child)
-        (values parent pid))))))
+     (let* ((proc
+             (lambda ()
+               (piped-process command args
+                              ;; Use port->fdes on child to increase the
+                              ;; revealed count, so that the fd does not get
+                              ;; closed again when child gets gc'd, as
+                              ;; piped-process already closes it.
+                              (cons (fileno parent) (port->fdes child))
+                              (cons (fileno child) (fileno parent)))))
+            (pid
+             (if (file-port? (current-error-port))
+                 (proc)
+                 (with-error-to-file "/dev/null" proc))))
+       (values parent pid)))))
 
 (define* (inferior-pipe directory command error-port)
   "Return two values: an input/output pipe on the Guix instance in DIRECTORY
-- 
2.36.0





^ permalink raw reply related	[flat|nested] 27+ messages in thread

* bug#55441: [PATCH 1/2] Fix child spawning closing standard fds prematurely
  2022-05-26 20:49                               ` Josselin Poiret via Bug reports for GNU Guix
  2022-05-26 20:50                                 ` bug#55441: [PATCH] guix: inferior: Make open-bidirectional-pipe use piped-process Josselin Poiret via Bug reports for GNU Guix
@ 2022-05-26 20:50                                 ` Josselin Poiret via Bug reports for GNU Guix
  2022-05-26 20:55                                   ` Maxime Devos
  2022-05-26 20:50                                 ` bug#55441: [PATCH 2/2] Improve thread safety of piped-process Josselin Poiret via Bug reports for GNU Guix
  2022-05-26 20:58                                 ` bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections Maxime Devos
  3 siblings, 1 reply; 27+ messages in thread
From: Josselin Poiret via Bug reports for GNU Guix @ 2022-05-26 20:50 UTC (permalink / raw)
  To: Ludovic Courtès, Christopher Baines; +Cc: 55441, Josselin Poiret

* libguile/posix.c (renumber_file_descriptor): Refactor it as
dup_handle_error.
(dup_handle_error, dup2_handle_error): New functions that wrap around
dup and dup2 by retrying on EINTR or EBUSY, as well as erroring out on
other errors.
(start_child): Close standard file descriptors only
after all of them have been dup2'd.
---
 libguile/posix.c | 82 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 53 insertions(+), 29 deletions(-)

diff --git a/libguile/posix.c b/libguile/posix.c
index 3ab12b99e..dc3080b3c 100644
--- a/libguile/posix.c
+++ b/libguile/posix.c
@@ -1280,14 +1280,14 @@ SCM_DEFINE (scm_fork, "primitive-fork", 0, 0, 0,
 #endif /* HAVE_FORK */
 
 #ifdef HAVE_FORK
-/* 'renumber_file_descriptor' is a helper function for 'start_child'
-   below, and is specialized for that particular environment where it
-   doesn't make sense to report errors via exceptions.  It uses dup(2)
-   to duplicate the file descriptor FD, closes the original FD, and
-   returns the new descriptor.  If dup(2) fails, print an error message
-   to ERR and abort.  */
+/* 'dup_handle_error' is a helper function for 'start_child' below, and
+   is specialized for that particular environment where it doesn't make
+   sense to report errors via exceptions.  It uses dup(2) to duplicate
+   the file descriptor FD, does *not* close the original FD, and returns
+   the new descriptor.  If dup(2) fails, print an error message to ERR
+   and abort.  */
 static int
-renumber_file_descriptor (int fd, int err)
+dup_handle_error (int fd, int err)
 {
   int new_fd;
 
@@ -1304,7 +1304,33 @@ renumber_file_descriptor (int fd, int err)
       _exit (127);  /* Use exit status 127, as with other exec errors. */
     }
 
-  close (fd);
+  return new_fd;
+}
+
+/* 'dup2_handle_error' is a helper function for 'start_child' below, and
+   is specialized for that particular environment where it doesn't make
+   sense to report errors via exceptions.  It uses dup2(2) to duplicate
+   the file descriptor FD, does *not* close the original FD, and returns
+   the new descriptor.  If dup2(2) fails, print an error message to ERR
+   and abort.  */
+static int
+dup2_handle_error (int fd, int to, int err)
+{
+  int new_fd;
+
+  do
+    new_fd = dup2 (fd, to);
+  while (new_fd == -1 && (errno == EINTR || errno == EBUSY));
+
+  if (new_fd == -1)
+    {
+      /* At this point we are in the child process before exec.  We
+         cannot safely raise an exception in this environment.  */
+      const char *msg = strerror (errno);
+      fprintf (fdopen (err, "a"), "start_child: dup failed: %s\n", msg);
+      _exit (127);  /* Use exit status 127, as with other exec errors. */
+    }
+
   return new_fd;
 }
 #endif /* HAVE_FORK */
@@ -1357,27 +1383,25 @@ start_child (const char *exec_file, char **exec_argv,
   if (err == -1)
     err = open ("/dev/null", O_WRONLY);
 
-  if (in > 0)
-    {
-      if (out == 0)
-        out = renumber_file_descriptor (out, err);
-      if (err == 0)
-        err = renumber_file_descriptor (err, err);
-      do dup2 (in, 0); while (errno == EINTR);
-      close (in);
-    }
-  if (out > 1)
-    {
-      if (err == 1)
-        err = renumber_file_descriptor (err, err);
-      do dup2 (out, 1); while (errno == EINTR);
-      close (out);
-    }
-  if (err > 2)
-    {
-      do dup2 (err, 2); while (errno == EINTR);
-      close (err);
-    }
+  /* Dup each non-yet-dup2'd fd that's in the way to the next available fd,
+     so that we can safely dup2 to 0/1/2 without potentially overwriting
+     in/out/err.  Note that dup2 doesn't do anything if its arguments are
+     equal. */
+  if (out == 0)
+    out = dup_handle_error (out, err);
+  if (err == 0)
+    err = dup_handle_error (err, err);
+  dup2_handle_error (in, 0, err);
+
+  if (err == 1)
+    err = dup_handle_error (err, err);
+  dup2_handle_error (out, 1, err);
+
+  dup2_handle_error (err, 2, err);
+
+  if (in > 2) close (in);
+  if (out > 2) close (out);
+  if (err > 2) close (err);
 
   execvp (exec_file, exec_argv);
 
-- 
2.36.0





^ permalink raw reply related	[flat|nested] 27+ messages in thread

* bug#55441: [PATCH 2/2] Improve thread safety of piped-process.
  2022-05-26 20:49                               ` Josselin Poiret via Bug reports for GNU Guix
  2022-05-26 20:50                                 ` bug#55441: [PATCH] guix: inferior: Make open-bidirectional-pipe use piped-process Josselin Poiret via Bug reports for GNU Guix
  2022-05-26 20:50                                 ` bug#55441: [PATCH 1/2] Fix child spawning closing standard fds prematurely Josselin Poiret via Bug reports for GNU Guix
@ 2022-05-26 20:50                                 ` Josselin Poiret via Bug reports for GNU Guix
  2022-05-26 20:58                                 ` bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections Maxime Devos
  3 siblings, 0 replies; 27+ messages in thread
From: Josselin Poiret via Bug reports for GNU Guix @ 2022-05-26 20:50 UTC (permalink / raw)
  To: Ludovic Courtès, Christopher Baines; +Cc: 55441, Josselin Poiret

* libguile/posix.c (scm_piped_process): Avoid double close.
---
 libguile/posix.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libguile/posix.c b/libguile/posix.c
index dc3080b3c..e134408e3 100644
--- a/libguile/posix.c
+++ b/libguile/posix.c
@@ -1486,7 +1486,7 @@ scm_piped_process (SCM prog, SCM args, SCM from, SCM to)
       SCM_SYSERROR;
     }
 
-  if (reading)
+  if (reading && !(writing && c2p[1] == p2c[0]))
     close (c2p[1]);
   if (writing)
     close (p2c[0]);
-- 
2.36.0





^ permalink raw reply related	[flat|nested] 27+ messages in thread

* bug#55441: [PATCH 1/2] Fix child spawning closing standard fds prematurely
  2022-05-26 20:50                                 ` bug#55441: [PATCH 1/2] Fix child spawning closing standard fds prematurely Josselin Poiret via Bug reports for GNU Guix
@ 2022-05-26 20:55                                   ` Maxime Devos
  0 siblings, 0 replies; 27+ messages in thread
From: Maxime Devos @ 2022-05-26 20:55 UTC (permalink / raw)
  To: Josselin Poiret, Ludovic Courtès, Christopher Baines; +Cc: 55441

[-- Attachment #1: Type: text/plain, Size: 345 bytes --]

Josselin Poiret via Bug reports for GNU Guix schreef op do 26-05-2022
om 22:50 [+0200]:
> +      fprintf (fdopen (err, "a"), "start_child: dup failed: %s\n", msg);

fdopen can fail due to out-of-memory problems, so you'll have to check
the return value, or use dprintf instead which accepts a file
descriptor.

Greetings,
Maxime.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 260 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-26 20:49                               ` Josselin Poiret via Bug reports for GNU Guix
                                                   ` (2 preceding siblings ...)
  2022-05-26 20:50                                 ` bug#55441: [PATCH 2/2] Improve thread safety of piped-process Josselin Poiret via Bug reports for GNU Guix
@ 2022-05-26 20:58                                 ` Maxime Devos
  2022-05-28 14:02                                   ` Josselin Poiret via Bug reports for GNU Guix
  3 siblings, 1 reply; 27+ messages in thread
From: Maxime Devos @ 2022-05-26 20:58 UTC (permalink / raw)
  To: Josselin Poiret, Ludovic Courtès, Christopher Baines; +Cc: 55441

[-- Attachment #1: Type: text/plain, Size: 813 bytes --]

Josselin Poiret via Bug reports for GNU Guix schreef op do 26-05-2022
om 22:49 [+0200]:
> Hello everyone,
> 
> With all the recent platform changes, I was looking at `make
> cuirass-jobs` for the first time and it left some deadlocked inferiors
> as well, so I figured I could try my own solution, since I've already
> dabbled in Guile POSIX internals!  Here's a patch that rewrites
> open-bidirectional-pipe with a more thread-safe alternative,
> `piped-process`, but this also requires some Guile patches, that I will
> post as well.  With these, `make cuirass-jobs` doesn't hang at all
> anymore for me.
> 
> Let me know what you think of these
> 
> Best,

The process opening code looks fragile, so I think some tests are
required to avoid causing regressions later.

Greetings,
Maxime.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 260 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [PATCH] guix: inferior: Make open-bidirectional-pipe use piped-process.
  2022-05-26 20:50                                 ` bug#55441: [PATCH] guix: inferior: Make open-bidirectional-pipe use piped-process Josselin Poiret via Bug reports for GNU Guix
@ 2022-05-26 21:02                                   ` Maxime Devos
  2022-05-28 17:15                                   ` Ludovic Courtès
  1 sibling, 0 replies; 27+ messages in thread
From: Maxime Devos @ 2022-05-26 21:02 UTC (permalink / raw)
  To: Josselin Poiret, Ludovic Courtès, Christopher Baines; +Cc: 55441

[-- Attachment #1: Type: text/plain, Size: 322 bytes --]

Josselin Poiret via Bug reports for GNU Guix schreef op do 26-05-2022
om 22:50 [+0200]:
> * guix/inferior.scm: Load up scm_init_popen extension to be able to
> use piped-process.

Can we export the procedure instead from (ice-9 popen) + use @@ for
now, instead of doing this low-level thing?

Greetings,
Maxime.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 260 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-26 20:58                                 ` bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections Maxime Devos
@ 2022-05-28 14:02                                   ` Josselin Poiret via Bug reports for GNU Guix
  2023-01-26 14:39                                     ` Ludovic Courtès
  0 siblings, 1 reply; 27+ messages in thread
From: Josselin Poiret via Bug reports for GNU Guix @ 2022-05-28 14:02 UTC (permalink / raw)
  To: Maxime Devos, Ludovic Courtès, Christopher Baines; +Cc: 55441

Hello Maxime,

Thanks for your review.  I took your comments into account and actually
wrote some more patches to clean up start_child of potential thread
unsafe practices.  Since these are not guix-specific, I sent the whole
patchset to guile-devel instead [1].

Maxime Devos <maximedevos@telenet.be> writes:
> The process opening code looks fragile, so I think some tests are
> required to avoid causing regressions later.

How would you suggest we do that?  Since these are mostly racing issues
between threads, I don't know how we could reproduce these reliably in
tests.

[1] v4 PATCH at
   https://debbugs.gnu.org/cgi/bugreport.cgi?bug=52835
   (20220528124634.17353-1-dev@jpoiret.xyz)

Best,
-- 
Josselin Poiret




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [PATCH] guix: inferior: Make open-bidirectional-pipe use piped-process.
  2022-05-26 20:50                                 ` bug#55441: [PATCH] guix: inferior: Make open-bidirectional-pipe use piped-process Josselin Poiret via Bug reports for GNU Guix
  2022-05-26 21:02                                   ` Maxime Devos
@ 2022-05-28 17:15                                   ` Ludovic Courtès
  1 sibling, 0 replies; 27+ messages in thread
From: Ludovic Courtès @ 2022-05-28 17:15 UTC (permalink / raw)
  To: Josselin Poiret; +Cc: 55441, Christopher Baines

Hi,

Josselin Poiret <dev@jpoiret.xyz> skribis:

> * guix/inferior.scm: Load up scm_init_popen extension to be able to
> use piped-process.
> (open-bidirectional-pipe): Remove custom fork+exec code and use
> piped-process instead.

[...]

> +(eval-when (expand load eval)
> +  (load-extension (string-append "libguile-" (effective-version))
> +                  "scm_init_popen"))

To me this is not an option because it’s using an internal function and
it’s too fragile (that C function is not meant to be called more than
once).

That said, I agree that in principle we should be doing something along
these lines: calling out to a C function that does fork+dup+exec in a
safe way.

Thanks for looking into this!

Ludo’.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
  2022-05-28 14:02                                   ` Josselin Poiret via Bug reports for GNU Guix
@ 2023-01-26 14:39                                     ` Ludovic Courtès
  0 siblings, 0 replies; 27+ messages in thread
From: Ludovic Courtès @ 2023-01-26 14:39 UTC (permalink / raw)
  To: Josselin Poiret; +Cc: 55441, Mathieu Othacehe, Christopher Baines, Maxime Devos

Hi there!

I have some good news!  Time has passed but we didn’t remain idle :-)
and the scary bug at <https://issues.guix.gnu.org/55441> is finding a
happy end!

First, Josselin introduced bindings to ‘posix_spawn’¹ in Guile, to use
as the foundation of a new ‘spawn’ procedure, but also for ‘system*’,
‘open-pipe*’, & co.:

  https://issues.guix.gnu.org/52835

That’s big deal because ‘posix_spawn’ is usable in a multi-threaded
context (POSIX doesn’t explicitly say so, but it stems from the fact
that what happens in the child process before ‘exec’ is “safe”, and in
particular Guile code doesn’t get a chance to run).

Guile 3.0.9 released yesterday includes those changes.

Then we have this commit, where I changed ‘open-bidirectional-pipe’ in
(guix inferior) to use ‘spawn’ where available:

  fed3953d70 inferior: Use 'spawn' on Guile 3.0.9+.

That solves the problem that led to this bug.

Last, we have this update:

  230de2e94b gnu: guix: Update to d5fece6.

It gives us a ‘guix’ package that contains the ‘spawn’ fix and removes
‘guix-for-cuirass’, which is now longer necessary.

Thanks everyone for the team work, very nice when working on touch
issues like this one!

Ludo’.

¹ https://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_spawn.html




^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-01-26 14:41 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <87fslcaznn.fsf@gmail.com>
     [not found] ` <87mtfj174l.fsf@gnu.org>
2022-05-16  3:49   ` bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections Maxim Cournoyer
2022-05-16  7:32     ` Mathieu Othacehe
2022-05-16  9:13       ` Ludovic Courtès
2022-05-16 10:32         ` Ludovic Courtès
2022-05-16 12:19           ` Ludovic Courtès
2022-05-16 17:32             ` Maxim Cournoyer
2022-05-17  7:45               ` Ludovic Courtès
2022-05-18  7:44                 ` Ludovic Courtès
2022-05-20 18:32                   ` Ludovic Courtès
2022-05-24 21:02                     ` Ludovic Courtès
2022-05-25 16:32                       ` Ludovic Courtès
2022-05-25 18:21                         ` Ludovic Courtès
2022-05-25 23:13                           ` Christopher Baines
2022-05-26  9:44                             ` Ludovic Courtès
2022-05-26 20:49                               ` Josselin Poiret via Bug reports for GNU Guix
2022-05-26 20:50                                 ` bug#55441: [PATCH] guix: inferior: Make open-bidirectional-pipe use piped-process Josselin Poiret via Bug reports for GNU Guix
2022-05-26 21:02                                   ` Maxime Devos
2022-05-28 17:15                                   ` Ludovic Courtès
2022-05-26 20:50                                 ` bug#55441: [PATCH 1/2] Fix child spawning closing standard fds prematurely Josselin Poiret via Bug reports for GNU Guix
2022-05-26 20:55                                   ` Maxime Devos
2022-05-26 20:50                                 ` bug#55441: [PATCH 2/2] Improve thread safety of piped-process Josselin Poiret via Bug reports for GNU Guix
2022-05-26 20:58                                 ` bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections Maxime Devos
2022-05-28 14:02                                   ` Josselin Poiret via Bug reports for GNU Guix
2023-01-26 14:39                                     ` Ludovic Courtès
2022-05-26 12:29                           ` Ludovic Courtès
2022-05-16 12:49         ` Maxim Cournoyer
2022-05-17 12:52           ` Ludovic Courtès

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).