Hi! Mathieu Othacehe skribis: > Those two finalizer threads share the same pipe. When we try to > stop the finalizer thread in Shepherd, right before forking a new > process, we send a '\1' byte to the finalizer pipe. > > 1 write(6, "\1", 1 > > > which is received by (line 183597): > > 253 <... read resumed>"\1", 1) = 1 > > the marionette finalizer thread. Then, we pthread_join the Shepherd > finalizer thread, which never stops! Quite unfortunate. While working on a fix for this issue (finalizer pipe shared between parent and child process), I found the ‘sleep_pipe’ of the main thread is also shared between the parent and its child. Attached is a reproducer. It prints something like this before hanging: --8<---------------cut here---------------start------------->8--- $ guile ~/src/guile-debugging/signal-pipe.scm parent: 25947 child: 25953 alarm in parent! alarm in parent! alarm in parent! [...] alarm in parent! alarm in parent! child woken up! --8<---------------cut here---------------end--------------->8--- “child woken up” means that it’s the child process that won the race reading on the sleep pipe (from ‘scm_std_select’). The parent process then hangs because, in ‘scm_std_select’, it did: 1. select(1), which returned due to available data on ‘wakeup_fd’; 2. ‘full_read (wakeup_fd, &dummy, 1)’ gets stuck forever in read(2) because the child process read that byte in the meantime so there’s nothing left to read. Here’s the sequence: --8<---------------cut here---------------start------------->8--- 25947 select(4, [3], NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout) 25947 getpid() = 25947 25947 kill(25947, SIGALRM) = 0 25947 --- SIGALRM {si_signo=SIGALRM, si_code=SI_USER, si_pid=25947, si_uid=1000} --- 25947 write(8, "\16", 1) = 1 25947 rt_sigreturn({mask=[]}) = 0 25952 <... read resumed>"\16", 1) = 1 25947 rt_sigprocmask(SIG_BLOCK, NULL, 25952 write(4, "\0", 1 25947 <... rt_sigprocmask resumed>[], 8) = 0 25953 <... select resumed>) = 1 (in [3], left {tv_sec=0, tv_usec=346370}) 25952 <... write resumed>) = 1 25947 select(4, [3], NULL, NULL, {tv_sec=0, tv_usec=100000} 25953 read(3, 25952 rt_sigprocmask(SIG_BLOCK, NULL, 25947 <... select resumed>) = 1 (in [3], left {tv_sec=0, tv_usec=99999}) 25953 <... read resumed>"\0", 1) = 1 25947 read(3, 25952 <... rt_sigprocmask resumed>~[KILL STOP PWR RTMIN RT_1], 8) = 0 25953 write(1, "child woken up!\n", 16 25952 read(7, --8<---------------cut here---------------end--------------->8--- Notice how “\16” (= SIGALRM) is written by the parent’s main thread and read by the child’s main thread. Thoughts? Ludo’.