Hey Ludo, On Sat, Mar 03 2018, Ludovic Courtès wrote: > If they’re zombies, that means nobody called waitpid(2). > Presumably the > polling code would need to do that. > > So I suppose ‘check-for-dead-services’ should do something like: > > [ ... ] > > Does that make sense? Please check waitpid(2) carefully though, > because > it’s pretty gnarly and I might have forgotten or misinterpreted > something here. Unfortunately we can't do that. We fall back to the polling approach to handle the fact that the processes that we care about aren't our children (hence we don't get SIGCHLD). The waitpid system call only waits for processes which are children of the calling process. I looked into the zombie problem a bit more, and I found what the problem actually is. In the build environment a guile process is running as pid 1 (the *-guile-builder script for that job). This guile process never handles SIGCHLD, and never calls wait/waitpid, so any orphaned processes become zombies. I tried modifying derivations.scm, but it wanted to rebuild a lot of things, so I gave up. I think we need to add something like this to the *-guile-builder script: (sigaction SIGCHLD (lambda () (let loop () (match (waitpid WAIT_ANY WNOHANG) ((0 . _) #f) ((pid . _) (loop)) (_ #f)))) SA_NOCLDSTOP) I've attached the output of `ps axjf` inside the build container, so you can see why I think that this is the problem. It's a bit of a shame that this is different to `guix environment --container`, where /bin/sh is pid 1, because it meant that it would build successfully in my container, but would fail in the build container (which is a confusing experience).