* bug#40981: Graphical installer tests sometimes hang.
@ 2020-04-30 11:51 Mathieu Othacehe
2020-05-04 11:42 ` Mathieu Othacehe
0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Othacehe @ 2020-04-30 11:51 UTC (permalink / raw)
To: 40981
Hello,
Graphical installer tests sometimes hang, before starting marionette
tests. See for instance:
https://ci.guix.gnu.org/log/d31s9sycixhvfak5lpzdg0mzvz5aa9av-gui-installed-os-encrypted.
Restarting tests just after a hang (on the same installed image),
sometimes work. I removed the "quiet" kernel argument to see what's
going on.
--8<---------------cut here---------------start------------->8---
[ 0.862608] Freeing unused kernel image memory: 1964K
[ 0.863177] Run /init as init process
GC Warning: pthread_getattr_np or pthread_attr_getstack failed for main thread
GC Warning: Couldn't read /proc/stat
Welcome, this is GNU's early boot Guile.
Use '--repl' for an initrd REPL.
loading kernel modules...
[ 0.915640] usbcore: registered new interface driver usb-storage
[ 0.917800] usbcore: registered new interface driver uas
[ 0.919569] hidraw: raw HID events driver (C) Jiri Kosina
[ 0.920519] usbcore: registered new interface driver usbhid
[ 0.921177] usbhid: USB HID core driver
[ 0.933506] isci: Intel(R) C600 SAS Controller Driver - version 1.2.0
[ 0.951303] PCI Interrupt Link [LNKD] enabled at IRQ 11
[ 0.970144] PCI Interrupt Link [LNKA] enabled at IRQ 10
[ 0.974033] virtio_blk virtio1: [vda] 4505600 512-byte logical blocks (2.31 GB/2.15 GiB)
[ 0.976186] vda: vda1 vda2
;; hanging here.
--8<---------------cut here---------------end--------------->8---
It seems that the boot freezes, soon after the initrd is started, and
before loading the boot script.
Mathieu
^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#40981: Graphical installer tests sometimes hang.
2020-04-30 11:51 bug#40981: Graphical installer tests sometimes hang Mathieu Othacehe
@ 2020-05-04 11:42 ` Mathieu Othacehe
2020-05-04 12:50 ` Mathieu Othacehe
0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Othacehe @ 2020-05-04 11:42 UTC (permalink / raw)
To: 40981
[-- Attachment #1: Type: text/plain, Size: 1070 bytes --]
Hello,
> It seems that the boot freezes, soon after the initrd is started, and
> before loading the boot script.
I made some progress on this one. Turns out, the system was started and
the tests are in progress when it hangs.
The problem occurs during the reset of the HTTP proxy ("guix-daemon
set-http-proxy action, clear" test). It seems that clearing the proxy
kills Shepherd.
Here's a screenshot of the console when it happens. It can also be
reproduced by spawning a VM and running this script:
--8<---------------cut here---------------start------------->8---
(use-modules (gnu services herd))
(with-shepherd-action 'guix-daemon
('set-http-proxy "http://localhost:8118")
result
result)
(with-shepherd-action 'guix-daemon
('set-http-proxy)
result
result)
--8<---------------cut here---------------end--------------->8---
I'll keep looking!
Thanks,
Mathieu
[-- Attachment #2: crash.ppm --]
[-- Type: image/x-portable-pixmap, Size: 2359312 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#40981: Graphical installer tests sometimes hang.
2020-05-04 11:42 ` Mathieu Othacehe
@ 2020-05-04 12:50 ` Mathieu Othacehe
2020-05-05 10:00 ` Ludovic Courtès
0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Othacehe @ 2020-05-04 12:50 UTC (permalink / raw)
To: 40981
> I'll keep looking!
Ok, getting closer. Here's a suspect part of Shepherd strace log:
--8<---------------cut here---------------start------------->8---
[pid 1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
[pid 1] write(9, "shepherd[1]: changing HTTP/HTTPS"..., 86) = 86
[pid 1] getpgid(194) = 194
[pid 1] kill(-194, SIGTERM) = 0
--8<---------------cut here---------------end--------------->8---
I think the problem is introduced by commit
1e7a91d21f1cc5d02697680e19e3878ff8565710 in Shepherd.
"(getpgid <guix-daemon-pid>") returns 0, and calling "(kill 0 SIGTERM)"
kills all processes.
Now, I really don't get how guix-daemon pgid could be zero. Man page of
setpgid(2) says:
--8<---------------cut here---------------start------------->8---
A child created via fork(2) inherits its parent's process group ID. The PGID is
preserved across an execve(2).
--8<---------------cut here---------------end--------------->8---
WDYT?
Thanks,
Mathieu
^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#40981: Graphical installer tests sometimes hang.
2020-05-04 12:50 ` Mathieu Othacehe
@ 2020-05-05 10:00 ` Ludovic Courtès
2020-05-05 10:22 ` Mathieu Othacehe
0 siblings, 1 reply; 10+ messages in thread
From: Ludovic Courtès @ 2020-05-05 10:00 UTC (permalink / raw)
To: Mathieu Othacehe; +Cc: 40981
Hi!
Mathieu Othacehe <m.othacehe@gmail.com> skribis:
>> I'll keep looking!
>
> Ok, getting closer. Here's a suspect part of Shepherd strace log:
>
> [pid 1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
> [pid 1] write(9, "shepherd[1]: changing HTTP/HTTPS"..., 86) = 86
> [pid 1] getpgid(194) = 194
> [pid 1] kill(-194, SIGTERM) = 0
>
>
> I think the problem is introduced by commit
> 1e7a91d21f1cc5d02697680e19e3878ff8565710 in Shepherd.
OK, but the trace above is “as expected”, isn’t it?
> "(getpgid <guix-daemon-pid>") returns 0, and calling "(kill 0 SIGTERM)"
> kills all processes.
What made you think of this scenario?
I don’t think getpgid(2) can return 0. Or am I missing something?
Since guix-dameon doesn’t actually daemonize, getpgid(pid) = pid.
Running this (in a VM) works fine:
while herd set-http-proxy guix-daemon foo ; do : ; done
Thanks for debugging!
Ludo’.
^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#40981: Graphical installer tests sometimes hang.
2020-05-05 10:00 ` Ludovic Courtès
@ 2020-05-05 10:22 ` Mathieu Othacehe
2020-05-06 10:02 ` Mathieu Othacehe
0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Othacehe @ 2020-05-05 10:22 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: 40981
[-- Attachment #1: Type: text/plain, Size: 541 bytes --]
Hey!
> OK, but the trace above is “as expected”, isn’t it?
Oh sorry, I pasted the wrong part! Here's the part where getpgid returns
zero and the full log in attachment:
--8<---------------cut here---------------start------------->8---
[pid 1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
[pid 1] write(9, "shepherd[1]: clearing HTTP/HTTPS"..., 59) = 59
[pid 1] getpgid(266) = 0
--8<---------------cut here---------------end--------------->8---
Thanks,
Mathieu
[-- Attachment #2: strace.log --]
[-- Type: application/octet-stream, Size: 4974 bytes --]
/gnu/store/9xl2m0qgqcxk4m3qmf957pq6dfpnm5c1-strace-5.5/bin/strace: Process 1 attached with 2 threads
[pid 169] read(10, <unfinished ...>
[pid 1] select(15, [3 14], [], [], {tv_sec=1, tv_usec=768164}) = 0 (Timeout)
[pid 1] wait4(-1, 0x7ffd2b3a79dc, WNOHANG, NULL) = 0
[pid 1] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
[pid 1] select(15, [3 14], [], [], {tv_sec=30, tv_usec=0}) = 1 (in [14], left {tv_sec=25, tv_usec=541389})
[pid 1] accept4(14, {sa_family=AF_UNIX}, [112->2], 0) = 16
[pid 1] read(16, "(shepherd-command (version 0) (a"..., 1024) = 134
[pid 1] getcwd("/", 100) = 2
[pid 1] brk(0x5e9000) = 0x5e9000
[pid 1] brk(0x5e1000) = 0x5e1000
[pid 1] brk(0x5d9000) = 0x5d9000
[pid 1] brk(0x5d1000) = 0x5d1000
[pid 1] chdir("/root") = 0
[pid 1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
[pid 1] write(9, "shepherd[1]: changing HTTP/HTTPS"..., 86) = 86
[pid 1] getpgid(194) = 194
[pid 1] kill(-194, SIGTERM) = 0
[pid 1] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=194, si_uid=0, si_status=SIGTERM, si_utime=1, si_stime=2} ---
[pid 1] write(11, "\21", 1 <unfinished ...>
[pid 169] <... read resumed>"\21", 1) = 1
[pid 1] <... write resumed>) = 1
[pid 169] rt_sigprocmask(SIG_BLOCK, NULL, ~[KILL STOP PWR RTMIN RT_1], 8) = 0
[pid 169] read(10, <unfinished ...>
[pid 1] rt_sigreturn({mask=[]}) = 0
[pid 1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
[pid 1] write(9, "shepherd[1]: Service guix-daemon"..., 51) = 51
[pid 1] wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], WNOHANG, NULL) = 194
[pid 1] wait4(-1, 0x7ffd2b3a779c, WNOHANG, NULL) = 0
[pid 1] getuid() = 0
[pid 1] clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fc8094a1e50) = 266
[pid 1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
[pid 1] write(9, "shepherd[1]: Service guix-daemon"..., 51) = 51
[pid 1] chdir("/") = 0
[pid 1] write(16, "(reply (version 0) (result (#t))"..., 214) = 214
[pid 1] close(16) = 0
[pid 1] brk(0x5c1000) = 0x5c1000
[pid 1] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
[pid 1] select(15, [3 14], [], [], {tv_sec=30, tv_usec=0}) = 1 (in [14], left {tv_sec=29, tv_usec=995886})
[pid 1] accept4(14, {sa_family=AF_UNIX}, [112->2], 0) = 16
[pid 1] read(16, "(shepherd-command (version 0) (a"..., 1024) = 111
[pid 1] getcwd("/", 100) = 2
[pid 1] brk(0x5e9000) = 0x5e9000
[pid 1] brk(0x5e1000) = 0x5e1000
[pid 1] brk(0x5d9000) = 0x5d9000
[pid 1] brk(0x5d1000) = 0x5d1000
[pid 1] chdir("/root") = 0
[pid 1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
[pid 1] write(9, "shepherd[1]: clearing HTTP/HTTPS"..., 59) = 59
[pid 1] getpgid(266) = 0
[pid 1] kill(0, SIGTERM) = 0
[pid 1] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
[pid 1] write(11, "\17", 1 <unfinished ...>
[pid 169] <... read resumed>"\17", 1) = 1
[pid 1] <... write resumed>) = 1
[pid 169] rt_sigprocmask(SIG_BLOCK, NULL, ~[KILL STOP PWR RTMIN RT_1], 8) = 0
[pid 169] read(10, <unfinished ...>
[pid 1] rt_sigreturn({mask=[]}) = 0
[pid 1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
[pid 1] write(9, "shepherd[1]: Service guix-daemon"..., 51) = 51
[pid 1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
[pid 1] write(9, "shepherd[1]: Exiting shepherd..."..., 33) = 33
[pid 1] getpgid(222) = 222
[pid 1] kill(-222, SIGTERM) = 0
[pid 1] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=0, si_uid=0, si_status=SIGTERM, si_utime=3, si_stime=0} ---
[pid 1] write(11, "\21", 1 <unfinished ...>
[pid 169] <... read resumed>"\21", 1) = 1
[pid 1] <... write resumed>) = 1
[pid 169] rt_sigprocmask(SIG_BLOCK, NULL, ~[KILL STOP PWR RTMIN RT_1], 8) = 0
[pid 169] read(10, <unfinished ...>
[pid 1] rt_sigreturn({mask=[]}) = 0
[pid 1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
[pid 1] write(9, "shepherd[1]: Service mcron has b"..., 45) = 45
[pid 1] stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}) = 0
[pid 1] write(9, "shepherd[1]: Service console-fon"..., 57) = 57
[pid 1] getpgid(203) = 203
[pid 1] kill(-203, SIGTERM) = 0
/gnu/store/9xl2m0qgqcxk4m3qmf957pq6dfpnm5c1-strace-5.5/bin/strace: Process 1 detached
/gnu/store/9xl2m0qgqcxk4m3qmf957pq6dfpnm5c1-strace-5.5/bin/strace: Process 169 detached
^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#40981: Graphical installer tests sometimes hang.
2020-05-05 10:22 ` Mathieu Othacehe
@ 2020-05-06 10:02 ` Mathieu Othacehe
2020-05-07 16:48 ` Mathieu Othacehe
0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Othacehe @ 2020-05-06 10:02 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: 40981
[-- Attachment #1: Type: text/plain, Size: 483 bytes --]
Hello,
I found what happened here. Turns out, when a process is forked and
before "setsid" or "setpgid" is called, it shares the parent PGID. In
that case the parent is Shepherd, with the PGID 0.
When doing the following actions:
* stop guix-daemon
* start guix-daemon
* stop guix-daemon
* start guix-daemon
If the second stop occurs after "fork" has been done, but before
"setsid", then "(getpgid)" returns 0. The naive patch attached could fix
the situation.
WDYT?
Mathieu
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-service-Fix-make-kill-destructor-when-PGID-is-zero.patch --]
[-- Type: text/x-diff, Size: 2149 bytes --]
From 0e4167251a56d6baa4f51fe72250a6e3bffae8c3 Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <m.othacehe@gmail.com>
Date: Wed, 6 May 2020 11:48:26 +0200
Subject: [PATCH] service: Fix 'make-kill-destructor' when PGID is zero.
When a process is forked, and before its GID is changed in "exec-command",
it will share the parent GID, which is 0 for Shepherd. In that case, use
the PID instead of the PGID.
* modules/shepherd/service.scm (make-kill-destructor): Handle the case when
PGID is zero, between the process fork and exec.
---
modules/shepherd/service.scm | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 8604d2f..258992c 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -5,6 +5,7 @@
;; Copyright (C) 2016 Alex Kost <alezost@gmail.com>
;; Copyright (C) 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
;; Copyright (C) 2019 Ricardo Wurmus <rekado@elephly.net>
+;; Copyright (C) 2020 Mathieu Othacehe <m.othacehe@gmail.com>
;;
;; This file is part of the GNU Shepherd.
;;
@@ -957,11 +958,15 @@ start."
"Return a procedure that sends SIGNAL to the process group of the PID given
as argument, where SIGNAL defaults to `SIGTERM'."
(lambda (pid . args)
- ;; Kill the whole process group PID belongs to. Don't assume that PID
- ;; is a process group ID: that's not the case when using #:pid-file,
- ;; where the process group ID is the PID of the process that
- ;; "daemonized".
- (kill (- (getpgid pid)) signal)
+ ;; Kill the whole process group PID belongs to. Don't assume that PID is
+ ;; a process group ID: that's not the case when using #:pid-file, where
+ ;; the process group ID is the PID of the process that "daemonized". If
+ ;; this procedure is called, between the process fork and exec, the PGID
+ ;; will still be zero (the Shepherd PGID). In that case, use the PID.
+ (let ((pgid (getpgid pid)))
+ (if pgid
+ (kill (- pgid) signal)
+ (kill pid signal)))
#f))
;; Produce a constructor that executes a command.
--
2.26.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* bug#40981: Graphical installer tests sometimes hang.
2020-05-06 10:02 ` Mathieu Othacehe
@ 2020-05-07 16:48 ` Mathieu Othacehe
2020-05-10 10:32 ` Ludovic Courtès
0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Othacehe @ 2020-05-07 16:48 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: 40981
[-- Attachment #1: Type: text/plain, Size: 532 bytes --]
Hello,
The previous patch was not working. The reason is that, when a process
is forked and before execv is called, it shares the parent signal
handler.
So when sending a SIGTERM to the child process, we are stopping
root-service, if the signal is received before the child has forked.
The work-around here is to save the installed SIGTERM handler and reset
it. Then, after forking, the parent can restore the SIGTERM handler. The
child will use the default SIGTERM handler that terminates the process.
WDYT?
Thanks,
Mathieu
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-service-Fix-make-kill-destructor-when-PGID-is-zero.patch --]
[-- Type: text/x-diff, Size: 4964 bytes --]
From aa6f67068f1fdd622673ec0223f05fd8f8a96baa Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <m.othacehe@gmail.com>
Date: Thu, 7 May 2020 18:39:41 +0200
Subject: [PATCH] service: Fix 'make-kill-destructor' when PGID is zero.
When a process is forked, and before its GID is changed in "exec-command",
it will share the parent GID, which is 0 for Shepherd. In that case, use
the PID instead of the PGID.
Also make sure to reset the SIGTERM handler before forking a process. Failing
to do so, will result in stopping Shepherd if a SIGTERM is received between
fork and execv calls. Restore the SIGTERM handler once the process has been
forked.
* modules/shepherd/service.scm (fork+exec-command): Save the current SIGTERM
handler and reset it before forking. Then, restore it on the parent after
forking.
(make-kill-destructor): Handle the case when PGID is zero, between the process
fork and exec.
---
modules/shepherd/service.scm | 36 +++++++++++++++++++++++++++++-------
tests/forking-service.sh | 18 ++++++++++++++++++
2 files changed, 47 insertions(+), 7 deletions(-)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 8604d2f..c68d7e0 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -5,6 +5,7 @@
;; Copyright (C) 2016 Alex Kost <alezost@gmail.com>
;; Copyright (C) 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
;; Copyright (C) 2019 Ricardo Wurmus <rekado@elephly.net>
+;; Copyright (C) 2020 Mathieu Othacehe <m.othacehe@gmail.com>
;;
;; This file is part of the GNU Shepherd.
;;
@@ -884,7 +885,18 @@ its PID."
(unless %sigchld-handler-installed?
(sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
(set! %sigchld-handler-installed? #t))
- (let ((pid (primitive-fork)))
+
+ ;; When forking a process, the signal handlers are inherited, until it
+ ;; forks. If SIGTERM is received by the forked process, before it calls
+ ;; execv, the installed SIGTERM handler, stopping Shepherd will be called.
+ ;; To avoid this, save the SIGTERM handler, disable it, and restore it once,
+ ;; the process has been forked. This way, the forked process will use the
+ ;; default SIGTERM handler stopping the process.
+ (let ((term-handler (match (sigaction SIGTERM)
+ ((proc . _)
+ proc)))
+ (pid (and (sigaction SIGTERM SIG_DFL)
+ (primitive-fork))))
(if (zero? pid)
(exec-command command
#:user user
@@ -893,7 +905,10 @@ its PID."
#:directory directory
#:file-creation-mask file-creation-mask
#:environment-variables environment-variables)
- pid)))
+ (begin
+ ;; Restore the initial SIGTERM handler.
+ (sigaction SIGTERM term-handler)
+ pid))))
(define* (make-forkexec-constructor command
#:key
@@ -957,11 +972,18 @@ start."
"Return a procedure that sends SIGNAL to the process group of the PID given
as argument, where SIGNAL defaults to `SIGTERM'."
(lambda (pid . args)
- ;; Kill the whole process group PID belongs to. Don't assume that PID
- ;; is a process group ID: that's not the case when using #:pid-file,
- ;; where the process group ID is the PID of the process that
- ;; "daemonized".
- (kill (- (getpgid pid)) signal)
+ ;; Kill the whole process group PID belongs to. Don't assume that PID is
+ ;; a process group ID: that's not the case when using #:pid-file, where
+ ;; the process group ID is the PID of the process that "daemonized". If
+ ;; this procedure is called, between the process fork and exec, the PGID
+ ;; will still be zero (the Shepherd PGID). In that case, use the PID.
+ (let ((current-pgid (getpgid 0))
+ (pgid (getpgid pid)))
+ (if (eq? pgid current-pgid)
+ (begin
+ (kill pid signal))
+ (begin
+ (kill (- pgid) signal))))
#f))
;; Produce a constructor that executes a command.
diff --git a/tests/forking-service.sh b/tests/forking-service.sh
index 7126c3d..47b09a2 100644
--- a/tests/forking-service.sh
+++ b/tests/forking-service.sh
@@ -71,6 +71,17 @@ cat > "$conf"<<EOF
#:pid-file "$PWD/$service2_pid")
#:stop (make-kill-destructor)
#:respawn? #t))
+
+(define %command3
+ '("$SHELL" "-c" "sleep 600"))
+
+(register-services
+ (make <service>
+ ;; A service that forks into a different process.
+ #:provides '(test3)
+ #:start (make-forkexec-constructor %command3)
+ #:stop (make-kill-destructor)
+ #:respawn? #t))
EOF
cat $conf
@@ -113,3 +124,10 @@ sleep 1;
$herd status test2 | grep started
test "`cat $PWD/$service2_started`" = "started
started"
+
+# Try to trigger eventual race conditions, when killing a process between fork
+# and execv calls.
+for i in {1..50}
+do
+ $herd restart test3
+done
--
2.26.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* bug#40981: Graphical installer tests sometimes hang.
2020-05-07 16:48 ` Mathieu Othacehe
@ 2020-05-10 10:32 ` Ludovic Courtès
2020-05-10 11:19 ` Mathieu Othacehe
0 siblings, 1 reply; 10+ messages in thread
From: Ludovic Courtès @ 2020-05-10 10:32 UTC (permalink / raw)
To: Mathieu Othacehe; +Cc: 40981
Hi!
Mathieu Othacehe <m.othacehe@gmail.com> skribis:
> The previous patch was not working. The reason is that, when a process
> is forked and before execv is called, it shares the parent signal
> handler.
>
> So when sending a SIGTERM to the child process, we are stopping
> root-service, if the signal is received before the child has forked.
Woow, good catch!
> The work-around here is to save the installed SIGTERM handler and reset
> it. Then, after forking, the parent can restore the SIGTERM handler. The
> child will use the default SIGTERM handler that terminates the process.
OK, makes sense. (Another occasion to see how something like
‘posix_spawn’ would be more appropriate than fork + exec…)
> From aa6f67068f1fdd622673ec0223f05fd8f8a96baa Mon Sep 17 00:00:00 2001
> From: Mathieu Othacehe <m.othacehe@gmail.com>
> Date: Thu, 7 May 2020 18:39:41 +0200
> Subject: [PATCH] service: Fix 'make-kill-destructor' when PGID is zero.
>
> When a process is forked, and before its GID is changed in "exec-command",
> it will share the parent GID, which is 0 for Shepherd. In that case, use
> the PID instead of the PGID.
>
> Also make sure to reset the SIGTERM handler before forking a process. Failing
> to do so, will result in stopping Shepherd if a SIGTERM is received between
> fork and execv calls. Restore the SIGTERM handler once the process has been
> forked.
>
> * modules/shepherd/service.scm (fork+exec-command): Save the current SIGTERM
> handler and reset it before forking. Then, restore it on the parent after
> forking.
> (make-kill-destructor): Handle the case when PGID is zero, between the process
> fork and exec.
[...]
> + ;; Kill the whole process group PID belongs to. Don't assume that PID is
> + ;; a process group ID: that's not the case when using #:pid-file, where
> + ;; the process group ID is the PID of the process that "daemonized". If
> + ;; this procedure is called, between the process fork and exec, the PGID
> + ;; will still be zero (the Shepherd PGID). In that case, use the PID.
> + (let ((current-pgid (getpgid 0))
> + (pgid (getpgid pid)))
> + (if (eq? pgid current-pgid)
> + (begin
> + (kill pid signal))
> + (begin
> + (kill (- pgid) signal))))
Shouldn’t it be:
(let ((pgid (getpgid pid)))
(if (= (getpgid 0) pgid)
(kill pid signal) ;don't kill ourself
(kill (-p pgid) signal)))
?
Note: Use the most specific comparison procedure, ‘=’ in this case,
because we know we’re dealing with numbers (it enables proper type
checking, better compiler optimizations, etc.). More importantly, ‘eq?’
performs pointer comparison, so it shouldn’t be used with numbers (in
practice it works with “fixnums” but not with bignums).
> +# Try to trigger eventual race conditions, when killing a process between fork
> +# and execv calls.
> +for i in {1..50}
> +do
> + $herd restart test3
> +done
Would it reproduce the problem well enough?
Use `seq 1 50` to avoid relying on a Bash-specific construct (which I
think it is, right?).
Could you send an updated patch?
Thanks for the bug hunting and for the patch!
Ludo’.
^ permalink raw reply [flat|nested] 10+ messages in thread
* bug#40981: Graphical installer tests sometimes hang.
2020-05-10 10:32 ` Ludovic Courtès
@ 2020-05-10 11:19 ` Mathieu Othacehe
2020-05-11 21:09 ` Ludovic Courtès
0 siblings, 1 reply; 10+ messages in thread
From: Mathieu Othacehe @ 2020-05-10 11:19 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: 40981
[-- Attachment #1: Type: text/plain, Size: 1692 bytes --]
Hey Ludo,
> Woow, good catch!
Thanks :)
>> The work-around here is to save the installed SIGTERM handler and reset
>> it. Then, after forking, the parent can restore the SIGTERM handler. The
>> child will use the default SIGTERM handler that terminates the process.
>
> OK, makes sense. (Another occasion to see how something like
> ‘posix_spawn’ would be more appropriate than fork + exec…)
Didn't know about that function, but it seems way easier to manipulate
and less error prone indeed!
> Shouldn’t it be:
>
> (let ((pgid (getpgid pid)))
> (if (= (getpgid 0) pgid)
> (kill pid signal) ;don't kill ourself
> (kill (-p pgid) signal)))
Yes, I changed it.
> Note: Use the most specific comparison procedure, ‘=’ in this case,
> because we know we’re dealing with numbers (it enables proper type
> checking, better compiler optimizations, etc.). More importantly, ‘eq?’
> performs pointer comparison, so it shouldn’t be used with numbers (in
> practice it works with “fixnums” but not with bignums).
Ok, noted!
>> +# Try to trigger eventual race conditions, when killing a process between fork
>> +# and execv calls.
>> +for i in {1..50}
>> +do
>> + $herd restart test3
>> +done
>
> Would it reproduce the problem well enough?
On a slow machine one time out of two, and on a faster machine,
never. The 'reproducer' I used, was to add a 'sleep' between fork and
exec, it works way better!
Tell me if you think its better to drop it.
> Could you send an updated patch?
Here it is!
> Thanks for the bug hunting and for the patch!
Thanks for the fast review :)
Mathieu
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-service-Fix-make-kill-destructor-when-PGID-is-zero.patch --]
[-- Type: text/x-diff, Size: 4901 bytes --]
From 79d3603bf15b8f815136178be8c8a236734a7596 Mon Sep 17 00:00:00 2001
From: Mathieu Othacehe <m.othacehe@gmail.com>
Date: Thu, 7 May 2020 18:39:41 +0200
Subject: [PATCH] service: Fix 'make-kill-destructor' when PGID is zero.
When a process is forked, and before its GID is changed in "exec-command",
it will share the parent GID, which is 0 for Shepherd. In that case, use
the PID instead of the PGID.
Also make sure to reset the SIGTERM handler before forking a process. Failing
to do so, will result in stopping Shepherd if a SIGTERM is received between
fork and execv calls. Restore the SIGTERM handler once the process has been
forked.
* modules/shepherd/service.scm (fork+exec-command): Save the current SIGTERM
handler and reset it before forking. Then, restore it on the parent after
forking.
(make-kill-destructor): Handle the case when PGID is zero, between the process
fork and exec.
---
modules/shepherd/service.scm | 33 ++++++++++++++++++++++++++-------
tests/forking-service.sh | 18 ++++++++++++++++++
2 files changed, 44 insertions(+), 7 deletions(-)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 8604d2f..45fcf32 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -5,6 +5,7 @@
;; Copyright (C) 2016 Alex Kost <alezost@gmail.com>
;; Copyright (C) 2018 Carlo Zancanaro <carlo@zancanaro.id.au>
;; Copyright (C) 2019 Ricardo Wurmus <rekado@elephly.net>
+;; Copyright (C) 2020 Mathieu Othacehe <m.othacehe@gmail.com>
;;
;; This file is part of the GNU Shepherd.
;;
@@ -884,7 +885,18 @@ its PID."
(unless %sigchld-handler-installed?
(sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
(set! %sigchld-handler-installed? #t))
- (let ((pid (primitive-fork)))
+
+ ;; When forking a process, the signal handlers are inherited, until it
+ ;; forks. If SIGTERM is received by the forked process, before it calls
+ ;; execv, the installed SIGTERM handler, stopping Shepherd will be called.
+ ;; To avoid this, save the SIGTERM handler, disable it, and restore it once,
+ ;; the process has been forked. This way, the forked process will use the
+ ;; default SIGTERM handler stopping the process.
+ (let ((term-handler (match (sigaction SIGTERM)
+ ((proc . _)
+ proc)))
+ (pid (and (sigaction SIGTERM SIG_DFL)
+ (primitive-fork))))
(if (zero? pid)
(exec-command command
#:user user
@@ -893,7 +905,10 @@ its PID."
#:directory directory
#:file-creation-mask file-creation-mask
#:environment-variables environment-variables)
- pid)))
+ (begin
+ ;; Restore the initial SIGTERM handler.
+ (sigaction SIGTERM term-handler)
+ pid))))
(define* (make-forkexec-constructor command
#:key
@@ -957,11 +972,15 @@ start."
"Return a procedure that sends SIGNAL to the process group of the PID given
as argument, where SIGNAL defaults to `SIGTERM'."
(lambda (pid . args)
- ;; Kill the whole process group PID belongs to. Don't assume that PID
- ;; is a process group ID: that's not the case when using #:pid-file,
- ;; where the process group ID is the PID of the process that
- ;; "daemonized".
- (kill (- (getpgid pid)) signal)
+ ;; Kill the whole process group PID belongs to. Don't assume that PID is
+ ;; a process group ID: that's not the case when using #:pid-file, where
+ ;; the process group ID is the PID of the process that "daemonized". If
+ ;; this procedure is called, between the process fork and exec, the PGID
+ ;; will still be zero (the Shepherd PGID). In that case, use the PID.
+ (let ((pgid (getpgid pid)))
+ (if (= (getpgid 0) pgid)
+ (kill pid signal) ;don't kill ourself
+ (kill (- pgid) signal)))
#f))
;; Produce a constructor that executes a command.
diff --git a/tests/forking-service.sh b/tests/forking-service.sh
index 7126c3d..bd9aac9 100644
--- a/tests/forking-service.sh
+++ b/tests/forking-service.sh
@@ -71,6 +71,17 @@ cat > "$conf"<<EOF
#:pid-file "$PWD/$service2_pid")
#:stop (make-kill-destructor)
#:respawn? #t))
+
+(define %command3
+ '("$SHELL" "-c" "sleep 600"))
+
+(register-services
+ (make <service>
+ ;; A service that forks into a different process.
+ #:provides '(test3)
+ #:start (make-forkexec-constructor %command3)
+ #:stop (make-kill-destructor)
+ #:respawn? #t))
EOF
cat $conf
@@ -113,3 +124,10 @@ sleep 1;
$herd status test2 | grep started
test "`cat $PWD/$service2_started`" = "started
started"
+
+# Try to trigger eventual race conditions, when killing a process between fork
+# and execv calls.
+for i in `seq 1 50`
+do
+ $herd restart test3
+done
--
2.26.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* bug#40981: Graphical installer tests sometimes hang.
2020-05-10 11:19 ` Mathieu Othacehe
@ 2020-05-11 21:09 ` Ludovic Courtès
0 siblings, 0 replies; 10+ messages in thread
From: Ludovic Courtès @ 2020-05-11 21:09 UTC (permalink / raw)
To: Mathieu Othacehe; +Cc: 40981-done
Hello,
Mathieu Othacehe <othacehe@gnu.org> skribis:
>>> The work-around here is to save the installed SIGTERM handler and reset
>>> it. Then, after forking, the parent can restore the SIGTERM handler. The
>>> child will use the default SIGTERM handler that terminates the process.
>>
>> OK, makes sense. (Another occasion to see how something like
>> ‘posix_spawn’ would be more appropriate than fork + exec…)
>
> Didn't know about that function, but it seems way easier to manipulate
> and less error prone indeed!
Make sure to read “A fork() in the Road” on that topic:
https://lwn.net/Articles/785430/
>>> +# Try to trigger eventual race conditions, when killing a process between fork
>>> +# and execv calls.
>>> +for i in {1..50}
>>> +do
>>> + $herd restart test3
>>> +done
>>
>> Would it reproduce the problem well enough?
>
> On a slow machine one time out of two, and on a faster machine,
> never. The 'reproducer' I used, was to add a 'sleep' between fork and
> exec, it works way better!
>
> Tell me if you think its better to drop it.
It’s better than nothing, let’s keep it.
>>From 79d3603bf15b8f815136178be8c8a236734a7596 Mon Sep 17 00:00:00 2001
> From: Mathieu Othacehe <m.othacehe@gmail.com>
> Date: Thu, 7 May 2020 18:39:41 +0200
> Subject: [PATCH] service: Fix 'make-kill-destructor' when PGID is zero.
>
> When a process is forked, and before its GID is changed in "exec-command",
> it will share the parent GID, which is 0 for Shepherd. In that case, use
> the PID instead of the PGID.
>
> Also make sure to reset the SIGTERM handler before forking a process. Failing
> to do so, will result in stopping Shepherd if a SIGTERM is received between
> fork and execv calls. Restore the SIGTERM handler once the process has been
> forked.
>
> * modules/shepherd/service.scm (fork+exec-command): Save the current SIGTERM
> handler and reset it before forking. Then, restore it on the parent after
> forking.
> (make-kill-destructor): Handle the case when PGID is zero, between the process
> fork and exec.
I added a “Fixes” line and pushed it.
Thanks a lot!
I can roll a 0.8.1 release soonish (I’d like to add signalfd support
while at it, we’ll see.)
Ludo’.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2020-05-11 21:09 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-04-30 11:51 bug#40981: Graphical installer tests sometimes hang Mathieu Othacehe
2020-05-04 11:42 ` Mathieu Othacehe
2020-05-04 12:50 ` Mathieu Othacehe
2020-05-05 10:00 ` Ludovic Courtès
2020-05-05 10:22 ` Mathieu Othacehe
2020-05-06 10:02 ` Mathieu Othacehe
2020-05-07 16:48 ` Mathieu Othacehe
2020-05-10 10:32 ` Ludovic Courtès
2020-05-10 11:19 ` Mathieu Othacehe
2020-05-11 21:09 ` Ludovic Courtès
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).