* bug#58926: Shepherd becomes unresponsive after an interrupt
@ 2022-10-31 12:44 Mathieu Othacehe
2022-11-10 9:59 ` Ludovic Courtès
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Mathieu Othacehe @ 2022-10-31 12:44 UTC (permalink / raw)
To: 58926
Hello,
When running the following command:
--8<---------------cut here---------------start------------->8---
sudo herd restart service-that-hangs-upon-restart
--8<---------------cut here---------------end--------------->8---
then hitting C-c, Shepherd becomes totally unresponsive:
--8<---------------cut here---------------start------------->8---
sudo herd status
--8<---------------cut here---------------end--------------->8---
and all further Shpeherd commands hang forever. I was able to reproduce
it in two different configurations:
1. On my laptop with a Wireguard service trying to reach a non-existing
DNS server.
--8<---------------cut here---------------start------------->8---
(service wireguard-service-type
(wireguard-configuration
(addresses (list "10.0.0.2/24"))
(dns '("10.0.0.50")) #does not exit
--8<---------------cut here---------------end--------------->8---
2. On Berlin, while trying to restart nginx.
In both situations, the "reboot" command was also hanging.
Thanks,
Mathieu
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#58926: Shepherd becomes unresponsive after an interrupt
2022-10-31 12:44 bug#58926: Shepherd becomes unresponsive after an interrupt Mathieu Othacehe
@ 2022-11-10 9:59 ` Ludovic Courtès
2022-11-12 18:10 ` Ludovic Courtès
2022-11-12 18:28 ` Ludovic Courtès
2 siblings, 0 replies; 6+ messages in thread
From: Ludovic Courtès @ 2022-11-10 9:59 UTC (permalink / raw)
To: Mathieu Othacehe; +Cc: 58926
Hi,
Mathieu Othacehe <othacehe@gnu.org> skribis:
> sudo herd restart service-that-hangs-upon-restart
>
>
> then hitting C-c, Shepherd becomes totally unresponsive:
>
> sudo herd status
>
>
> and all further Shpeherd commands hang forever. I was able to reproduce
> it in two different configurations:
>
> 1. On my laptop with a Wireguard service trying to reach a non-existing
> DNS server.
>
> (service wireguard-service-type
> (wireguard-configuration
> (addresses (list "10.0.0.2/24"))
> (dns '("10.0.0.50")) #does not exit
>
> 2. On Berlin, while trying to restart nginx.
I experienced case #2: in that case ‘strace -p1’ showed that shepherd
was stuck on waitpid of the nginx process, which was not terminating.
Killing that process would unlock shepherd.
This might be <https://issues.guix.gnu.org/56674>.
Would be good to see what’s up with WireGuard.
Ludo’.
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#58926: Shepherd becomes unresponsive after an interrupt
2022-10-31 12:44 bug#58926: Shepherd becomes unresponsive after an interrupt Mathieu Othacehe
2022-11-10 9:59 ` Ludovic Courtès
@ 2022-11-12 18:10 ` Ludovic Courtès
2022-11-17 10:23 ` bug#53225: " Ludovic Courtès
2022-11-12 18:28 ` Ludovic Courtès
2 siblings, 1 reply; 6+ messages in thread
From: Ludovic Courtès @ 2022-11-12 18:10 UTC (permalink / raw)
To: Mathieu Othacehe; +Cc: 53225, 58926
Mathieu Othacehe <othacehe@gnu.org> skribis:
> 1. On my laptop with a Wireguard service trying to reach a non-existing
> DNS server.
>
> (service wireguard-service-type
> (wireguard-configuration
> (addresses (list "10.0.0.2/24"))
> (dns '("10.0.0.50")) #does not exit
This one is similar to:
https://issues.guix.gnu.org/53225
https://issues.guix.gnu.org/53381
It has to do with the fact that “wg-quick up” blocks until it succeeds
and that ‘invoke’ gets stuck on ‘waitpid’ until the “wg-quick” process
terminates.
The solution will be to use something non-blocking instead of ‘invoke’;
I’m looking into it.
Ludo’.
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#53225: bug#58926: Shepherd becomes unresponsive after an interrupt
2022-11-12 18:10 ` Ludovic Courtès
@ 2022-11-17 10:23 ` Ludovic Courtès
0 siblings, 0 replies; 6+ messages in thread
From: Ludovic Courtès @ 2022-11-17 10:23 UTC (permalink / raw)
To: Mathieu Othacehe; +Cc: 53225-done, 58926-done
Hi,
Ludovic Courtès <ludo@gnu.org> skribis:
> Mathieu Othacehe <othacehe@gnu.org> skribis:
>
>> 1. On my laptop with a Wireguard service trying to reach a non-existing
>> DNS server.
>>
>> (service wireguard-service-type
>> (wireguard-configuration
>> (addresses (list "10.0.0.2/24"))
>> (dns '("10.0.0.50")) #does not exit
>
> This one is similar to:
>
> https://issues.guix.gnu.org/53225
> https://issues.guix.gnu.org/53381
>
> It has to do with the fact that “wg-quick up” blocks until it succeeds
> and that ‘invoke’ gets stuck on ‘waitpid’ until the “wg-quick” process
> terminates.
>
> The solution will be to use something non-blocking instead of ‘invoke’;
> I’m looking into it.
This is fixed in the Shepherd 0.9.3, which landed in Guix commit
283d7318c5b312d7129adb6dbeea6ad205ce89d1.
As I wrote, I’m not sure whether it fixes the nginx situation since I
could not reproduce it. I’m closing and let’s open a new issue
specifically for nginx if it comes up again with 0.9.3.
Thanks,
Ludo’.
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#58926: Shepherd becomes unresponsive after an interrupt
2022-10-31 12:44 bug#58926: Shepherd becomes unresponsive after an interrupt Mathieu Othacehe
2022-11-10 9:59 ` Ludovic Courtès
2022-11-12 18:10 ` Ludovic Courtès
@ 2022-11-12 18:28 ` Ludovic Courtès
2 siblings, 0 replies; 6+ messages in thread
From: Ludovic Courtès @ 2022-11-12 18:28 UTC (permalink / raw)
To: Mathieu Othacehe; +Cc: 58926
Mathieu Othacehe <othacehe@gnu.org> skribis:
> then hitting C-c, Shepherd becomes totally unresponsive:
>
> sudo herd status
>
>
> and all further Shpeherd commands hang forever. I was able to reproduce
> it in two different configurations:
[...]
> 2. On Berlin, while trying to restart nginx.
I can’t reproduce it in a VM.
Before I try it on a production system :-), does anyone have a tip on
how to reproduce it? Or perhaps strace output from a system that
exhibits this bug?
TIA!
Ludo’.
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#56674: [Shepherd] Use of ‘waitpid’, ‘system*’, etc. in service code can cause deadlocks
@ 2022-07-20 21:39 Ludovic Courtès
2022-11-13 23:16 ` Ludovic Courtès
0 siblings, 1 reply; 6+ messages in thread
From: Ludovic Courtès @ 2022-07-20 21:39 UTC (permalink / raw)
To: 56674
Hi!
We’ve just had a bad experience with the nginx service on berlin, where
‘herd restart nginx’ would cause shepherd to get stuck forever in
‘waitpid’ on the process that was supposed to start nginx.
The details are unclear, but one thing is clear is that using ‘waitpid’
(either directly or indirectly with ‘system*’, which is what
‘nginx-service-type’ does) is not great:
1. In the best case, shepherd (as of 0.9.1) is stuck while ‘system*’
is in ‘waitpid’ waiting for child process completion (“stuck” as
in: doesn’t do anything, not even answering ‘herd’ requests or
inetd connections.)
2. I don’t think that can happen with ‘system*’ (because it’s in C),
but generally speaking, there’s a possibility that shepherd’s event
loop will handle child process termination before some other
user-made ‘waitpid’ call does.
Anyway, that’s a bad situation.
So I can think of several ways to address it:
1. Change the nginx service ‘stop’ method to just
(make-kill-destructor), which should work just as well as invoking
“nginx -s stop”.
2. Have Shepherd provide a replacement for ‘system*’.
Thoughts?
Ludo’.
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#56674: [Shepherd] Use of ‘waitpid’, ‘system*’, etc. in service code can cause deadlocks
2022-07-20 21:39 bug#56674: [Shepherd] Use of ‘waitpid’, ‘system*’, etc. in service code can cause deadlocks Ludovic Courtès
@ 2022-11-13 23:16 ` Ludovic Courtès
2022-11-14 16:32 ` bug#58926: Shepherd becomes unresponsive after an interrupt Ludovic Courtès
0 siblings, 1 reply; 6+ messages in thread
From: Ludovic Courtès @ 2022-11-13 23:16 UTC (permalink / raw)
To: 56674
[-- Attachment #1: Type: text/plain, Size: 2121 bytes --]
Hi,
Ludovic Courtès <ludo@gnu.org> skribis:
> 1. In the best case, shepherd (as of 0.9.1) is stuck while ‘system*’
> is in ‘waitpid’ waiting for child process completion (“stuck” as
> in: doesn’t do anything, not even answering ‘herd’ requests or
> inetd connections.)
>
> 2. I don’t think that can happen with ‘system*’ (because it’s in C),
> but generally speaking, there’s a possibility that shepherd’s event
> loop will handle child process termination before some other
> user-made ‘waitpid’ call does.
>
> Anyway, that’s a bad situation.
>
> So I can think of several ways to address it:
>
> 1. Change the nginx service ‘stop’ method to just
> (make-kill-destructor), which should work just as well as invoking
> “nginx -s stop”.
>
> 2. Have Shepherd provide a replacement for ‘system*’.
These fresh Shepherd commits install a non-blocking ‘system*’ replacement:
975b0aa service: Provide a non-blocking replacement of 'system*'.
039c7a8 service: Spawn a fiber responsible for process monitoring.
We’ll have to do more testing and probably go for a 0.9.3 release soon.
Protip: you can test the latest shepherd with:
--8<---------------cut here---------------start------------->8---
(operating-system
;; …
(essential-services
(modify-services (operating-system-default-essential-services
this-operating-system)
(shepherd-root-service-type
config =>
(shepherd-configuration
(shepherd (package
(inherit shepherd-0.9)
(version "0.9.3pre")
(source (git-checkout
(url "https://git.savannah.gnu.org/git/shepherd.git")))
(native-inputs
(modify-inputs (package-native-inputs shepherd-0.9)
(append autoconf automake help2man texinfo gnu-gettext))))))))))
--8<---------------cut here---------------end--------------->8---
Full example attached.
Ludo’.
[-- Attachment #2: the example --]
[-- Type: text/plain, Size: 3640 bytes --]
;; This is an operating system configuration template
;; for a "bare bones" setup, with no X11 display server.
(use-modules (gnu) (guix) (guix git))
(use-service-modules networking ssh web vpn shepherd)
(use-package-modules linux screen ssh
admin autotools gettext man texinfo)
(operating-system
(host-name "komputilo")
(timezone "Europe/Berlin")
(locale "en_US.utf8")
;; Boot in "legacy" BIOS mode, assuming /dev/sdX is the
;; target hard disk, and "my-root" is the label of the target
;; root file system.
(bootloader (bootloader-configuration
(bootloader grub-bootloader)
(targets '("/dev/sdX"))))
;; It's fitting to support the equally bare bones ‘-nographic’
;; QEMU option, which also nicely sidesteps forcing QWERTY.
(kernel-arguments (list "console=ttyS0,115200"))
(file-systems (cons (file-system
(device (file-system-label "my-root"))
(mount-point "/")
(type "ext4"))
%base-file-systems))
;; This is where user accounts are specified. The "root"
;; account is implicit, and is initially created with the
;; empty password.
(users (cons (user-account
(name "alice")
(comment "Bob's sister")
(group "users")
;; Adding the account to the "wheel" group
;; makes it a sudoer. Adding it to "audio"
;; and "video" allows the user to play sound
;; and access the webcam.
(supplementary-groups '("wheel"
"audio" "video")))
%base-user-accounts))
;; Globally-installed packages.
(packages (append (list screen strace) %base-packages))
(essential-services
(modify-services (operating-system-default-essential-services
this-operating-system)
(shepherd-root-service-type
config =>
(shepherd-configuration
(shepherd (package
(inherit shepherd-0.9)
(version "0.9.3pre")
(source (git-checkout
(url "https://git.savannah.gnu.org/git/shepherd.git")))
(native-inputs
(modify-inputs (package-native-inputs shepherd-0.9)
(append autoconf automake help2man texinfo gnu-gettext)))))))))
;; Add services to the baseline: a DHCP client and
;; an SSH server.
(services (append (list (service dhcp-client-service-type)
(service nginx-service-type
(nginx-configuration
(server-blocks
(list (nginx-server-configuration
(listen '("80"))
(server-name '("www.example.org"))
(root "/srv/whatever"))))))
(service wireguard-service-type
(wireguard-configuration
(addresses (list "10.0.0.2/24"))
(dns '("10.0.0.50")))) ;does not exit
(service openssh-service-type
(openssh-configuration
(openssh openssh-sans-x)
(port-number 2222))))
%base-services)))
^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#58926: Shepherd becomes unresponsive after an interrupt
2022-11-13 23:16 ` Ludovic Courtès
@ 2022-11-14 16:32 ` Ludovic Courtès
0 siblings, 0 replies; 6+ messages in thread
From: Ludovic Courtès @ 2022-11-14 16:32 UTC (permalink / raw)
To: 56674; +Cc: Mathieu Othacehe, 58926
Hello!
Ludovic Courtès <ludo@gnu.org> skribis:
> These fresh Shepherd commits install a non-blocking ‘system*’ replacement:
>
> 975b0aa service: Provide a non-blocking replacement of 'system*'.
> 039c7a8 service: Spawn a fiber responsible for process monitoring.
>
> We’ll have to do more testing and probably go for a 0.9.3 release soon.
Shepherd commit ada88074f0ab7551fd0f3dce8bf06de971382e79 passes my
tests. It definitely solves the wireguard example and similar things
(uses of ‘system*’ in service constructors/destructors); I can’t tell
for sure about nginx because I haven’t been able to reproduce it in a
VM. I’m interested in ways to reproduce it.
It does look like we could go with 0.9.3 real soon now.
Ludo’.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2022-11-17 10:24 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-31 12:44 bug#58926: Shepherd becomes unresponsive after an interrupt Mathieu Othacehe
2022-11-10 9:59 ` Ludovic Courtès
2022-11-12 18:10 ` Ludovic Courtès
2022-11-17 10:23 ` bug#53225: " Ludovic Courtès
2022-11-12 18:28 ` Ludovic Courtès
-- strict thread matches above, loose matches on Subject: below --
2022-07-20 21:39 bug#56674: [Shepherd] Use of ‘waitpid’, ‘system*’, etc. in service code can cause deadlocks Ludovic Courtès
2022-11-13 23:16 ` Ludovic Courtès
2022-11-14 16:32 ` bug#58926: Shepherd becomes unresponsive after an interrupt Ludovic Courtès
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/guix.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).