unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
* bug#58485: [shepherd] Restarting guix-publish fails
@ 2022-10-13  7:51 Lars-Dominik Braun
  2022-10-13  9:28 ` Liliana Marie Prikler
  2022-11-17  8:24 ` Ludovic Courtès
  0 siblings, 2 replies; 15+ messages in thread
From: Lars-Dominik Braun @ 2022-10-13  7:51 UTC (permalink / raw)
  To: 58485; +Cc: Ludovic Courtès

Hi,

it seems that `herd restart guix-publish` stopped working after the
introduction of socket activation into shepherd. This is a problem,
because I restart guix-publish automatically after unattended-upgrades. It
fails with the following error for me:

---snip---
Backtrace:
           7 (primitive-load "/gnu/store/7xrg2sbb529ki6hv99n27svg0fi?")
In ice-9/boot-9.scm:
    724:2  6 (call-with-prompt ("prompt") #<procedure 7f8173184940 ?> ?)
  1752:10  5 (with-exception-handler _ _ #:unwind? _ # _)
In ice-9/eval.scm:
    619:8  4 (_ #(#(#<directory (guile-user) 7f817318ac80>)))
In ice-9/boot-9.scm:
   260:13  3 (for-each #<procedure restart-service (name)> _)
In gnu/services/herd.scm:
    168:4  2 (invoke-action guix-publish restart () #<procedure 7f81?>)
    176:7  1 (failure)
In ice-9/boot-9.scm:
  1685:16  0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
ERROR:
  1. &action-exception-error:
      service: guix-publish
      action: start
      key: system-error
      args: ("bind" "~A" ("Address already in use") (98))
---snap---

Note that due to the socket activation you must visit the URL at least
once to start up the guix-publish process. Otherwise a restart will
work fine. It also works fine the second time I invoke `herd restart
guix-publish`, because `guix-publish` is dead by that time.

Looking at an strace shepherd is indeed trying to kill `guix-publish`
and re-bind to the same address:

---snip---
1     read(23, "(shepherd-command (version 0) (action restart) (service guix-publish) (arguments ()) (directory \"/root\"))", 1024) = 105
1     getpgid(18096)                    = 18096
1     getpgid(0)                        = 0
1     kill(-18096, SIGTERM)             = 0
1     newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}, 0) = 0
1     write(17, "shepherd[1]: Service guix-publish has been stopped.\n", 52) = 52
1     socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 36
1     setsockopt(36, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
1     bind(36, {sa_family=AF_INET, sin_port=htons(8082), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)
1     write(23, "(reply (version 0) (result #f) (error (error (version 0) action-exception start guix-publish system-error (\"bind\" \"~A\" (\"Address already in use\") (98)))) (messages (\"Service guix-publish has been stopped.\")))", 208) = 208
1     close(23)
---snap---

The obvious explanation would be that stopping does not wait for the
process to actually exit. make-kill-destructor does not waitpid it seems
and 'running is set unconditionally to #f after 'stop has finished.

Cheers,
Lars





^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#58485: [shepherd] Restarting guix-publish fails
  2022-10-13  7:51 bug#58485: [shepherd] Restarting guix-publish fails Lars-Dominik Braun
@ 2022-10-13  9:28 ` Liliana Marie Prikler
  2022-10-13 11:35   ` Lars-Dominik Braun
  2022-11-17  8:24 ` Ludovic Courtès
  1 sibling, 1 reply; 15+ messages in thread
From: Liliana Marie Prikler @ 2022-10-13  9:28 UTC (permalink / raw)
  To: Lars-Dominik Braun, 58485; +Cc: Ludovic Courtès

Am Donnerstag, dem 13.10.2022 um 09:51 +0200 schrieb Lars-Dominik
Braun:
> The obvious explanation would be that stopping does not wait for the
> process to actually exit. make-kill-destructor does not waitpid it
> seems and 'running is set unconditionally to #f after 'stop has
> finished.
Shouldn't [1] address this very issue?

[1]
http://git.savannah.gnu.org/cgit/guix.git/commit/?id=2a37f174becbafd70591f6eb1d98493c5c1df0e2




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#58485: [shepherd] Restarting guix-publish fails
  2022-10-13  9:28 ` Liliana Marie Prikler
@ 2022-10-13 11:35   ` Lars-Dominik Braun
  2022-10-13 13:38     ` Liliana Marie Prikler
  0 siblings, 1 reply; 15+ messages in thread
From: Lars-Dominik Braun @ 2022-10-13 11:35 UTC (permalink / raw)
  To: Liliana Marie Prikler; +Cc: Ludovic Courtès, 58485

Hi Liliana,

> Shouldn't [1] address this very issue?
> [1]
> http://git.savannah.gnu.org/cgit/guix.git/commit/?id=2a37f174becbafd70591f6eb1d98493c5c1df0e2
no, if the process is running make-systemd-destructor is just an alias
for make-kill-destructor. So it does not matter which one we use in this case.

Lars




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#58485: [shepherd] Restarting guix-publish fails
  2022-10-13 11:35   ` Lars-Dominik Braun
@ 2022-10-13 13:38     ` Liliana Marie Prikler
  2022-10-14  6:18       ` Lars-Dominik Braun
  0 siblings, 1 reply; 15+ messages in thread
From: Liliana Marie Prikler @ 2022-10-13 13:38 UTC (permalink / raw)
  To: Lars-Dominik Braun; +Cc: Ludovic Courtès, 58485

Am Donnerstag, dem 13.10.2022 um 13:35 +0200 schrieb Lars-Dominik
Braun:
> Hi Liliana,
> 
> > Shouldn't [1] address this very issue?
> > [1]
> > http://git.savannah.gnu.org/cgit/guix.git/commit/?id=2a37f174becbafd70591f6eb1d98493c5c1df0e2
> no, if the process is running make-systemd-destructor is just an
> alias for make-kill-destructor. So it does not matter which one we
> use in this case.
Ahh, so the issue is that shepherd waits neither for the process to be
actually killed nor for the socket to become available, isn't it?




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#58485: [shepherd] Restarting guix-publish fails
  2022-10-13 13:38     ` Liliana Marie Prikler
@ 2022-10-14  6:18       ` Lars-Dominik Braun
  2022-10-14  6:57         ` Liliana Marie Prikler
  0 siblings, 1 reply; 15+ messages in thread
From: Lars-Dominik Braun @ 2022-10-14  6:18 UTC (permalink / raw)
  To: Liliana Marie Prikler; +Cc: Ludovic Courtès, 58485

Hi,

> Ahh, so the issue is that shepherd waits neither for the process to be
> actually killed nor for the socket to become available, isn't it?
I would argue it’s the former, but having either of them would solve
the problem, I think.

Lars





^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#58485: [shepherd] Restarting guix-publish fails
  2022-10-14  6:18       ` Lars-Dominik Braun
@ 2022-10-14  6:57         ` Liliana Marie Prikler
  0 siblings, 0 replies; 15+ messages in thread
From: Liliana Marie Prikler @ 2022-10-14  6:57 UTC (permalink / raw)
  To: Lars-Dominik Braun; +Cc: Ludovic Courtès, 58485

Am Freitag, dem 14.10.2022 um 08:18 +0200 schrieb Lars-Dominik Braun:
> Hi,
> 
> > Ahh, so the issue is that shepherd waits neither for the process to
> > be
> > actually killed nor for the socket to become available, isn't it?
> I would argue it’s the former, but having either of them would solve
> the problem, I think.
I think you need both: if the process is killed, but the socket
remains, you need to clean it up.  As far as I'm aware, that does not
happen automatically.

Cheers




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#58485: [shepherd] Restarting guix-publish fails
  2022-10-13  7:51 bug#58485: [shepherd] Restarting guix-publish fails Lars-Dominik Braun
  2022-10-13  9:28 ` Liliana Marie Prikler
@ 2022-11-17  8:24 ` Ludovic Courtès
  2022-11-17 10:19   ` Ludovic Courtès
                     ` (2 more replies)
  1 sibling, 3 replies; 15+ messages in thread
From: Ludovic Courtès @ 2022-11-17  8:24 UTC (permalink / raw)
  To: Lars-Dominik Braun; +Cc: 58485

Hi,

Lars-Dominik Braun <lars@6xq.net> skribis:

> 1     kill(-18096, SIGTERM)             = 0
> 1     newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2298, ...}, 0) = 0
> 1     write(17, "shepherd[1]: Service guix-publish has been stopped.\n", 52) = 52
> 1     socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 36
> 1     setsockopt(36, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> 1     bind(36, {sa_family=AF_INET, sin_port=htons(8082), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRINUSE (Address already in use)
> 1     write(23, "(reply (version 0) (result #f) (error (error (version 0) action-exception start guix-publish system-error (\"bind\" \"~A\" (\"Address already in use\") (98)))) (messages (\"Service guix-publish has been stopped.\")))", 208) = 208
> 1     close(23)
> ---snap---
>
> The obvious explanation would be that stopping does not wait for the
> process to actually exit. make-kill-destructor does not waitpid it seems
> and 'running is set unconditionally to #f after 'stop has finished.

Indeed.  This is fixed by Shepherd commit
d97592f58603ff51cb280ae57d413c8731e601b3, which will be in the upcoming
0.9.3 release.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#58485: [shepherd] Restarting guix-publish fails
  2022-11-17  8:24 ` Ludovic Courtès
@ 2022-11-17 10:19   ` Ludovic Courtès
  2023-02-07  8:39   ` Lars-Dominik Braun
       [not found]   ` <Y+IM4IrO4V05o3V9@zpidnb93>
  2 siblings, 0 replies; 15+ messages in thread
From: Ludovic Courtès @ 2022-11-17 10:19 UTC (permalink / raw)
  To: Lars-Dominik Braun; +Cc: 58485-done

Ludovic Courtès <ludo@gnu.org> skribis:

> Indeed.  This is fixed by Shepherd commit
> d97592f58603ff51cb280ae57d413c8731e601b3, which will be in the upcoming
> 0.9.3 release.

The Shepherd 0.9.3 has landed in Guix commit
283d7318c5b312d7129adb6dbeea6ad205ce89d1.

Ludo’.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#58485: [shepherd] Restarting guix-publish fails
  2022-11-17  8:24 ` Ludovic Courtès
  2022-11-17 10:19   ` Ludovic Courtès
@ 2023-02-07  8:39   ` Lars-Dominik Braun
       [not found]   ` <Y+IM4IrO4V05o3V9@zpidnb93>
  2 siblings, 0 replies; 15+ messages in thread
From: Lars-Dominik Braun @ 2023-02-07  8:39 UTC (permalink / raw)
  To: 58485

[-- Attachment #1: Type: text/plain, Size: 1979 bytes --]

Hi Ludo,

> Indeed.  This is fixed by Shepherd commit
> d97592f58603ff51cb280ae57d413c8731e601b3, which will be in the upcoming
> 0.9.3 release.
I’m on 0.9.3 and it works fine with `herd restart` now. But ssh-daemon
has the same issue when being restarted by unattended-upgrades (which
is fatal, because unable to use SSH I have to restart the entire box):

---snip---
shepherd: Service nginx has been stopped.
shepherd: Service nginx has been started.
shepherd: Service collectd has been stopped.
shepherd: Service collectd has been started.
shepherd: Service ntpd has been stopped.
shepherd: Service ntpd has been started.
shepherd: Service guix-publish has been stopped.
shepherd: Service guix-publish has been started.
shepherd: Service ssh-daemon has been stopped.
Backtrace:
           7 (primitive-load "/gnu/store/ip5m1n8kb6p0rfglzpkk17k060a?")
In ice-9/boot-9.scm:
    724:2  6 (call-with-prompt ("prompt") #<procedure 7f89a11f3840 ?> ?)
  1752:10  5 (with-exception-handler _ _ #:unwind? _ # _)
In ice-9/eval.scm:
    619:8  4 (_ #(#(#<directory (guile-user) 7f89a11ffc80>)))
In ice-9/boot-9.scm:
   260:13  3 (for-each #<procedure restart-service (name)> _)
In gnu/services/herd.scm:
    168:4  2 (invoke-action ssh-daemon restart () #<procedure 7f89a0?>)
    176:7  1 (failure)
In ice-9/boot-9.scm:
  1685:16  0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
ERROR:
  1. &action-exception-error:
      service: ssh-daemon
      action: start
      key: system-error
      args: ("bind" "~A" ("Address already in use") (98)
---snap---

Maybe I can strace herd and see what happens exactly.

Thanks,
Lars

-- 
Lars-Dominik Braun
Wissenschaftlicher Mitarbeiter/Research Associate

www.leibniz-psychology.org
ZPID - Leibniz-Institut für Psychologie /
ZPID - Leibniz Institute for Psychology
Universitätsring 15
D-54296 Trier - Germany
Tel.: +49–651–201-4964

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#58485: [shepherd] Restarting guix-publish fails
       [not found]   ` <Y+IM4IrO4V05o3V9@zpidnb93>
@ 2023-02-20 10:20     ` Ludovic Courtès
  2023-02-20 13:25       ` Lars-Dominik Braun
  0 siblings, 1 reply; 15+ messages in thread
From: Ludovic Courtès @ 2023-02-20 10:20 UTC (permalink / raw)
  To: Lars-Dominik Braun; +Cc: 58485, Lars-Dominik Braun

Hi Lars,

Lars-Dominik Braun <ldb@leibniz-psychology.org> skribis:

>> Indeed.  This is fixed by Shepherd commit
>> d97592f58603ff51cb280ae57d413c8731e601b3, which will be in the upcoming
>> 0.9.3 release.
> I’m on 0.9.3 and it works fine with `herd restart` now. But ssh-daemon
> has the same issue when being restarted by unattended-upgrades (which
> is fatal, because unable to use SSH I have to restart the entire box):
>
> ---snip---
> shepherd: Service nginx has been stopped.
> shepherd: Service nginx has been started.
> shepherd: Service collectd has been stopped.
> shepherd: Service collectd has been started.
> shepherd: Service ntpd has been stopped.
> shepherd: Service ntpd has been started.
> shepherd: Service guix-publish has been stopped.
> shepherd: Service guix-publish has been started.
> shepherd: Service ssh-daemon has been stopped.
> Backtrace:
>            7 (primitive-load "/gnu/store/ip5m1n8kb6p0rfglzpkk17k060a?")
> In ice-9/boot-9.scm:
>     724:2  6 (call-with-prompt ("prompt") #<procedure 7f89a11f3840 ?> ?)
>   1752:10  5 (with-exception-handler _ _ #:unwind? _ # _)
> In ice-9/eval.scm:
>     619:8  4 (_ #(#(#<directory (guile-user) 7f89a11ffc80>)))
> In ice-9/boot-9.scm:
>    260:13  3 (for-each #<procedure restart-service (name)> _)
> In gnu/services/herd.scm:
>     168:4  2 (invoke-action ssh-daemon restart () #<procedure 7f89a0?>)
>     176:7  1 (failure)
> In ice-9/boot-9.scm:
>   1685:16  0 (raise-exception _ #:continuable? _)
>
> ice-9/boot-9.scm:1685:16: In procedure raise-exception:
> ERROR:
>   1. &action-exception-error:
>       service: ssh-daemon
>       action: start
>       key: system-error
>       args: ("bind" "~A" ("Address already in use") (98)
> ---snap---
>
> Maybe I can strace herd and see what happens exactly.

Can you confirm shepherd (PID 1) is 0.9.3?

‘sudo herd restart ssh-daemon’ works fine on my laptop FWIW.

Note that the situation is different from that of ‘guix publish’: here
it’s inetd style, as opposed to systemd style for ‘guix publish’.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#58485: [shepherd] Restarting guix-publish fails
  2023-02-20 10:20     ` Ludovic Courtès
@ 2023-02-20 13:25       ` Lars-Dominik Braun
  2023-04-27 21:23         ` Ludovic Courtès
  0 siblings, 1 reply; 15+ messages in thread
From: Lars-Dominik Braun @ 2023-02-20 13:25 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 58485, Lars-Dominik Braun

[-- Attachment #1: Type: text/plain, Size: 3732 bytes --]

Hi Ludo,

> Can you confirm shepherd (PID 1) is 0.9.3?
it is:

root         1  0.2  0.2 308148 76816 ?        Sl   Feb07  52:08 /gnu/store/kphp5d85rrb3q1rdc2lfqc1mdklwh3qp-guile-3.0.9/bin/guile --no-auto-compile /gnu/store/4nw0zb4swga0cb8i35nvng3rg6z5qm8p-shepherd-0.9.3/bin/shepherd --config /gnu/store/cvrai6z8777jf7860rnvppfznl1lcxi1-shepherd.conf

> ‘sudo herd restart ssh-daemon’ works fine on my laptop FWIW.
This works fine too. Only unattended-upgrades seems to have this issue :/

The strace looks unsuspicious right now:

---snip---
1     14:12:15.117035 read(21, "(shepherd-command (version 0) (action restart) (service ssh-daemon) (arguments ()) (directory \"/root\"))", 1024) = 103
1     14:12:15.117254 close(27)         = 0
1     14:12:15.117283 close(30)         = 0
1     14:12:15.117416 newfstatat(AT_FDCWD, "/etc/localtime", {st_dev=makedev(0x8, 0x2), st_ino=110100491, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, s
t_size=2298, st_atime=1676898665 /* 2023-02-20T14:11:05.338746772+0100 */, st_atime_nsec=338746772, st_mtime=1676898664 /* 2023-02-20T14:11:04.874743456+0100 */, st_mtime_nsec=874743456, st_c
time=1676898664 /* 2023-02-20T14:11:04.874743456+0100 */, st_ctime_nsec=874743456}, 0) = 0
1     14:12:15.117475 write(17, "shepherd[1]: Service ssh-daemon has been stopped.\n", 50) = 50
1     14:12:15.117524 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 26
1     14:12:15.117561 setsockopt(26, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
1     14:12:15.117598 bind(26, {sa_family=AF_INET, sin_port=htons(2222), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
1     14:12:15.117724 write(21, "(reply (version 0) (result #f) (error (error (version 0) action-exception start ssh-daemon system-error (\"bind\" \"~A\" (\"Address already in use\") (98)))) (messages (\"Service ssh-daemon has been stopped.\")))", 204) = 204
1     14:12:15.117754 close(21)         = 0
---snap---

But nginx seems to have the same issue, except that it does not fail
entirely and succeeds after waiting a short period of time:

---snip---
2023/02/20 14:12:14 [notice] 7136#0: signal 15 (SIGTERM) received from 6644, exiting
2023/02/20 14:12:14 [notice] 7137#0: exiting
2023/02/20 14:12:14 [notice] 7137#0: exit
2023/02/20 14:12:14 [notice] 7136#0: signal 17 (SIGCHLD) received from 7137
2023/02/20 14:12:14 [notice] 7136#0: worker process 7137 exited with code 0
2023/02/20 14:12:14 [emerg] 6645#0: bind() to 0.0.0.0:443 failed (98: Address already in use)
2023/02/20 14:12:14 [emerg] 6645#0: bind() to 0.0.0.0:80 failed (98: Address already in use)
2023/02/20 14:12:14 [emerg] 6645#0: bind() to [::]:80 failed (98: Address already in use)
2023/02/20 14:12:14 [notice] 7136#0: exit
2023/02/20 14:12:14 [notice] 6645#0: try again to bind() after 500ms
2023/02/20 14:12:14 [notice] 6645#0: using the "epoll" event method
2023/02/20 14:12:14 [notice] 6645#0: nginx/1.23.3
2023/02/20 14:12:14 [notice] 6645#0: OS: Linux 6.1.9
2023/02/20 14:12:14 [notice] 6645#0: getrlimit(RLIMIT_NOFILE): 1024:4096
2023/02/20 14:12:14 [notice] 6648#0: start worker processes
2023/02/20 14:12:14 [notice] 6648#0: start worker process 6649
2023/02/20 14:12:32 [info] 6649#0: epoll_wait() failed (4: Interrupted system call)
---snap---

I see we’re already using SO_REUSEADDR, so all of this is a bit of a
mystery to me.

Thanks,
Lars

-- 
Lars-Dominik Braun
Wissenschaftlicher Mitarbeiter/Research Associate

www.leibniz-psychology.org
ZPID - Leibniz-Institut für Psychologie /
ZPID - Leibniz Institute for Psychology
Universitätsring 15
D-54296 Trier - Germany
Tel.: +49–651–201-4964

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#58485: [shepherd] Restarting guix-publish fails
  2023-02-20 13:25       ` Lars-Dominik Braun
@ 2023-04-27 21:23         ` Ludovic Courtès
  2023-04-28 12:31           ` Lars-Dominik Braun
  2023-04-28 13:09           ` bug#58485: [shepherd] EADDRINUSE while restarting ssh-daemon Ludovic Courtès
  0 siblings, 2 replies; 15+ messages in thread
From: Ludovic Courtès @ 2023-04-27 21:23 UTC (permalink / raw)
  To: Lars-Dominik Braun; +Cc: 58485, Lars-Dominik Braun

Hi,

Sorry for the late reply.  I’m going through Shepherd bug reports and I
remembered this discussion…

Lars-Dominik Braun <ldb@leibniz-psychology.org> skribis:

>> Can you confirm shepherd (PID 1) is 0.9.3?
> it is:
>
> root         1  0.2  0.2 308148 76816 ?        Sl   Feb07  52:08 /gnu/store/kphp5d85rrb3q1rdc2lfqc1mdklwh3qp-guile-3.0.9/bin/guile --no-auto-compile /gnu/store/4nw0zb4swga0cb8i35nvng3rg6z5qm8p-shepherd-0.9.3/bin/shepherd --config /gnu/store/cvrai6z8777jf7860rnvppfznl1lcxi1-shepherd.conf
>
>> ‘sudo herd restart ssh-daemon’ works fine on my laptop FWIW.
> This works fine too. Only unattended-upgrades seems to have this issue :/
>
> The strace looks unsuspicious right now:
>
> ---snip---
> 1     14:12:15.117035 read(21, "(shepherd-command (version 0) (action restart) (service ssh-daemon) (arguments ()) (directory \"/root\"))", 1024) = 103
> 1     14:12:15.117254 close(27)         = 0
> 1     14:12:15.117283 close(30)         = 0
> 1     14:12:15.117416 newfstatat(AT_FDCWD, "/etc/localtime", {st_dev=makedev(0x8, 0x2), st_ino=110100491, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, s
> t_size=2298, st_atime=1676898665 /* 2023-02-20T14:11:05.338746772+0100 */, st_atime_nsec=338746772, st_mtime=1676898664 /* 2023-02-20T14:11:04.874743456+0100 */, st_mtime_nsec=874743456, st_c
> time=1676898664 /* 2023-02-20T14:11:04.874743456+0100 */, st_ctime_nsec=874743456}, 0) = 0
> 1     14:12:15.117475 write(17, "shepherd[1]: Service ssh-daemon has been stopped.\n", 50) = 50
> 1     14:12:15.117524 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 26
> 1     14:12:15.117561 setsockopt(26, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> 1     14:12:15.117598 bind(26, {sa_family=AF_INET, sin_port=htons(2222), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
> 1     14:12:15.117724 write(21, "(reply (version 0) (result #f) (error (error (version 0) action-exception start ssh-daemon system-error (\"bind\" \"~A\" (\"Address already in use\") (98)))) (messages (\"Service ssh-daemon has been stopped.\")))", 204) = 204
> 1     14:12:15.117754 close(21)         = 0

This suggests ‘bind’ can return EADDRINUSE even though the sockets have
been closed before (presumably file descriptors 27 and 30 above).

Can you confirm nothing else is competing to bind port 2222 on that
machine?

I tried to reproduce it with something as brutal as:

  while sudo herd restart sshd ; do : ; done

… to no avail (I’m on current Shepherd ‘master’ though).

Maybe we should just have shepherd retry upon EADDRINUSE (like nginx
does, as you wrote), though I’d like to understand under what conditions
we can get EADDRINUSE in the first place.

Ludo’.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#58485: [shepherd] Restarting guix-publish fails
  2023-04-27 21:23         ` Ludovic Courtès
@ 2023-04-28 12:31           ` Lars-Dominik Braun
  2023-04-28 13:09           ` bug#58485: [shepherd] EADDRINUSE while restarting ssh-daemon Ludovic Courtès
  1 sibling, 0 replies; 15+ messages in thread
From: Lars-Dominik Braun @ 2023-04-28 12:31 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 58485, Lars-Dominik Braun

[-- Attachment #1: Type: text/plain, Size: 669 bytes --]

Hi,

> Can you confirm nothing else is competing to bind port 2222 on that
> machine?

not sure how to confirm that with certainty (it’s hard to get an lsof
in the exact right moment), but according to the OS config only SSHd is
supposed to use port 2222, see [1].

[1] https://github.com/leibniz-psychology/psychnotebook-deploy/blob/master/src/zpid/machines/patna/os.scm

Cheers,
Lars

-- 
Lars-Dominik Braun
Wissenschaftlicher Mitarbeiter/Research Associate

www.leibniz-psychology.org
ZPID - Leibniz-Institut für Psychologie /
ZPID - Leibniz Institute for Psychology
Universitätsring 15
D-54296 Trier - Germany
Tel.: +49–651–201-4964

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 659 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#58485: [shepherd] EADDRINUSE while restarting ssh-daemon
  2023-04-27 21:23         ` Ludovic Courtès
  2023-04-28 12:31           ` Lars-Dominik Braun
@ 2023-04-28 13:09           ` Ludovic Courtès
  2023-06-11 14:20             ` Ludovic Courtès
  1 sibling, 1 reply; 15+ messages in thread
From: Ludovic Courtès @ 2023-04-28 13:09 UTC (permalink / raw)
  To: Lars-Dominik Braun; +Cc: 58485, Lars-Dominik Braun

Hi,

Ludovic Courtès <ludo@gnu.org> skribis:

>> 1     14:12:15.117035 read(21, "(shepherd-command (version 0) (action restart) (service ssh-daemon) (arguments ()) (directory \"/root\"))", 1024) = 103
>> 1     14:12:15.117254 close(27)         = 0
>> 1     14:12:15.117283 close(30)         = 0
>> 1     14:12:15.117416 newfstatat(AT_FDCWD, "/etc/localtime", {st_dev=makedev(0x8, 0x2), st_ino=110100491, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, s
>> t_size=2298, st_atime=1676898665 /* 2023-02-20T14:11:05.338746772+0100 */, st_atime_nsec=338746772, st_mtime=1676898664 /* 2023-02-20T14:11:04.874743456+0100 */, st_mtime_nsec=874743456, st_c
>> time=1676898664 /* 2023-02-20T14:11:04.874743456+0100 */, st_ctime_nsec=874743456}, 0) = 0
>> 1     14:12:15.117475 write(17, "shepherd[1]: Service ssh-daemon has been stopped.\n", 50) = 50
>> 1     14:12:15.117524 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 26
>> 1     14:12:15.117561 setsockopt(26, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>> 1     14:12:15.117598 bind(26, {sa_family=AF_INET, sin_port=htons(2222), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
>> 1     14:12:15.117724 write(21, "(reply (version 0) (result #f) (error (error (version 0) action-exception start ssh-daemon system-error (\"bind\" \"~A\" (\"Address already in use\") (98)))) (messages (\"Service ssh-daemon has been stopped.\")))", 204) = 204
>> 1     14:12:15.117754 close(21)         = 0

[...]

> Maybe we should just have shepherd retry upon EADDRINUSE (like nginx
> does, as you wrote), though I’d like to understand under what conditions
> we can get EADDRINUSE in the first place.

Done:

  https://git.savannah.gnu.org/cgit/shepherd.git/commit/?id=41789ee8d0e164967f9ca196db4e9601400a462e

Ludo’.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#58485: [shepherd] EADDRINUSE while restarting ssh-daemon
  2023-04-28 13:09           ` bug#58485: [shepherd] EADDRINUSE while restarting ssh-daemon Ludovic Courtès
@ 2023-06-11 14:20             ` Ludovic Courtès
  0 siblings, 0 replies; 15+ messages in thread
From: Ludovic Courtès @ 2023-06-11 14:20 UTC (permalink / raw)
  To: Lars-Dominik Braun; +Cc: Lars-Dominik Braun, 58485-done

Hi Lars,

Ludovic Courtès <ludo@gnu.org> skribis:

> Ludovic Courtès <ludo@gnu.org> skribis:
>
>>> 1     14:12:15.117035 read(21, "(shepherd-command (version 0) (action restart) (service ssh-daemon) (arguments ()) (directory \"/root\"))", 1024) = 103
>>> 1     14:12:15.117254 close(27)         = 0
>>> 1     14:12:15.117283 close(30)         = 0
>>> 1     14:12:15.117416 newfstatat(AT_FDCWD, "/etc/localtime", {st_dev=makedev(0x8, 0x2), st_ino=110100491, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, s
>>> t_size=2298, st_atime=1676898665 /* 2023-02-20T14:11:05.338746772+0100 */, st_atime_nsec=338746772, st_mtime=1676898664 /* 2023-02-20T14:11:04.874743456+0100 */, st_mtime_nsec=874743456, st_c
>>> time=1676898664 /* 2023-02-20T14:11:04.874743456+0100 */, st_ctime_nsec=874743456}, 0) = 0
>>> 1     14:12:15.117475 write(17, "shepherd[1]: Service ssh-daemon has been stopped.\n", 50) = 50
>>> 1     14:12:15.117524 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 26
>>> 1     14:12:15.117561 setsockopt(26, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>>> 1     14:12:15.117598 bind(26, {sa_family=AF_INET, sin_port=htons(2222), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
>>> 1     14:12:15.117724 write(21, "(reply (version 0) (result #f) (error (error (version 0) action-exception start ssh-daemon system-error (\"bind\" \"~A\" (\"Address already in use\") (98)))) (messages (\"Service ssh-daemon has been stopped.\")))", 204) = 204
>>> 1     14:12:15.117754 close(21)         = 0
>
> [...]
>
>> Maybe we should just have shepherd retry upon EADDRINUSE (like nginx
>> does, as you wrote), though I’d like to understand under what conditions
>> we can get EADDRINUSE in the first place.
>
> Done:
>
>   https://git.savannah.gnu.org/cgit/shepherd.git/commit/?id=41789ee8d0e164967f9ca196db4e9601400a462e

I’m assuming that this is fixed in Shepherd 0.10.x.  Please reopen if
you stumble upon this issue again.

Ludo’.




^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-06-11 14:21 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-13  7:51 bug#58485: [shepherd] Restarting guix-publish fails Lars-Dominik Braun
2022-10-13  9:28 ` Liliana Marie Prikler
2022-10-13 11:35   ` Lars-Dominik Braun
2022-10-13 13:38     ` Liliana Marie Prikler
2022-10-14  6:18       ` Lars-Dominik Braun
2022-10-14  6:57         ` Liliana Marie Prikler
2022-11-17  8:24 ` Ludovic Courtès
2022-11-17 10:19   ` Ludovic Courtès
2023-02-07  8:39   ` Lars-Dominik Braun
     [not found]   ` <Y+IM4IrO4V05o3V9@zpidnb93>
2023-02-20 10:20     ` Ludovic Courtès
2023-02-20 13:25       ` Lars-Dominik Braun
2023-04-27 21:23         ` Ludovic Courtès
2023-04-28 12:31           ` Lars-Dominik Braun
2023-04-28 13:09           ` bug#58485: [shepherd] EADDRINUSE while restarting ssh-daemon Ludovic Courtès
2023-06-11 14:20             ` Ludovic Courtès

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).