bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system

all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system
@ 2022-01-27 11:32 Attila Lendvai
  2022-01-27 12:13 ` bug#53580: (No Subject) Attila Lendvai
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Attila Lendvai @ 2022-01-27 11:32 UTC (permalink / raw)
  To: 53580

the systems seems to work fine. Gnome is up, i can log in with my user, and everything seems to work, except herd.

i encounter this broken state every once in a while. IRC logs also mention this multiple times, but without many insights:

https://logs.guix.gnu.org/guix/search?query=%2Fvar%2Frun%2Fshepherd%2Fsocket

```
# herd status
error: connect: /var/run/shepherd/socket: No such file or directory

# ps afxu | grep shepherd
root         1  0.0  0.3 160788 43684 ?        Sl   11:51   0:00 /gnu/store/cnfsv9ywaacyafkqdqsv2ry8f01yr7a9-guile-3.0.7/bin/guile --no-auto-compile /gnu/store/vza48khbaq0fdmcsrn27xj5y5yy76z6l-shepherd-0.8.1/bin/shepherd --config /gnu/store/q4nd803lxrlkr60s8sx88gvpb6c7lxyd-shepherd.conf

# uptime
12:26:44  up   0:34,  2 users,  load average: 0.00, 0.01, 0.00
```

looking at shepherd's code:

```
(define (call-with-server-socket file-name proc)
  "Call PROC, passing it a listening socket at FILE-NAME and deleting the
socket file at FILE-NAME upon exit of PROC.  Return the values of PROC."
  (let ((sock (open-server-socket file-name)))
    (dynamic-wind
      noop
      (lambda () (proc sock))
      (lambda ()
        (close sock)
        (catch-system-error (delete-file file-name))))))
```

maybe this is caused by some call/cc magic that causes an unwind that deletes the file, but then continues?

--
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“Above all, do not lose your desire to walk: Every day I walk myself into a state of well-being and walk away from every illness; I have walked myself into my best thoughts, and I know of no thought so burdensome that one cannot walk away from it.”
	— Søren Kierkegaard (1813–1855)





^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#53580: (No Subject)
  2022-01-27 11:32 bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system Attila Lendvai
@ 2022-01-27 12:13 ` Attila Lendvai
  2022-02-01 11:06   ` Efraim Flashner
  2022-02-01 19:28 ` bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system Maxime Devos
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Attila Lendvai @ 2022-01-27 12:13 UTC (permalink / raw)
  To: 53580@debbugs.gnu.org

[-- Attachment #1: Type: text/plain, Size: 134 bytes --]

i forgot to add that i'm working on a shepherd service, and this may be due to errors in the service's user code, like the start gexp.

[-- Attachment #2: Type: text/html, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#53580: (No Subject)
  2022-01-27 12:13 ` bug#53580: (No Subject) Attila Lendvai
@ 2022-02-01 11:06   ` Efraim Flashner
  0 siblings, 0 replies; 15+ messages in thread
From: Efraim Flashner @ 2022-02-01 11:06 UTC (permalink / raw)
  To: Attila Lendvai; +Cc: 53580@debbugs.gnu.org

[-- Attachment #1: Type: text/plain, Size: 558 bytes --]

On Thu, Jan 27, 2022 at 12:13:28PM +0000, Attila Lendvai wrote:
> i forgot to add that i'm working on a shepherd service, and this may be due to errors in the service's user code, like the start gexp.

This is generally when I see this type of error. I normally try to
create a minimal VM and launch that when I'm trying out a new service.

-- 
Efraim Flashner   <efraim@flashner.co.il>   רנשלפ םירפא
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system
  2022-01-27 11:32 bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system Attila Lendvai
  2022-01-27 12:13 ` bug#53580: (No Subject) Attila Lendvai
@ 2022-02-01 19:28 ` Maxime Devos
  2022-04-04  7:15   ` Attila Lendvai
  2023-05-18 20:12 ` Ludovic Courtès
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Maxime Devos @ 2022-02-01 19:28 UTC (permalink / raw)
  To: Attila Lendvai, 53580

[-- Attachment #1: Type: text/plain, Size: 1477 bytes --]

Attila Lendvai schreef op do 27-01-2022 om 11:32 [+0000]:
> (define (call-with-server-socket file-name proc)
>   "Call PROC, passing it a listening socket at FILE-NAME and deleting the
> socket file at FILE-NAME upon exit of PROC.  Return the values of PROC."
>   (let ((sock (open-server-socket file-name)))
>     (dynamic-wind
>       noop
>       (lambda () (proc sock))
>       (lambda ()
>         (close sock)
>         (catch-system-error (delete-file file-name))))))
> ```
> 
> maybe this is caused by some call/cc magic that causes an unwind that deletes the file, but then continues?

Shepherd doesn't use call/cc anywhere.  However, it does use
_delimited_ continuations, even though only through let/ec and
'guard'/'catch'/...  More generally, call/cc is typically unused in
(Guile) Scheme code, and call-with-prompt / abort-to-prompt / shift /
reset / % are used instead.

My guess what happens: the start code of a shepherd service
fails between 'fork' and 'exec', with an exception.  The exception
isn't caught (or is caught and reraised), so the 'out' guard of the
'dynamic-wind' is entered and the file representing the socket is
deleted.

If that's indeed the case, it might be a good idea to install
some exception handlers in fork+exec-command and friends (including
make-forkexec-constructor/container), to make shepherd more robust
w.r.t. services failing to start.

Greetings,
Maxime.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 260 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system
  2022-02-01 19:28 ` bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system Maxime Devos
@ 2022-04-04  7:15   ` Attila Lendvai
  0 siblings, 0 replies; 15+ messages in thread
From: Attila Lendvai @ 2022-04-04  7:15 UTC (permalink / raw)
  To: Maxime Devos; +Cc: 53580

FTR,

the issue is that when Shepherd is booting up, i.e. starting from its config file, it calls the start forms without guarding for any possible exceptions. any error propagates up beyond the loop and up until an unwind protect that deletes the socket.

the reason my system seemed fully functional is that my service was pretty much the last one to be started.

--
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“I made up the term 'object-oriented', and I can tell you I didn't have C++ in mind.”
	— Alan Kay, OOPSLA '97

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system
  2022-01-27 11:32 bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system Attila Lendvai
  2022-01-27 12:13 ` bug#53580: (No Subject) Attila Lendvai
  2022-02-01 19:28 ` bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system Maxime Devos
@ 2023-05-18 20:12 ` Ludovic Courtès
  2023-05-27 10:33 ` bug#53580: shepherd's architecture Attila Lendvai
  2023-06-11 14:18 ` Ludovic Courtès
  4 siblings, 0 replies; 15+ messages in thread
From: Ludovic Courtès @ 2023-05-18 20:12 UTC (permalink / raw)
  To: Attila Lendvai; +Cc: 53580

Hello Attila,

I had totally overlooked this bug report.

Attila Lendvai <attila@lendvai.name> skribis:

> the systems seems to work fine. Gnome is up, i can log in with my user, and everything seems to work, except herd.
>
> i encounter this broken state every once in a while. IRC logs also mention this multiple times, but without many insights:
>
> https://logs.guix.gnu.org/guix/search?query=%2Fvar%2Frun%2Fshepherd%2Fsocket
>
> ```
> # herd status
> error: connect: /var/run/shepherd/socket: No such file or directory

[...]

> the issue is that when Shepherd is booting up, i.e. starting from its config file, it calls the start forms without guarding for any possible exceptions. any error propagates up beyond the loop and up until an unwind protect that deletes the socket.
>
> the reason my system seemed fully functional is that my service was pretty much the last one to be started.

Currently (in 0.10.0), the ‘run-daemon’ procedure loads the user’s
config file before listening on /var/run/shepherd/socket.  However, if
an exception is thrown from the config file, it stops:

--8<---------------cut here---------------start------------->8---
$ echo '(error "oops")' > /tmp/conf.scm
$ ./shepherd -I -s sock -c /tmp/conf.scm
Starting service root...
Service root started.
Service root running with value #t.
Service root has been started.
misc-error(#f "~A" ("oops") #f)

Some deprecated features have been used.  Set the environment
variable GUILE_WARN_DEPRECATED to "detailed" and rerun the
program to get more information.  Set it to "no" to suppress
this message.
$ echo $?
1
--8<---------------cut here---------------end--------------->8---

Now, while the config file is being evaluated, shepherd does not listen
on its socket, which isn’t great.

This is mitigated by the use of ‘start-in-the-background’ (introduced in
0.9.0) in the config file, which, as the name implies, doesn’t block
further operation.

So I *think* we’re mostly okay now.  The one thing we could do is load
the whole config file in a separate fiber, and maybe it’s fine to keep
going even when there’s an error during config file evaluation?

WDYT?

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#53580: shepherd's architecture
  2022-01-27 11:32 bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system Attila Lendvai
                   ` (2 preceding siblings ...)
  2023-05-18 20:12 ` Ludovic Courtès
@ 2023-05-27 10:33 ` Attila Lendvai
  2023-05-28 22:23   ` Attila Lendvai
  2023-06-06 15:16   ` bug#53580: " Ludovic Courtès
  2023-06-11 14:18 ` Ludovic Courtès
  4 siblings, 2 replies; 15+ messages in thread
From: Attila Lendvai @ 2023-05-27 10:33 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 53580

[forked from: bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system]

> So I think we’re mostly okay now. The one thing we could do is load
> the whole config file in a separate fiber, and maybe it’s fine to keep
> going even when there’s an error during config file evaluation?
>
> WDYT?


i think there's a fundamental issue to be resolved here, and addressing that would implicitly resolve the entire class of issues that this one belongs to.

guile (shepherd) is run as the init process, and because of that it may not exit or be respawn. but at the same time when we reconfigure a guix system, then shepherd's config should not only be reloaded, but its internal state merged with the new config, and potentially even with an evolved shepherd codebase.

i still lack a proper mental model of all this to succesfully predict what will happen when i `guix system reconfigure` after i `guix pull`-ed my service code, and/or changed the config of my services.

--------

this problem of migration is pretty much a CS research topic...

ideally, there should be a non-shepherd-specific protocol defined for such migrations, and the new shpeherd codebase could migrate its state from the old to the new, with most of the migration code being automatic. some of it must be hand written as rquired by some semantic changes.

even more ideally, we should reflexive systems; admit that source code is a graph, and store it as one (as opposed to a string of characters); and our systems should have orthogonal persistency, etc, etc... a far cry from what we have now.

Fare's excellent blog has some visionary thoughts on this, especially in:

https://ngnghm.github.io/blog/2015/09/08/chapter-5-non-stop-change/

but given that we will not have these any time soon... what can we do now?

--------

note: what follows are wild ideas, and i'm not sure i have the necessary understanding of the involved subsystems to properly judge their feasibility... so take them with a pinch of salt.

idea 1
--------

it doesn't seem to be an insurmontable task to make sure that guile can safely unlink a module from its heap, check if there are any references into the module to be dropped, and then reload this module from disk.

the already runing fibers would keep the required code in the heap until after they are stopped/restarted. then the module would get GC'd eventually.

this would help solve the problem that a reconfigured service may have a completely different start/stop code. and by taking some careful shortcuts we may be able to make reloading work without having to stop the service process in question.

idea 2
--------

another, probably better idea:

split up shepherd's codebase into isolated parts:

 1) the init process

 2) the service runners, which are spawned by 1). let's call this part
    'the runner'.

 3) the CLI scripts that implement stuff like `reboot` by sending a
    message to 1).

the runner would spawn and manage the actual daemon binaries/processes.

the init process would communicate with the runners through a channel/pipe that is created when the runner are spawn. i.e. here we wouldn't need an IPC socket file like we need for the communication between the scripts and the init process.

AFAIU the internal structure of shepherd is already turning into something like this with the use of fibers and channels. i suspect Ludo has something like this on his mind already.

in this setup most of the complexity and the evolution of the shepherd codebase would happen in the runner, and the other two parts could be kept minimal and would rarely need to change (and thus require a reboot).

the need for a reboot could be detected by noticing that the compiled binary of the init process has changed compared to what is currently running as PID 1.

the driver process of a service could be reloaded/respawned the next time when the daemon is stopped or it quits unexpectedly.

--------

recently i've succesfully wrote a shepherd service that spawns a daemon, and from a fiber it does two way communication with the daemon using a pipe connected to the daemon's stdio. i guess that counts as a proof of concept for the second idea, but i'm not sure about its stability. a stuck/failing service is a different issue than a stuck/failing init process.

for reference, the spawning of the daemon:

https://github.com/attila-lendvai/guix-crypto/blob/8f996239bb8c2a1103c3e54605faf680fe1ed093/src/guix-crypto/services/swarm.scm#L315

the fiber's code that talks to it:

https://github.com/attila-lendvai/guix-crypto/blob/8f996239bb8c2a1103c3e54605faf680fe1ed093/src/guix-crypto/swarm-utils.scm#L133

--
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“We reject: kings, presidents and voting. We believe in: rough consensus and running code.”
	— David Clark for the IETF





^ permalink raw reply	[flat|nested] 15+ messages in thread

* shepherd's architecture
  2023-05-27 10:33 ` bug#53580: shepherd's architecture Attila Lendvai
@ 2023-05-28 22:23   ` Attila Lendvai
  2023-05-29 14:46     ` bug#53580: " Brian Cully via Bug reports for GNU Guix
  2023-06-06 15:16   ` bug#53580: " Ludovic Courtès
  1 sibling, 1 reply; 15+ messages in thread
From: Attila Lendvai @ 2023-05-28 22:23 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 53580, guix-devel

[resending to include the guix-devel list. apologies for everyone who receives this mail twice!]

------

[forked from: bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system]

> So I think we’re mostly okay now. The one thing we could do is load
> the whole config file in a separate fiber, and maybe it’s fine to keep
> going even when there’s an error during config file evaluation?
> 
> WDYT?

i think there's a fundamental issue to be resolved here, and addressing that would implicitly resolve the entire class of issues that this one belongs to.

guile (shepherd) is run as the init process, and because of that it may not exit or be respawn. but at the same time when we reconfigure a guix system, then shepherd's config should not only be reloaded, but its internal state merged with the new config, and potentially even with an evolved shepherd codebase.

i still lack a proper mental model of all this to succesfully predict what will happen when i `guix system reconfigure` after i `guix pull`-ed my service code, and/or changed the config of my services.

--------

this problem of migration is pretty much a CS research topic...

ideally, there should be a non-shepherd-specific protocol defined for such migrations, and the new shpeherd codebase could migrate its state from the old to the new, with most of the migration code being automatic. some of it must be hand written as rquired by some semantic changes.

even more ideally, we should reflexive systems; admit that source code is a graph, and store it as one (as opposed to a string of characters); and our systems should have orthogonal persistency, etc, etc... a far cry from what we have now.

Fare's excellent blog has some visionary thoughts on this, especially in:

https://ngnghm.github.io/blog/2015/09/08/chapter-5-non-stop-change/

but given that we will not have these any time soon... what can we do now?

--------

note: what follows are wild ideas, and i'm not sure i have the necessary understanding of the involved subsystems to properly judge their feasibility... so take them with a pinch of salt.

idea 1
--------

it doesn't seem to be an insurmontable task to make sure that guile can safely unlink a module from its heap, check if there are any references into the module to be dropped, and then reload this module from disk.

the already runing fibers would keep the required code in the heap until after they are stopped/restarted. then the module would get GC'd eventually.

this would help solve the problem that a reconfigured service may have a completely different start/stop code. and by taking some careful shortcuts we may be able to make reloading work without having to stop the service process in question.

idea 2
--------

another, probably better idea:

split up shepherd's codebase into isolated parts:

1) the init process

2) the service runners, which are spawned by 1). let's call this part
'the runner'.

3) the CLI scripts that implement stuff like `reboot` by sending a
message to 1).

the runner would spawn and manage the actual daemon binaries/processes.

the init process would communicate with the runners through a channel/pipe that is created when the runner are spawn. i.e. here we wouldn't need an IPC socket file like we need for the communication between the scripts and the init process.

AFAIU the internal structure of shepherd is already turning into something like this with the use of fibers and channels. i suspect Ludo has something like this on his mind already.

in this setup most of the complexity and the evolution of the shepherd codebase would happen in the runner, and the other two parts could be kept minimal and would rarely need to change (and thus require a reboot).

the need for a reboot could be detected by noticing that the compiled binary of the init process has changed compared to what is currently running as PID 1.

the driver process of a service could be reloaded/respawned the next time when the daemon is stopped or it quits unexpectedly.

--------

recently i've succesfully wrote a shepherd service that spawns a daemon, and from a fiber it does two way communication with the daemon using a pipe connected to the daemon's stdio. i guess that counts as a proof of concept for the second idea, but i'm not sure about its stability. a stuck/failing service is a different issue than a stuck/failing init process.

for reference, the spawning of the daemon:

https://github.com/attila-lendvai/guix-crypto/blob/8f996239bb8c2a1103c3e54605faf680fe1ed093/src/guix-crypto/services/swarm.scm#L315

the fiber's code that talks to it:

https://github.com/attila-lendvai/guix-crypto/blob/8f996239bb8c2a1103c3e54605faf680fe1ed093/src/guix-crypto/swarm-utils.scm#L133

-- 
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“Dying societies accumulate laws like dying men accumulate remedies.”
	—  Nicolás Gómez Dávila (1913–1994), 'Escolios a un texto implicito: Seleccion'

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#53580: shepherd's architecture
  2023-05-28 22:23   ` Attila Lendvai
@ 2023-05-29 14:46     ` Brian Cully via Bug reports for GNU Guix
  2023-05-29 15:18       ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Cully via Bug reports for GNU Guix @ 2023-05-29 14:46 UTC (permalink / raw)
  To: Attila Lendvai; +Cc: guix-devel, Ludovic Courtès, 53580

Attila Lendvai <attila@lendvai.name> writes:

> it doesn't seem to be an insurmontable task to make sure that 
> guile
> can safely unlink a module from its heap, check if there are any
> references into the module to be dropped, and then reload this 
> module
> from disk.
>
> the already runing fibers would keep the required code in the 
> heap
> until after they are stopped/restarted. then the module would 
> get GC'd
> eventually.
>
> this would help solve the problem that a reconfigured service 
> may have
> a completely different start/stop code. and by taking some 
> careful
> shortcuts we may be able to make reloading work without having 
> to stop
> the service process in question.

Erlang has had hot code reloading for decades, built around the 
needs of 100% uptime systems. The problem is more complex than it 
often appears to people who are used to how lisps traditionally do 
it. I strongly recommend reading up on Erlang's migration 
system. Briefly: you can't just swap out function definitions, 
because they rely on non-function state which needs to be migrated 
along with the function itself, and you can't do it whenever you 
want, because external actors may be relying on a view of the 
internal state. To accomplish this, Erlang has a lot of machinery, 
and it fits in to the core design of the language and runtime 
which would be extremely difficult to port over to non-Erlang 
languages. Doing it in Scheme is probably possible in an academic 
sense, but not in a practical one.

OTOH, Lisp Flavoured Erlang exists if you want that syntax. There 
would definitely be advantages to writing an init (and, indeed, 
any service that needs 100% uptime) on top of the Erlang virtual 
machine. But going the other way, by porting Erlang's 
functionality into Scheme, is going to be a wash.

> in this setup most of the complexity and the evolution of the 
> shepherd
> codebase would happen in the runner, and the other two parts 
> could be
> kept minimal and would rarely need to change (and thus require a
> reboot).

Accepting that dramatic enough changes to PID 1 are going to 
require a reboot seems reasonable to me. They should be even more 
rare than kernel updates, and we accept rebooting there already.

-bjc

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: shepherd's architecture
  2023-05-29 14:46     ` bug#53580: " Brian Cully via Bug reports for GNU Guix
@ 2023-05-29 15:18       ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
  0 siblings, 0 replies; 15+ messages in thread
From: Felix Lechner via Development of GNU Guix and the GNU System distribution. @ 2023-05-29 15:18 UTC (permalink / raw)
  To: Brian Cully; +Cc: Attila Lendvai, Ludovic Courtès, 53580, guix-devel

Hi Brian,

On Mon, May 29, 2023 at 8:02 AM Brian Cully via Development of GNU
Guix and the GNU System distribution. <guix-devel@gnu.org> wrote:
>
> Erlang has had hot code reloading for decades

Thank you for that pointer! I also had Erlang on my mind while reading
Attila's message.

> Lisp Flavoured Erlang exists if you want that syntax. There
> would definitely be advantages to writing an init (and, indeed,
> any service that needs 100% uptime) on top of the Erlang virtual
> machine.

“Twenty years from now you will be more disappointed by the things
that you didn't do than by the ones you did do. So throw off the
bowlines. Sail away from the safe harbor. Catch the trade winds in
your sails. Explore. Dream. Discover.” --- H. Jackson Brown Jr in
"P.S. I Love You"

Kind regards
Felix

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#53580: shepherd's architecture
  2023-05-27 10:33 ` bug#53580: shepherd's architecture Attila Lendvai
  2023-05-28 22:23   ` Attila Lendvai
@ 2023-06-06 15:16   ` Ludovic Courtès
  2023-06-08 12:54     ` Csepp
  2023-06-08 20:56     ` Attila Lendvai
  1 sibling, 2 replies; 15+ messages in thread
From: Ludovic Courtès @ 2023-06-06 15:16 UTC (permalink / raw)
  To: Attila Lendvai; +Cc: 53580

Hi Attila,

Attila Lendvai <attila@lendvai.name> skribis:

> [forked from: bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system]
>
>> So I think we’re mostly okay now. The one thing we could do is load
>> the whole config file in a separate fiber, and maybe it’s fine to keep
>> going even when there’s an error during config file evaluation?
>>
>> WDYT?
>
>
> i think there's a fundamental issue to be resolved here, and addressing that would implicitly resolve the entire class of issues that this one belongs to.
>
> guile (shepherd) is run as the init process, and because of that it may not exit or be respawn. but at the same time when we reconfigure a guix system, then shepherd's config should not only be reloaded, but its internal state merged with the new config, and potentially even with an evolved shepherd codebase.

Sorry to be direct: is there a concrete bug you’re reporting here?

> i still lack a proper mental model of all this to succesfully predict what will happen when i `guix system reconfigure` after i `guix pull`-ed my service code, and/or changed the config of my services.

What happens is that ‘guix system reconfigure’ loads new services into
the running shepherd.  New services simply get started; services for
which a same-named service is already running instead get registered as
a “replacement”, meaning that the new version of the service only gets
started when the user explicitly runs ‘herd restart SERVICE’.

Non-stop upgrades is ideal, but shepherd alone cannot do that.  For
instance, nginx supports that, and no init system could implement that
on its behalf.

Ludo’.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#53580: shepherd's architecture
  2023-06-06 15:16   ` bug#53580: " Ludovic Courtès
@ 2023-06-08 12:54     ` Csepp
  2023-06-08 20:56     ` Attila Lendvai
  1 sibling, 0 replies; 15+ messages in thread
From: Csepp @ 2023-06-08 12:54 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: attila, 53580


Ludovic Courtès <ludo@gnu.org> writes:

> Hi Attila,
>
> Attila Lendvai <attila@lendvai.name> skribis:
>
>> [forked from: bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system]
>>
>>> So I think we’re mostly okay now. The one thing we could do is load
>>> the whole config file in a separate fiber, and maybe it’s fine to keep
>>> going even when there’s an error during config file evaluation?
>>>
>>> WDYT?
>>
>>
>> i think there's a fundamental issue to be resolved here, and
>> addressing that would implicitly resolve the entire class of issues
>> that this one belongs to.
>>
>> guile (shepherd) is run as the init process, and because of that it
>> may not exit or be respawn. but at the same time when we reconfigure
>> a guix system, then shepherd's config should not only be reloaded,
>> but its internal state merged with the new config, and potentially
>> even with an evolved shepherd codebase.
>
> Sorry to be direct: is there a concrete bug you’re reporting here?
>
>> i still lack a proper mental model of all this to succesfully
>> predict what will happen when i `guix system reconfigure` after i
>> `guix pull`-ed my service code, and/or changed the config of my
>> services.
>
> What happens is that ‘guix system reconfigure’ loads new services into
> the running shepherd.  New services simply get started; services for
> which a same-named service is already running instead get registered as
> a “replacement”, meaning that the new version of the service only gets
> started when the user explicitly runs ‘herd restart SERVICE’.
>
> Non-stop upgrades is ideal, but shepherd alone cannot do that.  For
> instance, nginx supports that, and no init system could implement that
> on its behalf.
>
> Ludo’.

Do services get a reference to their previously running version?
The Minix project was experimenting with supporting something like
supervisor trees for high uptime, and one way they were trying to
achieve that was by giving services the memory of their previous
version, so they could read their state and migrate it to their own
memory.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#53580: shepherd's architecture
  2023-06-06 15:16   ` bug#53580: " Ludovic Courtès
  2023-06-08 12:54     ` Csepp
@ 2023-06-08 20:56     ` Attila Lendvai
  2023-06-11 14:16       ` bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system Ludovic Courtès
  1 sibling, 1 reply; 15+ messages in thread
From: Attila Lendvai @ 2023-06-08 20:56 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: 53580

> Sorry to be direct: is there a concrete bug you’re reporting here?

i didn't pay careful enough attention to report something specific, but one thing that pops to mind:

when i'm working on my service code, which is `guix pull`ed in from my channel, then after a reconfigure i seem to have to reboot for my new code to get activated. a simple `herd restart` on the service didn't seem to be enough. i.e. the guile modules that my service code is using did not get reloaded into the PID 1 guile.

keep in mind that this is a non-trivial service that e.g. spawns a long-lived fiber to talk to the daemon through its stdio while the daemon is running. IOW, its start GEXP is not just a simple forkexec, but something more complex that uses functions from guile modules that should be reloaded into PID 1 when the new version of the service is to be started.

-- 
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“The unexamined life is not worth living for a human being.”
	— Socrates (c. 470–399 BC, tried and executed), 'Apology' (399 BC)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system
  2023-06-08 20:56     ` Attila Lendvai
@ 2023-06-11 14:16       ` Ludovic Courtès
  0 siblings, 0 replies; 15+ messages in thread
From: Ludovic Courtès @ 2023-06-11 14:16 UTC (permalink / raw)
  To: Attila Lendvai; +Cc: 53580

Hi,

Attila Lendvai <attila@lendvai.name> skribis:

> when i'm working on my service code, which is `guix pull`ed in from my channel, then after a reconfigure i seem to have to reboot for my new code to get activated. a simple `herd restart` on the service didn't seem to be enough. i.e. the guile modules that my service code is using did not get reloaded into the PID 1 guile.

Guile modules do not get reloaded; there’s no mechanism in place to
reload previously-loaded Guile modules.

> keep in mind that this is a non-trivial service that e.g. spawns a long-lived fiber to talk to the daemon through its stdio while the daemon is running. IOW, its start GEXP is not just a simple forkexec, but something more complex that uses functions from guile modules that should be reloaded into PID 1 when the new version of the service is to be started.

OK, got it.  There’s not enough info here to be concrete, but I’d
recommend making it a separate process if you need to reliably
reload/replace the module.  IOW, you’d make it a “regular” service
spawned with ‘make-forkexec-constructor’ or similar.

However this doesn’t have anything to do with the initial bug report and
the title of this message; for clarity, please move further discussion
to guix-devel.

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system
  2022-01-27 11:32 bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system Attila Lendvai
                   ` (3 preceding siblings ...)
  2023-05-27 10:33 ` bug#53580: shepherd's architecture Attila Lendvai
@ 2023-06-11 14:18 ` Ludovic Courtès
  4 siblings, 0 replies; 15+ messages in thread
From: Ludovic Courtès @ 2023-06-11 14:18 UTC (permalink / raw)
  To: Attila Lendvai; +Cc: 53580

Attila Lendvai <attila@lendvai.name> skribis:

> (define (call-with-server-socket file-name proc)
>   "Call PROC, passing it a listening socket at FILE-NAME and deleting the
> socket file at FILE-NAME upon exit of PROC.  Return the values of PROC."
>   (let ((sock (open-server-socket file-name)))
>     (dynamic-wind
>       noop
>       (lambda () (proc sock))
>       (lambda ()
>         (close sock)
>         (catch-system-error (delete-file file-name))))))

For the record, ‘dynamic-wind’ here was replaced by ‘catch’ in
46790f9d924af2a9521adccb9e6db6afd9c1a2e7, which corresponds to the
introduction of Fibers in 0.9.x.

Ludo’.




^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-06-11 14:19 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-01-27 11:32 bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system Attila Lendvai
2022-01-27 12:13 ` bug#53580: (No Subject) Attila Lendvai
2022-02-01 11:06   ` Efraim Flashner
2022-02-01 19:28 ` bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system Maxime Devos
2022-04-04  7:15   ` Attila Lendvai
2023-05-18 20:12 ` Ludovic Courtès
2023-05-27 10:33 ` bug#53580: shepherd's architecture Attila Lendvai
2023-05-28 22:23   ` Attila Lendvai
2023-05-29 14:46     ` bug#53580: " Brian Cully via Bug reports for GNU Guix
2023-05-29 15:18       ` Felix Lechner via Development of GNU Guix and the GNU System distribution.
2023-06-06 15:16   ` bug#53580: " Ludovic Courtès
2023-06-08 12:54     ` Csepp
2023-06-08 20:56     ` Attila Lendvai
2023-06-11 14:16       ` bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system Ludovic Courtès
2023-06-11 14:18 ` Ludovic Courtès

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.