shepherd's architecture - Attila Lendvai

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: Attila Lendvai <attila@lendvai.name>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: 53580@debbugs.gnu.org, guix-devel <guix-devel@gnu.org>
Subject: shepherd's architecture
Date: Sun, 28 May 2023 22:23:00 +0000	[thread overview]
Message-ID: <Fe3OtPtYH2PHkXerCVPLsOIdvUw04jv5hYL_lms1t-V-JQ8vTvdHyu7Lk-PYfYIBns96NX34wxf8Yb3uNiiPGQzOMjB2hO7_yn3lrILV6fA=@lendvai.name> (raw)
In-Reply-To: <Jf0lcTW5Lw4gnNDSPsv037iYNAMvK28S6tL4Zh0FdGp7nnQCgCD_uITYxJ4PFxqKkaS5CUH_7mUucz2tvKVJKdQt2uhizTDQaiJ0Jup2wbs=@lendvai.name>

[resending to include the guix-devel list. apologies for everyone who receives this mail twice!]

------

[forked from: bug#53580: /var/run/shepherd/socket is missing on an otherwise functional system]

> So I think we’re mostly okay now. The one thing we could do is load
> the whole config file in a separate fiber, and maybe it’s fine to keep
> going even when there’s an error during config file evaluation?
> 
> WDYT?

i think there's a fundamental issue to be resolved here, and addressing that would implicitly resolve the entire class of issues that this one belongs to.

guile (shepherd) is run as the init process, and because of that it may not exit or be respawn. but at the same time when we reconfigure a guix system, then shepherd's config should not only be reloaded, but its internal state merged with the new config, and potentially even with an evolved shepherd codebase.

i still lack a proper mental model of all this to succesfully predict what will happen when i `guix system reconfigure` after i `guix pull`-ed my service code, and/or changed the config of my services.

--------

this problem of migration is pretty much a CS research topic...

ideally, there should be a non-shepherd-specific protocol defined for such migrations, and the new shpeherd codebase could migrate its state from the old to the new, with most of the migration code being automatic. some of it must be hand written as rquired by some semantic changes.

even more ideally, we should reflexive systems; admit that source code is a graph, and store it as one (as opposed to a string of characters); and our systems should have orthogonal persistency, etc, etc... a far cry from what we have now.

Fare's excellent blog has some visionary thoughts on this, especially in:

https://ngnghm.github.io/blog/2015/09/08/chapter-5-non-stop-change/

but given that we will not have these any time soon... what can we do now?

--------

note: what follows are wild ideas, and i'm not sure i have the necessary understanding of the involved subsystems to properly judge their feasibility... so take them with a pinch of salt.

idea 1
--------

it doesn't seem to be an insurmontable task to make sure that guile can safely unlink a module from its heap, check if there are any references into the module to be dropped, and then reload this module from disk.

the already runing fibers would keep the required code in the heap until after they are stopped/restarted. then the module would get GC'd eventually.

this would help solve the problem that a reconfigured service may have a completely different start/stop code. and by taking some careful shortcuts we may be able to make reloading work without having to stop the service process in question.

idea 2
--------

another, probably better idea:

split up shepherd's codebase into isolated parts:

1) the init process

2) the service runners, which are spawned by 1). let's call this part
'the runner'.

3) the CLI scripts that implement stuff like `reboot` by sending a
message to 1).

the runner would spawn and manage the actual daemon binaries/processes.

the init process would communicate with the runners through a channel/pipe that is created when the runner are spawn. i.e. here we wouldn't need an IPC socket file like we need for the communication between the scripts and the init process.

AFAIU the internal structure of shepherd is already turning into something like this with the use of fibers and channels. i suspect Ludo has something like this on his mind already.

in this setup most of the complexity and the evolution of the shepherd codebase would happen in the runner, and the other two parts could be kept minimal and would rarely need to change (and thus require a reboot).

the need for a reboot could be detected by noticing that the compiled binary of the init process has changed compared to what is currently running as PID 1.

the driver process of a service could be reloaded/respawned the next time when the daemon is stopped or it quits unexpectedly.

--------

recently i've succesfully wrote a shepherd service that spawns a daemon, and from a fiber it does two way communication with the daemon using a pipe connected to the daemon's stdio. i guess that counts as a proof of concept for the second idea, but i'm not sure about its stability. a stuck/failing service is a different issue than a stuck/failing init process.

for reference, the spawning of the daemon:

https://github.com/attila-lendvai/guix-crypto/blob/8f996239bb8c2a1103c3e54605faf680fe1ed093/src/guix-crypto/services/swarm.scm#L315

the fiber's code that talks to it:

https://github.com/attila-lendvai/guix-crypto/blob/8f996239bb8c2a1103c3e54605faf680fe1ed093/src/guix-crypto/swarm-utils.scm#L133

-- 
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“Dying societies accumulate laws like dying men accumulate remedies.”
	—  Nicolás Gómez Dávila (1913–1994), 'Escolios a un texto implicito: Seleccion'

next      parent reply	other threads:[~2023-05-28 22:23 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <Jf0lcTW5Lw4gnNDSPsv037iYNAMvK28S6tL4Zh0FdGp7nnQCgCD_uITYxJ4PFxqKkaS5CUH_7mUucz2tvKVJKdQt2uhizTDQaiJ0Jup2wbs=@lendvai.name>
2023-05-28 22:23 ` Attila Lendvai [this message]
2023-05-29 14:46   ` bug#53580: shepherd's architecture Brian Cully via Bug reports for GNU Guix
2023-05-29 15:18     ` Felix Lechner via Development of GNU Guix and the GNU System distribution.

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='Fe3OtPtYH2PHkXerCVPLsOIdvUw04jv5hYL_lms1t-V-JQ8vTvdHyu7Lk-PYfYIBns96NX34wxf8Yb3uNiiPGQzOMjB2hO7_yn3lrILV6fA=@lendvai.name' \
    --to=attila@lendvai.name \
    --cc=53580@debbugs.gnu.org \
    --cc=guix-devel@gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).