Re: A Critique of Shepherd Design

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: raid5atemyhomework <raid5atemyhomework@protonmail.com>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: "guix-devel@gnu.org" <guix-devel@gnu.org>
Subject: Re: A Critique of Shepherd Design
Date: Wed, 24 Mar 2021 14:29:47 +0000	[thread overview]
Message-ID: <Ld_TMjNUA0aUyDY-8tl7YN9sjcTmkWxxKLHbROq_RKwg-2vTm1AH6PwJJUnPyK5ToPAuxCpfgNeDdrrxfbhfA7PQ4t1sgBRkXi6Q2kjHphc=@protonmail.com> (raw)
In-Reply-To: <87h7l3w28b.fsf@gnu.org>

Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, March 23, 2021 1:02 AM, Ludovic Courtès <ludo@gnu.org> wrote:

> Hi,
>
> raid5atemyhomework raid5atemyhomework@protonmail.com skribis:
>
> > I'm not sure you can afford to keep it simple.
>
> It has limitations but it does the job—just like many other init systems
> did the job before the advent of systemd.
>
> > Consider: https://issues.guix.gnu.org/47253
> > In that issue, the `networking` provision comes up potentially before the network is, in fact, up. This means that other daemons that require `networking` could potentially be started before the network connection is up.
>
> The ‘networking’ service is just here to say that some network interface
> is up or will be up. It’s admittedly vague and weak, but it’s enough
> for most daemons; they just need to be able to bind and listen to some
> port.
>
> > One example of such a daemon is `transmission-daemon`. This daemon will bind itself to port 9091 so you can control it. Unfortunately, if it gets started while network is down, it will be unable to bind to 9091 (so you can't control it) but still keep running. On my system that means that on reboot I have to manually `sudo herd restart trannsmission-daemon`.
>
> Could you report a bug for this one? I don’t see why it’d fail to bind.

Let me see if I can still get to the old syslogs where `transmission-daemon` claims it cannot bind, but it still keeps going anyway.  I've pulled my `transmission-daemon` directly into my `configuration.scm` so I can edit its `requirement` to include a custom `networking-online` service that does `nm-online`.

> > In another example, I have a custom daemon that I have set up to use
> > the Tor proxy over 127.0.0.1:9050. It requires both `networking` and
> > `tor`. When it starts after `networking` comes up but before the
> > actual network does, it dies because it can't access the proxy at
> > 127.0.0.1:9050 (apparently NetworkManager handles loopback as well).
>
> Loopback is handled by the ‘loopback’ shepherd service, which is
> provided via ‘%base-services’. Perhaps you just need to have your
> service depend on it?
>
> > Switching to a concurrent design for Shepherd --- any concurrent design --- is probably best done sooner rather than later, because it risks strongly affecting customized `configuration.scm`s like mine that have almost a half dozen custom Shepherd daemons.
>
> I suspect the main issue here is undeclared dependencies of some of the
> Shepherd services you mention.
>
> I like the “sooner rather than later” bit, though: it sounds like you’re
> about to send patches or announce some sponsorship program?… :-)

Not particularly, but I *have* looked over Shepherd and here are some notes.  Maybe I'll even send patches, but the reaction to the ZFS patches makes me just shrug; I'd need to devote more time than what I spent on ZFS, and the ZFS patches aren't getting into Guix, so why bother.  If I get annoyed enough I'll just patch my own system and call it a day.  My own system has `nm-online` and I don't expect to not have networking, so the `nm-online` delay is unlikely to be an issue, and I don't intend to mess with the `configuration.scm` anymore because it's just too brittle, I'll just host VMs instead and use a SystemD fully-free OS like Trisquel, I only need Guix for the ZFS (which Trisquel does not have, for some reason).

Anyway...

It seems to me that a good design to use would be that each `<service>` should have its own process.  Then the big loop in `modules/shepherd.scm` will then be "just" a completely event-based command dispatcher that forwards commands to the correct per-`<service>` process.

Now, one feature Shepherd has is that it can be set with `--socket-file=-`, which, if specified, causes GNU Shepherd to enable GNU readline and use the readline library to read `-herd`-like (?) commands.

Unfortunately the `readline` interface is inherently blocking instead of event-based.  The C interface of GNU readline has an alternative interface that is compatible with event-based (and I've used this in the past to create a toy chat program that would display messages from other users while you were typing your own) but it looks like this interface is not exposed.  I checked `readline-port` as well, but the code I could find online suggests that this just uses the blocking `readline` interface, and would (?) be incompatible with the Guile `select`. (side note: the `SIGCHLD` problem could probably be fixed if Guile had `pselect` exposed, but apparently it's not exposed and I'm not prepared to dedicate even more time fiddling with the lack of syscalls in Guile.  Maybe a signal-via-pipe technique would work as an alternative though since that supposedly works on every UNIX-like --- but presumably the Shepherd authors already knew that, so maybe there is good reason not to use it).

Since `readline` is blocking, one possibility would be to *fork off* the process that does the stdin communication.  Then it can continue to use the blocking `readline`.  It just needs to invoke `herd stop root`when it gets EOF (note: need to check how commands are sent, for now it looks to me (not verified, just impression) that commands are sent as plain text).

Since the goal is to make the mainloop into a very parallel dispatcher, we need some way to somehow send commands in stdin-mode.  We can take advantage of the little-known fact that UNIX domain sockets can pass file descriptors across processes, with the file descriptor number being remapped to the receiving process via magic kernel stuff.  So, we create a `socketpair` (NOTE: CHECK IF GUILE HAS `socketpair`!!!  Also review how the fd-passing gets done, maybe Guile doesn't expose the needed syscalls either, sigh), then each time the `readline`-process receives a command, it creates a new `socketpair`, sends over one end to the mainloop, sends the command via the other end, then  waits for a response and prints it.  This should make it very near to in experience as the blocking Shepherd.

If the above pattern is workable, ***we can use the same pattern for `--socket-file=/file/path`***.  We ***always*** fork off *some* process to handle `--socket-file`, whether `stdin`-mode or not.  In `--socket-file=/file/path` mode, the `socket-file` process binds the socket file, `listen`s on it on a loop, and then just passes the opened socket over to the mainloop.

We also need this pattern as a general feature of the mainloop.  An action on one `<service>` can trigger actions on another service (in theory; my cursory review of the Guix code suggests that current services only trigger actions on themselves (`ganeti`, `guix-daemon`; but this is not a full review and there may be other services in Guix that do other stuff)); note in particular that `start` causes every `requirement` to be started anyway.  So I think we need a general mechanism for the mainloop to receive commands from remote processes; we might as well use the same mechanism for both the Shepherd<->user interaction and the Shepherd<->service interaction.

So for clarity of exposition, let me then list down the processes created by Shepherd:

* The `mainloop` process which handles the main massively-parallel event loop.  This is PID 1.
* The `socket-file` process which either gets commands from `stdin`, or via the `socket-file`.
* Multiple per-`<service>` process.

Now, the mainloop has to parse the command in order to learn which per-`<service>` process the command should get forwarded to.  And as mentioned, each per-`<service>` process also needs a command-sending socket to go to the mainloop.  So for each per-`<service>` process:

* The mainloop maintains a mainloop->service socket to send commands over.
* The mainloop maintains a service->mainloop socket it receives command socketfds over.

The mainloop process also special-cases the `root` service --- it handles commands to those directly (BUT this probably needs a lot of fiddling with the data structures involved --- `herd status root` can now occur while a `herd start <whatever>` is still running, so we need status reporting for "being started up" and "being stopped" as well --- `/` for "starting up", `\` for "stopping"?).

Now, the `action` and other procedures need to be special-cased.  We need a global variable indicating:

* The current process is `root` ie the mainloop process.
* The current process is some `<service>`.

Every `action` is different depending on this variable (`%process-identity`?).

* IF the action is going to the same `<service>` (including `root`):
  * Just tail-call into the action code.
* If the current process is `root` and a non-`root` action is being performed:
  * Check if the per-`<service>` process has been started, and start if needed.
  * Schedule the command to be sent via the event loop.
  * Keep operating the mainloop until the command has completed.
    * Use an event-loop stepper function (i.e. just calls `select` and dispatches appropriately, then returns, so caller has to implement the loop).
    * Initially set a mutable variable to `#f.
    * Schedule the command with a callback that sets the above variable to `#t`.
    * Call the event-loop stepper function until the mutable variable is true.
    * This implements the current semantics where a `.conf` file running an action will block until the action finishes, while still allowing commands to be sent to the Shepherd daemon.
* If the current process is not `root` and the action to be performed is of a different process:
  * Create a socketpair to send the command over to the mainloop and send it (blocking).
  * Send the command to the mainloop (blocking).
  * Wait for completion (blocking).

Each per-`<service>` process has a simple blocking loop: It waits for commands from the mainloop process, executes those commands, then loops again.

In particular, it means that any `start` actions in the `.conf` file will block (which is the expected behavior of legacy `.conf` files, but even so, the Shepherd will be able to handle commands even while it is still loading `.conf`.

=== Concurrency is Like Regexps, Now You Have Two Problems ===

But take not that this means that it is possible to deadlock services in this way:

* There are two services `A` and `B`.
  * `A` has an action `AtoB` which invokes an action `BtoA` on service `B`.
  * The `B` `BtoA` action invokes an action `Anoop` on service `A`.
    * In the above structure, because the `A` per-`<service>` process is waiting on the `BtoA` action on service `B`, it cannot handle the `Anoop` action!
    * In the current single-threaded Shepherd, such a thing is actually possible and would not cause a deadlock if the `A` `Anoop` terminates normally.

HOWEVER, this is probably a very unusual setup!  It may be tolerable to simply require that a service that performs actions on another service should have an acyclic relationship (it is Turing-complete --- consider the case where the `B` `BtoA` action reinvokes the `A` `AtoB` with the same arguments causing it to invoke `B` `BtoA` action again ad infinitum --- whereas an acyclic relationship requirement would provably terminate; it may even be possible, by passing a "stack" (i.e. list of service names that caused a particular action of a particular service to be invoked) when passing commands across, with the `root` service always passing an empty stack, and each `<service>`-to-`<service>` action prepends the name of the calling service, so the mainloop can detect a dynamic cycle and just fail the command without forwarding).

IF this restriction is too onerous, then it may be possible to use an event loop in the per-`<service>` process as well, and use the same wait-for-events-while-blocking logic as on the mainloop --- the code might even be reusable.  BUT I worry about this bit, as it could potentially mean that an action is invoked in the dynamic context (including fluids, sigh) of another on-going completely-unrelated action, which is BAAAAAD.  This is fine for the `root` service since the `root` service is Shepherd itself (? I think) and we can ensure that the Shepherd code does not depend on dynamic context.

Another is that in the current single-threaded Shepherd, any service's action can (re-)register a new `<service>`.  This is problematic since a per-`<service>` process will obviously not affect the mainloop process post-fork.

Again this is probably a very unusual setup.  While such a thing would be cute, I think it would be tolerable to simply require that only the mainloop process (which is what loads the `.conf` file) is the only one that is allowed to (re-)register `<service>`s.

=== Taking Advantage of Concurrency ===

Now that each `<service>` gets its own process, we can add a new `force-destroy-service` action on `root`.

    herd force-destroy-service root <some-service>

This forces the per-`<service>` process, and that of every dependent `<service>`, as well as all processes in the process trees of tho affected per-`<service>` processes, to be force-killed (IMPORTANT: check how to get the process tree, also look into "process groups", maybe that would work better, but does it exist on Hurd?).

This new command helps mitigate the issue where Shepherd `start` actions are Turing-complete and potentially contain infinite-loop bugs.  If this happens, the sysad can `herd force-destroy-service root <some-service>` the misbehaving service, reconfigure the system to correct the issue, and restart.

Another issue is that the current Guix startup is like this (paraphrased):

    (for-each (lambda (service) (start service)) '(guix-daemon #;...))

Now consider this in the context of the `nm-online` example above.  The reason why Guix+Shepherd cannot just ***always*** use `nm-online` like SystemD does is because `nm-online` can delay boot for an arbitrary number of seconds.  And since `start` is a known-blocking legacy interface, it should remain a known-blocking interface (back compatibility is a bitch).

This means that for the Guix startup, we should expose a new Shepherd API:

    (start-multiple '(guix-daemon #;...))

This basically does this:

* Set a mutable variable to the length of the input list.
* For each service, if the per-`<service>` process isn't started yet, start it.
* For each service, schedule to `start` the service; have the callback decrement the above variable (maybe also set another variable to a list of services that fail to start).
* In a loop:
  * If the above mutable variable is zero, exit.
  * Otherwise, call the mainloop stepper function, then loop again.

This allows multiple `start` to be executed at once (which could trigger a `start` again of their requirements, but if a service is already started then its `start` action should do nothing) in parallel.  There may be a thundering herd effect on the mainloop though because of hammering `start` on the requirements.  Hmm.  Concurrency is hard.

Then Guix also has to be modified to use `start-multiple` instead.

This further step ensures as well that infloop problems in custom service definitions do not delay startup --- startup is fully parallel (modulo thundering herds of daemons).

=== Overall ===

It seems to me that with the above redesign, there would be very little code left of the original Shepherd, making this more of a reimplementation than a patch or even a fork.

Thanks
raid5atemyhomework

next prev parent reply	other threads:[~2021-03-24 14:30 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-19 17:33 A Critique of Shepherd Design raid5atemyhomework
2021-03-19 21:49 ` Maxime Devos
2021-03-20  2:42   ` raid5atemyhomework
2021-03-20  4:02 ` Maxim Cournoyer
2021-03-20  5:07 ` Mark H Weaver
2021-03-20 11:10   ` raid5atemyhomework
2021-03-20 16:58 ` Ludovic Courtès
2021-03-21  0:22   ` raid5atemyhomework
2021-03-22 17:02     ` Ludovic Courtès
2021-03-24 14:29       ` raid5atemyhomework [this message]
2021-03-24 14:48       ` raid5atemyhomework
2021-03-22 13:42 ` raingloom
2021-03-22 17:50   ` Martin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='Ld_TMjNUA0aUyDY-8tl7YN9sjcTmkWxxKLHbROq_RKwg-2vTm1AH6PwJJUnPyK5ToPAuxCpfgNeDdrrxfbhfA7PQ4t1sgBRkXi6Q2kjHphc=@protonmail.com' \
    --to=raid5atemyhomework@protonmail.com \
    --cc=guix-devel@gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).