A Critique of Shepherd Design - raid5atemyhomework

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: raid5atemyhomework <raid5atemyhomework@protonmail.com>
To: "guix-devel@gnu.org" <guix-devel@gnu.org>
Subject: A Critique of Shepherd Design
Date: Fri, 19 Mar 2021 17:33:57 +0000	[thread overview]
Message-ID: <sQ_4CGBvkWpMP0GawFJgssNTNnBggFDtD_Q4z5DKOkx_VMfHkhNs16yMJd10VeB9LEWiyQ2B_WrYiM_XxS0N1vxedKyDkHS5AACi1TFY7Xc=@protonmail.com> (raw)

GNU Shepherd is the `init` system used by GNU Guix.  It features:

* A rich full Scheme language to describe actions.
* A simple core that is easy to maintain.

However, in this critique, I contend that these features are bugs.

The Shepherd language for describing actions on Shepherd daemons is a Turing-complete Guile language.  Turing completeness runs afoul of the Principle of Least Power.  In principle, all that actions have to do is invoke `exec`, `fork`, `kill`, and `waitpid` syscalls.  Yet the language is a full Turing-complete language, including the major weakness of Turing-completeness: the inability to solve the halting problem.

The fact that the halting problem is unsolved in the language means it is possible to trivially write an infinite loop in the language.  In the context of an `init` system, the possibility of an infinite loop is dangerous, as it means the system may never complete bootup.

Now, let us combine this with the second feature (really a bug): GNU shepherd is a simple, single-threaded Scheme program.  That means that if the single thread enters an infinite loop (because of a Shepherd service description that entered an infinite loop), then Shepherd itself hangs.  This means that the system is severely broken.  You cannot `sudo reboot` because that communicates with the Shepherd daemon.  You cannot even `kill 1` because signals are handled in the mainloop, which never get entered if the `start` action of some Shepherd daemon managed to enter an infinite loop.

I have experienced this firsthand since I wrote a Shepherd service to launch a daemon, and I needed to wait for the daemon initialization to complete.  My implementation of this had a bug that caused an infinite loop to be entered, but would only tickle this bug when a very particular (persistent on-disk) state existed.  I wrote this code about a month or so ago, and never got to test it until last week, when the bug was tickled.  Unfortunately, by then I had removed older system versions that predated the bug.  When I looked at a backup copy of the `configuration.scm`, I discovered the bug soon afterwards.  But the amount of code verbiage needed here had apparently overwhelmed me at the time I wrote the code to do the waiting, and the bug got into the code and broke my system.  I had to completely reinstall Guix (fortunately the backup copy of `configuration.scm` was helpful in recovering most of the system, also ZFS rocks).

Yes, I made a mistake.  I'm only human.  It should be easy to recover from mistakes.  The full Turing-completeness of the language invites mistakes, and the single-threadedness makes the infinite-loop mistakes that Turing-completeness invites, into catastrophic system breakages.

So what can we do?

For one, a Turing-complete language is a strict *superset* of non-Turing-complete languages.  So one thing we can do is to define a more dedicated language for Shepherd actions, strongly encourage the use of that sub-language, and, at some point, require that truly Turing-complete actions need to be wrapped in a `(unsafe-turing-complete ...)` form.

For example, in addition to the current existing `make-forkexec-constructor` and `make-kill-destructor`, let us also add these syntax forms:

`(wait-for-condition <timeout> <form>)` - Return a procedure that accepts a numeric `pid`, that does: Check if evaluating `<form>` in the lexical context results in `#f`.  If so, wait one second and re-evaluate, but exit anyway and return `#f` if `<timeout>` seconds has passed, or if the `pid` has died.  If `<form>` evaluates to non-`#f` then return it immediately.

`(timed-action <timeout> <form> ...)` - Return a procedure that accepts a numeric `pid`, that does: Spawn a new thread (or maybe better a `fork`ed process group?).  The new thread evaluates `<form> ...` in sequence.  If the thread completes or throws in `<timeout>` seconds, return the result or throw the exception in the main thread.  IF the `<timeout>` is reached or the given `pid` has died, kill the thread and any processes it may have spawned, then return `#f`.

`(wait-to-die <timeout>)` - Return a procedure that accepts a `pid` that does: Check if the `pid` has died, if so, return #t.  Else sleep 1 second and try again, and if `<timeout>` is reached, return `#f`.

The above forms should also report as well what they are doing (e.g. `Have been waiting 10 of 30 seconds for <form>`) on the console and/or syslog.

In addition, `make-forkexec-constructor` should accept a `#:post-fork`, which is a procedure that accepts a `pid`, and `make-kill-destructor` should accept a `#:pre-kill`, which is a procedure that accepts a `pid`.  Possibly we need to add combinators for the above actions as well.  For example a `sub-actions` procedural form that accepts any number of functions of the above `pid -> bool` type and creates a single combined `pid -> bool`.

So for example, in my case, I would have written a `make-forkexec-constructor` that accepted a `#:post-fork` that had a `wait-for-condition` on the code that checked if the daemon initialization completed correctly.

I think it is a common enough pattern that:

* We have to spawn a daemon.
* We have to wait for the daemon to be properly initialized (`#:post-fork`)
* When shutting down the daemon, it's best to at least try to politely ask it to finish using whatever means the daemon has (`#:pre-kill`).
* If the daemon doesn't exit soon after we politely asked it to, be less polite and start sending signals.

So I think the above should cover a good number of necessities.

Then, we can define a strict subset of Guile, containing a set of forms we know are total (i.e. we have a proof they will always halt for all inputs).  Then any Shepherd actions, being Lisp code and therefore homoiconic, can be analyzed.  Every list must have a `car` that is a symbol naming a syntax or procedure that is known to be safe --- `list`, `append`, `cons*`, `cons`, `string-append`, `string-join`, `quote` (which also prevents analysis of sub-lists), `and`, `or`, `begin`, thunk combinators, as well as the domain-specific `make-forkexec-constructor`, `make-kill-destructor`, `wait-for-condition`, `timed-action`, and probably some of the `(guix build utils)` like `invoke`, `invoke/quiet`, `mkdir-p` etc.

Sub-forms (or the entire form for an action) can be wrapped in `(unsafe-turing-complete ...)` to skip this analysis for the sub-form, but otherwise, by default, the specific subset must be used, and users have to explicitly put `(unsafe-turing-complete ...)` so they are aware that they can seriously break their system if they make a mistake here.  Ideally, as much of the code for any Shepherd action should be *outside* an ``unsafe-turing-complete`, and only parts of the code that really need the full Guile language to implement should be rapped in `unsafe-turing-complete`.

(`lambda` and `let-syntax` and `let`, because they can rebind the meanings of symbols, would need to be in `unsafe-turing-complete` --- otherwise the analysis routine would require a full syntax expander as well)

Turing-completeness is a weakness, not a strength, and restricting languages to be the Least Powerful necessary is better.  The `unsafe-turing-complete` form allows an escape, but by default Shepherd code should be in the restricted non-Turing-complete subset, to reduce the scope for catastrophic mistakes.

Thanks
raid5atemyhomework

next             reply	other threads:[~2021-03-19 17:34 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-19 17:33 raid5atemyhomework [this message]
2021-03-19 21:49 ` A Critique of Shepherd Design Maxime Devos
2021-03-20  2:42   ` raid5atemyhomework
2021-03-20  4:02 ` Maxim Cournoyer
2021-03-20  5:07 ` Mark H Weaver
2021-03-20 11:10   ` raid5atemyhomework
2021-03-20 16:58 ` Ludovic Courtès
2021-03-21  0:22   ` raid5atemyhomework
2021-03-22 17:02     ` Ludovic Courtès
2021-03-24 14:29       ` raid5atemyhomework
2021-03-24 14:48       ` raid5atemyhomework
2021-03-22 13:42 ` raingloom
2021-03-22 17:50   ` Martin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='sQ_4CGBvkWpMP0GawFJgssNTNnBggFDtD_Q4z5DKOkx_VMfHkhNs16yMJd10VeB9LEWiyQ2B_WrYiM_XxS0N1vxedKyDkHS5AACi1TFY7Xc=@protonmail.com' \
    --to=raid5atemyhomework@protonmail.com \
    --cc=guix-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).