From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <guix-devel-bounces+larch=yhetil.org@gnu.org>
Received: from mp1 ([2001:41d0:2:c151::])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	by ms11 with LMTPS
	id OH5zHEBhVWCxVgAA0tVLHw
	(envelope-from <guix-devel-bounces+larch=yhetil.org@gnu.org>)
	for <larch@yhetil.org>; Sat, 20 Mar 2021 02:43:12 +0000
Received: from aspmx2.migadu.com ([2001:41d0:2:c151::])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
	by mp1 with LMTPS
	id mLorGEBhVWCDOgAAbx9fmQ
	(envelope-from <guix-devel-bounces+larch=yhetil.org@gnu.org>)
	for <larch@yhetil.org>; Sat, 20 Mar 2021 02:43:12 +0000
Received: from lists.gnu.org (lists.gnu.org [209.51.188.17])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by aspmx2.migadu.com (Postfix) with ESMTPS id D7A892457D
	for <larch@yhetil.org>; Sat, 20 Mar 2021 03:43:11 +0100 (CET)
Received: from localhost ([::1]:38054 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <guix-devel-bounces+larch=yhetil.org@gnu.org>)
	id 1lNRa1-0003WM-NB
	for larch@yhetil.org; Fri, 19 Mar 2021 22:43:09 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10]:57696)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <raid5atemyhomework@protonmail.com>)
 id 1lNRZl-0003WB-Pn
 for guix-devel@gnu.org; Fri, 19 Mar 2021 22:42:54 -0400
Received: from mail-40137.protonmail.ch ([185.70.40.137]:61836)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <raid5atemyhomework@protonmail.com>)
 id 1lNRZh-000569-Hj
 for guix-devel@gnu.org; Fri, 19 Mar 2021 22:42:53 -0400
Date: Sat, 20 Mar 2021 02:42:38 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=protonmail.com;
 s=protonmail; t=1616208165;
 bh=Je+E8Ge113dc2+m5YDjO8YsPZIGKJEib/3yiDQw0R/0=;
 h=Date:To:From:Cc:Reply-To:Subject:In-Reply-To:References:From;
 b=EXEWelJR644FunFBaqajHTvaeusvkyXZ8qcvMtlXidbA9B09K2bHN/i8nl64uqJ+1
 GpxyCZLE7j3y1PieTVVSUQENCtkRx51XftNYKlAgZX93n1O2URt3SONDl1Rdxm4nrf
 IkPAfdGIaPncYzADPSY7l5+hsryqs2sNvjP5frHE=
To: Maxime Devos <maximedevos@telenet.be>
From: raid5atemyhomework <raid5atemyhomework@protonmail.com>
Cc: "guix-devel@gnu.org" <guix-devel@gnu.org>
Subject: Re: A Critique of Shepherd Design
Message-ID: <i2RYnzzEDl6LOUMEsZlhEAuqjNObTyJM7gpg2en5tQEAvqCjIBM2JkocbhWoNxSIukHZNCT3ATP2roxZvyQ6J8w4LPDw0F-qkLq3ZW4f2Ws=@protonmail.com>
In-Reply-To: <6286d7101ae8219a539bc34437ab46bd48d38476.camel@telenet.be>
References: <sQ_4CGBvkWpMP0GawFJgssNTNnBggFDtD_Q4z5DKOkx_VMfHkhNs16yMJd10VeB9LEWiyQ2B_WrYiM_XxS0N1vxedKyDkHS5AACi1TFY7Xc=@protonmail.com>
 <6286d7101ae8219a539bc34437ab46bd48d38476.camel@telenet.be>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Received-SPF: pass client-ip=185.70.40.137;
 envelope-from=raid5atemyhomework@protonmail.com; helo=mail-40137.protonmail.ch
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001,
 RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001,
 SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: guix-devel@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Development of GNU Guix and the GNU System distribution."
 <guix-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guix-devel>,
 <mailto:guix-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/guix-devel>
List-Post: <mailto:guix-devel@gnu.org>
List-Help: <mailto:guix-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guix-devel>,
 <mailto:guix-devel-request@gnu.org?subject=subscribe>
Reply-To: raid5atemyhomework <raid5atemyhomework@protonmail.com>
Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org
Sender: "Guix-devel" <guix-devel-bounces+larch=yhetil.org@gnu.org>
X-Migadu-Flow: FLOW_IN
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org;
	s=key1; t=1616208192;
	h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:list-id:list-help:
	 list-unsubscribe:list-subscribe:list-post:dkim-signature;
	bh=Je+E8Ge113dc2+m5YDjO8YsPZIGKJEib/3yiDQw0R/0=;
	b=llPBkzY9F/gFd9WivShwU6r5TQEVgUnIVKEL8bSAFDBUZ7U1OLijTqUphOOwHVrZiEA31p
	QvwapeDbfYWM0DNC61ahN4jgSl20E5uenPaWD0dkGxeY8Fq+OxNTaJ9/ky9Y6lWj64CPZf
	GhzEvjFArWFr+WGYcthAnTvZ73eGrLoq8MgLg82GPaCRr4EfdPTk9n1sql1ofbC4kwgDVZ
	U9/EtGZqEqcP8xvWrs5HTkYL7dw631rvxfOhhpQkz7cF2QytkapWcuCbZsP2d/d2wKBzjh
	I1eh2NnC6p58Zd+OMMNJSeNE0O8Jz8Nb/FpbESrlFBW8ke6UvHG6jm+ZdE9JiA==
ARC-Seal: i=1; s=key1; d=yhetil.org; t=1616208192; a=rsa-sha256; cv=none;
	b=hmJXLMSuIyEUhf+W1p6zW5/bhx/N/H+Qx4K7TUrFNprQZRwMMXBCto9eO/bs2IDxA7Ny8Z
	eilXsV4be/JiiNrmHWdfx/4ofbyo//szSL51oWhh00YUX2B68YTl3aM04n3RNt6MWwxnVN
	qgA64AY+VIeC+kPzwYmVjoHkpGbt4Ic/CeKlJpmozxlJTWEx5xzjXgmOXZyza3VJgYDmk3
	w/hmvPKSB2HPVnCEjPBcHQrHE8fT8IKT8TrfhkHpxxBG16Sx4jnR8jiRXItLFvvJtynAzb
	cU5yWf/xEen2zHJFV9HJDsUZaAK7Nb/tSEh3HL7bWWibdluWuEjl/1BfH9WxIw==
ARC-Authentication-Results: i=1;
	aspmx2.migadu.com;
	dkim=pass header.d=protonmail.com header.s=protonmail header.b=EXEWelJR;
	dmarc=pass (policy=quarantine) header.from=protonmail.com;
	spf=pass (aspmx2.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org
X-Migadu-Spam-Score: -3.11
Authentication-Results: aspmx2.migadu.com;
	dkim=pass header.d=protonmail.com header.s=protonmail header.b=EXEWelJR;
	dmarc=pass (policy=quarantine) header.from=protonmail.com;
	spf=pass (aspmx2.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org
X-Migadu-Queue-Id: D7A892457D
X-Spam-Score: -3.11
X-Migadu-Scanner: scn0.migadu.com
X-TUID: gQoRIYgHzT09

Hello Maxime,

> Multi-threading seems complicated (but can be worth it). What work would =
you put
> on which thread (please give us some concrete to reason about, =E2=80=
=98thread X does A,
> and is created on $EVENT ...=E2=80=99)? A complication is that "fork" is =
=E2=80=98unsafe=E2=80=99 in a
> process with multiple threads (I don't know the details). Maybe that coul=
d be
> worked around. (E.g. by using system* on a helper binary that closes unne=
eded
> file descriptors and sets the umask .... and eventually uses exec to run =
the
> service binary.)

Perhaps the point is not so much to be *multi-threaded* but to be *concurre=
nt*, and whether to be multi-*thread* or multi-*process* or multi-*fiber* i=
s a lower-level detail.

For example, we can consider that for 99% of shepherd services, the `start`=
 action returns a numeric PID, and the remaining ones return `#t`.  Rather =
than call the `start` action in the same thread/process/fiber that implemen=
ts the Shepherd main loop:

* Create a pipe.
* `fork`.
  * On the child:
    * Close the pipe output end, make the input end cloexec.
    * Call the `start` action.
    * If an error is thrown, print out `(error <>)` with the serialization =
of the error to the pipe.
    * Otherwise, print out `(return <>)` with the return value of the `star=
t` action.
  * On the parent:
    * Close the pipe input end, and add the output end and the pid of the f=
orked child to one of the things we will poll in the mainloop.
    * Return to the mainloop.
* In the mainloop:
  * `poll` (or `select`, or whatever) the action pipes in addition to the `=
herd` command socket(s).
  * If an action pipe is ready, read it in:
    * If it EOFed then the sub-process died or execed without returning any=
thing, treat as an error.
    * Similarly if unable to `read` from it.
    * Otherwise deserialize it and react appropriately as to whether it is =
an error or a return of some value.
  * If an action pipe has been around too long, kill the pid (and all child=
 pids) and close the pipe and treat as an error.

I mean --- single-threaded/process/fibered, non-concurrent `init` systems h=
ave fallen by the wayside because of their brittleness.  SystemD is getting=
 wide support precisely because it's an excellent concurrent `init` mechani=
sm that is fairly tolerant of mistakes; a problem in one service unit is ge=
nerally isolated in that service unit and its dependents and will not cause=
 problems with the other service units.

Presumably, `fork`ing within a `fork`ed process has no issues since that is=
 what is normally done in Unix anyway, it's threads that are weird.


> That's a bit of an over-simplification. At least in a singly-threaded She=
pherd,
> waitpid could not be used by an individual service, as it
> (1) returns not often
> enough (if a process of another service exits and this action was waiting=
 on
> a single pid) (2) too often (likewise,
> and this action was =E2=80=98just waiting=E2=80=99
> (3) potentially never (imagine every process has been started and an acti=
on
> was waiting for one to exit,
> and now the user tries to reboot. Then shepherd
> daemon would never accept the connection from the user!).

Yes, I agree it is an oversimplification, but in any case, there should be =
some smaller subset of things that can be (typically) done.

Then any escape hatch like `unsafe-turing-complete` or `without-static-anal=
ysis` can retain the full power of the language for those times when you *r=
eally* do need it.

>
> And in a multi-threaded Shepherd, "fork" is unsafe. (Anyway, "fork" captu=
res
> too much state, and no action should ever call "exec" without "fork".) Al=
so,
> "exec" replaces too little state (e.g. open file descriptors, umask, root
> directory, current working directory ...).
>
> But perhaps that's just a call for a different set of primitives (which c=
ould
> be implemented on top of these system calls by shepherd, or directly on t=
op
> of the relevant Hurd RPCs when running on GNU/Hurd.).

Yes.

> > Yet the language is a full Turing-complete language, including the majo=
r
> > weakness of Turing-completeness: the inability to solve the halting pro=
blem.
>
> IIUC, there are plans to perform the network configuration inside shepher=
d
> (search for guile-netlink). So arbitrary code still needs to be allowed.
> This is the Procedural (extensible) I'll referring to later.
>
> > The fact that the halting problem is unsolved in the language means it =
is
> > possible to trivially write an infinite loop in the language. In the co=
ntext
> > of an `init` system, the possibility of an infinite loop is dangerous, =
as it
> > means the system may never complete bootup.
>
> I'm assuming the hypothetical infinite loop is in, say, the code for star=
ting
> guix-daemon which isn't very essential (try "herd stop guix-daemon"!)
> (If there's an infinite loop is in e.g. udev-service-type then there's in=
deed
> a problem, then that infinite loop might as well be in eudev itself, so I=
'm
> only speaking of non-essential services blocking the boot process.)
>
> I'm not convinced we need Turing-incompleteness here, at least if the
> start / do-a-dance action / stop code of each service definition is run i=
n
> a separate thread (not currently the case).
>
> > I have experienced this firsthand since I wrote a Shepherd service to l=
aunch
> > a daemon, and I needed to wait for the daemon initialization to complet=
e.
> > My implementation of this had a bug that caused an infinite loop to be =
entered,
> > but would only tickle this bug when a very particular (persistent on-di=
sk) state
> > existed.
> > I wrote this code about a month or so ago, and never got to test it unt=
il last week,
> > when the bug was tickled. Unfortunately, by then I had removed older sy=
stem versions
> > that predated the bug. When I looked at a backup copy of the `configura=
tion.scm`, I
> > discovered the bug soon afterwards.
> > But the amount of code verbiage needed here had apparently overwhelmed =
me at the time I
> > wrote the code to do the waiting, and the bug got into the code and bro=
ke my system.
> > I had to completely reinstall Guix (fortunately the backup copy of `con=
figuration.scm`
> > was helpful in recovering most of the system, also ZFS rocks).
>
> Ok, shepherd should be more robust (at least for non-essential services l=
ike guix-publish).
>
> > Yes, I made a mistake. I'm only human. It should be easy to recover fro=
m mistakes.
>
> I would need to know what the mistake and what =E2=80=98recovering=
=E2=80=99 would be exactly in this
> case, before agreeing or disagreeing whether this is something shepherd s=
hould help
> with.

It's what I described in the previous paragraph.  I wrote code to wait for =
startup, the code to wait for startup had a bug that caused it to enter an =
infinite loop under a certain edge case.

The edge case was not reached for a long time (almost a month), so I only f=
ound it a week ago when it triggered and broke my system.  In order to reco=
ver I had to reinstall Guix System completely, and none of the saved system=
 generations had a configuration that predated the introduction of the bug.

If you want specifics --- as a sketch, what I did was something similar to =
this:

    (let wait-for-startup ((count 0))
      (cond
        ((daemon-started?) ; if the daemon died this would return #f
         #t)
        ((> count 30)
         (format #t "daemon taking too long, continuing with boot~%")
         #f))
        (else
         (sleep 1)
         (wait-for-startup (- count 1)))) ;;; <--- (fuck)

The daemon died (this was the persistent on-disk edge case, arguably a bug =
in `daemon-started?` as well) causing `(daemon-started?)` to return `#f` pe=
rmanently, and then I started counting *in the wrong direction*.


> > The full Turing-completeness of the language invites mistakes,
>
> Agreed (though =E2=80=98invites=E2=80=99 seems a little too strong here).

Come on.  Something as simple as "Wait for some condition to be true" needs=
 code like the above to implement fully in the Shepherd action-description =
language (i.e. Scheme).  And because I was debating on "just put `(count 30=
)` in the `let` and decrement" versus "Naaa just count up like a normal per=
son" and got my wires crossed, I got into the above completely dumb mistake=
.

Indeed, not only should `-` be changed to `+` in the above, but it's also b=
etter to do something like this as well:

    (let wait-for-startup ((count 0))
      (cond
        ((daemon-started?) ; if the daemon died this would return #f
         #t)
        ((not (zero? (car (waitpid daemon-pid WNOHANG))))
         (format #t "daemon died!~%")
         #f)
        ((> count 30)
         (format #t "daemon taking too long, continuing with boot~%")
         #f))
        (else
         (sleep 1)
         (wait-for-startup (+ count 1))))

Because if the daemon died, waiting for 30 seconds would be fairly dumb as =
well.

That is a fair bit of verbiage, and much harder to read through for bugs.

As per information theory: the more possible things I can talk about, the m=
ore bits I have to push through a communication channel to convey it.  Turi=
ng-complete languages can do more things than more restricted  domain-speci=
fic languages, so on average it requires more code to implement an arbitrar=
y thing in the Turing-complete language (thus requiring more careful review=
) than the restricted domain-specific language.

> -   Section 3: A nicely behaving API.
>
> > So what can we do?
> > For one, a Turing-complete language is a strict superset of non-Turing-=
complete
> > languages. So one thing we can do is to define a more dedicated languag=
e for Shepherd
> > actions, strongly encourage the use of that sub-language, and, at some =
point, require
> > that truly Turing-complete actions need to be wrapped in a `(unsafe-tur=
ing-complete ...)` form.
> > For example, in addition to the current existing `make-forkexec-constru=
ctor` and
> > `make-kill-destructor`, let us also add these syntax forms:
>
> > `(wait-for-condition <timeout> <form>)` - Return a procedure that accep=
ts a numeric `pid`, that does: Check if evaluating `<form>` in the lexical =
context results in `#f`. If so, wait one second and re-evaluate, but exit a=
nyway and return `#f` if `<timeout>` seconds has passed, or if the `pid` ha=
s died. If `<form>` evaluates to non-`#f` then return it immediately.
> > `(timed-action <timeout> <form> ...)` - Return a procedure that accepts=
 a numeric `pid`, that does: Spawn a new thread (or maybe better a `fork`ed=
 process group?). The new thread evaluates `<form> ...` in sequence. If the=
 thread completes or throws in `<timeout>` seconds, return the result or th=
row the exception in the main thread. IF the `<timeout>` is reached or the
> > given `pid` has died, kill the thread and any processes it may have spa=
wned, then return `#f`.
> > `(wait-to-die <timeout>)` - Return a procedure that accepts a `pid` tha=
t does: Check if the `pid` has died, if so, return #t. Else sleep 1 second =
and try again, and if `<timeout>` is reached, return `#f`.
>
> Hard to say if these syntaxes are useful in advance.

In my particular case I could have just written `(wait-for-condition 30 (da=
emon-started?))`.  I imagine a fair number of wait-for-daemon-startup thing=
s can use that or `(timed-action 30 (invoke #$waiter-program))`.

>
> In any case, I don't like numeric pids, as they are reused.
> Could we use something like pidfds (Linux) or whatever the Hurd has inste=
ad?
> (Wrapped in nice type in Scheme, maybe named <task> with a predicate task=
?,
> a predicate task-dead?, a wait-for-task-dead operation (if we use guile-f=
ibers).)

Certainly, but that may require breaking changes to existing specific sheph=
erd actions.


> Note: (perform-operation (choice-operation (wait-for-task-dead task) (sle=
ep 5)))
> =3D=3D wait until task is dead OR 5 seconds have passed. Operations in gu=
ile-fibers
> combine nicely!
>
> Also, sprinkling time-outs everywhere seems rather dubious to me (if that=
's
> what timed-action is for). If some process is taking long to start,
> then by killing it it will take forever to start.

Sprinkling timeouts makes the halting problem trivial: for any code that is=
 wrapped in a timeout, that code will halt (either by itself, or by hitting=
 the timeout).

Indeed, you can make a practical total functional language simply by requir=
ing that recursions have a decrementing counter that will abort computation=
 when it reaches 0.  It could use mostly partial code patterns with the add=
ition of that decrementing counter, and it becomes total.  In many ways the=
 programmer is just assuring the compiler "this operation will not take mor=
e than N recursions".


>
> > The above forms should also report as well what they are doing (e.g. `H=
ave been waiting 10 of 30 seconds for <form>`) on the console and/or syslog=
.
>
> Something like that would be useful, yes. But printing <form> itself seem=
s a bit
> low-level to me.

Yes, but if it's something that the end-user wrote in their `configuration.=
scm`, it seems fine to me.

>
> > In addition, `make-forkexec-constructor` should accept a `#:post-fork`,=
 which is
> > a procedure that accepts a `pid`, and `make-kill-destructor` should acc=
ept a
> > `#:pre-kill`, which is a procedure that accepts a `pid`. Possibly we ne=
ed to
> > add combinators for the above actions as well. For example a `sub-actio=
ns`
> > procedural form that accepts any number of functions of the above `pid =
-> bool`
> > type and creates a single combined `pid -> bool`.
>
> > So for example, in my case, I would have written a `make-forkexec-const=
ructor`
> > that accepted a `#:post-fork` that had a `wait-for-condition` on the co=
de that
> > hecked if the daemon initialization completed correctly.
> > I think it is a common enough pattern that:
> >
> > -   We have to spawn a daemon.
> > -   We have to wait for the daemon to be properly initialized (`#:post-=
fork`)
> > -   When shutting down the daemon, it's best to at least try to politel=
y ask it
> >     to finish using whatever means the daemon has (`#:pre-kill`).
> >
> > -   If the daemon doesn't exit soon after we politely asked it to, be l=
ess polite and start sending signals.
> >
> > So I think the above should cover a good number of necessities.
>
> Agreed. My own proposal, which I like better (-: :
> (this is the =E2=80=98procedural=E2=80=99 (extensible) interface)
>
> -   Multi-threading using guile-fibers (we'll use condition variables & c=
hannels).
> -   Instead of make-forkexec-constructor, we have
>     make-process-constructor, as fork + exec as-is is troublesome in
>     a multi-threaded application (see POSIX).
>
>     (It still accepts #:user, #:group & the like)
>
>     make-process-constructor returns a thunk as usual,
>     but this thunk THUNK works differently:
>
>     THUNK : () -> <process-task>.
>
>
> <process-task> is a GOOPS class representing processes (parent: <shepherd=
-task>),
> that do not neccessarily exist yet / anymore. <process-task>
> has some fields:
>
> * stopped?-cond
> (condition variable, only signalled after exit-value is set)
> * exit-value
> (an atomic box, value #f or an integer,
> only read after stopped?-cond is signalled
> or whenever the user is feeling impatient and asking shepherd
> how it is going.)
> * stop!-cond (condition variable)
> * stop-forcefully!-cond (condition variable)
> * started-cond (condition variable, only signalled when =E2=80=98pid=
=E2=80=99 has been set)
> * pid (an atomic box)
> * maybe some other condition variables?
> * other atomic boxes?
>
> THUNK also starts a fiber associated with the returned <process-task>,
> which tries to start the service binary in the background (could also be =
used
> for other things than the service binary). When the binary has been start=
ed,
> the fiber signals started-cond.
>
> The fiber will wait for the process to exit, and sets exit-value and
> signals stopped?-cond when it does. It will also wait for a polite reques=
t
> to stop the service binary (stop!-cond) and impolite requests (for SIGKIL=
L?)
> (stop-forcefully!-cond).
>
> (Does shepherd even have an equivalent to stop-forcefully! ?)
>
> kill-destructor is adjusted to not work with pids directly, but rather
> with <process-task>. It signals stop!-cond or stop-forcefully!-cond.
>
> (So process creation & destruction happens asynchronuously, but there
> still is an easy way to wait until they are actually started.)
>
> -   #:pre-kill & #:post-fork seem rather ad-hoc to me. I advise
>     still wrapping make-process-constructor & kill-destructor.
>     However, the thunk still needs to return a <shepherd-task> (or subcla=
ss
>     of course).
>
>     If the thunk does anything that could block / loop (e.g. if it's a HT=
TP
>     server, then the wrapper might want to wait until the process is actu=
ally
>     listening at port 80), then this blocking / looping must be done in a=
 fiber.
>
>     Likewise if the killer does anything that could block / loop (e.g.
>     first asking the process to exit nicely before killing it after a tim=
e-out).

Sure, this design could work better as well.  But *some* restricted dialect=
 should exist, and we should discourage people from using the full Guile la=
nguage in Shepherd actions they write bespoke in their `configuration.scm`,=
 not tout it as a feature.

> -   Section 4: Static analysis
>
> > Then, we can define a strict subset of Guile, containing a set of forms=
 we know are
> > total (i.e. we have a proof they will always halt for all inputs). Then=
 any Shepherd
> > actions, being Lisp code and therefore homoiconic, can be analyzed. Eve=
ry list must
> > have a `car` that is a symbol naming a syntax or procedure that is know=
n to be safe
> > --- `list`, `append`, `cons*`, `cons`, `string-append`, `string-join`, =
`quote`
> > (which also prevents analysis of sub-lists), `and`, `or`, `begin`, thun=
k combinators,
> > as well as the domain-specific `make-forkexec-constructor`, `make-kill-=
destructor`,
> > `wait-for-condition`, `timed-action`, and probably some of the `(guix b=
uild utils)`
> > like `invoke`, `invoke/quiet`, [...]
>
> `invoke' has no place in your subset, as the invoked program can take arb=
itrarily long.
> (Shouldn't be a problem when multi-threading, but then there's less need =
for this kind
> of static analysis.)

Well, the thing is --- I can probably write a shell script and test it **ou=
tside of Shepherd**.  In fact, in my mistake above, on a previous non-Guix =
system, I ***did*** use a shell script to wait for daemon startup.  But whe=
n I ported it over to Guix, I decided to rewrite it as Scheme code to "real=
ly dive into the Guix system".

Before, when writing it as a shell script I was able to test it and indeed =
found problems and fixed them.  But as Scheme code, well, it's a bit harder=
, since I have to go import a bunch of things (that I have to go search for=
, and I have to figure out Guile path settings for the bits that are provid=
ed by Guix, etc.), and then afterwards I have to copy back the tested code =
into my `configuration.scm`.  This was enough of an annoyance to me that it=
 discouraged me from testing it, which could have caught the above bug.

Since `invoke` calls an external program, which runs isolated in its own pr=
ocess (and can be `kill`ed in order to unstuck Shepherd), and can more easi=
ly be independently tested outside the `configuration.scm`, it's substantia=
lly safer than writing the same logic directly in the Shepherd language, so=
 it gets a pass.

Many other daemons that have some kind of "wait-for-daemon-startup" program=
 will often include tests for the "wait" program as well in its test suite,=
 so using `invoke` is substantially safer than writing bespoke code in Sche=
me that performs the same action.

> > `mkdir-p` etc.
>
> I have vague plans to replace 'mkdir-p', 'mkdir-p/perms' etc. with some p=
rocedure
> (prepare-directory
> `("/var/stuff #:bits #o700 #:user ... #:group
> ("blabla" #:bits ...
> #:atomic-generate-contents ,(lambda (port) (display 'stuff port))))
> ...))
>
> ... automatically taking care of symlinks & detecting whether there are c=
onflicts
> with different services.
>
> > Sub-forms (or the entire form for an action) can be wrapped in `(unsafe=
-turing-complete ...)`
> > to skip this analysis for the sub-form,
>
> An escape hatch seems good to me, though I would rather call it
> `(without-static-analysis (totality) ...)'. The forms are not neccesarily=
 =E2=80=98unsafe=E2=80=99,
> only potentially so. We could also define a checkers for other issues thi=
s way.
>
> (I'm thinking of potential TOCTTOU (time of check to time of use) problem=
s involving
> symbolic links.)
>
> > but otherwise, by default, the specific subset must be used, and users =
have to
> > explicitly put `(unsafe-turing-complete ...)` so they are aware that th=
ey can
> > seriously break their system if they make a mistake here. Ideally, as m=
uch of
> > the code for any Shepherd action should be outside an ``unsafe-turing-c=
omplete`, and only parts of the code that really need the full Guile langua=
ge to implement should be rapped in`unsafe-turing-complete`.
>
> I'm getting Rusty vibes here. Sounds sound to me.
>
> > (`lambda` and `let-syntax` and `let`, because they can rebind the meani=
ngs of symbols,
> > would need to be in `unsafe-turing-complete` --- otherwise the analysis=
 routine would
> > require a full syntax expander as well)
>
> No. The analysis routine should not directly work on Scheme code, but rat=
her on
> the 'tree-il' (or maybe on the CPS code, I dunno).
> Try (macro-expand '(let ((a 0)) a)) from a Guile REPL.
>
> > Turing-completeness is a weakness, not a strength, and restricting lang=
uages to be
> > the Least Powerful necessary is better. The `unsafe-turing-complete` fo=
rm allows
> > an escape, but by default Shepherd code should be in the restricted non=
-Turing-complete
> > subset, to reduce the scope for catastrophic mistakes.
>
> I'm assuming make-fork+exec-constructor/container would be defined in Sch=
eme
> and be whitelisted in raid5atemyhomework-scheme?

Yes.

>
> -   Section 5 -- Declarative
>
>     Something you don't seem to have considered: defining services in a d=
eclarative
>     language! Hypothetical example =C3=A0 la SystemD, written in somethin=
g vaguely
>     resembling Wisp (Wisp is an alternative read syntax for Scheme with l=
ess
>     parentheses):

I prefer SRFI 110 myself.

> (I'm a bit rusty with the details on defining shepherd services
>     in Guix:
>
>     shepherd-service
>     start `I-need-to-think-of-a-name-constructor #:binary #$(file-append =
package "/bin/stuff") #:arguments stuff #:umask suff #:user stuff #:group s=
tuff ,@(if linux`(#:allowed-syscalls ',(x y z ...)))
>     #:friendly-shutdown
>     thunk
>     invoke #$(file-append package "/bin/stop-stuff-please")
>     #:polling-test-started?
>     thunk
>     if . file-exists? "/run/i-am-really-started"
>     #true
>
>     ;; tell I-need-to-think-of.... to wait 4 secs
>     ;; before trying again
>     seconds 4
>     ;; how much time starting may take
>     ;; until we consider things failed
>     ;; & stop the process
>     #:startup-timeout?
>     thunk
>     if spinning-disk?
>     plenty
>     little
>     #:actions etcetera
>     ...
>
>     Well, that's technically still procedural, but this starts to look
>     like a SystemD configuration file due to the many keyword arguments!
>
>     (Here (thunk X ...) =3D=3D (lambda () X ...), and
>     I-need-to-think-of-a-name-constructor
>     is a procedure with very many options, that
>     =E2=80=98should be enough for everyone. If it doesn't support
>     enough arguments,
>     look in SystemD for inspiration.)
>
>     That seems easier to analyse. Although it's bit kitchen-sinky,
>     no kitchen is complete without a sink, so maybe that's ok ...
>     (I assume the kitchen sink I-need-to-think-of-a-name-constructor
>     would be defined in the guix source code.)


Well, yes ---- but then we can start arguing that using SystemD would be be=
tter, as it is more battle-tested and already exists and all the kitchen-si=
nk features are already there and a lot more people can hack its language t=
han can hack the specific dialect of scheme (augmented with Guixy things li=
ke `make-forkexec-constructor`) used by GNU Shepherd.

Thanks
raid5atemyhomework