From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id mE9XJZ/gVGAlBQAA0tVLHw (envelope-from ) for ; Fri, 19 Mar 2021 17:34:23 +0000 Received: from aspmx1.migadu.com ([2001:41d0:8:6d80::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id qIpCIZ/gVGBVOAAA1q6Kng (envelope-from ) for ; Fri, 19 Mar 2021 17:34:23 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 25B0232FFA for ; Fri, 19 Mar 2021 18:34:23 +0100 (CET) Received: from localhost ([::1]:53134 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lNJ0v-0002Oe-My for larch@yhetil.org; Fri, 19 Mar 2021 13:34:21 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:38414) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lNJ0m-0002OU-TO for guix-devel@gnu.org; Fri, 19 Mar 2021 13:34:13 -0400 Received: from mail-40141.protonmail.ch ([185.70.40.141]:13790) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lNJ0j-00063F-Qz for guix-devel@gnu.org; Fri, 19 Mar 2021 13:34:12 -0400 Date: Fri, 19 Mar 2021 17:33:57 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=protonmail.com; s=protonmail; t=1616175245; bh=b/bWhtbfMo6tZyYOLkVnnCbpMLc7+qExA05w1Gi1AIc=; h=Date:To:From:Reply-To:Subject:From; b=ncXBixLAI0WwgFxP+1k1WsOKGneQWClwLVRP6EAj0jV83j96Ae+E28gbOK4aFB9bw Z2fd/iK1eEhqYlUxnU/VHl3kRG0pmI3hF9OByIyQZ9HDuDYlmqazdX84Zw2bLw+lVk NTxXcXfQileESaVavxco8WTMTsdmwYtJBdFudMeo= To: "guix-devel@gnu.org" From: raid5atemyhomework Subject: A Critique of Shepherd Design Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=185.70.40.141; envelope-from=raid5atemyhomework@protonmail.com; helo=mail-40141.protonmail.ch X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: raid5atemyhomework Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1616175263; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=b/bWhtbfMo6tZyYOLkVnnCbpMLc7+qExA05w1Gi1AIc=; b=Py4hWu1pCydjjN/ZJaFju3vIsXd9DqA/ZjTRo9szYsJ61cL3s54daenVNqq72qZmk6VWC7 2g/3ZRd/ZiMyO9JdGzdPMwrsNd9Z1Z7hYNJKcQAuYfRzgJCWoz+xKG7E2CQX1zuUO96/04 xzGRFCmS1cCCeIvEXX5FtY/OSXjgYa3omA9itBfWJyca7UmC7V0an7tX33LJmryFbNmPaH LaZbrx9nQPE16hPaWPt+hYsIoBJAis28c+143uSrGyyPuOtsDlQjZ3xvCIHbk0urrAnVrf wKlDUxT7zDdnVCDrW0F/ZtUzkiOZ5rRx2LgzAUHScvi4+Tuy4+KoN+X4zTpl1Q== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1616175263; a=rsa-sha256; cv=none; b=bj8s57kUi1AoZZsNwzaXkcmeFVBXOAD4pM3tCne9Ry2V19pTZ2WXRx65IybpEC3mlvOxSB Zm3c/+acqjyJO3fnBUiKSplNO+4j1fwxXQXZe+ysTJhK8McQwCbUb0tXD4PAenA0Ia3d3g BBSZ68T8B0CsvJkguh1OsVWNRpkNU67jpTUZp/YP0qfkgvSla2kQbAXGLkchlKoAq8qfDg 1Zml/KDoIRrysR5iyPPr7DFBwM6KrIzTxOlq7VEaooINvTDJPzXPX8MInmMCxWf2w3Aee6 ktjnkce9J+o43xzcWjjt8rOsqTjXpP+J+m5BXiVMcrEDWa7p9Eul5jllVfYNvg== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=protonmail.com header.s=protonmail header.b=ncXBixLA; dmarc=pass (policy=quarantine) header.from=protonmail.com; spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Spam-Score: -3.11 Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=protonmail.com header.s=protonmail header.b=ncXBixLA; dmarc=pass (policy=quarantine) header.from=protonmail.com; spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Migadu-Queue-Id: 25B0232FFA X-Spam-Score: -3.11 X-Migadu-Scanner: scn0.migadu.com X-TUID: XMZR6Lu9w1lN GNU Shepherd is the `init` system used by GNU Guix. It features: * A rich full Scheme language to describe actions. * A simple core that is easy to maintain. However, in this critique, I contend that these features are bugs. The Shepherd language for describing actions on Shepherd daemons is a Turin= g-complete Guile language. Turing completeness runs afoul of the Principle= of Least Power. In principle, all that actions have to do is invoke `exec= `, `fork`, `kill`, and `waitpid` syscalls. Yet the language is a full Turi= ng-complete language, including the major weakness of Turing-completeness: = the inability to solve the halting problem. The fact that the halting problem is unsolved in the language means it is p= ossible to trivially write an infinite loop in the language. In the contex= t of an `init` system, the possibility of an infinite loop is dangerous, as= it means the system may never complete bootup. Now, let us combine this with the second feature (really a bug): GNU shephe= rd is a simple, single-threaded Scheme program. That means that if the sin= gle thread enters an infinite loop (because of a Shepherd service descripti= on that entered an infinite loop), then Shepherd itself hangs. This means = that the system is severely broken. You cannot `sudo reboot` because that = communicates with the Shepherd daemon. You cannot even `kill 1` because si= gnals are handled in the mainloop, which never get entered if the `start` a= ction of some Shepherd daemon managed to enter an infinite loop. I have experienced this firsthand since I wrote a Shepherd service to launc= h a daemon, and I needed to wait for the daemon initialization to complete.= My implementation of this had a bug that caused an infinite loop to be en= tered, but would only tickle this bug when a very particular (persistent on= -disk) state existed. I wrote this code about a month or so ago, and never= got to test it until last week, when the bug was tickled. Unfortunately, = by then I had removed older system versions that predated the bug. When I = looked at a backup copy of the `configuration.scm`, I discovered the bug so= on afterwards. But the amount of code verbiage needed here had apparently = overwhelmed me at the time I wrote the code to do the waiting, and the bug = got into the code and broke my system. I had to completely reinstall Guix = (fortunately the backup copy of `configuration.scm` was helpful in recoveri= ng most of the system, also ZFS rocks). Yes, I made a mistake. I'm only human. It should be easy to recover from = mistakes. The full Turing-completeness of the language invites mistakes, a= nd the single-threadedness makes the infinite-loop mistakes that Turing-com= pleteness invites, into catastrophic system breakages. So what can we do? For one, a Turing-complete language is a strict *superset* of non-Turing-co= mplete languages. So one thing we can do is to define a more dedicated lan= guage for Shepherd actions, strongly encourage the use of that sub-language= , and, at some point, require that truly Turing-complete actions need to be= wrapped in a `(unsafe-turing-complete ...)` form. For example, in addition to the current existing `make-forkexec-constructor= ` and `make-kill-destructor`, let us also add these syntax forms: `(wait-for-condition
)` - Return a procedure that accepts a= numeric `pid`, that does: Check if evaluating `` in the lexical cont= ext results in `#f`. If so, wait one second and re-evaluate, but exit anyw= ay and return `#f` if `` seconds has passed, or if the `pid` has d= ied. If `` evaluates to non-`#f` then return it immediately. `(timed-action ...)` - Return a procedure that accepts a n= umeric `pid`, that does: Spawn a new thread (or maybe better a `fork`ed pro= cess group?). The new thread evaluates ` ...` in sequence. If the t= hread completes or throws in `` seconds, return the result or thro= w the exception in the main thread. IF the `` is reached or the g= iven `pid` has died, kill the thread and any processes it may have spawned,= then return `#f`. `(wait-to-die )` - Return a procedure that accepts a `pid` that do= es: Check if the `pid` has died, if so, return #t. Else sleep 1 second and= try again, and if `` is reached, return `#f`. The above forms should also report as well what they are doing (e.g. `Have = been waiting 10 of 30 seconds for `) on the console and/or syslog. In addition, `make-forkexec-constructor` should accept a `#:post-fork`, whi= ch is a procedure that accepts a `pid`, and `make-kill-destructor` should a= ccept a `#:pre-kill`, which is a procedure that accepts a `pid`. Possibly = we need to add combinators for the above actions as well. For example a `s= ub-actions` procedural form that accepts any number of functions of the abo= ve `pid -> bool` type and creates a single combined `pid -> bool`. So for example, in my case, I would have written a `make-forkexec-construct= or` that accepted a `#:post-fork` that had a `wait-for-condition` on the co= de that checked if the daemon initialization completed correctly. I think it is a common enough pattern that: * We have to spawn a daemon. * We have to wait for the daemon to be properly initialized (`#:post-fork`) * When shutting down the daemon, it's best to at least try to politely ask = it to finish using whatever means the daemon has (`#:pre-kill`). * If the daemon doesn't exit soon after we politely asked it to, be less po= lite and start sending signals. So I think the above should cover a good number of necessities. Then, we can define a strict subset of Guile, containing a set of forms we = know are total (i.e. we have a proof they will always halt for all inputs).= Then any Shepherd actions, being Lisp code and therefore homoiconic, can = be analyzed. Every list must have a `car` that is a symbol naming a syntax= or procedure that is known to be safe --- `list`, `append`, `cons*`, `cons= `, `string-append`, `string-join`, `quote` (which also prevents analysis of= sub-lists), `and`, `or`, `begin`, thunk combinators, as well as the domain= -specific `make-forkexec-constructor`, `make-kill-destructor`, `wait-for-co= ndition`, `timed-action`, and probably some of the `(guix build utils)` lik= e `invoke`, `invoke/quiet`, `mkdir-p` etc. Sub-forms (or the entire form for an action) can be wrapped in `(unsafe-tur= ing-complete ...)` to skip this analysis for the sub-form, but otherwise, b= y default, the specific subset must be used, and users have to explicitly p= ut `(unsafe-turing-complete ...)` so they are aware that they can seriously= break their system if they make a mistake here. Ideally, as much of the c= ode for any Shepherd action should be *outside* an ``unsafe-turing-complete= `, and only parts of the code that really need the full Guile language to i= mplement should be rapped in `unsafe-turing-complete`. (`lambda` and `let-syntax` and `let`, because they can rebind the meanings = of symbols, would need to be in `unsafe-turing-complete` --- otherwise the = analysis routine would require a full syntax expander as well) Turing-completeness is a weakness, not a strength, and restricting language= s to be the Least Powerful necessary is better. The `unsafe-turing-complet= e` form allows an escape, but by default Shepherd code should be in the res= tricted non-Turing-complete subset, to reduce the scope for catastrophic mi= stakes. Thanks raid5atemyhomework