From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id UAoVE2uEwl4TMwAA0tVLHw (envelope-from ) for ; Mon, 18 May 2020 12:49:47 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1 with LMTPS id IL69DmuEwl5BEQAAbx9fmQ (envelope-from ) for ; Mon, 18 May 2020 12:49:47 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id D7EDF940DA4 for ; Mon, 18 May 2020 12:49:46 +0000 (UTC) Received: from localhost ([::1]:56348 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jafDF-0007ol-Jx for larch@yhetil.org; Mon, 18 May 2020 08:49:45 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:53562) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jafCc-0007eo-7D for guix-devel@gnu.org; Mon, 18 May 2020 08:49:06 -0400 Received: from mail.thebird.nl ([94.142.245.5]:36418) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jafCa-0006SI-IH for guix-devel@gnu.org; Mon, 18 May 2020 08:49:05 -0400 Received: by mail.thebird.nl (Postfix, from userid 1000) id 70F3278AF; Mon, 18 May 2020 14:49:00 +0200 (CEST) Date: Mon, 18 May 2020 07:49:00 -0500 From: Pjotr Prins To: guix-devel Subject: Slurm with containers (i.e., orchestration) Message-ID: <20200518124900.jkr5rts5bnslrkqg@thebird.nl> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: NeoMutt/20170113 (1.7.2) Received-SPF: pass client-ip=94.142.245.5; envelope-from=pjotr2020@thebird.nl; helo=mail.thebird.nl X-detected-operating-system: by eggs.gnu.org: First seen = 2020/05/18 08:49:00 X-ACL-Warn: Detected OS = Linux 3.11 and newer [fuzzy] X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001 autolearn=_AUTOLEARN X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: "Guix-devel" X-Scanner: scn0 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of guix-devel-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=guix-devel-bounces@gnu.org X-Spam-Score: -1.01 X-TUID: vxtGsnTZhl2J I am looking into some light-weight style orchestration. One possibility is to use Slurm with Guix containers - on a cluster with Guix that is almost trivial (we use Guix containers a lot! They are great) and would also allow non-container jobs. Once we have containers and Slurm it should also be possible to deploy in some cloud infrastructure, provided there are no dependencies on the cluster itself. I think it would make a terrific BLOG story if we put something like that together. Bcbio describes an architecture that uses the common workflow language (CWL) to run pipelines with containers https://bcbio-nextgen.readthedocs.io/en/latest/contents/cwl.html#running-with-cromwell-local-hpc I am not promoting the use of this, but it shows that infrastructure exists that can deploy workflows on containers in different setups (Bcbio supports Slurm). I know the Guix infrastructure uses Guix deploy to achieve similar roll-outs. What that lacks is the orchestration mechanism itself which should handle dependencies between jobs (i.e. a workflow). The GNU Workflow Language goes some way, but it does not handle orchestration itself. In other words, we almost have the pieces, but one thing is missing :). Thoughts? I know I have brought this up before in different guises, but we start to really need something here. What makes orchestration? I guess it concerns a dynamic database of machines that can execute jobs and some type of software registry (Guix). Next it should be able to schedule and execute jobs using some constraint specifiers (like network/CPU/RAM). It could be a 'dynamic' Slurm that makes use of real machines and VMs. Or hook into an existing cloud service. A slurm job could monitor sending a container into a cloud service. I think we can build this up a step at a time. Thoughts? Pj.