From mboxrd@z Thu Jan 1 00:00:00 1970 From: Roel Janssen Subject: Re: Workflow management with GNU Guix Date: Tue, 14 Jun 2016 11:16:31 +0200 Message-ID: <87bn343zbk.fsf@gnu.org> References: <87wpmzhdk2.fsf@gnu.org> <87twhyp505.fsf@mdc-berlin.de> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:51613) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bCkSj-0005pv-Qd for guix-devel@gnu.org; Tue, 14 Jun 2016 05:16:47 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bCkSf-0005vV-GH for guix-devel@gnu.org; Tue, 14 Jun 2016 05:16:44 -0400 In-reply-to: <87twhyp505.fsf@mdc-berlin.de> List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: "Guix-devel" To: Ricardo Wurmus Cc: guix-devel@gnu.org Hello all, Thank you for your replies. I will use Ricardo's response to reply to. Ricardo Wurmus writes: > (Resending this as it could not be delivered.) > > Ricardo Wurmus writes: > >> Hi Roel, >> >>> With GNU Guix we are able to install programs to our machines with an amazing >>> level of control over the dependency graph of the programs. We can now know >>> what code will run when we invoke a program. We can now know what the impact >>> of an upgrade will be. And we can now safely roll-back to previous states. >>> >>> What seems to be a common practice in research involving data analysis, is >>> running multiple programs in a chain to transform data from raw to specific. >>> This is often referred to as a "pipeline" or a "workflow". Because data sets >>> can be quite large in comparison to the computing power of our laptops, the >>> data analysis is performed on computing clusters instead of single machines. >>> >>> The usage of a pipeline/workflow is somewhat different from the package >>> construction, because we want to run the sequence of commands on different data >>> sets (as opposed to running it on the same source code). Plus, I would like to >>> integrate it with existing computing clusters that have a job scheduling system >>> in place. >>> >>> The reason I think this should be possible with Guix is that it has >>> everything in place to do software deployment and run-time isolation >>> (containers). From there it is a small step to executing programs in an >>> automated way. >>> >>> So, I would like to propose a new Guix subcommand and an extension to >>> the package management language to add workflow management features. >> >> I probably don’t understand your idea well enough, but from what I >> understand it doesn’t really have much to do with packages (other than >> using them) and store manipulation per se (produced artifacts are not >> added to the store). Exactly what features of Guix do you want to build >> on? I would like to build on the language to express packages. What's nice about the package recipes is that they are understandable, they are shareable (just copy and paste the recipe) and from them a reproducible output can be produced. A package recipe describes its entire dependency graph because the symbols in the inputs are turned into specific versions of the external packages. This is a very powerful feat for specifying how to run things. >> My perspective on pipelines is that they should be developed like any >> other software package, treating individual tools as you would treat >> libraries. This means that a pipeline would have a configuration step >> in which it checks for the paths of all tools it needs internally, and >> then use the full paths rather than assume all tools to be in a >> directory listed in the PATH variable. When we would use Guix package recipes to describe tools, we wouldn't need to search for them. We could just set up a profile with these tools and set the environment variables suggested by Guix accordingly. This way we can generate the exact dependency graph of a pipeline, leaving no ambiguity to the run-time environment. >> Distributing jobs to clusters would be the responsibility of the >> pipeline, e.g. by using DRMMA, which supports several resource >> management backends and has bindings for a wide range of programming >> languages. Wouldn't it be easier to write a pipeline in a language that has the infrastructure to uniquely describe and deploy a program and its dependencies? You don't need to search for available tools, you can just install them. If they were available already, installing will be a matter of creating a couple of symbolic links. Here is a translation of a "real-world" process definition to my record type from one of the pipelines I studied. It isn't a perfect example because it uses a package that isn't in Guix.. Anyway: === (define (rnaseq-fastq-quality-control in out) (process (name "rnaseq-fastq-quality-control") (version "1.0") (environment `(("fastqc" ,fastqc-bin-0.11.4))) (input in) (output (string-append out "/" name)) (procedure (script (interpreter 'guile) (source (let ((sample-files (find-files in #:directories? #f))) `(begin ;; Create output directories. (unless (access? ,out F_OK) (mkdir ,out)) (unless (access? ,output F_OK) (mkdir ,output)) ;; Perform the analysis step. (map (lambda (file) (when (string-suffix? ".fastq.gz" file) (system* "fastqc" "-q" file "-o" ,output))) ',sample-files)))))) (synopsis "Generate quality control reports for FastQ files") (description "This process generates a quality control report for a single FastQ file."))) === The resulting expression in `source' can be executed with Guile in any place on a computing cluster (as long as the files are accessible at the same location on other machines). This snippet can be copy-pasted elsewhere and be included in another pipeline without adjusting what job distribution system should be used. We can deal with that on the "workflow" level instead of the "process" level. I left the option open to use other scripting languages, but we could compact it a bit more when only using Guile. >>> Would this be a feature you are interested in adding to GNU Guix? >> >> Even if it wasn’t part of Guix itself, you could develop it separately >> and still add it as a Guix command, much like it is currently done for >> “guix web” (which I think should eventually be part of Guix). That may be a good idea. >>> I'm currently working on a proof-of-concept implementation that has three >>> record types/levels of abstraction: >>> : Describes which es should be run, and concerns itself with >>> the order of execution. >>> >>> : Describes what packages are needed to run the programs involved, >>> and its relationship to other processes. Processes take input and >>> generate output much like the package construction process. >>> >>>