From mboxrd@z Thu Jan  1 00:00:00 1970
From: Roel Janssen <roel@gnu.org>
Subject: Re: Workflow management with GNU Guix
Date: Tue, 14 Jun 2016 11:16:31 +0200
Message-ID: <87bn343zbk.fsf@gnu.org>
References: <87wpmzhdk2.fsf@gnu.org>
	<idjfutnih58.fsf@bimsb-sys02.mdc-berlin.net>
	<87twhyp505.fsf@mdc-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Return-path: <guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([2001:4830:134:3::10]:51613)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <roel@gnu.org>) id 1bCkSj-0005pv-Qd
	for guix-devel@gnu.org; Tue, 14 Jun 2016 05:16:47 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <roel@gnu.org>) id 1bCkSf-0005vV-GH
	for guix-devel@gnu.org; Tue, 14 Jun 2016 05:16:44 -0400
In-reply-to: <87twhyp505.fsf@mdc-berlin.de>
List-Id: "Development of GNU Guix and the GNU System distribution."
	<guix-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guix-devel>,
	<mailto:guix-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guix-devel/>
List-Post: <mailto:guix-devel@gnu.org>
List-Help: <mailto:guix-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guix-devel>,
	<mailto:guix-devel-request@gnu.org?subject=subscribe>
Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org
Sender: "Guix-devel" <guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org>
To: Ricardo Wurmus <rekado@elephly.net>
Cc: guix-devel@gnu.org

Hello all,

Thank you for your replies.  I will use Ricardo's response to reply to.

Ricardo Wurmus writes:

> (Resending this as it could not be delivered.)
>
> Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes:
>
>> Hi Roel,
>>
>>> With GNU Guix we are able to install programs to our machines with an amazing
>>> level of control over the dependency graph of the programs.  We can now know
>>> what code will run when we invoke a program.  We can now know what the impact
>>> of an upgrade will be.  And we can now safely roll-back to previous states.
>>>
>>> What seems to be a common practice in research involving data analysis, is
>>> running multiple programs in a chain to transform data from raw to specific. 
>>> This is often referred to as a "pipeline" or a "workflow".  Because data sets
>>> can be quite large in comparison to the computing power of our laptops, the
>>> data analysis is performed on computing clusters instead of single machines.
>>>
>>> The usage of a pipeline/workflow is somewhat different from the package
>>> construction, because we want to run the sequence of commands on different data
>>> sets (as opposed to running it on the same source code).  Plus, I would like to
>>> integrate it with existing computing clusters that have a job scheduling system
>>> in place.  
>>>
>>> The reason I think this should be possible with Guix is that it has
>>> everything in place to do software deployment and run-time isolation
>>> (containers).  From there it is a small step to executing programs in an
>>> automated way.
>>>
>>> So, I would like to propose a new Guix subcommand and an extension to
>>> the package management language to add workflow management features.
>>
>> I probably don’t understand your idea well enough, but from what I
>> understand it doesn’t really have much to do with packages (other than
>> using them) and store manipulation per se (produced artifacts are not
>> added to the store).  Exactly what features of Guix do you want to build
>> on?

I would like to build on the language to express packages.  What's nice
about the package recipes is that they are understandable, they are
shareable (just copy and paste the recipe) and from them a reproducible
output can be produced.

A package recipe describes its entire dependency graph because the
symbols in the inputs are turned into specific versions of the external
packages.  This is a very powerful feat for specifying how to run
things.

>> My perspective on pipelines is that they should be developed like any
>> other software package, treating individual tools as you would treat
>> libraries.  This means that a pipeline would have a configuration step
>> in which it checks for the paths of all tools it needs internally, and
>> then use the full paths rather than assume all tools to be in a
>> directory listed in the PATH variable.

When we would use Guix package recipes to describe tools, we wouldn't
need to search for them. We could just set up a profile with these tools
and set the environment variables suggested by Guix accordingly.

This way we can generate the exact dependency graph of a pipeline,
leaving no ambiguity to the run-time environment.

>> Distributing jobs to clusters would be the responsibility of the
>> pipeline, e.g. by using DRMMA, which supports several resource
>> management backends and has bindings for a wide range of programming
>> languages.

Wouldn't it be easier to write a pipeline in a language that has the
infrastructure to uniquely describe and deploy a program and its
dependencies?  You don't need to search for available tools, you can
just install them.  If they were available already, installing will be a
matter of creating a couple of symbolic links.

Here is a translation of a "real-world" process definition to my
<process> record type from one of the pipelines I studied.  It isn't a
perfect example because it uses a package that isn't in Guix..  Anyway:

===
(define (rnaseq-fastq-quality-control in out)
  (process
    (name "rnaseq-fastq-quality-control")
    (version "1.0")
    (environment
     `(("fastqc" ,fastqc-bin-0.11.4)))
    (input in)
    (output (string-append out "/" name))
    (procedure
     (script
      (interpreter 'guile)
      (source
        (let ((sample-files (find-files in #:directories? #f)))
         `(begin
            ;; Create output directories.
            (unless (access? ,out F_OK) (mkdir ,out))
            (unless (access? ,output F_OK) (mkdir ,output))
            ;; Perform the analysis step.
            (map (lambda (file)
                   (when (string-suffix? ".fastq.gz" file)
                     (system* "fastqc" "-q" file "-o" ,output)))
                 ',sample-files))))))
    (synopsis "Generate quality control reports for FastQ files")
    (description "This process generates a quality control report
for a single FastQ file.")))
===

The resulting expression in `source' can be executed with Guile in any
place on a computing cluster (as long as the files are accessible at the
same location on other machines).

This snippet can be copy-pasted elsewhere and be included in another
pipeline without adjusting what job distribution system should be used.
We can deal with that on the "workflow" level instead of the "process"
level.

I left the option open to use other scripting languages, but we could
compact it a bit more when only using Guile.

>>> Would this be a feature you are interested in adding to GNU Guix?
>>
>> Even if it wasn’t part of Guix itself, you could develop it separately
>> and still add it as a Guix command, much like it is currently done for
>> “guix web” (which I think should eventually be part of Guix).

That may be a good idea.

>>> I'm currently working on a proof-of-concept implementation that has three
>>> record types/levels of abstraction:
>>> <workflow>:  Describes which <process>es should be run, and concerns itself with
>>>              the order of execution.
>>>
>>> <process>:   Describes what packages are needed to run the programs involved,
>>>              and its relationship to other processes.  Processes take input and
>>>              generate output much like the package construction process.
>>>
>>> <script>:    Short and simple imperative instructions to perform a task. They are
>>>              part of a <process>.  Currently, my implementation generates a shell
>>>              script that can be either Guile, Sh, Perl or Python.
>>
>> From that list it seems as if the only link to Guix is ensuring the
>> environment contains required programs.  This can be done right now with
>> the help of manifests and profiles.
>>
>> I wonder if maybe we could add Guix as a package management backend to
>> existing workflow specification systems (instead of the curiously
>> popular and IMO barely adequate Conda, for example).

That is an option too.  The workflow specification systems overlap in
describing tools though.  For example, the Common Workflow Language
(CWL).  If we then look at:
  http://www.commonwl.org/draft-3/CommandLineTool.html#CommandLineTool

The `requirements' field is the equivalent of `inputs' and
`propagated-inputs' in Guix.

With Guix, we could describe a command-line tool by refering to the
package recipe, and then write the command to run.

>>> The subcommand I envision is:
>>>   guix workflow
>>>
>>> With primarily:
>>>   guix workflow --run=<name-of-workflow-definition>
>>>
>>> If you are interested in adding any form of workflow management to GNU Guix, I
>>> can elaborate on my proof-of-concept implementation, so we can work from there.
>>> (or throw everything out of the window and start from scratch ;-))
>>
>> Could you show us an example workflow?

So, the <process>es look like the snippet provided above.  Then the
workflow itself looks like:

===
(define (rnaseq-pipeline in out)
  (workflow
   (name "rnaseq-pipeline")
   (version "1.0")
   (input in)
   (output (string-append
            out "/" name "-" (date->string (current-date) "~Y-~m-~d")))
   (processes
    '(rnaseq-initialize
      rnaseq-fastq-quality-control
      rnaseq-align
      rnaseq-add-read-groups
      rnaseq-index
      rnaseq-feature-readcount
      rnaseq-collect-alignment-metrics
      rnaseq-merge-read-features
      rnaseq-compute-rpkm-values
      rnaseq-normalize-read-counts
      rnaseq-differential-expression))
   (restrictions
    `((,rnaseq-fastq-quality-control ,rnaseq-initialize)
      (,rnaseq-align ,rnaseq-initialize)
      (,rnaseq-add-read-groups ,rnaseq-align)
      (,rnaseq-index ,rnaseq-add-read-groups)
      (,rnaseq-collect-alignment-metrics ,rnaseq-index)
      (,rnaseq-feature-readcount ,rnaseq-index)
      (,rnaseq-merge-read-features ,rnaseq-feature-readcount)
      (,rnaseq-compute-rpkm-values ,rnaseq-merge-read-features)
      (,rnaseq-normalize-read-counts ,rnaseq-merge-read-features)
      (,rnaseq-differential-expression ,rnaseq-merge-read-features)))
   (synopsis "RNA sequencing pipeline used at the UMCU")
   (description "The RNAseq pipeline can do quality control on FastQ and BAM
files; align reads against a reference genome; count reads in features;
normalize read counts; calculate RPKMs and perform DE analysis of standard
designs.")))
===

The `restrictions' are dependency pairs (A B) where A depends on the
successful completion of B.  From this, the execution order can be
determined.  

Thank you all for your time.

Kind regards,
Roel Janssen