From mboxrd@z Thu Jan  1 00:00:00 1970
From: myglc2 <myglc2@gmail.com>
Subject: Re: leaky pipelines and Guix
Date: Fri, 04 Mar 2016 18:29:20 -0500
Message-ID: <87vb51hlf3.fsf@gmail.com>
References: <idjtwli6s7w.fsf@bimsb-sys02.mdc-berlin.net>
	<87egci10tz.fsf@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Return-path: <help-guix-bounces+gcggh-help-guix=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([2001:4830:134:3::10]:48191)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <gcggh-help-guix@m.gmane.org>) id 1abz8T-0003te-0t
	for help-guix@gnu.org; Fri, 04 Mar 2016 18:27:54 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <gcggh-help-guix@m.gmane.org>) id 1abz8N-0004no-Vl
	for help-guix@gnu.org; Fri, 04 Mar 2016 18:27:52 -0500
Received: from plane.gmane.org ([80.91.229.3]:41067)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <gcggh-help-guix@m.gmane.org>) id 1abz8N-0004nk-Lk
	for help-guix@gnu.org; Fri, 04 Mar 2016 18:27:47 -0500
Received: from list by plane.gmane.org with local (Exim 4.69)
	(envelope-from <gcggh-help-guix@m.gmane.org>) id 1abz8L-0006bw-1g
	for help-guix@gnu.org; Sat, 05 Mar 2016 00:27:45 +0100
Received: from c-73-167-118-254.hsd1.ma.comcast.net ([73.167.118.254])
	by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
	id 1AlnuQ-0007hv-00
	for <help-guix@gnu.org>; Sat, 05 Mar 2016 00:27:45 +0100
Received: from myglc2 by c-73-167-118-254.hsd1.ma.comcast.net with local
	(Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00
	for <help-guix@gnu.org>; Sat, 05 Mar 2016 00:27:45 +0100
List-Id: <help-guix.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/help-guix>,
	<mailto:help-guix-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/help-guix>
List-Post: <mailto:help-guix@gnu.org>
List-Help: <mailto:help-guix-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/help-guix>,
	<mailto:help-guix-request@gnu.org?subject=subscribe>
Errors-To: help-guix-bounces+gcggh-help-guix=m.gmane.org@gnu.org
Sender: help-guix-bounces+gcggh-help-guix=m.gmane.org@gnu.org
To: help-guix@gnu.org

ludo@gnu.org (Ludovic Courtès) writes:

> Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> skribis:

>> [...]
>> So, how could I package something like that?  Is packaging the wrong
>> approach here and should I really just be using “guix environment” to
>> prepare a suitable environment, run the pipeline, and then exit?

> Maybe packages are the wrong abstraction here?
>
> IIUC, a pipeline is really a function that takes inputs and produces
> output(s).  So it can definitely be modeled as a derivation.

I built and ran reproducible pipelines on HPC clusters for the last 5
years. IMO the derivation model fits (disclaimer, I am still trying to
figure Guix out ;)

I think of a generic pipeline as a series of job steps (series of
derivations). Job steps must be configured at a meta level in terms of
parameters, dependicies, inputs and outputs. I found Grid Engine qmake
(which is GNU Make integrated with the Grid Engine scheduler) extremely
useful for this. I used it to configure & manage the pipeline, express
dependencies, partition & manage parallel tasks, deal with error
conditions, and manage starts and re-starts. Such pipeline jobs ran for
weeks without incident.

I dealt with the input/output problem using a recursive sub-make
architecture in which data flowed up the analysis (make) directory
tree. I dealt with modularity by using git submodules. I checked results
into git for provenance. The only real fly in the ointment was that make
uses time stamps. What you really want is a hash or a git SHA. Of course
there were also hideous problems w/software configuration, but I expect
guix will solve those :=)

> Perhaps as a first step you could try and write a procedure and a CLI
> around it that simply runs a given pipeline:
>
>   $ guix biopipeline foo frog.dna human.dna
>   …
>   /gnu/store/…-freak.dna
>
> The procedure itself would be along the lines of:
>
>   (define (foo-pipeline input1 input2)
>     (gexp->derivation "result"
>                       #~(begin
>                           (setenv "PATH" "/foo/bar")
>                           (invoke-make-and-co #$input1 #$input2
>                                               #$output))))

Sidebar:

- What is "biopipeline" above? A new guix command?

- Should "foo-pipeline" read "foo", or visa versa?

>> [...]
>> However, most pipelines do not take this approach.  Pipelines are often
>> designed as glue (written in Perl, or as Makefiles) that ties together
>> other tools in some particular order.  These tools are usually assumed
>> to be available on the PATH.

Yes, these pipelines are generally badly designed.

>> [...]
>> I can easily create a shared profile containing the tools that are
>> needed by a particular pipeline and provide a wrapper script that
>> does something like this (pseudo-code):
>>
>>     bash
>>     eval $(guix package --search-paths=prefix)
>>     do things
>>     exit
>>
>> But I wouldn’t want to do this for individual users, letting them
>> install all tools in a separate profile to run that pipeline, run
>> something like the above to set up the environment, then fetch the
>> tarball containing the glue code that constitutes the pipeline
>> (because we wouldn’t offer a Guix package for something that’s not
>> usable without so much effort to prepare an environment first),
>> unpack it and then run it inside that environment.
>>
>> To me this seems to be in the twilight zone between proper packaging and
>> a use-case for “guix environment”.  I welcome any comments about how to
>> approach this and I’m looking forward to the many practical tricks that
>> I must have overlooked.

An attraction of Guix is the possibility of placing job step inputs and
outputs in the store, or something like the store. So how about
integrating GNU Make with Guix to enable job steps that are equivalent
to ...

step1:
        guix environment foo bar && read the store, do things, save in store

... Or maybe something like ...

step2:
        send to guix-daemon:
>   (define (foo-pipeline input1 input2)
>     (gexp->derivation "result"
>                       #~(begin
>                           (setenv "PATH" "/foo/bar")
>                           (invoke-make-and-co #$input1 #$input2
>                                               #$output))))

Then you can provide a pipeline by providing a makefile.

Things needed to make this work:

- make integration with guix store

- make integration with guix-daemon

- HPC scheduler: qmake allows specific HPC resources to be requested for
  each job step (e.g. memory, slots(cpus)). Grid engine uses these to
  determine where the steps run. Maybe these features could be achieved
  by running the guix daemon over a scheduler, like slurm. Or maybe by
  submitting job steps to slurm wich are in turn run by the daemon?
  (disclaimer, I am still trying to figure Guix out ;) - George