all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: myglc2 <myglc2@gmail.com>
To: help-guix@gnu.org
Subject: Re: leaky pipelines and Guix
Date: Fri, 04 Mar 2016 18:29:20 -0500	[thread overview]
Message-ID: <87vb51hlf3.fsf@gmail.com> (raw)
In-Reply-To: 87egci10tz.fsf@gnu.org

ludo@gnu.org (Ludovic Courtès) writes:

> Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> skribis:

>> [...]
>> So, how could I package something like that?  Is packaging the wrong
>> approach here and should I really just be using “guix environment” to
>> prepare a suitable environment, run the pipeline, and then exit?

> Maybe packages are the wrong abstraction here?
>
> IIUC, a pipeline is really a function that takes inputs and produces
> output(s).  So it can definitely be modeled as a derivation.

I built and ran reproducible pipelines on HPC clusters for the last 5
years. IMO the derivation model fits (disclaimer, I am still trying to
figure Guix out ;)

I think of a generic pipeline as a series of job steps (series of
derivations). Job steps must be configured at a meta level in terms of
parameters, dependicies, inputs and outputs. I found Grid Engine qmake
(which is GNU Make integrated with the Grid Engine scheduler) extremely
useful for this. I used it to configure & manage the pipeline, express
dependencies, partition & manage parallel tasks, deal with error
conditions, and manage starts and re-starts. Such pipeline jobs ran for
weeks without incident.

I dealt with the input/output problem using a recursive sub-make
architecture in which data flowed up the analysis (make) directory
tree. I dealt with modularity by using git submodules. I checked results
into git for provenance. The only real fly in the ointment was that make
uses time stamps. What you really want is a hash or a git SHA. Of course
there were also hideous problems w/software configuration, but I expect
guix will solve those :=)

> Perhaps as a first step you could try and write a procedure and a CLI
> around it that simply runs a given pipeline:
>
>   $ guix biopipeline foo frog.dna human.dna
>   …
>   /gnu/store/…-freak.dna
>
> The procedure itself would be along the lines of:
>
>   (define (foo-pipeline input1 input2)
>     (gexp->derivation "result"
>                       #~(begin
>                           (setenv "PATH" "/foo/bar")
>                           (invoke-make-and-co #$input1 #$input2
>                                               #$output))))

Sidebar:

- What is "biopipeline" above? A new guix command?

- Should "foo-pipeline" read "foo", or visa versa?

>> [...]
>> However, most pipelines do not take this approach.  Pipelines are often
>> designed as glue (written in Perl, or as Makefiles) that ties together
>> other tools in some particular order.  These tools are usually assumed
>> to be available on the PATH.

Yes, these pipelines are generally badly designed.

>> [...]
>> I can easily create a shared profile containing the tools that are
>> needed by a particular pipeline and provide a wrapper script that
>> does something like this (pseudo-code):
>>
>>     bash
>>     eval $(guix package --search-paths=prefix)
>>     do things
>>     exit
>>
>> But I wouldn’t want to do this for individual users, letting them
>> install all tools in a separate profile to run that pipeline, run
>> something like the above to set up the environment, then fetch the
>> tarball containing the glue code that constitutes the pipeline
>> (because we wouldn’t offer a Guix package for something that’s not
>> usable without so much effort to prepare an environment first),
>> unpack it and then run it inside that environment.
>>
>> To me this seems to be in the twilight zone between proper packaging and
>> a use-case for “guix environment”.  I welcome any comments about how to
>> approach this and I’m looking forward to the many practical tricks that
>> I must have overlooked.

An attraction of Guix is the possibility of placing job step inputs and
outputs in the store, or something like the store. So how about
integrating GNU Make with Guix to enable job steps that are equivalent
to ...

step1:
        guix environment foo bar && read the store, do things, save in store

... Or maybe something like ...

step2:
        send to guix-daemon:
>   (define (foo-pipeline input1 input2)
>     (gexp->derivation "result"
>                       #~(begin
>                           (setenv "PATH" "/foo/bar")
>                           (invoke-make-and-co #$input1 #$input2
>                                               #$output))))

Then you can provide a pipeline by providing a makefile.

Things needed to make this work:

- make integration with guix store

- make integration with guix-daemon

- HPC scheduler: qmake allows specific HPC resources to be requested for
  each job step (e.g. memory, slots(cpus)). Grid engine uses these to
  determine where the steps run. Maybe these features could be achieved
  by running the guix daemon over a scheduler, like slurm. Or maybe by
  submitting job steps to slurm wich are in turn run by the daemon?
  (disclaimer, I am still trying to figure Guix out ;) - George

  reply	other threads:[~2016-03-04 23:27 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-09 11:25 leaky pipelines and Guix Ricardo Wurmus
2016-02-12 14:04 ` Ludovic Courtès
2016-03-04 23:29   ` myglc2 [this message]
2016-03-07  9:56     ` Ludovic Courtès
2016-03-07 23:21       ` myglc2
2016-03-05 11:05   ` Ricardo Wurmus
2016-03-07  9:54     ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87vb51hlf3.fsf@gmail.com \
    --to=myglc2@gmail.com \
    --cc=help-guix@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.