leaky pipelines and Guix

unofficial mirror of help-guix@gnu.org 
 help / color / mirror / Atom feed

* leaky pipelines and Guix
@ 2016-02-09 11:25 Ricardo Wurmus
  2016-02-12 14:04 ` Ludovic Courtès
  0 siblings, 1 reply; 7+ messages in thread
From: Ricardo Wurmus @ 2016-02-09 11:25 UTC (permalink / raw)
  To: help-guix

Hi Guix,

although I’m comfortable packaging software for Guix I’m still not
confident enough to tackle bioinformatics pipelines, as they don’t play
well with isolation.

In the pipeline that I’m currently working on as a consultant packager
I’m trying to treat the pipeline itself as a first-class package.  This
means that the locations of the tools it calls out to are all
configurable (thanks to auto{conf,make}) and they certainly do not have
to be in the PATH.  This allows us to install this pipeline (and the
tools it needs) easily alongside other variants of tools.  The pipeline
is also not just a bare Makefile but has a wrapper script to provide a
simplifed user interface.

However, most pipelines do not take this approach.  Pipelines are often
designed as glue (written in Perl, or as Makefiles) that ties together
other tools in some particular order.  These tools are usually assumed
to be available on the PATH.  Pipelines aren’t treated enough like
packages (which will be the subject of an inflammatory, click-baiting
blog post that I’m working on), so they usually come without a
configuration script to override implicit assumptions.

In the context of Guix this means that each pipeline would need its very
own isolated environment where the PATH is set up to contain the
locations of all tools that are needed at runtime (that’s what I mean by
“leaky”).  As many pipelines do not come with wrapper scripts there is
no easy way to sneakily set up such an environment for the duration of
the run.

So, how could I package something like that?  Is packaging the wrong
approach here and should I really just be using “guix environment” to
prepare a suitable environment, run the pipeline, and then exit?  I know
that there is work in progress to support profile-based environments
that would make this a little more feasible (as the environment wouldn’t
be as volatile as they are now), but it seems somewhat inconvenient.

This pains me especially in the context of multi-user systems.  I can
easily create a shared profile containing the tools that are needed by a
particular pipeline and provide a wrapper script that does something
like this (pseudo-code):

    bash
    eval $(guix package --search-paths=prefix)
    do things
    exit

But I wouldn’t want to do this for individual users, letting them
install all tools in a separate profile to run that pipeline, run
something like the above to set up the environment, then fetch the
tarball containing the glue code that constitutes the pipeline (because
we wouldn’t offer a Guix package for something that’s not usable without
so much effort to prepare an environment first), unpack it and then run
it inside that environment.

To me this seems to be in the twilight zone between proper packaging and
a use-case for “guix environment”.  I welcome any comments about how to
approach this and I’m looking forward to the many practical tricks that
I must have overlooked.

~~ Ricardo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: leaky pipelines and Guix
  2016-02-09 11:25 leaky pipelines and Guix Ricardo Wurmus
@ 2016-02-12 14:04 ` Ludovic Courtès
  2016-03-04 23:29   ` myglc2
  2016-03-05 11:05   ` Ricardo Wurmus
  0 siblings, 2 replies; 7+ messages in thread
From: Ludovic Courtès @ 2016-02-12 14:04 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: help-guix

Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> skribis:

> So, how could I package something like that?  Is packaging the wrong
> approach here and should I really just be using “guix environment” to
> prepare a suitable environment, run the pipeline, and then exit?

Maybe packages are the wrong abstraction here?

IIUC, a pipeline is really a function that takes inputs and produces
output(s).  So it can definitely be modeled as a derivation.

Perhaps as a first step you could try and write a procedure and a CLI
around it that simply runs a given pipeline:

  $ guix biopipeline foo frog.dna human.dna
  …
  /gnu/store/…-freak.dna

The procedure itself would be along the lines of:

  (define (foo-pipeline input1 input2)
    (gexp->derivation "result"
                      #~(begin
                          (setenv "PATH" "/foo/bar")
                          (invoke-make-and-co #$input1 #$input2
                                              #$output))))

Once you’ve done this exercise for a couple of pipelines, perhaps you’ll
find a higher-level abstraction that captures properties common to all
bioinfo pipelines?

HTH,
Ludo’.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: leaky pipelines and Guix
  2016-02-12 14:04 ` Ludovic Courtès
@ 2016-03-04 23:29   ` myglc2
  2016-03-07  9:56     ` Ludovic Courtès
  2016-03-05 11:05   ` Ricardo Wurmus
  1 sibling, 1 reply; 7+ messages in thread
From: myglc2 @ 2016-03-04 23:29 UTC (permalink / raw)
  To: help-guix

ludo@gnu.org (Ludovic Courtès) writes:

> Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> skribis:

>> [...]
>> So, how could I package something like that?  Is packaging the wrong
>> approach here and should I really just be using “guix environment” to
>> prepare a suitable environment, run the pipeline, and then exit?

> Maybe packages are the wrong abstraction here?
>
> IIUC, a pipeline is really a function that takes inputs and produces
> output(s).  So it can definitely be modeled as a derivation.

I built and ran reproducible pipelines on HPC clusters for the last 5
years. IMO the derivation model fits (disclaimer, I am still trying to
figure Guix out ;)

I think of a generic pipeline as a series of job steps (series of
derivations). Job steps must be configured at a meta level in terms of
parameters, dependicies, inputs and outputs. I found Grid Engine qmake
(which is GNU Make integrated with the Grid Engine scheduler) extremely
useful for this. I used it to configure & manage the pipeline, express
dependencies, partition & manage parallel tasks, deal with error
conditions, and manage starts and re-starts. Such pipeline jobs ran for
weeks without incident.

I dealt with the input/output problem using a recursive sub-make
architecture in which data flowed up the analysis (make) directory
tree. I dealt with modularity by using git submodules. I checked results
into git for provenance. The only real fly in the ointment was that make
uses time stamps. What you really want is a hash or a git SHA. Of course
there were also hideous problems w/software configuration, but I expect
guix will solve those :=)

> Perhaps as a first step you could try and write a procedure and a CLI
> around it that simply runs a given pipeline:
>
>   $ guix biopipeline foo frog.dna human.dna
>   …
>   /gnu/store/…-freak.dna
>
> The procedure itself would be along the lines of:
>
>   (define (foo-pipeline input1 input2)
>     (gexp->derivation "result"
>                       #~(begin
>                           (setenv "PATH" "/foo/bar")
>                           (invoke-make-and-co #$input1 #$input2
>                                               #$output))))

Sidebar:

- What is "biopipeline" above? A new guix command?

- Should "foo-pipeline" read "foo", or visa versa?

>> [...]
>> However, most pipelines do not take this approach.  Pipelines are often
>> designed as glue (written in Perl, or as Makefiles) that ties together
>> other tools in some particular order.  These tools are usually assumed
>> to be available on the PATH.

Yes, these pipelines are generally badly designed.

>> [...]
>> I can easily create a shared profile containing the tools that are
>> needed by a particular pipeline and provide a wrapper script that
>> does something like this (pseudo-code):
>>
>>     bash
>>     eval $(guix package --search-paths=prefix)
>>     do things
>>     exit
>>
>> But I wouldn’t want to do this for individual users, letting them
>> install all tools in a separate profile to run that pipeline, run
>> something like the above to set up the environment, then fetch the
>> tarball containing the glue code that constitutes the pipeline
>> (because we wouldn’t offer a Guix package for something that’s not
>> usable without so much effort to prepare an environment first),
>> unpack it and then run it inside that environment.
>>
>> To me this seems to be in the twilight zone between proper packaging and
>> a use-case for “guix environment”.  I welcome any comments about how to
>> approach this and I’m looking forward to the many practical tricks that
>> I must have overlooked.

An attraction of Guix is the possibility of placing job step inputs and
outputs in the store, or something like the store. So how about
integrating GNU Make with Guix to enable job steps that are equivalent
to ...

step1:
        guix environment foo bar && read the store, do things, save in store

... Or maybe something like ...

step2:
        send to guix-daemon:
>   (define (foo-pipeline input1 input2)
>     (gexp->derivation "result"
>                       #~(begin
>                           (setenv "PATH" "/foo/bar")
>                           (invoke-make-and-co #$input1 #$input2
>                                               #$output))))

Then you can provide a pipeline by providing a makefile.

Things needed to make this work:

- make integration with guix store

- make integration with guix-daemon

- HPC scheduler: qmake allows specific HPC resources to be requested for
  each job step (e.g. memory, slots(cpus)). Grid engine uses these to
  determine where the steps run. Maybe these features could be achieved
  by running the guix daemon over a scheduler, like slurm. Or maybe by
  submitting job steps to slurm wich are in turn run by the daemon?
  (disclaimer, I am still trying to figure Guix out ;) - George

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: leaky pipelines and Guix
  2016-02-12 14:04 ` Ludovic Courtès
  2016-03-04 23:29   ` myglc2
@ 2016-03-05 11:05   ` Ricardo Wurmus
  2016-03-07  9:54     ` Ludovic Courtès
  1 sibling, 1 reply; 7+ messages in thread
From: Ricardo Wurmus @ 2016-03-05 11:05 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: help-guix

Ludovic Courtès <ludo@gnu.org> writes:

> Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> skribis:
>
>> So, how could I package something like that?  Is packaging the wrong
>> approach here and should I really just be using “guix environment” to
>> prepare a suitable environment, run the pipeline, and then exit?
>
> Maybe packages are the wrong abstraction here?
>
> IIUC, a pipeline is really a function that takes inputs and produces
> output(s).  So it can definitely be modeled as a derivation.

This may be true and the basic abstraction you propose seems correct and
useful, but I was talking about existing pipelines.  They have already
been implemented using snakemake or make to keep track of individual
steps, etc.  My primary concern is with making these pipelines work, not
to rewrite them.

For a particularly nasty pipeline I’m just using a separate profile
just for the pipeline dependencies.  Users build the pipeline glue code
themselves by whatever means they deem appropriate and then load the
profile in a subshell:

    bash
    source /path/to/pipeline-profile/etc/profile
    # run the pipeline here
    exit

I think that these existing bio pipelines should really be treated more
like configurable packages.  For a pipeline that we’re currently working
on I’m involved in making sure that it can be packaged and installed.
We chose to use autoconf to substitute tool placeholders at configure
time.  This allows us to install the pipeline easily with Guix as we can
treat tools just as regular runtime dependencies.  At configure time the
actual full paths to the needed tools are injected into the sources, so
we don’t need to propagate anything and make assumptions about PATH.

Many problems with bio pipelines stem from the fact that they are not
treated as first-class applications, so they often don’t have a wrapper
script, nor a configuration or installation step.  I think the easiest
way to fix this is to encourage the design of pipelines as real software
packages rather than distributing bland Makefiles/snakefiles and
assuming that the user will arrange for a suitable environment.

~~ Ricardo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: leaky pipelines and Guix
  2016-03-05 11:05   ` Ricardo Wurmus
@ 2016-03-07  9:54     ` Ludovic Courtès
  0 siblings, 0 replies; 7+ messages in thread
From: Ludovic Courtès @ 2016-03-07  9:54 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: help-guix

Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> skribis:
>>
>>> So, how could I package something like that?  Is packaging the wrong
>>> approach here and should I really just be using “guix environment” to
>>> prepare a suitable environment, run the pipeline, and then exit?
>>
>> Maybe packages are the wrong abstraction here?
>>
>> IIUC, a pipeline is really a function that takes inputs and produces
>> output(s).  So it can definitely be modeled as a derivation.
>
> This may be true and the basic abstraction you propose seems correct and
> useful, but I was talking about existing pipelines.  They have already
> been implemented using snakemake or make to keep track of individual
> steps, etc.  My primary concern is with making these pipelines work, not
> to rewrite them.

Oh, got it.

> For a particularly nasty pipeline I’m just using a separate profile
> just for the pipeline dependencies.  Users build the pipeline glue code
> themselves by whatever means they deem appropriate and then load the
> profile in a subshell:
>
>     bash
>     source /path/to/pipeline-profile/etc/profile
>     # run the pipeline here
>     exit
>
> I think that these existing bio pipelines should really be treated more
> like configurable packages.  For a pipeline that we’re currently working
> on I’m involved in making sure that it can be packaged and installed.
> We chose to use autoconf to substitute tool placeholders at configure
> time.  This allows us to install the pipeline easily with Guix as we can
> treat tools just as regular runtime dependencies.  At configure time the
> actual full paths to the needed tools are injected into the sources, so
> we don’t need to propagate anything and make assumptions about PATH.
>
> Many problems with bio pipelines stem from the fact that they are not
> treated as first-class applications, so they often don’t have a wrapper
> script, nor a configuration or installation step.  I think the easiest
> way to fix this is to encourage the design of pipelines as real software
> packages rather than distributing bland Makefiles/snakefiles and
> assuming that the user will arrange for a suitable environment.

Indeed.  Then I think if existing pipelines are shell scripts or small
programs, it makes sense to treat them as packages.

Ludo’.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: leaky pipelines and Guix
  2016-03-04 23:29   ` myglc2
@ 2016-03-07  9:56     ` Ludovic Courtès
  2016-03-07 23:21       ` myglc2
  0 siblings, 1 reply; 7+ messages in thread
From: Ludovic Courtès @ 2016-03-07  9:56 UTC (permalink / raw)
  To: myglc2; +Cc: help-guix

myglc2 <myglc2@gmail.com> skribis:

> ludo@gnu.org (Ludovic Courtès) writes:

[...]

>> Perhaps as a first step you could try and write a procedure and a CLI
>> around it that simply runs a given pipeline:
>>
>>   $ guix biopipeline foo frog.dna human.dna
>>   …
>>   /gnu/store/…-freak.dna
>>
>> The procedure itself would be along the lines of:
>>
>>   (define (foo-pipeline input1 input2)
>>     (gexp->derivation "result"
>>                       #~(begin
>>                           (setenv "PATH" "/foo/bar")
>>                           (invoke-make-and-co #$input1 #$input2
>>                                               #$output))))
>
> Sidebar:
>
> - What is "biopipeline" above? A new guix command?

Right.  Basically I was suggesting implementing the pipeline as a Scheme
procedure (‘foo-pipeline’ above), and adding a command-line interface on
top of it (‘guix biopipeline’.)

This means that all inputs and outputs would go through the store, so
you would get caching and all that for free.

But I now understand that I was slightly off-topic.  ;-)

Ludo’.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: leaky pipelines and Guix
  2016-03-07  9:56     ` Ludovic Courtès
@ 2016-03-07 23:21       ` myglc2
  0 siblings, 0 replies; 7+ messages in thread
From: myglc2 @ 2016-03-07 23:21 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: help-guix

ludo@gnu.org (Ludovic Courtès) writes:

> myglc2 <myglc2@gmail.com> skribis:
>
>> ludo@gnu.org (Ludovic Courtès) writes:
>
> [...]
>
>>> Perhaps as a first step you could try and write a procedure and a CLI
>>> around it that simply runs a given pipeline:
>>>
>>>   $ guix biopipeline foo frog.dna human.dna
>>>   …
>>>   /gnu/store/…-freak.dna
>>>
>>> The procedure itself would be along the lines of:
>>>
>>>   (define (foo-pipeline input1 input2)
>>>     (gexp->derivation "result"
>>>                       #~(begin
>>>                           (setenv "PATH" "/foo/bar")
>>>                           (invoke-make-and-co #$input1 #$input2
>>>                                               #$output))))
>>
>> Sidebar:
>>
>> - What is "biopipeline" above? A new guix command?
>
> Right.  Basically I was suggesting implementing the pipeline as a Scheme
> procedure (‘foo-pipeline’ above), and adding a command-line interface on
> top of it (‘guix biopipeline’.)
>
> This means that all inputs and outputs would go through the store, so
> you would get caching and all that for free.
>
> But I now understand that I was slightly off-topic.  ;-)

Thanks. Having built bespoke analysis pipelines for the last five years,
I find your idea intriguing. So my response to the original post was
slightly off-topic. also ;)

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-03-07 23:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-09 11:25 leaky pipelines and Guix Ricardo Wurmus
2016-02-12 14:04 ` Ludovic Courtès
2016-03-04 23:29   ` myglc2
2016-03-07  9:56     ` Ludovic Courtès
2016-03-07 23:21       ` myglc2
2016-03-05 11:05   ` Ricardo Wurmus
2016-03-07  9:54     ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).