Workflow management with GNU Guix

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Workflow management with GNU Guix
@ 2016-05-12  8:43 Roel Janssen
  2016-05-12 11:41 ` Taylan Ulrich Bayırlı/Kammer
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Roel Janssen @ 2016-05-12  8:43 UTC (permalink / raw)
  To: guix-devel

Dear Guix,

With GNU Guix we are able to install programs to our machines with an amazing
level of control over the dependency graph of the programs.  We can now know
what code will run when we invoke a program.  We can now know what the impact
of an upgrade will be.  And we can now safely roll-back to previous states.

What seems to be a common practice in research involving data analysis, is
running multiple programs in a chain to transform data from raw to specific. 
This is often referred to as a "pipeline" or a "workflow".  Because data sets
can be quite large in comparison to the computing power of our laptops, the
data analysis is performed on computing clusters instead of single machines.

The usage of a pipeline/workflow is somewhat different from the package
construction, because we want to run the sequence of commands on different data
sets (as opposed to running it on the same source code).  Plus, I would like to
integrate it with existing computing clusters that have a job scheduling system
in place.  

The reason I think this should be possible with Guix is that it has
everything in place to do software deployment and run-time isolation
(containers).  From there it is a small step to executing programs in an
automated way.

So, I would like to propose a new Guix subcommand and an extension to
the package management language to add workflow management features.

Would this be a feature you are interested in adding to GNU Guix?

I'm currently working on a proof-of-concept implementation that has three
record types/levels of abstraction:
<workflow>:  Describes which <process>es should be run, and concerns itself with
             the order of execution.

<process>:   Describes what packages are needed to run the programs involved,
             and its relationship to other processes.  Processes take input and
             generate output much like the package construction process.

<script>:    Short and simple imperative instructions to perform a task. They are
             part of a <process>.  Currently, my implementation generates a shell
             script that can be either Guile, Sh, Perl or Python.

The subcommand I envision is:
  guix workflow

With primarily:
  guix workflow --run=<name-of-workflow-definition>

If you are interested in adding any form of workflow management to GNU Guix, I
can elaborate on my proof-of-concept implementation, so we can work from there.
(or throw everything out of the window and start from scratch ;-))

Thanks again for your time.

Kind regards,
Roel Janssen

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Workflow management with GNU Guix
  2016-05-12  8:43 Workflow management with GNU Guix Roel Janssen
@ 2016-05-12 11:41 ` Taylan Ulrich Bayırlı/Kammer
  2016-05-12 16:06 ` Ludovic Courtès
       [not found] ` <idjfutnih58.fsf@bimsb-sys02.mdc-berlin.net>
  2 siblings, 0 replies; 13+ messages in thread
From: Taylan Ulrich Bayırlı/Kammer @ 2016-05-12 11:41 UTC (permalink / raw)
  To: Roel Janssen; +Cc: guix-devel

Roel Janssen <roel@gnu.org> writes:

> The usage of a pipeline/workflow is somewhat different from the
> package construction, because we want to run the sequence of commands
> on different data sets (as opposed to running it on the same source
> code).

Is this not conceptually the same thing as changing the 'source' field
of a package recipe?  With the new package transformation feature[0],
this can be done "on the fly" like:

    guix build emacs --with-source=emacs-25.1-alpha.tar.xz

Maybe a "process" can just be a build phase, and a "workflow" a build
system, as they currently exist in Guix.  Not sure what a "script" would
be, though build phases can easily execute shell commands, scripts, and
so on within the build directory.

That means one could write a "package recipe" that doesn't really build
a package from source code, but rather creates arbitrary output files
from arbitrary input files.  (Same thing to Guix anyway.)  The 'source'
field of the recipe would contain some dummy value, and one would
specify the real input like:

    guix build processed-data --with-source=raw-data-2016-05-12.txt

So maybe Guix already has everything you need? :-) Not sure if I fully
understand the problem domain though, so apologies if I'm missing the
point.

Taylan

[0] https://lists.gnu.org/archive/html/guix-devel/2016-02/msg00001.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Workflow management with GNU Guix
  2016-05-12  8:43 Workflow management with GNU Guix Roel Janssen
  2016-05-12 11:41 ` Taylan Ulrich Bayırlı/Kammer
@ 2016-05-12 16:06 ` Ludovic Courtès
  2016-10-25 13:28   ` Roel Janssen
       [not found] ` <idjfutnih58.fsf@bimsb-sys02.mdc-berlin.net>
  2 siblings, 1 reply; 13+ messages in thread
From: Ludovic Courtès @ 2016-05-12 16:06 UTC (permalink / raw)
  To: Roel Janssen; +Cc: guix-devel

Hello Guix!

Roel Janssen <roel@gnu.org> skribis:

> So, I would like to propose a new Guix subcommand and an extension to
> the package management language to add workflow management features.
>
> Would this be a feature you are interested in adding to GNU Guix?

I don’t know if it should be in Guix itself (and it’s probably too early
to think about it), but there’s definitely interest in it!

Pjotr mentioned it before, and Ricardo started a thread on this topic on
help-guix in February¹, where we discussed something similar to what you
proposed.  I agree with you that Guix should be a nice tool for the job.

¹ https://lists.gnu.org/archive/html/help-guix/2016-02/msg00019.html

> I'm currently working on a proof-of-concept implementation that has three
> record types/levels of abstraction:
> <workflow>:  Describes which <process>es should be run, and concerns itself with
>              the order of execution.
>
> <process>:   Describes what packages are needed to run the programs involved,
>              and its relationship to other processes.  Processes take input and
>              generate output much like the package construction process.
>
> <script>:    Short and simple imperative instructions to perform a task. They are
>              part of a <process>.  Currently, my implementation generates a shell
>              script that can be either Guile, Sh, Perl or Python.

In the previous discussion, I thought that a gexp would be enough to
write a derivation that implements a workflow.  That is, basically you’d
write:

  (define (my-workflow input)
    (gexp->derivation "result" #~(process-the-thing #$input #$output)))

Maybe it’s all it takes to represent a workflow?  Or maybe my idea of
what workflows look like is too naive.

> The subcommand I envision is:
>   guix workflow
>
> With primarily:
>   guix workflow --run=<name-of-workflow-definition>
>
> If you are interested in adding any form of workflow management to GNU Guix, I
> can elaborate on my proof-of-concept implementation, so we can work from there.
> (or throw everything out of the window and start from scratch ;-))

I’m interested in seeing what it’s like, and examples of it!

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Workflow management with GNU Guix
  2016-05-12 16:06 ` Ludovic Courtès
@ 2016-10-25 13:28   ` Roel Janssen
  2016-10-26 12:41     ` Ludovic Courtès
  0 siblings, 1 reply; 13+ messages in thread
From: Roel Janssen @ 2016-10-25 13:28 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel

[-- Attachment #1: gwl.tar.gz --]
[-- Type: application/gzip, Size: 6419 bytes --]

[-- Attachment #2: workflow-language.pdf --]
[-- Type: application/pdf, Size: 664813 bytes --]

[-- Attachment #3: Type: text/plain, Size: 2725 bytes --]


Ludovic Courtès writes:

> Hello Guix!
>
> Roel Janssen <roel@gnu.org> skribis:
>
>> So, I would like to propose a new Guix subcommand and an extension to
>> the package management language to add workflow management features.
>>
>> Would this be a feature you are interested in adding to GNU Guix?
>
> I don’t know if it should be in Guix itself (and it’s probably too early
> to think about it), but there’s definitely interest in it!
>
> Pjotr mentioned it before, and Ricardo started a thread on this topic on
> help-guix in February¹, where we discussed something similar to what you
> proposed.  I agree with you that Guix should be a nice tool for the job.
>
> ¹ https://lists.gnu.org/archive/html/help-guix/2016-02/msg00019.html
>
>> I'm currently working on a proof-of-concept implementation that has three
>> record types/levels of abstraction:
>> <workflow>:  Describes which <process>es should be run, and concerns itself with
>>              the order of execution.
>>
>> <process>:   Describes what packages are needed to run the programs involved,
>>              and its relationship to other processes.  Processes take input and
>>              generate output much like the package construction process.
>>
>> <script>:    Short and simple imperative instructions to perform a task. They are
>>              part of a <process>.  Currently, my implementation generates a shell
>>              script that can be either Guile, Sh, Perl or Python.
>
> In the previous discussion, I thought that a gexp would be enough to
> write a derivation that implements a workflow.  That is, basically you’d
> write:
>
>   (define (my-workflow input)
>     (gexp->derivation "result" #~(process-the-thing #$input #$output)))
>
> Maybe it’s all it takes to represent a workflow?  Or maybe my idea of
> what workflows look like is too naive.
>
>> The subcommand I envision is:
>>   guix workflow
>>
>> With primarily:
>>   guix workflow --run=<name-of-workflow-definition>
>>
>> If you are interested in adding any form of workflow management to GNU Guix, I
>> can elaborate on my proof-of-concept implementation, so we can work from there.
>> (or throw everything out of the window and start from scratch ;-))
>
> I’m interested in seeing what it’s like, and examples of it!

I realize I never shared my proof-of-concept implementation.  I attached
my motivations for having a workflow language in Guix, and my code.

The subcommand "guix workflow" does not work (yet) here.  I currently
execute a workflow directly from the REPL.

A final point to note is that I would like to do a second attempt at
designing the workflow language, changing the way we can execute
programs.

Kind regards,
Roel Janssen

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Workflow management with GNU Guix
  2016-10-25 13:28   ` Roel Janssen
@ 2016-10-26 12:41     ` Ludovic Courtès
  2016-10-26 13:41       ` Roel Janssen
  0 siblings, 1 reply; 13+ messages in thread
From: Ludovic Courtès @ 2016-10-26 12:41 UTC (permalink / raw)
  To: Roel Janssen; +Cc: guix-devel

Roel Janssen <roel@gnu.org> skribis:

> I realize I never shared my proof-of-concept implementation.  I attached
> my motivations for having a workflow language in Guix, and my code.

Nice work, thanks for sharing!

> The subcommand "guix workflow" does not work (yet) here.  I currently
> execute a workflow directly from the REPL.
>
> A final point to note is that I would like to do a second attempt at
> designing the workflow language, changing the way we can execute
> programs.

IIUC, (guix workflows) from the tarball you sent executes workflows in
the current environment, as opposed to creating a derivation that would
actually perform the workflow.  What motivated this approach?

Workflows could compiled to derivations, which in turn could be “built”,
and their build result would be the workflow’s output file.

I guess in practice it only works if users of the cluster can build
derivations on the cluster and have them scheduled on compute nodes.

Thoughts?

Thank you!

Ludo’.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Workflow management with GNU Guix
  2016-10-26 12:41     ` Ludovic Courtès
@ 2016-10-26 13:41       ` Roel Janssen
  2016-10-28 13:15         ` Ludovic Courtès
  0 siblings, 1 reply; 13+ messages in thread
From: Roel Janssen @ 2016-10-26 13:41 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel

Ludovic Courtès writes:

> Roel Janssen <roel@gnu.org> skribis:
>
>> I realize I never shared my proof-of-concept implementation.  I attached
>> my motivations for having a workflow language in Guix, and my code.
>
> Nice work, thanks for sharing!
>
>> The subcommand "guix workflow" does not work (yet) here.  I currently
>> execute a workflow directly from the REPL.
>>
>> A final point to note is that I would like to do a second attempt at
>> designing the workflow language, changing the way we can execute
>> programs.
>
> IIUC, (guix workflows) from the tarball you sent executes workflows in
> the current environment, as opposed to creating a derivation that would
> actually perform the workflow.  What motivated this approach?

The short answer:
Lack of time to implement it properly ;).

The slightly longer answer:
I want to avoid storing results in the store, because we could be
analyzing files of 100GB or more that we do not want to copy into the
store, neither do we want to store the results of the run in the store.

I now realize we could only put the derivation in the store, and not the
build output itself..

> Workflows could compiled to derivations, which in turn could be “built”,
> and their build result would be the workflow’s output file.
>
> I guess in practice it only works if users of the cluster can build
> derivations on the cluster and have them scheduled on compute nodes.
>
> Thoughts?

For building derivations, I think we need super user privileges, right?
Why can't the scripts "just" output the environment variables required as
@code{guix package --search-paths} provides, and then run the commands
with the newly set environment?

Kind regards,
Roel Janssen

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Workflow management with GNU Guix
  2016-10-26 13:41       ` Roel Janssen
@ 2016-10-28 13:15         ` Ludovic Courtès
  2016-10-28 14:40           ` Roel Janssen
  0 siblings, 1 reply; 13+ messages in thread
From: Ludovic Courtès @ 2016-10-28 13:15 UTC (permalink / raw)
  To: Roel Janssen; +Cc: guix-devel

Hi!

Roel Janssen <roel@gnu.org> skribis:

> Ludovic Courtès writes:

[...]

>> IIUC, (guix workflows) from the tarball you sent executes workflows in
>> the current environment, as opposed to creating a derivation that would
>> actually perform the workflow.  What motivated this approach?
>
> The short answer:
> Lack of time to implement it properly ;).
>
> The slightly longer answer:
> I want to avoid storing results in the store, because we could be
> analyzing files of 100GB or more that we do not want to copy into the
> store, neither do we want to store the results of the run in the store.

Good point!

> I now realize we could only put the derivation in the store, and not the
> build output itself..

A derivation has to get its inputs from the store, and to write its
output to the store.  There’s no other option.

So I guess that’s an argument in favor of the approach you chose.

>> Workflows could compiled to derivations, which in turn could be “built”,
>> and their build result would be the workflow’s output file.
>>
>> I guess in practice it only works if users of the cluster can build
>> derivations on the cluster and have them scheduled on compute nodes.
>>
>> Thoughts?
>
> For building derivations, I think we need super user privileges, right?

Well guix-daemon needs to run as root, unless --disable-chroot is used.

> Why can't the scripts "just" output the environment variables required as
> @code{guix package --search-paths} provides, and then run the commands
> with the newly set environment?

Fundamentally, a derivation just describes a command, its arguments, its
dependencies, its outputs, and its environment variables.

So you’re right: you can very much run a derivation “by hand” instead of
letting the daemon do it on your behalf.  The only difference is that
you won’t have write access to the store.

Here’s an example:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,use(guix)
scheme@(guile-user)> ,use(gnu packages base)
scheme@(guile-user)> (define s (open-connection))
scheme@(guile-user)> (package-derivation s coreutils)
$4 = #<derivation /gnu/store/rmnb2x5vh9d9gdn1zb8q83hpyfnici18-coreutils-8.25.drv => /gnu/store/81pkzgzjwbnxfd5izgmgam8hfmjn20v8-coreutils-8.25-debug /gnu/store/apx87qb8g3f6x0gbx555qpnfm1wkdv4v-coreutils-8.25 5baea00>
scheme@(guile-user)> (derivation-builder $4)
$5 = "/gnu/store/ik15p8lrbk6jfa3fs3x34m78lj2c0ix1-guile-2.0.11/bin/guile"
scheme@(guile-user)> (derivation-builder-arguments $4)
$6 = ("--no-auto-compile" "-L" "/gnu/store/mn706n39l8z37w8wdqcm9v8pg6zcn33v-module-import" "/gnu/store/1a559p1yki9x1g676r8z0p3cf1f3pq7l-coreutils-8.25-guile-builder")
scheme@(guile-user)> (apply system* $5 $6)
[ Well, chdir to /tmp or something before trying it at home… ]
--8<---------------cut here---------------end--------------->8---

Ludo’.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Workflow management with GNU Guix
  2016-10-28 13:15         ` Ludovic Courtès
@ 2016-10-28 14:40           ` Roel Janssen
  2016-10-28 15:27             ` Ludovic Courtès
  0 siblings, 1 reply; 13+ messages in thread
From: Roel Janssen @ 2016-10-28 14:40 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel


Ludovic Courtès writes:

> Hi!
>
> Roel Janssen <roel@gnu.org> skribis:
>
>> Ludovic Courtès writes:
>
> [...]
>
>>> IIUC, (guix workflows) from the tarball you sent executes workflows in
>>> the current environment, as opposed to creating a derivation that would
>>> actually perform the workflow.  What motivated this approach?
>>
>> The short answer:
>> Lack of time to implement it properly ;).
>>
>> The slightly longer answer:
>> I want to avoid storing results in the store, because we could be
>> analyzing files of 100GB or more that we do not want to copy into the
>> store, neither do we want to store the results of the run in the store.
>
> Good point!
>
>> I now realize we could only put the derivation in the store, and not the
>> build output itself..
>
> A derivation has to get its inputs from the store, and to write its
> output to the store.  There’s no other option.
>
> So I guess that’s an argument in favor of the approach you chose.

Can't a derivation write its output to some other place than the store?
Maybe by running it "by hand"?

>>> Workflows could compiled to derivations, which in turn could be “built”,
>>> and their build result would be the workflow’s output file.
>>>
>>> I guess in practice it only works if users of the cluster can build
>>> derivations on the cluster and have them scheduled on compute nodes.
>>>
>>> Thoughts?
>>
>> For building derivations, I think we need super user privileges, right?
>
> Well guix-daemon needs to run as root, unless --disable-chroot is used.

Yeah ok..  But as long as the guix-daemon doesn't build any derivation it
doesn't need super user privileges ;).

>> Why can't the scripts "just" output the environment variables required as
>> @code{guix package --search-paths} provides, and then run the commands
>> with the newly set environment?
>
> Fundamentally, a derivation just describes a command, its arguments, its
> dependencies, its outputs, and its environment variables.
>
> So you’re right: you can very much run a derivation “by hand” instead of
> letting the daemon do it on your behalf.  The only difference is that
> you won’t have write access to the store.

And that's fine, because we don't want to write the output to the store :).

So, the workflow language should create a derivation, but then
guix-daemon should not execute the derivation, but instead, the workflow
execution engine can do it so it can avoid writing the output to the
store.. right?

Kind regards,
Roel Janssen

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Workflow management with GNU Guix
  2016-10-28 14:40           ` Roel Janssen
@ 2016-10-28 15:27             ` Ludovic Courtès
  2016-10-28 17:25               ` Roel Janssen
  0 siblings, 1 reply; 13+ messages in thread
From: Ludovic Courtès @ 2016-10-28 15:27 UTC (permalink / raw)
  To: Roel Janssen; +Cc: guix-devel

Roel Janssen <roel@gnu.org> skribis:

> Ludovic Courtès writes:

[...]

>> So I guess that’s an argument in favor of the approach you chose.
>
> Can't a derivation write its output to some other place than the store?
> Maybe by running it "by hand"?

Yes, if you run it “by hand”, then you can tweak things as you see fit.

>> Fundamentally, a derivation just describes a command, its arguments, its
>> dependencies, its outputs, and its environment variables.
>>
>> So you’re right: you can very much run a derivation “by hand” instead of
>> letting the daemon do it on your behalf.  The only difference is that
>> you won’t have write access to the store.
>
> And that's fine, because we don't want to write the output to the store :).
>
> So, the workflow language should create a derivation, but then
> guix-daemon should not execute the derivation, but instead, the workflow
> execution engine can do it so it can avoid writing the output to the
> store.. right?

Right.  In addition to the snippet I gave, you’d need to set the
environment variables that are specified in the derivation.

For each output of the derivation, one environment variable is defined
that points to its /gnu/store/… file name.  So for instance, you’d also
need to do:

  (setenv "out" "/home/roel/something")

if you want to “redirect” the “out” output to a place that’s not its
normal place in the store.

With user namespaces, you could simply bind mount /home/roel/something
to /gnu/store/… in the process that runs the derivation builder, instead
using of the ‘setenv’ hack above.

Ludo’.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Workflow management with GNU Guix
  2016-10-28 15:27             ` Ludovic Courtès
@ 2016-10-28 17:25               ` Roel Janssen
  2016-10-29 20:56                 ` Ludovic Courtès
  0 siblings, 1 reply; 13+ messages in thread
From: Roel Janssen @ 2016-10-28 17:25 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guix-devel


Ludovic Courtès writes:

> Roel Janssen <roel@gnu.org> skribis:
>
>> Ludovic Courtès writes:
>
> [...]
>
>>> So I guess that’s an argument in favor of the approach you chose.
>>
>> Can't a derivation write its output to some other place than the store?
>> Maybe by running it "by hand"?
>
> Yes, if you run it “by hand”, then you can tweak things as you see fit.
>
>>> Fundamentally, a derivation just describes a command, its arguments, its
>>> dependencies, its outputs, and its environment variables.
>>>
>>> So you’re right: you can very much run a derivation “by hand” instead of
>>> letting the daemon do it on your behalf.  The only difference is that
>>> you won’t have write access to the store.
>>
>> And that's fine, because we don't want to write the output to the store :).
>>
>> So, the workflow language should create a derivation, but then
>> guix-daemon should not execute the derivation, but instead, the workflow
>> execution engine can do it so it can avoid writing the output to the
>> store.. right?
>
> Right.  In addition to the snippet I gave, you’d need to set the
> environment variables that are specified in the derivation.
>
> For each output of the derivation, one environment variable is defined
> that points to its /gnu/store/… file name.  So for instance, you’d also
> need to do:
>
>   (setenv "out" "/home/roel/something")
>
> if you want to “redirect” the “out” output to a place that’s not its
> normal place in the store.
>
> With user namespaces, you could simply bind mount /home/roel/something
> to /gnu/store/… in the process that runs the derivation builder, instead
> using of the ‘setenv’ hack above.

Ideally, we would do the equivalent of @code{guix environment
--container --ad-hoc --pure <packages>}  and execute the programs inside
the environment.  Unfortunately, that requires super user privileges as
well (for good reasons!).

It would be great to build this in though.. just for those who want to
do things properly and have the luxury of doing so...

I'll try to implement this in the upcoming week(s) so we have something
to try out.

Kind regards,
Roel Janssen

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Workflow management with GNU Guix
  2016-10-28 17:25               ` Roel Janssen
@ 2016-10-29 20:56                 ` Ludovic Courtès
  0 siblings, 0 replies; 13+ messages in thread
From: Ludovic Courtès @ 2016-10-29 20:56 UTC (permalink / raw)
  To: Roel Janssen; +Cc: guix-devel

Roel Janssen <roel@gnu.org> skribis:

> Ludovic Courtès writes:
>
>> Roel Janssen <roel@gnu.org> skribis:
>>
>>> Ludovic Courtès writes:
>>
>> [...]
>>
>>>> So I guess that’s an argument in favor of the approach you chose.
>>>
>>> Can't a derivation write its output to some other place than the store?
>>> Maybe by running it "by hand"?
>>
>> Yes, if you run it “by hand”, then you can tweak things as you see fit.
>>
>>>> Fundamentally, a derivation just describes a command, its arguments, its
>>>> dependencies, its outputs, and its environment variables.
>>>>
>>>> So you’re right: you can very much run a derivation “by hand” instead of
>>>> letting the daemon do it on your behalf.  The only difference is that
>>>> you won’t have write access to the store.
>>>
>>> And that's fine, because we don't want to write the output to the store :).
>>>
>>> So, the workflow language should create a derivation, but then
>>> guix-daemon should not execute the derivation, but instead, the workflow
>>> execution engine can do it so it can avoid writing the output to the
>>> store.. right?
>>
>> Right.  In addition to the snippet I gave, you’d need to set the
>> environment variables that are specified in the derivation.
>>
>> For each output of the derivation, one environment variable is defined
>> that points to its /gnu/store/… file name.  So for instance, you’d also
>> need to do:
>>
>>   (setenv "out" "/home/roel/something")
>>
>> if you want to “redirect” the “out” output to a place that’s not its
>> normal place in the store.
>>
>> With user namespaces, you could simply bind mount /home/roel/something
>> to /gnu/store/… in the process that runs the derivation builder, instead
>> using of the ‘setenv’ hack above.
>
> Ideally, we would do the equivalent of @code{guix environment
> --container --ad-hoc --pure <packages>}  and execute the programs inside
> the environment.  Unfortunately, that requires super user privileges as
> well (for good reasons!).

… if user namespaces are disabled, but yeah.

> It would be great to build this in though.. just for those who want to
> do things properly and have the luxury of doing so...
>
> I'll try to implement this in the upcoming week(s) so we have something
> to try out.

Cool!  Check out ‘call-with-container’.  Essentially you need to tell it
to bind-mount all of (derivation-inputs drv), where drv is the
derivation you want to build.

Ludo’.

^ permalink raw reply	[flat|nested] 13+ messages in thread

[parent not found: <idjfutnih58.fsf@bimsb-sys02.mdc-berlin.net>]

* Re: Workflow management with GNU Guix
       [not found] ` <idjfutnih58.fsf@bimsb-sys02.mdc-berlin.net>
@ 2016-05-16 12:22   ` Ricardo Wurmus
  2016-06-14  9:16     ` Roel Janssen
  0 siblings, 1 reply; 13+ messages in thread
From: Ricardo Wurmus @ 2016-05-16 12:22 UTC (permalink / raw)
  To: Roel Janssen; +Cc: guix-devel

(Resending this as it could not be delivered.)

Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes:

> Hi Roel,
>
>> With GNU Guix we are able to install programs to our machines with an amazing
>> level of control over the dependency graph of the programs.  We can now know
>> what code will run when we invoke a program.  We can now know what the impact
>> of an upgrade will be.  And we can now safely roll-back to previous states.
>>
>> What seems to be a common practice in research involving data analysis, is
>> running multiple programs in a chain to transform data from raw to specific. 
>> This is often referred to as a "pipeline" or a "workflow".  Because data sets
>> can be quite large in comparison to the computing power of our laptops, the
>> data analysis is performed on computing clusters instead of single machines.
>>
>> The usage of a pipeline/workflow is somewhat different from the package
>> construction, because we want to run the sequence of commands on different data
>> sets (as opposed to running it on the same source code).  Plus, I would like to
>> integrate it with existing computing clusters that have a job scheduling system
>> in place.  
>>
>> The reason I think this should be possible with Guix is that it has
>> everything in place to do software deployment and run-time isolation
>> (containers).  From there it is a small step to executing programs in an
>> automated way.
>>
>> So, I would like to propose a new Guix subcommand and an extension to
>> the package management language to add workflow management features.
>
> I probably don’t understand your idea well enough, but from what I
> understand it doesn’t really have much to do with packages (other than
> using them) and store manipulation per se (produced artifacts are not
> added to the store).  Exactly what features of Guix do you want to build
> on?
>
> My perspective on pipelines is that they should be developed like any
> other software package, treating individual tools as you would treat
> libraries.  This means that a pipeline would have a configuration step
> in which it checks for the paths of all tools it needs internally, and
> then use the full paths rather than assume all tools to be in a
> directory listed in the PATH variable.
>
> Distributing jobs to clusters would be the responsibility of the
> pipeline, e.g. by using DRMMA, which supports several resource
> management backends and has bindings for a wide range of programming
> languages.
>
>> Would this be a feature you are interested in adding to GNU Guix?
>
> Even if it wasn’t part of Guix itself, you could develop it separately
> and still add it as a Guix command, much like it is currently done for
> “guix web” (which I think should eventually be part of Guix).
>
>> I'm currently working on a proof-of-concept implementation that has three
>> record types/levels of abstraction:
>> <workflow>:  Describes which <process>es should be run, and concerns itself with
>>              the order of execution.
>>
>> <process>:   Describes what packages are needed to run the programs involved,
>>              and its relationship to other processes.  Processes take input and
>>              generate output much like the package construction process.
>>
>> <script>:    Short and simple imperative instructions to perform a task. They are
>>              part of a <process>.  Currently, my implementation generates a shell
>>              script that can be either Guile, Sh, Perl or Python.
>
> From that list it seems as if the only link to Guix is ensuring the
> environment contains required programs.  This can be done right now with
> the help of manifests and profiles.
>
> I wonder if maybe we could add Guix as a package management backend to
> existing workflow specification systems (instead of the curiously
> popular and IMO barely adequate Conda, for example).
>
>> The subcommand I envision is:
>>   guix workflow
>>
>> With primarily:
>>   guix workflow --run=<name-of-workflow-definition>
>>
>> If you are interested in adding any form of workflow management to GNU Guix, I
>> can elaborate on my proof-of-concept implementation, so we can work from there.
>> (or throw everything out of the window and start from scratch ;-))
>
> Could you show us an example workflow?
>
> ~~ Ricardo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Workflow management with GNU Guix
  2016-05-16 12:22   ` Ricardo Wurmus
@ 2016-06-14  9:16     ` Roel Janssen
  0 siblings, 0 replies; 13+ messages in thread
From: Roel Janssen @ 2016-06-14  9:16 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guix-devel

Hello all,

Thank you for your replies.  I will use Ricardo's response to reply to.

Ricardo Wurmus writes:

> (Resending this as it could not be delivered.)
>
> Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes:
>
>> Hi Roel,
>>
>>> With GNU Guix we are able to install programs to our machines with an amazing
>>> level of control over the dependency graph of the programs.  We can now know
>>> what code will run when we invoke a program.  We can now know what the impact
>>> of an upgrade will be.  And we can now safely roll-back to previous states.
>>>
>>> What seems to be a common practice in research involving data analysis, is
>>> running multiple programs in a chain to transform data from raw to specific. 
>>> This is often referred to as a "pipeline" or a "workflow".  Because data sets
>>> can be quite large in comparison to the computing power of our laptops, the
>>> data analysis is performed on computing clusters instead of single machines.
>>>
>>> The usage of a pipeline/workflow is somewhat different from the package
>>> construction, because we want to run the sequence of commands on different data
>>> sets (as opposed to running it on the same source code).  Plus, I would like to
>>> integrate it with existing computing clusters that have a job scheduling system
>>> in place.  
>>>
>>> The reason I think this should be possible with Guix is that it has
>>> everything in place to do software deployment and run-time isolation
>>> (containers).  From there it is a small step to executing programs in an
>>> automated way.
>>>
>>> So, I would like to propose a new Guix subcommand and an extension to
>>> the package management language to add workflow management features.
>>
>> I probably don’t understand your idea well enough, but from what I
>> understand it doesn’t really have much to do with packages (other than
>> using them) and store manipulation per se (produced artifacts are not
>> added to the store).  Exactly what features of Guix do you want to build
>> on?

I would like to build on the language to express packages.  What's nice
about the package recipes is that they are understandable, they are
shareable (just copy and paste the recipe) and from them a reproducible
output can be produced.

A package recipe describes its entire dependency graph because the
symbols in the inputs are turned into specific versions of the external
packages.  This is a very powerful feat for specifying how to run
things.

>> My perspective on pipelines is that they should be developed like any
>> other software package, treating individual tools as you would treat
>> libraries.  This means that a pipeline would have a configuration step
>> in which it checks for the paths of all tools it needs internally, and
>> then use the full paths rather than assume all tools to be in a
>> directory listed in the PATH variable.

When we would use Guix package recipes to describe tools, we wouldn't
need to search for them. We could just set up a profile with these tools
and set the environment variables suggested by Guix accordingly.

This way we can generate the exact dependency graph of a pipeline,
leaving no ambiguity to the run-time environment.

>> Distributing jobs to clusters would be the responsibility of the
>> pipeline, e.g. by using DRMMA, which supports several resource
>> management backends and has bindings for a wide range of programming
>> languages.

Wouldn't it be easier to write a pipeline in a language that has the
infrastructure to uniquely describe and deploy a program and its
dependencies?  You don't need to search for available tools, you can
just install them.  If they were available already, installing will be a
matter of creating a couple of symbolic links.

Here is a translation of a "real-world" process definition to my
<process> record type from one of the pipelines I studied.  It isn't a
perfect example because it uses a package that isn't in Guix..  Anyway:

===
(define (rnaseq-fastq-quality-control in out)
  (process
    (name "rnaseq-fastq-quality-control")
    (version "1.0")
    (environment
     `(("fastqc" ,fastqc-bin-0.11.4)))
    (input in)
    (output (string-append out "/" name))
    (procedure
     (script
      (interpreter 'guile)
      (source
        (let ((sample-files (find-files in #:directories? #f)))
         `(begin
            ;; Create output directories.
            (unless (access? ,out F_OK) (mkdir ,out))
            (unless (access? ,output F_OK) (mkdir ,output))
            ;; Perform the analysis step.
            (map (lambda (file)
                   (when (string-suffix? ".fastq.gz" file)
                     (system* "fastqc" "-q" file "-o" ,output)))
                 ',sample-files))))))
    (synopsis "Generate quality control reports for FastQ files")
    (description "This process generates a quality control report
for a single FastQ file.")))
===

The resulting expression in `source' can be executed with Guile in any
place on a computing cluster (as long as the files are accessible at the
same location on other machines).

This snippet can be copy-pasted elsewhere and be included in another
pipeline without adjusting what job distribution system should be used.
We can deal with that on the "workflow" level instead of the "process"
level.

I left the option open to use other scripting languages, but we could
compact it a bit more when only using Guile.

>>> Would this be a feature you are interested in adding to GNU Guix?
>>
>> Even if it wasn’t part of Guix itself, you could develop it separately
>> and still add it as a Guix command, much like it is currently done for
>> “guix web” (which I think should eventually be part of Guix).

That may be a good idea.

>>> I'm currently working on a proof-of-concept implementation that has three
>>> record types/levels of abstraction:
>>> <workflow>:  Describes which <process>es should be run, and concerns itself with
>>>              the order of execution.
>>>
>>> <process>:   Describes what packages are needed to run the programs involved,
>>>              and its relationship to other processes.  Processes take input and
>>>              generate output much like the package construction process.
>>>
>>> <script>:    Short and simple imperative instructions to perform a task. They are
>>>              part of a <process>.  Currently, my implementation generates a shell
>>>              script that can be either Guile, Sh, Perl or Python.
>>
>> From that list it seems as if the only link to Guix is ensuring the
>> environment contains required programs.  This can be done right now with
>> the help of manifests and profiles.
>>
>> I wonder if maybe we could add Guix as a package management backend to
>> existing workflow specification systems (instead of the curiously
>> popular and IMO barely adequate Conda, for example).

That is an option too.  The workflow specification systems overlap in
describing tools though.  For example, the Common Workflow Language
(CWL).  If we then look at:
  http://www.commonwl.org/draft-3/CommandLineTool.html#CommandLineTool

The `requirements' field is the equivalent of `inputs' and
`propagated-inputs' in Guix.

With Guix, we could describe a command-line tool by refering to the
package recipe, and then write the command to run.

>>> The subcommand I envision is:
>>>   guix workflow
>>>
>>> With primarily:
>>>   guix workflow --run=<name-of-workflow-definition>
>>>
>>> If you are interested in adding any form of workflow management to GNU Guix, I
>>> can elaborate on my proof-of-concept implementation, so we can work from there.
>>> (or throw everything out of the window and start from scratch ;-))
>>
>> Could you show us an example workflow?

So, the <process>es look like the snippet provided above.  Then the
workflow itself looks like:

===
(define (rnaseq-pipeline in out)
  (workflow
   (name "rnaseq-pipeline")
   (version "1.0")
   (input in)
   (output (string-append
            out "/" name "-" (date->string (current-date) "~Y-~m-~d")))
   (processes
    '(rnaseq-initialize
      rnaseq-fastq-quality-control
      rnaseq-align
      rnaseq-add-read-groups
      rnaseq-index
      rnaseq-feature-readcount
      rnaseq-collect-alignment-metrics
      rnaseq-merge-read-features
      rnaseq-compute-rpkm-values
      rnaseq-normalize-read-counts
      rnaseq-differential-expression))
   (restrictions
    `((,rnaseq-fastq-quality-control ,rnaseq-initialize)
      (,rnaseq-align ,rnaseq-initialize)
      (,rnaseq-add-read-groups ,rnaseq-align)
      (,rnaseq-index ,rnaseq-add-read-groups)
      (,rnaseq-collect-alignment-metrics ,rnaseq-index)
      (,rnaseq-feature-readcount ,rnaseq-index)
      (,rnaseq-merge-read-features ,rnaseq-feature-readcount)
      (,rnaseq-compute-rpkm-values ,rnaseq-merge-read-features)
      (,rnaseq-normalize-read-counts ,rnaseq-merge-read-features)
      (,rnaseq-differential-expression ,rnaseq-merge-read-features)))
   (synopsis "RNA sequencing pipeline used at the UMCU")
   (description "The RNAseq pipeline can do quality control on FastQ and BAM
files; align reads against a reference genome; count reads in features;
normalize read counts; calculate RPKMs and perform DE analysis of standard
designs.")))
===

The `restrictions' are dependency pairs (A B) where A depends on the
successful completion of B.  From this, the execution order can be
determined.  

Thank you all for your time.

Kind regards,
Roel Janssen

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-10-29 20:56 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-12  8:43 Workflow management with GNU Guix Roel Janssen
2016-05-12 11:41 ` Taylan Ulrich Bayırlı/Kammer
2016-05-12 16:06 ` Ludovic Courtès
2016-10-25 13:28   ` Roel Janssen
2016-10-26 12:41     ` Ludovic Courtès
2016-10-26 13:41       ` Roel Janssen
2016-10-28 13:15         ` Ludovic Courtès
2016-10-28 14:40           ` Roel Janssen
2016-10-28 15:27             ` Ludovic Courtès
2016-10-28 17:25               ` Roel Janssen
2016-10-29 20:56                 ` Ludovic Courtès
     [not found] ` <idjfutnih58.fsf@bimsb-sys02.mdc-berlin.net>
2016-05-16 12:22   ` Ricardo Wurmus
2016-06-14  9:16     ` Roel Janssen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).