* Workflow management with GNU Guix @ 2016-05-12 8:43 Roel Janssen 2016-05-12 11:41 ` Taylan Ulrich Bayırlı/Kammer ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Roel Janssen @ 2016-05-12 8:43 UTC (permalink / raw) To: guix-devel Dear Guix, With GNU Guix we are able to install programs to our machines with an amazing level of control over the dependency graph of the programs. We can now know what code will run when we invoke a program. We can now know what the impact of an upgrade will be. And we can now safely roll-back to previous states. What seems to be a common practice in research involving data analysis, is running multiple programs in a chain to transform data from raw to specific. This is often referred to as a "pipeline" or a "workflow". Because data sets can be quite large in comparison to the computing power of our laptops, the data analysis is performed on computing clusters instead of single machines. The usage of a pipeline/workflow is somewhat different from the package construction, because we want to run the sequence of commands on different data sets (as opposed to running it on the same source code). Plus, I would like to integrate it with existing computing clusters that have a job scheduling system in place. The reason I think this should be possible with Guix is that it has everything in place to do software deployment and run-time isolation (containers). From there it is a small step to executing programs in an automated way. So, I would like to propose a new Guix subcommand and an extension to the package management language to add workflow management features. Would this be a feature you are interested in adding to GNU Guix? I'm currently working on a proof-of-concept implementation that has three record types/levels of abstraction: <workflow>: Describes which <process>es should be run, and concerns itself with the order of execution. <process>: Describes what packages are needed to run the programs involved, and its relationship to other processes. Processes take input and generate output much like the package construction process. <script>: Short and simple imperative instructions to perform a task. They are part of a <process>. Currently, my implementation generates a shell script that can be either Guile, Sh, Perl or Python. The subcommand I envision is: guix workflow With primarily: guix workflow --run=<name-of-workflow-definition> If you are interested in adding any form of workflow management to GNU Guix, I can elaborate on my proof-of-concept implementation, so we can work from there. (or throw everything out of the window and start from scratch ;-)) Thanks again for your time. Kind regards, Roel Janssen ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Workflow management with GNU Guix 2016-05-12 8:43 Workflow management with GNU Guix Roel Janssen @ 2016-05-12 11:41 ` Taylan Ulrich Bayırlı/Kammer 2016-05-12 16:06 ` Ludovic Courtès [not found] ` <idjfutnih58.fsf@bimsb-sys02.mdc-berlin.net> 2 siblings, 0 replies; 13+ messages in thread From: Taylan Ulrich Bayırlı/Kammer @ 2016-05-12 11:41 UTC (permalink / raw) To: Roel Janssen; +Cc: guix-devel Roel Janssen <roel@gnu.org> writes: > The usage of a pipeline/workflow is somewhat different from the > package construction, because we want to run the sequence of commands > on different data sets (as opposed to running it on the same source > code). Is this not conceptually the same thing as changing the 'source' field of a package recipe? With the new package transformation feature[0], this can be done "on the fly" like: guix build emacs --with-source=emacs-25.1-alpha.tar.xz Maybe a "process" can just be a build phase, and a "workflow" a build system, as they currently exist in Guix. Not sure what a "script" would be, though build phases can easily execute shell commands, scripts, and so on within the build directory. That means one could write a "package recipe" that doesn't really build a package from source code, but rather creates arbitrary output files from arbitrary input files. (Same thing to Guix anyway.) The 'source' field of the recipe would contain some dummy value, and one would specify the real input like: guix build processed-data --with-source=raw-data-2016-05-12.txt So maybe Guix already has everything you need? :-) Not sure if I fully understand the problem domain though, so apologies if I'm missing the point. Taylan [0] https://lists.gnu.org/archive/html/guix-devel/2016-02/msg00001.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Workflow management with GNU Guix 2016-05-12 8:43 Workflow management with GNU Guix Roel Janssen 2016-05-12 11:41 ` Taylan Ulrich Bayırlı/Kammer @ 2016-05-12 16:06 ` Ludovic Courtès 2016-10-25 13:28 ` Roel Janssen [not found] ` <idjfutnih58.fsf@bimsb-sys02.mdc-berlin.net> 2 siblings, 1 reply; 13+ messages in thread From: Ludovic Courtès @ 2016-05-12 16:06 UTC (permalink / raw) To: Roel Janssen; +Cc: guix-devel Hello Guix! Roel Janssen <roel@gnu.org> skribis: > So, I would like to propose a new Guix subcommand and an extension to > the package management language to add workflow management features. > > Would this be a feature you are interested in adding to GNU Guix? I don’t know if it should be in Guix itself (and it’s probably too early to think about it), but there’s definitely interest in it! Pjotr mentioned it before, and Ricardo started a thread on this topic on help-guix in February¹, where we discussed something similar to what you proposed. I agree with you that Guix should be a nice tool for the job. ¹ https://lists.gnu.org/archive/html/help-guix/2016-02/msg00019.html > I'm currently working on a proof-of-concept implementation that has three > record types/levels of abstraction: > <workflow>: Describes which <process>es should be run, and concerns itself with > the order of execution. > > <process>: Describes what packages are needed to run the programs involved, > and its relationship to other processes. Processes take input and > generate output much like the package construction process. > > <script>: Short and simple imperative instructions to perform a task. They are > part of a <process>. Currently, my implementation generates a shell > script that can be either Guile, Sh, Perl or Python. In the previous discussion, I thought that a gexp would be enough to write a derivation that implements a workflow. That is, basically you’d write: (define (my-workflow input) (gexp->derivation "result" #~(process-the-thing #$input #$output))) Maybe it’s all it takes to represent a workflow? Or maybe my idea of what workflows look like is too naive. > The subcommand I envision is: > guix workflow > > With primarily: > guix workflow --run=<name-of-workflow-definition> > > If you are interested in adding any form of workflow management to GNU Guix, I > can elaborate on my proof-of-concept implementation, so we can work from there. > (or throw everything out of the window and start from scratch ;-)) I’m interested in seeing what it’s like, and examples of it! Thanks, Ludo’. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Workflow management with GNU Guix 2016-05-12 16:06 ` Ludovic Courtès @ 2016-10-25 13:28 ` Roel Janssen 2016-10-26 12:41 ` Ludovic Courtès 0 siblings, 1 reply; 13+ messages in thread From: Roel Janssen @ 2016-10-25 13:28 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guix-devel [-- Attachment #1: gwl.tar.gz --] [-- Type: application/gzip, Size: 6419 bytes --] [-- Attachment #2: workflow-language.pdf --] [-- Type: application/pdf, Size: 664813 bytes --] [-- Attachment #3: Type: text/plain, Size: 2725 bytes --] Ludovic Courtès writes: > Hello Guix! > > Roel Janssen <roel@gnu.org> skribis: > >> So, I would like to propose a new Guix subcommand and an extension to >> the package management language to add workflow management features. >> >> Would this be a feature you are interested in adding to GNU Guix? > > I don’t know if it should be in Guix itself (and it’s probably too early > to think about it), but there’s definitely interest in it! > > Pjotr mentioned it before, and Ricardo started a thread on this topic on > help-guix in February¹, where we discussed something similar to what you > proposed. I agree with you that Guix should be a nice tool for the job. > > ¹ https://lists.gnu.org/archive/html/help-guix/2016-02/msg00019.html > >> I'm currently working on a proof-of-concept implementation that has three >> record types/levels of abstraction: >> <workflow>: Describes which <process>es should be run, and concerns itself with >> the order of execution. >> >> <process>: Describes what packages are needed to run the programs involved, >> and its relationship to other processes. Processes take input and >> generate output much like the package construction process. >> >> <script>: Short and simple imperative instructions to perform a task. They are >> part of a <process>. Currently, my implementation generates a shell >> script that can be either Guile, Sh, Perl or Python. > > In the previous discussion, I thought that a gexp would be enough to > write a derivation that implements a workflow. That is, basically you’d > write: > > (define (my-workflow input) > (gexp->derivation "result" #~(process-the-thing #$input #$output))) > > Maybe it’s all it takes to represent a workflow? Or maybe my idea of > what workflows look like is too naive. > >> The subcommand I envision is: >> guix workflow >> >> With primarily: >> guix workflow --run=<name-of-workflow-definition> >> >> If you are interested in adding any form of workflow management to GNU Guix, I >> can elaborate on my proof-of-concept implementation, so we can work from there. >> (or throw everything out of the window and start from scratch ;-)) > > I’m interested in seeing what it’s like, and examples of it! I realize I never shared my proof-of-concept implementation. I attached my motivations for having a workflow language in Guix, and my code. The subcommand "guix workflow" does not work (yet) here. I currently execute a workflow directly from the REPL. A final point to note is that I would like to do a second attempt at designing the workflow language, changing the way we can execute programs. Kind regards, Roel Janssen ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Workflow management with GNU Guix 2016-10-25 13:28 ` Roel Janssen @ 2016-10-26 12:41 ` Ludovic Courtès 2016-10-26 13:41 ` Roel Janssen 0 siblings, 1 reply; 13+ messages in thread From: Ludovic Courtès @ 2016-10-26 12:41 UTC (permalink / raw) To: Roel Janssen; +Cc: guix-devel Roel Janssen <roel@gnu.org> skribis: > I realize I never shared my proof-of-concept implementation. I attached > my motivations for having a workflow language in Guix, and my code. Nice work, thanks for sharing! > The subcommand "guix workflow" does not work (yet) here. I currently > execute a workflow directly from the REPL. > > A final point to note is that I would like to do a second attempt at > designing the workflow language, changing the way we can execute > programs. IIUC, (guix workflows) from the tarball you sent executes workflows in the current environment, as opposed to creating a derivation that would actually perform the workflow. What motivated this approach? Workflows could compiled to derivations, which in turn could be “built”, and their build result would be the workflow’s output file. I guess in practice it only works if users of the cluster can build derivations on the cluster and have them scheduled on compute nodes. Thoughts? Thank you! Ludo’. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Workflow management with GNU Guix 2016-10-26 12:41 ` Ludovic Courtès @ 2016-10-26 13:41 ` Roel Janssen 2016-10-28 13:15 ` Ludovic Courtès 0 siblings, 1 reply; 13+ messages in thread From: Roel Janssen @ 2016-10-26 13:41 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guix-devel Ludovic Courtès writes: > Roel Janssen <roel@gnu.org> skribis: > >> I realize I never shared my proof-of-concept implementation. I attached >> my motivations for having a workflow language in Guix, and my code. > > Nice work, thanks for sharing! > >> The subcommand "guix workflow" does not work (yet) here. I currently >> execute a workflow directly from the REPL. >> >> A final point to note is that I would like to do a second attempt at >> designing the workflow language, changing the way we can execute >> programs. > > IIUC, (guix workflows) from the tarball you sent executes workflows in > the current environment, as opposed to creating a derivation that would > actually perform the workflow. What motivated this approach? The short answer: Lack of time to implement it properly ;). The slightly longer answer: I want to avoid storing results in the store, because we could be analyzing files of 100GB or more that we do not want to copy into the store, neither do we want to store the results of the run in the store. I now realize we could only put the derivation in the store, and not the build output itself.. > Workflows could compiled to derivations, which in turn could be “built”, > and their build result would be the workflow’s output file. > > I guess in practice it only works if users of the cluster can build > derivations on the cluster and have them scheduled on compute nodes. > > Thoughts? For building derivations, I think we need super user privileges, right? Why can't the scripts "just" output the environment variables required as @code{guix package --search-paths} provides, and then run the commands with the newly set environment? Kind regards, Roel Janssen ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Workflow management with GNU Guix 2016-10-26 13:41 ` Roel Janssen @ 2016-10-28 13:15 ` Ludovic Courtès 2016-10-28 14:40 ` Roel Janssen 0 siblings, 1 reply; 13+ messages in thread From: Ludovic Courtès @ 2016-10-28 13:15 UTC (permalink / raw) To: Roel Janssen; +Cc: guix-devel Hi! Roel Janssen <roel@gnu.org> skribis: > Ludovic Courtès writes: [...] >> IIUC, (guix workflows) from the tarball you sent executes workflows in >> the current environment, as opposed to creating a derivation that would >> actually perform the workflow. What motivated this approach? > > The short answer: > Lack of time to implement it properly ;). > > The slightly longer answer: > I want to avoid storing results in the store, because we could be > analyzing files of 100GB or more that we do not want to copy into the > store, neither do we want to store the results of the run in the store. Good point! > I now realize we could only put the derivation in the store, and not the > build output itself.. A derivation has to get its inputs from the store, and to write its output to the store. There’s no other option. So I guess that’s an argument in favor of the approach you chose. >> Workflows could compiled to derivations, which in turn could be “built”, >> and their build result would be the workflow’s output file. >> >> I guess in practice it only works if users of the cluster can build >> derivations on the cluster and have them scheduled on compute nodes. >> >> Thoughts? > > For building derivations, I think we need super user privileges, right? Well guix-daemon needs to run as root, unless --disable-chroot is used. > Why can't the scripts "just" output the environment variables required as > @code{guix package --search-paths} provides, and then run the commands > with the newly set environment? Fundamentally, a derivation just describes a command, its arguments, its dependencies, its outputs, and its environment variables. So you’re right: you can very much run a derivation “by hand” instead of letting the daemon do it on your behalf. The only difference is that you won’t have write access to the store. Here’s an example: --8<---------------cut here---------------start------------->8--- scheme@(guile-user)> ,use(guix) scheme@(guile-user)> ,use(gnu packages base) scheme@(guile-user)> (define s (open-connection)) scheme@(guile-user)> (package-derivation s coreutils) $4 = #<derivation /gnu/store/rmnb2x5vh9d9gdn1zb8q83hpyfnici18-coreutils-8.25.drv => /gnu/store/81pkzgzjwbnxfd5izgmgam8hfmjn20v8-coreutils-8.25-debug /gnu/store/apx87qb8g3f6x0gbx555qpnfm1wkdv4v-coreutils-8.25 5baea00> scheme@(guile-user)> (derivation-builder $4) $5 = "/gnu/store/ik15p8lrbk6jfa3fs3x34m78lj2c0ix1-guile-2.0.11/bin/guile" scheme@(guile-user)> (derivation-builder-arguments $4) $6 = ("--no-auto-compile" "-L" "/gnu/store/mn706n39l8z37w8wdqcm9v8pg6zcn33v-module-import" "/gnu/store/1a559p1yki9x1g676r8z0p3cf1f3pq7l-coreutils-8.25-guile-builder") scheme@(guile-user)> (apply system* $5 $6) [ Well, chdir to /tmp or something before trying it at home… ] --8<---------------cut here---------------end--------------->8--- Ludo’. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Workflow management with GNU Guix 2016-10-28 13:15 ` Ludovic Courtès @ 2016-10-28 14:40 ` Roel Janssen 2016-10-28 15:27 ` Ludovic Courtès 0 siblings, 1 reply; 13+ messages in thread From: Roel Janssen @ 2016-10-28 14:40 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guix-devel Ludovic Courtès writes: > Hi! > > Roel Janssen <roel@gnu.org> skribis: > >> Ludovic Courtès writes: > > [...] > >>> IIUC, (guix workflows) from the tarball you sent executes workflows in >>> the current environment, as opposed to creating a derivation that would >>> actually perform the workflow. What motivated this approach? >> >> The short answer: >> Lack of time to implement it properly ;). >> >> The slightly longer answer: >> I want to avoid storing results in the store, because we could be >> analyzing files of 100GB or more that we do not want to copy into the >> store, neither do we want to store the results of the run in the store. > > Good point! > >> I now realize we could only put the derivation in the store, and not the >> build output itself.. > > A derivation has to get its inputs from the store, and to write its > output to the store. There’s no other option. > > So I guess that’s an argument in favor of the approach you chose. Can't a derivation write its output to some other place than the store? Maybe by running it "by hand"? >>> Workflows could compiled to derivations, which in turn could be “built”, >>> and their build result would be the workflow’s output file. >>> >>> I guess in practice it only works if users of the cluster can build >>> derivations on the cluster and have them scheduled on compute nodes. >>> >>> Thoughts? >> >> For building derivations, I think we need super user privileges, right? > > Well guix-daemon needs to run as root, unless --disable-chroot is used. Yeah ok.. But as long as the guix-daemon doesn't build any derivation it doesn't need super user privileges ;). >> Why can't the scripts "just" output the environment variables required as >> @code{guix package --search-paths} provides, and then run the commands >> with the newly set environment? > > Fundamentally, a derivation just describes a command, its arguments, its > dependencies, its outputs, and its environment variables. > > So you’re right: you can very much run a derivation “by hand” instead of > letting the daemon do it on your behalf. The only difference is that > you won’t have write access to the store. And that's fine, because we don't want to write the output to the store :). So, the workflow language should create a derivation, but then guix-daemon should not execute the derivation, but instead, the workflow execution engine can do it so it can avoid writing the output to the store.. right? Kind regards, Roel Janssen ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Workflow management with GNU Guix 2016-10-28 14:40 ` Roel Janssen @ 2016-10-28 15:27 ` Ludovic Courtès 2016-10-28 17:25 ` Roel Janssen 0 siblings, 1 reply; 13+ messages in thread From: Ludovic Courtès @ 2016-10-28 15:27 UTC (permalink / raw) To: Roel Janssen; +Cc: guix-devel Roel Janssen <roel@gnu.org> skribis: > Ludovic Courtès writes: [...] >> So I guess that’s an argument in favor of the approach you chose. > > Can't a derivation write its output to some other place than the store? > Maybe by running it "by hand"? Yes, if you run it “by hand”, then you can tweak things as you see fit. >> Fundamentally, a derivation just describes a command, its arguments, its >> dependencies, its outputs, and its environment variables. >> >> So you’re right: you can very much run a derivation “by hand” instead of >> letting the daemon do it on your behalf. The only difference is that >> you won’t have write access to the store. > > And that's fine, because we don't want to write the output to the store :). > > So, the workflow language should create a derivation, but then > guix-daemon should not execute the derivation, but instead, the workflow > execution engine can do it so it can avoid writing the output to the > store.. right? Right. In addition to the snippet I gave, you’d need to set the environment variables that are specified in the derivation. For each output of the derivation, one environment variable is defined that points to its /gnu/store/… file name. So for instance, you’d also need to do: (setenv "out" "/home/roel/something") if you want to “redirect” the “out” output to a place that’s not its normal place in the store. With user namespaces, you could simply bind mount /home/roel/something to /gnu/store/… in the process that runs the derivation builder, instead using of the ‘setenv’ hack above. Ludo’. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Workflow management with GNU Guix 2016-10-28 15:27 ` Ludovic Courtès @ 2016-10-28 17:25 ` Roel Janssen 2016-10-29 20:56 ` Ludovic Courtès 0 siblings, 1 reply; 13+ messages in thread From: Roel Janssen @ 2016-10-28 17:25 UTC (permalink / raw) To: Ludovic Courtès; +Cc: guix-devel Ludovic Courtès writes: > Roel Janssen <roel@gnu.org> skribis: > >> Ludovic Courtès writes: > > [...] > >>> So I guess that’s an argument in favor of the approach you chose. >> >> Can't a derivation write its output to some other place than the store? >> Maybe by running it "by hand"? > > Yes, if you run it “by hand”, then you can tweak things as you see fit. > >>> Fundamentally, a derivation just describes a command, its arguments, its >>> dependencies, its outputs, and its environment variables. >>> >>> So you’re right: you can very much run a derivation “by hand” instead of >>> letting the daemon do it on your behalf. The only difference is that >>> you won’t have write access to the store. >> >> And that's fine, because we don't want to write the output to the store :). >> >> So, the workflow language should create a derivation, but then >> guix-daemon should not execute the derivation, but instead, the workflow >> execution engine can do it so it can avoid writing the output to the >> store.. right? > > Right. In addition to the snippet I gave, you’d need to set the > environment variables that are specified in the derivation. > > For each output of the derivation, one environment variable is defined > that points to its /gnu/store/… file name. So for instance, you’d also > need to do: > > (setenv "out" "/home/roel/something") > > if you want to “redirect” the “out” output to a place that’s not its > normal place in the store. > > With user namespaces, you could simply bind mount /home/roel/something > to /gnu/store/… in the process that runs the derivation builder, instead > using of the ‘setenv’ hack above. Ideally, we would do the equivalent of @code{guix environment --container --ad-hoc --pure <packages>} and execute the programs inside the environment. Unfortunately, that requires super user privileges as well (for good reasons!). It would be great to build this in though.. just for those who want to do things properly and have the luxury of doing so... I'll try to implement this in the upcoming week(s) so we have something to try out. Kind regards, Roel Janssen ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Workflow management with GNU Guix 2016-10-28 17:25 ` Roel Janssen @ 2016-10-29 20:56 ` Ludovic Courtès 0 siblings, 0 replies; 13+ messages in thread From: Ludovic Courtès @ 2016-10-29 20:56 UTC (permalink / raw) To: Roel Janssen; +Cc: guix-devel Roel Janssen <roel@gnu.org> skribis: > Ludovic Courtès writes: > >> Roel Janssen <roel@gnu.org> skribis: >> >>> Ludovic Courtès writes: >> >> [...] >> >>>> So I guess that’s an argument in favor of the approach you chose. >>> >>> Can't a derivation write its output to some other place than the store? >>> Maybe by running it "by hand"? >> >> Yes, if you run it “by hand”, then you can tweak things as you see fit. >> >>>> Fundamentally, a derivation just describes a command, its arguments, its >>>> dependencies, its outputs, and its environment variables. >>>> >>>> So you’re right: you can very much run a derivation “by hand” instead of >>>> letting the daemon do it on your behalf. The only difference is that >>>> you won’t have write access to the store. >>> >>> And that's fine, because we don't want to write the output to the store :). >>> >>> So, the workflow language should create a derivation, but then >>> guix-daemon should not execute the derivation, but instead, the workflow >>> execution engine can do it so it can avoid writing the output to the >>> store.. right? >> >> Right. In addition to the snippet I gave, you’d need to set the >> environment variables that are specified in the derivation. >> >> For each output of the derivation, one environment variable is defined >> that points to its /gnu/store/… file name. So for instance, you’d also >> need to do: >> >> (setenv "out" "/home/roel/something") >> >> if you want to “redirect” the “out” output to a place that’s not its >> normal place in the store. >> >> With user namespaces, you could simply bind mount /home/roel/something >> to /gnu/store/… in the process that runs the derivation builder, instead >> using of the ‘setenv’ hack above. > > Ideally, we would do the equivalent of @code{guix environment > --container --ad-hoc --pure <packages>} and execute the programs inside > the environment. Unfortunately, that requires super user privileges as > well (for good reasons!). … if user namespaces are disabled, but yeah. > It would be great to build this in though.. just for those who want to > do things properly and have the luxury of doing so... > > I'll try to implement this in the upcoming week(s) so we have something > to try out. Cool! Check out ‘call-with-container’. Essentially you need to tell it to bind-mount all of (derivation-inputs drv), where drv is the derivation you want to build. Ludo’. ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <idjfutnih58.fsf@bimsb-sys02.mdc-berlin.net>]
* Re: Workflow management with GNU Guix [not found] ` <idjfutnih58.fsf@bimsb-sys02.mdc-berlin.net> @ 2016-05-16 12:22 ` Ricardo Wurmus 2016-06-14 9:16 ` Roel Janssen 0 siblings, 1 reply; 13+ messages in thread From: Ricardo Wurmus @ 2016-05-16 12:22 UTC (permalink / raw) To: Roel Janssen; +Cc: guix-devel (Resending this as it could not be delivered.) Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes: > Hi Roel, > >> With GNU Guix we are able to install programs to our machines with an amazing >> level of control over the dependency graph of the programs. We can now know >> what code will run when we invoke a program. We can now know what the impact >> of an upgrade will be. And we can now safely roll-back to previous states. >> >> What seems to be a common practice in research involving data analysis, is >> running multiple programs in a chain to transform data from raw to specific. >> This is often referred to as a "pipeline" or a "workflow". Because data sets >> can be quite large in comparison to the computing power of our laptops, the >> data analysis is performed on computing clusters instead of single machines. >> >> The usage of a pipeline/workflow is somewhat different from the package >> construction, because we want to run the sequence of commands on different data >> sets (as opposed to running it on the same source code). Plus, I would like to >> integrate it with existing computing clusters that have a job scheduling system >> in place. >> >> The reason I think this should be possible with Guix is that it has >> everything in place to do software deployment and run-time isolation >> (containers). From there it is a small step to executing programs in an >> automated way. >> >> So, I would like to propose a new Guix subcommand and an extension to >> the package management language to add workflow management features. > > I probably don’t understand your idea well enough, but from what I > understand it doesn’t really have much to do with packages (other than > using them) and store manipulation per se (produced artifacts are not > added to the store). Exactly what features of Guix do you want to build > on? > > My perspective on pipelines is that they should be developed like any > other software package, treating individual tools as you would treat > libraries. This means that a pipeline would have a configuration step > in which it checks for the paths of all tools it needs internally, and > then use the full paths rather than assume all tools to be in a > directory listed in the PATH variable. > > Distributing jobs to clusters would be the responsibility of the > pipeline, e.g. by using DRMMA, which supports several resource > management backends and has bindings for a wide range of programming > languages. > >> Would this be a feature you are interested in adding to GNU Guix? > > Even if it wasn’t part of Guix itself, you could develop it separately > and still add it as a Guix command, much like it is currently done for > “guix web” (which I think should eventually be part of Guix). > >> I'm currently working on a proof-of-concept implementation that has three >> record types/levels of abstraction: >> <workflow>: Describes which <process>es should be run, and concerns itself with >> the order of execution. >> >> <process>: Describes what packages are needed to run the programs involved, >> and its relationship to other processes. Processes take input and >> generate output much like the package construction process. >> >> <script>: Short and simple imperative instructions to perform a task. They are >> part of a <process>. Currently, my implementation generates a shell >> script that can be either Guile, Sh, Perl or Python. > > From that list it seems as if the only link to Guix is ensuring the > environment contains required programs. This can be done right now with > the help of manifests and profiles. > > I wonder if maybe we could add Guix as a package management backend to > existing workflow specification systems (instead of the curiously > popular and IMO barely adequate Conda, for example). > >> The subcommand I envision is: >> guix workflow >> >> With primarily: >> guix workflow --run=<name-of-workflow-definition> >> >> If you are interested in adding any form of workflow management to GNU Guix, I >> can elaborate on my proof-of-concept implementation, so we can work from there. >> (or throw everything out of the window and start from scratch ;-)) > > Could you show us an example workflow? > > ~~ Ricardo ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Workflow management with GNU Guix 2016-05-16 12:22 ` Ricardo Wurmus @ 2016-06-14 9:16 ` Roel Janssen 0 siblings, 0 replies; 13+ messages in thread From: Roel Janssen @ 2016-06-14 9:16 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: guix-devel Hello all, Thank you for your replies. I will use Ricardo's response to reply to. Ricardo Wurmus writes: > (Resending this as it could not be delivered.) > > Ricardo Wurmus <ricardo.wurmus@mdc-berlin.de> writes: > >> Hi Roel, >> >>> With GNU Guix we are able to install programs to our machines with an amazing >>> level of control over the dependency graph of the programs. We can now know >>> what code will run when we invoke a program. We can now know what the impact >>> of an upgrade will be. And we can now safely roll-back to previous states. >>> >>> What seems to be a common practice in research involving data analysis, is >>> running multiple programs in a chain to transform data from raw to specific. >>> This is often referred to as a "pipeline" or a "workflow". Because data sets >>> can be quite large in comparison to the computing power of our laptops, the >>> data analysis is performed on computing clusters instead of single machines. >>> >>> The usage of a pipeline/workflow is somewhat different from the package >>> construction, because we want to run the sequence of commands on different data >>> sets (as opposed to running it on the same source code). Plus, I would like to >>> integrate it with existing computing clusters that have a job scheduling system >>> in place. >>> >>> The reason I think this should be possible with Guix is that it has >>> everything in place to do software deployment and run-time isolation >>> (containers). From there it is a small step to executing programs in an >>> automated way. >>> >>> So, I would like to propose a new Guix subcommand and an extension to >>> the package management language to add workflow management features. >> >> I probably don’t understand your idea well enough, but from what I >> understand it doesn’t really have much to do with packages (other than >> using them) and store manipulation per se (produced artifacts are not >> added to the store). Exactly what features of Guix do you want to build >> on? I would like to build on the language to express packages. What's nice about the package recipes is that they are understandable, they are shareable (just copy and paste the recipe) and from them a reproducible output can be produced. A package recipe describes its entire dependency graph because the symbols in the inputs are turned into specific versions of the external packages. This is a very powerful feat for specifying how to run things. >> My perspective on pipelines is that they should be developed like any >> other software package, treating individual tools as you would treat >> libraries. This means that a pipeline would have a configuration step >> in which it checks for the paths of all tools it needs internally, and >> then use the full paths rather than assume all tools to be in a >> directory listed in the PATH variable. When we would use Guix package recipes to describe tools, we wouldn't need to search for them. We could just set up a profile with these tools and set the environment variables suggested by Guix accordingly. This way we can generate the exact dependency graph of a pipeline, leaving no ambiguity to the run-time environment. >> Distributing jobs to clusters would be the responsibility of the >> pipeline, e.g. by using DRMMA, which supports several resource >> management backends and has bindings for a wide range of programming >> languages. Wouldn't it be easier to write a pipeline in a language that has the infrastructure to uniquely describe and deploy a program and its dependencies? You don't need to search for available tools, you can just install them. If they were available already, installing will be a matter of creating a couple of symbolic links. Here is a translation of a "real-world" process definition to my <process> record type from one of the pipelines I studied. It isn't a perfect example because it uses a package that isn't in Guix.. Anyway: === (define (rnaseq-fastq-quality-control in out) (process (name "rnaseq-fastq-quality-control") (version "1.0") (environment `(("fastqc" ,fastqc-bin-0.11.4))) (input in) (output (string-append out "/" name)) (procedure (script (interpreter 'guile) (source (let ((sample-files (find-files in #:directories? #f))) `(begin ;; Create output directories. (unless (access? ,out F_OK) (mkdir ,out)) (unless (access? ,output F_OK) (mkdir ,output)) ;; Perform the analysis step. (map (lambda (file) (when (string-suffix? ".fastq.gz" file) (system* "fastqc" "-q" file "-o" ,output))) ',sample-files)))))) (synopsis "Generate quality control reports for FastQ files") (description "This process generates a quality control report for a single FastQ file."))) === The resulting expression in `source' can be executed with Guile in any place on a computing cluster (as long as the files are accessible at the same location on other machines). This snippet can be copy-pasted elsewhere and be included in another pipeline without adjusting what job distribution system should be used. We can deal with that on the "workflow" level instead of the "process" level. I left the option open to use other scripting languages, but we could compact it a bit more when only using Guile. >>> Would this be a feature you are interested in adding to GNU Guix? >> >> Even if it wasn’t part of Guix itself, you could develop it separately >> and still add it as a Guix command, much like it is currently done for >> “guix web” (which I think should eventually be part of Guix). That may be a good idea. >>> I'm currently working on a proof-of-concept implementation that has three >>> record types/levels of abstraction: >>> <workflow>: Describes which <process>es should be run, and concerns itself with >>> the order of execution. >>> >>> <process>: Describes what packages are needed to run the programs involved, >>> and its relationship to other processes. Processes take input and >>> generate output much like the package construction process. >>> >>> <script>: Short and simple imperative instructions to perform a task. They are >>> part of a <process>. Currently, my implementation generates a shell >>> script that can be either Guile, Sh, Perl or Python. >> >> From that list it seems as if the only link to Guix is ensuring the >> environment contains required programs. This can be done right now with >> the help of manifests and profiles. >> >> I wonder if maybe we could add Guix as a package management backend to >> existing workflow specification systems (instead of the curiously >> popular and IMO barely adequate Conda, for example). That is an option too. The workflow specification systems overlap in describing tools though. For example, the Common Workflow Language (CWL). If we then look at: http://www.commonwl.org/draft-3/CommandLineTool.html#CommandLineTool The `requirements' field is the equivalent of `inputs' and `propagated-inputs' in Guix. With Guix, we could describe a command-line tool by refering to the package recipe, and then write the command to run. >>> The subcommand I envision is: >>> guix workflow >>> >>> With primarily: >>> guix workflow --run=<name-of-workflow-definition> >>> >>> If you are interested in adding any form of workflow management to GNU Guix, I >>> can elaborate on my proof-of-concept implementation, so we can work from there. >>> (or throw everything out of the window and start from scratch ;-)) >> >> Could you show us an example workflow? So, the <process>es look like the snippet provided above. Then the workflow itself looks like: === (define (rnaseq-pipeline in out) (workflow (name "rnaseq-pipeline") (version "1.0") (input in) (output (string-append out "/" name "-" (date->string (current-date) "~Y-~m-~d"))) (processes '(rnaseq-initialize rnaseq-fastq-quality-control rnaseq-align rnaseq-add-read-groups rnaseq-index rnaseq-feature-readcount rnaseq-collect-alignment-metrics rnaseq-merge-read-features rnaseq-compute-rpkm-values rnaseq-normalize-read-counts rnaseq-differential-expression)) (restrictions `((,rnaseq-fastq-quality-control ,rnaseq-initialize) (,rnaseq-align ,rnaseq-initialize) (,rnaseq-add-read-groups ,rnaseq-align) (,rnaseq-index ,rnaseq-add-read-groups) (,rnaseq-collect-alignment-metrics ,rnaseq-index) (,rnaseq-feature-readcount ,rnaseq-index) (,rnaseq-merge-read-features ,rnaseq-feature-readcount) (,rnaseq-compute-rpkm-values ,rnaseq-merge-read-features) (,rnaseq-normalize-read-counts ,rnaseq-merge-read-features) (,rnaseq-differential-expression ,rnaseq-merge-read-features))) (synopsis "RNA sequencing pipeline used at the UMCU") (description "The RNAseq pipeline can do quality control on FastQ and BAM files; align reads against a reference genome; count reads in features; normalize read counts; calculate RPKMs and perform DE analysis of standard designs."))) === The `restrictions' are dependency pairs (A B) where A depends on the successful completion of B. From this, the execution order can be determined. Thank you all for your time. Kind regards, Roel Janssen ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2016-10-29 20:56 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-05-12 8:43 Workflow management with GNU Guix Roel Janssen 2016-05-12 11:41 ` Taylan Ulrich Bayırlı/Kammer 2016-05-12 16:06 ` Ludovic Courtès 2016-10-25 13:28 ` Roel Janssen 2016-10-26 12:41 ` Ludovic Courtès 2016-10-26 13:41 ` Roel Janssen 2016-10-28 13:15 ` Ludovic Courtès 2016-10-28 14:40 ` Roel Janssen 2016-10-28 15:27 ` Ludovic Courtès 2016-10-28 17:25 ` Roel Janssen 2016-10-29 20:56 ` Ludovic Courtès [not found] ` <idjfutnih58.fsf@bimsb-sys02.mdc-berlin.net> 2016-05-16 12:22 ` Ricardo Wurmus 2016-06-14 9:16 ` Roel Janssen
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/guix.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).