* [GWL] (random) next steps? @ 2018-12-14 19:16 zimoun 2018-12-15 9:09 ` Ricardo Wurmus 0 siblings, 1 reply; 6+ messages in thread From: zimoun @ 2018-12-14 19:16 UTC (permalink / raw) To: Guix Devel Dear Guixers, ... or at least some of them :-) Here, I would like to collect some discussions or ideas about the Guix Workflow Language (GWL) and the next steps of this awesome tool! For those who do not know. About workflow language, Wikipedia says: https://en.wikipedia.org/wiki/Scientific_workflow_system About GWL, basically the idea is to apply Guix functional principles to data processing. More details here: https://www.guixwl.org/ Roel Janssen is the original author of the GWL and now the project is part of GNU. Well, I narrow the Ludo's notes from the Paris' meeting and add commentaries as it was suggested. :-) ** HPC, “workflows”, and all that - overview & status - supporting “the cloud” + service that produces Docker/Singularity images + todo: produce layered Docker images like Nix folks - workflows + snakemake doesn’t handle software deployment + or does so through Docker + GWL = workflow + deployment + add support for Docker + add “recency” checks + data storage: IRODS? - Docker arguments + security: handling patient data with untrusted “:latest” images + Guix allows for “bisect” 1. Even if I am not a big fan of WISP because I remember difficulties to catch "parenthesis closing" issue last time I tried, now I am fan of what Ricardo showed! Let push out the wisp-way of GWL... or not. :-) What are the opinions ? (pa (ren (the (sis)))) vs pa: ren: the: sis With wisp, the workflow description seems close to CWL, which is an argument ;-) Last time, I check, CWL files seems flat yaml-like files and they lack programmable extension; as Snakemake provides with Python for example. And GWL-wisp will have both: easy syntax and programmable stuff, because it is still lisp under the hood. https://www.draketo.de/proj/wisp/ https://www.commonwl.org/ https://snakemake.readthedocs.io/en/stable/getting_started/examples.html 2. One of the lacking feature of GWL is kind-of content addressable store (CAS) for data. Another workflow language named FunFlow (baked as Haskell-DSL) implements such kind of ideas. To quote explanations by Ricardo: "we could copy for the GWL (thus avoiding the need for recency checks). The GWL lacks a data provenance story and a CAS could fit the bill." https://github.com/tweag/funflow 3. The project OpenMole about parametric explorations seems implementing an IPFS way to deal with the data. Not sure what does it mean. :-) https://openmole.org/ Talking about data, Zenodo is always around. ;-) https://zenodo.org/ 4. Some GWL scripts are already there. Could we centralize them to one repo? Even if they are not clean. I mean something in this flavor: https://github.com/common-workflow-language/workflows 5. I recently have discovered the ELisp package `s.el` via the blog post: http://kitchingroup.cheme.cmu.edu/blog/2018/05/14/f-strings-in-emacs-lisp/ or other said: https://github.com/alphapapa/elexandria/blob/master/elexandria.el#L224 Does it appear to you a right path to write a "formater" in this flavour instead of the `string-append` ? I mean, e.g., `(system ,(string-command "gzip ${data-inputs} -c > ${outputs}")) instead of e.g., `(system ,(string-append "gzip " data-inputs " -c > " outputs)) It seems more on the flavour of Snakemake. 6. The graph of dependencies between the processes/units/rules is written by hand. What should be the best strategy to capture it ? By files "à la" Snakemake ? Other ? 7. Does it appear to you a good idea to provide a command `guix workflow pack` ? to produce an archive with the binaries or the commit hashes of the channels, etc. Last, the webpage [1] points to gwl-devel mailing list which seems broken. Does the gwl-devel need to be activated and the GWL discussion will append there or everything stay here to not scattered too much. [1] https://www.guixwl.org/community What do you think? What is do-able? Science-fiction dream? Thank you. Have a nice week-end, simon -- Then, just pointers/threads (that I am aware of) to remind what the list already discussed. https://lists.gnu.org/archive/html/help-guix/2016-02/msg00019.html https://lists.gnu.org/archive/html/guix-devel/2016-05/msg00380.html https://lists.gnu.org/archive/html/guix-devel/2016-10/msg00947.html https://lists.gnu.org/archive/html/guix-devel/2016-10/msg01248.html https://lists.gnu.org/archive/html/guix-devel/2018-01/msg00371.html https://lists.gnu.org/archive/html/guix-devel/2018-02/msg00177.html https://lists.gnu.org/archive/html/help-guix/2018-05/msg00241.html ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [GWL] (random) next steps? 2018-12-14 19:16 [GWL] (random) next steps? zimoun @ 2018-12-15 9:09 ` Ricardo Wurmus 2018-12-17 17:33 ` zimoun 0 siblings, 1 reply; 6+ messages in thread From: Ricardo Wurmus @ 2018-12-15 9:09 UTC (permalink / raw) To: zimoun; +Cc: Guix Devel, gwl-devel Hi simon, > Here, I would like to collect some discussions or ideas about the Guix > Workflow Language (GWL) and the next steps of this awesome tool! thanks for kicking off this discussion! > With wisp, the workflow description seems close to CWL, which is an argument ;-) > Last time, I check, CWL files seems flat yaml-like files and they lack > programmable extension; as Snakemake provides with Python for example. > And GWL-wisp will have both: easy syntax and programmable stuff, > because it is still lisp under the hood. I’m working on updating the GWL manual to show a simple wispy example. > 4. > Some GWL scripts are already there. > Could we centralize them to one repo? > Even if they are not clean. I mean something in this flavor: > https://github.com/common-workflow-language/workflows I only know of Roel’s ATACseq workflow[1], but we could add a few more independent process definitions for simple tasks such as sequence alignment, trimming, etc. This could be a git subtree that includes an independent repository. [1]: https://github.com/UMCUGenetics/gwl-atacseq/ > 5. > I recently have discovered the ELisp package `s.el` via the blog post: > http://kitchingroup.cheme.cmu.edu/blog/2018/05/14/f-strings-in-emacs-lisp/ > or other said: > https://github.com/alphapapa/elexandria/blob/master/elexandria.el#L224 > > Does it appear to you a right path to write a "formater" in this > flavour instead of the `string-append` ? > I mean, e.g., > `(system ,(string-command "gzip ${data-inputs} -c > ${outputs}")) > instead of e.g., > `(system ,(string-append "gzip " data-inputs " -c > " outputs)) > > It seems more on the flavour of Snakemake. Scheme itself has (format #f "…" foo bar) for string interpolation. With a little macro we could generate the right “format” invocation, so that the user could do something similar to what you suggested: (shell "gzip ${data-inputs} -c > ${outputs}") –> (system (format #f "gzip ~a -c > ~a" data-inputs outputs)) String concatenation is one possibility, but I hope we can do better than that. scsh offers special process forms that would allow us to do things like this: (shell (gzip ,data-inputs -c > ,outputs)) or (run (gzip ,data-inputs -c) (> 1 ,outputs)) Maybe we can take some inspiration from scsh. > 6. > The graph of dependencies between the processes/units/rules is written > by hand. What should be the best strategy to capture it ? By files "à > la" Snakemake ? Other ? The GWL currently does not use the input information provided by the user in the data-inputs field. For the content addressible store we will need to change this. The GWL will then be able of determining that data-inputs are in fact the outputs of other processes. > 7. > Does it appear to you a good idea to provide a command `guix workflow pack` ? > to produce an archive with the binaries or the commit hashes of the > channels, etc. This shouldn’t be difficult to implement as all the needed pieces already exist. > Last, the webpage [1] points to gwl-devel mailing list which seems broken. > Does the gwl-devel need to be activated and the GWL discussion will > append there or everything stay here to not scattered too much. > > [1] https://www.guixwl.org/community Hmm, looks like the mailing list exists, but has never been used. That’s why there is no archive. Let’s see if this email creates this archive. -- Ricardo ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [GWL] (random) next steps? 2018-12-15 9:09 ` Ricardo Wurmus @ 2018-12-17 17:33 ` zimoun 2018-12-21 20:06 ` Ricardo Wurmus 0 siblings, 1 reply; 6+ messages in thread From: zimoun @ 2018-12-17 17:33 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: Guix Devel, gwl-devel Dear, > I’m working on updating the GWL manual to show a simple wispy example. Nice! I am also giving a deeper look at the manual. :-) > > Some GWL scripts are already there. > I only know of Roel’s ATACseq workflow[1], but we could add a few more > independent process definitions for simple tasks such as sequence > alignment, trimming, etc. This could be a git subtree that includes an > independent repository. Yes it should be a git subtree. An idea should be to collect examples and in the same time to improve kind of test suite. I mean I have in mind to collect simple and minimal examples to also populate the tests/. At starting point (and with minimal effort), I would like to rewrite the minimal snakemake examples, e.g., https://snakemake.readthedocs.io/en/stable/getting_started/examples.html Once a wispy "syntax" fixed, it will be a good exercise. ;-) > Scheme itself has (format #f "…" foo bar) for string interpolation. > With a little macro we could generate the right “format” invocation, so > that the user could do something similar to what you suggested: > > (shell "gzip ${data-inputs} -c > ${outputs}") > > –> (system (format #f "gzip ~a -c > ~a" data-inputs outputs)) > > String concatenation is one possibility, but I hope we can do better > than that. scsh offers special process forms that would allow us to do > things like this: > > (shell (gzip ,data-inputs -c > ,outputs)) > > or > > (run (gzip ,data-inputs -c) > (> 1 ,outputs)) > > Maybe we can take some inspiration from scsh. I did not know about scsh. I am giving a look... What I have in mind is to reduce the "gap" between the Lisp syntax and more mainstream-ish syntax as Snakemake or CWL. The comma s.t. (shell (gzip ,data-inputs -c > ,outputs)) are nice! But it is less "natural" than the simple string interpolation, at least to people in my environment. ;-) What do you think ? > > 6. > > The graph of dependencies between the processes/units/rules is written > > by hand. What should be the best strategy to capture it ? By files "à > > la" Snakemake ? Other ? > > The GWL currently does not use the input information provided by the > user in the data-inputs field. For the content addressible store we > will need to change this. The GWL will then be able of determining that > data-inputs are in fact the outputs of other processes. Hum? nice but how? I mean, the graph cannot be deduced and it needs to be written by hand, somehow. Isn't it? Last, just to fix the ideas about what we are talking about in terms of input/output sizes. An aligner as Bowtie2/BWA uses as inputs: - a fixed dataset (reference): it is approx. 25GB for human species. - experimental data (specific genome): it is approx 10GB for some kind of sequencing and say that the series are approx. 50 experiments or more (one cohort); so you have to deal with 500GB for one analysis. The output for each data is around 20GB. Then this output is used by another tools to trim, filter out, compare, etc. I mean, part of the time is spent in moving data (read/write), contrary to HPC-simulations---other story, other issues (MPI, etc.). Strategies à la git-annex (Haskell, again! ;-) should be nice. But is the history useful ? Thank you for any comments or ideas. All the best, simon ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [GWL] (random) next steps? 2018-12-17 17:33 ` zimoun @ 2018-12-21 20:06 ` Ricardo Wurmus 2019-01-04 17:48 ` zimoun 0 siblings, 1 reply; 6+ messages in thread From: Ricardo Wurmus @ 2018-12-21 20:06 UTC (permalink / raw) To: zimoun; +Cc: Guix Devel, gwl-devel Hi simon, >> > 6. >> > The graph of dependencies between the processes/units/rules is written >> > by hand. What should be the best strategy to capture it ? By files "à >> > la" Snakemake ? Other ? >> >> The GWL currently does not use the input information provided by the >> user in the data-inputs field. For the content addressible store we >> will need to change this. The GWL will then be able of determining that >> data-inputs are in fact the outputs of other processes. > > Hum? nice but how? > I mean, the graph cannot be deduced and it needs to be written by > hand, somehow. Isn't it? We can connect a graph by joining the inputs of one process with the outputs of another. With a content addressed store we would run processes in isolation and map the declared data inputs into the environment. Instead of working on the global namespace of the shared file system we can learn from Guix and strictly control the execution environment. After a process has run to completion, only files that were declared as outputs end up in the content addressed store. A process could declare outputs like this: (define the-process (process (name 'foo) (outputs '((result "path/to/result.bam") (meta "path/to/meta.xml"))))) Other processes can then access these files with: (output the-process 'result) i.e. the file corresponding to the declared output “result” of the process named by the variable “the-process”. The question here is just how far we want to take the idea of “content addressed” – is it enough to take the hash of all inputs or do we need to compute the output hash, which could be much more expensive? -- Ricardo ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [GWL] (random) next steps? 2018-12-21 20:06 ` Ricardo Wurmus @ 2019-01-04 17:48 ` zimoun 2019-01-16 22:08 ` Ricardo Wurmus 0 siblings, 1 reply; 6+ messages in thread From: zimoun @ 2019-01-04 17:48 UTC (permalink / raw) To: Ricardo Wurmus; +Cc: Guix Devel, gwl-devel [-- Attachment #1: Type: text/plain, Size: 6167 bytes --] Hi Ricardo, Happy New Year !! > We can connect a graph by joining the inputs of one process with the > outputs of another. > > With a content addressed store we would run processes in isolation and > map the declared data inputs into the environment. Instead of working > on the global namespace of the shared file system we can learn from Guix > and strictly control the execution environment. After a process has run > to completion, only files that were declared as outputs end up in the > content addressed store. > > A process could declare outputs like this: > > (define the-process > (process > (name 'foo) > (outputs > '((result "path/to/result.bam") > (meta "path/to/meta.xml"))))) > > Other processes can then access these files with: > > (output the-process 'result) > > i.e. the file corresponding to the declared output “result” of the > process named by the variable “the-process”. Ok, in this spirit? https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#rule-dependencies From my point of view, there is 2 different paths: 1- the inputs-outputs are attached to the process/rule/unit 2- the processes/rules/units are a pure function and then the `workflow' describes how to glue them together. If I understand well, Snakemake is about the path 1-. From the inputs-outputs chain, the graph is deduced. Attached, a dummy example with snakemake where I re-use one `shell' between 2 different rules. It is ugly because it works with strings. And the rule `filter' cannot be used without the rule `move_1' since the two rules are explicitly connected by their input-output. The other approach is to define a function that returns a process. Then one needs to specify the graph with the `restrictions', other said which function composes with which one. However, because we also want to track the intermediate outputs, the inputs-outputs is specified for each process; should be optional, isn't it? If I understand well, it is one possible approach of Dynamic Workflows by GWL: https://www.guixwl.org/beyond-started On one hand, from the path 1-, it is hard to reuse the process/rule because the composition is hard-coded in the inputs-outputs (duplication of the same process/rule with different inputs-outputs). The graph is written by the user when it writes the inputs-outputs chain. On the other hand, from the path 2-, it is difficult to provide both the inputs-outputs to the function and also the graph without duplicate some code. I do not have the mind really clear and I have no idea how to achieve the idea below of the functional paradigm. The process/rule/unit is function with free inputs-outputs (argument or variable) and it returns a process. The workflow is a scope where these functions are combined through some inputs-outputs. For example, let define 2 processes: move and filter. (define* (move in out #:optional (opt "")) (process (package-inputs `(("mv" ,mv))) (input in) (output out) (procedure `(system ,(string-append " mv " opt " " input output))))) (define (filter in out) (process (package-inputs `(("sed" ,sed))) (input in) (output out) (procedure `(system ,(string-append "sed '1d' " input " > " output))))) Then let create the workflow that encodes the graph: (define wkflow:move->filter->move (workflow (let ((tmpA (temp-file)) (tmpB (temp-file))) (processes `((,move "my-input" ,tmpA) (,filter ,tmpA ,tmpB) (,move ,tmpB "my-output" " -v ")))))) From the `processes', it should be nice to deduce the graph. I am not sure it is possible... Even if it lacks which one is the entry point. But it should be fixed by the `input' and `output' field of `workflow'. Since the move and filter are just pure function, one can easily reuse them and e.g. apply in a different order: (define wkflow:filter->move (workflow (let ((tmp (temp-file))) (processes `((,move ,tmp "one-output") (,filter "one-input" ,tmp)))))) As you said, one thing should also be: (processes `((,move ,(output filter) "one-output") (,filter "one-input" ,(temp-file #:hold #t))) Do you think it is doable? How hard should be? > The question here is just how far we want to take the idea of “content > addressed” – is it enough to take the hash of all inputs or do we need > to compute the output hash, which could be much more expensive? Yes, I agree. Moreover, if the output is hash, then the hash should depend on the hash of the inputs and of the hash of the tools, isn't it? To me, once the workflow is computed, one is happy with their results. Then after a couple of months or years, one still has a copy of the working folder but they are not able to find how they have been computed: which version of the tools, the binaries is not working anymore, etc. Therefore, it should be easy to extract from the results how they have been computed: version, etc. Last, is it useful to write on disk the intermediate files if they are not stored? In the tread [0], we discussed the possibility to stream the pipes. Let say, the simple case: filter input > filtered quality filtered > output and the piped version is better is you do not mind about the filtered file: filter input | quality > ouput However, the classic pipe does not fit for this case: filter input_R1 > R1_filtered filter input_R2 > R2_filtered align R1_filtered R2_filtered > output_aligned In general, one is not interested to conserve the files R{1,2}_filtered. So why spend time to write them on disk and to hash them. In other words, is it doable to stream the `processes' at the process level? It is different point of view, but it reaches the same aim, I guess Last, could we add a GWL session to the before-FOSDEM days? What do you think? Thank you. All the best, simon [0] http://lists.gnu.org/archive/html/guix-devel/2018-07/msg00231.html [-- Attachment #2: func.smk --] [-- Type: application/octet-stream, Size: 1564 bytes --] rule out: input: "output.txt" ### # # Generate fake data # (top here because Snake is not declarative language) # rule fake_data: output: "init.txt" shell: """ echo -e 'first line\nok!' > {output} """ # ## ### # # Example of re-usable processing between rules # def move(inputs, outputs, params=None): """ Move inputs to ouputs. Note: params is optional and provides options of the mv command. """ try: options = params.options except: options = '' pass cmd = """ mv {options} {inputs} {outputs} """.format(inputs=inputs, outputs=outputs, options=options) return cmd # ### ### # # Because Python rocks! ;-) # def generator(): """ Simple generator of temporary name. Example of use: > name = generator() > next(name) 0 > next(name) 1 etc. """ i = 0 while True: yield str(i) + '.tmp' i += 1 name = generator() # ### ### # # The internal rules # ### rule move_1: input: {rules.fake_data.output} output: temp(next(name)) params: options = '-v', run: shell(move(input, output, params)) rule filter: input: {rules.move_1.output} output: temp(next(name)) shell: """ sed '1d' {input} > {output} """ rule move_2: input: {rules.filter.output} output: {rules.out.input} run: shell(move(input, output)) ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [GWL] (random) next steps? 2019-01-04 17:48 ` zimoun @ 2019-01-16 22:08 ` Ricardo Wurmus 0 siblings, 0 replies; 6+ messages in thread From: Ricardo Wurmus @ 2019-01-16 22:08 UTC (permalink / raw) To: zimoun; +Cc: gwl-devel Hi simon, [- guix-devel@gnu.org] I wrote: > We can connect a graph by joining the inputs of one process with the > outputs of another. > > With a content addressed store we would run processes in isolation and > map the declared data inputs into the environment. Instead of working > on the global namespace of the shared file system we can learn from Guix > and strictly control the execution environment. After a process has run > to completion, only files that were declared as outputs end up in the > content addressed store. > > A process could declare outputs like this: > > (define the-process > (process > (name 'foo) > (outputs > '((result "path/to/result.bam") > (meta "path/to/meta.xml"))))) > > Other processes can then access these files with: > > (output the-process 'result) > > i.e. the file corresponding to the declared output “result” of the > process named by the variable “the-process”. You wrote: > From my point of view, there is 2 different paths: > 1- the inputs-outputs are attached to the process/rule/unit > 2- the processes/rules/units are a pure function and then the > `workflow' describes how to glue them together. […] > On one hand, from the path 1-, it is hard to reuse the process/rule > because the composition is hard-coded in the inputs-outputs > (duplication of the same process/rule with different inputs-outputs). > The graph is written by the user when it writes the inputs-outputs > chain. > On the other hand, from the path 2-, it is difficult to provide both > the inputs-outputs to the function and also the graph without > duplicate some code. I agree with this assessment. I would like to note, though, that at least the declaration of outputs works in both systems. Only when an exact input is tightly attached to a process/rule do we limit ourselves to the first path where composition is inflexible. > Last, is it useful to write on disk the intermediate files if they are > not stored? > In the tread [0], we discussed the possibility to stream the pipes. > Let say, the simple case: > filter input > filtered > quality filtered > output > and the piped version is better is you do not mind about the filtered file: > filter input | quality > ouput > > However, the classic pipe does not fit for this case: > filter input_R1 > R1_filtered > filter input_R2 > R2_filtered > align R1_filtered R2_filtered > output_aligned > In general, one is not interested to conserve the files > R{1,2}_filtered. So why spend time to write them on disk and to hash > them. > > In other words, is it doable to stream the `processes' at the process > level? […] > [0] http://lists.gnu.org/archive/html/guix-devel/2018-07/msg00231.html For this to work at all inputs and outputs must be declared. This wasn’t mentioned before, but it could of course be done in the workflow declaration rather than the individual process descriptions. But even then it isn’t clear to me how to do this in a general fashion. It may work fine for tools that write to I/O streams, but we would probably need mechanisms to declare this behaviour. It cannot be generally inferred, nor can a process automatically change the behaviour of its procedure to switch between the generation of intermediate files and output to a stream. The GWL examples show the use of the “(system "foo > out.file") idiom, which I don’t like very much. I’d prefer to use "foo" directly and declare the output to be a stream. > Last, could we add a GWL session to the before-FOSDEM days? The Guix Days are what we make of them, so yes, we can have a GWL session there :) -- Ricardo ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-01-16 22:09 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-12-14 19:16 [GWL] (random) next steps? zimoun 2018-12-15 9:09 ` Ricardo Wurmus 2018-12-17 17:33 ` zimoun 2018-12-21 20:06 ` Ricardo Wurmus 2019-01-04 17:48 ` zimoun 2019-01-16 22:08 ` Ricardo Wurmus
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/guix.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.