* [GWL] (random) next steps?
@ 2018-12-14 19:16 zimoun
2018-12-15 9:09 ` Ricardo Wurmus
0 siblings, 1 reply; 6+ messages in thread
From: zimoun @ 2018-12-14 19:16 UTC (permalink / raw)
To: Guix Devel
Dear Guixers,
... or at least some of them :-)
Here, I would like to collect some discussions or ideas about the Guix
Workflow Language (GWL) and the next steps of this awesome tool!
For those who do not know.
About workflow language, Wikipedia says:
https://en.wikipedia.org/wiki/Scientific_workflow_system
About GWL, basically the idea is to apply Guix functional principles
to data processing. More details here:
https://www.guixwl.org/
Roel Janssen is the original author of the GWL and now the project is
part of GNU.
Well, I narrow the Ludo's notes from the Paris' meeting and add
commentaries as it was suggested. :-)
** HPC, “workflows”, and all that
- overview & status
- supporting “the cloud”
+ service that produces Docker/Singularity images
+ todo: produce layered Docker images like Nix folks
- workflows
+ snakemake doesn’t handle software deployment
+ or does so through Docker
+ GWL = workflow + deployment
+ add support for Docker
+ add “recency” checks
+ data storage: IRODS?
- Docker arguments
+ security: handling patient data with untrusted “:latest” images
+ Guix allows for “bisect”
1.
Even if I am not a big fan of WISP because I remember difficulties to
catch "parenthesis closing" issue last time I tried, now I am fan of
what Ricardo showed!
Let push out the wisp-way of GWL... or not. :-)
What are the opinions ?
(pa (ren (the (sis))))
vs
pa:
ren:
the:
sis
With wisp, the workflow description seems close to CWL, which is an argument ;-)
Last time, I check, CWL files seems flat yaml-like files and they lack
programmable extension; as Snakemake provides with Python for example.
And GWL-wisp will have both: easy syntax and programmable stuff,
because it is still lisp under the hood.
https://www.draketo.de/proj/wisp/
https://www.commonwl.org/
https://snakemake.readthedocs.io/en/stable/getting_started/examples.html
2.
One of the lacking feature of GWL is kind-of content addressable store
(CAS) for data. Another workflow language named FunFlow (baked as
Haskell-DSL) implements such kind of ideas. To quote explanations by
Ricardo: "we could copy for the GWL (thus avoiding the need for
recency checks). The GWL lacks a data provenance story and a CAS
could fit the bill."
https://github.com/tweag/funflow
3.
The project OpenMole about parametric explorations seems implementing
an IPFS way to deal with the data. Not sure what does it mean. :-)
https://openmole.org/
Talking about data, Zenodo is always around. ;-)
https://zenodo.org/
4.
Some GWL scripts are already there.
Could we centralize them to one repo?
Even if they are not clean. I mean something in this flavor:
https://github.com/common-workflow-language/workflows
5.
I recently have discovered the ELisp package `s.el` via the blog post:
http://kitchingroup.cheme.cmu.edu/blog/2018/05/14/f-strings-in-emacs-lisp/
or other said:
https://github.com/alphapapa/elexandria/blob/master/elexandria.el#L224
Does it appear to you a right path to write a "formater" in this
flavour instead of the `string-append` ?
I mean, e.g.,
`(system ,(string-command "gzip ${data-inputs} -c > ${outputs}"))
instead of e.g.,
`(system ,(string-append "gzip " data-inputs " -c > " outputs))
It seems more on the flavour of Snakemake.
6.
The graph of dependencies between the processes/units/rules is written
by hand. What should be the best strategy to capture it ? By files "à
la" Snakemake ? Other ?
7.
Does it appear to you a good idea to provide a command `guix workflow pack` ?
to produce an archive with the binaries or the commit hashes of the
channels, etc.
Last, the webpage [1] points to gwl-devel mailing list which seems broken.
Does the gwl-devel need to be activated and the GWL discussion will
append there or everything stay here to not scattered too much.
[1] https://www.guixwl.org/community
What do you think?
What is do-able? Science-fiction dream?
Thank you.
Have a nice week-end,
simon
--
Then, just pointers/threads (that I am aware of) to remind what the
list already discussed.
https://lists.gnu.org/archive/html/help-guix/2016-02/msg00019.html
https://lists.gnu.org/archive/html/guix-devel/2016-05/msg00380.html
https://lists.gnu.org/archive/html/guix-devel/2016-10/msg00947.html
https://lists.gnu.org/archive/html/guix-devel/2016-10/msg01248.html
https://lists.gnu.org/archive/html/guix-devel/2018-01/msg00371.html
https://lists.gnu.org/archive/html/guix-devel/2018-02/msg00177.html
https://lists.gnu.org/archive/html/help-guix/2018-05/msg00241.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [GWL] (random) next steps?
2018-12-14 19:16 [GWL] (random) next steps? zimoun
@ 2018-12-15 9:09 ` Ricardo Wurmus
2018-12-17 17:33 ` zimoun
0 siblings, 1 reply; 6+ messages in thread
From: Ricardo Wurmus @ 2018-12-15 9:09 UTC (permalink / raw)
To: zimoun; +Cc: Guix Devel, gwl-devel
Hi simon,
> Here, I would like to collect some discussions or ideas about the Guix
> Workflow Language (GWL) and the next steps of this awesome tool!
thanks for kicking off this discussion!
> With wisp, the workflow description seems close to CWL, which is an argument ;-)
> Last time, I check, CWL files seems flat yaml-like files and they lack
> programmable extension; as Snakemake provides with Python for example.
> And GWL-wisp will have both: easy syntax and programmable stuff,
> because it is still lisp under the hood.
I’m working on updating the GWL manual to show a simple wispy example.
> 4.
> Some GWL scripts are already there.
> Could we centralize them to one repo?
> Even if they are not clean. I mean something in this flavor:
> https://github.com/common-workflow-language/workflows
I only know of Roel’s ATACseq workflow[1], but we could add a few more
independent process definitions for simple tasks such as sequence
alignment, trimming, etc. This could be a git subtree that includes an
independent repository.
[1]: https://github.com/UMCUGenetics/gwl-atacseq/
> 5.
> I recently have discovered the ELisp package `s.el` via the blog post:
> http://kitchingroup.cheme.cmu.edu/blog/2018/05/14/f-strings-in-emacs-lisp/
> or other said:
> https://github.com/alphapapa/elexandria/blob/master/elexandria.el#L224
>
> Does it appear to you a right path to write a "formater" in this
> flavour instead of the `string-append` ?
> I mean, e.g.,
> `(system ,(string-command "gzip ${data-inputs} -c > ${outputs}"))
> instead of e.g.,
> `(system ,(string-append "gzip " data-inputs " -c > " outputs))
>
> It seems more on the flavour of Snakemake.
Scheme itself has (format #f "…" foo bar) for string interpolation.
With a little macro we could generate the right “format” invocation, so
that the user could do something similar to what you suggested:
(shell "gzip ${data-inputs} -c > ${outputs}")
–> (system (format #f "gzip ~a -c > ~a" data-inputs outputs))
String concatenation is one possibility, but I hope we can do better
than that. scsh offers special process forms that would allow us to do
things like this:
(shell (gzip ,data-inputs -c > ,outputs))
or
(run (gzip ,data-inputs -c)
(> 1 ,outputs))
Maybe we can take some inspiration from scsh.
> 6.
> The graph of dependencies between the processes/units/rules is written
> by hand. What should be the best strategy to capture it ? By files "à
> la" Snakemake ? Other ?
The GWL currently does not use the input information provided by the
user in the data-inputs field. For the content addressible store we
will need to change this. The GWL will then be able of determining that
data-inputs are in fact the outputs of other processes.
> 7.
> Does it appear to you a good idea to provide a command `guix workflow pack` ?
> to produce an archive with the binaries or the commit hashes of the
> channels, etc.
This shouldn’t be difficult to implement as all the needed pieces
already exist.
> Last, the webpage [1] points to gwl-devel mailing list which seems broken.
> Does the gwl-devel need to be activated and the GWL discussion will
> append there or everything stay here to not scattered too much.
>
> [1] https://www.guixwl.org/community
Hmm, looks like the mailing list exists, but has never been used.
That’s why there is no archive. Let’s see if this email creates this
archive.
--
Ricardo
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [GWL] (random) next steps?
2018-12-15 9:09 ` Ricardo Wurmus
@ 2018-12-17 17:33 ` zimoun
2018-12-21 20:06 ` Ricardo Wurmus
0 siblings, 1 reply; 6+ messages in thread
From: zimoun @ 2018-12-17 17:33 UTC (permalink / raw)
To: Ricardo Wurmus; +Cc: Guix Devel, gwl-devel
Dear,
> I’m working on updating the GWL manual to show a simple wispy example.
Nice!
I am also giving a deeper look at the manual. :-)
> > Some GWL scripts are already there.
> I only know of Roel’s ATACseq workflow[1], but we could add a few more
> independent process definitions for simple tasks such as sequence
> alignment, trimming, etc. This could be a git subtree that includes an
> independent repository.
Yes it should be a git subtree.
An idea should be to collect examples and in the same time to improve
kind of test suite.
I mean I have in mind to collect simple and minimal examples to also
populate the tests/.
At starting point (and with minimal effort), I would like to rewrite
the minimal snakemake examples, e.g.,
https://snakemake.readthedocs.io/en/stable/getting_started/examples.html
Once a wispy "syntax" fixed, it will be a good exercise. ;-)
> Scheme itself has (format #f "…" foo bar) for string interpolation.
> With a little macro we could generate the right “format” invocation, so
> that the user could do something similar to what you suggested:
>
> (shell "gzip ${data-inputs} -c > ${outputs}")
>
> –> (system (format #f "gzip ~a -c > ~a" data-inputs outputs))
>
> String concatenation is one possibility, but I hope we can do better
> than that. scsh offers special process forms that would allow us to do
> things like this:
>
> (shell (gzip ,data-inputs -c > ,outputs))
>
> or
>
> (run (gzip ,data-inputs -c)
> (> 1 ,outputs))
>
> Maybe we can take some inspiration from scsh.
I did not know about scsh. I am giving a look...
What I have in mind is to reduce the "gap" between the Lisp syntax and
more mainstream-ish syntax as Snakemake or CWL.
The comma s.t. (shell (gzip ,data-inputs -c > ,outputs)) are nice!
But it is less "natural" than the simple string interpolation, at
least to people in my environment. ;-)
What do you think ?
> > 6.
> > The graph of dependencies between the processes/units/rules is written
> > by hand. What should be the best strategy to capture it ? By files "à
> > la" Snakemake ? Other ?
>
> The GWL currently does not use the input information provided by the
> user in the data-inputs field. For the content addressible store we
> will need to change this. The GWL will then be able of determining that
> data-inputs are in fact the outputs of other processes.
Hum? nice but how?
I mean, the graph cannot be deduced and it needs to be written by
hand, somehow. Isn't it?
Last, just to fix the ideas about what we are talking about in terms
of input/output sizes.
An aligner as Bowtie2/BWA uses as inputs:
- a fixed dataset (reference): it is approx. 25GB for human species.
- experimental data (specific genome): it is approx 10GB for some
kind of sequencing and say that the series are approx. 50 experiments
or more (one cohort); so you have to deal with 500GB for one analysis.
The output for each data is around 20GB. Then this output is used by
another tools to trim, filter out, compare, etc.
I mean, part of the time is spent in moving data (read/write),
contrary to HPC-simulations---other story, other issues (MPI, etc.).
Strategies à la git-annex (Haskell, again! ;-) should be nice. But is
the history useful ?
Thank you for any comments or ideas.
All the best,
simon
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [GWL] (random) next steps?
2018-12-17 17:33 ` zimoun
@ 2018-12-21 20:06 ` Ricardo Wurmus
2019-01-04 17:48 ` zimoun
0 siblings, 1 reply; 6+ messages in thread
From: Ricardo Wurmus @ 2018-12-21 20:06 UTC (permalink / raw)
To: zimoun; +Cc: Guix Devel, gwl-devel
Hi simon,
>> > 6.
>> > The graph of dependencies between the processes/units/rules is written
>> > by hand. What should be the best strategy to capture it ? By files "à
>> > la" Snakemake ? Other ?
>>
>> The GWL currently does not use the input information provided by the
>> user in the data-inputs field. For the content addressible store we
>> will need to change this. The GWL will then be able of determining that
>> data-inputs are in fact the outputs of other processes.
>
> Hum? nice but how?
> I mean, the graph cannot be deduced and it needs to be written by
> hand, somehow. Isn't it?
We can connect a graph by joining the inputs of one process with the
outputs of another.
With a content addressed store we would run processes in isolation and
map the declared data inputs into the environment. Instead of working
on the global namespace of the shared file system we can learn from Guix
and strictly control the execution environment. After a process has run
to completion, only files that were declared as outputs end up in the
content addressed store.
A process could declare outputs like this:
(define the-process
(process
(name 'foo)
(outputs
'((result "path/to/result.bam")
(meta "path/to/meta.xml")))))
Other processes can then access these files with:
(output the-process 'result)
i.e. the file corresponding to the declared output “result” of the
process named by the variable “the-process”.
The question here is just how far we want to take the idea of “content
addressed” – is it enough to take the hash of all inputs or do we need
to compute the output hash, which could be much more expensive?
--
Ricardo
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [GWL] (random) next steps?
2018-12-21 20:06 ` Ricardo Wurmus
@ 2019-01-04 17:48 ` zimoun
2019-01-16 22:08 ` Ricardo Wurmus
0 siblings, 1 reply; 6+ messages in thread
From: zimoun @ 2019-01-04 17:48 UTC (permalink / raw)
To: Ricardo Wurmus; +Cc: Guix Devel, gwl-devel
[-- Attachment #1: Type: text/plain, Size: 6167 bytes --]
Hi Ricardo,
Happy New Year !!
> We can connect a graph by joining the inputs of one process with the
> outputs of another.
>
> With a content addressed store we would run processes in isolation and
> map the declared data inputs into the environment. Instead of working
> on the global namespace of the shared file system we can learn from Guix
> and strictly control the execution environment. After a process has run
> to completion, only files that were declared as outputs end up in the
> content addressed store.
>
> A process could declare outputs like this:
>
> (define the-process
> (process
> (name 'foo)
> (outputs
> '((result "path/to/result.bam")
> (meta "path/to/meta.xml")))))
>
> Other processes can then access these files with:
>
> (output the-process 'result)
>
> i.e. the file corresponding to the declared output “result” of the
> process named by the variable “the-process”.
Ok, in this spirit?
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#rule-dependencies
From my point of view, there is 2 different paths:
1- the inputs-outputs are attached to the process/rule/unit
2- the processes/rules/units are a pure function and then the
`workflow' describes how to glue them together.
If I understand well, Snakemake is about the path 1-. From the
inputs-outputs chain, the graph is deduced.
Attached, a dummy example with snakemake where I re-use one `shell'
between 2 different rules. It is ugly because it works with strings.
And the rule `filter' cannot be used without the rule `move_1' since
the two rules are explicitly connected by their input-output.
The other approach is to define a function that returns a process.
Then one needs to specify the graph with the `restrictions', other
said which function composes with which one. However, because we also
want to track the intermediate outputs, the inputs-outputs is
specified for each process; should be optional, isn't it? If I
understand well, it is one possible approach of Dynamic Workflows by
GWL:
https://www.guixwl.org/beyond-started
On one hand, from the path 1-, it is hard to reuse the process/rule
because the composition is hard-coded in the inputs-outputs
(duplication of the same process/rule with different inputs-outputs).
The graph is written by the user when it writes the inputs-outputs
chain.
On the other hand, from the path 2-, it is difficult to provide both
the inputs-outputs to the function and also the graph without
duplicate some code.
I do not have the mind really clear and I have no idea how to achieve
the idea below of the functional paradigm.
The process/rule/unit is function with free inputs-outputs (argument
or variable) and it returns a process.
The workflow is a scope where these functions are combined through
some inputs-outputs.
For example, let define 2 processes: move and filter.
(define* (move in out #:optional (opt ""))
(process
(package-inputs
`(("mv" ,mv)))
(input in)
(output out)
(procedure
`(system ,(string-append " mv " opt " " input output)))))
(define (filter in out)
(process
(package-inputs
`(("sed" ,sed)))
(input in)
(output out)
(procedure
`(system ,(string-append "sed '1d' " input " > " output)))))
Then let create the workflow that encodes the graph:
(define wkflow:move->filter->move
(workflow
(let ((tmpA (temp-file))
(tmpB (temp-file)))
(processes
`((,move "my-input" ,tmpA)
(,filter ,tmpA ,tmpB)
(,move ,tmpB "my-output" " -v "))))))
From the `processes', it should be nice to deduce the graph.
I am not sure it is possible... Even if it lacks which one is the
entry point. But it should be fixed by the `input' and `output' field
of `workflow'.
Since the move and filter are just pure function, one can easily reuse
them and e.g. apply in a different order:
(define wkflow:filter->move
(workflow
(let ((tmp (temp-file)))
(processes
`((,move ,tmp "one-output")
(,filter "one-input" ,tmp))))))
As you said, one thing should also be:
(processes
`((,move ,(output filter) "one-output")
(,filter "one-input" ,(temp-file #:hold #t)))
Do you think it is doable? How hard should be?
> The question here is just how far we want to take the idea of “content
> addressed” – is it enough to take the hash of all inputs or do we need
> to compute the output hash, which could be much more expensive?
Yes, I agree.
Moreover, if the output is hash, then the hash should depend on the
hash of the inputs and of the hash of the tools, isn't it?
To me, once the workflow is computed, one is happy with their results.
Then after a couple of months or years, one still has a copy of the
working folder but they are not able to find how they have been
computed: which version of the tools, the binaries is not working
anymore, etc.
Therefore, it should be easy to extract from the results how they have
been computed: version, etc.
Last, is it useful to write on disk the intermediate files if they are
not stored?
In the tread [0], we discussed the possibility to stream the pipes.
Let say, the simple case:
filter input > filtered
quality filtered > output
and the piped version is better is you do not mind about the filtered file:
filter input | quality > ouput
However, the classic pipe does not fit for this case:
filter input_R1 > R1_filtered
filter input_R2 > R2_filtered
align R1_filtered R2_filtered > output_aligned
In general, one is not interested to conserve the files
R{1,2}_filtered. So why spend time to write them on disk and to hash
them.
In other words, is it doable to stream the `processes' at the process level?
It is different point of view, but it reaches the same aim, I guess
Last, could we add a GWL session to the before-FOSDEM days?
What do you think?
Thank you.
All the best,
simon
[0] http://lists.gnu.org/archive/html/guix-devel/2018-07/msg00231.html
[-- Attachment #2: func.smk --]
[-- Type: application/octet-stream, Size: 1564 bytes --]
rule out:
input:
"output.txt"
###
#
# Generate fake data
# (top here because Snake is not declarative language)
#
rule fake_data:
output:
"init.txt"
shell:
"""
echo -e 'first line\nok!' > {output}
"""
#
##
###
#
# Example of re-usable processing between rules
#
def move(inputs, outputs, params=None):
"""
Move inputs to ouputs.
Note: params is optional and provides options of the mv command.
"""
try:
options = params.options
except:
options = ''
pass
cmd = """
mv {options} {inputs} {outputs}
""".format(inputs=inputs,
outputs=outputs,
options=options)
return cmd
#
###
###
#
# Because Python rocks! ;-)
#
def generator():
"""
Simple generator of temporary name.
Example of use:
> name = generator()
> next(name)
0
> next(name)
1
etc.
"""
i = 0
while True:
yield str(i) + '.tmp'
i += 1
name = generator()
#
###
###
#
# The internal rules
#
###
rule move_1:
input:
{rules.fake_data.output}
output:
temp(next(name))
params:
options = '-v',
run:
shell(move(input, output, params))
rule filter:
input:
{rules.move_1.output}
output:
temp(next(name))
shell:
"""
sed '1d' {input} > {output}
"""
rule move_2:
input:
{rules.filter.output}
output:
{rules.out.input}
run:
shell(move(input, output))
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [GWL] (random) next steps?
2019-01-04 17:48 ` zimoun
@ 2019-01-16 22:08 ` Ricardo Wurmus
0 siblings, 0 replies; 6+ messages in thread
From: Ricardo Wurmus @ 2019-01-16 22:08 UTC (permalink / raw)
To: zimoun; +Cc: gwl-devel
Hi simon,
[- guix-devel@gnu.org]
I wrote:
> We can connect a graph by joining the inputs of one process with the
> outputs of another.
>
> With a content addressed store we would run processes in isolation and
> map the declared data inputs into the environment. Instead of working
> on the global namespace of the shared file system we can learn from Guix
> and strictly control the execution environment. After a process has run
> to completion, only files that were declared as outputs end up in the
> content addressed store.
>
> A process could declare outputs like this:
>
> (define the-process
> (process
> (name 'foo)
> (outputs
> '((result "path/to/result.bam")
> (meta "path/to/meta.xml")))))
>
> Other processes can then access these files with:
>
> (output the-process 'result)
>
> i.e. the file corresponding to the declared output “result” of the
> process named by the variable “the-process”.
You wrote:
> From my point of view, there is 2 different paths:
> 1- the inputs-outputs are attached to the process/rule/unit
> 2- the processes/rules/units are a pure function and then the
> `workflow' describes how to glue them together.
[…]
> On one hand, from the path 1-, it is hard to reuse the process/rule
> because the composition is hard-coded in the inputs-outputs
> (duplication of the same process/rule with different inputs-outputs).
> The graph is written by the user when it writes the inputs-outputs
> chain.
> On the other hand, from the path 2-, it is difficult to provide both
> the inputs-outputs to the function and also the graph without
> duplicate some code.
I agree with this assessment.
I would like to note, though, that at least the declaration of outputs
works in both systems. Only when an exact input is tightly attached to
a process/rule do we limit ourselves to the first path where composition
is inflexible.
> Last, is it useful to write on disk the intermediate files if they are
> not stored?
> In the tread [0], we discussed the possibility to stream the pipes.
> Let say, the simple case:
> filter input > filtered
> quality filtered > output
> and the piped version is better is you do not mind about the filtered file:
> filter input | quality > ouput
>
> However, the classic pipe does not fit for this case:
> filter input_R1 > R1_filtered
> filter input_R2 > R2_filtered
> align R1_filtered R2_filtered > output_aligned
> In general, one is not interested to conserve the files
> R{1,2}_filtered. So why spend time to write them on disk and to hash
> them.
>
> In other words, is it doable to stream the `processes' at the process
> level?
[…]
> [0] http://lists.gnu.org/archive/html/guix-devel/2018-07/msg00231.html
For this to work at all inputs and outputs must be declared. This
wasn’t mentioned before, but it could of course be done in the workflow
declaration rather than the individual process descriptions.
But even then it isn’t clear to me how to do this in a general fashion.
It may work fine for tools that write to I/O streams, but we would
probably need mechanisms to declare this behaviour. It cannot be
generally inferred, nor can a process automatically change the behaviour
of its procedure to switch between the generation of intermediate files
and output to a stream.
The GWL examples show the use of the “(system "foo > out.file") idiom,
which I don’t like very much. I’d prefer to use "foo" directly and
declare the output to be a stream.
> Last, could we add a GWL session to the before-FOSDEM days?
The Guix Days are what we make of them, so yes, we can have a GWL
session there :)
--
Ricardo
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-01-16 22:09 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-12-14 19:16 [GWL] (random) next steps? zimoun
2018-12-15 9:09 ` Ricardo Wurmus
2018-12-17 17:33 ` zimoun
2018-12-21 20:06 ` Ricardo Wurmus
2019-01-04 17:48 ` zimoun
2019-01-16 22:08 ` Ricardo Wurmus
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.