Re: [GWL] (random) next steps?

unofficial mirror of gwl-devel@gnu.org
 help / color / mirror / Atom feed

* Re: [GWL] (random) next steps?
       [not found] <CAJ3okZ1Wy8eOGgnvFQN-ay-j37HCjFbYoT3EobkvRNULq0eJHA@mail.gmail.com>
@ 2018-12-15  9:09 ` Ricardo Wurmus
  2018-12-17 17:33   ` zimoun
  0 siblings, 1 reply; 5+ messages in thread
From: Ricardo Wurmus @ 2018-12-15  9:09 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel, gwl-devel

Hi simon,

> Here, I would like to collect some discussions or ideas about the Guix
> Workflow Language (GWL) and the next steps of this awesome tool!

thanks for kicking off this discussion!

> With wisp, the workflow description seems close to CWL, which is an argument ;-)
> Last time, I check, CWL files seems flat yaml-like files and they lack
> programmable extension; as Snakemake provides with Python for example.
> And GWL-wisp will have both: easy syntax and programmable stuff,
> because it is still lisp under the hood.

I’m working on updating the GWL manual to show a simple wispy example.

> 4.
> Some GWL scripts are already there.
> Could we centralize them to one repo?
> Even if they are not clean. I mean something in this flavor:
> https://github.com/common-workflow-language/workflows

I only know of Roel’s ATACseq workflow[1], but we could add a few more
independent process definitions for simple tasks such as sequence
alignment, trimming, etc.  This could be a git subtree that includes an
independent repository.

[1]: https://github.com/UMCUGenetics/gwl-atacseq/

> 5.
> I recently have discovered the ELisp package `s.el` via the blog post:
> http://kitchingroup.cheme.cmu.edu/blog/2018/05/14/f-strings-in-emacs-lisp/
> or other said:
> https://github.com/alphapapa/elexandria/blob/master/elexandria.el#L224
>
> Does it appear to you a right path to write a "formater" in this
> flavour instead of the `string-append` ?
> I mean, e.g.,
>   `(system ,(string-command "gzip  ${data-inputs}  -c >  ${outputs}"))
> instead of e.g.,
>   `(system ,(string-append "gzip " data-inputs " -c > " outputs))
>
> It seems more on the flavour of  Snakemake.

Scheme itself has (format #f "…" foo bar) for string interpolation.
With a little macro we could generate the right “format” invocation, so
that the user could do something similar to what you suggested:

    (shell "gzip ${data-inputs} -c > ${outputs}")

    –> (system (format #f "gzip ~a -c > ~a" data-inputs outputs))

String concatenation is one possibility, but I hope we can do better
than that.  scsh offers special process forms that would allow us to do
things like this:

    (shell (gzip ,data-inputs -c > ,outputs))

or

    (run (gzip ,data-inputs -c)
         (> 1 ,outputs))

Maybe we can take some inspiration from scsh.

> 6.
> The graph of dependencies between the processes/units/rules is written
> by hand. What should be the best strategy to capture it ? By files "à
> la" Snakemake ? Other ?

The GWL currently does not use the input information provided by the
user in the data-inputs field.  For the content addressible store we
will need to change this.  The GWL will then be able of determining that
data-inputs are in fact the outputs of other processes.

> 7.
> Does it appear to you a good idea to provide a command `guix workflow pack` ?
> to produce an archive with the binaries or the commit hashes of the
> channels, etc.

This shouldn’t be difficult to implement as all the needed pieces
already exist.

> Last, the webpage [1] points to gwl-devel mailing list which seems broken.
> Does the gwl-devel need to be activated and the GWL discussion will
> append there or everything stay here to not scattered too much.
>
> [1] https://www.guixwl.org/community

Hmm, looks like the mailing list exists, but has never been used.
That’s why there is no archive.  Let’s see if this email creates this
archive.

--
Ricardo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [GWL] (random) next steps?
  2018-12-15  9:09 ` [GWL] (random) next steps? Ricardo Wurmus
@ 2018-12-17 17:33   ` zimoun
  2018-12-21 20:06     ` Ricardo Wurmus
  0 siblings, 1 reply; 5+ messages in thread
From: zimoun @ 2018-12-17 17:33 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: Guix Devel, gwl-devel

Dear,

> I’m working on updating the GWL manual to show a simple wispy example.

Nice!
I am also giving a deeper look at the manual. :-)


> > Some GWL scripts are already there.
> I only know of Roel’s ATACseq workflow[1], but we could add a few more
> independent process definitions for simple tasks such as sequence
> alignment, trimming, etc.  This could be a git subtree that includes an
> independent repository.

Yes it should be a git subtree.
An idea should be to collect examples and in the same time to improve
kind of test suite.
I mean I have in mind to collect simple and minimal examples to also
populate the tests/.

At starting point (and with minimal effort), I would like to rewrite
the minimal snakemake examples, e.g.,
https://snakemake.readthedocs.io/en/stable/getting_started/examples.html

Once a wispy "syntax" fixed, it will be a good exercise. ;-)


> Scheme itself has (format #f "…" foo bar) for string interpolation.
> With a little macro we could generate the right “format” invocation, so
> that the user could do something similar to what you suggested:
>
>     (shell "gzip ${data-inputs} -c > ${outputs}")
>
>     –> (system (format #f "gzip ~a -c > ~a" data-inputs outputs))
>
> String concatenation is one possibility, but I hope we can do better
> than that.  scsh offers special process forms that would allow us to do
> things like this:
>
>     (shell (gzip ,data-inputs -c > ,outputs))
>
> or
>
>     (run (gzip ,data-inputs -c)
>          (> 1 ,outputs))
>
> Maybe we can take some inspiration from scsh.

I did not know about scsh. I am giving a look...

What I have in mind is to reduce the "gap" between the Lisp syntax and
more mainstream-ish syntax as Snakemake or CWL.
The comma s.t. (shell (gzip ,data-inputs -c > ,outputs)) are nice!
But it is less "natural" than the simple string interpolation, at
least to people in my environment. ;-)

What do you think ?


> > 6.
> > The graph of dependencies between the processes/units/rules is written
> > by hand. What should be the best strategy to capture it ? By files "à
> > la" Snakemake ? Other ?
>
> The GWL currently does not use the input information provided by the
> user in the data-inputs field.  For the content addressible store we
> will need to change this.  The GWL will then be able of determining that
> data-inputs are in fact the outputs of other processes.

Hum? nice but how?
I mean, the graph cannot be deduced and it needs to be written by
hand, somehow. Isn't it?



Last, just to fix the ideas about what we are talking about in terms
of input/output sizes.
An aligner as Bowtie2/BWA uses as inputs:
 - a fixed dataset (reference): it is approx. 25GB for human species.
 - experimental data (specific genome): it is approx 10GB for some
kind of sequencing and say that the series are approx. 50 experiments
or more (one cohort); so you have to deal with 500GB for one analysis.
The output for each data is around 20GB. Then this output is used by
another tools to trim, filter out, compare, etc.
I mean, part of the time is spent in moving data (read/write),
contrary to HPC-simulations---other story, other issues (MPI, etc.).

Strategies à la git-annex (Haskell, again! ;-) should be nice. But is
the history useful ?


Thank you for any comments or ideas.

All the best,
simon

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [GWL] (random) next steps?
  2018-12-17 17:33   ` zimoun
@ 2018-12-21 20:06     ` Ricardo Wurmus
  2019-01-04 17:48       ` zimoun
  0 siblings, 1 reply; 5+ messages in thread
From: Ricardo Wurmus @ 2018-12-21 20:06 UTC (permalink / raw)
  To: zimoun; +Cc: Guix Devel, gwl-devel

Hi simon,

>> > 6.
>> > The graph of dependencies between the processes/units/rules is written
>> > by hand. What should be the best strategy to capture it ? By files "à
>> > la" Snakemake ? Other ?
>>
>> The GWL currently does not use the input information provided by the
>> user in the data-inputs field.  For the content addressible store we
>> will need to change this.  The GWL will then be able of determining that
>> data-inputs are in fact the outputs of other processes.
>
> Hum? nice but how?
> I mean, the graph cannot be deduced and it needs to be written by
> hand, somehow. Isn't it?

We can connect a graph by joining the inputs of one process with the
outputs of another.

With a content addressed store we would run processes in isolation and
map the declared data inputs into the environment.  Instead of working
on the global namespace of the shared file system we can learn from Guix
and strictly control the execution environment.  After a process has run
to completion, only files that were declared as outputs end up in the
content addressed store.

A process could declare outputs like this:

    (define the-process
      (process
        (name 'foo)
        (outputs
         '((result "path/to/result.bam")
           (meta   "path/to/meta.xml")))))

Other processes can then access these files with:

    (output the-process 'result)

i.e. the file corresponding to the declared output “result” of the
process named by the variable “the-process”.

The question here is just how far we want to take the idea of “content
addressed” – is it enough to take the hash of all inputs or do we need
to compute the output hash, which could be much more expensive?

--
Ricardo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [GWL] (random) next steps?
  2018-12-21 20:06     ` Ricardo Wurmus
@ 2019-01-04 17:48       ` zimoun
  2019-01-16 22:08         ` Ricardo Wurmus
  0 siblings, 1 reply; 5+ messages in thread
From: zimoun @ 2019-01-04 17:48 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: Guix Devel, gwl-devel

[-- Attachment #1: Type: text/plain, Size: 6167 bytes --]

Hi Ricardo,

Happy New Year !!

> We can connect a graph by joining the inputs of one process with the
> outputs of another.
>
> With a content addressed store we would run processes in isolation and
> map the declared data inputs into the environment.  Instead of working
> on the global namespace of the shared file system we can learn from Guix
> and strictly control the execution environment.  After a process has run
> to completion, only files that were declared as outputs end up in the
> content addressed store.
>
> A process could declare outputs like this:
>
>     (define the-process
>       (process
>         (name 'foo)
>         (outputs
>          '((result "path/to/result.bam")
>            (meta   "path/to/meta.xml")))))
>
> Other processes can then access these files with:
>
>     (output the-process 'result)
>
> i.e. the file corresponding to the declared output “result” of the
> process named by the variable “the-process”.

Ok, in this spirit?
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#rule-dependencies

From my point of view, there is 2 different paths:
 1- the inputs-outputs are attached to the process/rule/unit
 2- the processes/rules/units are a pure function and then the
`workflow' describes how to glue them together.

If I understand well, Snakemake is about the path 1-. From the
inputs-outputs chain, the graph is deduced.
Attached, a dummy example with snakemake where I re-use one `shell'
between 2 different rules. It is ugly because it works with strings.
And the rule `filter' cannot be used without the rule `move_1' since
the two rules are explicitly connected by their input-output.

The other approach is to define a function that returns a process.
Then one needs to specify the graph with the `restrictions', other
said which function composes with which one. However, because we also
want to track the intermediate outputs, the inputs-outputs is
specified for each process; should be optional, isn't it? If I
understand well, it is one possible approach of Dynamic Workflows by
GWL:
https://www.guixwl.org/beyond-started

On one hand, from the path 1-, it is hard to reuse the process/rule
because the composition is hard-coded in the inputs-outputs
(duplication of the same process/rule with different inputs-outputs).
The graph is written by the user when it writes the inputs-outputs
chain.
On the other hand, from the path 2-, it is difficult to provide both
the inputs-outputs to the function and also the graph without
duplicate some code.

I do not have the mind really clear and I have no idea how to achieve
the idea below of the functional paradigm.
The process/rule/unit is function with free inputs-outputs (argument
or variable) and it returns a process.
The workflow is a scope where these functions are combined through
some inputs-outputs.

For example, let define 2 processes: move and filter.

(define* (move in out #:optional (opt ""))
  (process
   (package-inputs
    `(("mv" ,mv)))
   (input in)
   (output out)
   (procedure
    `(system ,(string-append " mv " opt " " input output)))))

(define (filter in out)
  (process
   (package-inputs
    `(("sed" ,sed)))
   (input in)
   (output out)
   (procedure
    `(system ,(string-append "sed  '1d' " input " > " output)))))

Then let create the workflow that encodes the graph:

(define wkflow:move->filter->move
  (workflow
   (let ((tmpA (temp-file))
         (tmpB (temp-file)))
     (processes
      `((,move "my-input" ,tmpA)
        (,filter ,tmpA ,tmpB)
        (,move ,tmpB "my-output" " -v "))))))

From the `processes', it should be nice to deduce the graph.
I am not sure it is possible... Even if it lacks which one is the
entry point. But it should be fixed by the `input' and `output' field
of `workflow'.

Since the move and filter are just pure function, one can easily reuse
them and e.g. apply in a different order:

(define wkflow:filter->move
  (workflow
   (let ((tmp (temp-file)))
     (processes
      `((,move ,tmp "one-output")
        (,filter "one-input" ,tmp))))))

As you said, one thing should also be:

(processes
      `((,move ,(output filter) "one-output")
        (,filter "one-input" ,(temp-file #:hold #t)))

Do you think it is doable? How hard should be?

> The question here is just how far we want to take the idea of “content
> addressed” – is it enough to take the hash of all inputs or do we need
> to compute the output hash, which could be much more expensive?

Yes, I agree.
Moreover, if the output is hash, then the hash should depend on the
hash of the inputs and of the hash of the tools, isn't it?

To me, once the workflow is computed, one is happy with their results.
Then after a couple of months or years, one still has a copy of the
working folder but they are not able to find how they have been
computed: which version of the tools, the binaries is not working
anymore, etc.
Therefore, it should be easy to extract from the results how they have
been computed: version, etc.

Last, is it useful to write on disk the intermediate files if they are
not stored?
In the tread [0], we discussed the possibility to stream the pipes.
Let say, the simple case:
   filter input > filtered
   quality filtered > output
and the piped version is better is you do not mind about the filtered file:
   filter input | quality > ouput

However, the classic pipe does not fit for this case:
   filter input_R1 > R1_filtered
   filter input_R2 > R2_filtered
   align R1_filtered R2_filtered > output_aligned
In general, one is not interested to conserve the files
R{1,2}_filtered. So why spend time to write them on disk and to hash
them.

In other words, is it doable to stream the `processes' at the process level?

It is different point of view, but it reaches the same aim, I guess

Last, could we add a GWL session to the before-FOSDEM days?

What do you think?

Thank you.

All the best,
simon

[0] http://lists.gnu.org/archive/html/guix-devel/2018-07/msg00231.html

[-- Attachment #2: func.smk --]
[-- Type: application/octet-stream, Size: 1564 bytes --]

rule out:
    input:
        "output.txt"

###
#
# Generate fake data
# (top here because Snake is not declarative language)
#
rule fake_data:
    output:
        "init.txt"
    shell:
       """
       echo -e 'first line\nok!' > {output}
       """
#
##

###
#
# Example of re-usable processing between rules
#
def move(inputs, outputs, params=None):
    """
    Move inputs to ouputs.

    Note: params is optional and provides options of the mv command.
    """
    try:
        options = params.options
    except:
        options = ''
        pass
    cmd = """

    mv {options} {inputs} {outputs}

    """.format(inputs=inputs,
               outputs=outputs,
               options=options)
    return cmd
#
###

###
#
# Because Python rocks! ;-)
#
def generator():
    """
    Simple generator of temporary name.

    Example of use:

     > name = generator()
     > next(name)
      0
     > next(name)
      1
    etc.
    """
    i = 0
    while True:
        yield str(i) + '.tmp'
        i += 1
name = generator()
#
###

###
#
# The internal rules
#
###

rule move_1:
    input:
        {rules.fake_data.output}
    output:
        temp(next(name))
    params:
        options = '-v',
    run:
        shell(move(input, output, params))

rule filter:
    input:
        {rules.move_1.output}
    output:
        temp(next(name))
    shell:
        """

        sed '1d' {input} > {output}

        """

rule move_2:
    input:
        {rules.filter.output}
    output:
        {rules.out.input}
    run:
        shell(move(input, output))

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [GWL] (random) next steps?
  2019-01-04 17:48       ` zimoun
@ 2019-01-16 22:08         ` Ricardo Wurmus
  0 siblings, 0 replies; 5+ messages in thread
From: Ricardo Wurmus @ 2019-01-16 22:08 UTC (permalink / raw)
  To: zimoun; +Cc: gwl-devel


Hi simon,

[- guix-devel@gnu.org]

I wrote:

> We can connect a graph by joining the inputs of one process with the
> outputs of another.
>
> With a content addressed store we would run processes in isolation and
> map the declared data inputs into the environment.  Instead of working
> on the global namespace of the shared file system we can learn from Guix
> and strictly control the execution environment.  After a process has run
> to completion, only files that were declared as outputs end up in the
> content addressed store.
>
> A process could declare outputs like this:
>
>     (define the-process
>       (process
>         (name 'foo)
>         (outputs
>          '((result "path/to/result.bam")
>            (meta   "path/to/meta.xml")))))
>
> Other processes can then access these files with:
>
>     (output the-process 'result)
>
> i.e. the file corresponding to the declared output “result” of the
> process named by the variable “the-process”.

You wrote:

> From my point of view, there is 2 different paths:
>  1- the inputs-outputs are attached to the process/rule/unit
>  2- the processes/rules/units are a pure function and then the
> `workflow' describes how to glue them together.
[…]
> On one hand, from the path 1-, it is hard to reuse the process/rule
> because the composition is hard-coded in the inputs-outputs
> (duplication of the same process/rule with different inputs-outputs).
> The graph is written by the user when it writes the inputs-outputs
> chain.
> On the other hand, from the path 2-, it is difficult to provide both
> the inputs-outputs to the function and also the graph without
> duplicate some code.

I agree with this assessment.

I would like to note, though, that at least the declaration of outputs
works in both systems.  Only when an exact input is tightly attached to
a process/rule do we limit ourselves to the first path where composition
is inflexible.

> Last, is it useful to write on disk the intermediate files if they are
> not stored?
> In the tread [0], we discussed the possibility to stream the pipes.
> Let say, the simple case:
>    filter input > filtered
>    quality filtered > output
> and the piped version is better is you do not mind about the filtered file:
>    filter input | quality > ouput
>
> However, the classic pipe does not fit for this case:
>    filter input_R1 > R1_filtered
>    filter input_R2 > R2_filtered
>    align R1_filtered R2_filtered > output_aligned
> In general, one is not interested to conserve the files
> R{1,2}_filtered. So why spend time to write them on disk and to hash
> them.
>
> In other words, is it doable to stream the `processes' at the process
> level?
[…]
> [0] http://lists.gnu.org/archive/html/guix-devel/2018-07/msg00231.html

For this to work at all inputs and outputs must be declared.  This
wasn’t mentioned before, but it could of course be done in the workflow
declaration rather than the individual process descriptions.

But even then it isn’t clear to me how to do this in a general fashion.
It may work fine for tools that write to I/O streams, but we would
probably need mechanisms to declare this behaviour.  It cannot be
generally inferred, nor can a process automatically change the behaviour
of its procedure to switch between the generation of intermediate files
and output to a stream.

The GWL examples show the use of the “(system "foo > out.file") idiom,
which I don’t like very much.  I’d prefer to use "foo" directly and
declare the output to be a stream.

> Last, could we add a GWL session to the before-FOSDEM days?

The Guix Days are what we make of them, so yes, we can have a GWL
session there :)

--
Ricardo

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-01-16 22:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAJ3okZ1Wy8eOGgnvFQN-ay-j37HCjFbYoT3EobkvRNULq0eJHA@mail.gmail.com>
2018-12-15  9:09 ` [GWL] (random) next steps? Ricardo Wurmus
2018-12-17 17:33   ` zimoun
2018-12-21 20:06     ` Ricardo Wurmus
2019-01-04 17:48       ` zimoun
2019-01-16 22:08         ` Ricardo Wurmus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).